Try it out via this demo, or build and run it on your own CPU or GPU.
bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU and GPU (NPU support will coming next).
The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. Please refer to the technical report for more details.
- System Requirements Document (SRD): docs/system-requirements.md captures the authoritative scope, architecture, and quality targets for the full stack.
- System Roadmap & Scope Register: docs/system-roadmap.md tracks capability maturity, owners, and milestone planning across all phases of the autonomous development loop.
- Iteration Log: docs/iteration-log.md records autonomous refinement cycles and links to future actions for Phases 4–12.
- Runtime Supervisor Guide: docs/runtime_supervisor.md explains asynchronous orchestration patterns and telemetry hooks.
- Deployment Guide: docs/deployment.md consolidates repeatable build, container, Kubernetes, and edge deployment instructions.
- Telemetry Dashboards: docs/telemetry-dashboards.md captures real-time observability profiles for TUI and web experiences.
- Documentation Review Checklist: docs/documentation-review-checklist.md defines the automation-enforced review gate for SRD, roadmap, and changelog updates.
SRD last updated: 2025-05-30 Roadmap last updated: 2025-05-30
The tested models are dummy setups used in a research context to demonstrate the inference performance of bitnet.cpp.
A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2:
demo.mp4
- 05/20/2025 BitNet Official GPU inference kernel
- 04/14/2025 BitNet Official 2B Parameter Model on Hugging Face
- 02/18/2025 Bitnet.cpp: Efficient Edge Inference for Ternary LLMs
- 11/08/2024 BitNet a4.8: 4-bit Activations for 1-bit LLMs
- 10/21/2024 1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs
- 10/17/2024 bitnet.cpp 1.0 released.
- 03/21/2024 The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ
- 02/27/2024 The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
- 10/17/2023 BitNet: Scaling 1-bit Transformers for Large Language Models
This project is based on the llama.cpp framework. We would like to thank all the authors for their contributions to the open-source community. Also, bitnet.cpp's kernels are built on top of the Lookup Table methodologies pioneered in T-MAC. For inference of general low-bit LLMs beyond ternary models, we recommend using T-MAC.
| Model | Parameters | CPU | Kernel | ||
|---|---|---|---|---|---|
| I2_S | TL1 | TL2 | |||
| BitNet-b1.58-2B-4T | 2.4B | x86 | ✅ | ❌ | ✅ |
| ARM | ✅ | ✅ | ❌ | ||
❗️We use existing 1-bit LLMs available on Hugging Face to demonstrate the inference capabilities of bitnet.cpp. We hope the release of bitnet.cpp will inspire the development of 1-bit LLMs in large-scale settings in terms of model size and training tokens.
| Model | Parameters | CPU | Kernel | ||
|---|---|---|---|---|---|
| I2_S | TL1 | TL2 | |||
| bitnet_b1_58-large | 0.7B | x86 | ✅ | ❌ | ✅ |
| ARM | ✅ | ✅ | ❌ | ||
| bitnet_b1_58-3B | 3.3B | x86 | ❌ | ❌ | ✅ |
| ARM | ❌ | ✅ | ❌ | ||
| Llama3-8B-1.58-100B-tokens | 8.0B | x86 | ✅ | ❌ | ✅ |
| ARM | ✅ | ✅ | ❌ | ||
| Falcon3 Family | 1B-10B | x86 | ✅ | ❌ | ✅ |
| ARM | ✅ | ✅ | ❌ | ||
| Falcon-E Family | 1B-3B | x86 | ✅ | ❌ | ✅ |
| ARM | ✅ | ✅ | ❌ | ||
- python>=3.9
- cmake>=3.22
- clang>=18
-
For Windows users, install Visual Studio 2022. In the installer, toggle on at least the following options(this also automatically installs the required additional tools like CMake):
- Desktop-development with C++
- C++-CMake Tools for Windows
- Git for Windows
- C++-Clang Compiler for Windows
- MS-Build Support for LLVM-Toolset (clang)
-
For Debian/Ubuntu users, you can download with Automatic installation script
bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"
-
- conda (highly recommend)
Important
If you are using Windows, please remember to always use a Developer Command Prompt / PowerShell for VS2022 for the following commands. Please refer to the FAQs below if you see any issues.
- Clone the repo
git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet- Install the dependencies
# (Recommended) Create a new conda environment
conda create -n bitnet-cpp python=3.9
conda activate bitnet-cpp
pip install -r requirements.txtInstall the packaged Python utilities directly:
pip install .- Build the project
cmake -S . -B build
cmake --build build -jTo run the fully automated bootstrapper with caching and Hugging Face downloads enabled, execute:
python setup_env.py --hf-repo microsoft/BitNet-b1.58-2B-4T --cache-dir ~/.cache/bitnetUse python setup_env.py --help to inspect additional options such as
--skip-build, --skip-model, and --force when re-running on provisioned
machines.
# Run inference with the quantized model
python run_inference.py \
--model models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
--prompt "You are a helpful assistant" \
--threads auto \
--conversationusage: run_inference.py [-h] -m MODEL -p PROMPT [-n N_PREDICT] [-c CTX_SIZE] [-t THREADS]
[--temperature TEMPERATURE] [-b BATCH_SIZE] [--gpu-layers GPU_LAYERS]
[-cnv] [--build-dir BUILD_DIR]
[--log-level {CRITICAL,ERROR,WARNING,INFO,DEBUG}] [--dry-run] [--diagnostics]
[--extra-args ...]
Run BitNet inference via llama.cpp
options:
-h, --help show this help message and exit
-m MODEL, --model MODEL
Path to the GGUF model
-p PROMPT, --prompt PROMPT
Prompt to generate text from
-n N_PREDICT, --n-predict N_PREDICT
Number of tokens to generate
-c CTX_SIZE, --ctx-size CTX_SIZE
Context window size
-t THREADS, --threads THREADS
Thread count or 'auto'
--temperature TEMPERATURE
Sampling temperature
-b BATCH_SIZE, --batch-size BATCH_SIZE
Prompt batch size
--gpu-layers GPU_LAYERS
Number of layers to offload to the GPU (requires GPU
build)
-cnv, --conversation Enable chat mode
--build-dir BUILD_DIR
Directory that contains the compiled llama.cpp binaries
--log-level {CRITICAL,ERROR,WARNING,INFO,DEBUG}
Logging verbosity for runtime diagnostics
--dry-run Print the command that would be executed without running it
--diagnostics Show a health report for the runtime and exit
--extra-args ... Additional llama.cpp flags appended verbatim
See docs/deployment.md for a step-by-step checklist that
turns these scripts into a repeatable deployment pipeline.
When running against a GPU-enabled build of llama.cpp, specify
--gpu-layers <count> to offload the chosen number of transformer layers while
retaining the validated runtime checks provided by the Python wrapper.
For services that need to schedule multiple inference jobs concurrently or
enforce latency budgets, the new RuntimeSupervisor pairs with
BitNetRuntime to provide asyncio-based execution with configurable
concurrency and timeouts. See docs/runtime_supervisor.md
for integration examples and guidance on capturing structured telemetry from
each run.
We provide scripts to run the inference benchmark providing a model.
usage: e2e_benchmark.py -m MODEL [-n N_TOKEN] [-p N_PROMPT] [-t THREADS]
Setup the environment for running the inference
required arguments:
-m MODEL, --model MODEL
Path to the model file.
optional arguments:
-h, --help
Show this help message and exit.
-n N_TOKEN, --n-token N_TOKEN
Number of generated tokens.
-p N_PROMPT, --n-prompt N_PROMPT
Prompt to generate text from.
-t THREADS, --threads THREADS
Number of threads to use.
Here's a brief explanation of each argument:
-m,--model: The path to the model file. This is a required argument that must be provided when running the script.-n,--n-token: The number of tokens to generate during the inference. It is an optional argument with a default value of 128.-p,--n-prompt: The number of prompt tokens to use for generating text. This is an optional argument with a default value of 512.-t,--threads: The number of threads to use for running the inference. It is an optional argument with a default value of 2.-h,--help: Show the help message and exit. Use this argument to display usage information.
For example:
python utils/e2e_benchmark.py -m /path/to/model -n 200 -p 256 -t 4 This command would run the inference benchmark using the model located at /path/to/model, generating 200 tokens from a 256 token prompt, utilizing 4 threads.
For the model layout that do not supported by any public model, we provide scripts to generate a dummy model with the given model layout, and run the benchmark on your machine:
python utils/generate-dummy-bitnet-model.py models/bitnet_b1_58-large --outfile models/dummy-bitnet-125m.tl1.gguf --outtype tl1 --model-size 125M
# Run benchmark with the generated model, use -m to specify the model path, -p to specify the prompt processed, -n to specify the number of token to generate
python utils/e2e_benchmark.py -m models/dummy-bitnet-125m.tl1.gguf -p 512 -n 128# Prepare the .safetensors model file
huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./models/bitnet-b1.58-2B-4T-bf16
# Convert to gguf model
python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16The repository now includes lightweight sanity checks for the Python setup workflow. Execute them with:
python -m unittest discover -s tests
The tests are safe to run without downloading models or compiling native code; they validate the command runner and configuration
guards used by setup_env.py.
tests/test_integration_kernels.pybuilds minimal CPU/GPU inference executables through CMake and validates kernel correctness. GPU checks are skipped when CUDA tooling is not detected.tests/test_cli_properties.pyandtests/test_supervisor_properties.pyrely on property-based fuzzing (powered by Hypothesis) to harden CLI parsing and asynchronous scheduling.tests/test_performance_profiles.pyruns deterministic latency/throughput smoke tests to surface regressions in token generation speed.tests/test_gpu_kernel_unit.pycompiles a CUDA micro-kernel with PyTorch extensions to sanity check PTX/ROCm execution paths.
Install optional dependencies for these scenarios via pip install -r tests/requirements-test.txt.
A: This is an issue introduced in recent version of llama.cpp. Please refer to this commit in the discussion to fix this issue.
A: Before building the project, verify your clang installation and access to Visual Studio tools by running:
clang -v
This command checks that you are using the correct version of clang and that the Visual Studio tools are available. If you see an error message such as:
'clang' is not recognized as an internal or external command, operable program or batch file.
It indicates that your command line window is not properly initialized for Visual Studio tools.
• If you are using Command Prompt, run:
"C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\VsDevCmd.bat" -startdir=none -arch=x64 -host_arch=x64
• If you are using Windows PowerShell, run the following commands:
Import-Module "C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\Microsoft.VisualStudio.DevShell.dll" Enter-VsDevShell 3f0e31ad -SkipAutomaticLocation -DevCmdArguments "-arch=x64 -host_arch=x64"
These steps will initialize your environment and allow you to use the correct Visual Studio tools.


