Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 61 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

Try it out via this [demo](https://demo-bitnet-h0h8hcfqeqhrf5gf.canadacentral-01.azurewebsites.net/), or build and run it on your own [CPU](https://github.com/microsoft/BitNet?tab=readme-ov-file#build-from-source) or [GPU](https://github.com/microsoft/BitNet/blob/main/gpu/README.md).

bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support **fast** and **lossless** inference of 1.58-bit models on CPU and GPU (NPU support will coming next).
bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support **fast** and **lossless** inference of 1.58-bit models on CPU and GPU (NPU support will be coming next).

The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of **1.37x** to **5.07x** on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by **55.4%** to **70.0%**, further boosting overall efficiency. On x86 CPUs, speedups range from **2.37x** to **6.17x** with energy reductions between **71.9%** to **82.2%**. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. Please refer to the [technical report](https://arxiv.org/abs/2410.16144) for more details.

Expand Down Expand Up @@ -304,6 +304,66 @@ huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./models/
python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16
```

### OpenAI-Compatible API Server

BitNet includes a built-in HTTP server that provides an OpenAI-compatible API for easy integration with existing tools and frameworks.

#### Start the Server

After building the project, start the server with:

```bash
./build/bin/llama-server \
--model models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
--host 127.0.0.1 \
--port 8080 \
--ctx-size 4096
```

#### Use the API

The server provides standard OpenAI-compatible endpoints:

```bash
# Chat completions
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "bitnet",
"messages": [{"role": "user", "content": "Hello, how are you?"}]
}'

# List models
curl http://127.0.0.1:8080/v1/models
```

#### Integration Examples

**Python (OpenAI SDK):**
```python
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="dummy")
response = client.chat.completions.create(
model="bitnet",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
```

**LangChain:**
```python
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
base_url="http://127.0.0.1:8080/v1",
api_key="dummy",
model="bitnet"
)
```

This enables drop-in replacement for OpenAI API in tools like Open WebUI, Continue, and other OpenAI-compatible applications.

### FAQ (Frequently Asked Questions)📌

#### Q1: The build dies with errors building llama.cpp due to issues with std::chrono in log.cpp?
Expand Down