microsoft · JasonOA888 · Mar 12, 2026
diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@
 
 Try it out via this [demo](https://demo-bitnet-h0h8hcfqeqhrf5gf.canadacentral-01.azurewebsites.net/), or build and run it on your own [CPU](https://github.com/microsoft/BitNet?tab=readme-ov-file#build-from-source) or [GPU](https://github.com/microsoft/BitNet/blob/main/gpu/README.md).
 
-bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support **fast** and **lossless** inference of 1.58-bit models on CPU and GPU (NPU support will coming next).
+bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support **fast** and **lossless** inference of 1.58-bit models on CPU and GPU (NPU support will be coming next).
 
 The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of **1.37x** to **5.07x** on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by **55.4%** to **70.0%**, further boosting overall efficiency. On x86 CPUs, speedups range from **2.37x** to **6.17x** with energy reductions between **71.9%** to **82.2%**. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. Please refer to the [technical report](https://arxiv.org/abs/2410.16144) for more details.
 
@@ -304,6 +304,66 @@ huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./models/
 python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16
 ```
 
+### OpenAI-Compatible API Server
+
+BitNet includes a built-in HTTP server that provides an OpenAI-compatible API for easy integration with existing tools and frameworks.
+
+#### Start the Server
+
+After building the project, start the server with:
+
+```bash
+./build/bin/llama-server \
+    --model models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
+    --host 127.0.0.1 \
+    --port 8080 \
+    --ctx-size 4096
+```
+
+#### Use the API
+
+The server provides standard OpenAI-compatible endpoints:
+
+```bash
+# Chat completions
+curl http://127.0.0.1:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "bitnet",
+    "messages": [{"role": "user", "content": "Hello, how are you?"}]
+  }'
+
+# List models
+curl http://127.0.0.1:8080/v1/models
+```
+
+#### Integration Examples
+
+**Python (OpenAI SDK):**
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="dummy")
+response = client.chat.completions.create(
+    model="bitnet",
+    messages=[{"role": "user", "content": "Hello!"}]
+)
+print(response.choices[0].message.content)
+```
+
+**LangChain:**
+```python
+from langchain_openai import ChatOpenAI
+
+llm = ChatOpenAI(
+    base_url="http://127.0.0.1:8080/v1",
+    api_key="dummy",
+    model="bitnet"
+)
+```
+
+This enables drop-in replacement for OpenAI API in tools like Open WebUI, Continue, and other OpenAI-compatible applications.
+
 ### FAQ (Frequently Asked Questions)📌 
 
 #### Q1: The build dies with errors building llama.cpp due to issues with std::chrono in log.cpp?