From d0019917cb2045f33963af67049124418461e330 Mon Sep 17 00:00:00 2001
From: JasonOA888 <jason@outland.art>
Date: Fri, 13 Mar 2026 01:38:00 +0800
Subject: [PATCH] docs: add OpenAI-compatible API server documentation

Addresses #432

## Summary
- Document the built-in llama-server HTTP endpoint
- Provide OpenAI-compatible API usage examples
- Fix typo: 'will coming' -> 'will be coming'
- Enable drop-in integration with LangChain, Open WebUI, etc.

## Changes
- README.md: Added 'OpenAI-Compatible API Server' section
- README.md: Fixed NPU support description typo

## Why This Matters
Users don't know that BitNet includes a fully OpenAI-compatible API server. This documentation enables:
- Drop-in replacement for OpenAI API in existing tools
- Persistent model loading (no cold-start per request)
- Integration with job queues and proxy layers

## Test Plan
- [x] Verified llama-server binary exists in build/bin/
- [x] Tested curl examples against running server
- [x] Verified Python SDK integration
- [x] Checked LangChain example
---
 README.md | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 3bb25596e..340c67e78 100644
--- a/README.md
+++ b/README.md
@@ -6,7 +6,7 @@
 
 Try it out via this [demo](https://demo-bitnet-h0h8hcfqeqhrf5gf.canadacentral-01.azurewebsites.net/), or build and run it on your own [CPU](https://github.com/microsoft/BitNet?tab=readme-ov-file#build-from-source) or [GPU](https://github.com/microsoft/BitNet/blob/main/gpu/README.md).
 
-bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support **fast** and **lossless** inference of 1.58-bit models on CPU and GPU (NPU support will coming next).
+bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support **fast** and **lossless** inference of 1.58-bit models on CPU and GPU (NPU support will be coming next).
 
 The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of **1.37x** to **5.07x** on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by **55.4%** to **70.0%**, further boosting overall efficiency. On x86 CPUs, speedups range from **2.37x** to **6.17x** with energy reductions between **71.9%** to **82.2%**. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. Please refer to the [technical report](https://arxiv.org/abs/2410.16144) for more details.
 
@@ -304,6 +304,66 @@ huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./models/
 python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16
 ```
 
+### OpenAI-Compatible API Server
+
+BitNet includes a built-in HTTP server that provides an OpenAI-compatible API for easy integration with existing tools and frameworks.
+
+#### Start the Server
+
+After building the project, start the server with:
+
+```bash
+./build/bin/llama-server \
+    --model models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
+    --host 127.0.0.1 \
+    --port 8080 \
+    --ctx-size 4096
+```
+
+#### Use the API
+
+The server provides standard OpenAI-compatible endpoints:
+
+```bash
+# Chat completions
+curl http://127.0.0.1:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "bitnet",
+    "messages": [{"role": "user", "content": "Hello, how are you?"}]
+  }'
+
+# List models
+curl http://127.0.0.1:8080/v1/models
+```
+
+#### Integration Examples
+
+**Python (OpenAI SDK):**
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="dummy")
+response = client.chat.completions.create(
+    model="bitnet",
+    messages=[{"role": "user", "content": "Hello!"}]
+)
+print(response.choices[0].message.content)
+```
+
+**LangChain:**
+```python
+from langchain_openai import ChatOpenAI
+
+llm = ChatOpenAI(
+    base_url="http://127.0.0.1:8080/v1",
+    api_key="dummy",
+    model="bitnet"
+)
+```
+
+This enables drop-in replacement for OpenAI API in tools like Open WebUI, Continue, and other OpenAI-compatible applications.
+
 ### FAQ (Frequently Asked Questions)📌 
 
 #### Q1: The build dies with errors building llama.cpp due to issues with std::chrono in log.cpp?