From d0019917cb2045f33963af67049124418461e330 Mon Sep 17 00:00:00 2001 From: JasonOA888 Date: Fri, 13 Mar 2026 01:38:00 +0800 Subject: [PATCH] docs: add OpenAI-compatible API server documentation Addresses #432 ## Summary - Document the built-in llama-server HTTP endpoint - Provide OpenAI-compatible API usage examples - Fix typo: 'will coming' -> 'will be coming' - Enable drop-in integration with LangChain, Open WebUI, etc. ## Changes - README.md: Added 'OpenAI-Compatible API Server' section - README.md: Fixed NPU support description typo ## Why This Matters Users don't know that BitNet includes a fully OpenAI-compatible API server. This documentation enables: - Drop-in replacement for OpenAI API in existing tools - Persistent model loading (no cold-start per request) - Integration with job queues and proxy layers ## Test Plan - [x] Verified llama-server binary exists in build/bin/ - [x] Tested curl examples against running server - [x] Verified Python SDK integration - [x] Checked LangChain example --- README.md | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 61 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 3bb25596e..340c67e78 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ Try it out via this [demo](https://demo-bitnet-h0h8hcfqeqhrf5gf.canadacentral-01.azurewebsites.net/), or build and run it on your own [CPU](https://github.com/microsoft/BitNet?tab=readme-ov-file#build-from-source) or [GPU](https://github.com/microsoft/BitNet/blob/main/gpu/README.md). -bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support **fast** and **lossless** inference of 1.58-bit models on CPU and GPU (NPU support will coming next). +bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support **fast** and **lossless** inference of 1.58-bit models on CPU and GPU (NPU support will be coming next). The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of **1.37x** to **5.07x** on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by **55.4%** to **70.0%**, further boosting overall efficiency. On x86 CPUs, speedups range from **2.37x** to **6.17x** with energy reductions between **71.9%** to **82.2%**. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. Please refer to the [technical report](https://arxiv.org/abs/2410.16144) for more details. @@ -304,6 +304,66 @@ huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./models/ python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16 ``` +### OpenAI-Compatible API Server + +BitNet includes a built-in HTTP server that provides an OpenAI-compatible API for easy integration with existing tools and frameworks. + +#### Start the Server + +After building the project, start the server with: + +```bash +./build/bin/llama-server \ + --model models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \ + --host 127.0.0.1 \ + --port 8080 \ + --ctx-size 4096 +``` + +#### Use the API + +The server provides standard OpenAI-compatible endpoints: + +```bash +# Chat completions +curl http://127.0.0.1:8080/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "bitnet", + "messages": [{"role": "user", "content": "Hello, how are you?"}] + }' + +# List models +curl http://127.0.0.1:8080/v1/models +``` + +#### Integration Examples + +**Python (OpenAI SDK):** +```python +from openai import OpenAI + +client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="dummy") +response = client.chat.completions.create( + model="bitnet", + messages=[{"role": "user", "content": "Hello!"}] +) +print(response.choices[0].message.content) +``` + +**LangChain:** +```python +from langchain_openai import ChatOpenAI + +llm = ChatOpenAI( + base_url="http://127.0.0.1:8080/v1", + api_key="dummy", + model="bitnet" +) +``` + +This enables drop-in replacement for OpenAI API in tools like Open WebUI, Continue, and other OpenAI-compatible applications. + ### FAQ (Frequently Asked Questions)📌 #### Q1: The build dies with errors building llama.cpp due to issues with std::chrono in log.cpp?