-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Open
Description
Background
setup_env.py already builds llama-server as part of the cmake step (it lives at build/bin/llama-server after a successful build). This binary provides a fully OpenAI-compatible HTTP API (/v1/chat/completions, /v1/completions, /v1/models) — the same interface as llama.cpp's server.
The README currently only documents run_inference.py for inference. The server binary is silently present but undiscovered by most users.
What this unlocks
- Drop-in replacement for OpenAI API in downstream tools (LangChain, Open WebUI, custom apps) without code changes
- Persistent model loading (no 2-3s cold-start per request)
- Integration with job queues or proxy layers that speak OpenAI protocol
Minimal usage (after build)
./build/bin/llama-server --model models/BitNet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf --host 127.0.0.1 --port 8080 --parallel 1 --ctx-size 4096
# Then:
curl http://127.0.0.1:8080/v1/chat/completions -d '{"model":"bitnet","messages":[{"role":"user","content":"Hello"}]}'Question
Would the team be interested in a PR that:
- Documents this capability in the README (a single section — no code changes)
- Optionally adds a minimal Python wrapper script (consistent with the repo's Python-first style) to make the invocation discoverable
Happy to contribute either or both if there's interest. Flagging as a question first rather than opening a cold PR.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels