Skip to content

Feature request: document/support llama-server HTTP endpoint for OpenAI-compatible serving #432

@parthalon025

Description

@parthalon025

Background

setup_env.py already builds llama-server as part of the cmake step (it lives at build/bin/llama-server after a successful build). This binary provides a fully OpenAI-compatible HTTP API (/v1/chat/completions, /v1/completions, /v1/models) — the same interface as llama.cpp's server.

The README currently only documents run_inference.py for inference. The server binary is silently present but undiscovered by most users.

What this unlocks

  • Drop-in replacement for OpenAI API in downstream tools (LangChain, Open WebUI, custom apps) without code changes
  • Persistent model loading (no 2-3s cold-start per request)
  • Integration with job queues or proxy layers that speak OpenAI protocol

Minimal usage (after build)

./build/bin/llama-server     --model models/BitNet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf     --host 127.0.0.1     --port 8080     --parallel 1     --ctx-size 4096

# Then:
curl http://127.0.0.1:8080/v1/chat/completions   -d '{"model":"bitnet","messages":[{"role":"user","content":"Hello"}]}'

Question

Would the team be interested in a PR that:

  1. Documents this capability in the README (a single section — no code changes)
  2. Optionally adds a minimal Python wrapper script (consistent with the repo's Python-first style) to make the invocation discoverable

Happy to contribute either or both if there's interest. Flagging as a question first rather than opening a cold PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions