feat(api): add Anthropic Messages API compatibility endpoint by sufubao · Pull Request #1272 · ModelTC/LightLLM

sufubao · 2026-04-15T14:37:44Z

Summary

Adds POST /v1/messages endpoint compatible with the Anthropic Messages API, enabled via --enable_anthropic_api flag
Implements request/response translation between Anthropic and OpenAI formats using LiteLLM as the translation layer, reusing the existing OpenAI-compatible inference pipeline
Supports both streaming (SSE) and non-streaming responses, including tool_use content blocks, image content blocks, and proper Anthropic event sequencing

Key files

File	Description
`lightllm/server/api_anthropic.py`	Main handler, request/response translation, SSE bridge
`lightllm/server/_litellm_shim.py`	LiteLLM adapter wrapper with version compatibility checks
`lightllm/server/api_http.py`	Route registration for `/v1/messages`
`lightllm/server/api_cli.py`	`--enable_anthropic_api` CLI flag
`setup.py`	`litellm` declared as `anthropic_api` optional extra
`docs/EN/source/tutorial/anthropic.rst`	Usage documentation

Test plan

Tested against Qwen3.5-122B-A10B (TP=4) with --enable_anthropic_api and litellm==1.82.6.

1. Non-streaming text

curl http://localhost:17888/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: fake" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "qwen3_5",
    "max_tokens": 512,
    "messages": [{"role": "user", "content": "Say hi in one word"}]
  }'

Response

{
  "id": "msg_7c405f60e4994e4d961fd76b",
  "type": "message",
  "role": "assistant",
  "model": "qwen3_5",
  "stop_sequence": null,
  "usage": {"input_tokens": 19, "output_tokens": 512},
  "content": [{"type": "text", "text": "Hi"}],
  "stop_reason": "end_turn"
}

2. Streaming text

curl -N http://localhost:17888/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: fake" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "qwen3_5",
    "max_tokens": 1024,
    "stream": true,
    "messages": [{"role": "user", "content": "Say hi in one word"}]
  }'

Response

event: message_start
data: {"type":"message_start","message":{"id":"msg_6be0aa4718134dc88489208c","type":"message","role":"assistant","model":"qwen3_5","content":[],"stop_reason":null,"stop_sequence":null,"usage":{"input_tokens":0,"output_tokens":0,"cache_creation_input_tokens":0,"cache_read_input_tokens":0}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hi"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{"input_tokens":0,"output_tokens":0}}

event: message_stop
data: {"type":"message_stop"}

3. Tool use (non-streaming)

curl http://localhost:17888/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: fake" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "qwen3_5",
    "max_tokens": 512,
    "tools": [{
      "name": "get_weather",
      "description": "Get current weather for a city",
      "input_schema": {
        "type": "object",
        "properties": {"city": {"type": "string", "description": "City name"}},
        "required": ["city"]
      }
    }],
    "messages": [{"role": "user", "content": "What is the weather in Beijing?"}]
  }'

Response

{
  "id": "msg_cf00a67dbc1846cfa1a3ab19",
  "type": "message",
  "role": "assistant",
  "model": "qwen3_5",
  "stop_sequence": null,
  "usage": {"input_tokens": 282, "output_tokens": 95},
  "content": [
    {"type": "tool_use", "id": "call_1f6b59ecab9f4f078b821163", "name": "get_weather", "input": {"city": "Beijing"}}
  ],
  "stop_reason": "tool_use"
}

4. Tool use (streaming)

curl -N http://localhost:17888/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: fake" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "qwen3_5",
    "max_tokens": 512,
    "stream": true,
    "tools": [{
      "name": "get_weather",
      "description": "Get current weather for a city",
      "input_schema": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"]
      }
    }],
    "messages": [{"role": "user", "content": "Weather in Shanghai?"}]
  }'

Response

event: message_start
data: {"type":"message_start","message":{"id":"msg_95cf683629dd49df89474719","type":"message","role":"assistant","model":"qwen3_5","content":[],"stop_reason":null,"stop_sequence":null,"usage":{"input_tokens":0,"output_tokens":0,"cache_creation_input_tokens":0,"cache_read_input_tokens":0}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"\n\n"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: content_block_start
data: {"type":"content_block_start","index":1,"content_block":{"type":"tool_use","id":"call_57c24e389dcc4da7a52b6669","name":"get_weather","input":{}}}

event: content_block_delta
data: {"type":"content_block_delta","index":1,"delta":{"type":"input_json_delta","partial_json":"{"}}

event: content_block_delta
data: {"type":"content_block_delta","index":1,"delta":{"type":"input_json_delta","partial_json":"\"city\": \"Shanghai\""}}

event: content_block_delta
data: {"type":"content_block_delta","index":1,"delta":{"type":"input_json_delta","partial_json":"}"}}

event: content_block_stop
data: {"type":"content_block_stop","index":1}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"tool_use","stop_sequence":null},"usage":{"input_tokens":0,"output_tokens":0}}

event: message_stop
data: {"type":"message_stop"}

5. Image input (base64)

# Generate a valid base64 PNG (Python)
B64=$(python3 -c "
import base64, io
from PIL import Image
img = Image.new('RGB', (4, 4), color=(255, 0, 0))
buf = io.BytesIO()
img.save(buf, format='PNG')
print(base64.b64encode(buf.getvalue()).decode())
")

curl http://localhost:17888/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: fake" \
  -H "anthropic-version: 2023-06-01" \
  -d "{
    \"model\": \"qwen3_5\",
    \"max_tokens\": 512,
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"image\", \"source\": {\"type\": \"base64\", \"media_type\": \"image/png\", \"data\": \"$B64\"}},
        {\"type\": \"text\", \"text\": \"What color is this image?\"}
      ]
    }]
  }"

Response

{
  "id": "msg_fea4e0913ac844e69afa65a6",
  "type": "message",
  "role": "assistant",
  "model": "qwen3_5",
  "stop_sequence": null,
  "usage": {"input_tokens": 82, "output_tokens": 141},
  "content": [{"type": "text", "text": "This image is **red**. Specifically, it appears to be a solid, bright red color."}],
  "stop_reason": "end_turn"
}

Introduces the opt-in flag that will gate the Anthropic Messages API compatibility layer. The flag is currently inert — the endpoint itself lands in subsequent commits. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Isolates LiteLLM's experimental_pass_through imports behind get_* accessors so future upgrades only touch one file. Includes a startup check (ensure_available) that fails loudly if litellm is missing when --enable_anthropic_api is set.

The installed litellm in the development environment is 1.83.7 and the experimental adapter import works against it. The previous value (1.60.0) predated testing and made log and error messages misleading. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Addresses code review feedback on the Anthropic API shim: - Removes _cached_stream_wrapper_cls — it was forward-allocated for a future streaming task but has no current references. It can be reintroduced alongside its getter when the streaming work lands. - Expands the deep-import error message with a concrete pinning command so operators hitting it after a LiteLLM upgrade know how to recover.

Locks down the exact I/O shapes we depend on. A failure here signals that LiteLLM's experimental adapter contract has shifted and the shim needs updating. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Introduces anthropic_messages_impl plus request/response translation helpers. Streaming returns 501 for now — it lands in a later commit. Unit tests cover the dict-level translation path without requiring a running server. Also adds conftest.py at repo root to patch the transformers tokenizers version check (tokenizers 0.22.x vs <0.22 requirement in transformers 4.49) so that api_models is importable in the unit-test environment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The route is gated on --enable_anthropic_api; requests return 404 otherwise. Startup calls ensure_available so a missing litellm package fails loudly at server boot rather than on the first request.

Adds _openai_sse_to_anthropic_events, an async generator that consumes the existing chat_completions_impl streaming output and re-emits it as Anthropic message_start/content_block_*/message_delta/message_stop events. Tool-use streaming is still pending.

- Update stale module docstring to reflect streaming support. - Drop per-delta output_tokens increment; rely on the trailing usage chunk as the single source of truth for token counts. - Document the known-zero input_tokens in message_start and surface the real prompt token count in message_delta.usage instead. - Wire up the previously-dead _run helper in the streaming bridge test and move the asyncio import to the top of the file.

These tests lock down the contract for _anthropic_to_chat_request (when given an Anthropic tool schema, must emit an OpenAI-shaped tools array with parameters key and function type) and _chat_response_to_anthropic (when given a ChatCompletionResponse carrying a tool_calls message, must emit an Anthropic tool_use content block and tool_use stop_reason). These tests are added without runtime verification because the current environment does not have torch/litellm installed. A follow-up commit may be needed if either assertion fails when run in a full environment — the plan documents the expected remediation (rename input_schema to parameters on the request side; force stop_reason=tool_use on the response side).

Replace the collapsed set literal {"get_weather", "get_weather"[:64]} with a direct equality check. The slice is a no-op on short names, so the set held a single element and the OR branch was dead. The 64-char truncation contract will be exercised by a longer-name test if needed.

Exercises non-streaming text, streaming text, system prompt, and tool use against a live LightLLM with --enable_anthropic_api. Not wired into automated CI because it needs a GPU runner.

test_anthropic_api.py is a standalone CLI tool whose module-level test_* functions take positional arguments pytest cannot inject. Without this collect_ignore, running pytest test/ produces four collection errors. The script is still invoked directly via python test/test_api/test_anthropic_api.py ...

Install with: pip install 'lightllm[anthropic_api]' Keeps the base install lean; only users of --enable_anthropic_api need the heavy dependency. Upper bound tracks _litellm_shim.py's _MAX_TESTED_LITELLM_VERSION (1.84.0); bump explicitly after re-running the round-trip characterisation test against newer releases.

New tutorial page covers enablement via --enable_anthropic_api, SDK usage for non-streaming and streaming, supported features, and known limitations (ignored cache_control, unsupported thinking/batch/files, zero input_tokens in streaming message_start). Linked from the EN tutorial toctree alongside the existing OpenAI API guide.

Final-review polish: add a vision smoke test for _anthropic_to_chat_request to guard against LiteLLM silently dropping image content blocks, and document that _openai_sse_to_anthropic_events' text_block_index stays at 0 on the text-only path. The vision test does not assert a specific OpenAI shape — LiteLLM's adapter owns that contract and may normalise it differently across releases — but it verifies the user message survives translation.

chat_completions_impl yields str ("data: {...}\n\n"), not bytes — StreamingResponse encodes on the way out, but _openai_sse_to_anthropic_events consumes body_iterator directly and so was reaching the raw str. The old .split(b"\n") path raised TypeError on the first streaming request. Normalise at entry (decode bytes → str, leave str alone) and switch the inner splitter to str literals. Adds a regression test feeding a str iterator so the bytes-only path no longer hides the real contract. Reported from a live run: claude-code → /v1/messages → 500 with TypeError: must be str or None, not bytes at api_anthropic.py:193.

_openai_sse_to_anthropic_events previously only handled delta.content (text) and silently dropped delta.tool_calls. When a model returned a streaming tool call, clients saw stop_reason=tool_use with zero tool_use content blocks — the exact shape that makes Claude Code report "The model's tool call could not be parsed (retry also failed)". The bridge now: - tracks at most one open content block at a time (text or tool_use); - buffers tool-call id/name/arguments per OpenAI streaming index until a name is available, then emits content_block_start(tool_use); - flushes buffered arguments and any subsequent deltas as input_json_delta partial_json events; - closes the current block when switching between text and tool_use or between tool_use blocks; - closes any still-open block at end of stream before message_delta. Adds test_stream_bridge_emits_tool_use_content_block: feeds a realistic three-chunk tool-call stream and asserts the Anthropic event sequence (message_start, tool_use content_block_start, input_json_delta x N, content_block_stop, message_delta with stop_reason=tool_use, message_stop) plus verifies the partial_json deltas reconstruct the original arguments. Reproduced against a live server at /v1/messages with --tool_call_parser qwen3_coder; pre-fix stream contained event: message_delta data: {...,"stop_reason":"tool_use",...} with no tool_use block. Post-fix stream contains the full sequence.

Four fixes observed from live Claude Code + curl traffic against /v1/messages, none functional-blocking but all wire-format drift that confuses strict Anthropic clients: 1. message id prefix. LiteLLM passes LightLLM's raw numeric request id through (e.g. "56"). Strict clients expect msg_* — force it in _normalize_anthropic_response when the adapter's output does not already match. 2. Empty leading text blocks. LiteLLM's response adapter sometimes emits [{"type":"text","text":""}, {"type":"tool_use",...}] where the spec wants just [{"type":"tool_use",...}]. Filter empty text blocks in the normaliser. 3. provider_specific_fields leak. LiteLLM internal field surfaces on content blocks in the final response; strip it in the normaliser. 4. Error envelope. create_error_response (from api_openai) produces OpenAI's {"error":{"message":...,"type":...}} shape, which Claude Code does not parse — it surfaces a generic failure instead of the real message. Add _anthropic_error_response which returns the Anthropic envelope {"type":"error","error":{"type":...,"message":...}} with status -> error-type mapping from https://docs.anthropic.com/en/api/errors. Use it for every error-return path in anthropic_messages_impl, including a _rewrap_openai_error_as_anthropic helper that best-effort decodes OpenAI errors coming back from chat_completions_impl and rewraps them into Anthropic shape. Extracts the cleanup logic into a new _normalize_anthropic_response helper so it can be tested directly without mocking the whole LiteLLM adapter path. Adds three unit tests covering (1)+(2)+(3) and an already-good-id preservation case, plus an error-envelope shape test.

gemini-code-assist

Code Review

This pull request introduces an Anthropic Messages API compatibility layer, enabling LightLLM to expose a /v1/messages endpoint compatible with Anthropic SDKs. The implementation includes a translation shim using LiteLLM, logic for mapping requests and responses between OpenAI and Anthropic formats, and a streaming bridge for SSE events. Documentation, CLI flags, and comprehensive unit and integration tests are also provided. Review feedback highlights the need for more robust SSE parsing to handle potential data fragmentation and the importance of catching translation errors to prevent unhandled internal server errors.

gemini-code-assist · 2026-04-15T14:39:20Z

+    async for raw_chunk in openai_body_iterator:
+        if not raw_chunk:
+            continue
+        # chat_completions_impl yields str ("data: {...}\n\n"); some callers or
+        # middlewares may hand us bytes. Normalise to str so the splitter below
+        # does not have to branch on type.
+        if isinstance(raw_chunk, (bytes, bytearray)):
+            raw_chunk = raw_chunk.decode("utf-8", errors="replace")
+        # A single StreamingResponse chunk may contain multiple SSE lines.
+        for line in raw_chunk.split("\n"):
+            line = line.strip()
+            if not line or not line.startswith("data: "):
+                continue
+            payload = line[len("data: "):]


The current SSE parsing logic is susceptible to fragmentation. If a raw_chunk ends in the middle of a line or a multi-byte UTF-8 character, the data will be corrupted or lost. It is safer to use a buffer to accumulate partial lines and an incremental decoder for UTF-8 bytes.

buffer = "" async for raw_chunk in openai_body_iterator: if not raw_chunk: continue # chat_completions_impl yields str ("data: {...}\n\n"); some callers or # middlewares may hand us bytes. Normalise to str so the splitter below # does not have to branch on type. if isinstance(raw_chunk, (bytes, bytearray)): raw_chunk = raw_chunk.decode("utf-8", errors="replace") buffer += raw_chunk while "\n" in buffer: line, buffer = buffer.split("\n", 1) line = line.strip() if not line or not line.startswith("data: "): continue payload = line[len("data: "):]

Not applicable here. _openai_sse_to_anthropic_events consumes downstream.body_iterator directly — this is the raw Python async generator returned by chat_completions_impl's StreamingResponse, not bytes arriving from a network socket. Each yield in that generator is a complete SSE event ("data: {...}\n\n"), so TCP-level fragmentation and multi-byte UTF-8 splits cannot occur. The existing try/except json.loads (line 247) already handles any unexpected malformed payload. Adding a line buffer would add complexity without addressing a real failure mode in this architecture.

gemini-code-assist · 2026-04-15T14:39:20Z

+    anthropic_dict = _chat_response_to_anthropic(downstream, tool_name_mapping, requested_model)
+    return JSONResponse(anthropic_dict)


The call to _chat_response_to_anthropic can raise a RuntimeError (via the fallback translator) if tool calls are present but LiteLLM translation fails. This should be caught to return a proper Anthropic error response instead of allowing a generic 500 error to propagate.

Suggested change

anthropic_dict = _chat_response_to_anthropic(downstream, tool_name_mapping, requested_model)

return JSONResponse(anthropic_dict)

try:

anthropic_dict = _chat_response_to_anthropic(downstream, tool_name_mapping, requested_model)

except Exception as exc:

logger.exception("Anthropic response translation failed")

return _anthropic_error_response(HTTPStatus.INTERNAL_SERVER_ERROR, str(exc))

return JSONResponse(anthropic_dict)

Fixed in b03d06c. _chat_response_to_anthropic is now wrapped in try/except — any exception (including the RuntimeError from _fallback_openai_to_anthropic when tool calls are present) returns a proper Anthropic error envelope via _anthropic_error_response(500, ...) instead of propagating as a generic 500.

sufubao and others added 19 commits April 15, 2026 22:13

feat(api): add --enable_anthropic_api CLI flag

34a470f

Introduces the opt-in flag that will gate the Anthropic Messages API compatibility layer. The flag is currently inert — the endpoint itself lands in subsequent commits. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(api): add LiteLLM adapter round-trip characterisation test

9158066

Locks down the exact I/O shapes we depend on. A failure here signals that LiteLLM's experimental adapter contract has shifted and the shim needs updating. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(api): register POST /v1/messages route

455b487

The route is gated on --enable_anthropic_api; requests return 404 otherwise. Startup calls ensure_available so a missing litellm package fails loudly at server boot rather than on the first request.

test(api): add manual integration test script for Anthropic SDK

0b570c1

Exercises non-streaming text, streaming text, system prompt, and tool use against a live LightLLM with --enable_anthropic_api. Not wired into automated CI because it needs a GPU runner.

gemini-code-assist bot reviewed Apr 15, 2026

View reviewed changes

sufubao added 3 commits April 16, 2026 00:16

fix(api): catch translation errors and return Anthropic error envelope

b03d06c

test: remove all Anthropic API tests

ca59417

style: apply black formatting to api_anthropic and _litellm_shim

1f1d300

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): add Anthropic Messages API compatibility endpoint#1272

feat(api): add Anthropic Messages API compatibility endpoint#1272
sufubao wants to merge 22 commits intomainfrom
support_anthropic

sufubao commented Apr 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 15, 2026

Uh oh!

sufubao Apr 15, 2026

Uh oh!

gemini-code-assist bot Apr 15, 2026

Uh oh!

sufubao Apr 15, 2026

Uh oh!

sufubao Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		anthropic_dict = _chat_response_to_anthropic(downstream, tool_name_mapping, requested_model)
		return JSONResponse(anthropic_dict)

Conversation

sufubao commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key files

Test plan

1. Non-streaming text

2. Streaming text

3. Tool use (non-streaming)

4. Tool use (streaming)

5. Image input (base64)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

sufubao Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

sufubao Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

sufubao Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sufubao commented Apr 15, 2026 •

edited

Loading