Skip to content

feat(api): add Anthropic Messages API compatibility endpoint#1272

Open
sufubao wants to merge 22 commits intomainfrom
support_anthropic
Open

feat(api): add Anthropic Messages API compatibility endpoint#1272
sufubao wants to merge 22 commits intomainfrom
support_anthropic

Conversation

@sufubao
Copy link
Copy Markdown
Collaborator

@sufubao sufubao commented Apr 15, 2026

Summary

  • Adds POST /v1/messages endpoint compatible with the Anthropic Messages API, enabled via --enable_anthropic_api flag
  • Implements request/response translation between Anthropic and OpenAI formats using LiteLLM as the translation layer, reusing the existing OpenAI-compatible inference pipeline
  • Supports both streaming (SSE) and non-streaming responses, including tool_use content blocks, image content blocks, and proper Anthropic event sequencing

Key files

File Description
lightllm/server/api_anthropic.py Main handler, request/response translation, SSE bridge
lightllm/server/_litellm_shim.py LiteLLM adapter wrapper with version compatibility checks
lightllm/server/api_http.py Route registration for /v1/messages
lightllm/server/api_cli.py --enable_anthropic_api CLI flag
setup.py litellm declared as anthropic_api optional extra
docs/EN/source/tutorial/anthropic.rst Usage documentation

Test plan

Tested against Qwen3.5-122B-A10B (TP=4) with --enable_anthropic_api and litellm==1.82.6.

1. Non-streaming text

curl http://localhost:17888/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: fake" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "qwen3_5",
    "max_tokens": 512,
    "messages": [{"role": "user", "content": "Say hi in one word"}]
  }'
Response
{
  "id": "msg_7c405f60e4994e4d961fd76b",
  "type": "message",
  "role": "assistant",
  "model": "qwen3_5",
  "stop_sequence": null,
  "usage": {"input_tokens": 19, "output_tokens": 512},
  "content": [{"type": "text", "text": "Hi"}],
  "stop_reason": "end_turn"
}

2. Streaming text

curl -N http://localhost:17888/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: fake" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "qwen3_5",
    "max_tokens": 1024,
    "stream": true,
    "messages": [{"role": "user", "content": "Say hi in one word"}]
  }'
Response
event: message_start
data: {"type":"message_start","message":{"id":"msg_6be0aa4718134dc88489208c","type":"message","role":"assistant","model":"qwen3_5","content":[],"stop_reason":null,"stop_sequence":null,"usage":{"input_tokens":0,"output_tokens":0,"cache_creation_input_tokens":0,"cache_read_input_tokens":0}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hi"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{"input_tokens":0,"output_tokens":0}}

event: message_stop
data: {"type":"message_stop"}

3. Tool use (non-streaming)

curl http://localhost:17888/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: fake" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "qwen3_5",
    "max_tokens": 512,
    "tools": [{
      "name": "get_weather",
      "description": "Get current weather for a city",
      "input_schema": {
        "type": "object",
        "properties": {"city": {"type": "string", "description": "City name"}},
        "required": ["city"]
      }
    }],
    "messages": [{"role": "user", "content": "What is the weather in Beijing?"}]
  }'
Response
{
  "id": "msg_cf00a67dbc1846cfa1a3ab19",
  "type": "message",
  "role": "assistant",
  "model": "qwen3_5",
  "stop_sequence": null,
  "usage": {"input_tokens": 282, "output_tokens": 95},
  "content": [
    {"type": "tool_use", "id": "call_1f6b59ecab9f4f078b821163", "name": "get_weather", "input": {"city": "Beijing"}}
  ],
  "stop_reason": "tool_use"
}

4. Tool use (streaming)

curl -N http://localhost:17888/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: fake" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "qwen3_5",
    "max_tokens": 512,
    "stream": true,
    "tools": [{
      "name": "get_weather",
      "description": "Get current weather for a city",
      "input_schema": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"]
      }
    }],
    "messages": [{"role": "user", "content": "Weather in Shanghai?"}]
  }'
Response
event: message_start
data: {"type":"message_start","message":{"id":"msg_95cf683629dd49df89474719","type":"message","role":"assistant","model":"qwen3_5","content":[],"stop_reason":null,"stop_sequence":null,"usage":{"input_tokens":0,"output_tokens":0,"cache_creation_input_tokens":0,"cache_read_input_tokens":0}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"\n\n"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: content_block_start
data: {"type":"content_block_start","index":1,"content_block":{"type":"tool_use","id":"call_57c24e389dcc4da7a52b6669","name":"get_weather","input":{}}}

event: content_block_delta
data: {"type":"content_block_delta","index":1,"delta":{"type":"input_json_delta","partial_json":"{"}}

event: content_block_delta
data: {"type":"content_block_delta","index":1,"delta":{"type":"input_json_delta","partial_json":"\"city\": \"Shanghai\""}}

event: content_block_delta
data: {"type":"content_block_delta","index":1,"delta":{"type":"input_json_delta","partial_json":"}"}}

event: content_block_stop
data: {"type":"content_block_stop","index":1}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"tool_use","stop_sequence":null},"usage":{"input_tokens":0,"output_tokens":0}}

event: message_stop
data: {"type":"message_stop"}

5. Image input (base64)

# Generate a valid base64 PNG (Python)
B64=$(python3 -c "
import base64, io
from PIL import Image
img = Image.new('RGB', (4, 4), color=(255, 0, 0))
buf = io.BytesIO()
img.save(buf, format='PNG')
print(base64.b64encode(buf.getvalue()).decode())
")

curl http://localhost:17888/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: fake" \
  -H "anthropic-version: 2023-06-01" \
  -d "{
    \"model\": \"qwen3_5\",
    \"max_tokens\": 512,
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"image\", \"source\": {\"type\": \"base64\", \"media_type\": \"image/png\", \"data\": \"$B64\"}},
        {\"type\": \"text\", \"text\": \"What color is this image?\"}
      ]
    }]
  }"
Response
{
  "id": "msg_fea4e0913ac844e69afa65a6",
  "type": "message",
  "role": "assistant",
  "model": "qwen3_5",
  "stop_sequence": null,
  "usage": {"input_tokens": 82, "output_tokens": 141},
  "content": [{"type": "text", "text": "This image is **red**. Specifically, it appears to be a solid, bright red color."}],
  "stop_reason": "end_turn"
}

sufubao and others added 19 commits April 15, 2026 22:13
Introduces the opt-in flag that will gate the Anthropic Messages API
compatibility layer. The flag is currently inert — the endpoint itself
lands in subsequent commits.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Isolates LiteLLM's experimental_pass_through imports behind get_*
accessors so future upgrades only touch one file. Includes a startup
check (ensure_available) that fails loudly if litellm is missing when
--enable_anthropic_api is set.
The installed litellm in the development environment is 1.83.7 and
the experimental adapter import works against it. The previous value
(1.60.0) predated testing and made log and error messages misleading.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Addresses code review feedback on the Anthropic API shim:
- Removes _cached_stream_wrapper_cls — it was forward-allocated for a
  future streaming task but has no current references. It can be
  reintroduced alongside its getter when the streaming work lands.
- Expands the deep-import error message with a concrete pinning command
  so operators hitting it after a LiteLLM upgrade know how to recover.
Locks down the exact I/O shapes we depend on. A failure here signals
that LiteLLM's experimental adapter contract has shifted and the shim
needs updating.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces anthropic_messages_impl plus request/response translation
helpers. Streaming returns 501 for now — it lands in a later commit.
Unit tests cover the dict-level translation path without requiring a
running server.

Also adds conftest.py at repo root to patch the transformers tokenizers
version check (tokenizers 0.22.x vs <0.22 requirement in transformers
4.49) so that api_models is importable in the unit-test environment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The route is gated on --enable_anthropic_api; requests return 404
otherwise. Startup calls ensure_available so a missing litellm package
fails loudly at server boot rather than on the first request.
Adds _openai_sse_to_anthropic_events, an async generator that consumes
the existing chat_completions_impl streaming output and re-emits it as
Anthropic message_start/content_block_*/message_delta/message_stop
events. Tool-use streaming is still pending.
- Update stale module docstring to reflect streaming support.
- Drop per-delta output_tokens increment; rely on the trailing usage
  chunk as the single source of truth for token counts.
- Document the known-zero input_tokens in message_start and surface
  the real prompt token count in message_delta.usage instead.
- Wire up the previously-dead _run helper in the streaming bridge
  test and move the asyncio import to the top of the file.
These tests lock down the contract for _anthropic_to_chat_request (when
given an Anthropic tool schema, must emit an OpenAI-shaped tools array
with parameters key and function type) and _chat_response_to_anthropic
(when given a ChatCompletionResponse carrying a tool_calls message,
must emit an Anthropic tool_use content block and tool_use stop_reason).

These tests are added without runtime verification because the current
environment does not have torch/litellm installed. A follow-up commit
may be needed if either assertion fails when run in a full environment —
the plan documents the expected remediation (rename input_schema to
parameters on the request side; force stop_reason=tool_use on the
response side).
Replace the collapsed set literal {"get_weather", "get_weather"[:64]}
with a direct equality check. The slice is a no-op on short names, so
the set held a single element and the OR branch was dead. The 64-char
truncation contract will be exercised by a longer-name test if needed.
Exercises non-streaming text, streaming text, system prompt, and tool
use against a live LightLLM with --enable_anthropic_api. Not wired into
automated CI because it needs a GPU runner.
test_anthropic_api.py is a standalone CLI tool whose module-level
test_* functions take positional arguments pytest cannot inject.
Without this collect_ignore, running pytest test/ produces four
collection errors. The script is still invoked directly via
python test/test_api/test_anthropic_api.py ...
Install with: pip install 'lightllm[anthropic_api]'
Keeps the base install lean; only users of --enable_anthropic_api
need the heavy dependency. Upper bound tracks _litellm_shim.py's
_MAX_TESTED_LITELLM_VERSION (1.84.0); bump explicitly after
re-running the round-trip characterisation test against newer
releases.
New tutorial page covers enablement via --enable_anthropic_api, SDK
usage for non-streaming and streaming, supported features, and known
limitations (ignored cache_control, unsupported thinking/batch/files,
zero input_tokens in streaming message_start). Linked from the EN
tutorial toctree alongside the existing OpenAI API guide.
Final-review polish: add a vision smoke test for _anthropic_to_chat_request
to guard against LiteLLM silently dropping image content blocks, and
document that _openai_sse_to_anthropic_events' text_block_index stays
at 0 on the text-only path.

The vision test does not assert a specific OpenAI shape — LiteLLM's
adapter owns that contract and may normalise it differently across
releases — but it verifies the user message survives translation.
chat_completions_impl yields str ("data: {...}\n\n"), not bytes —
StreamingResponse encodes on the way out, but _openai_sse_to_anthropic_events
consumes body_iterator directly and so was reaching the raw str. The old
.split(b"\n") path raised TypeError on the first streaming request.

Normalise at entry (decode bytes → str, leave str alone) and switch the
inner splitter to str literals. Adds a regression test feeding a str
iterator so the bytes-only path no longer hides the real contract.

Reported from a live run: claude-code → /v1/messages → 500 with
TypeError: must be str or None, not bytes at api_anthropic.py:193.
_openai_sse_to_anthropic_events previously only handled delta.content
(text) and silently dropped delta.tool_calls. When a model returned a
streaming tool call, clients saw stop_reason=tool_use with zero tool_use
content blocks — the exact shape that makes Claude Code report
"The model's tool call could not be parsed (retry also failed)".

The bridge now:
  - tracks at most one open content block at a time (text or tool_use);
  - buffers tool-call id/name/arguments per OpenAI streaming index until
    a name is available, then emits content_block_start(tool_use);
  - flushes buffered arguments and any subsequent deltas as
    input_json_delta partial_json events;
  - closes the current block when switching between text and tool_use or
    between tool_use blocks;
  - closes any still-open block at end of stream before message_delta.

Adds test_stream_bridge_emits_tool_use_content_block: feeds a realistic
three-chunk tool-call stream and asserts the Anthropic event sequence
(message_start, tool_use content_block_start, input_json_delta x N,
content_block_stop, message_delta with stop_reason=tool_use, message_stop)
plus verifies the partial_json deltas reconstruct the original arguments.

Reproduced against a live server at /v1/messages with
--tool_call_parser qwen3_coder; pre-fix stream contained
  event: message_delta  data: {...,"stop_reason":"tool_use",...}
with no tool_use block. Post-fix stream contains the full sequence.
Four fixes observed from live Claude Code + curl traffic against
/v1/messages, none functional-blocking but all wire-format drift that
confuses strict Anthropic clients:

1. message id prefix. LiteLLM passes LightLLM's raw numeric request id
   through (e.g. "56"). Strict clients expect msg_* — force it in
   _normalize_anthropic_response when the adapter's output does not
   already match.

2. Empty leading text blocks. LiteLLM's response adapter sometimes
   emits [{"type":"text","text":""}, {"type":"tool_use",...}] where
   the spec wants just [{"type":"tool_use",...}]. Filter empty text
   blocks in the normaliser.

3. provider_specific_fields leak. LiteLLM internal field surfaces on
   content blocks in the final response; strip it in the normaliser.

4. Error envelope. create_error_response (from api_openai) produces
   OpenAI's {"error":{"message":...,"type":...}} shape, which Claude
   Code does not parse — it surfaces a generic failure instead of the
   real message. Add _anthropic_error_response which returns the
   Anthropic envelope {"type":"error","error":{"type":...,"message":...}}
   with status -> error-type mapping from
   https://docs.anthropic.com/en/api/errors. Use it for every
   error-return path in anthropic_messages_impl, including a
   _rewrap_openai_error_as_anthropic helper that best-effort decodes
   OpenAI errors coming back from chat_completions_impl and rewraps
   them into Anthropic shape.

Extracts the cleanup logic into a new _normalize_anthropic_response
helper so it can be tested directly without mocking the whole LiteLLM
adapter path. Adds three unit tests covering (1)+(2)+(3) and an
already-good-id preservation case, plus an error-envelope shape test.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an Anthropic Messages API compatibility layer, enabling LightLLM to expose a /v1/messages endpoint compatible with Anthropic SDKs. The implementation includes a translation shim using LiteLLM, logic for mapping requests and responses between OpenAI and Anthropic formats, and a streaming bridge for SSE events. Documentation, CLI flags, and comprehensive unit and integration tests are also provided. Review feedback highlights the need for more robust SSE parsing to handle potential data fragmentation and the importance of catching translation errors to prevent unhandled internal server errors.

Comment thread lightllm/server/api_anthropic.py Outdated
Comment on lines +231 to +244
async for raw_chunk in openai_body_iterator:
if not raw_chunk:
continue
# chat_completions_impl yields str ("data: {...}\n\n"); some callers or
# middlewares may hand us bytes. Normalise to str so the splitter below
# does not have to branch on type.
if isinstance(raw_chunk, (bytes, bytearray)):
raw_chunk = raw_chunk.decode("utf-8", errors="replace")
# A single StreamingResponse chunk may contain multiple SSE lines.
for line in raw_chunk.split("\n"):
line = line.strip()
if not line or not line.startswith("data: "):
continue
payload = line[len("data: "):]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current SSE parsing logic is susceptible to fragmentation. If a raw_chunk ends in the middle of a line or a multi-byte UTF-8 character, the data will be corrupted or lost. It is safer to use a buffer to accumulate partial lines and an incremental decoder for UTF-8 bytes.

    buffer = ""
    async for raw_chunk in openai_body_iterator:
        if not raw_chunk:
            continue
        # chat_completions_impl yields str ("data: {...}\n\n"); some callers or
        # middlewares may hand us bytes. Normalise to str so the splitter below
        # does not have to branch on type.
        if isinstance(raw_chunk, (bytes, bytearray)):
            raw_chunk = raw_chunk.decode("utf-8", errors="replace")

        buffer += raw_chunk
        while "\n" in buffer:
            line, buffer = buffer.split("\n", 1)
            line = line.strip()
            if not line or not line.startswith("data: "):
                continue
            payload = line[len("data: "):]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable here. _openai_sse_to_anthropic_events consumes downstream.body_iterator directly — this is the raw Python async generator returned by chat_completions_impl's StreamingResponse, not bytes arriving from a network socket. Each yield in that generator is a complete SSE event ("data: {...}\n\n"), so TCP-level fragmentation and multi-byte UTF-8 splits cannot occur. The existing try/except json.loads (line 247) already handles any unexpected malformed payload. Adding a line buffer would add complexity without addressing a real failure mode in this architecture.

Comment thread lightllm/server/api_anthropic.py Outdated
Comment on lines +543 to +544
anthropic_dict = _chat_response_to_anthropic(downstream, tool_name_mapping, requested_model)
return JSONResponse(anthropic_dict)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to _chat_response_to_anthropic can raise a RuntimeError (via the fallback translator) if tool calls are present but LiteLLM translation fails. This should be caught to return a proper Anthropic error response instead of allowing a generic 500 error to propagate.

Suggested change
anthropic_dict = _chat_response_to_anthropic(downstream, tool_name_mapping, requested_model)
return JSONResponse(anthropic_dict)
try:
anthropic_dict = _chat_response_to_anthropic(downstream, tool_name_mapping, requested_model)
except Exception as exc:
logger.exception("Anthropic response translation failed")
return _anthropic_error_response(HTTPStatus.INTERNAL_SERVER_ERROR, str(exc))
return JSONResponse(anthropic_dict)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b03d06c. _chat_response_to_anthropic is now wrapped in try/except — any exception (including the RuntimeError from _fallback_openai_to_anthropic when tool calls are present) returns a proper Anthropic error envelope via _anthropic_error_response(500, ...) instead of propagating as a generic 500.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b03d06c. _chat_response_to_anthropic is now wrapped in try/except — any exception (including the RuntimeError from _fallback_openai_to_anthropic when tool calls are present) returns a proper Anthropic error envelope via _anthropic_error_response(500, ...) instead of propagating as a generic 500.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant