feat(api): add Anthropic Messages API compatibility endpoint#1272
feat(api): add Anthropic Messages API compatibility endpoint#1272
Conversation
Introduces the opt-in flag that will gate the Anthropic Messages API compatibility layer. The flag is currently inert — the endpoint itself lands in subsequent commits. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Isolates LiteLLM's experimental_pass_through imports behind get_* accessors so future upgrades only touch one file. Includes a startup check (ensure_available) that fails loudly if litellm is missing when --enable_anthropic_api is set.
The installed litellm in the development environment is 1.83.7 and the experimental adapter import works against it. The previous value (1.60.0) predated testing and made log and error messages misleading. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Addresses code review feedback on the Anthropic API shim: - Removes _cached_stream_wrapper_cls — it was forward-allocated for a future streaming task but has no current references. It can be reintroduced alongside its getter when the streaming work lands. - Expands the deep-import error message with a concrete pinning command so operators hitting it after a LiteLLM upgrade know how to recover.
Locks down the exact I/O shapes we depend on. A failure here signals that LiteLLM's experimental adapter contract has shifted and the shim needs updating. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces anthropic_messages_impl plus request/response translation helpers. Streaming returns 501 for now — it lands in a later commit. Unit tests cover the dict-level translation path without requiring a running server. Also adds conftest.py at repo root to patch the transformers tokenizers version check (tokenizers 0.22.x vs <0.22 requirement in transformers 4.49) so that api_models is importable in the unit-test environment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The route is gated on --enable_anthropic_api; requests return 404 otherwise. Startup calls ensure_available so a missing litellm package fails loudly at server boot rather than on the first request.
Adds _openai_sse_to_anthropic_events, an async generator that consumes the existing chat_completions_impl streaming output and re-emits it as Anthropic message_start/content_block_*/message_delta/message_stop events. Tool-use streaming is still pending.
- Update stale module docstring to reflect streaming support. - Drop per-delta output_tokens increment; rely on the trailing usage chunk as the single source of truth for token counts. - Document the known-zero input_tokens in message_start and surface the real prompt token count in message_delta.usage instead. - Wire up the previously-dead _run helper in the streaming bridge test and move the asyncio import to the top of the file.
These tests lock down the contract for _anthropic_to_chat_request (when given an Anthropic tool schema, must emit an OpenAI-shaped tools array with parameters key and function type) and _chat_response_to_anthropic (when given a ChatCompletionResponse carrying a tool_calls message, must emit an Anthropic tool_use content block and tool_use stop_reason). These tests are added without runtime verification because the current environment does not have torch/litellm installed. A follow-up commit may be needed if either assertion fails when run in a full environment — the plan documents the expected remediation (rename input_schema to parameters on the request side; force stop_reason=tool_use on the response side).
Replace the collapsed set literal {"get_weather", "get_weather"[:64]}
with a direct equality check. The slice is a no-op on short names, so
the set held a single element and the OR branch was dead. The 64-char
truncation contract will be exercised by a longer-name test if needed.
Exercises non-streaming text, streaming text, system prompt, and tool use against a live LightLLM with --enable_anthropic_api. Not wired into automated CI because it needs a GPU runner.
test_anthropic_api.py is a standalone CLI tool whose module-level test_* functions take positional arguments pytest cannot inject. Without this collect_ignore, running pytest test/ produces four collection errors. The script is still invoked directly via python test/test_api/test_anthropic_api.py ...
Install with: pip install 'lightllm[anthropic_api]' Keeps the base install lean; only users of --enable_anthropic_api need the heavy dependency. Upper bound tracks _litellm_shim.py's _MAX_TESTED_LITELLM_VERSION (1.84.0); bump explicitly after re-running the round-trip characterisation test against newer releases.
New tutorial page covers enablement via --enable_anthropic_api, SDK usage for non-streaming and streaming, supported features, and known limitations (ignored cache_control, unsupported thinking/batch/files, zero input_tokens in streaming message_start). Linked from the EN tutorial toctree alongside the existing OpenAI API guide.
Final-review polish: add a vision smoke test for _anthropic_to_chat_request to guard against LiteLLM silently dropping image content blocks, and document that _openai_sse_to_anthropic_events' text_block_index stays at 0 on the text-only path. The vision test does not assert a specific OpenAI shape — LiteLLM's adapter owns that contract and may normalise it differently across releases — but it verifies the user message survives translation.
chat_completions_impl yields str ("data: {...}\n\n"), not bytes —
StreamingResponse encodes on the way out, but _openai_sse_to_anthropic_events
consumes body_iterator directly and so was reaching the raw str. The old
.split(b"\n") path raised TypeError on the first streaming request.
Normalise at entry (decode bytes → str, leave str alone) and switch the
inner splitter to str literals. Adds a regression test feeding a str
iterator so the bytes-only path no longer hides the real contract.
Reported from a live run: claude-code → /v1/messages → 500 with
TypeError: must be str or None, not bytes at api_anthropic.py:193.
_openai_sse_to_anthropic_events previously only handled delta.content
(text) and silently dropped delta.tool_calls. When a model returned a
streaming tool call, clients saw stop_reason=tool_use with zero tool_use
content blocks — the exact shape that makes Claude Code report
"The model's tool call could not be parsed (retry also failed)".
The bridge now:
- tracks at most one open content block at a time (text or tool_use);
- buffers tool-call id/name/arguments per OpenAI streaming index until
a name is available, then emits content_block_start(tool_use);
- flushes buffered arguments and any subsequent deltas as
input_json_delta partial_json events;
- closes the current block when switching between text and tool_use or
between tool_use blocks;
- closes any still-open block at end of stream before message_delta.
Adds test_stream_bridge_emits_tool_use_content_block: feeds a realistic
three-chunk tool-call stream and asserts the Anthropic event sequence
(message_start, tool_use content_block_start, input_json_delta x N,
content_block_stop, message_delta with stop_reason=tool_use, message_stop)
plus verifies the partial_json deltas reconstruct the original arguments.
Reproduced against a live server at /v1/messages with
--tool_call_parser qwen3_coder; pre-fix stream contained
event: message_delta data: {...,"stop_reason":"tool_use",...}
with no tool_use block. Post-fix stream contains the full sequence.
Four fixes observed from live Claude Code + curl traffic against
/v1/messages, none functional-blocking but all wire-format drift that
confuses strict Anthropic clients:
1. message id prefix. LiteLLM passes LightLLM's raw numeric request id
through (e.g. "56"). Strict clients expect msg_* — force it in
_normalize_anthropic_response when the adapter's output does not
already match.
2. Empty leading text blocks. LiteLLM's response adapter sometimes
emits [{"type":"text","text":""}, {"type":"tool_use",...}] where
the spec wants just [{"type":"tool_use",...}]. Filter empty text
blocks in the normaliser.
3. provider_specific_fields leak. LiteLLM internal field surfaces on
content blocks in the final response; strip it in the normaliser.
4. Error envelope. create_error_response (from api_openai) produces
OpenAI's {"error":{"message":...,"type":...}} shape, which Claude
Code does not parse — it surfaces a generic failure instead of the
real message. Add _anthropic_error_response which returns the
Anthropic envelope {"type":"error","error":{"type":...,"message":...}}
with status -> error-type mapping from
https://docs.anthropic.com/en/api/errors. Use it for every
error-return path in anthropic_messages_impl, including a
_rewrap_openai_error_as_anthropic helper that best-effort decodes
OpenAI errors coming back from chat_completions_impl and rewraps
them into Anthropic shape.
Extracts the cleanup logic into a new _normalize_anthropic_response
helper so it can be tested directly without mocking the whole LiteLLM
adapter path. Adds three unit tests covering (1)+(2)+(3) and an
already-good-id preservation case, plus an error-envelope shape test.
There was a problem hiding this comment.
Code Review
This pull request introduces an Anthropic Messages API compatibility layer, enabling LightLLM to expose a /v1/messages endpoint compatible with Anthropic SDKs. The implementation includes a translation shim using LiteLLM, logic for mapping requests and responses between OpenAI and Anthropic formats, and a streaming bridge for SSE events. Documentation, CLI flags, and comprehensive unit and integration tests are also provided. Review feedback highlights the need for more robust SSE parsing to handle potential data fragmentation and the importance of catching translation errors to prevent unhandled internal server errors.
| async for raw_chunk in openai_body_iterator: | ||
| if not raw_chunk: | ||
| continue | ||
| # chat_completions_impl yields str ("data: {...}\n\n"); some callers or | ||
| # middlewares may hand us bytes. Normalise to str so the splitter below | ||
| # does not have to branch on type. | ||
| if isinstance(raw_chunk, (bytes, bytearray)): | ||
| raw_chunk = raw_chunk.decode("utf-8", errors="replace") | ||
| # A single StreamingResponse chunk may contain multiple SSE lines. | ||
| for line in raw_chunk.split("\n"): | ||
| line = line.strip() | ||
| if not line or not line.startswith("data: "): | ||
| continue | ||
| payload = line[len("data: "):] |
There was a problem hiding this comment.
The current SSE parsing logic is susceptible to fragmentation. If a raw_chunk ends in the middle of a line or a multi-byte UTF-8 character, the data will be corrupted or lost. It is safer to use a buffer to accumulate partial lines and an incremental decoder for UTF-8 bytes.
buffer = ""
async for raw_chunk in openai_body_iterator:
if not raw_chunk:
continue
# chat_completions_impl yields str ("data: {...}\n\n"); some callers or
# middlewares may hand us bytes. Normalise to str so the splitter below
# does not have to branch on type.
if isinstance(raw_chunk, (bytes, bytearray)):
raw_chunk = raw_chunk.decode("utf-8", errors="replace")
buffer += raw_chunk
while "\n" in buffer:
line, buffer = buffer.split("\n", 1)
line = line.strip()
if not line or not line.startswith("data: "):
continue
payload = line[len("data: "):]There was a problem hiding this comment.
Not applicable here. _openai_sse_to_anthropic_events consumes downstream.body_iterator directly — this is the raw Python async generator returned by chat_completions_impl's StreamingResponse, not bytes arriving from a network socket. Each yield in that generator is a complete SSE event ("data: {...}\n\n"), so TCP-level fragmentation and multi-byte UTF-8 splits cannot occur. The existing try/except json.loads (line 247) already handles any unexpected malformed payload. Adding a line buffer would add complexity without addressing a real failure mode in this architecture.
| anthropic_dict = _chat_response_to_anthropic(downstream, tool_name_mapping, requested_model) | ||
| return JSONResponse(anthropic_dict) |
There was a problem hiding this comment.
The call to _chat_response_to_anthropic can raise a RuntimeError (via the fallback translator) if tool calls are present but LiteLLM translation fails. This should be caught to return a proper Anthropic error response instead of allowing a generic 500 error to propagate.
| anthropic_dict = _chat_response_to_anthropic(downstream, tool_name_mapping, requested_model) | |
| return JSONResponse(anthropic_dict) | |
| try: | |
| anthropic_dict = _chat_response_to_anthropic(downstream, tool_name_mapping, requested_model) | |
| except Exception as exc: | |
| logger.exception("Anthropic response translation failed") | |
| return _anthropic_error_response(HTTPStatus.INTERNAL_SERVER_ERROR, str(exc)) | |
| return JSONResponse(anthropic_dict) |
There was a problem hiding this comment.
Fixed in b03d06c. _chat_response_to_anthropic is now wrapped in try/except — any exception (including the RuntimeError from _fallback_openai_to_anthropic when tool calls are present) returns a proper Anthropic error envelope via _anthropic_error_response(500, ...) instead of propagating as a generic 500.
There was a problem hiding this comment.
Fixed in b03d06c. _chat_response_to_anthropic is now wrapped in try/except — any exception (including the RuntimeError from _fallback_openai_to_anthropic when tool calls are present) returns a proper Anthropic error envelope via _anthropic_error_response(500, ...) instead of propagating as a generic 500.
Summary
POST /v1/messagesendpoint compatible with the Anthropic Messages API, enabled via--enable_anthropic_apiflagtool_usecontent blocks, image content blocks, and proper Anthropic event sequencingKey files
lightllm/server/api_anthropic.pylightllm/server/_litellm_shim.pylightllm/server/api_http.py/v1/messageslightllm/server/api_cli.py--enable_anthropic_apiCLI flagsetup.pylitellmdeclared asanthropic_apioptional extradocs/EN/source/tutorial/anthropic.rstTest plan
Tested against Qwen3.5-122B-A10B (TP=4) with
--enable_anthropic_apiandlitellm==1.82.6.1. Non-streaming text
Response
{ "id": "msg_7c405f60e4994e4d961fd76b", "type": "message", "role": "assistant", "model": "qwen3_5", "stop_sequence": null, "usage": {"input_tokens": 19, "output_tokens": 512}, "content": [{"type": "text", "text": "Hi"}], "stop_reason": "end_turn" }2. Streaming text
Response
3. Tool use (non-streaming)
Response
{ "id": "msg_cf00a67dbc1846cfa1a3ab19", "type": "message", "role": "assistant", "model": "qwen3_5", "stop_sequence": null, "usage": {"input_tokens": 282, "output_tokens": 95}, "content": [ {"type": "tool_use", "id": "call_1f6b59ecab9f4f078b821163", "name": "get_weather", "input": {"city": "Beijing"}} ], "stop_reason": "tool_use" }4. Tool use (streaming)
Response
5. Image input (base64)
Response
{ "id": "msg_fea4e0913ac844e69afa65a6", "type": "message", "role": "assistant", "model": "qwen3_5", "stop_sequence": null, "usage": {"input_tokens": 82, "output_tokens": 141}, "content": [{"type": "text", "text": "This image is **red**. Specifically, it appears to be a solid, bright red color."}], "stop_reason": "end_turn" }