fix(server): preserve assistant reasoning field across turns#394
Conversation
mlxcel-server dropped the assistant `reasoning` field on input, so prior-turn reasoning was lost in multi-turn interleaved-reasoning conversations. Chat templates that render `message.get('reasoning')` (e.g. Gemma 4) never received it, so the model could not see its own prior thinking across turns. The output side (streaming reasoning_content) was already complete; this is the input-side fix for issue #362.
Add `reasoning: Option<String>` to the `Message` struct, accepting both `reasoning` and the OpenAI-compatible `reasoning_content` spelling via serde alias. Requests that carry a non-empty reasoning field now take the raw-JSON render path (alongside tool-bearing requests) so the field reaches the template; the typed ChatMessage path only carries role and content and would otherwise drop it.
In `build_raw_json_messages_with_thinking`, forward the reasoning field using the same decision applied to inline `<think>` blocks: drop it when the turn is stripped by the rolling checkpoint (preserve_thinking=false, the default), forward it otherwise. Skip forwarding when the content already carries an inline `<think>` block so the same reasoning is never double-injected across the two channels.
Wire the same forwarding into the Anthropic Messages translator (assistant `thinking` blocks become the reasoning field) and the Responses API input mapping (a `reasoning` item rides on the following assistant turn) for cross-API consistency.
Add regression tests: the multi-turn render path asserts reasoning reaches the prompt when preserve_thinking=true and is stripped by default, a double-injection guard test, the raw-path routing test, a serde reasoning/reasoning_content round-trip, and per-translator forwarding tests.
A Reasoning input item immediately followed by a FunctionCallOutput with no preceding FunctionCall left pending_reasoning alive because the tool-call flush block (which calls pending_reasoning.take()) was skipped. The buffered reasoning then leaked onto the next assistant turn, violating the invariant that a reasoning item not followed by an assistant turn before the next turn boundary is dropped. Fix: mirror the Message arm by clearing pending_reasoning = None after the Tool message is pushed in the FunctionCallOutput arm. In the normal Reasoning->FunctionCall->FunctionCallOutput flow the flush already consumes the reasoning via take(), so the new assignment is a no-op and the normal flow is unaffected. Two regression tests added: one asserts no leak in the malformed [Reasoning, FunctionCallOutput, Message(assistant)] sequence; the other confirms the normal tool flow still attaches reasoning to the function-call assistant turn. docs/responses-api.md: correct the stale claim that reasoning input items are not fed back into the next prompt.
PR FinalizationMEDIUM fix applied
TestsTwo new tests in
73 tests passed in Docs
Lint / format
Commit: |
Summary
mlxcel-server dropped the assistant
reasoningfield on input, so prior-turn reasoning was lost in multi-turn interleaved-reasoning conversations. Chat templates that rendermessage.get('reasoning')(e.g. Gemma 4) never received it. This is the input-side fix; the output side (streamingreasoning_content) was already complete.What changed
src/server/types/request.rs: addreasoning: Option<String>toMessage, accepting bothreasoningand the OpenAI-compatiblereasoning_contentvia#[serde(alias = "reasoning_content")], omitted from output when absent.src/server/chat_request.rs: route requests carrying a non-empty reasoning field through the raw-JSON render path (the typedChatMessagepath only carries role + content). Inbuild_raw_json_messages_with_thinking, forwardmsg["reasoning"]using the same decision applied to inline<think>blocks.src/server/anthropic_translator.rs: assistantthinkingblocks are forwarded onto the parallelreasoningfield instead of being dropped.src/server/responses_translator.rs: a Responsesreasoninginput item is buffered and attached to the following assistant turn'sreasoningfield; it does not leak across a turn boundary.Messageliterals updated for the new field; tests added.preserve_thinking interaction and double-injection
preserve_thinking=false(default): a prior assistant turn (before the rolling checkpoint) has both its inline<think>block and its parallelreasoningfield dropped, so prior thinking is removed consistently and cannot leak into the next prompt.preserve_thinking=true: the reasoning field is forwarded for retained turns.<think>block, the parallelreasoningfield is not also forwarded, so templates that render both channels never emit the same reasoning twice.Cross-API wiring
Text-only message shape and tool-result messages have no reasoning to carry, so they are unaffected.Test plan
cargo test --lib --features metal,accelerate -- server::chat_request:: server::types::request:: server::anthropic_translator:: server::responses_translator::(109 passed)cargo clippy --lib --tests --features metal,accelerate -- -D warnings(clean)cargo fmt --check(clean)New tests: multi-turn render asserts reasoning reaches the prompt when
preserve_thinking=trueand is stripped by default; double-injection guard; raw-path routing; serdereasoning/reasoning_contentround-trip; Anthropic and Responses forwarding.Closes #362