Skip to content

fix(server): preserve assistant reasoning field across turns#394

Merged
inureyes merged 2 commits into
mainfrom
fix/issue-362-preserve-assistant-reasoning
Jun 22, 2026
Merged

fix(server): preserve assistant reasoning field across turns#394
inureyes merged 2 commits into
mainfrom
fix/issue-362-preserve-assistant-reasoning

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

mlxcel-server dropped the assistant reasoning field on input, so prior-turn reasoning was lost in multi-turn interleaved-reasoning conversations. Chat templates that render message.get('reasoning') (e.g. Gemma 4) never received it. This is the input-side fix; the output side (streaming reasoning_content) was already complete.

What changed

  • src/server/types/request.rs: add reasoning: Option<String> to Message, accepting both reasoning and the OpenAI-compatible reasoning_content via #[serde(alias = "reasoning_content")], omitted from output when absent.
  • src/server/chat_request.rs: route requests carrying a non-empty reasoning field through the raw-JSON render path (the typed ChatMessage path only carries role + content). In build_raw_json_messages_with_thinking, forward msg["reasoning"] using the same decision applied to inline <think> blocks.
  • src/server/anthropic_translator.rs: assistant thinking blocks are forwarded onto the parallel reasoning field instead of being dropped.
  • src/server/responses_translator.rs: a Responses reasoning input item is buffered and attached to the following assistant turn's reasoning field; it does not leak across a turn boundary.
  • Existing Message literals updated for the new field; tests added.

preserve_thinking interaction and double-injection

  • preserve_thinking=false (default): a prior assistant turn (before the rolling checkpoint) has both its inline <think> block and its parallel reasoning field dropped, so prior thinking is removed consistently and cannot leak into the next prompt.
  • preserve_thinking=true: the reasoning field is forwarded for retained turns.
  • Double-injection guard: when a retained turn's content already contains an inline <think> block, the parallel reasoning field is not also forwarded, so templates that render both channels never emit the same reasoning twice.

Cross-API wiring

  • Anthropic Messages and Responses translators both reconstruct assistant turns that reach the chat-template path, so both forward reasoning. The Anthropic Text-only message shape and tool-result messages have no reasoning to carry, so they are unaffected.

Test plan

  • cargo test --lib --features metal,accelerate -- server::chat_request:: server::types::request:: server::anthropic_translator:: server::responses_translator:: (109 passed)
  • cargo clippy --lib --tests --features metal,accelerate -- -D warnings (clean)
  • cargo fmt --check (clean)

New tests: multi-turn render asserts reasoning reaches the prompt when preserve_thinking=true and is stripped by default; double-injection guard; raw-path routing; serde reasoning/reasoning_content round-trip; Anthropic and Responses forwarding.

Closes #362

mlxcel-server dropped the assistant `reasoning` field on input, so prior-turn reasoning was lost in multi-turn interleaved-reasoning conversations. Chat templates that render `message.get('reasoning')` (e.g. Gemma 4) never received it, so the model could not see its own prior thinking across turns. The output side (streaming reasoning_content) was already complete; this is the input-side fix for issue #362.

Add `reasoning: Option<String>` to the `Message` struct, accepting both `reasoning` and the OpenAI-compatible `reasoning_content` spelling via serde alias. Requests that carry a non-empty reasoning field now take the raw-JSON render path (alongside tool-bearing requests) so the field reaches the template; the typed ChatMessage path only carries role and content and would otherwise drop it.

In `build_raw_json_messages_with_thinking`, forward the reasoning field using the same decision applied to inline `<think>` blocks: drop it when the turn is stripped by the rolling checkpoint (preserve_thinking=false, the default), forward it otherwise. Skip forwarding when the content already carries an inline `<think>` block so the same reasoning is never double-injected across the two channels.

Wire the same forwarding into the Anthropic Messages translator (assistant `thinking` blocks become the reasoning field) and the Responses API input mapping (a `reasoning` item rides on the following assistant turn) for cross-API consistency.

Add regression tests: the multi-turn render path asserts reasoning reaches the prompt when preserve_thinking=true and is stripped by default, a double-injection guard test, the raw-path routing test, a serde reasoning/reasoning_content round-trip, and per-translator forwarding tests.
@inureyes inureyes added status:review Under review type:bug Bug fixes, error corrections, or issue resolutions priority:medium Medium priority area:core mlxcel-core: MLX FFI, primitives, KV cache, layers area:inference Generation, sampling, decoding (incl. speculative, DRY) labels Jun 22, 2026
A Reasoning input item immediately followed by a FunctionCallOutput with no
preceding FunctionCall left pending_reasoning alive because the tool-call
flush block (which calls pending_reasoning.take()) was skipped. The buffered
reasoning then leaked onto the next assistant turn, violating the invariant
that a reasoning item not followed by an assistant turn before the next turn
boundary is dropped.

Fix: mirror the Message arm by clearing pending_reasoning = None after the
Tool message is pushed in the FunctionCallOutput arm. In the normal
Reasoning->FunctionCall->FunctionCallOutput flow the flush already consumes
the reasoning via take(), so the new assignment is a no-op and the normal
flow is unaffected.

Two regression tests added: one asserts no leak in the malformed
[Reasoning, FunctionCallOutput, Message(assistant)] sequence; the other
confirms the normal tool flow still attaches reasoning to the function-call
assistant turn.

docs/responses-api.md: correct the stale claim that reasoning input items
are not fed back into the next prompt.
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization

MEDIUM fix applied

src/server/responses_translator.rs - FunctionCallOutput arm: the arm was already calling pending_reasoning.take() inside the tool-call flush block, but that block is skipped when pending_tool_calls is empty (malformed input: Reasoning item immediately followed by FunctionCallOutput with no preceding FunctionCall). The buffered reasoning then leaked onto the next assistant turn. Fixed by adding pending_reasoning = None; after the Tool message push, matching the pattern already used in the Message arm. In the normal Reasoning->FunctionCall->FunctionCallOutput flow the flush take() already consumes the reasoning, so this assignment is a no-op.

Tests

Two new tests in server::responses_translator::tests:

  • reasoning_does_not_leak_when_function_call_output_has_no_preceding_function_call: asserts reasoning == None on the assistant turn after the malformed sequence [Reasoning, FunctionCallOutput(no preceding FunctionCall), Message(assistant)].
  • reasoning_attaches_to_function_call_turn_in_normal_tool_flow: confirms the normal [Reasoning, FunctionCall, FunctionCallOutput, Message(assistant)] flow still attaches reasoning to the function-call assistant turn and not to the later text assistant turn.

73 tests passed in server::responses_translator:: and server::chat_request::.

Docs

docs/responses-api.md: corrected the stale claim that reasoning input items "are not fed back into the next prompt" to accurately describe the preservation behavior added by this PR.

Lint / format

cargo clippy --lib --tests --features metal,accelerate -- -D warnings: clean. The fmt tool cannot parse pre-existing let-chains in this file (confirmed pre-existing before this PR); the new code is formatted consistently with the surrounding file.

Commit: bd3bef703

@inureyes inureyes added status:done Completed and removed status:review Under review labels Jun 22, 2026
@inureyes inureyes merged commit f0576fa into main Jun 22, 2026
5 checks passed
@inureyes inureyes deleted the fix/issue-362-preserve-assistant-reasoning branch June 22, 2026 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core mlxcel-core: MLX FFI, primitives, KV cache, layers area:inference Generation, sampling, decoding (incl. speculative, DRY) priority:medium Medium priority status:done Completed type:bug Bug fixes, error corrections, or issue resolutions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(server): preserve assistant reasoning field across turns (interleaved reasoning)

1 participant