fix(server): preserve assistant reasoning field across turns by inureyes · Pull Request #394 · lablup/mlxcel

inureyes · 2026-06-22T00:08:11Z

Summary

mlxcel-server dropped the assistant reasoning field on input, so prior-turn reasoning was lost in multi-turn interleaved-reasoning conversations. Chat templates that render message.get('reasoning') (e.g. Gemma 4) never received it. This is the input-side fix; the output side (streaming reasoning_content) was already complete.

What changed

src/server/types/request.rs: add reasoning: Option<String> to Message, accepting both reasoning and the OpenAI-compatible reasoning_content via #[serde(alias = "reasoning_content")], omitted from output when absent.
src/server/chat_request.rs: route requests carrying a non-empty reasoning field through the raw-JSON render path (the typed ChatMessage path only carries role + content). In build_raw_json_messages_with_thinking, forward msg["reasoning"] using the same decision applied to inline <think> blocks.
src/server/anthropic_translator.rs: assistant thinking blocks are forwarded onto the parallel reasoning field instead of being dropped.
src/server/responses_translator.rs: a Responses reasoning input item is buffered and attached to the following assistant turn's reasoning field; it does not leak across a turn boundary.
Existing Message literals updated for the new field; tests added.

preserve_thinking interaction and double-injection

preserve_thinking=false (default): a prior assistant turn (before the rolling checkpoint) has both its inline <think> block and its parallel reasoning field dropped, so prior thinking is removed consistently and cannot leak into the next prompt.
preserve_thinking=true: the reasoning field is forwarded for retained turns.
Double-injection guard: when a retained turn's content already contains an inline <think> block, the parallel reasoning field is not also forwarded, so templates that render both channels never emit the same reasoning twice.

Cross-API wiring

Anthropic Messages and Responses translators both reconstruct assistant turns that reach the chat-template path, so both forward reasoning. The Anthropic Text-only message shape and tool-result messages have no reasoning to carry, so they are unaffected.

Test plan

cargo test --lib --features metal,accelerate -- server::chat_request:: server::types::request:: server::anthropic_translator:: server::responses_translator:: (109 passed)
cargo clippy --lib --tests --features metal,accelerate -- -D warnings (clean)
cargo fmt --check (clean)

New tests: multi-turn render asserts reasoning reaches the prompt when preserve_thinking=true and is stripped by default; double-injection guard; raw-path routing; serde reasoning/reasoning_content round-trip; Anthropic and Responses forwarding.

Closes #362

mlxcel-server dropped the assistant `reasoning` field on input, so prior-turn reasoning was lost in multi-turn interleaved-reasoning conversations. Chat templates that render `message.get('reasoning')` (e.g. Gemma 4) never received it, so the model could not see its own prior thinking across turns. The output side (streaming reasoning_content) was already complete; this is the input-side fix for issue #362. Add `reasoning: Option<String>` to the `Message` struct, accepting both `reasoning` and the OpenAI-compatible `reasoning_content` spelling via serde alias. Requests that carry a non-empty reasoning field now take the raw-JSON render path (alongside tool-bearing requests) so the field reaches the template; the typed ChatMessage path only carries role and content and would otherwise drop it. In `build_raw_json_messages_with_thinking`, forward the reasoning field using the same decision applied to inline `<think>` blocks: drop it when the turn is stripped by the rolling checkpoint (preserve_thinking=false, the default), forward it otherwise. Skip forwarding when the content already carries an inline `<think>` block so the same reasoning is never double-injected across the two channels. Wire the same forwarding into the Anthropic Messages translator (assistant `thinking` blocks become the reasoning field) and the Responses API input mapping (a `reasoning` item rides on the following assistant turn) for cross-API consistency. Add regression tests: the multi-turn render path asserts reasoning reaches the prompt when preserve_thinking=true and is stripped by default, a double-injection guard test, the raw-path routing test, a serde reasoning/reasoning_content round-trip, and per-translator forwarding tests.

A Reasoning input item immediately followed by a FunctionCallOutput with no preceding FunctionCall left pending_reasoning alive because the tool-call flush block (which calls pending_reasoning.take()) was skipped. The buffered reasoning then leaked onto the next assistant turn, violating the invariant that a reasoning item not followed by an assistant turn before the next turn boundary is dropped. Fix: mirror the Message arm by clearing pending_reasoning = None after the Tool message is pushed in the FunctionCallOutput arm. In the normal Reasoning->FunctionCall->FunctionCallOutput flow the flush already consumes the reasoning via take(), so the new assignment is a no-op and the normal flow is unaffected. Two regression tests added: one asserts no leak in the malformed [Reasoning, FunctionCallOutput, Message(assistant)] sequence; the other confirms the normal tool flow still attaches reasoning to the function-call assistant turn. docs/responses-api.md: correct the stale claim that reasoning input items are not fed back into the next prompt.

inureyes · 2026-06-22T00:24:03Z

PR Finalization

MEDIUM fix applied

src/server/responses_translator.rs - FunctionCallOutput arm: the arm was already calling pending_reasoning.take() inside the tool-call flush block, but that block is skipped when pending_tool_calls is empty (malformed input: Reasoning item immediately followed by FunctionCallOutput with no preceding FunctionCall). The buffered reasoning then leaked onto the next assistant turn. Fixed by adding pending_reasoning = None; after the Tool message push, matching the pattern already used in the Message arm. In the normal Reasoning->FunctionCall->FunctionCallOutput flow the flush take() already consumes the reasoning, so this assignment is a no-op.

Tests

Two new tests in server::responses_translator::tests:

reasoning_does_not_leak_when_function_call_output_has_no_preceding_function_call: asserts reasoning == None on the assistant turn after the malformed sequence [Reasoning, FunctionCallOutput(no preceding FunctionCall), Message(assistant)].
reasoning_attaches_to_function_call_turn_in_normal_tool_flow: confirms the normal [Reasoning, FunctionCall, FunctionCallOutput, Message(assistant)] flow still attaches reasoning to the function-call assistant turn and not to the later text assistant turn.

73 tests passed in server::responses_translator:: and server::chat_request::.

Docs

docs/responses-api.md: corrected the stale claim that reasoning input items "are not fed back into the next prompt" to accurately describe the preservation behavior added by this PR.

Lint / format

cargo clippy --lib --tests --features metal,accelerate -- -D warnings: clean. The fmt tool cannot parse pre-existing let-chains in this file (confirmed pre-existing before this PR); the new code is formatted consistently with the surrounding file.

Commit: bd3bef703

inureyes added status:review Under review type:bug Bug fixes, error corrections, or issue resolutions priority:medium Medium priority area:core mlxcel-core: MLX FFI, primitives, KV cache, layers area:inference Generation, sampling, decoding (incl. speculative, DRY) labels Jun 22, 2026

inureyes added status:done Completed and removed status:review Under review labels Jun 22, 2026

inureyes merged commit f0576fa into main Jun 22, 2026
5 checks passed

inureyes deleted the fix/issue-362-preserve-assistant-reasoning branch June 22, 2026 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): preserve assistant reasoning field across turns#394

fix(server): preserve assistant reasoning field across turns#394
inureyes merged 2 commits into
mainfrom
fix/issue-362-preserve-assistant-reasoning

inureyes commented Jun 22, 2026

Uh oh!

inureyes commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

inureyes commented Jun 22, 2026

Summary

What changed

preserve_thinking interaction and double-injection

Cross-API wiring

Test plan

Uh oh!

inureyes commented Jun 22, 2026

PR Finalization

MEDIUM fix applied

Tests

Docs

Lint / format

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant