Skip to content

fix(backends): route ModelOption.THINKING=bool to chat_template_kwargs for vLLM#1204

Open
planetf1 wants to merge 5 commits into
generative-computing:mainfrom
planetf1:fix/issue-1093-thinking-vllm
Open

fix(backends): route ModelOption.THINKING=bool to chat_template_kwargs for vLLM#1204
planetf1 wants to merge 5 commits into
generative-computing:mainfrom
planetf1:fix/issue-1093-thinking-vllm

Conversation

@planetf1
Copy link
Copy Markdown
Contributor

@planetf1 planetf1 commented Jun 4, 2026

ModelOption.THINKING=True did nothing for vLLM-served models. THINKING=False caused an API error. This PR fixes both.

Before: thinking-model users on vLLM had to resort to raw extra_body workarounds, and setting THINKING=False to suppress reasoning would raise a TypeError from the OpenAI SDK.
After: THINKING=True/False works correctly out of the box against vLLM (Qwen3, Gemma 4, GLM-4.5, and any model using --reasoning-parser), and against Ollama's OpenAI-compatible endpoint.

ModelOption.THINKING Before After
True reasoning_effort="medium" only — vLLM ignores it reasoning_effort="medium" + enable_thinking=True
False reasoning_effort=False — SDK TypeError enable_thinking=False; no reasoning_effort sent
"medium"/"high" unchanged unchanged

Scope: OpenAIBackend only (mellea/backends/openai.py). OllamaBackend (native SDK) was already correct. LiteLLMBackend is handled in sibling PR #1207.

Closes #1093. Part of #1201.

Test plan

  • test_reasoning_effort_conditional_passing — Tests 3+4 cover THINKING=True/False streaming; Test 5 verifies user extra_body merges correctly
  • test_reasoning_effort_bool_true — asserts both reasoning_effort="medium" and enable_thinking=True; adapter_name survives
  • test_reasoning_effort_bool_falseenable_thinking=False, no reasoning_effort, adapter_name survives
  • Live: Qwen3.6-27B on vLLM — THINKING=False suppresses <think>
  • Live: Gemma-4-31B-it on vLLM — THINKING=True enables reasoning, THINKING=False suppresses, merge works ✅
  • Live: qwen3.6:35b on Ollama 0.30.4 (OpenAI-compat) — THINKING=True populates mot._thinking, THINKING=False clean ✅
Why enable_thinking in extra_body — technical detail

chat_template_kwargs cannot be a direct kwarg to the OpenAI SDK's create() — the SDK raises TypeError on unknown parameters. extra_body is the SDK escape hatch: its contents are merged into the top-level HTTP request JSON body before the call. vLLM then reads chat_template_kwargs from the body and injects them into the Jinja chat template at render time.

enable_thinking is the vLLM-standardised toggle (Qwen3, Gemma 4, GLM-4.5, MiniMax-M2, and most models added since early 2026). When reasoning_effort is sent, vLLM v0.22.0+ auto-injects enable_thinking for recognised reasoning parsers — so the dual-set (reasoning_effort + enable_thinking) is consistent with that direction and ensures correctness even for servers that haven't yet updated. See the vLLM reasoning outputs docs.

Out of scope: IBM Granite 3.2 and DeepSeek-V3.1 use thinking=True (not enable_thinking) in their chat templates. The bool branch sends enable_thinking, which those models won't act on — this is a pre-existing gap, not a regression introduced here.

Regression safety

Sending chat_template_kwargs in extra_body to servers that don't understand it (OpenAI proper, DeepSeek, etc.) is safe — unknown extra_body fields are silently ignored.

If a caller passes THINKING=True and their own extra_body in model_options, we merge them rather than spreading two dicts (which would raise TypeError). Tested.

For the intrinsic path, chat_template_kwargs.adapter_name is written before the THINKING block. The new code merges enable_thinking into the existing dict rather than replacing it; tests assert adapter_name survives.

Minor: generate_log.extra["thinking"] now records the raw caller value (True/False) rather than the normalised "medium" string. Nothing reads this field functionally.

…s for vLLM

Closes generative-computing#1093.

Previously, ModelOption.THINKING was unconditionally mapped to
reasoning_effort, which only works for OpenAI/DeepSeek. For vLLM-served
models like Qwen3, the correct toggle is
extra_body.chat_template_kwargs.enable_thinking. Passing False via
reasoning_effort was also invalid for OpenAI (the SDK rejects it).

New behaviour:
- THINKING=True  → reasoning_effort="medium" (OpenAI/DeepSeek) +
                   chat_template_kwargs.enable_thinking=True (vLLM/Qwen3)
- THINKING=False → chat_template_kwargs.enable_thinking=False (vLLM/Qwen3);
                   no reasoning_effort sent (OpenAI disables reasoning by default)
- THINKING=str   → reasoning_effort=str as before (OpenAI-specific levels)

Servers that don't recognise one mechanism ignore it in extra_body, so
sending both is safe. The fix applies to both the intrinsic (non-streaming)
and chat-context (streaming) code paths. Tests added for True and False cases.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@github-actions github-actions Bot added the bug Something isn't working label Jun 4, 2026
planetf1 added 2 commits June 4, 2026 14:28
…sic tests

Both bool THINKING tests use an EmbeddedIntrinsicAdapter that sets
adapter_name in chat_template_kwargs before the THINKING block runs.
Without an assertion that adapter_name is preserved, the merge logic
could silently overwrite it with a fresh dict and the tests would
still pass.

Add the missing assertion to test_reasoning_effort_bool_true and
test_reasoning_effort_bool_false so a future regression in the merge
path is caught immediately.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…rror

When THINKING=bool, the streaming path now writes extra_params["extra_body"].
If the caller also passes extra_body in model_options it survives
filter_chat_completions_kwargs and appears in the backend_specific spread —
giving create() two extra_body sources and raising TypeError.

Pre-compute backend_specific before the create() call, pop any user-supplied
extra_body from it, and merge it into extra_params["extra_body"] so there is a
single authoritative source. Mirrors the pattern the intrinsic path already
uses (explicit extra_body= kwarg at line 715).

Also update the post_processing docstring: the `thinking` parameter now carries
the raw caller value (True/False/string), not always a string level.

Regression test (Test 5 in test_reasoning_effort_conditional_passing): passes
THINKING=True alongside extra_body and asserts both keys survive in the merged
dict with no TypeError.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@markstur
Copy link
Copy Markdown
Contributor

markstur commented Jun 4, 2026

This review was from Qwen:

Review Summary

A bug fix was reviewed that addresses a TypeError when THINKING=True and user-supplied extra_body are passed together.

Issues Found

Severity Issue
Medium Potential data loss: If user passes extra_body={"chat_template_kwargs": {...}} while THINKING=True, the user's dict completely overwrites the one containing enable_thinking, silently disabling thinking mode
Low Test gap: Test covers happy path but not the collision edge case

Recommendation

Add deep merge for chat_template_kwargs to prevent data loss:

if "chat_template_kwargs" in eb and "chat_template_kwargs" in user_extra_body:
    merged = {**eb["chat_template_kwargs"], **user_extra_body["chat_template_kwargs"]}
    user_extra_body["chat_template_kwargs"] = merged
eb.update(user_extra_body)
extra_params["extra_body"] = eb

UPDATE: Interrogating a bit... this comment was specifically about openai.py:945-946. Sorry if I shared too soon. 👀

…rve enable_thinking

Shallow eb.update(user_extra_body) was overwriting the backend-set
enable_thinking when the user also supplied extra_body with its own
chat_template_kwargs (e.g. adapter_name). Pop chat_template_kwargs
before the update and merge the two dicts key-by-key, matching the
same pattern used in LiteLLMBackend.

Add regression test (Test 6) covering the clobber path; the existing
Test 5 used a non-overlapping key and did not catch the bug.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1
Copy link
Copy Markdown
Contributor Author

planetf1 commented Jun 4, 2026

Thanks @markstur — the self-correction was right to be cautious, but Qwen spotted a real gap.

The merge in openai.py was shallow (eb.update(user_extra_body)), so a user passing extra_body={"chat_template_kwargs": {...}} alongside THINKING=True would have their dict overwrite the backend-set enable_thinking. The identical issue was already caught and fixed in LiteLLMBackend (PR #1207); I've now applied the same deep-merge pattern here and added a regression test (Test 6) that covers the overlapping-key path — the existing Test 5 used a non-overlapping key (guided_json) and wouldn't have caught it.

Copy link
Copy Markdown
Contributor

@markstur markstur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it's an improvement but there seems to be an edge case that'd be good to fix now.

Looks like there is a scenario where the shallow update is not right and a deep copy is needed. I tried to put the test and fix inline but I wouldn't trust the formatting.

Regarding nits about "is None", if you prefer as-is for readablity it looks fine. There might be room for improvement by using falsey conditions but sometimes that's too subtle to be an improvement.

Comment thread mellea/backends/openai.py Outdated
Comment thread mellea/backends/openai.py Outdated
Comment thread mellea/backends/openai.py
Comment thread test/backends/test_openai_ollama.py
@markstur
Copy link
Copy Markdown
Contributor

markstur commented Jun 4, 2026

if you agree that redundant None default and/or replacing is None with simpler truthy/falsy ifs, then please scan for additional places. I didn't want to nit, nit, nit because I'm thinking the author might prefer it the way it is written

…guard

Two call sites in OpenAIBackend used `.get(ModelOption.THINKING, None)`;
`dict.get` already defaults to None so the explicit argument was redundant.

Added a brief comment on the `is not None` guard to clarify why a plain
`if thinking:` check would be wrong: `THINKING=False` is a valid caller
value (suppresses the think block) and would be silently skipped if the
check were truthy-only.

Addresses review comments from @markstur on PR generative-computing#1204.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 requested a review from markstur June 5, 2026 13:49
Copy link
Copy Markdown
Contributor

@ajbozarth ajbozarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scope split with #1207 looks right. One note inline; one out-of-diff.

Out of diffmellea/backends/model_options.py:30 (THINKING docstring): with this PR, True/False and "medium"/"high" route through different mechanisms, and True is the only value that engages models like Gemma 4 that don't auto-inject enable_thinking from reasoning_effort. THINKING="medium" against vLLM/Gemma 4 silently does nothing. Either expand the docstring there, or open a follow-up against workstream 6 of #1201 to unify the semantics.

Comment thread mellea/backends/openai.py
eb["chat_template_kwargs"] = {
**eb.get("chat_template_kwargs", {}),
**user_ctk,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a one-liner noting chat_template_kwargs is the only key both sides write, so it's the only one that needs deep-merge. If the rewriter ever populates other nested vLLM keys (guided_decoding, structured_outputs), the eb.update(user_extra_body) above would clobber them silently.

@ajbozarth
Copy link
Copy Markdown
Contributor

Trying to do the review above Claude really struggled with scope drift in its review. constantly going off track and down rabbit holes into "what ifs" around the changes this PR makes, so I am unsure if the above review caught everything. A rough re-through of my own of the code change (not the tests) it looks solid though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: improve ModelOption.THINKING with vLLM through OpenAI

3 participants