fix(backends): route ModelOption.THINKING=bool to chat_template_kwargs for vLLM by planetf1 · Pull Request #1204 · generative-computing/mellea

planetf1 · 2026-06-04T13:12:33Z

ModelOption.THINKING=True did nothing for vLLM-served models. THINKING=False caused an API error. This PR fixes both.

Before: thinking-model users on vLLM had to resort to raw extra_body workarounds, and setting THINKING=False to suppress reasoning would raise a TypeError from the OpenAI SDK.
After: THINKING=True/False works correctly out of the box against vLLM (Qwen3, Gemma 4, GLM-4.5, and any model using --reasoning-parser), and against Ollama's OpenAI-compatible endpoint.

`ModelOption.THINKING`	Before	After
`True`	`reasoning_effort="medium"` only — vLLM ignores it	`reasoning_effort="medium"` + `enable_thinking=True`
`False`	`reasoning_effort=False` — SDK TypeError	`enable_thinking=False`; no `reasoning_effort` sent
`"medium"/"high"`	unchanged	unchanged

Scope: OpenAIBackend only (mellea/backends/openai.py). OllamaBackend (native SDK) was already correct. LiteLLMBackend is handled in sibling PR #1207.

Closes #1093. Part of #1201.

Test plan

test_reasoning_effort_conditional_passing — Tests 3+4 cover THINKING=True/False streaming; Test 5 verifies user extra_body merges correctly
test_reasoning_effort_bool_true — asserts both reasoning_effort="medium" and enable_thinking=True; adapter_name survives
test_reasoning_effort_bool_false — enable_thinking=False, no reasoning_effort, adapter_name survives
Live: Qwen3.6-27B on vLLM — THINKING=False suppresses <think> ✅
Live: Gemma-4-31B-it on vLLM — THINKING=True enables reasoning, THINKING=False suppresses, merge works ✅
Live: qwen3.6:35b on Ollama 0.30.4 (OpenAI-compat) — THINKING=True populates mot._thinking, THINKING=False clean ✅

Why enable_thinking in extra_body — technical detail

chat_template_kwargs cannot be a direct kwarg to the OpenAI SDK's create() — the SDK raises TypeError on unknown parameters. extra_body is the SDK escape hatch: its contents are merged into the top-level HTTP request JSON body before the call. vLLM then reads chat_template_kwargs from the body and injects them into the Jinja chat template at render time.

enable_thinking is the vLLM-standardised toggle (Qwen3, Gemma 4, GLM-4.5, MiniMax-M2, and most models added since early 2026). When reasoning_effort is sent, vLLM v0.22.0+ auto-injects enable_thinking for recognised reasoning parsers — so the dual-set (reasoning_effort + enable_thinking) is consistent with that direction and ensures correctness even for servers that haven't yet updated. See the vLLM reasoning outputs docs.

Out of scope: IBM Granite 3.2 and DeepSeek-V3.1 use thinking=True (not enable_thinking) in their chat templates. The bool branch sends enable_thinking, which those models won't act on — this is a pre-existing gap, not a regression introduced here.

Regression safety

Sending chat_template_kwargs in extra_body to servers that don't understand it (OpenAI proper, DeepSeek, etc.) is safe — unknown extra_body fields are silently ignored.

If a caller passes THINKING=True and their own extra_body in model_options, we merge them rather than spreading two dicts (which would raise TypeError). Tested.

For the intrinsic path, chat_template_kwargs.adapter_name is written before the THINKING block. The new code merges enable_thinking into the existing dict rather than replacing it; tests assert adapter_name survives.

Minor: generate_log.extra["thinking"] now records the raw caller value (True/False) rather than the normalised "medium" string. Nothing reads this field functionally.

…s for vLLM Closes generative-computing#1093. Previously, ModelOption.THINKING was unconditionally mapped to reasoning_effort, which only works for OpenAI/DeepSeek. For vLLM-served models like Qwen3, the correct toggle is extra_body.chat_template_kwargs.enable_thinking. Passing False via reasoning_effort was also invalid for OpenAI (the SDK rejects it). New behaviour: - THINKING=True → reasoning_effort="medium" (OpenAI/DeepSeek) + chat_template_kwargs.enable_thinking=True (vLLM/Qwen3) - THINKING=False → chat_template_kwargs.enable_thinking=False (vLLM/Qwen3); no reasoning_effort sent (OpenAI disables reasoning by default) - THINKING=str → reasoning_effort=str as before (OpenAI-specific levels) Servers that don't recognise one mechanism ignore it in extra_body, so sending both is safe. The fix applies to both the intrinsic (non-streaming) and chat-context (streaming) code paths. Tests added for True and False cases. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

…sic tests Both bool THINKING tests use an EmbeddedIntrinsicAdapter that sets adapter_name in chat_template_kwargs before the THINKING block runs. Without an assertion that adapter_name is preserved, the merge logic could silently overwrite it with a fresh dict and the tests would still pass. Add the missing assertion to test_reasoning_effort_bool_true and test_reasoning_effort_bool_false so a future regression in the merge path is caught immediately. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

…rror When THINKING=bool, the streaming path now writes extra_params["extra_body"]. If the caller also passes extra_body in model_options it survives filter_chat_completions_kwargs and appears in the backend_specific spread — giving create() two extra_body sources and raising TypeError. Pre-compute backend_specific before the create() call, pop any user-supplied extra_body from it, and merge it into extra_params["extra_body"] so there is a single authoritative source. Mirrors the pattern the intrinsic path already uses (explicit extra_body= kwarg at line 715). Also update the post_processing docstring: the `thinking` parameter now carries the raw caller value (True/False/string), not always a string level. Regression test (Test 5 in test_reasoning_effort_conditional_passing): passes THINKING=True alongside extra_body and asserts both keys survive in the merged dict with no TypeError. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

markstur · 2026-06-04T20:54:28Z

This review was from Qwen:

Review Summary

A bug fix was reviewed that addresses a TypeError when THINKING=True and user-supplied extra_body are passed together.

Issues Found

Severity	Issue
Medium	Potential data loss: If user passes `extra_body={"chat_template_kwargs": {...}}` while `THINKING=True`, the user's dict completely overwrites the one containing `enable_thinking`, silently disabling thinking mode
Low	Test gap: Test covers happy path but not the collision edge case

Recommendation

Add deep merge for chat_template_kwargs to prevent data loss:

if "chat_template_kwargs" in eb and "chat_template_kwargs" in user_extra_body:
    merged = {**eb["chat_template_kwargs"], **user_extra_body["chat_template_kwargs"]}
    user_extra_body["chat_template_kwargs"] = merged
eb.update(user_extra_body)
extra_params["extra_body"] = eb

UPDATE: Interrogating a bit... this comment was specifically about openai.py:945-946. Sorry if I shared too soon. 👀

…rve enable_thinking Shallow eb.update(user_extra_body) was overwriting the backend-set enable_thinking when the user also supplied extra_body with its own chat_template_kwargs (e.g. adapter_name). Pop chat_template_kwargs before the update and merge the two dicts key-by-key, matching the same pattern used in LiteLLMBackend. Add regression test (Test 6) covering the clobber path; the existing Test 5 used a non-overlapping key and did not catch the bug. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

planetf1 · 2026-06-04T21:20:51Z

Thanks @markstur — the self-correction was right to be cautious, but Qwen spotted a real gap.

The merge in openai.py was shallow (eb.update(user_extra_body)), so a user passing extra_body={"chat_template_kwargs": {...}} alongside THINKING=True would have their dict overwrite the backend-set enable_thinking. The identical issue was already caught and fixed in LiteLLMBackend (PR #1207); I've now applied the same deep-merge pattern here and added a regression test (Test 6) that covers the overlapping-key path — the existing Test 5 used a non-overlapping key (guided_json) and wouldn't have caught it.

markstur

In general, it's an improvement but there seems to be an edge case that'd be good to fix now.

Looks like there is a scenario where the shallow update is not right and a deep copy is needed. I tried to put the test and fix inline but I wouldn't trust the formatting.

Regarding nits about "is None", if you prefer as-is for readablity it looks fine. There might be room for improvement by using falsey conditions but sometimes that's too subtle to be an improvement.

markstur · 2026-06-04T21:48:31Z

if you agree that redundant None default and/or replacing is None with simpler truthy/falsy ifs, then please scan for additional places. I didn't want to nit, nit, nit because I'm thinking the author might prefer it the way it is written

@markstur

…guard Two call sites in OpenAIBackend used `.get(ModelOption.THINKING, None)`; `dict.get` already defaults to None so the explicit argument was redundant. Added a brief comment on the `is not None` guard to clarify why a plain `if thinking:` check would be wrong: `THINKING=False` is a valid caller value (suppresses the think block) and would be silently skipped if the check were truthy-only. Addresses review comments from @markstur on PR generative-computing#1204. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

ajbozarth

Scope split with #1207 looks right. One note inline; one out-of-diff.

Out of diff — mellea/backends/model_options.py:30 (THINKING docstring): with this PR, True/False and "medium"/"high" route through different mechanisms, and True is the only value that engages models like Gemma 4 that don't auto-inject enable_thinking from reasoning_effort. THINKING="medium" against vLLM/Gemma 4 silently does nothing. Either expand the docstring there, or open a follow-up against workstream 6 of #1201 to unify the semantics.

ajbozarth · 2026-06-05T16:46:06Z

+                eb["chat_template_kwargs"] = {
+                    **eb.get("chat_template_kwargs", {}),
+                    **user_ctk,
+                }


Add a one-liner noting chat_template_kwargs is the only key both sides write, so it's the only one that needs deep-merge. If the rewriter ever populates other nested vLLM keys (guided_decoding, structured_outputs), the eb.update(user_extra_body) above would clobber them silently.

ajbozarth · 2026-06-05T16:47:59Z

Trying to do the review above Claude really struggled with scope drift in its review. constantly going off track and down rabbit holes into "what ifs" around the changes this PR makes, so I am unsure if the above review caught everything. A rough re-through of my own of the code change (not the tests) it looks solid though.

github-actions Bot added the bug Something isn't working label Jun 4, 2026

planetf1 added 2 commits June 4, 2026 14:28

This was referenced Jun 4, 2026

Multi-turn chat silently drops reasoning output from thinking models #1201

Open

fix(litellm): pass api_base and enable_thinking for bool THINKING #1206

Closed

planetf1 marked this pull request as ready for review June 4, 2026 18:17

planetf1 requested a review from a team as a code owner June 4, 2026 18:17

planetf1 requested review from ajbozarth, akihikokuroda and markstur June 4, 2026 18:17

planetf1 mentioned this pull request Jun 4, 2026

fix(litellm): pass api_base and set enable_thinking for bool THINKING #1207

Merged

3 tasks

markstur requested changes Jun 4, 2026

View reviewed changes

Comment thread mellea/backends/openai.py Outdated

Comment thread mellea/backends/openai.py Outdated

Comment thread mellea/backends/openai.py

Comment thread test/backends/test_openai_ollama.py

planetf1 requested a review from markstur June 5, 2026 13:49

ajbozarth reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backends): route ModelOption.THINKING=bool to chat_template_kwargs for vLLM#1204

fix(backends): route ModelOption.THINKING=bool to chat_template_kwargs for vLLM#1204
planetf1 wants to merge 5 commits into
generative-computing:mainfrom
planetf1:fix/issue-1093-thinking-vllm

planetf1 commented Jun 4, 2026 •

edited

Loading

Uh oh!

markstur commented Jun 4, 2026 •

edited

Loading

Uh oh!

planetf1 commented Jun 4, 2026

Uh oh!

markstur left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markstur commented Jun 4, 2026

Uh oh!

ajbozarth left a comment

Uh oh!

ajbozarth Jun 5, 2026

Uh oh!

ajbozarth commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

planetf1 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

markstur commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Summary

Issues Found

Recommendation

Uh oh!

planetf1 commented Jun 4, 2026

Uh oh!

markstur left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markstur commented Jun 4, 2026

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

ajbozarth Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

ajbozarth commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

planetf1 commented Jun 4, 2026 •

edited

Loading

markstur commented Jun 4, 2026 •

edited

Loading