fix(backends): route ModelOption.THINKING=bool to chat_template_kwargs for vLLM#1204
fix(backends): route ModelOption.THINKING=bool to chat_template_kwargs for vLLM#1204planetf1 wants to merge 5 commits into
Conversation
…s for vLLM Closes generative-computing#1093. Previously, ModelOption.THINKING was unconditionally mapped to reasoning_effort, which only works for OpenAI/DeepSeek. For vLLM-served models like Qwen3, the correct toggle is extra_body.chat_template_kwargs.enable_thinking. Passing False via reasoning_effort was also invalid for OpenAI (the SDK rejects it). New behaviour: - THINKING=True → reasoning_effort="medium" (OpenAI/DeepSeek) + chat_template_kwargs.enable_thinking=True (vLLM/Qwen3) - THINKING=False → chat_template_kwargs.enable_thinking=False (vLLM/Qwen3); no reasoning_effort sent (OpenAI disables reasoning by default) - THINKING=str → reasoning_effort=str as before (OpenAI-specific levels) Servers that don't recognise one mechanism ignore it in extra_body, so sending both is safe. The fix applies to both the intrinsic (non-streaming) and chat-context (streaming) code paths. Tests added for True and False cases. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…sic tests Both bool THINKING tests use an EmbeddedIntrinsicAdapter that sets adapter_name in chat_template_kwargs before the THINKING block runs. Without an assertion that adapter_name is preserved, the merge logic could silently overwrite it with a fresh dict and the tests would still pass. Add the missing assertion to test_reasoning_effort_bool_true and test_reasoning_effort_bool_false so a future regression in the merge path is caught immediately. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…rror When THINKING=bool, the streaming path now writes extra_params["extra_body"]. If the caller also passes extra_body in model_options it survives filter_chat_completions_kwargs and appears in the backend_specific spread — giving create() two extra_body sources and raising TypeError. Pre-compute backend_specific before the create() call, pop any user-supplied extra_body from it, and merge it into extra_params["extra_body"] so there is a single authoritative source. Mirrors the pattern the intrinsic path already uses (explicit extra_body= kwarg at line 715). Also update the post_processing docstring: the `thinking` parameter now carries the raw caller value (True/False/string), not always a string level. Regression test (Test 5 in test_reasoning_effort_conditional_passing): passes THINKING=True alongside extra_body and asserts both keys survive in the merged dict with no TypeError. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
|
This review was from Qwen: Review SummaryA bug fix was reviewed that addresses a Issues Found
RecommendationAdd deep merge for if "chat_template_kwargs" in eb and "chat_template_kwargs" in user_extra_body:
merged = {**eb["chat_template_kwargs"], **user_extra_body["chat_template_kwargs"]}
user_extra_body["chat_template_kwargs"] = merged
eb.update(user_extra_body)
extra_params["extra_body"] = ebUPDATE: Interrogating a bit... this comment was specifically about openai.py:945-946. Sorry if I shared too soon. 👀 |
…rve enable_thinking Shallow eb.update(user_extra_body) was overwriting the backend-set enable_thinking when the user also supplied extra_body with its own chat_template_kwargs (e.g. adapter_name). Pop chat_template_kwargs before the update and merge the two dicts key-by-key, matching the same pattern used in LiteLLMBackend. Add regression test (Test 6) covering the clobber path; the existing Test 5 used a non-overlapping key and did not catch the bug. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
|
Thanks @markstur — the self-correction was right to be cautious, but Qwen spotted a real gap. The merge in |
markstur
left a comment
There was a problem hiding this comment.
In general, it's an improvement but there seems to be an edge case that'd be good to fix now.
Looks like there is a scenario where the shallow update is not right and a deep copy is needed. I tried to put the test and fix inline but I wouldn't trust the formatting.
Regarding nits about "is None", if you prefer as-is for readablity it looks fine. There might be room for improvement by using falsey conditions but sometimes that's too subtle to be an improvement.
|
if you agree that redundant None default and/or replacing is None with simpler truthy/falsy ifs, then please scan for additional places. I didn't want to nit, nit, nit because I'm thinking the author might prefer it the way it is written |
…guard Two call sites in OpenAIBackend used `.get(ModelOption.THINKING, None)`; `dict.get` already defaults to None so the explicit argument was redundant. Added a brief comment on the `is not None` guard to clarify why a plain `if thinking:` check would be wrong: `THINKING=False` is a valid caller value (suppresses the think block) and would be silently skipped if the check were truthy-only. Addresses review comments from @markstur on PR generative-computing#1204. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
ajbozarth
left a comment
There was a problem hiding this comment.
Scope split with #1207 looks right. One note inline; one out-of-diff.
Out of diff — mellea/backends/model_options.py:30 (THINKING docstring): with this PR, True/False and "medium"/"high" route through different mechanisms, and True is the only value that engages models like Gemma 4 that don't auto-inject enable_thinking from reasoning_effort. THINKING="medium" against vLLM/Gemma 4 silently does nothing. Either expand the docstring there, or open a follow-up against workstream 6 of #1201 to unify the semantics.
| eb["chat_template_kwargs"] = { | ||
| **eb.get("chat_template_kwargs", {}), | ||
| **user_ctk, | ||
| } |
There was a problem hiding this comment.
Add a one-liner noting chat_template_kwargs is the only key both sides write, so it's the only one that needs deep-merge. If the rewriter ever populates other nested vLLM keys (guided_decoding, structured_outputs), the eb.update(user_extra_body) above would clobber them silently.
|
Trying to do the review above Claude really struggled with scope drift in its review. constantly going off track and down rabbit holes into "what ifs" around the changes this PR makes, so I am unsure if the above review caught everything. A rough re-through of my own of the code change (not the tests) it looks solid though. |
ModelOption.THINKING=Truedid nothing for vLLM-served models.THINKING=Falsecaused an API error. This PR fixes both.Before: thinking-model users on vLLM had to resort to raw
extra_bodyworkarounds, and settingTHINKING=Falseto suppress reasoning would raise aTypeErrorfrom the OpenAI SDK.After:
THINKING=True/Falseworks correctly out of the box against vLLM (Qwen3, Gemma 4, GLM-4.5, and any model using--reasoning-parser), and against Ollama's OpenAI-compatible endpoint.ModelOption.THINKINGTruereasoning_effort="medium"only — vLLM ignores itreasoning_effort="medium"+enable_thinking=TrueFalsereasoning_effort=False— SDK TypeErrorenable_thinking=False; noreasoning_effortsent"medium"/"high"Scope:
OpenAIBackendonly (mellea/backends/openai.py).OllamaBackend(native SDK) was already correct.LiteLLMBackendis handled in sibling PR #1207.Closes #1093. Part of #1201.
Test plan
test_reasoning_effort_conditional_passing— Tests 3+4 coverTHINKING=True/Falsestreaming; Test 5 verifies userextra_bodymerges correctlytest_reasoning_effort_bool_true— asserts bothreasoning_effort="medium"andenable_thinking=True;adapter_namesurvivestest_reasoning_effort_bool_false—enable_thinking=False, noreasoning_effort,adapter_namesurvivesTHINKING=Falsesuppresses<think>✅THINKING=Trueenables reasoning,THINKING=Falsesuppresses, merge works ✅THINKING=Truepopulatesmot._thinking,THINKING=Falseclean ✅Why
enable_thinkinginextra_body— technical detailchat_template_kwargscannot be a direct kwarg to the OpenAI SDK'screate()— the SDK raisesTypeErroron unknown parameters.extra_bodyis the SDK escape hatch: its contents are merged into the top-level HTTP request JSON body before the call. vLLM then readschat_template_kwargsfrom the body and injects them into the Jinja chat template at render time.enable_thinkingis the vLLM-standardised toggle (Qwen3, Gemma 4, GLM-4.5, MiniMax-M2, and most models added since early 2026). Whenreasoning_effortis sent, vLLM v0.22.0+ auto-injectsenable_thinkingfor recognised reasoning parsers — so the dual-set (reasoning_effort+enable_thinking) is consistent with that direction and ensures correctness even for servers that haven't yet updated. See the vLLM reasoning outputs docs.Out of scope: IBM Granite 3.2 and DeepSeek-V3.1 use
thinking=True(notenable_thinking) in their chat templates. The bool branch sendsenable_thinking, which those models won't act on — this is a pre-existing gap, not a regression introduced here.Regression safety
Sending
chat_template_kwargsinextra_bodyto servers that don't understand it (OpenAI proper, DeepSeek, etc.) is safe — unknownextra_bodyfields are silently ignored.If a caller passes
THINKING=Trueand their ownextra_bodyinmodel_options, we merge them rather than spreading two dicts (which would raiseTypeError). Tested.For the intrinsic path,
chat_template_kwargs.adapter_nameis written before the THINKING block. The new code mergesenable_thinkinginto the existing dict rather than replacing it; tests assertadapter_namesurvives.Minor:
generate_log.extra["thinking"]now records the raw caller value (True/False) rather than the normalised"medium"string. Nothing reads this field functionally.