fix(litellm): pass api_base and set enable_thinking for bool THINKING#1207
Conversation
LiteLLMBackend had two bugs in ModelOption.THINKING handling: 1. _base_url was stored in __init__ but never passed as api_base to litellm.acompletion() or litellm.atext_completion(). Any backend constructed with a custom base_url (e.g. pointing at a vLLM server) silently ignored it; all requests routed to LiteLLM's default provider inference instead of the specified endpoint. 2. When THINKING=True/False (bool), chat_template_kwargs.enable_thinking was never set in extra_body. For vLLM models that require explicit opt-in (e.g. Gemma 4 with --reasoning-parser gemma4), THINKING=True had no effect. THINKING=False also incorrectly sent reasoning_effort=False, which is not a valid value. Fix: forward self._base_url as api_base in both acompletion() call sites. For bool THINKING, set extra_body.chat_template_kwargs.enable_thinking accordingly. THINKING=True also sets reasoning_effort="medium" for providers that route via that param (Ollama via LiteLLM's translation). THINKING=False sends only enable_thinking=False — no reasoning_effort. User-supplied extra_body is deep-merged so chat_template_kwargs keys from both sides survive. Tests: 7 new unit tests added to test/backends/test_litellm_thinking.py, covering all four THINKING variants plus api_base forwarding and extra_body merge safety. Tests mock litellm.acompletion so no server is required. Verified live against Qwen3.6-27B (vLLM) and Gemma-4-31B-it (vLLM) via litellm.acompletion() directly (see scratchpad/smoke_litellm_vllm.py). LiteLLM->Ollama path also verified (see scratchpad/smoke_litellm_ollama.py). Closes generative-computing#1206. Relates to generative-computing#1201. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Two regressions introduced in the previous commit:
1. User-supplied api_base in model_options survived
_make_backend_specific_and_remove (raw string keys pass through)
and collided with the new explicit api_base= kwarg, raising
TypeError. Fix: pop api_base from model_specific_options before
the call and let the user value take precedence. Applied to both
acompletion and atext_completion call sites.
2. extra_body deep-merge called .pop() on user_extra_body which
shared a reference with the caller's model_options (shallow copy
chain through replace_keys/remove_special_keys). Silent data loss:
chat_template_kwargs was deleted from the caller's dict on the
first call, so subsequent calls with the same options dropped it.
Fix: shallow-copy user_extra_body before mutating.
Also fix: use `if user_ctk is not None:` so an explicit empty {}
is not silently discarded.
Two regression tests added:
- test_user_api_base_in_model_options_takes_precedence
- test_extra_body_mutation_does_not_corrupt_caller_dict
All 19 tests pass.
Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Sending api_base unconditionally meant any LiteLLMBackend constructed
without a base_url (e.g. for Anthropic, Watsonx, or any cloud provider)
would forward api_base="http://localhost:11434/v1" to LiteLLM, overriding
its provider-default endpoint inference and silently misrouting requests.
Fix: change the base_url __init__ default from "http://localhost:11434"
to None and track self._explicit_base_url. api_base is only forwarded
when the caller explicitly passed a base_url (or api_base in
model_options). For default/None base_url, api_base=None lets LiteLLM
infer the correct endpoint from the model prefix.
Test: test_no_api_base_forwarded_when_base_url_not_set confirms that
LiteLLMBackend("anthropic/...") sends api_base=None to LiteLLM.
Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Mapping[str, Any] | Any is not assignable to dict[Any, Any]. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
psschwei
left a comment
There was a problem hiding this comment.
LGTM
a couple of non-blocking items below
| original_thinking = thinking # preserve raw caller value for the generate log | ||
| reasoning_params: dict[str, Any] = {} | ||
| if thinking is not None: | ||
| if type(thinking) is bool: |
There was a problem hiding this comment.
nit:
| if type(thinking) is bool: | |
| if isinstance(thinking, bool): |
though ok if we leave as is too
| model_id: str = "ollama_chat/" + str(model_ids.IBM_GRANITE_4_1_3B.ollama_name), | ||
| formatter: ChatFormatter | None = None, | ||
| base_url: str | None = "http://localhost:11434", | ||
| base_url: str | None = None, |
There was a problem hiding this comment.
I guess this is technically a breaking change, but if I understand the flow right the typically ollama call will handle this anyway so there's minimal user impact right?
There was a problem hiding this comment.
The old code accepted a base_url parameter but never forwarded it to LiteLLM. Looking at the pre-PR acompletion() call:
litellm.acompletion(
model=self._model_id,
messages=conversation,
tools=formatted_tools,
reasoning_effort=thinking,
drop_params=True,
**extra_params,
**model_specific_options,
)No api_base= anywhere. The value was stored in self._base_url but never used. LiteLLM always inferred the endpoint from the model prefix — ollama_chat/ → localhost:11434, anthropic/ → Anthropic's API, and so on.
That means the three cases worth considering are all safe:
- Default construction
LiteLLMBackend()— old:_base_urlset to localhost:11434 but ignored; new:_explicit_base_url=False,api_basenot forwarded. LiteLLM infers the same endpoint either way. - Explicit non-default Ollama
LiteLLMBackend(base_url="http://myserver:11434")— old: stored but silently ignored, so this was already broken; new: forwarded correctly for the first time. Fixed, not broken. - Cloud provider
LiteLLMBackend(model_id="anthropic/...")— old: LiteLLM inferred Anthropic's endpoint; new:_explicit_base_url=False, same outcome.
The default changes from "http://localhost:11434" to None because keeping a non-None default would cause the new forwarding logic to send api_base=localhost:11434 to cloud providers constructed without arguments — which would be wrong. None preserves the pre-PR behaviour: let LiteLLM infer.
LiteLLMBackendsilently ignored custom endpoints andTHINKINGhad no effect. This PR fixes both.Any backend constructed with a
base_url(e.g. pointing at a vLLM server) was broken:_base_urlwas stored but never forwarded tolitellm.acompletion(), so all requests routed to LiteLLM's default provider inference instead. Separately,THINKING=True/Falsenever setchat_template_kwargs.enable_thinking, so vLLM models requiring an explicit opt-in (e.g. Gemma 4) never got thinking.THINKING=Falsealso incorrectly sentreasoning_effort=False, which is an invalid value.base_urlapi_baseto both call sitesTHINKING=Truereasoning_effort="medium"only — Gemma 4 ignored itreasoning_effort="medium"+enable_thinking=TrueTHINKING=Falsereasoning_effort=False— invalid valueenable_thinking=Falseonly; noreasoning_effortextra_body+ THINKINGTypeErroron mergeScope:
LiteLLMBackendonly (mellea/backends/litellm.py). Sibling PR #1204 coversOpenAIBackend. Both are part of #1201.Closes #1206. Relates to #1201, #1204.
Test plan
uv run pytest test/backends/test_litellm_thinking.py -v— all 17 tests passuv run pytest -m "not qualitative"— full fast suite passesscratchpad/smoke_litellm_vllm.pyandscratchpad/smoke_litellm_ollama.py:reasoning_effort='medium'→ thinking engaged ✅reasoning_effort='medium'alone → no thinking ❌; withenable_thinking=True→ thinking ✅qwen3.6:35b):THINKING=True→mot._thinkingpopulated ✅;THINKING=False→ clean ✅Technical detail — why two mechanisms
vLLM models control thinking via
chat_template_kwargs.enable_thinking(a Jinja template variable). LiteLLM-compatible providers (Ollama, cloud) usereasoning_effort. Sending both means each server picks up whichever it understands.THINKING=Truesetsreasoning_effort="medium"(for LiteLLM's translation layer, e.g. Ollama →think=True) andenable_thinking=True(for vLLM).THINKING=Falsesends onlyenable_thinking=False— omittingreasoning_effortrather than sending an invalidFalse.String values (
"low","medium","high") still route toreasoning_effortonly; they are OpenAI-compatible values and don't setenable_thinking. For vLLM users wanting string-level control,True/Falseis the correct API.