fix(litellm): pass api_base and set enable_thinking for bool THINKING by planetf1 · Pull Request #1207 · generative-computing/mellea

planetf1 · 2026-06-04T18:19:09Z

LiteLLMBackend silently ignored custom endpoints and THINKING had no effect. This PR fixes both.

Any backend constructed with a base_url (e.g. pointing at a vLLM server) was broken: _base_url was stored but never forwarded to litellm.acompletion(), so all requests routed to LiteLLM's default provider inference instead. Separately, THINKING=True/False never set chat_template_kwargs.enable_thinking, so vLLM models requiring an explicit opt-in (e.g. Gemma 4) never got thinking. THINKING=False also incorrectly sent reasoning_effort=False, which is an invalid value.

Issue	Before	After
Custom `base_url`	Stored, never used — requests misrouted	Forwarded as `api_base` to both call sites
`THINKING=True`	`reasoning_effort="medium"` only — Gemma 4 ignored it	`reasoning_effort="medium"` + `enable_thinking=True`
`THINKING=False`	`reasoning_effort=False` — invalid value	`enable_thinking=False` only; no `reasoning_effort`
User `extra_body` + THINKING	`TypeError` on merge	Deep-merged correctly

Scope: LiteLLMBackend only (mellea/backends/litellm.py). Sibling PR #1204 covers OpenAIBackend. Both are part of #1201.

Closes #1206. Relates to #1201, #1204.

Test plan

uv run pytest test/backends/test_litellm_thinking.py -v — all 17 tests pass
uv run pytest -m "not qualitative" — full fast suite passes
Live via scratchpad/smoke_litellm_vllm.py and scratchpad/smoke_litellm_ollama.py:
- Qwen3.6-27B (vLLM): reasoning_effort='medium' → thinking engaged ✅
- Gemma-4-31B-it (vLLM): reasoning_effort='medium' alone → no thinking ❌; with enable_thinking=True → thinking ✅
- LiteLLM → Ollama (qwen3.6:35b): THINKING=True → mot._thinking populated ✅; THINKING=False → clean ✅

Technical detail — why two mechanisms

vLLM models control thinking via chat_template_kwargs.enable_thinking (a Jinja template variable). LiteLLM-compatible providers (Ollama, cloud) use reasoning_effort. Sending both means each server picks up whichever it understands. THINKING=True sets reasoning_effort="medium" (for LiteLLM's translation layer, e.g. Ollama → think=True) and enable_thinking=True (for vLLM). THINKING=False sends only enable_thinking=False — omitting reasoning_effort rather than sending an invalid False.

String values ("low", "medium", "high") still route to reasoning_effort only; they are OpenAI-compatible values and don't set enable_thinking. For vLLM users wanting string-level control, True/False is the correct API.

LiteLLMBackend had two bugs in ModelOption.THINKING handling: 1. _base_url was stored in __init__ but never passed as api_base to litellm.acompletion() or litellm.atext_completion(). Any backend constructed with a custom base_url (e.g. pointing at a vLLM server) silently ignored it; all requests routed to LiteLLM's default provider inference instead of the specified endpoint. 2. When THINKING=True/False (bool), chat_template_kwargs.enable_thinking was never set in extra_body. For vLLM models that require explicit opt-in (e.g. Gemma 4 with --reasoning-parser gemma4), THINKING=True had no effect. THINKING=False also incorrectly sent reasoning_effort=False, which is not a valid value. Fix: forward self._base_url as api_base in both acompletion() call sites. For bool THINKING, set extra_body.chat_template_kwargs.enable_thinking accordingly. THINKING=True also sets reasoning_effort="medium" for providers that route via that param (Ollama via LiteLLM's translation). THINKING=False sends only enable_thinking=False — no reasoning_effort. User-supplied extra_body is deep-merged so chat_template_kwargs keys from both sides survive. Tests: 7 new unit tests added to test/backends/test_litellm_thinking.py, covering all four THINKING variants plus api_base forwarding and extra_body merge safety. Tests mock litellm.acompletion so no server is required. Verified live against Qwen3.6-27B (vLLM) and Gemma-4-31B-it (vLLM) via litellm.acompletion() directly (see scratchpad/smoke_litellm_vllm.py). LiteLLM->Ollama path also verified (see scratchpad/smoke_litellm_ollama.py). Closes generative-computing#1206. Relates to generative-computing#1201. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

Two regressions introduced in the previous commit: 1. User-supplied api_base in model_options survived _make_backend_specific_and_remove (raw string keys pass through) and collided with the new explicit api_base= kwarg, raising TypeError. Fix: pop api_base from model_specific_options before the call and let the user value take precedence. Applied to both acompletion and atext_completion call sites. 2. extra_body deep-merge called .pop() on user_extra_body which shared a reference with the caller's model_options (shallow copy chain through replace_keys/remove_special_keys). Silent data loss: chat_template_kwargs was deleted from the caller's dict on the first call, so subsequent calls with the same options dropped it. Fix: shallow-copy user_extra_body before mutating. Also fix: use `if user_ctk is not None:` so an explicit empty {} is not silently discarded. Two regression tests added: - test_user_api_base_in_model_options_takes_precedence - test_extra_body_mutation_does_not_corrupt_caller_dict All 19 tests pass. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

Sending api_base unconditionally meant any LiteLLMBackend constructed without a base_url (e.g. for Anthropic, Watsonx, or any cloud provider) would forward api_base="http://localhost:11434/v1" to LiteLLM, overriding its provider-default endpoint inference and silently misrouting requests. Fix: change the base_url __init__ default from "http://localhost:11434" to None and track self._explicit_base_url. api_base is only forwarded when the caller explicitly passed a base_url (or api_base in model_options). For default/None base_url, api_base=None lets LiteLLM infer the correct endpoint from the model prefix. Test: test_no_api_base_forwarded_when_base_url_not_set confirms that LiteLLMBackend("anthropic/...") sends api_base=None to LiteLLM. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

Mapping[str, Any] | Any is not assignable to dict[Any, Any]. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

psschwei

LGTM
a couple of non-blocking items below

psschwei · 2026-06-05T13:48:09Z

+        original_thinking = thinking  # preserve raw caller value for the generate log
+        reasoning_params: dict[str, Any] = {}
+        if thinking is not None:
+            if type(thinking) is bool:


nit:

Suggested change

if type(thinking) is bool:

if isinstance(thinking, bool):

though ok if we leave as is too

psschwei · 2026-06-05T13:50:51Z

        model_id: str = "ollama_chat/" + str(model_ids.IBM_GRANITE_4_1_3B.ollama_name),
        formatter: ChatFormatter | None = None,
-        base_url: str | None = "http://localhost:11434",
+        base_url: str | None = None,


I guess this is technically a breaking change, but if I understand the flow right the typically ollama call will handle this anyway so there's minimal user impact right?

The old code accepted a base_url parameter but never forwarded it to LiteLLM. Looking at the pre-PR acompletion() call:

litellm.acompletion( model=self._model_id, messages=conversation, tools=formatted_tools, reasoning_effort=thinking, drop_params=True, **extra_params, **model_specific_options, )

No api_base= anywhere. The value was stored in self._base_url but never used. LiteLLM always inferred the endpoint from the model prefix — ollama_chat/ → localhost:11434, anthropic/ → Anthropic's API, and so on.

That means the three cases worth considering are all safe:

Default construction LiteLLMBackend() — old: _base_url set to localhost:11434 but ignored; new: _explicit_base_url=False, api_base not forwarded. LiteLLM infers the same endpoint either way.

Explicit non-default Ollama LiteLLMBackend(base_url="http://myserver:11434") — old: stored but silently ignored, so this was already broken; new: forwarded correctly for the first time. Fixed, not broken.

Cloud provider LiteLLMBackend(model_id="anthropic/...") — old: LiteLLM inferred Anthropic's endpoint; new: _explicit_base_url=False, same outcome.

The default changes from "http://localhost:11434" to None because keeping a non-None default would cause the new forwarding logic to send api_base=localhost:11434 to cloud providers constructed without arguments — which would be wrong. None preserves the pre-PR behaviour: let LiteLLM infer.

github-actions Bot added the bug Something isn't working label Jun 4, 2026

planetf1 mentioned this pull request Jun 4, 2026

fix(backends): route ModelOption.THINKING=bool to chat_template_kwargs for vLLM #1204

Open

6 tasks

planetf1 added 3 commits June 4, 2026 19:32

fix(test): cast call_args.kwargs to dict to satisfy mypy return type

fc3f185

Mapping[str, Any] | Any is not assignable to dict[Any, Any]. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

planetf1 marked this pull request as ready for review June 4, 2026 21:36

planetf1 requested a review from a team as a code owner June 4, 2026 21:36

planetf1 requested review from AngeloDanducci, jakelorocco and psschwei June 4, 2026 21:36

planetf1 mentioned this pull request Jun 5, 2026

Multi-turn chat silently drops reasoning output from thinking models #1201

Open

psschwei approved these changes Jun 5, 2026

View reviewed changes

planetf1 added this pull request to the merge queue Jun 5, 2026

Merged via the queue into generative-computing:main with commit 1651e14 Jun 5, 2026
9 checks passed

planetf1 deleted the fix/issue-1206-litellm-api-base-thinking branch June 5, 2026 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(litellm): pass api_base and set enable_thinking for bool THINKING#1207

fix(litellm): pass api_base and set enable_thinking for bool THINKING#1207
planetf1 merged 4 commits into
generative-computing:mainfrom
planetf1:fix/issue-1206-litellm-api-base-thinking

planetf1 commented Jun 4, 2026 •

edited

Loading

Uh oh!

psschwei left a comment

Uh oh!

psschwei Jun 5, 2026

Uh oh!

psschwei Jun 5, 2026

Uh oh!

planetf1 Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

planetf1 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

psschwei left a comment

Choose a reason for hiding this comment

Uh oh!

psschwei Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

psschwei Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

planetf1 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

planetf1 commented Jun 4, 2026 •

edited

Loading