Skip to content

fix(litellm): pass api_base and set enable_thinking for bool THINKING#1207

Merged
planetf1 merged 4 commits into
generative-computing:mainfrom
planetf1:fix/issue-1206-litellm-api-base-thinking
Jun 5, 2026
Merged

fix(litellm): pass api_base and set enable_thinking for bool THINKING#1207
planetf1 merged 4 commits into
generative-computing:mainfrom
planetf1:fix/issue-1206-litellm-api-base-thinking

Conversation

@planetf1
Copy link
Copy Markdown
Contributor

@planetf1 planetf1 commented Jun 4, 2026

LiteLLMBackend silently ignored custom endpoints and THINKING had no effect. This PR fixes both.

Any backend constructed with a base_url (e.g. pointing at a vLLM server) was broken: _base_url was stored but never forwarded to litellm.acompletion(), so all requests routed to LiteLLM's default provider inference instead. Separately, THINKING=True/False never set chat_template_kwargs.enable_thinking, so vLLM models requiring an explicit opt-in (e.g. Gemma 4) never got thinking. THINKING=False also incorrectly sent reasoning_effort=False, which is an invalid value.

Issue Before After
Custom base_url Stored, never used — requests misrouted Forwarded as api_base to both call sites
THINKING=True reasoning_effort="medium" only — Gemma 4 ignored it reasoning_effort="medium" + enable_thinking=True
THINKING=False reasoning_effort=False — invalid value enable_thinking=False only; no reasoning_effort
User extra_body + THINKING TypeError on merge Deep-merged correctly

Scope: LiteLLMBackend only (mellea/backends/litellm.py). Sibling PR #1204 covers OpenAIBackend. Both are part of #1201.

Closes #1206. Relates to #1201, #1204.

Test plan

  • uv run pytest test/backends/test_litellm_thinking.py -v — all 17 tests pass
  • uv run pytest -m "not qualitative" — full fast suite passes
  • Live via scratchpad/smoke_litellm_vllm.py and scratchpad/smoke_litellm_ollama.py:
    • Qwen3.6-27B (vLLM): reasoning_effort='medium' → thinking engaged ✅
    • Gemma-4-31B-it (vLLM): reasoning_effort='medium' alone → no thinking ❌; with enable_thinking=True → thinking ✅
    • LiteLLM → Ollama (qwen3.6:35b): THINKING=Truemot._thinking populated ✅; THINKING=False → clean ✅
Technical detail — why two mechanisms

vLLM models control thinking via chat_template_kwargs.enable_thinking (a Jinja template variable). LiteLLM-compatible providers (Ollama, cloud) use reasoning_effort. Sending both means each server picks up whichever it understands. THINKING=True sets reasoning_effort="medium" (for LiteLLM's translation layer, e.g. Ollama → think=True) and enable_thinking=True (for vLLM). THINKING=False sends only enable_thinking=False — omitting reasoning_effort rather than sending an invalid False.

String values ("low", "medium", "high") still route to reasoning_effort only; they are OpenAI-compatible values and don't set enable_thinking. For vLLM users wanting string-level control, True/False is the correct API.

LiteLLMBackend had two bugs in ModelOption.THINKING handling:

1. _base_url was stored in __init__ but never passed as api_base to
   litellm.acompletion() or litellm.atext_completion(). Any backend
   constructed with a custom base_url (e.g. pointing at a vLLM server)
   silently ignored it; all requests routed to LiteLLM's default
   provider inference instead of the specified endpoint.

2. When THINKING=True/False (bool), chat_template_kwargs.enable_thinking
   was never set in extra_body. For vLLM models that require explicit
   opt-in (e.g. Gemma 4 with --reasoning-parser gemma4), THINKING=True
   had no effect. THINKING=False also incorrectly sent
   reasoning_effort=False, which is not a valid value.

Fix: forward self._base_url as api_base in both acompletion() call
sites. For bool THINKING, set extra_body.chat_template_kwargs.enable_thinking
accordingly. THINKING=True also sets reasoning_effort="medium" for
providers that route via that param (Ollama via LiteLLM's translation).
THINKING=False sends only enable_thinking=False — no reasoning_effort.
User-supplied extra_body is deep-merged so chat_template_kwargs keys
from both sides survive.

Tests: 7 new unit tests added to test/backends/test_litellm_thinking.py,
covering all four THINKING variants plus api_base forwarding and
extra_body merge safety. Tests mock litellm.acompletion so no server
is required.

Verified live against Qwen3.6-27B (vLLM) and Gemma-4-31B-it (vLLM)
via litellm.acompletion() directly (see scratchpad/smoke_litellm_vllm.py).
LiteLLM->Ollama path also verified (see scratchpad/smoke_litellm_ollama.py).

Closes generative-computing#1206. Relates to generative-computing#1201.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
planetf1 added 3 commits June 4, 2026 19:32
Two regressions introduced in the previous commit:

1. User-supplied api_base in model_options survived
   _make_backend_specific_and_remove (raw string keys pass through)
   and collided with the new explicit api_base= kwarg, raising
   TypeError. Fix: pop api_base from model_specific_options before
   the call and let the user value take precedence. Applied to both
   acompletion and atext_completion call sites.

2. extra_body deep-merge called .pop() on user_extra_body which
   shared a reference with the caller's model_options (shallow copy
   chain through replace_keys/remove_special_keys). Silent data loss:
   chat_template_kwargs was deleted from the caller's dict on the
   first call, so subsequent calls with the same options dropped it.
   Fix: shallow-copy user_extra_body before mutating.
   Also fix: use `if user_ctk is not None:` so an explicit empty {}
   is not silently discarded.

Two regression tests added:
- test_user_api_base_in_model_options_takes_precedence
- test_extra_body_mutation_does_not_corrupt_caller_dict

All 19 tests pass.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Sending api_base unconditionally meant any LiteLLMBackend constructed
without a base_url (e.g. for Anthropic, Watsonx, or any cloud provider)
would forward api_base="http://localhost:11434/v1" to LiteLLM, overriding
its provider-default endpoint inference and silently misrouting requests.

Fix: change the base_url __init__ default from "http://localhost:11434"
to None and track self._explicit_base_url. api_base is only forwarded
when the caller explicitly passed a base_url (or api_base in
model_options). For default/None base_url, api_base=None lets LiteLLM
infer the correct endpoint from the model prefix.

Test: test_no_api_base_forwarded_when_base_url_not_set confirms that
LiteLLMBackend("anthropic/...") sends api_base=None to LiteLLM.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Mapping[str, Any] | Any is not assignable to dict[Any, Any].

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 marked this pull request as ready for review June 4, 2026 21:36
@planetf1 planetf1 requested a review from a team as a code owner June 4, 2026 21:36
Copy link
Copy Markdown
Member

@psschwei psschwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
a couple of non-blocking items below

original_thinking = thinking # preserve raw caller value for the generate log
reasoning_params: dict[str, Any] = {}
if thinking is not None:
if type(thinking) is bool:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
if type(thinking) is bool:
if isinstance(thinking, bool):

though ok if we leave as is too

model_id: str = "ollama_chat/" + str(model_ids.IBM_GRANITE_4_1_3B.ollama_name),
formatter: ChatFormatter | None = None,
base_url: str | None = "http://localhost:11434",
base_url: str | None = None,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is technically a breaking change, but if I understand the flow right the typically ollama call will handle this anyway so there's minimal user impact right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code accepted a base_url parameter but never forwarded it to LiteLLM. Looking at the pre-PR acompletion() call:

litellm.acompletion(
    model=self._model_id,
    messages=conversation,
    tools=formatted_tools,
    reasoning_effort=thinking,
    drop_params=True,
    **extra_params,
    **model_specific_options,
)

No api_base= anywhere. The value was stored in self._base_url but never used. LiteLLM always inferred the endpoint from the model prefix — ollama_chat/ → localhost:11434, anthropic/ → Anthropic's API, and so on.

That means the three cases worth considering are all safe:

  • Default construction LiteLLMBackend() — old: _base_url set to localhost:11434 but ignored; new: _explicit_base_url=False, api_base not forwarded. LiteLLM infers the same endpoint either way.
  • Explicit non-default Ollama LiteLLMBackend(base_url="http://myserver:11434") — old: stored but silently ignored, so this was already broken; new: forwarded correctly for the first time. Fixed, not broken.
  • Cloud provider LiteLLMBackend(model_id="anthropic/...") — old: LiteLLM inferred Anthropic's endpoint; new: _explicit_base_url=False, same outcome.

The default changes from "http://localhost:11434" to None because keeping a non-None default would cause the new forwarding logic to send api_base=localhost:11434 to cloud providers constructed without arguments — which would be wrong. None preserves the pre-PR behaviour: let LiteLLM infer.

@planetf1 planetf1 added this pull request to the merge queue Jun 5, 2026
Merged via the queue into generative-computing:main with commit 1651e14 Jun 5, 2026
9 checks passed
@planetf1 planetf1 deleted the fix/issue-1206-litellm-api-base-thinking branch June 5, 2026 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(litellm): pass api_base and enable_thinking for bool THINKING

2 participants