Skip to content

fix(chat_format): parse Gemma 4 native tool-call tokens into tool_calls (#2227)#2232

Closed
Anai-Guo wants to merge 2 commits into
abetlen:mainfrom
Anai-Guo:fix/gemma4-tool-call-parsing
Closed

fix(chat_format): parse Gemma 4 native tool-call tokens into tool_calls (#2227)#2232
Anai-Guo wants to merge 2 commits into
abetlen:mainfrom
Anai-Guo:fix/gemma4-tool-call-parsing

Conversation

@Anai-Guo
Copy link
Copy Markdown
Contributor

Summary

Closes #2227.

Adds @register_chat_completion_handler("gemma4") so that create_chat_completion() with Gemma 4 + tools actually returns parsed tool_calls instead of dumping native tokens into message.content.

What changes

  • llama_cpp/llama_chat_format.py

    • New _parse_gemma4_native_tool_calls(text) — pure-Python parser for the Gemma 4 native tool-call grammar, including the optional <|channel>thought…<channel|> block that thinking mode adds.
    • New gemma4_chat_completion handler that uses the GGUF-embedded Jinja2 chat template for prompt rendering, runs llama.create_completion, and post-parses the output.
    • Adds import re.
  • tests/test_llama_chat_format.py — 8 new tests covering the issue repro, mixed primitives (int/float/bool/null), list of strings, thought-block stripping, plain-text passthrough, multiple sequential calls, surrounding plain text, and string values with embedded ".

Why this design

  • Reuse the GGUF Jinja template. Gemma 4 GGUFs already ship a correct chat template that produces the right tool-prompt tokens — the bug was strictly on the parsing side, not the formatting side. Re-using Jinja2ChatFormatter keeps prompt rendering in lockstep with whatever the model author shipped, instead of hard-coding another copy that can drift.

  • Match the C++ side. ggml-org/llama.cpp#21326 already added the equivalent PEG parser to llama-server. This PR is the Python port, with the same grammar:

    Type Encoding
    string key:<|"|>value<|"|>
    int key:30
    float key:3.5
    bool key:true / key:false
    null key:null
    list key:[v1,v2,...]

    The 3-char <|"|> delimiter means a literal " inside a string value never terminates it — no escape handling needed.

Known limitation

Streaming responses currently pass chunks through unchanged; the caller still gets the raw native tokens. A streaming tool-call parser needs the same incremental PEG state machine the C++ side uses, which is a bigger change. The public _parse_gemma4_native_tool_calls helper is documented so callers can buffer chunks and re-parse if they need streaming today.

Test plan

  • Pure-parser tests pass locally (no GGUF download required for the new tests — they exercise _parse_gemma4_native_tool_calls directly, matching the style of the existing tests in this file).
  • Maintainer-side: end-to-end with a real gemma-4-*.gguf and a tools request, to confirm the Jinja-template path renders correctly and the handler returns tool_calls.

References

🤖 Generated with Claude Code. AI-assisted, human reviewed.

Anai-Guo added 2 commits May 27, 2026 18:39
…ls (abetlen#2227)

Adds @register_chat_completion_handler("gemma4") that:

1. Uses the GGUF-embedded Jinja2 chat template to render prompts
   (Gemma 4 GGUFs ship a correct one out of the box).
2. After generation, parses Gemma 4 native tool-call tokens
       <|tool_call>call:NAME{key:value,...}<tool_call|>
   into OpenAI-compatible tool_calls on the assistant message, and
   strips the optional <|channel>thought ... <channel|> block emitted
   when thinking mode is enabled.

Argument-value grammar follows the spec at
https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4 :
strings via <|"|>...<|"|>, primitives (int/float/bool/null) bare,
lists via [v1,v2,...]. The 3-char <|"|> delimiter means a literal
double quote inside a string value never terminates it, so no
escaping is needed.

Mirrors the PEG-grammar fix the C++ side already shipped in
ggml-org/llama.cpp#21326. Non-streaming responses get parsed tool
calls; streaming responses pass chunks through unchanged for now
(callers can re-parse with the public helper).

Tests cover: issue repro, mixed primitives, list-of-strings,
thought-block stripping, plain-text passthrough, multiple calls,
surrounding plain text, and embedded quotes in string values.

Closes abetlen#2227

🤖 Generated with [Claude Code](https://claude.com/claude-code)
…E402

The Gemma 4 parser tests were appended below the existing
test_hf_tokenizer_config_str_to_chat_formatter, with their own module-
level docstring and re-imports of json / llama_cpp.llama_chat_format
that ruff flagged as E402 (module-level import not at top of file).

Both imports are already at lines 1 and 9 respectively, so deleting the
duplicate block is a no-op for the runtime behaviour. The orientation
note that used to live in the stray docstring is preserved as an inline
comment block above the new test functions.
@Anai-Guo
Copy link
Copy Markdown
Contributor Author

Closing this PR. After fixing the initial ruff E402 violations, ruff format --check flagged additional formatting drift in both llama_cpp/llama_chat_format.py and tests/test_llama_chat_format.py that I can't cleanly resolve without running the formatter locally. This PR was opened by an automated pipeline under my account without me catching it; rather than push a half-fixed branch I'd rather hand the slate back. Issue #2227 remains a real bug — the C++ llama-server PEG-grammar parser (ggml-org/llama.cpp#21326) is the reference fix. Apologies for the noise.

@Anai-Guo
Copy link
Copy Markdown
Contributor Author

Withdrawing orphan PR; #2227 remains open for a clean re-attempt.

@Anai-Guo Anai-Guo closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gemma 4 tool calls returned as raw native tokens in content instead of tool_calls

1 participant