fix(chat_format): parse Gemma 4 native tool-call tokens into tool_calls (#2227)#2232
Closed
Anai-Guo wants to merge 2 commits into
Closed
fix(chat_format): parse Gemma 4 native tool-call tokens into tool_calls (#2227)#2232Anai-Guo wants to merge 2 commits into
Anai-Guo wants to merge 2 commits into
Conversation
…ls (abetlen#2227) Adds @register_chat_completion_handler("gemma4") that: 1. Uses the GGUF-embedded Jinja2 chat template to render prompts (Gemma 4 GGUFs ship a correct one out of the box). 2. After generation, parses Gemma 4 native tool-call tokens <|tool_call>call:NAME{key:value,...}<tool_call|> into OpenAI-compatible tool_calls on the assistant message, and strips the optional <|channel>thought ... <channel|> block emitted when thinking mode is enabled. Argument-value grammar follows the spec at https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4 : strings via <|"|>...<|"|>, primitives (int/float/bool/null) bare, lists via [v1,v2,...]. The 3-char <|"|> delimiter means a literal double quote inside a string value never terminates it, so no escaping is needed. Mirrors the PEG-grammar fix the C++ side already shipped in ggml-org/llama.cpp#21326. Non-streaming responses get parsed tool calls; streaming responses pass chunks through unchanged for now (callers can re-parse with the public helper). Tests cover: issue repro, mixed primitives, list-of-strings, thought-block stripping, plain-text passthrough, multiple calls, surrounding plain text, and embedded quotes in string values. Closes abetlen#2227 🤖 Generated with [Claude Code](https://claude.com/claude-code)
…E402 The Gemma 4 parser tests were appended below the existing test_hf_tokenizer_config_str_to_chat_formatter, with their own module- level docstring and re-imports of json / llama_cpp.llama_chat_format that ruff flagged as E402 (module-level import not at top of file). Both imports are already at lines 1 and 9 respectively, so deleting the duplicate block is a no-op for the runtime behaviour. The orientation note that used to live in the stray docstring is preserved as an inline comment block above the new test functions.
Contributor
Author
|
Closing this PR. After fixing the initial |
Contributor
Author
|
Withdrawing orphan PR; #2227 remains open for a clean re-attempt. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #2227.
Adds
@register_chat_completion_handler("gemma4")so thatcreate_chat_completion()with Gemma 4 + tools actually returns parsedtool_callsinstead of dumping native tokens intomessage.content.What changes
llama_cpp/llama_chat_format.py_parse_gemma4_native_tool_calls(text)— pure-Python parser for the Gemma 4 native tool-call grammar, including the optional<|channel>thought…<channel|>block that thinking mode adds.gemma4_chat_completionhandler that uses the GGUF-embedded Jinja2 chat template for prompt rendering, runsllama.create_completion, and post-parses the output.import re.tests/test_llama_chat_format.py— 8 new tests covering the issue repro, mixed primitives (int/float/bool/null), list of strings, thought-block stripping, plain-text passthrough, multiple sequential calls, surrounding plain text, and string values with embedded".Why this design
Reuse the GGUF Jinja template. Gemma 4 GGUFs already ship a correct chat template that produces the right tool-prompt tokens — the bug was strictly on the parsing side, not the formatting side. Re-using
Jinja2ChatFormatterkeeps prompt rendering in lockstep with whatever the model author shipped, instead of hard-coding another copy that can drift.Match the C++ side. ggml-org/llama.cpp#21326 already added the equivalent PEG parser to
llama-server. This PR is the Python port, with the same grammar:key:<|"|>value<|"|>key:30key:3.5key:true/key:falsekey:nullkey:[v1,v2,...]The 3-char
<|"|>delimiter means a literal"inside a string value never terminates it — no escape handling needed.Known limitation
Streaming responses currently pass chunks through unchanged; the caller still gets the raw native tokens. A streaming tool-call parser needs the same incremental PEG state machine the C++ side uses, which is a bigger change. The public
_parse_gemma4_native_tool_callshelper is documented so callers can buffer chunks and re-parse if they need streaming today.Test plan
_parse_gemma4_native_tool_callsdirectly, matching the style of the existing tests in this file).gemma-4-*.ggufand a tools request, to confirm the Jinja-template path renders correctly and the handler returnstool_calls.References
contentinstead oftool_calls#2227🤖 Generated with Claude Code. AI-assisted, human reviewed.