gemma4_31b: add OpenAI serving entrypoint#20473
Conversation
Add Gemma4 31B model-specific serving support on top of the shared examples/llm_server harness. This extracts the existing runner flow into a small Gemma4_31BEngine/LLMSession adapter, keeps main.cpp as a thin runner wrapper, and adds a C++ JSONL worker plus Python OpenAI-compatible launcher. The generic server remains model-agnostic; Gemma-specific behavior stays in examples/models/gemma4_31b, including chat-template options, BOS handling, channel cleanup, and Gemma tool-call parsing. Also wire the worker into the existing Gemma CUDA/MLX CMake presets and Makefile targets, document the serving harness usage, and add validation coverage: hermetic launcher tests, an opt-in on-device BOS/template regression test, and a CUDA no-bleed integration proof for interleaved multi-session execution.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20473
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 3 New Failures, 3 Unrelated Failures, 2 Unclassified FailuresAs of commit 2e0ac26 with merge base 3169302 ( NEW FAILURES - The following jobs have failed:
UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
Adds an OpenAI-compatible serving entrypoint for the Gemma4 31B example, reusing the shared examples/llm_server harness while keeping Gemma-specific logic under examples/models/gemma4_31b. The model runner is refactored to use a dedicated Gemma4_31BEngine / LLMSession adapter, and a new C++ JSONL worker process is introduced for model execution.
Changes:
- Introduce
Gemma4_31BEngine+ session adapter and refactor the existing runner to use it. - Add a C++ JSONL worker (
gemma4_31b_worker) and a Python OpenAI-compatible launcher (serve.py) with Gemma-specific templating/tool parsing. - Wire new targets into CMake presets/Makefile/docs and add hermetic + opt-in on-device + CUDA no-bleed validation tests.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| Makefile | Updates Gemma4 31B build targets to include worker (+ CUDA no-bleed test). |
| examples/models/gemma4_31b/README.md | Documents the serving harness, worker binary, and no-bleed test usage. |
| examples/models/gemma4_31b/CMakePresets.json | Ensures presets build runner + worker (+ CUDA no-bleed test). |
| examples/models/gemma4_31b/CMakeLists.txt | Adds worker + no-bleed test executables and links engine implementation into targets. |
| examples/models/gemma4_31b/export.py | Exports mutable-buffer metadata for multi-session isolation support. |
| examples/models/gemma4_31b/gemma4_31b_engine.h | Declares the Gemma4 engine/session adapter API. |
| examples/models/gemma4_31b/gemma4_31b_engine.cpp | Implements engine/session logic, including mutable-state rebinding (CUDA/MLX). |
| examples/models/gemma4_31b/main.cpp | Refactors runner to use the new engine/session adapter. |
| examples/models/gemma4_31b/gemma4_31b_worker.cpp | Adds JSONL worker entrypoint for OpenAI-serving execution. |
| examples/models/gemma4_31b/serve.py | Adds OpenAI-compatible Python control-plane launcher for Gemma4 31B. |
| examples/models/gemma4_31b/test_serve.py | Adds hermetic launcher/control-plane tests. |
| examples/models/gemma4_31b/test_ondevice_serving.py | Adds opt-in on-device BOS/template regression test. |
| examples/models/gemma4_31b/test_gemma4_31b_nobleed.cpp | Adds CUDA integration proof for isolated interleaved multi-session execution. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::string prompt_text = read_prompt(); | ||
| if (!FLAGS_prompt_file.empty() && prompt_text.empty()) { | ||
| return 1; | ||
| } |
| std::vector<std::unique_ptr<llm::LLMSession>> held; | ||
| while (true) { | ||
| auto session = engine->create_session(); | ||
| if (session.error() != Error::Ok) { | ||
| break; | ||
| } | ||
| held.push_back(std::move(session.get())); | ||
| if (held.size() > static_cast<size_t>(config.max_sessions) + 2) { | ||
| break; | ||
| } | ||
| } | ||
| check( | ||
| "capacity enforced: create_session fails past max_sessions", | ||
| held.size() <= static_cast<size_t>(config.max_sessions)); |
|
@claude review the code plz |
|
Claude finished @Gasoonjia's task in 2m 36s —— View job Code Review — Gemma4 31B serving entrypointNice refactor. The Functional concerns1. static bool valid_temperature(float t) {
return t == -1.0f || (t >= 0.0f && t <= 1.0f);
}The OpenAI chat API accepts 2. Empty prompt file is treated as fatal — Tests3. No-bleed capacity assertion keys off the wrong value — Minor / nits
CIThe 3 reported "new failures" are Docker None of the above is blocking; (1) and (3) are the two I'd address before merge. The engine/session contract adherence (poisoning semantics, terminal-step accounting, position bookkeeping) all checks out against |
| int64_t min_prefill_chunk, | ||
| bool rebind_available | ||
| #ifdef GEMMA_HAS_MUTABLE_STATE | ||
| , |
There was a problem hiding this comment.
an extra , is a little wierd; plz remove it
| #endif | ||
|
|
||
| #if defined(EXECUTORCH_BUILD_CUDA) || defined(EXECUTORCH_BUILD_MLX) | ||
| #define GEMMA_HAS_MUTABLE_STATE 1 |
There was a problem hiding this comment.
i think we can remove this; gemma431b can only run on cuda or mlx -> EXECUTORCH_BUILD_CUDA or EXECUTORCH_BUILD_MLX will always be defined -> GEMMA_HAS_MUTABLE_STATE should always be 1.
| } | ||
|
|
||
| Error set_temperature(float temperature) { | ||
| temp_val_ = (temperature <= 0.0f) ? 1e-6f : temperature; |
There was a problem hiding this comment.
should we check the temperature here then set it?
| decode_pos_data_[0] = pos_; | ||
| std::vector<EValue> inputs; | ||
| #ifdef EXECUTORCH_BUILD_CUDA | ||
| ET_CHECK_OK_OR_RETURN_ERROR(copy_decode_inputs_to_cuda()); |
There was a problem hiding this comment.
why do we need to copy input token from host to gpu? Now input token of decode method should be always on GPU.
| endif() | ||
|
|
||
| if(TARGET mlxdelegate) | ||
| executorch_target_copy_mlx_metallib(gemma4_31b_runner) |
| "get_vocab_size": config.vocab_size, | ||
| "get_n_layers": config.num_hidden_layers, | ||
| "get_max_prefill_chunk": max_prefill, | ||
| "get_min_prefill_chunk": 5, |
There was a problem hiding this comment.
Do you pad up to this chunk size? Should this be backend specific? MLX can handle seqlen=1.
| constexpr const char* kMaxPrefillChunk = "get_max_prefill_chunk"; | ||
| constexpr const char* kMinPrefillChunk = "get_min_prefill_chunk"; | ||
|
|
||
| Result<uint64_t> read_sampled_token( |
Add Gemma4 31B model-specific serving support on top of the shared examples/llm_server harness.
This extracts the existing runner flow into a small Gemma4_31BEngine/LLMSession adapter, keeps main.cpp as a thin runner wrapper, and adds a C++ JSONL worker plus Python OpenAI-compatible launcher. The generic server remains model-agnostic; Gemma-specific behavior stays in examples/models/gemma4_31b, including chat-template options, BOS handling, channel cleanup, and Gemma tool-call parsing.
Also wire the worker into the existing Gemma CUDA/MLX CMake presets and Makefile targets, document the serving harness usage, and add validation coverage: hermetic launcher tests, an opt-in on-device BOS/template regression test, and a CUDA no-bleed integration proof for interleaved multi-session execution.
#20001