gemma4_31b: add OpenAI serving entrypoint by mergennachin · Pull Request #20473 · pytorch/executorch

mergennachin · 2026-06-24T02:57:07Z

Add Gemma4 31B model-specific serving support on top of the shared examples/llm_server harness.

This extracts the existing runner flow into a small Gemma4_31BEngine/LLMSession adapter, keeps main.cpp as a thin runner wrapper, and adds a C++ JSONL worker plus Python OpenAI-compatible launcher. The generic server remains model-agnostic; Gemma-specific behavior stays in examples/models/gemma4_31b, including chat-template options, BOS handling, channel cleanup, and Gemma tool-call parsing.

Also wire the worker into the existing Gemma CUDA/MLX CMake presets and Makefile targets, document the serving harness usage, and add validation coverage: hermetic launcher tests, an opt-in on-device BOS/template regression test, and a CUDA no-bleed integration proof for interleaved multi-session execution.

#20001

Add Gemma4 31B model-specific serving support on top of the shared examples/llm_server harness. This extracts the existing runner flow into a small Gemma4_31BEngine/LLMSession adapter, keeps main.cpp as a thin runner wrapper, and adds a C++ JSONL worker plus Python OpenAI-compatible launcher. The generic server remains model-agnostic; Gemma-specific behavior stays in examples/models/gemma4_31b, including chat-template options, BOS handling, channel cleanup, and Gemma tool-call parsing. Also wire the worker into the existing Gemma CUDA/MLX CMake presets and Makefile targets, document the serving harness usage, and add validation coverage: hermetic launcher tests, an opt-in on-device BOS/template regression test, and a CUDA no-bleed integration proof for interleaved multi-session execution.

pytorch-bot · 2026-06-24T02:57:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20473

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI jobs will have longer queue times due to CI migration

❌ 3 New Failures, 3 Unrelated Failures, 2 Unclassified Failures

As of commit 2e0ac26 with merge base 3169302 ():

NEW FAILURES - The following jobs have failed:

pull / test-arm-backend-no-driver (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 5e7cec2ab620569102d74bcb3859ccdb6485be3239a0e3639f95de63afdbdd95 /exec failed with exit code 1
pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t ca433c47c724a536be47cf06b5bb9d06fcfff35590184e190bd55e4340e286dc /exec failed with exit code 1
pull / unittest-editable / linux / linux-job (gh)
RuntimeError: Command docker exec -t 30fff3cc03112459950e4a953a306bea73fd16f253e69458a973f9bcffd69fc0 /exec failed with exit code 1

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

Build Aarch64 Linux Wheels / pytorch/executorch / build-wheel-py3_10-cpu-aarch64 (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
/__w/executorch/executorch/pytorch/executorch/backends/apple/coreml/runtime/inmemoryfs/inmemory_filesystem.cpp:722:48: error: ‘inmemoryfs::InMemoryFileSystem::InMemoryNode::Kind’ has not been declared
Build Aarch64 Linux Wheels / pytorch/executorch / upload / upload-wheel-py3_10-cpu-aarch64 (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Unable to download artifact(s): Artifact not found for name: pytorch_executorch__3.10_cpu_aarch64

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (trunk failure)
pull / unittest / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-24T02:58:25Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

Adds an OpenAI-compatible serving entrypoint for the Gemma4 31B example, reusing the shared examples/llm_server harness while keeping Gemma-specific logic under examples/models/gemma4_31b. The model runner is refactored to use a dedicated Gemma4_31BEngine / LLMSession adapter, and a new C++ JSONL worker process is introduced for model execution.

Changes:

Introduce Gemma4_31BEngine + session adapter and refactor the existing runner to use it.
Add a C++ JSONL worker (gemma4_31b_worker) and a Python OpenAI-compatible launcher (serve.py) with Gemma-specific templating/tool parsing.
Wire new targets into CMake presets/Makefile/docs and add hermetic + opt-in on-device + CUDA no-bleed validation tests.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
Makefile	Updates Gemma4 31B build targets to include worker (+ CUDA no-bleed test).
examples/models/gemma4_31b/README.md	Documents the serving harness, worker binary, and no-bleed test usage.
examples/models/gemma4_31b/CMakePresets.json	Ensures presets build runner + worker (+ CUDA no-bleed test).
examples/models/gemma4_31b/CMakeLists.txt	Adds worker + no-bleed test executables and links engine implementation into targets.
examples/models/gemma4_31b/export.py	Exports mutable-buffer metadata for multi-session isolation support.
examples/models/gemma4_31b/gemma4_31b_engine.h	Declares the Gemma4 engine/session adapter API.
examples/models/gemma4_31b/gemma4_31b_engine.cpp	Implements engine/session logic, including mutable-state rebinding (CUDA/MLX).
examples/models/gemma4_31b/main.cpp	Refactors runner to use the new engine/session adapter.
examples/models/gemma4_31b/gemma4_31b_worker.cpp	Adds JSONL worker entrypoint for OpenAI-serving execution.
examples/models/gemma4_31b/serve.py	Adds OpenAI-compatible Python control-plane launcher for Gemma4 31B.
examples/models/gemma4_31b/test_serve.py	Adds hermetic launcher/control-plane tests.
examples/models/gemma4_31b/test_ondevice_serving.py	Adds opt-in on-device BOS/template regression test.
examples/models/gemma4_31b/test_gemma4_31b_nobleed.cpp	Adds CUDA integration proof for isolated interleaved multi-session execution.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  std::string prompt_text = read_prompt();
+  if (!FLAGS_prompt_file.empty() && prompt_text.empty()) {
+    return 1;
  }


+  std::vector<std::unique_ptr<llm::LLMSession>> held;
+  while (true) {
+    auto session = engine->create_session();
+    if (session.error() != Error::Ok) {
+      break;
+    }
+    held.push_back(std::move(session.get()));
+    if (held.size() > static_cast<size_t>(config.max_sessions) + 2) {
+      break;
+    }
+  }
+  check(
+      "capacity enforced: create_session fails past max_sessions",
+      held.size() <= static_cast<size_t>(config.max_sessions));


Gasoonjia · 2026-06-24T20:20:10Z

@claude review the code plz

claude · 2026-06-24T20:20:55Z

Claude finished @Gasoonjia's task in 2m 36s —— View job

Code Review — Gemma4 31B serving entrypoint

Nice refactor. The LLMEngine/LLMSession split is clean, the generic worker_loop stays model-agnostic, and the warm-resume invariant (resident_token_ids.size() == position()) is well-documented and looks correct end-to-end. Comments below, ordered by severity.

Functional concerns

1. temperature validation rejects part of the OpenAI range — gemma4_31b_engine.cpp:412

static bool valid_temperature(float t) {
  return t == -1.0f || (t >= 0.0f && t <= 1.0f);
}

The OpenAI chat API accepts temperature ∈ [0, 2]. The worker passes the request value straight through (worker_loop.h:113), so a standards-compliant client sending temperature: 1.5 gets Error::InvalidArgument, which surfaces as a hard "prefill failed" for the whole request rather than a clamp. For something billed as an OpenAI-compatible entrypoint this is a surprising rejection. Consider accepting [0, 2] (or clamping in the control plane). Fix this →

2. Empty prompt file is treated as fatal — main.cpp:68-79, main.cpp:144 (also flagged by Copilot)
read_prompt() returns "" for both an unreadable file and a valid-but-empty file, and main() exits on any empty result. Only an actual open failure should be fatal. Either signal the open error distinctly (e.g. return std::optional) or only bail when !f.is_open().

Tests

3. No-bleed capacity assertion keys off the wrong value — test_gemma4_31b_nobleed.cpp:248-254 (also flagged by Copilot)
The loop bound and final check both use config.max_sessions, but the engine clamps real capacity to 1 when rebinding is unavailable (and serving_capacity() is the source of truth). As written, the test passes even if create_session() silently clamps to 1 — i.e. it doesn't actually prove the multi-session capacity it's named for. Assert against engine->serving_capacity().max_physical_sessions_without_weight_duplication instead. Fix this →

Minor / nits

gemma4_31b_engine.cpp:584 — add_token_piece(..., "<turn|>") and "<end_of_turn>" are added to eos_ids, but main.cpp's template emits <turn|> as the user-turn terminator inside the prompt. Since these only affect sampled-token EOS detection (not prompt tokens), this is fine — worth a one-line comment noting why both spellings are covered, as the asymmetry with format_prompt is easy to misread.
main.cpp vs gemma4_31b_worker.cpp — the chat-template string lives literally in main.cpp:85 (raw runner) while the server path renders via Python ChatTemplate. Two sources of truth for "what a Gemma prompt looks like." Acceptable for an example, but a drift risk worth a comment pointing each at the other.
gemma4_31b_engine.cpp:251 — session destructor calls mutable_state_->destroy_session() outside exec_mutex_. The worker loop is single-threaded so this is safe today, but every other mutable-state touch goes through run_locked/with_active_session; a brief note that session lifecycle is single-threaded would prevent a future concurrent caller from tripping here.

CI

The 3 reported "new failures" are Docker exec infra errors (docker exec ... failed with exit code 1) and the aarch64 wheel failure is a pre-existing CoreML InMemoryNode::Kind build break unrelated to this PR — none appear caused by these changes. A rebase onto viable/strict should clear the broken-trunk entries.

None of the above is blocking; (1) and (3) are the two I'd address before merge. The engine/session contract adherence (poisoning semantics, terminal-step accounting, position bookkeeping) all checks out against llm_session.h.
· branch gemma4-31b-serving-pr

Gasoonjia · 2026-06-24T20:22:13Z

+      int64_t min_prefill_chunk,
+      bool rebind_available
+#ifdef GEMMA_HAS_MUTABLE_STATE
+      ,


an extra , is a little wierd; plz remove it

Gasoonjia · 2026-06-24T20:23:57Z

+#endif
+
+#if defined(EXECUTORCH_BUILD_CUDA) || defined(EXECUTORCH_BUILD_MLX)
+#define GEMMA_HAS_MUTABLE_STATE 1


i think we can remove this; gemma431b can only run on cuda or mlx -> EXECUTORCH_BUILD_CUDA or EXECUTORCH_BUILD_MLX will always be defined -> GEMMA_HAS_MUTABLE_STATE should always be 1.

Gasoonjia · 2026-06-24T20:27:35Z

+  }
+
+  Error set_temperature(float temperature) {
+    temp_val_ = (temperature <= 0.0f) ? 1e-6f : temperature;


should we check the temperature here then set it?

Gasoonjia · 2026-06-24T20:29:23Z

+    decode_pos_data_[0] = pos_;
+    std::vector<EValue> inputs;
+#ifdef EXECUTORCH_BUILD_CUDA
+    ET_CHECK_OK_OR_RETURN_ERROR(copy_decode_inputs_to_cuda());


why do we need to copy input token from host to gpu? Now input token of decode method should be always on GPU.

metascroy · 2026-06-24T22:25:59Z

 endif()

 if(TARGET mlxdelegate)
  executorch_target_copy_mlx_metallib(gemma4_31b_runner)


Is this repeated twice?

metascroy · 2026-06-24T22:28:00Z

            "get_vocab_size": config.vocab_size,
            "get_n_layers": config.num_hidden_layers,
            "get_max_prefill_chunk": max_prefill,
+            "get_min_prefill_chunk": 5,


Do you pad up to this chunk size? Should this be backend specific? MLX can handle seqlen=1.

metascroy · 2026-06-24T23:05:35Z

+constexpr const char* kMaxPrefillChunk = "get_max_prefill_chunk";
+constexpr const char* kMinPrefillChunk = "get_min_prefill_chunk";
+
+Result<uint64_t> read_sampled_token(


Do we do topk as well?

Copilot AI review requested due to automatic review settings June 24, 2026 02:57

mergennachin requested review from kirklandsign and larryliu0820 as code owners June 24, 2026 02:57

mergennachin temporarily deployed to cadence June 24, 2026 02:57 — with GitHub Actions Inactive

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026

mergennachin requested review from Gasoonjia, digantdesai and metascroy June 24, 2026 02:57

Copilot started reviewing on behalf of mergennachin June 24, 2026 02:57 View session

mergennachin added the ciflow/cuda label Jun 24, 2026

Copilot AI reviewed Jun 24, 2026

View reviewed changes

mergennachin mentioned this pull request Jun 24, 2026

examples/models/gemma4_31b: CUDA Engine/Session adapter + OpenAI serving #20207

Closed

Gasoonjia reviewed Jun 24, 2026

View reviewed changes

metascroy reviewed Jun 24, 2026

View reviewed changes

metascroy approved these changes Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gemma4_31b: add OpenAI serving entrypoint#20473

gemma4_31b: add OpenAI serving entrypoint#20473
mergennachin wants to merge 1 commit into
mainfrom
gemma4-31b-serving-pr

mergennachin commented Jun 24, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Gasoonjia commented Jun 24, 2026

Uh oh!

claude Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

Gasoonjia Jun 24, 2026

Uh oh!

Gasoonjia Jun 24, 2026

Uh oh!

Gasoonjia Jun 24, 2026

Uh oh!

Gasoonjia Jun 24, 2026

Uh oh!

metascroy Jun 24, 2026

Uh oh!

metascroy Jun 24, 2026

Uh oh!

metascroy Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

mergennachin commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20473

❗ 1 Active SEVs

❌ 3 New Failures, 3 Unrelated Failures, 2 Unclassified Failures

Uh oh!

github-actions Bot commented Jun 24, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Gasoonjia commented Jun 24, 2026

Uh oh!

claude Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — Gemma4 31B serving entrypoint

Functional concerns

Tests

Minor / nits

CI

Uh oh!

Gasoonjia Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Gasoonjia Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Gasoonjia Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Gasoonjia Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

metascroy Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

metascroy Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

metascroy Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mergennachin commented Jun 24, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 24, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 24, 2026 •

edited

Loading