Skip to content

gemma4_31b: add OpenAI serving entrypoint#20473

Open
mergennachin wants to merge 1 commit into
mainfrom
gemma4-31b-serving-pr
Open

gemma4_31b: add OpenAI serving entrypoint#20473
mergennachin wants to merge 1 commit into
mainfrom
gemma4-31b-serving-pr

Conversation

@mergennachin

@mergennachin mergennachin commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Add Gemma4 31B model-specific serving support on top of the shared examples/llm_server harness.

This extracts the existing runner flow into a small Gemma4_31BEngine/LLMSession adapter, keeps main.cpp as a thin runner wrapper, and adds a C++ JSONL worker plus Python OpenAI-compatible launcher. The generic server remains model-agnostic; Gemma-specific behavior stays in examples/models/gemma4_31b, including chat-template options, BOS handling, channel cleanup, and Gemma tool-call parsing.

Also wire the worker into the existing Gemma CUDA/MLX CMake presets and Makefile targets, document the serving harness usage, and add validation coverage: hermetic launcher tests, an opt-in on-device BOS/template regression test, and a CUDA no-bleed integration proof for interleaved multi-session execution.

#20001

Add Gemma4 31B model-specific serving support on top of the shared examples/llm_server harness.

This extracts the existing runner flow into a small Gemma4_31BEngine/LLMSession adapter, keeps main.cpp as a thin runner wrapper, and adds a C++ JSONL worker plus Python OpenAI-compatible launcher. The generic server remains model-agnostic; Gemma-specific behavior stays in examples/models/gemma4_31b, including chat-template options, BOS handling, channel cleanup, and Gemma tool-call parsing.

Also wire the worker into the existing Gemma CUDA/MLX CMake presets and Makefile targets, document the serving harness usage, and add validation coverage: hermetic launcher tests, an opt-in on-device BOS/template regression test, and a CUDA no-bleed integration proof for interleaved multi-session execution.
Copilot AI review requested due to automatic review settings June 24, 2026 02:57
@pytorch-bot

pytorch-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20473

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 3 New Failures, 3 Unrelated Failures, 2 Unclassified Failures

As of commit 2e0ac26 with merge base 3169302 (image):

NEW FAILURES - The following jobs have failed:

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026
@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an OpenAI-compatible serving entrypoint for the Gemma4 31B example, reusing the shared examples/llm_server harness while keeping Gemma-specific logic under examples/models/gemma4_31b. The model runner is refactored to use a dedicated Gemma4_31BEngine / LLMSession adapter, and a new C++ JSONL worker process is introduced for model execution.

Changes:

  • Introduce Gemma4_31BEngine + session adapter and refactor the existing runner to use it.
  • Add a C++ JSONL worker (gemma4_31b_worker) and a Python OpenAI-compatible launcher (serve.py) with Gemma-specific templating/tool parsing.
  • Wire new targets into CMake presets/Makefile/docs and add hermetic + opt-in on-device + CUDA no-bleed validation tests.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
Makefile Updates Gemma4 31B build targets to include worker (+ CUDA no-bleed test).
examples/models/gemma4_31b/README.md Documents the serving harness, worker binary, and no-bleed test usage.
examples/models/gemma4_31b/CMakePresets.json Ensures presets build runner + worker (+ CUDA no-bleed test).
examples/models/gemma4_31b/CMakeLists.txt Adds worker + no-bleed test executables and links engine implementation into targets.
examples/models/gemma4_31b/export.py Exports mutable-buffer metadata for multi-session isolation support.
examples/models/gemma4_31b/gemma4_31b_engine.h Declares the Gemma4 engine/session adapter API.
examples/models/gemma4_31b/gemma4_31b_engine.cpp Implements engine/session logic, including mutable-state rebinding (CUDA/MLX).
examples/models/gemma4_31b/main.cpp Refactors runner to use the new engine/session adapter.
examples/models/gemma4_31b/gemma4_31b_worker.cpp Adds JSONL worker entrypoint for OpenAI-serving execution.
examples/models/gemma4_31b/serve.py Adds OpenAI-compatible Python control-plane launcher for Gemma4 31B.
examples/models/gemma4_31b/test_serve.py Adds hermetic launcher/control-plane tests.
examples/models/gemma4_31b/test_ondevice_serving.py Adds opt-in on-device BOS/template regression test.
examples/models/gemma4_31b/test_gemma4_31b_nobleed.cpp Adds CUDA integration proof for isolated interleaved multi-session execution.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +143 to 146
std::string prompt_text = read_prompt();
if (!FLAGS_prompt_file.empty() && prompt_text.empty()) {
return 1;
}
Comment on lines +241 to +254
std::vector<std::unique_ptr<llm::LLMSession>> held;
while (true) {
auto session = engine->create_session();
if (session.error() != Error::Ok) {
break;
}
held.push_back(std::move(session.get()));
if (held.size() > static_cast<size_t>(config.max_sessions) + 2) {
break;
}
}
check(
"capacity enforced: create_session fails past max_sessions",
held.size() <= static_cast<size_t>(config.max_sessions));
@Gasoonjia

Copy link
Copy Markdown
Contributor

@claude review the code plz

@claude

claude Bot commented Jun 24, 2026

Copy link
Copy Markdown

Claude finished @Gasoonjia's task in 2m 36s —— View job


Code Review — Gemma4 31B serving entrypoint

Nice refactor. The LLMEngine/LLMSession split is clean, the generic worker_loop stays model-agnostic, and the warm-resume invariant (resident_token_ids.size() == position()) is well-documented and looks correct end-to-end. Comments below, ordered by severity.

Functional concerns

1. temperature validation rejects part of the OpenAI rangegemma4_31b_engine.cpp:412

static bool valid_temperature(float t) {
  return t == -1.0f || (t >= 0.0f && t <= 1.0f);
}

The OpenAI chat API accepts temperature ∈ [0, 2]. The worker passes the request value straight through (worker_loop.h:113), so a standards-compliant client sending temperature: 1.5 gets Error::InvalidArgument, which surfaces as a hard "prefill failed" for the whole request rather than a clamp. For something billed as an OpenAI-compatible entrypoint this is a surprising rejection. Consider accepting [0, 2] (or clamping in the control plane). Fix this →

2. Empty prompt file is treated as fatalmain.cpp:68-79, main.cpp:144 (also flagged by Copilot)
read_prompt() returns "" for both an unreadable file and a valid-but-empty file, and main() exits on any empty result. Only an actual open failure should be fatal. Either signal the open error distinctly (e.g. return std::optional) or only bail when !f.is_open().

Tests

3. No-bleed capacity assertion keys off the wrong valuetest_gemma4_31b_nobleed.cpp:248-254 (also flagged by Copilot)
The loop bound and final check both use config.max_sessions, but the engine clamps real capacity to 1 when rebinding is unavailable (and serving_capacity() is the source of truth). As written, the test passes even if create_session() silently clamps to 1 — i.e. it doesn't actually prove the multi-session capacity it's named for. Assert against engine->serving_capacity().max_physical_sessions_without_weight_duplication instead. Fix this →

Minor / nits

  • gemma4_31b_engine.cpp:584add_token_piece(..., "<turn|>") and "<end_of_turn>" are added to eos_ids, but main.cpp's template emits <turn|> as the user-turn terminator inside the prompt. Since these only affect sampled-token EOS detection (not prompt tokens), this is fine — worth a one-line comment noting why both spellings are covered, as the asymmetry with format_prompt is easy to misread.
  • main.cpp vs gemma4_31b_worker.cpp — the chat-template string lives literally in main.cpp:85 (raw runner) while the server path renders via Python ChatTemplate. Two sources of truth for "what a Gemma prompt looks like." Acceptable for an example, but a drift risk worth a comment pointing each at the other.
  • gemma4_31b_engine.cpp:251 — session destructor calls mutable_state_->destroy_session() outside exec_mutex_. The worker loop is single-threaded so this is safe today, but every other mutable-state touch goes through run_locked/with_active_session; a brief note that session lifecycle is single-threaded would prevent a future concurrent caller from tripping here.

CI

The 3 reported "new failures" are Docker exec infra errors (docker exec ... failed with exit code 1) and the aarch64 wheel failure is a pre-existing CoreML InMemoryNode::Kind build break unrelated to this PR — none appear caused by these changes. A rebase onto viable/strict should clear the broken-trunk entries.


None of the above is blocking; (1) and (3) are the two I'd address before merge. The engine/session contract adherence (poisoning semantics, terminal-step accounting, position bookkeeping) all checks out against llm_session.h.
· branch gemma4-31b-serving-pr

int64_t min_prefill_chunk,
bool rebind_available
#ifdef GEMMA_HAS_MUTABLE_STATE
,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an extra , is a little wierd; plz remove it

#endif

#if defined(EXECUTORCH_BUILD_CUDA) || defined(EXECUTORCH_BUILD_MLX)
#define GEMMA_HAS_MUTABLE_STATE 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can remove this; gemma431b can only run on cuda or mlx -> EXECUTORCH_BUILD_CUDA or EXECUTORCH_BUILD_MLX will always be defined -> GEMMA_HAS_MUTABLE_STATE should always be 1.

}

Error set_temperature(float temperature) {
temp_val_ = (temperature <= 0.0f) ? 1e-6f : temperature;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check the temperature here then set it?

decode_pos_data_[0] = pos_;
std::vector<EValue> inputs;
#ifdef EXECUTORCH_BUILD_CUDA
ET_CHECK_OK_OR_RETURN_ERROR(copy_decode_inputs_to_cuda());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to copy input token from host to gpu? Now input token of decode method should be always on GPU.

endif()

if(TARGET mlxdelegate)
executorch_target_copy_mlx_metallib(gemma4_31b_runner)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this repeated twice?

"get_vocab_size": config.vocab_size,
"get_n_layers": config.num_hidden_layers,
"get_max_prefill_chunk": max_prefill,
"get_min_prefill_chunk": 5,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you pad up to this chunk size? Should this be backend specific? MLX can handle seqlen=1.

constexpr const char* kMaxPrefillChunk = "get_max_prefill_chunk";
constexpr const char* kMinPrefillChunk = "get_min_prefill_chunk";

Result<uint64_t> read_sampled_token(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we do topk as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants