Add the EAGLE-3 speculative-decoding runner (CUDA) by digantdesai · Pull Request #20156 · pytorch/executorch

digantdesai · 2026-06-09T16:03:59Z

A C++ runner that drives the speculator .pte with the shifted (vLLM-EAGLE)
scheme: the draft pairs the target hidden state at position t with token t+1, so
each round runs one target forward (target_verify) and reseeds the next draft
chain from the hidden states verify already produced -- no standalone target
decode. Greedy verification keeps output identical to greedy target decoding.
target_verify runs on stable input buffers and can be captured as a CUDA graph.

It requires the .pte metadata (fails loudly if absent) and enforces the exported
prefill range [get_min_prefill_chunk, get_max_prefill_chunk] (no chunking). The
prefill bonus token is always emitted; the speculative loop runs only when more
tokens are requested, the bonus was not EOS, and a K-token verify window fits
within get_max_seq_len (so a one-token or near-context request returns without
seeding the draft). The chat template and stop tokens are flags defaulting to
Gemma 4 IT (--chat_prefix/--chat_suffix/--stop_ids/--stop_token, --bos_id -1 to
skip) so other target/tokenizer pairs run without code changes. Device-to-host
reads are error-checked; the printed tau excludes the free prefill token.

Authored with assistance from Claude Code.

[ghstack-poisoned]

digantdesai · 2026-06-09T16:04:00Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2026-06-09T16:04:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20156

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure, 1 Unrelated Failure, 2 Unclassified Failures

As of commit 593f812 with merge base da9158b ():

NEW FAILURE - The following job has failed:

Lint / lintrunner (gh)
>>> Lint for examples/models/eagle3/main.cpp:

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

Build Aarch64 Linux Wheels / pytorch/executorch / build-wheel-py3_10-cpu-aarch64 (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
/__w/executorch/executorch/pytorch/executorch/backends/apple/coreml/runtime/inmemoryfs/inmemory_filesystem.cpp:722:48: error: ‘inmemoryfs::InMemoryFileSystem::InMemoryNode::Kind’ has not been declared
Build Aarch64 Linux Wheels / pytorch/executorch / upload / upload-wheel-py3_10-cpu-aarch64 (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Unable to download artifact(s): Artifact not found for name: pytorch_executorch__3.10_cpu_aarch64

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (trunk failure)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

Update

5725fee

[ghstack-poisoned]

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2026

Update

dba68a6

[ghstack-poisoned]

digantdesai marked this pull request as ready for review June 23, 2026 19:22

digantdesai requested review from kirklandsign and larryliu0820 as code owners June 23, 2026 19:22

Update

593f812

[ghstack-poisoned]

digantdesai had a problem deploying to cadence June 25, 2026 05:11 — with GitHub Actions Error

digantdesai temporarily deployed to cadence June 25, 2026 05:11 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add the EAGLE-3 speculative-decoding runner (CUDA)#20156

Add the EAGLE-3 speculative-decoding runner (CUDA)#20156
digantdesai wants to merge 3 commits into
gh/digantdesai/60/headfrom
gh/digantdesai/61/head

digantdesai commented Jun 9, 2026

Uh oh!

digantdesai commented Jun 9, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

digantdesai commented Jun 9, 2026

Uh oh!

digantdesai commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20156

❗ 2 Active SEVs

❌ 1 New Failure, 1 Unrelated Failure, 2 Unclassified Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

digantdesai commented Jun 9, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading