[EAGLE] Optimize EAGLE Training by benchislett · Pull Request #1044 · NVIDIA/Model-Optimizer

benchislett · 2026-03-16T05:18:08Z

What does this PR do?

Type of change: Optimization

Changes:

Precompute base_model_logits.argmax() and base_model_logits.softmax() instead of recomputing in every call to _eagle_loss
Calculate per-prediction accuracy on the GPU and synchronize it to the host after running all TTT steps, to avoid cpu/gpu synchronization inside the TTT step loop.
Apply torch.compile to performance-critical training functions: prepare inputs, eagle forward, and eagle loss calculation. Omitted from target model in online case for now, as it may not be natively compatible with all architectures.

Usage

No changes to external interfaces

Testing

Ran training commands for benchmarking. Did not do a full training run, did not validate correctness.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

New Features
- Added runtime profiling ranges for speculative decoding to enable finer performance tracing.
Improvements
- Lazy initialization for rotary embeddings in llama-style decoders for more reliable startup.
- Speculative decoding now uses base-model predicted tokens and softmax probabilities for sampling and loss, improving stability and accuracy reporting.
- Parallelism configuration is now conditional, avoiding unnecessary setup unless required.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

… calls Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

coderabbitai · 2026-03-16T05:18:27Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Parallelism config creation in the speculative decoding example is now conditional. The Transformers plugin adds NVTX profiling decorators, lazily initializes Llama rotary embeddings in EagleModule, and threads base-model predicted tokens and softmax logits through input preparation into the eagle forward/loss/accuracy flow.

Changes

Cohort / File(s)	Summary
Conditional Parallelism Configuration `examples/speculative_decoding/main.py`	Make `parallelism_config` only when `cp_size > 1` or `dp_shard_size > 1`; keep ring-attention patching and set `parallelism_config.sp_backend = None` when `cp_size > 1`.
Profiling & EagleModule changes `modelopt/torch/speculative/plugins/transformers.py`	Add `@nvtx.range` decorators to `_prepare_eagle_inputs`, `_base_model_forward`, `_eagle_forward`, `_eagle_loss`; add `_maybe_init_rope()` for lazy Llama rotary init; `_prepare_eagle_inputs` now returns `base_output_predict_tok` and `base_output_softmax_logits`; propagate these into `_eagle_forward`/`_eagle_loss` and update loss/accuracy computations.
Manifest `pyproject.toml`	Minor manifest edits (small line changes).

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Trainer
participant BaseModel
participant EagleModule
participant Loss
Trainer->>BaseModel: run base forward(inputs)
BaseModel-->>Trainer: base_logits
Trainer->>Trainer: softmax(base_logits) -> base_output_softmax_logits\nargmax(base_logits) -> base_output_predict_tok
Trainer->>EagleModule: _prepare_eagle_inputs(..., base_output_predict_tok, base_output_softmax_logits)
EagleModule->>EagleModule: _maybe_init_rope()\ncompute eagle_logits
EagleModule-->>Trainer: eagle_logits
Trainer->>Loss: _eagle_loss(eagle_logits, base_output_softmax_logits, base_output_predict_tok, loss_mask)
Loss-->>Trainer: loss, accuracy

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	Critical NVTX import is not wrapped in try/except, breaking Windows CI where NVTX is unavailable; review feedback explicitly requests optional dependency handling.	Wrap NVTX import in try/except block with _NVTX_AVAILABLE flag and fallback decorator; replace all `@nvtx.range`() with conditional `@nvtx_range`() function.
Docstring Coverage	⚠️ Warning	Docstring coverage is 64.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[EAGLE] Optimize EAGLE Training' directly relates to the PR's main objective of optimizing EAGLE training through precomputing base model outputs, adding torch.compile annotations, and improving GPU/CPU synchronization.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch bchislett/eagle-speedups-torch-compile

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

benchislett · 2026-03-16T05:26:35Z

examples/speculative_decoding/main.py

-    training_args.parallelism_config = ParallelismConfig(
-        cp_size=training_args.cp_size, dp_shard_size=training_args.dp_shard_size
-    )
+    if training_args.cp_size > 1 or training_args.dp_shard_size > 1:


Note, this is an unrelated bugfix related to #1045 (does not fully solve the issue, just a single-gpu workaround)

As discussed in slack, this issue id due to transformers version mismatch. Should be fixed after updating transformers.

modelopt/torch/speculative/plugins/transformers.py

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/speculative/plugins/transformers.py`:
- Around line 908-910: The code can raise a NameError when self.eagle_ttt_steps
== 0 because the loop that defines ttt_step never runs; update the logic in
modify() (or the surrounding block) to handle the zero-case explicitly: either
assert self.eagle_ttt_steps >= 1 at the start of modify() to make the invariant
explicit, or initialize ttt_step to a safe default (or skip code that uses
ttt_step) when eagle_ttt_steps == 0 and ensure train_accs =
torch.zeros(num_parallel, num_ttt, device=input_ids.device) is still valid;
reference symbols: self.eagle_ttt_steps, train_accs, ttt_step, and modify() when
applying the fix.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: de2f61b0-6ad0-43ef-b333-c5cd195b6a21

📥 Commits

Reviewing files that changed from the base of the PR and between 3d40373 and 0309c19.

📒 Files selected for processing (1)

modelopt/torch/speculative/plugins/transformers.py

modelopt/torch/speculative/plugins/transformers.py

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

modelopt/torch/speculative/plugins/transformers.py (1)

909-912: ⚠️ Potential issue | 🟠 Major

Guard zero-step TTT to avoid undefined ttt_step.

If self.eagle_ttt_steps == 0, the loop at Line 931 never runs, and Line 989 references ttt_step before assignment.

🔧 Proposed fix

-        train_accs = torch.zeros(num_parallel, num_ttt, device=input_ids.device)
+        train_accs = torch.zeros(num_parallel, num_ttt, device=input_ids.device)
+        executed_ttt_steps = 0
@@
-        for ttt_step in range(self.eagle_ttt_steps):
+        for ttt_step in range(self.eagle_ttt_steps):
@@
-                train_accs[i, ttt_step] = acc
+                train_accs[i, ttt_step] = acc
+            executed_ttt_steps = ttt_step + 1
             if not self.training:
                 break
@@
-        train_accs = train_accs[:, : ttt_step + 1].tolist()
+        train_accs = train_accs[:, :executed_ttt_steps].tolist()

Also applies to: 988-990

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/speculative/plugins/transformers.py` around lines 909 - 912,
Guard against zero TTT steps by checking self.eagle_ttt_steps before using
ttt_step or running the TTT loop: if self.eagle_ttt_steps == 0 skip the entire
TTT block (including the loop that populates train_accs and any later use of
ttt_step) or initialize a safe default for ttt_step and related tensors so they
are defined when eagle_ttt_steps is 0; update references around train_accs, the
loop that iterates over range(self.eagle_ttt_steps) and the later code that uses
ttt_step (the code near variables num_ttt, train_accs and where ttt_step is
referenced) to either early-return/skip or handle the zero-case explicitly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/speculative/plugins/transformers.py`:
- Around line 928-929: The RoPE initializer is only called in
HFEagleModel.forward, but other entry points like pseudo_speculative_generate()
call _eagle_forward() and can trigger EagleModule.forward before rotary_emb
exists; update the code so every EAGLE entry path invokes the initializer: call
self.eagle_module._maybe_init_rope() at the start of EagleModule.forward and
also ensure _eagle_forward() (and/or pseudo_speculative_generate()) invokes
_maybe_init_rope() before any use of rotary_emb so RoPE is always initialized
regardless of which forward path is exercised.

---

Duplicate comments:
In `@modelopt/torch/speculative/plugins/transformers.py`:
- Around line 909-912: Guard against zero TTT steps by checking
self.eagle_ttt_steps before using ttt_step or running the TTT loop: if
self.eagle_ttt_steps == 0 skip the entire TTT block (including the loop that
populates train_accs and any later use of ttt_step) or initialize a safe default
for ttt_step and related tensors so they are defined when eagle_ttt_steps is 0;
update references around train_accs, the loop that iterates over
range(self.eagle_ttt_steps) and the later code that uses ttt_step (the code near
variables num_ttt, train_accs and where ttt_step is referenced) to either
early-return/skip or handle the zero-case explicitly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d8dcdd21-580f-406c-9a0f-ff3ab7c80865

📥 Commits

Reviewing files that changed from the base of the PR and between 0309c19 and 44c00b8.

📒 Files selected for processing (1)

modelopt/torch/speculative/plugins/transformers.py

modelopt/torch/speculative/plugins/transformers.py

codecov · 2026-03-16T19:38:48Z

Codecov Report

❌ Patch coverage is 81.81818% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.26%. Comparing base (7e2e85a) to head (89238c9).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/speculative/eagle/utils.py	77.77%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1044      +/-   ##
==========================================
- Coverage   70.29%   70.26%   -0.03%     
==========================================
  Files         227      227              
  Lines       25857    25868      +11     
==========================================
  Hits        18176    18176              
- Misses       7681     7692      +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

h-guo18 · 2026-03-16T19:47:03Z

Make sense to me. Is there any perf comparison before/after the optimizations?

coderabbitai

🧹 Nitpick comments (1)

modelopt/torch/speculative/plugins/transformers.py (1)

720-725: Consider a size guard before caching full teacher softmax.

Caching base_output_softmax_logits as a full [B, S, V] tensor can materially increase peak memory (especially large-vocab runs). A guarded fallback to per-slice softmax would keep this optimization safer across wider configs.

💡 Example guard pattern

-        base_output_predict_tok = base_model_logits.argmax(dim=-1).detach()
-        base_output_softmax_logits = torch.softmax(base_model_logits, dim=2).detach()
+        base_output_predict_tok = base_model_logits.argmax(dim=-1).detach()
+        cache_softmax = base_model_logits.numel() <= getattr(
+            self.eagle_config, "max_cached_teacher_prob_elems", 0
+        )
+        base_output_softmax_logits = (
+            torch.softmax(base_model_logits, dim=2).detach() if cache_softmax else None
+        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/speculative/plugins/transformers.py` around lines 720 - 725,
The code currently caches base_output_softmax_logits as a full [B,S,V] tensor
which can blow up memory for large vocabularies; add a size guard using
eagle_config.draft_vocab_size and eagle_config.vocab_size (or an explicit
max_vocab_for_full_softmax threshold) and only compute/cache full softmax when
vocab_size*B*S is below the threshold, otherwise avoid storing
base_output_softmax_logits and compute softmax per-slice on demand (or keep only
argmax via base_output_predict_tok); update the block around base_model_logits,
base_output_predict_tok and base_output_softmax_logits to branch on this guard
and ensure downstream users handle the per-slice-compute path.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/speculative/plugins/transformers.py`:
- Around line 720-725: The code currently caches base_output_softmax_logits as a
full [B,S,V] tensor which can blow up memory for large vocabularies; add a size
guard using eagle_config.draft_vocab_size and eagle_config.vocab_size (or an
explicit max_vocab_for_full_softmax threshold) and only compute/cache full
softmax when vocab_size*B*S is below the threshold, otherwise avoid storing
base_output_softmax_logits and compute softmax per-slice on demand (or keep only
argmax via base_output_predict_tok); update the block around base_model_logits,
base_output_predict_tok and base_output_softmax_logits to branch on this guard
and ensure downstream users handle the per-slice-compute path.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f3113c0a-eb31-4a7d-b61d-bae7d19554ae

📥 Commits

Reviewing files that changed from the base of the PR and between 44c00b8 and b30d95c.

📒 Files selected for processing (1)

modelopt/torch/speculative/plugins/transformers.py

benchislett · 2026-03-16T20:22:32Z

@h-guo18 I isolated perf using nsys profile with online training of a config for Llama 3.2 1B with K=3 on ISL 2048. Looking at the EAGLE3 FWD+BWD and excluding the target model forward pass, I get 1.9x speed improvement (roughly 280ms per batch of 16 requests, down from 540ms on main)

benchislett · 2026-03-16T20:23:23Z

@h-guo18 could you advise on this test failure? It seems like the windows build doesn't have NVTX available? I'm not sure how modelopt CI works, what do you suggest to fix the CI?

h-guo18 · 2026-03-16T20:53:13Z

@h-guo18 could you advise on this test failure? It seems like the windows build doesn't have NVTX available? I'm not sure how modelopt CI works, what do you suggest to fix the CI?

Seems like only the test on windows fails (link). Instead of installing it in the testing container, I think it's better to make it an optional dependency for minimal impact.

e.g. wrap the decorator with some check:

from contextlib import contextmanager

try:
    from torch.cuda import nvtx as torch_nvtx
except Exception:
    torch_nvtx = None


def _nvtx_available() -> bool:
    if torch_nvtx is None:
        return False
    try:
        torch_nvtx.range_push("probe")
        torch_nvtx.range_pop()
        return True
    except Exception:
        return False


_NVTX_ENABLED = _nvtx_available()


def nvtx_range(msg: str):
    """Can be used as both decorator and context manager fallback target."""
    if _NVTX_ENABLED:
        return torch_nvtx.range(msg)
    return _null_range(msg)


@contextmanager
def _null_range(msg: str):
    yield



#to use it:       
@nvtx_range("eagle_loss")
def compute_loss(x):
    return x.sum()

h-guo18 · 2026-03-16T20:58:10Z

@h-guo18 I isolated perf using nsys profile with online training of a config for Llama 3.2 1B with K=3 on ISL 2048. Looking at the EAGLE3 FWD+BWD and excluding the target model forward pass, I get 1.9x speed improvement (roughly 280ms per batch of 16 requests, down from 540ms on main)

That's huge speedup. Thanks! I don't see any torch.compile in current PR code, so I assume most speedup comes from precomputing base_model_logits related stuff?

benchislett · 2026-03-16T21:01:29Z

Oh oops I forgot to push the torch.compile annotations

benchislett · 2026-03-16T21:01:42Z

A massive chunk comes from the torch.compile

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/speculative/plugins/transformers.py`:
- Line 41: Make the NVTX import optional: replace the hard import of nvtx with a
guarded import (try/except ImportError) that sets a module-level nvtx = None
when unavailable, and add a helper function/ decorator factory named nvtx_range
that returns a no-op decorator when nvtx is None or returns nvtx.range(...) when
present; then replace all uses of the `@nvtx.range`(...) decorator in this module
(e.g., on any functions decorated with nvtx.range) with `@nvtx_range`(...) so the
code runs on systems without CUDA without changing behavior when nvtx is
available.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 85cd1527-f41a-4ff3-ad63-5908d1cc835b

📥 Commits

Reviewing files that changed from the base of the PR and between b30d95c and 67e4bee.

📒 Files selected for processing (1)

modelopt/torch/speculative/plugins/transformers.py

modelopt/torch/speculative/plugins/transformers.py

h-guo18 · 2026-03-16T22:40:39Z

A massive chunk comes from the torch.compile

Make sense. Besides,

Shall we make torch.compile an optional feature? e.g. add an argument in training script. I remember torch compile fails in some case (fsdp or cp>1, forgot the exact setting).
Could you run a quick correctness test and check the training acc curves before/after change? A few k steps on llama 1b would be sufficient I think.

yeyu-nvidia · 2026-03-17T18:26:27Z

LGTM. Please run a full test before merging

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

…ups-torch-compile Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett · 2026-03-18T17:51:09Z

@h-guo18 torch.compile is now optional, but on by default. I think we can update in software to disable the feature if CP>1 if it becomes a problem. But given the significant difference in performance I think we should try to have it on in as many cases as possible. (This is vLLM's torch.compile philosophy).

I ran a 1000-step test on 1xB200 with Batch Size 16 for Llama 3.2-1B.

Performance is much better with torch.compile enabled, 1.95 it/s v.s. 1.45 it/s, and this is in online mode where the base model cost is unchanged (base model not compiled).

Accuracy is identical for both loss and AR.

Config

{
    "rope_scaling": null,
    "rope_theta": 500000.0,
    "initializer_range": 0.02,
    "hidden_size": 2048,
    "intermediate_size": 16384,
    "num_attention_heads": 64,
    "num_key_value_heads": 8,
    "use_aux_hidden_state": true,
    "num_hidden_layers": 1,
    "has_lm_head": true,
    "_attn_implementation": "flex_attention",
    "draft_vocab_size": 65536
}

Launch Command

BASE_MODEL=meta-llama/Llama-3.2-1B-Instruct
DATA=input_conversations/daring-anteater.jsonl
LOGDIR=temp_logs_$(date +%Y%m%d_%H%M%S)

export GPU_PER_NODE=1
export CUDA_VISIBLE_DEVICES=7

echo "Using $GPU_PER_NODE GPU(s) per node: $CUDA_VISIBLE_DEVICES"
echo "Logging to $LOGDIR"

./launch_train.sh --model $BASE_MODEL \
            --output_dir $LOGDIR \
            --data $DATA  \
            --num_epochs 1 \
            --eagle_config eagle_config.json --train_bs 16 --ar_validate_steps 500 --num_ttt_steps 3 --draft_vocab_cache /root/eagle/ModelOptNew/examples/speculative_decoding/draft_vocab_cache/Llama-3.2-1B-Instruct/d2t.pt --disable_torch_compile True

Results

Torch Compile ON:

Step 500 AR: 1.5392
{'loss': 10.0528, 'grad_norm': 3.390625, 'learning_rate': 9.348039215686275e-05, 'epoch': 0.08}
{'loss': 9.4183, 'grad_norm': 4.0, 'learning_rate': 9.184640522875817e-05, 'epoch': 0.1}
{'loss': 9.1276, 'grad_norm': 3.21875, 'learning_rate': 9.02124183006536e-05, 'epoch': 0.11}
{'loss': 8.7839, 'grad_norm': 3.75, 'learning_rate': 8.857843137254901e-05, 'epoch': 0.13}
{'loss': 8.4203, 'grad_norm': 3.546875, 'learning_rate': 8.694444444444445e-05, 'epoch': 0.14}
Running AR validation...
Step 1000 AR: 1.6812

Torch Compile OFF:

Step 500 AR: 1.5398
{'loss': 10.0593, 'grad_norm': 3.390625, 'learning_rate': 9.348039215686275e-05, 'epoch': 0.08}
{'loss': 9.4192, 'grad_norm': 3.859375, 'learning_rate': 9.184640522875817e-05, 'epoch': 0.1}
{'loss': 9.1269, 'grad_norm': 3.390625, 'learning_rate': 9.02124183006536e-05, 'epoch': 0.11}
{'loss': 8.7853, 'grad_norm': 3.875, 'learning_rate': 8.857843137254901e-05, 'epoch': 0.13}
{'loss': 8.4205, 'grad_norm': 3.59375, 'learning_rate': 8.694444444444445e-05, 'epoch': 0.14}
Step 1000 AR: 1.6767

benchislett · 2026-03-18T17:51:24Z

The tests should pass now.

h-guo18

LGTM

h-guo18 · 2026-03-18T18:42:51Z

modelopt/torch/speculative/plugins/transformers.py

+            self._prepare_eagle_inputs = torch.compile(self._prepare_eagle_inputs, dynamic=False)
+            self._eagle_forward = torch.compile(
+                self._eagle_forward, dynamic=False, mode="max-autotune"
+            )
+            self._eagle_loss = torch.compile(self._eagle_loss, dynamic=False, fullgraph=True)


How about we add some try-except to these torch.compile, such that it fallback to eager whenever compile fails?

I think there's many other (unknown) cases torch compile will fail, other than the windows ci

Then we can also separate the compilation of these 3 functions. E.g. even if _eagle_forward fails to compile (due to flex attention perhaps), the other 2 function can still be optimized

I like this idea. However, I still think we would like to have full coverage of torch compile in CI if possible

I guess the linux test is passing, so for today we are covered. But it would be nice to have some confidence that the torch compile doesn't break (and just get skipped) in the future

Seems like some error could happen at runtime, even if compile works:
https://github.com/NVIDIA/Model-Optimizer/actions/runs/23258816928/job/67623073586?pr=1044#step:4:1727

We probably need to also set torch._dynamo.config.suppress_errors = True for fallback, in addition to try-except before torch.compile

Let's see if this one does the trick

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

modelopt/torch/speculative/eagle/utils.py

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

…VIDIA/Model-Optimizer into bchislett/eagle-speedups-torch-compile

benchislett added 4 commits March 16, 2026 02:40

nvtx annotations

4b76452

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

fix fsdp crash

7e10294

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

fix rope init

90cbd09

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

optimize loss calculation by avoiding extra softmax calls and .item()…

3d40373

… calls Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett requested a review from a team as a code owner March 16, 2026 05:18

benchislett requested a review from ChenhanYu March 16, 2026 05:18

benchislett commented Mar 16, 2026

View reviewed changes

modelopt/torch/speculative/plugins/transformers.py Show resolved Hide resolved

fix precommit

0309c19

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett requested review from h-guo18 and yeyu-nvidia March 16, 2026 05:35

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

modelopt/torch/speculative/plugins/transformers.py Show resolved Hide resolved

fix tests

44c00b8

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

modelopt/torch/speculative/plugins/transformers.py Show resolved Hide resolved

fix comment

b30d95c

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

benchislett added 2 commits March 16, 2026 21:02

torch.compile annotations

2815773

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

remove dead code

67e4bee

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

modelopt/torch/speculative/plugins/transformers.py Outdated Show resolved Hide resolved

benchislett added 2 commits March 18, 2026 16:55

make nvtx safe and torch.compile optional (on by default)

a637eef

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Merge remote-tracking branch 'origin/main' into bchislett/eagle-speed…

58dfce0

…ups-torch-compile Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

yeyu-nvidia approved these changes Mar 18, 2026

View reviewed changes

h-guo18 approved these changes Mar 18, 2026

View reviewed changes

h-guo18 reviewed Mar 18, 2026

View reviewed changes

benchislett added 3 commits March 18, 2026 19:55

torch.compile safely

d5ce31d

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

nvtx sniff test

b58be4a

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Merge branch 'main' into bchislett/eagle-speedups-torch-compile

23b010e

h-guo18 reviewed Mar 19, 2026

View reviewed changes

modelopt/torch/speculative/eagle/utils.py Outdated Show resolved Hide resolved

h-guo18 reviewed Mar 19, 2026

View reviewed changes

modelopt/torch/speculative/eagle/utils.py Outdated Show resolved Hide resolved

benchislett added 2 commits March 19, 2026 01:03

fix

cc0ef49

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Merge branch 'bchislett/eagle-speedups-torch-compile' of github.com:N…

89238c9

…VIDIA/Model-Optimizer into bchislett/eagle-speedups-torch-compile

benchislett mentioned this pull request Mar 19, 2026

[EAGLE] Dynamic sequence length for training samples #1069

Open

Conversation

benchislett commented Mar 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks failed

❌ Failed checks (1 error, 1 warning)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

h-guo18 commented Mar 16, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

benchislett commented Mar 16, 2026

Uh oh!

benchislett commented Mar 16, 2026

Uh oh!

h-guo18 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

h-guo18 commented Mar 16, 2026

Uh oh!

benchislett commented Mar 16, 2026

Uh oh!

benchislett commented Mar 16, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

h-guo18 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yeyu-nvidia commented Mar 17, 2026

Uh oh!

benchislett commented Mar 18, 2026

Uh oh!

benchislett commented Mar 18, 2026

Uh oh!

h-guo18 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

benchislett commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 16, 2026 •

edited

Loading

codecov bot commented Mar 16, 2026 •

edited

Loading

h-guo18 commented Mar 16, 2026 •

edited

Loading

h-guo18 commented Mar 16, 2026 •

edited

Loading

h-guo18 Mar 18, 2026 •

edited

Loading