Skip to content

feat(okr30): add EAGLE3 Claude Code skills for triage, validation, and new-model support#1429

Open
yeyu-nvidia wants to merge 5 commits into
mainfrom
yeyu/eagle3-claude-skills
Open

feat(okr30): add EAGLE3 Claude Code skills for triage, validation, and new-model support#1429
yeyu-nvidia wants to merge 5 commits into
mainfrom
yeyu/eagle3-claude-skills

Conversation

@yeyu-nvidia
Copy link
Copy Markdown
Contributor

@yeyu-nvidia yeyu-nvidia commented May 11, 2026

Summary

Adds four user-invocable Claude Code skills for the EAGLE3 offline pipeline (OKR 30 — Claude-assist experience).

  • /eagle3-triage: Diagnose a failed pipeline run. Failure tables for all 4 tasks covering vLLM server startup, hidden state dump (3 backends: TRT-LLM / HF / vLLM), training crashes, and benchmark failures. New-model-specific issue checklist (VLMs, MoE, SWA, custom tokenizers).
  • /eagle3-validate: Verify a completed run end-to-end. Artifact checks per task, AR threshold validation (≥ 2.1), structured validation report.
  • /eagle3-new-model: Guided workflow for adding a new model. Architecture lookup, GB200 GPU/TP calculation, dump backend selection, full YAML template with correct public-launcher script paths.
  • /eagle3-review-logs: Lightweight log reader. Finds sbatch_*.out files, reads all task logs, produces pass/fail summary with root causes and next steps.

Skills use public launcher paths (common/eagle3/, common/vllm/, etc.) and read sbatch_*.out files directly — no sandbox-specific tooling required.

Related

Test plan

  • Run /eagle3-triage against a known-failed experiment and verify it identifies the root cause
  • Run /eagle3-validate against a passing experiment and verify AR check works
  • Run /eagle3-new-model to generate a config for a new model and verify the YAML is correct
  • Run /eagle3-review-logs and verify summary output matches actual log contents

Summary by CodeRabbit

  • Documentation
    • Added comprehensive EAGLE3 guides: onboarding new models and example config creation, a log-review workflow that generates structured pass/fail reports with root-cause one-liners and suggested fixes, a user-invocable end-to-end validation workflow for pipeline runs, and a triage/troubleshooting guide mapping common failure patterns to concrete remediation steps.

…d new-model support

Four user-invocable skills for the EAGLE3 offline pipeline:

- eagle3-triage: diagnose failed pipeline runs step-by-step; failure tables
  for all 4 tasks (vLLM data synthesis, hidden state dump with 3 backends,
  training, benchmark); new-model-specific issue checklist
- eagle3-validate: verify completed runs; artifact checks; AR threshold (>= 2.1);
  structured validation report with next-step guidance
- eagle3-new-model: guided workflow for adding a new model; architecture lookup,
  GPU/TP calculation for GB200, backend selection, full YAML template with
  correct public-launcher script paths
- eagle3-review-logs: lightweight log reader; finds sbatch .out files, reads all
  task logs, produces pass/fail summary with root causes

Skills use public launcher paths (common/eagle3/, common/vllm/, etc.) and read
sbatch .out files directly — no sandbox-specific tooling required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2caff668-140f-4d79-be7d-4de88e9dbd58

📥 Commits

Reviewing files that changed from the base of the PR and between 0cfd3f5 and cbdc31e.

📒 Files selected for processing (1)
  • .claude/skills/eagle3-new-model/SKILL.md
✅ Files skipped from review due to trivial changes (1)
  • .claude/skills/eagle3-new-model/SKILL.md

📝 Walkthrough

Walkthrough

Adds four new Claude skill documents that guide creating model YAMLs, validating EAGLE3 pipeline runs, reviewing experiment logs, and triaging failures. All changes are documentation only; no code entities are modified.

Changes

EAGLE3 Pipeline Workflow Skills

Layer / File(s) Summary
New Model Configuration Skill
.claude/skills/eagle3-new-model/SKILL.md
Procedural guide to create hf_offline_eagle3.yaml for new models: extract architecture/serving properties, compute OCI‑HSG/GB200 GPU/node sizing, select hidden-state dump backend (vLLM/HF/TRT-LLM), author a 4-task pipeline (data synthesis, hidden-state dump, offline training, benchmark), add model-specific adjustments, and run a dry-run.
Pipeline Validation Skill
.claude/skills/eagle3-validate/SKILL.md
End-to-end validation workflow: locate latest experiment, inspect task logs for success/timeout/failure, verify artifacts under /scratchspace/, extract benchmark acceptance-rate (AR) and compare to threshold (>= 2.1), check training quality cues, and emit a structured validation report with PASS/FAIL and next steps.
Log Review and Analysis Skill
.claude/skills/eagle3-review-logs/SKILL.md
Systematic log review: locate sbatch_*.out Slurm logs, tail last 200 lines per task in parallel, analyze exit/cancellation signals, Python tracebacks, CUDA/Slurm failure modes, and success indicators; generate a structured markdown report with per-task diagnosis, suggested fixes, and a benign-patterns table.
Pipeline Failure Diagnosis Skill
.claude/skills/eagle3-triage/SKILL.md
Comprehensive triage workflow: find failed experiments, fetch Slurm logs, map error patterns to root causes and concrete fixes per task, run new-model-specific checks (VLM/SWA/trust_remote_code/MoE sizing/tokenizer cache), provide re-run commands and skip flags, and instruct updating tools/launcher/examples/EAGLE3_TRIAGE.md.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • kaix-nv
  • chadvoegele
🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly describes the main changes: adding four EAGLE3 Claude Code skills for triage, validation, new-model support, and log review—matching the primary content of the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PR modifies only .claude/skills documentation (SKILL.md markdown files) with no Python code changes to modelopt or examples directories.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch yeyu/eagle3-claude-skills

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/eagle3-new-model/SKILL.md:
- Around line 33-38: The fenced code block containing the sizing formulas (the
lines starting with "BF16 weight size  = total_params × 2 bytes" through "tp    
= min(gpus_needed, 4)") needs a fence language added (for example use ```text)
to satisfy MD040; update the opening fence to include that language so the
Markdown linter recognizes it as a plain-text code block.
- Around line 1-9: Add the missing frontmatter key user_invocable: true to the
skill metadata so the skill becomes callable; edit the SKILL.md frontmatter for
the eagle3-new-model skill and insert user_invocable: true (boolean) alongside
name/description so the YAML now includes user_invocable: true.
- Around line 33-45: The TP calculation is inconsistent: the formula "tp =
min(gpus_needed, 4)" contradicts table rows that set tp=4 even when GPUs needed
is 1 or 2; update either the formula or the table so they match. Locate the
block with the formulas (BF16 weight size, GPUs needed, nodes, tp) and either
change the tp formula to "tp = min(max(gpus_needed, 4), 4)" or more sensibly "tp
= min(4, gpus_needed) if gpus_needed >= 4 else gpus_needed" (or adjust each
example row in the table to set tp = gpus_needed for 1–3 GPUs), and ensure the
entries for models (8B dense, 70B dense, 685B MoE, 1T MoE) reflect the chosen
rule consistently.

In @.claude/skills/eagle3-review-logs/SKILL.md:
- Around line 54-79: The markdown in the SKILL.md "Output a structured markdown
report:" section is triggering markdownlint MD022/MD031 because headings (e.g.,
"### Summary", "### Task Results", "## Step 4 — Suggest next steps") and the
fenced code block (the ```bash snippet) are not surrounded by blank lines;
update the template so every heading has a blank line before and after it and
ensure the fenced code block has a blank line before and after the ```bash fence
to satisfy MD022/MD031 and eliminate the formatting warnings.
- Around line 34-40: The text says “Read the last 200 lines of each log in
parallel” but the shown for-loop is sequential; either remove the phrase “in
parallel” or replace the sequential for-loop block with a true parallel command.
Concretely, update the snippet that currently uses the for f in $(find ...); do
... tail -200 ... done to use the parallel xargs pipeline (find ... | sort |
xargs -I{} -P 8 sh -c 'echo "=== {} ==="; tail -200 "{}"; echo') so it actually
runs tails in parallel, or else change the prose to say “sequentially” and keep
the existing for-loop.

In @.claude/skills/eagle3-triage/SKILL.md:
- Around line 148-163: Add a blank line before and after each fenced code block
in the SKILL.md section containing the two uv run examples (the triple-backtick
command blocks for "To skip task_0..." and "To run only task_1...") so there is
an empty line surrounding each ```...``` fence to satisfy MD031.

In @.claude/skills/eagle3-validate/SKILL.md:
- Around line 59-61: The fenced code block containing "Average Acceptance Length
{'accept': X, 'count': Y, 'ratio': Z.ZZ}" needs a language tag to satisfy MD040;
update the three-backtick fence from ``` to ```text so the block is recognized
as plain text (leave the content unchanged) in SKILL.md.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5681877c-5f95-498c-bd50-02be8e857617

📥 Commits

Reviewing files that changed from the base of the PR and between d30ebbd and 6c580d8.

📒 Files selected for processing (4)
  • .claude/skills/eagle3-new-model/SKILL.md
  • .claude/skills/eagle3-review-logs/SKILL.md
  • .claude/skills/eagle3-triage/SKILL.md
  • .claude/skills/eagle3-validate/SKILL.md

Comment thread .claude/skills/eagle3-new-model/SKILL.md
Comment thread .claude/skills/eagle3-new-model/SKILL.md Outdated
Comment thread .claude/skills/eagle3-new-model/SKILL.md Outdated
Comment thread .claude/skills/eagle3-review-logs/SKILL.md Outdated
Comment thread .claude/skills/eagle3-review-logs/SKILL.md
Comment thread .claude/skills/eagle3-triage/SKILL.md
Comment thread .claude/skills/eagle3-validate/SKILL.md Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.47%. Comparing base (555be6c) to head (cbdc31e).
⚠️ Report is 87 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1429      +/-   ##
==========================================
- Coverage   76.74%   72.47%   -4.28%     
==========================================
  Files         476      487      +11     
  Lines       51307    60914    +9607     
==========================================
+ Hits        39377    44147    +4770     
- Misses      11930    16767    +4837     
Flag Coverage Δ
unit 53.92% <ø> (+1.38%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yeyu-nvidia and others added 2 commits May 11, 2026 11:47
Add `text` language specifiers to bare fenced code blocks:
- eagle3-new-model/SKILL.md: GPU calculation formula block
- eagle3-validate/SKILL.md: acceptance rate log output block

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
Add blank lines before fenced code blocks as required by MD031:
- eagle3-triage/SKILL.md: two re-run command blocks
- eagle3-review-logs/SKILL.md: suggested fix block and section headers

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
hf_model: /hf-local/<org>/<model>
```

### task_0 — Data synthesis (`common/vllm/query.sh`)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theses steps should be obvious for agents to infer from existing launcher examples. I personally suggest letting agent infer from existing ones, rather than provide a general guidance here.
Reason: launcher examples are easier to test and maintain than md skills.

cc @ChenhanYu for discussion

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable — agents do infer most of this from existing YAMLs. I'll trim the skill to the non-obvious bit (backend-selection heuristic: HF for VLM/SWA, vLLM as default, TRT-LLM for plain text) and let the agent infer the rest from the example configs rather than spelling out a general step list.

| NVFP4 quant model | task_0/task_3 use quant container; task_1/task_2 use BF16 base model — add `hf_model_bf16` global_var |
| Model needs `trust_remote_code` at benchmark | Add `--trust-remote-code` to task_3 args |

## Step 6 — Test with dry run
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, I never told the agent to do dry-run, but it does basically everytime. I think the dry-run command maybe something obvious for agent to read from launcher interface.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — dry-run is something the agent reads from the launcher interface anyway. I'll drop the explicit dry-run step.

uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml --dryrun --yes -v
```

## Step 7 — Update triage chart
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the relationship between the triage chart and pensive-intern's flow chart?

cc @ChenhanYu

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question — I should reconcile these. If pensive-intern's flow chart already covers the triage decision tree, the chart here is redundant and should point at it instead of duplicating. cc @ChenhanYu

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked Pensieve (docs) — it's the team's Slack memory/status bot (knowledge graph, weekly digests, JIRA ticket triage). It doesn't read EAGLE3 experiment logs, diagnose pipeline task failures, or validate runs, so there's no functional overlap with these skills, which operate on the launcher's experiments/ output and the 4-task pipeline. So I'd keep eagle3-triage/eagle3-review-logs rather than consolidate. Happy to reconsider if you were thinking of a different tool.


## Step 2 — Calculate GPU requirements (OCI-HSG / GB200)

OCI-HSG nodes: **4 GPUs × 192 GB HBM3e = 768 GB per node**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These cluster info is obvious for agent to infer. I don't think we need to hard-coded it here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — the OCI-HSG cluster sizing shouldn't be hardcoded here. I'll remove it and let the agent infer node/GPU sizing from existing examples + the launcher interface.

| TRT-LLM | `common/eagle3/dump_offline_data.sh` | Pure-text models with TRT-LLM support (needs `--tp`/`--moe-ep`) |

Use **HF** when the model is a VLM or uses sliding window attention (TRT-LLM does not support these).
Use **vLLM** for everything else as the default.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what's the consideration here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong reason — this falls in the same bucket as the other "agent can infer this" notes. I'll trim it when I slim the skill down.

@@ -0,0 +1,99 @@
---
name: eagle3-review-logs
Copy link
Copy Markdown
Contributor

@h-guo18 h-guo18 Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never used any skills, but my agent does basically the same thing when given the launcher yaml example alone. Any specific reason we need this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is triage-and-expose, not to replace human judgment: encode the launcher-specific surface (experiment dir layout, the 4-task mapping, acceptance-rate threshold, known per-model gotchas) so the agent reliably localizes and reports a failure for a human to act on. If your agent already does this well from the YAML alone, and pensive-intern covers the same surface, I'm happy to consolidate — drop eagle3-review-logs and keep only eagle3-new-model. cc @ChenhanYu

@@ -0,0 +1,179 @@
---
name: eagle3-triage
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this somehow functional overlap with pensive-intern? cc @ChenhanYu

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly — if pensive-intern already covers EAGLE3 triage, there's functional overlap and I'd rather not duplicate. I'll check pensive-intern's scope; if it covers this, I'll drop eagle3-triage and keep only eagle3-new-model (config generation). cc @ChenhanYu

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked Pensieve (docs) — it's the team's Slack memory/status bot (knowledge graph, weekly digests, JIRA ticket triage). It doesn't read EAGLE3 experiment logs, diagnose pipeline task failures, or validate runs, so there's no functional overlap with these skills, which operate on the launcher's experiments/ output and the 4-task pipeline. So I'd keep eagle3-triage/eagle3-review-logs rather than consolidate. Happy to reconsider if you were thinking of a different tool.

- eagle3-new-model: add user_invocable:true (match sibling eagle3 skills);
  fix internally-inconsistent GPU-sizing table (tp is fixed at 4, full-node
  sharding; gpus_to_fit only sizes node count) and use consistent example rows;
  update vLLM backend note to native extractor (no speculators).
- eagle3-review-logs: correct "in parallel" wording (the loop is sequential).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
yeyu-nvidia added a commit that referenced this pull request Jun 3, 2026
- vLLM dump: replace speculators VllmHiddenStatesGenerator + runtime
  source-patches with vLLM's native extract_hidden_states extractor
  (ExampleHiddenStatesConnector). No third-party data-gen dependency.
- Delete legacy duplicate scripts in examples/speculative_decoding/pipeline/eagle3/
  (dump_offline_data{,_hf,_vllm}.sh, offline_training.sh) — canonical
  versions live in tools/launcher/common/eagle3/.
- Move EAGLE3 SKILL.md files out of this PR (consolidated into #1429).
- Kimi-K2.5 benchmark: forward --trust-remote-code.
- Update triage docs to match the launcher (train_eagle.sh, eagle3_quick_check.yaml,
  vLLM native extractor) and drop stale speculators/offline_training references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
Per review (h-guo18): agents can infer task structure, GPU/node sizing, the
full YAML template, and the dry-run step from existing launcher examples, which
are easier to test and maintain than prose in the skill. Drop those sections and
keep only what is not obvious from the examples: the dump-backend selection
heuristic and the per-model adjustment gotchas. Reduced 220 -> 49 lines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants