Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
6016a0f
Add EAGLE3 offline launcher examples for 10 new models
yeyu-nvidia May 7, 2026
a6eeff4
Add EAGLE3 automation triage chart
yeyu-nvidia May 8, 2026
7c1388a
Port sandbox fixes: HF dump script, triage chart with test results
yeyu-nvidia May 8, 2026
642da1f
feat(eagle3): add vLLM hidden-state dump script and fix triage chart
yeyu-nvidia May 11, 2026
4abca8b
fix(launcher): use afterany dependency for allow_to_fail pipelines
yeyu-nvidia Apr 13, 2026
d0ad01b
fix(eagle3): fix code-quality CI failures in triage chart and vllm sc…
yeyu-nvidia May 11, 2026
eb830bd
fix(eagle3): pin speculators<0.5.0; document issues 6+7 in triage chart
yeyu-nvidia May 12, 2026
e1dd712
Fix torchvision import crash in vLLM container for dump_offline_data_…
yeyu-nvidia May 13, 2026
0b20534
Fix torch downgrade in dump_offline_data_vllm.sh breaking vllm._C
yeyu-nvidia May 14, 2026
2bedfa1
Fix compute_hidden_states_vllm.py for speculators 0.4.x API
yeyu-nvidia May 15, 2026
6ea8086
Remove transformers downgrade from dump_offline_data_vllm.sh
yeyu-nvidia May 19, 2026
86cc1c1
fix(eagle3): patch speculators/config.py for pydantic 2.13 compatibility
yeyu-nvidia May 19, 2026
ccfb6ef
fix(eagle3): fix tokenizer compatibility with transformers 5.x
yeyu-nvidia May 19, 2026
8ccb100
fix(eagle3): patch speculators for vLLM API compat (pydantic 2.13, Re…
yeyu-nvidia May 19, 2026
c1d8b8b
fix(eagle3): patch speculators vLLM scheduler to process all requests
yeyu-nvidia May 21, 2026
56bbdc6
fix(eagle3): support Ministral-3 (mistral3) VLM in offline training a…
yeyu-nvidia May 27, 2026
74c9c41
feat(eagle3): add pipeline configs, scripts, and triage docs for new …
yeyu-nvidia May 27, 2026
4e211f7
Add trust_remote_code to Ministral-3-8B EAGLE3 training config
yeyu-nvidia May 27, 2026
eb66bdb
Merge main into yeyu/eagle3-launcher-examples-new-models
yeyu-nvidia Jun 2, 2026
abe7cb1
Address review: use _LM_HEAD_PATHS/_EMBED_TOKENS_PATHS for Mistral su…
yeyu-nvidia Jun 2, 2026
1a7bb82
Address code review feedback
yeyu-nvidia Jun 2, 2026
fdbc2e1
Move quick_fail_check YAMLs from examples/ to tools/launcher/examples/
yeyu-nvidia Jun 2, 2026
8ea6b0a
Fix pre-commit auto-formatting (license headers, markdown blanks)
yeyu-nvidia Jun 2, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
215 changes: 215 additions & 0 deletions .claude/skills/eagle3-new-model/SKILL.md
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why SKILL.md exits in both here and #1429?

Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
---
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we separate the SKILL.md files to another PR?

name: eagle3-new-model
description: >
Add a new model to the EAGLE3 offline pipeline. Generates an hf_offline_eagle3.yaml
launcher config for a new model checkpoint, choosing the right hidden state dump
backend (TRT-LLM / HF / vLLM) and GPU configuration.
Use when user wants to run EAGLE3 on a model that does not yet have a YAML in
tools/launcher/examples/ or asks how to configure the pipeline for a new checkpoint.
---

# EAGLE3 New Model Configuration

This skill guides you through creating `tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml`
for a new model.

## Step 1 — Look up the model architecture

Determine these values from the HuggingFace model card, `config.json`, and vLLM docs:

| Property | Where to find it |
|---|---|
| Total / active parameters | Model card |
| Dense or MoE? | `config.json``num_experts`, `num_experts_per_tok` |
| Attention type (MHA / GQA / MLA / SWA) | Model card |
| Multimodal? (vision encoder) | Model card |
| BF16 weight size (GB) | `total_params × 2 bytes` |
| Special serving flags | vLLM docs, model README (`--trust-remote-code`, parsers) |

## Step 2 — Calculate GPU requirements (OCI-HSG / GB200)

OCI-HSG nodes: **4 GPUs × 192 GB HBM3e = 768 GB per node**

```text
BF16 weight size = total_params × 2 bytes
GPUs needed = ceil(weight_size_GB / 192)
nodes = ceil(gpus_needed / 4)
tp = min(gpus_needed, 4)
```

| Model | Weights (BF16) | GPUs | nodes | tp |
|---|---|---|---|---|
| 8B dense | ~16 GB | 1 | 1 | 4 |
| 70B dense | ~140 GB | 1 | 1 | 4 |
| 685B MoE | ~340 GB | 2 | 1 | 4 |
| 1T MoE | ~595 GB | 4 | 1 | 4 |

## Step 3 — Choose the hidden state dump backend

| Backend | Script | When to use |
|---------|--------|-------------|
| vLLM | `common/eagle3/dump_offline_data_vllm.sh` | Default; broad coverage via vLLM + speculators |
| HF | `common/eagle3/dump_offline_data_hf.sh` | VLMs, custom-code models, SWA attention |
| TRT-LLM | `common/eagle3/dump_offline_data.sh` | Pure-text models with TRT-LLM support (needs `--tp`/`--moe-ep`) |

Use **HF** when the model is a VLM or uses sliding window attention (TRT-LLM does not support these).
Use **vLLM** for everything else as the default.

## Step 4 — Write the YAML

Create `tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml`.
Use an existing config as a reference (e.g., `tools/launcher/examples/Qwen/Qwen3.5-35B-A3B/hf_offline_eagle3.yaml`).

### Header comment

```yaml
# EAGLE3 offline speculative decoding pipeline for <org>/<model>.
#
# <Model> is a <size> <dense|MoE> model. <brief notes: attention type, special reqs>
# BF16 weights ~<size> GB — fits on <N> GB200 node(s) (<N> × 192 GB).
#
# <Special requirements (if any)>
#
# 4-step pipeline:
# task_0: Data synthesis — query vLLM server to generate prompt samples
# task_1: Dump hidden states — run target model to capture hidden states
# task_2: Offline training — train the EAGLE3 draft head
# task_3: Benchmark — evaluate speculative decoding speedup via VLLM
#
# Usage:
# uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml --yes
# uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml --yes

job_name: <Model>_EAGLE3_offline
pipeline:
allow_to_fail: false
skip: false
note:

global_vars:
hf_model: /hf-local/<org>/<model>
```
### task_0 — Data synthesis (`common/vllm/query.sh`)

Args before `--` go to the vLLM server; args after `--` go to `query.py`.

```yaml
task_0:
script: common/vllm/query.sh
args:
- --model <<global_vars.hf_model>>
- --tensor-parallel-size <TP>
- --trust-remote-code # add only if required
- -- # separator
- --data /hf-local/modelopt/Speculative-Decoding-Dataset-v2-default
- --save /scratchspace/data
environment:
- HF_LOCAL: /hf-local
slurm_config:
_factory_: "slurm_factory"
nodes: <nodes>
ntasks_per_node: 1
gpus_per_node: 4
container: vllm/vllm-openai:latest
```

### task_1 — Hidden states (vLLM backend, default)

```yaml
task_1:
script: common/eagle3/dump_offline_data_vllm.sh
args:
- --input-data /scratchspace/data
- --output-dir /scratchspace/offline_hidden_states
- --max-seq-len 8192
environment:
- HF_MODEL_CKPT: <<global_vars.hf_model>>
slurm_config:
_factory_: "slurm_factory"
nodes: <nodes>
ntasks_per_node: 1
gpus_per_node: 4
container: vllm/vllm-openai:latest
```

For **HF backend** (VLMs, SWA models), use `dump_offline_data_hf.sh` instead — same args, no TP flags needed.

For **TRT-LLM backend**, use `dump_offline_data.sh` and add `--tp <TP>` and `--moe-ep 1` (or appropriate EP).

### task_2 — Offline training (`common/eagle3/train_eagle.sh`)

```yaml
task_2:
script: common/eagle3/train_eagle.sh
args:
- --config modules/Model-Optimizer/modelopt_recipes/general/speculative_decoding/eagle3.yaml
- model.model_name_or_path=<<global_vars.hf_model>>
- data.offline_data_path=/scratchspace/offline_hidden_states
- training.output_dir=/scratchspace/eagle3
- training.training_seq_len=4096
- training.disable_tqdm=true
- training.ar_validate_steps=500000
slurm_config:
_factory_: "slurm_factory"
nodes: 1
ntasks_per_node: 1
gpus_per_node: 4
container: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10
```

> **MoE note:** For MoE models with large per-expert hidden dims, consider increasing
> `intermediate_size` in `eagle_config.json` to match the model's `moe_intermediate_size`.

### task_3 — Benchmark (`common/specdec_bench/quick_check.sh`)

```yaml
task_3:
script: common/specdec_bench/quick_check.sh
args:
- --draft_model_dir /scratchspace/export
- --draft_length 3
- --output_length 4096
- --engine VLLM
- --tp_size <TP>
- --ep_size 1
- --speculative_algorithm EAGLE3
- --mtbench /hf-local/HuggingFaceH4/mt_bench_prompts/raw/question.jsonl
- --concurrency 1
environment:
- HF_LOCAL: /hf-local
- HF_MODEL_CKPT: <<global_vars.hf_model>>
slurm_config:
_factory_: "slurm_factory"
nodes: <nodes>
ntasks_per_node: 1
gpus_per_node: 4
container: vllm/vllm-openai:latest
```

## Step 5 — Common model-specific adjustments

| Situation | What to change |
|---|---|
| Requires `--trust-remote-code` | Add to task_0 vLLM args (before `--`) |
| VLM / multimodal | Use `dump_offline_data_hf.sh` for task_1 |
| Sliding window attention | Use `dump_offline_data_hf.sh` or `_vllm.sh` for task_1 |
| MoE with large expert hidden dim | Increase `intermediate_size` in eagle_config.json |
| Non-standard attention (MLA) | Verify `eagle_decoder_type` in the eagle3 recipe YAML |
| Custom tokenizer (e.g., tiktoken) | Set `TIKTOKEN_RS_CACHE_DIR` env var in task_0 and task_1 |
| NVFP4 quant model | task_0/task_3 use quant container; task_1/task_2 use BF16 base model — add `hf_model_bf16` global_var |
| Model needs `trust_remote_code` at benchmark | Add `--trust-remote-code` to task_3 args |

## Step 6 — Test with dry run

Preview the resolved config before submitting:

```bash
uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml --dryrun --yes -v
```

## Step 7 — Update triage chart

After adding a new model, add a row to the test matrix in
`tools/launcher/examples/EAGLE3_TRIAGE.md` with status 🔲 (not yet tested).
Fill in results after running.
99 changes: 99 additions & 0 deletions .claude/skills/eagle3-review-logs/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
name: eagle3-review-logs
description: >
Review EAGLE3 pipeline experiment logs from the launcher's experiments/ directory.
Summarizes pass/fail status for all 4 tasks, diagnoses failures with root causes
and fixes, and flags warnings. Use when the user asks to review job logs,
check experiment results, or diagnose why a specific task failed.
user_invocable: true
---

# Review EAGLE3 Experiment Logs

Analyze output logs from an EAGLE3 pipeline run launched via `launch.py` or `slurm.py`.

## Step 0 — Find experiment logs

Locate the experiment directory. The default is `experiments/` relative to the launcher root,
or wherever `--job-dir` was pointed.

```bash
ls -td experiments/cicd/cicd_* | head -10
```

Each experiment has one subdirectory per task (0–3). Logs are `sbatch_*.out` files inside:

```bash
find experiments/<exp_id>/ -name "sbatch_*.out" | sort
```

Do this in a single Bash call. If no experiments exist, ask the user for the directory.

## Step 1 — Read all task logs

Read the last 200 lines of each log in parallel. Errors appear at the end:

```bash
for f in $(find experiments/<exp_id>/ -name "sbatch_*.out" | sort); do
echo "=== $f ==="; tail -200 "$f"; echo
done
```

## Step 2 — Analyze

For each task log, check:

- **Exit / cancellation**: `DUE TO TIME LIMIT`, `FAILED`, signal (e.g., `signal 15`)
- **Python exceptions / tracebacks**: last exception is usually the root cause
- **CUDA errors**: OOM, NCCL timeout
- **Slurm state**: COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY
- **Success indicators**: "Saved N samples", "Successfully processed N conversations", training loss line, AR output

## Step 3 — Produce report

Output a structured markdown report:

### Summary

- Overall status: PASSED / FAILED / MIXED / PARTIAL
- Task breakdown: e.g., task_0 TIMEOUT, task_1 FAIL, task_2 skipped, task_3 skipped

### Task Results

For each task (0–3):

**Task N — \<name\>: PASS / FAIL / TIMEOUT**
- Key output: (e.g., "3277/3295 samples generated" or "Script not found")
- Error (if failed): quoted error message, max 10 lines
- Root cause: one-line diagnosis
- Suggested fix: actionable step

### Warnings

Non-fatal issues worth noting (near-OOM, tokenizer warnings, slow throughput).

## Step 4 — Suggest next steps

Based on results:

- If a task failed due to a known issue, suggest the fix and how to re-run from that task:

```bash
uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml \
pipeline.task_0.skip=true \
--yes
```

- If the failure pattern is new (not in `tools/launcher/examples/EAGLE3_TRIAGE.md`),
suggest adding it to the triage chart using `/eagle3-triage` guidance.

- If all tasks passed, suggest running `/eagle3-validate` to confirm AR meets threshold.

## Known benign patterns (do NOT mark as failures)

| Pattern | Explanation |
|---|---|
| vLLM server exit code 143 | SIGTERM — server was killed after queries completed. Expected. |
| `CANCELLED AT ... DUE TO TASK FAILURE` after `exit code: 0` | Slurm cleanup of worker nodes after main task succeeded. |
| `destroy_process_group() was not called` | Benign PyTorch shutdown warning. |
| `tokenizer class ... not equal to the registered tokenizer class` | Harmless tokenizer mismatch warning. |
Loading
Loading