-
Notifications
You must be signed in to change notification settings - Fork 423
EAGLE3 new model support: pipeline configs, triage docs, and Ministral-3 fixes #1417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
yeyu-nvidia
wants to merge
23
commits into
main
Choose a base branch
from
yeyu/eagle3-launcher-examples-new-models
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+4,829
−17
Open
Changes from all commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
6016a0f
Add EAGLE3 offline launcher examples for 10 new models
yeyu-nvidia a6eeff4
Add EAGLE3 automation triage chart
yeyu-nvidia 7c1388a
Port sandbox fixes: HF dump script, triage chart with test results
yeyu-nvidia 642da1f
feat(eagle3): add vLLM hidden-state dump script and fix triage chart
yeyu-nvidia 4abca8b
fix(launcher): use afterany dependency for allow_to_fail pipelines
yeyu-nvidia d0ad01b
fix(eagle3): fix code-quality CI failures in triage chart and vllm sc…
yeyu-nvidia eb830bd
fix(eagle3): pin speculators<0.5.0; document issues 6+7 in triage chart
yeyu-nvidia e1dd712
Fix torchvision import crash in vLLM container for dump_offline_data_…
yeyu-nvidia 0b20534
Fix torch downgrade in dump_offline_data_vllm.sh breaking vllm._C
yeyu-nvidia 2bedfa1
Fix compute_hidden_states_vllm.py for speculators 0.4.x API
yeyu-nvidia 6ea8086
Remove transformers downgrade from dump_offline_data_vllm.sh
yeyu-nvidia 86cc1c1
fix(eagle3): patch speculators/config.py for pydantic 2.13 compatibility
yeyu-nvidia ccfb6ef
fix(eagle3): fix tokenizer compatibility with transformers 5.x
yeyu-nvidia 8ccb100
fix(eagle3): patch speculators for vLLM API compat (pydantic 2.13, Re…
yeyu-nvidia c1d8b8b
fix(eagle3): patch speculators vLLM scheduler to process all requests
yeyu-nvidia 56bbdc6
fix(eagle3): support Ministral-3 (mistral3) VLM in offline training a…
yeyu-nvidia 74c9c41
feat(eagle3): add pipeline configs, scripts, and triage docs for new …
yeyu-nvidia 4e211f7
Add trust_remote_code to Ministral-3-8B EAGLE3 training config
yeyu-nvidia eb66bdb
Merge main into yeyu/eagle3-launcher-examples-new-models
yeyu-nvidia abe7cb1
Address review: use _LM_HEAD_PATHS/_EMBED_TOKENS_PATHS for Mistral su…
yeyu-nvidia 1a7bb82
Address code review feedback
yeyu-nvidia fdbc2e1
Move quick_fail_check YAMLs from examples/ to tools/launcher/examples/
yeyu-nvidia 8ea6b0a
Fix pre-commit auto-formatting (license headers, markdown blanks)
yeyu-nvidia File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,215 @@ | ||
| --- | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we separate the SKILL.md files to another PR? |
||
| name: eagle3-new-model | ||
| description: > | ||
| Add a new model to the EAGLE3 offline pipeline. Generates an hf_offline_eagle3.yaml | ||
| launcher config for a new model checkpoint, choosing the right hidden state dump | ||
| backend (TRT-LLM / HF / vLLM) and GPU configuration. | ||
| Use when user wants to run EAGLE3 on a model that does not yet have a YAML in | ||
| tools/launcher/examples/ or asks how to configure the pipeline for a new checkpoint. | ||
| --- | ||
|
|
||
| # EAGLE3 New Model Configuration | ||
|
|
||
| This skill guides you through creating `tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml` | ||
| for a new model. | ||
|
|
||
| ## Step 1 — Look up the model architecture | ||
|
|
||
| Determine these values from the HuggingFace model card, `config.json`, and vLLM docs: | ||
|
|
||
| | Property | Where to find it | | ||
| |---|---| | ||
| | Total / active parameters | Model card | | ||
| | Dense or MoE? | `config.json` → `num_experts`, `num_experts_per_tok` | | ||
| | Attention type (MHA / GQA / MLA / SWA) | Model card | | ||
| | Multimodal? (vision encoder) | Model card | | ||
| | BF16 weight size (GB) | `total_params × 2 bytes` | | ||
| | Special serving flags | vLLM docs, model README (`--trust-remote-code`, parsers) | | ||
|
|
||
| ## Step 2 — Calculate GPU requirements (OCI-HSG / GB200) | ||
|
|
||
| OCI-HSG nodes: **4 GPUs × 192 GB HBM3e = 768 GB per node** | ||
|
|
||
| ```text | ||
| BF16 weight size = total_params × 2 bytes | ||
| GPUs needed = ceil(weight_size_GB / 192) | ||
| nodes = ceil(gpus_needed / 4) | ||
| tp = min(gpus_needed, 4) | ||
| ``` | ||
|
|
||
| | Model | Weights (BF16) | GPUs | nodes | tp | | ||
| |---|---|---|---|---| | ||
| | 8B dense | ~16 GB | 1 | 1 | 4 | | ||
| | 70B dense | ~140 GB | 1 | 1 | 4 | | ||
| | 685B MoE | ~340 GB | 2 | 1 | 4 | | ||
| | 1T MoE | ~595 GB | 4 | 1 | 4 | | ||
|
|
||
| ## Step 3 — Choose the hidden state dump backend | ||
|
|
||
| | Backend | Script | When to use | | ||
| |---------|--------|-------------| | ||
| | vLLM | `common/eagle3/dump_offline_data_vllm.sh` | Default; broad coverage via vLLM + speculators | | ||
| | HF | `common/eagle3/dump_offline_data_hf.sh` | VLMs, custom-code models, SWA attention | | ||
| | TRT-LLM | `common/eagle3/dump_offline_data.sh` | Pure-text models with TRT-LLM support (needs `--tp`/`--moe-ep`) | | ||
|
|
||
| Use **HF** when the model is a VLM or uses sliding window attention (TRT-LLM does not support these). | ||
| Use **vLLM** for everything else as the default. | ||
|
|
||
| ## Step 4 — Write the YAML | ||
|
|
||
| Create `tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml`. | ||
| Use an existing config as a reference (e.g., `tools/launcher/examples/Qwen/Qwen3.5-35B-A3B/hf_offline_eagle3.yaml`). | ||
|
|
||
| ### Header comment | ||
|
|
||
| ```yaml | ||
| # EAGLE3 offline speculative decoding pipeline for <org>/<model>. | ||
| # | ||
| # <Model> is a <size> <dense|MoE> model. <brief notes: attention type, special reqs> | ||
| # BF16 weights ~<size> GB — fits on <N> GB200 node(s) (<N> × 192 GB). | ||
| # | ||
| # <Special requirements (if any)> | ||
| # | ||
| # 4-step pipeline: | ||
| # task_0: Data synthesis — query vLLM server to generate prompt samples | ||
| # task_1: Dump hidden states — run target model to capture hidden states | ||
| # task_2: Offline training — train the EAGLE3 draft head | ||
| # task_3: Benchmark — evaluate speculative decoding speedup via VLLM | ||
| # | ||
| # Usage: | ||
| # uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml --yes | ||
| # uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml --yes | ||
|
|
||
| job_name: <Model>_EAGLE3_offline | ||
| pipeline: | ||
| allow_to_fail: false | ||
| skip: false | ||
| note: | ||
|
|
||
| global_vars: | ||
| hf_model: /hf-local/<org>/<model> | ||
| ``` | ||
| ### task_0 — Data synthesis (`common/vllm/query.sh`) | ||
|
|
||
| Args before `--` go to the vLLM server; args after `--` go to `query.py`. | ||
|
|
||
| ```yaml | ||
| task_0: | ||
| script: common/vllm/query.sh | ||
| args: | ||
| - --model <<global_vars.hf_model>> | ||
| - --tensor-parallel-size <TP> | ||
| - --trust-remote-code # add only if required | ||
| - -- # separator | ||
| - --data /hf-local/modelopt/Speculative-Decoding-Dataset-v2-default | ||
| - --save /scratchspace/data | ||
| environment: | ||
| - HF_LOCAL: /hf-local | ||
| slurm_config: | ||
| _factory_: "slurm_factory" | ||
| nodes: <nodes> | ||
| ntasks_per_node: 1 | ||
| gpus_per_node: 4 | ||
| container: vllm/vllm-openai:latest | ||
| ``` | ||
|
|
||
| ### task_1 — Hidden states (vLLM backend, default) | ||
|
|
||
| ```yaml | ||
| task_1: | ||
| script: common/eagle3/dump_offline_data_vllm.sh | ||
| args: | ||
| - --input-data /scratchspace/data | ||
| - --output-dir /scratchspace/offline_hidden_states | ||
| - --max-seq-len 8192 | ||
| environment: | ||
| - HF_MODEL_CKPT: <<global_vars.hf_model>> | ||
| slurm_config: | ||
| _factory_: "slurm_factory" | ||
| nodes: <nodes> | ||
| ntasks_per_node: 1 | ||
| gpus_per_node: 4 | ||
| container: vllm/vllm-openai:latest | ||
| ``` | ||
|
|
||
| For **HF backend** (VLMs, SWA models), use `dump_offline_data_hf.sh` instead — same args, no TP flags needed. | ||
|
|
||
| For **TRT-LLM backend**, use `dump_offline_data.sh` and add `--tp <TP>` and `--moe-ep 1` (or appropriate EP). | ||
|
|
||
| ### task_2 — Offline training (`common/eagle3/train_eagle.sh`) | ||
|
|
||
| ```yaml | ||
| task_2: | ||
| script: common/eagle3/train_eagle.sh | ||
| args: | ||
| - --config modules/Model-Optimizer/modelopt_recipes/general/speculative_decoding/eagle3.yaml | ||
| - model.model_name_or_path=<<global_vars.hf_model>> | ||
| - data.offline_data_path=/scratchspace/offline_hidden_states | ||
| - training.output_dir=/scratchspace/eagle3 | ||
| - training.training_seq_len=4096 | ||
| - training.disable_tqdm=true | ||
| - training.ar_validate_steps=500000 | ||
| slurm_config: | ||
| _factory_: "slurm_factory" | ||
| nodes: 1 | ||
| ntasks_per_node: 1 | ||
| gpus_per_node: 4 | ||
| container: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10 | ||
| ``` | ||
|
|
||
| > **MoE note:** For MoE models with large per-expert hidden dims, consider increasing | ||
| > `intermediate_size` in `eagle_config.json` to match the model's `moe_intermediate_size`. | ||
|
|
||
| ### task_3 — Benchmark (`common/specdec_bench/quick_check.sh`) | ||
|
|
||
| ```yaml | ||
| task_3: | ||
| script: common/specdec_bench/quick_check.sh | ||
| args: | ||
| - --draft_model_dir /scratchspace/export | ||
| - --draft_length 3 | ||
| - --output_length 4096 | ||
| - --engine VLLM | ||
| - --tp_size <TP> | ||
| - --ep_size 1 | ||
| - --speculative_algorithm EAGLE3 | ||
| - --mtbench /hf-local/HuggingFaceH4/mt_bench_prompts/raw/question.jsonl | ||
| - --concurrency 1 | ||
| environment: | ||
| - HF_LOCAL: /hf-local | ||
| - HF_MODEL_CKPT: <<global_vars.hf_model>> | ||
| slurm_config: | ||
| _factory_: "slurm_factory" | ||
| nodes: <nodes> | ||
| ntasks_per_node: 1 | ||
| gpus_per_node: 4 | ||
| container: vllm/vllm-openai:latest | ||
| ``` | ||
|
|
||
| ## Step 5 — Common model-specific adjustments | ||
|
|
||
| | Situation | What to change | | ||
| |---|---| | ||
| | Requires `--trust-remote-code` | Add to task_0 vLLM args (before `--`) | | ||
| | VLM / multimodal | Use `dump_offline_data_hf.sh` for task_1 | | ||
| | Sliding window attention | Use `dump_offline_data_hf.sh` or `_vllm.sh` for task_1 | | ||
| | MoE with large expert hidden dim | Increase `intermediate_size` in eagle_config.json | | ||
| | Non-standard attention (MLA) | Verify `eagle_decoder_type` in the eagle3 recipe YAML | | ||
| | Custom tokenizer (e.g., tiktoken) | Set `TIKTOKEN_RS_CACHE_DIR` env var in task_0 and task_1 | | ||
| | NVFP4 quant model | task_0/task_3 use quant container; task_1/task_2 use BF16 base model — add `hf_model_bf16` global_var | | ||
| | Model needs `trust_remote_code` at benchmark | Add `--trust-remote-code` to task_3 args | | ||
|
|
||
| ## Step 6 — Test with dry run | ||
|
|
||
| Preview the resolved config before submitting: | ||
|
|
||
| ```bash | ||
| uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml --dryrun --yes -v | ||
| ``` | ||
|
|
||
| ## Step 7 — Update triage chart | ||
|
|
||
| After adding a new model, add a row to the test matrix in | ||
| `tools/launcher/examples/EAGLE3_TRIAGE.md` with status 🔲 (not yet tested). | ||
| Fill in results after running. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,99 @@ | ||
| --- | ||
| name: eagle3-review-logs | ||
| description: > | ||
| Review EAGLE3 pipeline experiment logs from the launcher's experiments/ directory. | ||
| Summarizes pass/fail status for all 4 tasks, diagnoses failures with root causes | ||
| and fixes, and flags warnings. Use when the user asks to review job logs, | ||
| check experiment results, or diagnose why a specific task failed. | ||
| user_invocable: true | ||
| --- | ||
|
|
||
| # Review EAGLE3 Experiment Logs | ||
|
|
||
| Analyze output logs from an EAGLE3 pipeline run launched via `launch.py` or `slurm.py`. | ||
|
|
||
| ## Step 0 — Find experiment logs | ||
|
|
||
| Locate the experiment directory. The default is `experiments/` relative to the launcher root, | ||
| or wherever `--job-dir` was pointed. | ||
|
|
||
| ```bash | ||
| ls -td experiments/cicd/cicd_* | head -10 | ||
| ``` | ||
|
|
||
| Each experiment has one subdirectory per task (0–3). Logs are `sbatch_*.out` files inside: | ||
|
|
||
| ```bash | ||
| find experiments/<exp_id>/ -name "sbatch_*.out" | sort | ||
| ``` | ||
|
|
||
| Do this in a single Bash call. If no experiments exist, ask the user for the directory. | ||
|
|
||
| ## Step 1 — Read all task logs | ||
|
|
||
| Read the last 200 lines of each log in parallel. Errors appear at the end: | ||
|
|
||
| ```bash | ||
| for f in $(find experiments/<exp_id>/ -name "sbatch_*.out" | sort); do | ||
| echo "=== $f ==="; tail -200 "$f"; echo | ||
| done | ||
| ``` | ||
|
|
||
| ## Step 2 — Analyze | ||
|
|
||
| For each task log, check: | ||
|
|
||
| - **Exit / cancellation**: `DUE TO TIME LIMIT`, `FAILED`, signal (e.g., `signal 15`) | ||
| - **Python exceptions / tracebacks**: last exception is usually the root cause | ||
| - **CUDA errors**: OOM, NCCL timeout | ||
| - **Slurm state**: COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY | ||
| - **Success indicators**: "Saved N samples", "Successfully processed N conversations", training loss line, AR output | ||
|
|
||
| ## Step 3 — Produce report | ||
|
|
||
| Output a structured markdown report: | ||
|
|
||
| ### Summary | ||
|
|
||
| - Overall status: PASSED / FAILED / MIXED / PARTIAL | ||
| - Task breakdown: e.g., task_0 TIMEOUT, task_1 FAIL, task_2 skipped, task_3 skipped | ||
|
|
||
| ### Task Results | ||
|
|
||
| For each task (0–3): | ||
|
|
||
| **Task N — \<name\>: PASS / FAIL / TIMEOUT** | ||
| - Key output: (e.g., "3277/3295 samples generated" or "Script not found") | ||
| - Error (if failed): quoted error message, max 10 lines | ||
| - Root cause: one-line diagnosis | ||
| - Suggested fix: actionable step | ||
|
|
||
| ### Warnings | ||
|
|
||
| Non-fatal issues worth noting (near-OOM, tokenizer warnings, slow throughput). | ||
|
|
||
| ## Step 4 — Suggest next steps | ||
|
|
||
| Based on results: | ||
|
|
||
| - If a task failed due to a known issue, suggest the fix and how to re-run from that task: | ||
|
|
||
| ```bash | ||
| uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml \ | ||
| pipeline.task_0.skip=true \ | ||
| --yes | ||
| ``` | ||
|
|
||
| - If the failure pattern is new (not in `tools/launcher/examples/EAGLE3_TRIAGE.md`), | ||
| suggest adding it to the triage chart using `/eagle3-triage` guidance. | ||
|
|
||
| - If all tasks passed, suggest running `/eagle3-validate` to confirm AR meets threshold. | ||
|
|
||
| ## Known benign patterns (do NOT mark as failures) | ||
|
|
||
| | Pattern | Explanation | | ||
| |---|---| | ||
| | vLLM server exit code 143 | SIGTERM — server was killed after queries completed. Expected. | | ||
| | `CANCELLED AT ... DUE TO TASK FAILURE` after `exit code: 0` | Slurm cleanup of worker nodes after main task succeeded. | | ||
| | `destroy_process_group() was not called` | Benign PyTorch shutdown warning. | | ||
| | `tokenizer class ... not equal to the registered tokenizer class` | Harmless tokenizer mismatch warning. | |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why SKILL.md exits in both here and #1429?