Add Megatron-Bridge PTQ quantize + export example scripts#1589
Add Megatron-Bridge PTQ quantize + export example scripts#1589kevalmorabia97 wants to merge 6 commits into
Conversation
Add a two-step post-training quantization flow for Megatron-Bridge models under examples/megatron_bridge/: - quantize.py: load an HF model via Megatron-Bridge, apply ModelOpt PTQ (via a --quant_cfg alias / full config name, or a --recipe YAML), with optional KV-cache quant, weight-only, compression, and MoE expert-ratio calibration; save a Megatron checkpoint (with ModelOpt state). Tensor / pipeline / expert parallelism are all supported. - export.py: load the quantized Megatron checkpoint, re-shard to TP=1, and export a HuggingFace (unified) checkpoint deployable with TensorRT-LLM / vLLM / SGLang. The unified exporter does not gather tensor-parallel shards (matching Megatron-LM, which forces TP=1 during export), hence the split. Add tests covering quantize -> export -> vLLM deployment and quantize -> Megatron reload with ModelOpt quantizers restored. Follow-ups: part 2 extends distill.py for quantization-aware distillation (QAD); part 3 adds NVFP4 + QAD-on-pruned-checkpoint experiments to the Nemotron-3 tutorial. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR implements an end-to-end post-training quantization (PTQ) and export workflow for Megatron-Bridge. Two new CLI scripts handle quantizing HuggingFace models to Megatron checkpoints and exporting them to unified HuggingFace checkpoints, with comprehensive documentation and integration tests validating the full pipeline. ChangesMegatron-Bridge PTQ Workflow
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Make the output-path flags explicit about the checkpoint format, with a shared 'export_' prefix matching prune_minitron.py's output_* convention: - quantize.py: --export_megatron_path (Megatron checkpoint output) - export.py: --export_unified_hf_path (unified HuggingFace output) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
3ebeedc to
1361339
Compare
|
/claude review |
Drop the standalone test_quantize_megatron_checkpoint_reload test and its _load_megatron_quantized.py helper: the export+vLLM test already reloads the quantized Megatron checkpoint (via export.py's load_megatron_model, at the harder TP>1 -> TP=1 reshard) and asserts hf_quant_config.json, which is only written when the reloaded model is quantized. Fold the modelopt_state-present assertion into that test so quantize.py's resumable-checkpoint contract stays covered without a second GPU job. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1589 +/- ##
==========================================
- Coverage 73.22% 70.47% -2.76%
==========================================
Files 478 478
Lines 52421 52421
==========================================
- Hits 38387 36942 -1445
- Misses 14034 15479 +1445
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/claude review |
- Reword the pack=False comment: it described 'backward compatibility', but this is a new file. Explain that calibration uses unpacked sequences (pack=True changes per-sample calibration statistics). - Guard the calibration_mode toggle with try/finally so a failure mid- calibration doesn't leave the flag set for the later compress / save calls. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
The export+vLLM test was OOM-killed on CI (exit 137). CI has more GPU memory than local (where it passes), so this is a host-RAM OOM: loading vLLM in-process stacked its large footprint on top of the test process's torch / Megatron / transformers imports. Move the vLLM load+generate into a standalone _vllm_generate.py run as a subprocess (footprint reclaimed on exit) with a minimal config (swap_space=0, max_num_seqs=1, enforce_eager, small KV cache). Also rename test_quantize.py -> test_quantize_export.py since it now covers export.py too. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
|
/claude review changes in examples/megatron_bridge directory |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/examples/megatron_bridge/_vllm_generate.py`:
- Around line 43-44: The current smoke-test uses "text is not None" which can't
detect an empty string; update the assertion after llm.generate(...) to require
a non-empty truthy string from the vLLM completion: check outputs and then
assert that outputs[0].outputs[0].text is non-empty (e.g., truthy or .strip()
length > 0) so CompletionOutput.text from the llm.generate call and
SamplingParams invocation must be validated as a non-empty string rather than
merely not None.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c55dd76c-7b35-4b30-bfb6-97fb7137f39b
📒 Files selected for processing (3)
examples/megatron_bridge/quantize.pytests/examples/megatron_bridge/_vllm_generate.pytests/examples/megatron_bridge/test_quantize_export.py
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/megatron_bridge/quantize.py
| outputs = llm.generate(["Hello!"], SamplingParams(max_tokens=4)) | ||
| assert outputs and outputs[0].outputs[0].text is not None, "vLLM produced no output" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In vLLM, can CompletionOutput.text returned by LLM.generate be None, or is it always a string?
💡 Result:
In vLLM, the CompletionOutput.text field is defined as a string (str) and is not nullable [1][2][3]. The authoritative source code for the CompletionOutput dataclass explicitly types the text field as text: str, ensuring it is always a string [2][3]. Consequently, it cannot be None [1][2].
Citations:
- 1: https://www.mintlify.com/vllm-project/vllm/api/outputs
- 2: https://github.com/vllm-project/vllm/blob/7cc302dd/vllm/outputs.py
- 3: https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/outputs.py
Fix vLLM smoke-test assertion to check for non-empty text
- In vLLM,
CompletionOutput.textis always astr(non-nullable), sotext is not Nonecan’t detect the “generated nothing” case (e.g., empty string). Make the test assert truthy/non-empty text instead.
🔧 Suggested stronger assertion
- outputs = llm.generate(["Hello!"], SamplingParams(max_tokens=4))
- assert outputs and outputs[0].outputs[0].text is not None, "vLLM produced no output"
+ outputs = llm.generate(["Hello!"], SamplingParams(max_tokens=4))
+ assert outputs and outputs[0].outputs and outputs[0].outputs[0].text, (
+ "vLLM produced no output"
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/examples/megatron_bridge/_vllm_generate.py` around lines 43 - 44, The
current smoke-test uses "text is not None" which can't detect an empty string;
update the assertion after llm.generate(...) to require a non-empty truthy
string from the vLLM completion: check outputs and then assert that
outputs[0].outputs[0].text is non-empty (e.g., truthy or .strip() length > 0) so
CompletionOutput.text from the llm.generate call and SamplingParams invocation
must be validated as a non-empty string rather than merely not None.
What does this PR do?
Type of change: new example
Adds a two-step post-training quantization (PTQ) flow for Megatron-Bridge models under
examples/megatron_bridge/, mirroring the Megatron-LMquantize.sh/export.shsplit:quantize.py— loads an HF model via Megatron-Bridge, applies ModelOpt PTQ (via a--quant_cfgalias / full config name, or a--recipeYAML), with optional KV-cache quant, weight-only, compression, and MoE expert-ratio calibration, then saves a Megatron checkpoint (with ModelOpt state). Tensor / pipeline / expert parallelism are all supported, and the checkpoint can later be reloaded for further training (QAT / distillation).export.py— loads the quantized Megatron checkpoint, re-shards to TP=1, and exports a HuggingFace (unified) checkpoint deployable with TensorRT-LLM / vLLM / SGLang.Why the split? The unified HF exporter (
export_mcore_gpt_to_hf) does not gather tensor-parallel-sharded weights — Megatron-LM likewise forcesTP=1during its export step. Saving a TP-sharded Megatron checkpoint first lets us calibrate at TP>1 (to fit large models) and then reload re-sharded to TP=1 for the HF export. A combined single-script flow silently produced corrupt HF checkpoints under TP>1 (collided per-rank shards), which this split avoids.Usage
Testing
tests/examples/megatron_bridge/test_quantize.py(validated on a 2-GPU NeMo26.04container):test_quantize_export_and_vllm_deployment— quantize a tiny Qwen3 via a recipe at TP=2 →export.pyre-shards to TP=1 → load + generate with vLLM (skipped if vLLM absent).test_quantize_megatron_checkpoint_reload— quantize at TP=2 → reload the Megatron checkpoint via the bridge and assert ModelOpt quantizers were restored.Before your PR is "Ready for review"
CONTRIBUTING.md: N/A (no new dependencies)/claude review)Additional Information
The Nemotron-3 tutorial update to use these scripts is intentionally not included here — it ships with the part 3 PR alongside the NVFP4 + QAD experiments.
Summary by CodeRabbit
Documentation
New Features
Tests