Speed up slow unit/gpu/example tests#1616
Conversation
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Optimize the slowest CPU unit tests (no coverage loss): - speculative: disable torch.compile (no-op on CPU) + module-scope fixture + smaller eagle arch (43s+25s -> <0.1s each) - calibrator: reduce histogram num_bins 2048->512/256 (~25s -> ~3s) - onnx conv/convtranspose: shrink structure-only model shapes (~21s -> ~1s) - ruler dataset: use local tiny tokenizer instead of gpt2 (hermetic + faster) - onnx dynamo export: one dtype + representative variant subset (~70s -> ~7s) - dist/autoquant: fork (not spawn) for gloo on CPU runners + world_size 4->2 Share a tiny_tokenizer fixture in tests/unit/torch/conftest.py. Gate slow autonas/fastnas tests behind @pytest.mark.manual, keeping one representative case per test enabled for sanity (preferring fastnas over autonas); core NAS behavior stays covered by lighter sibling tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR reduces unit-test workload and enforces hermetic offline execution by replacing remote model/dataset dependencies with local fixtures, shrinking model dimensions and calibration samples across tests, curating NAS/Dynamo test matrices, adding worker pool caching to reduce distributed setup overhead, and updating CI container versions and job timeouts. ChangesTest Infrastructure and Execution Updates
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/unit/torch/speculative/plugins/test_hf_speculative_lora.py (1)
52-56: 💤 Low valueModule-scoped fixture is appropriate for this optimization.
The scope change from function to module reduces setup overhead. Note that
test_forward_returns_losscallsmodel.train()on the shared fixture, which could create test order dependencies. However, the subsequent tests don't appear to assume eval mode, so this should be fine in practice.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/speculative/plugins/test_hf_speculative_lora.py` around lines 52 - 56, You changed the lora_eagle_model fixture scope to module, which is fine for performance but test_forward_returns_loss calls model.train() and may leave the shared model in train mode for other tests; update test_forward_returns_loss (or any test that calls model.train()) to restore the model state after running (e.g., call model.eval() or use a context that resets training mode) so the shared fixture is not mutated across tests, referencing the lora_eagle_model fixture and the test_forward_returns_loss test to locate the change.tests/unit/conftest.py (1)
24-27: ⚡ Quick winAdd a comment explaining the conditional import.
The import of
huggingface_hub.constantsis guarded to handle it being an optional dependency, but a brief comment naming the reason would improve clarity.📝 Suggested comment
+# Guard optional dependency: huggingface_hub may not be installed with contextlib.suppress(ImportError): import huggingface_hub.constants as _hf_constantsAs per coding guidelines, "Add a brief comment in those cases naming the reason" when placing imports inside functions or blocks rather than at module top.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/conftest.py` around lines 24 - 27, Add a short explanatory comment above the guarded import inside the contextlib.suppress block explaining why huggingface_hub.constants is imported conditionally (it is an optional dependency in tests) and that setting _hf_constants.HF_HUB_OFFLINE = True forces offline mode for tests; reference the guarded import (import huggingface_hub.constants as _hf_constants), the suppressing construct (contextlib.suppress(ImportError)), and the flag (_hf_constants.HF_HUB_OFFLINE) so reviewers understand the intent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/example_tests.yml:
- Line 62: The non-PR job's TensorRT-LLM image tag is out of sync: update the
docker_image value used by the trtllm-non-pr job to match the trtllm-pr job's
"nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc17" so both jobs use the same image
version; locate the trtllm-non-pr job's docker_image entry and replace the
"1.3.0rc16" tag with "1.3.0rc17".
---
Nitpick comments:
In `@tests/unit/conftest.py`:
- Around line 24-27: Add a short explanatory comment above the guarded import
inside the contextlib.suppress block explaining why huggingface_hub.constants is
imported conditionally (it is an optional dependency in tests) and that setting
_hf_constants.HF_HUB_OFFLINE = True forces offline mode for tests; reference the
guarded import (import huggingface_hub.constants as _hf_constants), the
suppressing construct (contextlib.suppress(ImportError)), and the flag
(_hf_constants.HF_HUB_OFFLINE) so reviewers understand the intent.
In `@tests/unit/torch/speculative/plugins/test_hf_speculative_lora.py`:
- Around line 52-56: You changed the lora_eagle_model fixture scope to module,
which is fine for performance but test_forward_returns_loss calls model.train()
and may leave the shared model in train mode for other tests; update
test_forward_returns_loss (or any test that calls model.train()) to restore the
model state after running (e.g., call model.eval() or use a context that resets
training mode) so the shared fixture is not mutated across tests, referencing
the lora_eagle_model fixture and the test_forward_returns_loss test to locate
the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f9689516-fd34-4372-aa9e-089960550c5f
📒 Files selected for processing (27)
.github/workflows/example_tests.yml.github/workflows/gpu_tests.ymlCONTRIBUTING.mdtests/_test_utils/onnx/lib_test_models.pytests/_test_utils/torch/distributed/utils.pytests/_test_utils/torch/quantization/tensor_quantizer_common.pytests/_test_utils/torch/tokenizer/chat_template.jinjatests/_test_utils/torch/tokenizer/tokenizer_config.jsontests/examples/gpt_oss/test_gpt_oss_qat.pytests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.pytests/gpu/torch/nas/test_distributed_model.pytests/gpu/torch/nas/test_search_space_with_vision_models.pytests/gpu/torch/utils/test_dataset_utils.pytests/gpu_megatron/conftest.pytests/unit/conftest.pytests/unit/onnx/quantization/test_convtranspose_qdq.pytests/unit/onnx/test_gqa_graph_surgery.pytests/unit/torch/conftest.pytests/unit/torch/deploy/utils/test_torch_onnx_utils.pytests/unit/torch/nas/test_full_algorithms_vision.pytests/unit/torch/quantization/test_autoquant.pytests/unit/torch/quantization/test_calibrator.pytests/unit/torch/sparsity/attention_sparsity/test_sparse_attention_calibration.pytests/unit/torch/speculative/plugins/test_hf_dflash.pytests/unit/torch/speculative/plugins/test_hf_speculative.pytests/unit/torch/speculative/plugins/test_hf_speculative_lora.pytests/unit/torch/utils/test_dataset_utils.py
💤 Files with no reviewable changes (1)
- tests/_test_utils/torch/tokenizer/tokenizer_config.json
Reduce per-test work (no tests removed/disabled; example-script args default to current behavior and only tests pass the fast values): - torch_onnx: add --trt_builder_optimization_level (test uses 0) for fast trtexec builds - diffusers: fewer calib/inference steps; add diffusion_trt --num-inference-steps; smaller cache_diffusion steps - speculative_decoding: cap eagle training to training.sample_size=64 - megatron_bridge: fewer train_iters / calib samples / shorter seq_length - llm_sparsity: data_prep --max_samples; small finetune data slice Reuse modelopt.torch.utils.dataset_utils in hf_pts.py: drop the bespoke get_calib_dataloader and call get_dataset_dataloader directly, so calibration supports any registered/HF/JSONL dataset (via a new --dataset arg) and includes attention_mask instead of input_ids only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
fork in the long-lived pytest process inherits locks held by background threads (OpenMP / torch intra-op pools), deadlocking the child (e.g. in dist.init_process_group) and hanging the job. Revert spawn_multiprocess_job to spawn; the world_size reduction remains the speedup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
|
/claude review |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 3
🧹 Nitpick comments (1)
examples/llm_sparsity/weight_sparsity/data_prep.py (1)
58-64: 💤 Low valueTruncation logic is correct. The
min(args.max_samples, len(ds))bound prevents over-selection, and rebuilding viatype(dataset)({...})preserves theDatasetDicttype. Reconstructing the container is a bit indirect; an in-place update reads more directly, but this is optional.♻️ Optional readability tweak
- if args.max_samples is not None: - dataset = type(dataset)( - { - split: ds.select(range(min(args.max_samples, len(ds)))) - for split, ds in dataset.items() - } - ) + if args.max_samples is not None: + for split, ds in dataset.items(): + dataset[split] = ds.select(range(min(args.max_samples, len(ds))))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/llm_sparsity/weight_sparsity/data_prep.py` around lines 58 - 64, The truncation is fine but reconstructing the DatasetDict is indirect; instead modify the existing dataset in-place: when args.max_samples is not None iterate the DatasetDict entries and replace each split with ds.select(range(min(args.max_samples, len(ds)))) so you update the existing dataset object rather than calling type(dataset)(...). Locate the block that checks args.max_samples and change it to assign dataset[split] = ds.select(...) for each split (keeping the same min(args.max_samples, len(ds)) bound).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/diffusers/quantization/diffusion_trt.py`:
- Around line 189-194: The CLI argument --num-inference-steps is not
range-checked; add validation at parse time by enforcing it is a positive
integer: either supply a custom argparse type/validator for
"num-inference-steps" or, immediately after parse_args(), check
args.num_inference_steps > 0 and call parser.error(...) (or raise a clear
ValueError) with a descriptive message; update references where
"num-inference-steps" is declared in parser.add_argument and any subsequent uses
to rely on this validated value.
In `@examples/llm_sparsity/weight_sparsity/hf_pts.py`:
- Around line 107-114: Replace the hardcoded max_sample_length=512 when
constructing the calibration dataloader so it uses args.model_max_length (i.e.,
call get_dataset_dataloader with max_sample_length=args.model_max_length), and
enable packing by passing pack=True to get_dataset_dataloader for calibration;
also ensure padding_side is not forcing right-padding for calibration (use
tokenizer.padding_side or explicitly set padding_side="left") to avoid degrading
calibration accuracy. Reference: the calib_dataloader construction that calls
get_dataset_dataloader and the args.model_max_length/tokenizer padding settings.
In `@examples/torch_onnx/torch_quant_to_onnx.py`:
- Around line 509-515: Validate the documented 0-5 range for the CLI option by
enforcing it at the argparse boundary: update the parser.add_argument call for
"--trt_builder_optimization_level" to restrict allowed values (e.g., use
choices=range(0,6) or a custom type/validator) so only integers 0 through 5 are
accepted, or add an explicit post-parse check that raises a clear argparse-style
error if args.trt_builder_optimization_level is outside 0-5; target the
parser.add_argument invocation for "--trt_builder_optimization_level" to
implement this validation.
---
Nitpick comments:
In `@examples/llm_sparsity/weight_sparsity/data_prep.py`:
- Around line 58-64: The truncation is fine but reconstructing the DatasetDict
is indirect; instead modify the existing dataset in-place: when args.max_samples
is not None iterate the DatasetDict entries and replace each split with
ds.select(range(min(args.max_samples, len(ds)))) so you update the existing
dataset object rather than calling type(dataset)(...). Locate the block that
checks args.max_samples and change it to assign dataset[split] = ds.select(...)
for each split (keeping the same min(args.max_samples, len(ds)) bound).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f35294c0-7823-4dab-8a27-16720df79e40
📒 Files selected for processing (14)
.github/workflows/example_tests.ymlexamples/diffusers/quantization/diffusion_trt.pyexamples/llm_sparsity/weight_sparsity/data_prep.pyexamples/llm_sparsity/weight_sparsity/hf_pts.pyexamples/torch_onnx/torch_quant_to_onnx.pytests/examples/diffusers/test_cache_diffusion.pytests/examples/diffusers/test_diffusers.pytests/examples/diffusers/test_export_diffusers_hf_ckpt.pytests/examples/llm_sparsity/weight_sparsity/test_llama_sparsify.pytests/examples/megatron_bridge/test_distill.pytests/examples/megatron_bridge/test_prune_minitron.pytests/examples/megatron_bridge/test_quantize_export.pytests/examples/speculative_decoding/test_eagle.pytests/examples/torch_onnx/test_torch_quant_to_onnx.py
✅ Files skipped from review due to trivial changes (1)
- tests/examples/diffusers/test_export_diffusers_hf_ckpt.py
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/workflows/example_tests.yml
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1616 +/- ##
==========================================
- Coverage 74.58% 67.57% -7.02%
==========================================
Files 482 482
Lines 52943 52943
==========================================
- Hits 39489 35777 -3712
- Misses 13454 17166 +3712
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Add per-test timeouts via pytest-timeout, keyed by test dir (unit 60s, gpu/example/regression 300s), call-phase only; per-test overrides where a one-time CUDA JIT compile is expected. - Pre-compile the conv3d implicit-GEMM CUDA kernel in a dedicated _extensions test so the build cost doesn't land on the first kernel test. - Relocate tests to the lane that owns their dependency: rename the gpt-oss example test dir to match the example, move the vLLM sparse-attention plugin test to gpu_vllm (drop its importorskip), and guard the diffusers import in the partial-install unit lane. - Fix the eagle example override (data.sample_size, not training.sample_size). - Consolidate diffusers test model-path constants into a single SDXL_PATH and use tiny pipelines for the plumbing tests (incl. cache_diffusion); normalize local_id basenames. - llm_sparsity hf_pts.py: reuse get_dataset_dataloader, honor --model_max_length, and left-pad for calibration. - Skip the upstream-broken nemotron-sft-agentic-v2 gated smoke dataset. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/_test_utils/examples/models.py (1)
15-24: ⚡ Quick winMake exported model-path constants explicit with
__all__.This helper is a shared constants surface; listing exports explicitly prevents accidental API drift when constants are renamed or added.
♻️ Proposed change
import os +__all__ = [ + "BART_PATH", + "T5_PATH", + "MIXTRAL_PATH", + "WHISPER_PATH", + "TINY_LLAMA_PATH", + "QWEN_VL_PATH", + "SDXL_PATH", + "SD3_PATH", + "FLUX_SCHNELL_PATH", + "PIXART_PATH", +] + _LOCAL_ROOT = os.getenv("MODELOPT_LOCAL_MODEL_ROOT")As per coding guidelines: "Define the public API with
__all__at the top of each module and re-export viafrom .module import *in package__init__.pyfiles to keep the public API explicit and safe".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/_test_utils/examples/models.py` around lines 15 - 24, Add an explicit __all__ to declare the module's public API so exports don't drift; for this file, add a top-level __all__ list that includes the model-path symbols you intend to export (e.g., "_LOCAL_ROOT" and "_select_path") and ensure any package __init__.py re-exports use from .models import *; update the module to include that __all__ and keep the existing names (_LOCAL_ROOT, _select_path) unchanged.tests/gpu/torch/utils/test_dataset_utils.py (1)
63-65: ⚡ Quick winUse
xfail(strict=True)instead of a permanentskipfor the known upstream issue.A hard skip can mask upstream recovery indefinitely.
xfail(strict=True)keeps the failure expected but surfaces an unexpected pass when the upstream split is fixed.Suggested change
- pytest.param(k, marks=pytest.mark.skip(reason="upstream search-split schema cast failure")) + pytest.param( + k, + marks=pytest.mark.xfail( + reason="upstream search-split schema cast failure", + strict=True, + ), + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/gpu/torch/utils/test_dataset_utils.py` around lines 63 - 65, The test case for the known upstream schema cast failure is being permanently skipped, which can hide when the issue is fixed. Update the `pytest.param` entry in `test_dataset_utils.py` for `nemotron-sft-agentic-v2` to use `pytest.mark.xfail(strict=True)` instead of `pytest.mark.skip`, keeping the failure expected while still surfacing an unexpected pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/_test_utils/examples/models.py`:
- Around line 15-24: Add an explicit __all__ to declare the module's public API
so exports don't drift; for this file, add a top-level __all__ list that
includes the model-path symbols you intend to export (e.g., "_LOCAL_ROOT" and
"_select_path") and ensure any package __init__.py re-exports use from .models
import *; update the module to include that __all__ and keep the existing names
(_LOCAL_ROOT, _select_path) unchanged.
In `@tests/gpu/torch/utils/test_dataset_utils.py`:
- Around line 63-65: The test case for the known upstream schema cast failure is
being permanently skipped, which can hide when the issue is fixed. Update the
`pytest.param` entry in `test_dataset_utils.py` for `nemotron-sft-agentic-v2` to
use `pytest.mark.xfail(strict=True)` instead of `pytest.mark.skip`, keeping the
failure expected while still surfacing an unexpected pass.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ab4c67a8-95c5-4a9a-93ad-f0f7420106b3
📒 Files selected for processing (22)
.github/workflows/example_tests.yml.github/workflows/gpu_tests.yml.github/workflows/unit_tests.ymlexamples/llm_sparsity/weight_sparsity/hf_pts.pyexamples/torch_onnx/torch_quant_to_onnx.pypyproject.tomltests/_test_utils/examples/models.pytests/conftest.pytests/examples/diffusers/test_cache_diffusion.pytests/examples/diffusers/test_diffusers.pytests/examples/diffusers/test_export_diffusers_hf_ckpt.pytests/examples/gpt-oss/test_gpt_oss_qat.pytests/examples/llm_ptq/_extensions/test_torch_extensions.pytests/examples/speculative_decoding/test_eagle.pytests/examples/vlm_ptq/_extensions/test_torch_extensions.pytests/gpu/_extensions/test_torch_extensions.pytests/gpu/_extensions/test_torch_kernels.pytests/gpu/onnx/quantization/autotune/test_workflow.pytests/gpu/torch/utils/test_dataset_utils.pytests/gpu_vllm/torch/sparsity/attention_sparsity/test_vllm_plugin.pytests/regression/torch/speculative/test_dflash.pytests/unit/torch/quantization/plugins/test_diffusers_wan_conv3d.py
💤 Files with no reviewable changes (2)
- tests/examples/gpt-oss/test_gpt_oss_qat.py
- tests/gpu_vllm/torch/sparsity/attention_sparsity/test_vllm_plugin.py
✅ Files skipped from review due to trivial changes (1)
- tests/gpu/_extensions/test_torch_extensions.py
|
/claude review |
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Large but cohesive test-infra PR (52 files, +632/-374) with many small, individually-reasonable changes. Design review: this PR doesn't introduce a new abstraction — it leans on already-present tooling (pytest-timeout, mp.get_context, the existing DistributedWorkerPool, the existing tests/_test_utils/torch/tokenizer/ tiny tokenizer, existing create_tiny_*_dir helpers) rather than building a new framework. No design red flags.
A few semantic-shift items a human owner should confirm rather than auto-approve:
examples/llm_sparsity/weight_sparsity/hf_pts.pyswitchespadding_sidefrom"right"to"left"for calibration. The comment explains why (get_dataset_dataloaderwarns otherwise), but this is a behavior change to an example CLI, not just a test edit. Worth confirming the resulting calibrated checkpoints are still equivalent for downstream users who script against this example.tests/gpu/torch/nas/test_distributed_model.pydrops the FSDP1test_fsdpparametrize entirely (only FSDP2 remains). The PR body says "redundant parametrize cases marked manual" but this case is deleted, not marked manual — confirm FSDP1 NAS coverage is intentionally retired.tests/unit/torch/utils/test_dataset_utils.py: the live HF integration tests (@pytest.mark.integrationagainsthf-internal-testing/dataset_with_data_filesand the gated Nemotron smoke) are replaced with toy local-dir round-trips. The new gpu-lane filetests/gpu/torch/utils/test_dataset_utils.pykeeps the Nemotron smoke but does not re-add a liveload_dataset(<hub-id>)round-trip — i.e. nothing in CI exercises the actual HF-download branch ofget_dataset_samplesanymore. Confirm that's intentional.- Unit-test wall-clock lane budget cut 30→15 min — aggressive given the matrix of OS/Torch/TF combos; one slow run could fail the lane. Worth confirming a few real CI runs landed comfortably under 15 min before merging.
- 60s default unit-test timeout is reasonable, but
pytest_collection_modifyitemsraisespytest.UsageErrorfor any unmapped top-leveltests/<group>/directory. That's good fail-loud behavior, but means anyone adding a newtests/<group>/will get a confusing collection error until they edittests/conftest.py. Documenting this in CONTRIBUTING.md alongside the new hermetic-unit-test note would help.
No correctness bugs spotted in the diff (the renamed model-path constants, the mp.get_context("spawn") switch, the _pool_cache worker dedup, the chat-template move into chat_template.jinja, and the symlinked _extensions/test_torch_extensions.py files all look right).
Follow-ups from PR review and CI triage: - Restore live HF dataset round-trips (TestHfTinyDataset) to the gpu lane (tests/gpu/torch/utils/test_dataset_utils.py). The hermetic-unit-test work dropped the ungated hf-internal-testing round-trips without relocating them, leaving the get_dataset_samples Hub-download branch unexercised in CI. - Move the tiny_tokenizer fixture up to the root tests/conftest.py so both unit and gpu lanes share it; drop the duplicate pad_tokenizer fixture and use tiny_tokenizer in test_gptq and test_dataset_utils. - get_tiny_tokenizer: default to left padding (pad_token already set) so callers no longer hand-roll it; document the override. - Example lanes: install editable (-e) so example scripts launched as subprocesses resolve modelopt to the same source path as the test process, letting the pre-compiled CUDA extension cache be reused instead of recompiling. - speculative_decoding: give test_offline_eagle3_training[kimi-k2.5] an explicit timeout(360) (~256s on main + margin) now that a default per-test example timeout exists. - Replace the tiny tokenizer's verbose ChatML+think chat template with a terse generation-tagged Q:/A: template. The ChatML special tokens byte-exploded on the 128-vocab tokenizer (98 vs 28 tokens), pushing compute_hidden_states test conversations past max-seq-len and breaking the offline-PTQ tests; the terse template keeps answer-only-loss generation-tag masking that unit tests need. - CONTRIBUTING: brief note on the per-test timeout defaults. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
|
/claude review |
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Large but cohesive test-infra PR (54 files, +763/-425). Design check passes — leverages existing tooling (pytest-timeout, mp.get_context, DistributedWorkerPool, the tiny tokenizer at tests/_test_utils/torch/tokenizer/, create_tiny_*_dir), no new abstraction introduced. Hermetic-unit-test goal is well-executed: tests/unit/conftest.py forces offline mode, the tiny_tokenizer fixture is shared, and Hub-only tests are relocated to tests/gpu/torch/utils/test_dataset_utils.py (live _HF_TINY round-trips and gated Nemotron smoke preserved there). Renames (SXDL_PATH→SDXL_PATH, SDXL_1_0_PATH→SDXL_PATH, LLAVA_PATH removal) all line up with their callers. The pytest_collection_modifyitems fail-loud on unmapped tests/<group>/ dirs is a nice guard, and timeout_func_only=true correctly excludes fixture setup from the budget.
A few items worth a human glance before merging:
- 💬 Author addressed the
padding_side"right"→"left"flip inexamples/llm_sparsity/weight_sparsity/hf_pts.pywith a comment ("Left padding is recommended for calibration;get_dataset_dataloaderwarns otherwise") — still warrants sign-off because this is a user-facing behavior change to an example CLI, not just a test edit, and downstream users scripting against this script will get different calibrated checkpoints. - 💬 PR body explains FSDP1
test_fsdpintests/gpu/torch/nas/test_distributed_model.pyis deleted (not marked@manual) because "FastNAS/AutoNAS aren't used with either FSDP1 or FSDP2" — flagging anyway because this is a coverage drop rather than a speedup, worth a human confirming the FSDP1-NAS path is indeed unused in production. - Unit-test lane budget cut 30→15 min is aggressive given the OS/Torch/TF matrix; one slow run can fail the lane. Worth confirming a few real CI runs landed comfortably under 15 min before merging.
- Adding a new top-level
tests/<group>/will fail collection withpytest.UsageErroruntil_DEFAULT_TIMEOUTis updated. CONTRIBUTING.md was updated to mention this — good — but consider whether the error message should mention the file path more explicitly (it already does: "add '{group}' to _DEFAULT_TIMEOUT in tests/conftest.py"). tests/gpu/torch/utils/test_dataset_utils.pypermanentlyskipsnemotron-sft-agentic-v2for an upstream schema cast failure —xfail(strict=True)would surface upstream recovery automatically; minor.
No correctness bugs spotted in the diff itself. The mp.get_context("spawn") switch (dropping the global set_start_method mutation), the _pool_cache worker-pool dedup in gpu_megatron/conftest.py, the chat-template move into chat_template.jinja, and the symlinked _extensions/test_torch_extensions.py files all look correct.
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/gpu/torch/utils/test_dataset_utils.py (1)
107-112: ⚡ Quick winMove
datasetsimport to module top (or justify the in-function import).
from datasets import load_datasetis placed inside_hf_dump_to_jsonlwithout a documented reason. Since these tests already depend ondatasetstransitively throughget_dataset_samples/get_dataset_dataloader, a deferred import doesn't guard anything here and only delays surfacing an import error from collection time to mid-test. Either hoist it to the top with the other imports, or add a brief comment naming the concrete reason (e.g., optional-dependency guard).♻️ Proposed change
import json +from datasets import load_dataset import pytest from huggingface_hub import get_tokendef _hf_dump_to_jsonl(name: str, split: str, path) -> str: - from datasets import load_dataset - ds = load_dataset(name, split=split) ds.to_json(str(path), lines=True) return str(path)As per coding guidelines: "Place all imports at module top in both source and test files... The only acceptable in-function imports are for circular imports or optional dependencies... and those should carry a brief comment naming the reason."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/gpu/torch/utils/test_dataset_utils.py` around lines 107 - 112, The in-function import "from datasets import load_dataset" inside _hf_dump_to_jsonl delays import errors and violates the module-top import guideline; either move the import to the module top with the other imports (and update any import ordering) or, if this is truly an optional/circular dependency, keep it but add a brief comment above the import inside _hf_dump_to_jsonl explaining the specific reason (e.g., "optional dependency guard" or "avoid circular import with X"). Ensure the symbol load_dataset remains referenced by _hf_dump_to_jsonl after the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/_example_tests_runner.yml:
- Line 52: The step currently expands inputs.pip_install_extras directly into
the shell command, which can allow shell injection; change it to pass the input
into an environment variable (e.g., PIP_INSTALL_EXTRAS) and validate it against
a small allowlist or a strict regex before using it, then reference the
validated env var in the pip command (e.g., python -m pip install -e
".${PIP_INSTALL_EXTRAS}" or build the install path safely). Ensure validation in
the workflow uses explicit allowed tokens or rejects any string containing
unsafe characters like ; & | $() ` and only permits known extras (or an exact
pattern such as ^\[.*\]$ or comma-separated alphanumerics), so callers cannot
inject arbitrary shell content.
---
Nitpick comments:
In `@tests/gpu/torch/utils/test_dataset_utils.py`:
- Around line 107-112: The in-function import "from datasets import
load_dataset" inside _hf_dump_to_jsonl delays import errors and violates the
module-top import guideline; either move the import to the module top with the
other imports (and update any import ordering) or, if this is truly an
optional/circular dependency, keep it but add a brief comment above the import
inside _hf_dump_to_jsonl explaining the specific reason (e.g., "optional
dependency guard" or "avoid circular import with X"). Ensure the symbol
load_dataset remains referenced by _hf_dump_to_jsonl after the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c7f6c7f0-2648-4139-acc1-274baed74f65
📒 Files selected for processing (10)
.github/workflows/_example_tests_runner.ymlCONTRIBUTING.mdtests/_test_utils/torch/tokenizer/chat_template.jinjatests/_test_utils/torch/transformers_models.pytests/conftest.pytests/examples/speculative_decoding/test_eagle.pytests/gpu/_extensions/test_torch_kernels.pytests/gpu/torch/quantization/test_gptq.pytests/gpu/torch/utils/test_dataset_utils.pytests/unit/torch/utils/test_dataset_utils.py
✅ Files skipped from review due to trivial changes (2)
- tests/_test_utils/torch/tokenizer/chat_template.jinja
- CONTRIBUTING.md
🚧 Files skipped from review as they are similar to previous changes (4)
- tests/examples/speculative_decoding/test_eagle.py
- tests/gpu/_extensions/test_torch_kernels.py
- tests/conftest.py
- tests/unit/torch/utils/test_dataset_utils.py
| (None, False), # tiny_llama (from fixture), no FakeBase | ||
| ("moonshotai/Kimi-K2.5", True), # remote HF repo, FakeBaseModel | ||
| pytest.param( | ||
| "moonshotai/Kimi-K2.5", True, # remote HF repo, FakeBaseModel |
There was a problem hiding this comment.
@h-guo18 can we use tiny Kimi local models here?
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
| model_max_length=model_max_length, | ||
| padding_side="right", | ||
| # Left padding is recommended for calibration (get_dataset_dataloader warns otherwise). | ||
| padding_side="left", |
| type=int, | ||
| default=4, | ||
| choices=range(6), | ||
| help="trtexec --builderOptimizationLevel (0-5). Lower is much faster to build " |
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
03435f9 to
50e26aa
Compare
What does this PR do?
Type of change: test infrastructure / test speedups + CI stabilization
Make the test suite faster,
tests/unithermetic, and the CI lanes stable, without losing coverage. Most changes are mechanical test/infra edits; the buckets below cover the diff broadly.Unit tests — hermetic (no HF Hub): toy local datasets/configs + the local tiny tokenizer (with a checked-in chat template) replace Hub assets;
tests/unit/conftest.pyenforces offline mode. Genuinely-HF tests moved totests/gpu*(e.g. the newtests/gpu/torch/utils/test_dataset_utils.py).CONTRIBUTING.mddocuments the hermetic-unit-test expectation.Unit-test speedups (no coverage loss): speculative (disable CPU torch.compile), calibrator (fewer histogram bins), ONNX conv/dynamo (smaller shapes + representative subset), Ruler/sparse-attention (local tokenizer), data-parallel autoquant (world size 4→2). Shared
tiny_tokenizerfixture. The distributed test helper now uses a privatespawncontext instead of mutating the global start method (avoids cross-test contamination).Rarely-used autonas/fastnas tests: heavy parametrize cases marked
@pytest.mark.manual, one representative kept per test (fastnas preferred); lighter sibling tests still cover core behavior. The legacy FSDP1 NAS distributed test is also dropped: FastNAS/AutoNAS aren't used with either FSDP1 or FSDP2, and FSDP1 is superseded by the newer FSDP2 API — so we keep a single FSDP2 case as a sanity check and drop FSDP1, leaving the suite leaner.gpu_megatron: deduplicate distributed worker pools by world_size within a module (saves a redundant pool spin-up in multi-pool files; module-scoped, no cross-module reuse).
Example tests: reduce per-test work via args that default to current behavior (tests pass the fast values) — torch_onnx TRT optimization level, diffusers calibration/inference steps, eagle
sample_size, megatron_bridge iters/calib, llm_sparsity data slice, export safetensors-structurecalib_size. Also enable the recently addedgpt-ossexample tests in CI.Per-test timeouts:
pytest-timeoutwith a default per-directory timeout (60s unit / 300s gpu+example) enforced intests/conftest.py(timeout_func_onlyinpyproject.toml), so a new test cannot silently exceed the budget — an unmapped test dir crashes collection. A few inherently slow tests carry explicit higher per-test overrides (CUDA-compile, autotune, dflash).CUDA kernel pre-compilation: a dedicated
tests/gpu/_extensionstest JIT-builds the conv3d implicit-GEMM kernel up front (collected before the functional tests in the same process) so the one-time build cost no longer lands on — and time out — the first functional test that uses it. Mirrored into thellm_ptq/vlm_ptqexample lanes.Test relocation & optional-dependency guards: vLLM sparsity plugin test moved to
tests/gpu_vllm(drops the in-testimportorskip); diffusers-dependent unit test guarded withimportorskip("diffusers")for partial-install lanes;gpt_ossexample test dir renamed togpt-ossto match the CI matrix.Diffusers test models: shared model-path constants in
tests/_test_utils/examples/models.pyconsolidated/renamed and point at tinyhf-internal-testingtest pipes (SDXL/SD3/FLUX) so cachify/quantize/export tests run on toy weights;local_ids normalized.Shared dataset utils:
examples/llm_sparsity/.../hf_pts.pynow usesget_dataset_dataloader(drops the bespoke cnn_dailymail-onlyget_calib_dataloader; supports any registered/HF/JSONL dataset, includes attention_mask);data_prep.pygains--max_samples.CI workflows: container image bumps (pytorch 26.04→26.05, TRT-LLM rc16→rc17) and tightened lane timeouts (unit 30→15 min, gpu lanes trimmed, onnx example lane 45 min).
Imports at top of file: in-function imports across the test suite are moved to module top per the coding guideline, conservatively — optional deps stay guarded (in-function or behind a module-level
importorskip) intests/unitsince the partial-install lane runs without them, and build/hardware-availability imports (apex, triton, megatron/transformer_engine, tensorrt_llm) plus_test_utilslazy guards are left in place.Kernel warning filters: the repeated
filterwarningsblanket-ignore in sixtests/gpu/torch/kernels/**modules is consolidated into a scoped hook intests/gpu/torch/kernels/conftest.py(kernel tests only — the rest of the suite keeps surfacing warnings).Eagle example speedups:
torch.compile(eagle recipe default) added ~2 min to every eagle training test; it's now disabled in the eagle example tests except one smoke (test_llama_eagle3[1-False]), and the downstream resume / AR-validate / export tests point at the compile-free checkpoint. Measured:test_ar_validate139s→17s, offline training 142s→22s, streaming 140s→23s — the compile path is still smoke-tested once.Example lanes install editable (
-e): so example scripts launched as subprocesses resolvemodeloptto the same source path as the test process and reuse the pre-compiled CUDA-extension cache instead of recompiling (~2 min/test); verified in the TRT-LLM container.Tiny test tokenizer:
get_tiny_tokenizerdefaults to left padding (what decoder-LM calibration expects) and ships a terse generation-tagged chat template — replacing a verbose ChatML one that inflated tokenized length on the 128-vocab tokenizer and broke the offline-PTQ example tests'max-seq-lenfilter.Restored Hub-download coverage: the live (ungated) HF dataset round-trips exercising
get_dataset_samples' download branch now live intests/gpu/torch/utils/test_dataset_utils.py(they had been dropped from the hermetic unit file without a counterpart).Individual file changes not explicitly called out above fall under this general test/CI cleanup.
Testing
Unit + the touched gpu_megatron files validated locally; example/GPU lanes validated in CI.
Before your PR is "Ready for review"
Summary by CodeRabbit
Release Notes
Chores
Tests
Documentation
CONTRIBUTING.mdto emphasize offline test design.Chores