Stormcast refactor with domain parallel + DiT support, testing, Pydantic configs#1438
Merged
pzharrington merged 11 commits intoNVIDIA:mainfrom Feb 26, 2026
Merged
Conversation
Collaborator
|
/blossom-ci |
nbren12
pushed a commit
to nbren12/modulus
that referenced
this pull request
Mar 24, 2026
…tic configs (NVIDIA#1438) * Stormcast recipe refactor for domain parallelism * Revert conditioning embedder setting * Update README.md * Rename ParallelManager to ParallelHelper * Clarify sharding usage * Fix 'scala' typo * Improve documentation of sharded_dataloader and sharded_data_iter * Centralize determination of sharded parameters * Fix test failures * Lint, readme polish, rename config --------- Co-authored-by: Peter Harrington <pharrington@nvidia.com> Co-authored-by: Peter Harrington <48932392+pzharrington@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PhysicsNeMo Pull Request
Refactors the StormCast training code as a follow-up to #1379.
Major additions:
training.domain_parallel_size. The domain parallelism is handled byParallelManagerinutils/parallel.py. FSDP is now used for parallelism regardless of domain parallel size.DiTfor diffusion models, with standard configuration with NATTEN loaded by the setup function onnn.py.torchrun --standalone --nproc-per-node=<number_of_gpus> --no-python pytest test_training.py.More minor changes:
utils/logging.pywhich handles both Tensorboard and WandB logging depending on which are selected.utils/apex.pyenables using Apex LayerNorm with ShardTensor.TODO / pending:
utils/loss.pyandutils/sampler.pycontain versions ofEDMLossanddeterministic_samplermodified to work with ShardTensor. TBD if we want to upstream the support to PhysicsNeMo library.Initial updated README.md is added.README.mdstill to be updated.inference.pystill to be properly checked for domain parallel support.Description
Checklist
Dependencies
Review Process
All PRs are reviewed by the PhysicsNeMo team before merging.
Depending on which files are changed, GitHub may automatically assign a maintainer for review.
We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.
AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.