Skip to content

Stormcast refactor with domain parallel + DiT support, testing, Pydantic configs#1438

Merged
pzharrington merged 11 commits intoNVIDIA:mainfrom
jleinonen:stormcast-domain-parallel
Feb 26, 2026
Merged

Stormcast refactor with domain parallel + DiT support, testing, Pydantic configs#1438
pzharrington merged 11 commits intoNVIDIA:mainfrom
jleinonen:stormcast-domain-parallel

Conversation

@jleinonen
Copy link
Copy Markdown
Collaborator

@jleinonen jleinonen commented Feb 23, 2026

PhysicsNeMo Pull Request

Refactors the StormCast training code as a follow-up to #1379.

Major additions:

  1. Support for domain parallel training, controlled by training.domain_parallel_size. The domain parallelism is handled by ParallelManager in utils/parallel.py. FSDP is now used for parallelism regardless of domain parallel size.
  2. Support for DiT for diffusion models, with standard configuration with NATTEN loaded by the setup function on nn.py.
  3. Configuration parsing using Pydantic. This eliminates most of the manual validation and default value checking as those are now handled by the Pydantic validation. Hydra is still used for loading configs but afterwards they are passed into Pydantic.
  4. A test suite using pytest. Includes tests for both single- and multi-GPU setups. A mock dataset that produces random data is included for testing. Multi-GPU tests can be run as torchrun --standalone --nproc-per-node=<number_of_gpus> --no-python pytest test_training.py.

More minor changes:

  1. Logging is now passed through utils/logging.py which handles both Tensorboard and WandB logging depending on which are selected.
  2. utils/apex.py enables using Apex LayerNorm with ShardTensor.
  3. Split training-time plotting into multiple plots as this allows one to view them as an animation, which makes it easier to see if the model is learning temporal dynamics properly.

TODO / pending:

  1. To support domain parallelism, utils/loss.py and utils/sampler.py contain versions of EDMLoss and deterministic_sampler modified to work with ShardTensor. TBD if we want to upstream the support to PhysicsNeMo library.
  2. README.md still to be updated. Initial updated README.md is added.
  3. inference.py still to be properly checked for domain parallel support.

Description

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@jleinonen jleinonen self-assigned this Feb 23, 2026
Comment thread examples/weather/stormcast/datasets/dataset.py Outdated
Comment thread examples/weather/stormcast/utils/nn.py
Comment thread examples/weather/stormcast/utils/sampler.py
Comment thread examples/weather/stormcast/utils/parallel.py Outdated
Comment thread examples/weather/stormcast/utils/parallel.py
Comment thread examples/weather/stormcast/utils/trainer.py Outdated
Comment thread examples/weather/stormcast/utils/parallel.py
Comment thread examples/weather/stormcast/utils/parallel.py
Comment thread examples/weather/stormcast/README.md Outdated
Comment thread examples/weather/stormcast/README.md Outdated
Comment thread examples/weather/stormcast/README.md Outdated
Comment thread examples/weather/stormcast/README.md
@pzharrington
Copy link
Copy Markdown
Collaborator

/blossom-ci

Copy link
Copy Markdown
Collaborator

@pzharrington pzharrington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! 🚀

@pzharrington pzharrington added this pull request to the merge queue Feb 26, 2026
Merged via the queue into NVIDIA:main with commit 79c2484 Feb 26, 2026
4 checks passed
nbren12 pushed a commit to nbren12/modulus that referenced this pull request Mar 24, 2026
…tic configs (NVIDIA#1438)

* Stormcast recipe refactor for domain parallelism

* Revert conditioning embedder setting

* Update README.md

* Rename ParallelManager to ParallelHelper

* Clarify sharding usage

* Fix 'scala' typo

* Improve documentation of sharded_dataloader and sharded_data_iter

* Centralize determination of sharded parameters

* Fix test failures

* Lint, readme polish, rename config

---------

Co-authored-by: Peter Harrington <pharrington@nvidia.com>
Co-authored-by: Peter Harrington <48932392+pzharrington@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants