Stormcast refactor with domain parallel + DiT support, testing, Pydantic configs by jleinonen · Pull Request #1438 · NVIDIA/physicsnemo

jleinonen · 2026-02-23T21:04:20Z

PhysicsNeMo Pull Request

Refactors the StormCast training code as a follow-up to #1379.

Major additions:

Support for domain parallel training, controlled by training.domain_parallel_size. The domain parallelism is handled by ParallelManager in utils/parallel.py. FSDP is now used for parallelism regardless of domain parallel size.
Support for DiT for diffusion models, with standard configuration with NATTEN loaded by the setup function on nn.py.
Configuration parsing using Pydantic. This eliminates most of the manual validation and default value checking as those are now handled by the Pydantic validation. Hydra is still used for loading configs but afterwards they are passed into Pydantic.
A test suite using pytest. Includes tests for both single- and multi-GPU setups. A mock dataset that produces random data is included for testing. Multi-GPU tests can be run as torchrun --standalone --nproc-per-node=<number_of_gpus> --no-python pytest test_training.py.

More minor changes:

Logging is now passed through utils/logging.py which handles both Tensorboard and WandB logging depending on which are selected.
utils/apex.py enables using Apex LayerNorm with ShardTensor.
Split training-time plotting into multiple plots as this allows one to view them as an animation, which makes it easier to see if the model is learning temporal dynamics properly.

TODO / pending:

To support domain parallelism, utils/loss.py and utils/sampler.py contain versions of EDMLoss and deterministic_sampler modified to work with ShardTensor. TBD if we want to upstream the support to PhysicsNeMo library.
~~README.md still to be updated.~~ Initial updated README.md is added.
inference.py still to be properly checked for domain parallel support.

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
If I am implementing a new model or modifying any existing model, I have followed the Models Implementation Coding Standards.

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

pzharrington · 2026-02-26T21:37:29Z

/blossom-ci

pzharrington

Nice work! 🚀

…tic configs (NVIDIA#1438) * Stormcast recipe refactor for domain parallelism * Revert conditioning embedder setting * Update README.md * Rename ParallelManager to ParallelHelper * Clarify sharding usage * Fix 'scala' typo * Improve documentation of sharded_dataloader and sharded_data_iter * Centralize determination of sharded parameters * Fix test failures * Lint, readme polish, rename config --------- Co-authored-by: Peter Harrington <pharrington@nvidia.com> Co-authored-by: Peter Harrington <48932392+pzharrington@users.noreply.github.com>

jleinonen added 2 commits February 23, 2026 07:17

Stormcast recipe refactor for domain parallelism

373cb24

Revert conditioning embedder setting

ce2265c

jleinonen requested a review from pzharrington February 23, 2026 21:04

jleinonen self-assigned this Feb 23, 2026

Update README.md

64632c9