[Common][PyTorch] Fuse scaling and unscaling of bf16 momentums into kernels #2632

yaox12 · 2026-01-29T07:22:19Z

Description

Fuse scaling and unscaling of bf16 momentums into kernels avoid explicit FP32 copies, which reduces the peak memory footprint.
Enable CUDA Graphs for BF16 momentums.

This PR only enables this feature for BF16 momentums, because BF16 doesn't require real scaling and unscaling, but just type converting.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Xin Yao <xiny@nvidia.com>

yaox12 · 2026-01-29T07:24:12Z

@kunlunl Can you review as well?

cc @nanz-nv This should resolve the peak memory issue, but I haven't verify it.

Also enable the capturable mode for BF16 momentums.

greptile-apps · 2026-01-29T07:26:29Z

Greptile Overview

Greptile Summary

This PR optimizes BF16 momentum handling in FusedAdam by fusing scaling/unscaling operations directly into the CUDA kernels, eliminating explicit FP32 copies and reducing peak memory footprint. It also enables CUDA Graphs (capturable mode) for BF16 momentums.

Key changes:

Added MOMENT_T template parameter to Adam kernel functors to support both FP32 and BF16 moment storage
Updated kernel validation to allow both moments (exp_avg/exp_avg_sq) to be either FP32 or BF16 (must match)
Added fuse_unscale flag in Python code that skips explicit BF16→FP32→BF16 conversions when moments are BF16
Modified get_unscaled_state() to optionally skip unscaling for BF16, letting kernel handle type conversion
Updated capturable mode validation to allow BF16 moments alongside FP32
Fixed store_param_remainders validation ordering issue in latest commit
Added test coverage for BF16 momentums in non-capturable mode
Updated deprecated torch.cuda.amp.GradScaler to torch.amp.GradScaler

Implementation approach:
The optimization leverages that BF16 doesn't require true scaling/unscaling - just type conversion. The kernel loads BF16 moments, performs Adam math in FP32, then stores back as BF16, avoiding intermediate FP32 buffers in Python. This keeps tensor pointers stable for CUDA Graph capture.

Confidence Score: 4/5

Safe to merge with minor test coverage gap for capturable mode
Implementation is well-designed with proper type safety through template parameters and runtime validation. The fuse_unscale logic is sound and the latest commit fixed the store_param_remainders ordering bug. Score not 5 due to missing test coverage for the capturable mode with BF16 momentums (a key feature mentioned in PR description).
tests/pytorch/test_fused_optimizer.py - add capturable mode test for BF16 momentums

Important Files Changed

Filename	Overview
tests/pytorch/test_fused_optimizer.py	Added test for BF16 momentums and fixed GradScaler API usage
transformer_engine/common/multi_tensor/adam.cu	Added MOMENT_T template parameter to support BF16 momentums, updated validation to allow BF16
transformer_engine/pytorch/optimizers/fused_adam.py	Updated capturable mode validation to allow BF16 moments, added fuse_unscale flag, modified get_unscaled_state to skip unscaling for BF16

Sequence Diagram

sequenceDiagram
    participant User
    participant FusedAdam as FusedAdam (Python)
    participant AdamKernel as Adam CUDA Kernel
    participant Moments as BF16 Moments (exp_avg/exp_avg_sq)

    Note over User,Moments: BF16 Moments with Fused Scaling Flow

    User->>FusedAdam: step() with BF16 moments
    FusedAdam->>FusedAdam: Check fuse_unscale flag (BF16 moments)
    FusedAdam->>FusedAdam: get_unscaled_state(skip_unscale=True)
    Note over FusedAdam: Skip explicit BF16→FP32 conversion
    FusedAdam->>AdamKernel: Call kernel with BF16 moments directly
    AdamKernel->>AdamKernel: Load BF16 moments, cast to FP32 for math
    AdamKernel->>AdamKernel: Perform Adam update in FP32
    AdamKernel->>Moments: Store updated moments as BF16
    Note over AdamKernel,Moments: Fused scaling: FP32→BF16 cast in kernel
    AdamKernel-->>FusedAdam: Return
    FusedAdam->>FusedAdam: Skip explicit scaling (fuse_unscale=True)
    Note over FusedAdam: Avoid explicit FP32 copy & scaling
    FusedAdam-->>User: Complete

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/optimizers/fused_adam.py

Signed-off-by: Xin Yao <xiny@nvidia.com>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Xin Yao <xiny@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-29T07:54:31Z

tests/pytorch/test_fused_optimizer.py

+    @pytest.mark.skipif(not is_bf16_available(), reason="bf16 if not supported")
+    def test_bf16_exp_avg_and_exp_avg_sq(self):
+        self.gen_precision_aware_test(
+            use_fp8_params=False,
+            param_dtype=torch.bfloat16,
+            use_master_weights=True,
+            master_weight_dtype=torch.float32,
+            grad_dtype=torch.float32,
+            exp_avg_dtype=torch.bfloat16,
+            exp_avg_sq_dtype=torch.bfloat16,
+            master_rtol=2e-3,
+            master_atol=2e-3,
+        )


Consider adding a test for capturable mode (CUDA Graphs) with BF16 momentums, since the PR description mentions "Enable CUDA Graphs for BF16 momentums" as a key feature. The current test only covers non-capturable mode.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Signed-off-by: Xin Yao <xiny@nvidia.com>

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-30T04:42:18Z

tests/pytorch/test_fused_optimizer.py

+    @pytest.mark.skipif(not is_bf16_available(), reason="bf16 if not supported")
+    def test_bf16_exp_avg_and_exp_avg_sq(self):
+        self.gen_precision_aware_test(
+            use_fp8_params=False,
+            param_dtype=torch.bfloat16,
+            use_master_weights=True,
+            master_weight_dtype=torch.float32,
+            grad_dtype=torch.float32,
+            exp_avg_dtype=torch.bfloat16,
+            exp_avg_sq_dtype=torch.bfloat16,
+            master_rtol=2e-3,
+            master_atol=2e-3,
+        )


Test only covers non-capturable mode. Add test for capturable mode with BF16 momentums since PR enables CUDA Graphs for this case.

Note: Check that gen_precision_aware_test supports a capturable parameter, or create a separate test method.

yaox12 added 2 commits January 29, 2026 07:14

fused scaling and unscaling of bf16 momentum

4eb228a

Signed-off-by: Xin Yao <xiny@nvidia.com>

add more comments

38388ef

Signed-off-by: Xin Yao <xiny@nvidia.com>

yaox12 requested a review from timmoon10 January 29, 2026 07:22

greptile-apps bot reviewed Jan 29, 2026

View reviewed changes

transformer_engine/pytorch/optimizers/fused_adam.py Show resolved Hide resolved

yaox12 added the 2.13.0 label Jan 29, 2026

enable cuda graphs for bf16 momentums

a608ec8

Signed-off-by: Xin Yao <xiny@nvidia.com>

greptile-apps bot reviewed Jan 29, 2026

View reviewed changes

yaox12 and others added 2 commits January 29, 2026 07:47

add tests

8931852

Signed-off-by: Xin Yao <xiny@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b3b419c

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Jan 29, 2026

View reviewed changes

update the check for store_param_remainders and capturable

19ff141

Signed-off-by: Xin Yao <xiny@nvidia.com>

greptile-apps bot reviewed Jan 30, 2026

View reviewed changes

kunlunl approved these changes Feb 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common][PyTorch] Fuse scaling and unscaling of bf16 momentums into kernels #2632

[Common][PyTorch] Fuse scaling and unscaling of bf16 momentums into kernels #2632

yaox12 commented Jan 29, 2026 •

edited

Loading

Uh oh!

yaox12 commented Jan 29, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Jan 29, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 29, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Common][PyTorch] Fuse scaling and unscaling of bf16 momentums into kernels #2632

Are you sure you want to change the base?

[Common][PyTorch] Fuse scaling and unscaling of bf16 momentums into kernels #2632

Conversation

yaox12 commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

yaox12 commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaox12 commented Jan 29, 2026 •

edited

Loading

yaox12 commented Jan 29, 2026 •

edited

Loading

greptile-apps bot commented Jan 29, 2026 •

edited

Loading