Skip to content

Conversation

@pggPL
Copy link
Collaborator

@pggPL pggPL commented Nov 4, 2025

Description

This PR adds detailed documentation for the Low Precision Training feature in Transformer Engine, covering FP8, MXFP8, NVFP4, and other quantization recipes for both PyTorch and JAX frameworks.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@pggPL pggPL force-pushed the docs_recipes branch 5 times, most recently from 79ed6d7 to 9649cd8 Compare December 1, 2025 16:34
@pggPL pggPL force-pushed the docs_recipes branch 4 times, most recently from 3053170 to 7905a74 Compare December 8, 2025 17:21
…m low precision training

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
pggPL and others added 4 commits December 8, 2025 18:38
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
… add GPU checks

Changes:
- Remove optimizer code from all recipe examples (keep only forward/backward)
- Fix Format imports (use Format.E4M3 instead of string 'E4M3')
- Fix params_dtype for PyTorch examples (add params_dtype=torch.bfloat16)
- Add GPU capability assertions before START blocks for blockwise/mxfp8/nvfp4
- Fix JAX imports (Float8CurrentScaling from common.recipe, NVFP4BlockScaling)
- Add global_shard_guard for TransformerLayer examples in JAX
- Fix fused_layers_jax.py return tuple unpacking
- Update memory_usage JAX examples with dynamic GPU measurement
- Remove memory_usage_3_jax (JAX doesn't support FP8 weight storage)
- Update performance_considerations.rst for JAX differences
- Delete unused .out files and fp8_autocast_jax.py

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL pggPL marked this pull request as ready for review December 8, 2025 21:34
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 8, 2025

Greptile Overview

Greptile Summary

This PR adds comprehensive documentation for Transformer Engine's low precision training capabilities, covering FP8, MXFP8, and NVFP4 quantization recipes for both PyTorch and JAX frameworks.

Key Additions

  • Introduction documentation explaining mixed precision training fundamentals, BF16/FP16 formats, master weights, and the autocast API for both frameworks
  • Recipe-specific documentation for five quantization approaches: FP8 Current Scaling, FP8 Delayed Scaling, FP8 Block Scaling, MXFP8, and NVFP4, each with detailed technical explanations and framework-specific examples
  • Performance optimization guide covering transpose handling, memory efficiency, fused layers, and architecture-specific considerations (Hopper vs Blackwell)
  • 28 SVG diagrams illustrating concepts like scaling factors, data formats, tensor layouts, and distributed training flows
  • Code examples for both PyTorch and JAX demonstrating recipe usage, distributed training, and performance measurement
  • CSS styling for diagram colors, responsive SVGs, and tab navigation

Documentation Quality

The documentation is well-structured with clear progression from basic concepts to advanced optimization. Technical explanations are thorough, diagrams effectively illustrate complex concepts, and code examples cover both single-GPU and distributed scenarios.

Minor Issues

  • One arXiv reference appears to use a future date (line 10 in nvfp4.rst) - should be verified
  • One example could be more explicit about which recipe is being used (memory_usage_2_pytorch.py)

All previously identified issues from earlier review rounds have been addressed.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk as it only adds documentation
  • Documentation-only PR with comprehensive content covering low precision training recipes. All code examples are non-executable documentation snippets. Previous review issues have been addressed. Only minor style suggestions remain (arXiv reference verification and explicit recipe parameter).
  • No files require special attention

Important Files Changed

Filename Overview
docs/features/low_precision_training/introduction/introduction.rst Comprehensive introduction to mixed precision and FP8 training concepts
docs/features/low_precision_training/fp8_current_scaling/fp8_current_scaling.rst Detailed documentation for FP8 Current Scaling recipe with examples
docs/features/low_precision_training/fp8_delayed_scaling/fp8_delayed_scaling.rst Explains FP8 Delayed Scaling with historical amax management
docs/features/low_precision_training/performance_considerations/performance_considerations.rst In-depth performance optimization guidance for transpose handling and memory
docs/features/low_precision_training/nvfp4/nvfp4.rst Documents NVFP4 4-bit precision recipe with hierarchical scaling

Sequence Diagram

sequenceDiagram
    participant User as User/Developer
    participant Docs as Documentation
    participant Intro as Introduction
    participant Recipes as Recipe Docs<br/>(FP8/MXFP8/NVFP4)
    participant Perf as Performance Guide
    participant Examples as Code Examples

    User->>Docs: Read low precision training docs
    Docs->>Intro: 1. Learn mixed precision basics
    Note over Intro: BF16/FP16 concepts<br/>Master weights<br/>Autocast usage
    
    Intro->>Recipes: 2. Choose quantization recipe
    Note over Recipes: FP8 Current Scaling<br/>FP8 Delayed Scaling<br/>FP8 Block Scaling<br/>MXFP8<br/>NVFP4
    
    Recipes->>Examples: 3. Review framework examples
    Note over Examples: PyTorch examples<br/>JAX examples<br/>Distributed training
    
    Examples->>Perf: 4. Optimize performance
    Note over Perf: Transpose handling<br/>Memory optimization<br/>Fused layers
    
    Perf->>User: 5. Implement in production
    Note over User: Apply recipe with<br/>te.autocast(recipe=...)
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

46 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ptrendx ptrendx added the documentation Improvements or additions to documentation label Dec 11, 2025
@pggPL pggPL mentioned this pull request Dec 16, 2025
13 tasks
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

pggPL and others added 3 commits January 13, 2026 12:19
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

pggPL added 2 commits January 13, 2026 12:23
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@jberchtold-nvidia jberchtold-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work! The diagrams and explanations are great and along with the code examples will hopefully make things a lot clearer to users

pggPL and others added 6 commits January 14, 2026 12:32
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Hopper (SM 9.0)

Blackwell and later (SM >= 10.0) – recipe is emulated with MXFP8. Note that this is done mainly for compatibility, MXFP8 is the preferred recipe on Blackwell.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • emulated = power of 2 scaling factor only
  • remove the compatibility part.

-----------------

Blackwell and later (SM 10.0+) No newline at end of file
Blackwell and later (SM 10.0+)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment about SM 12

pggPL and others added 2 commits January 29, 2026 15:25
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

===================================

NVFP4 is the first 4-bit recipe introduced in Transformer Engine –
please refer to the `NVFP4 paper <https://arxiv.org/abs/2509.25149>`__ for more details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arXiv ID 2509.25149 uses prefix 2509 (September 2025), which is in the future. Verify this is the correct reference or use a placeholder format like [arXiv link pending] until the paper is published.

layer = te.Linear(1024, 1024, params_dtype=torch.bfloat16)

inp = torch.randn(1024, 1024, dtype=torch.bfloat16, device="cuda")
with te.autocast(enabled=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing recipe parameter. te.autocast(enabled=True) uses a default recipe, but for documentation clarity, explicitly specify which recipe is being used (e.g., recipe=DelayedScaling()) to match other examples.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants