Description

This PR adds detailed documentation for the Low Precision Training feature in Transformer Engine, covering FP8, MXFP8, NVFP4, and other quantization recipes for both PyTorch and JAX frameworks.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…m low precision training Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

… add GPU checks Changes: - Remove optimizer code from all recipe examples (keep only forward/backward) - Fix Format imports (use Format.E4M3 instead of string 'E4M3') - Fix params_dtype for PyTorch examples (add params_dtype=torch.bfloat16) - Add GPU capability assertions before START blocks for blockwise/mxfp8/nvfp4 - Fix JAX imports (Float8CurrentScaling from common.recipe, NVFP4BlockScaling) - Add global_shard_guard for TransformerLayer examples in JAX - Fix fused_layers_jax.py return tuple unpacking - Update memory_usage JAX examples with dynamic GPU measurement - Remove memory_usage_3_jax (JAX doesn't support FP8 weight storage) - Update performance_considerations.rst for JAX differences - Delete unused .out files and fp8_autocast_jax.py Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

greptile-apps · 2025-12-08T21:37:37Z

Greptile Overview

Greptile Summary

This PR adds comprehensive documentation for Transformer Engine's low precision training capabilities, covering FP8, MXFP8, and NVFP4 quantization recipes for both PyTorch and JAX frameworks.

Key Additions

Introduction documentation explaining mixed precision training fundamentals, BF16/FP16 formats, master weights, and the autocast API for both frameworks
Recipe-specific documentation for five quantization approaches: FP8 Current Scaling, FP8 Delayed Scaling, FP8 Block Scaling, MXFP8, and NVFP4, each with detailed technical explanations and framework-specific examples
Performance optimization guide covering transpose handling, memory efficiency, fused layers, and architecture-specific considerations (Hopper vs Blackwell)
28 SVG diagrams illustrating concepts like scaling factors, data formats, tensor layouts, and distributed training flows
Code examples for both PyTorch and JAX demonstrating recipe usage, distributed training, and performance measurement
CSS styling for diagram colors, responsive SVGs, and tab navigation

Documentation Quality

The documentation is well-structured with clear progression from basic concepts to advanced optimization. Technical explanations are thorough, diagrams effectively illustrate complex concepts, and code examples cover both single-GPU and distributed scenarios.

Minor Issues

One arXiv reference appears to use a future date (line 10 in nvfp4.rst) - should be verified
One example could be more explicit about which recipe is being used (memory_usage_2_pytorch.py)

All previously identified issues from earlier review rounds have been addressed.

Confidence Score: 5/5

This PR is safe to merge with minimal risk as it only adds documentation
Documentation-only PR with comprehensive content covering low precision training recipes. All code examples are non-executable documentation snippets. Previous review issues have been addressed. Only minor style suggestions remain (arXiv reference verification and explicit recipe parameter).
No files require special attention

Important Files Changed

Filename	Overview
docs/features/low_precision_training/introduction/introduction.rst	Comprehensive introduction to mixed precision and FP8 training concepts
docs/features/low_precision_training/fp8_current_scaling/fp8_current_scaling.rst	Detailed documentation for FP8 Current Scaling recipe with examples
docs/features/low_precision_training/fp8_delayed_scaling/fp8_delayed_scaling.rst	Explains FP8 Delayed Scaling with historical amax management
docs/features/low_precision_training/performance_considerations/performance_considerations.rst	In-depth performance optimization guidance for transpose handling and memory
docs/features/low_precision_training/nvfp4/nvfp4.rst	Documents NVFP4 4-bit precision recipe with hierarchical scaling

Sequence Diagram

sequenceDiagram
    participant User as User/Developer
    participant Docs as Documentation
    participant Intro as Introduction
    participant Recipes as Recipe Docs<br/>(FP8/MXFP8/NVFP4)
    participant Perf as Performance Guide
    participant Examples as Code Examples

    User->>Docs: Read low precision training docs
    Docs->>Intro: 1. Learn mixed precision basics
    Note over Intro: BF16/FP16 concepts<br/>Master weights<br/>Autocast usage
    
    Intro->>Recipes: 2. Choose quantization recipe
    Note over Recipes: FP8 Current Scaling<br/>FP8 Delayed Scaling<br/>FP8 Block Scaling<br/>MXFP8<br/>NVFP4
    
    Recipes->>Examples: 3. Review framework examples
    Note over Examples: PyTorch examples<br/>JAX examples<br/>Distributed training
    
    Examples->>Perf: 4. Optimize performance
    Note over Perf: Transpose handling<br/>Memory optimization<br/>Fused layers
    
    Perf->>User: 5. Implement in production
    Note over User: Apply recipe with<br/>te.autocast(recipe=...)

greptile-apps

_{46 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

docs/features/low_precision_training/introduction/introduction.rst

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

docs/features/low_precision_training/fp8_delayed_scaling/fp8_delayed_scaling.rst

docs/features/low_precision_training/fp8_current_scaling/img/fp8_tensor_core.svg

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

docs/features/low_precision_training/fp8_delayed_scaling/fp8_delayed_scaling.rst

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

docs/features/low_precision_training/fp8_delayed_scaling/fp8_delayed_scaling.rst

docs/features/low_precision_training/introduction/introduction.rst

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

greptile-apps

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

docs/features/low_precision_training/fp8_delayed_scaling/fp8_delayed_scaling.rst

docs/features/low_precision_training/introduction/introduction.rst

...atures/low_precision_training/fp8_delayed_scaling/jax_delayed_scaling_distributed_example.py

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

docs/features/low_precision_training/introduction/bf16_fp16_training_jax.py

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

docs/features/low_precision_training/introduction/introduction.rst

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

docs/features/low_precision_training/introduction/introduction.rst

docs/features/low_precision_training/introduction/bf16_fp16_training_pytorch.py

docs/features/low_precision_training/performance_considerations/memory_usage_2_pytorch.py

jberchtold-nvidia

Fantastic work! The diagrams and explanations are great and along with the code examples will hopefully make things a lot clearer to users

docs/features/low_precision_training/fp8_delayed_scaling/jax_delayed_scaling_example.py

docs/features/low_precision_training/introduction/autocast_jax.py

docs/features/low_precision_training/performance_considerations/performance_considerations.rst

docs/features/low_precision_training/fp8_blockwise_scaling/jax_blockwise_scaling_example.py

docs/features/low_precision_training/nvfp4/jax_nvfp4_example.py

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

pggPL · 2026-01-29T22:45:25Z

docs/features/low_precision_training/fp8_blockwise_scaling/fp8_blockwise_scaling.rst

+
+Hopper (SM 9.0)
+
+Blackwell and later (SM >= 10.0) – recipe is emulated with MXFP8. Note that this is done mainly for compatibility, MXFP8 is the preferred recipe on Blackwell.


emulated = power of 2 scaling factor only

remove the compatibility part.

pggPL · 2026-01-29T22:48:07Z

docs/features/low_precision_training/mxfp8/mxfp8.rst

 -----------------

-Blackwell and later (SM 10.0+)
+Blackwell and later (SM 10.0+)


Comment about SM 12

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

greptile-apps

_{5 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-30T16:47:23Z

docs/features/low_precision_training/nvfp4/nvfp4.rst

+===================================
+
+NVFP4 is the first 4-bit recipe introduced in Transformer Engine –
+please refer to the `NVFP4 paper <https://arxiv.org/abs/2509.25149>`__ for more details.


arXiv ID 2509.25149 uses prefix 2509 (September 2025), which is in the future. Verify this is the correct reference or use a placeholder format like [arXiv link pending] until the paper is published.

greptile-apps · 2026-01-30T16:47:24Z

docs/features/low_precision_training/performance_considerations/memory_usage_2_pytorch.py

+    layer = te.Linear(1024, 1024, params_dtype=torch.bfloat16)
+
+    inp = torch.randn(1024, 1024, dtype=torch.bfloat16, device="cuda")
+    with te.autocast(enabled=True):


Missing recipe parameter. te.autocast(enabled=True) uses a default recipe, but for documentation clarity, explicitly specify which recipe is being used (e.g., recipe=DelayedScaling()) to match other examples.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

pggPL force-pushed the docs_recipes branch 5 times, most recently from 79ed6d7 to 9649cd8 Compare December 1, 2025 16:34

pggPL force-pushed the docs_recipes branch 4 times, most recently from 3053170 to 7905a74 Compare December 8, 2025 17:21

Code drop: Update recipes documentation and remove custom recipes fro…

a28ec54

…m low precision training Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

pggPL force-pushed the docs_recipes branch from 7905a74 to a28ec54 Compare December 8, 2025 17:22

pggPL and others added 4 commits December 8, 2025 18:38

Fix SVG css import path for diagrams

51f9327

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9152fc4

for more information, see https://pre-commit.ci

Fix JAX memory usage .out files with correct output

a299632

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

pggPL marked this pull request as ready for review December 8, 2025 21:34

greptile-apps bot reviewed Dec 8, 2025

View reviewed changes

ptrendx added the documentation Improvements or additions to documentation label Dec 11, 2025

pggPL mentioned this pull request Dec 16, 2025

Documentation for cpu offloading #2520

Open

13 tasks

ptrendx reviewed Jan 5, 2026

View reviewed changes