Skip to content

Conversation

@Oleg-Goncharov
Copy link
Collaborator

@Oleg-Goncharov Oleg-Goncharov commented Nov 21, 2025

Description

This PR introduces a specialized CUDA kernel optimized for NVFP4 quantization of BF16 inputs on Blackwell architecture (sm100f family). The implementation achieves performance improvements by leveraging architecture-specific features:

RN: round-to-nearest mode 6.4 TB/s (rowwise only 7.2 TB/s)

SR: stochastic rounding 4.5 TB/s (rowwise only 7.0 TB/s)

Rowwise + Colwise (transpose)

NVFP4 kernel performance 3

Rowwise only

a) round-to-nearest

NVFP4 cast rowwise

b) stochastic rounding

NVFP4 cast rowwise + SR

Below are the performance measurements for quantizing tensors using dimensions representative of DSv3 [8192×8, 7168] on internal Cluster (B300).

NVFP4 cast DSv3

Using --fast-math can improve performance of the kernel with the stochastic rounding (RNG) by up to ~10%.

Threads to data mapping (colwise case)

To reduce shared memory bank conflicts, the following mapping is use when reading from and writing to shmem buffers:

  • Singe thread processes 16x2 elements (2x NVFP4 blocks).
  • Cells of the same color belong to the same warp
  • Indices of threads and their offsets are computed as:
const int tid_Y_colwise = (thread_lane % 4 + warp) % 4;
const int tid_X_colwise = thread_lane;

const int thread_offset_Y_colwise = tid_Y_colwise * SCALE_DIM;
const int thread_offset_X_colwise = tid_X_colwise * 2;

where SCALE_DIM=16.
The arrows in the figure below illustrate how thread indices increment, forming a zigzag pattern.

a) Reads from SHMEM Input Buffer

Colwise reads NVFP4

b) Writes to SHMEM Output Transpose Buffer

Colwise writes NVFP4

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Added a specialized kernel
  • Added the logic to use it when the conditions are met

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 21, 2025

Greptile Summary

This PR introduces a specialized CUDA kernel for NVFP4 quantization of BF16 tensors on Blackwell architecture (sm_100+), achieving significant performance improvements (6.4 TB/s for round-to-nearest, 4.5 TB/s for stochastic rounding).

Key changes:

  • Added tuned 1D kernel (quantize_transpose_nvfp4_tuned_1D.cuh) using Blackwell-specific features: TMA async copies, memory barriers, and cluster launch control
  • Added PTX helper functions for mbarrier operations and cluster management in ptx.cuh
  • Updated dispatcher to route BF16+1D quantization cases to the tuned kernel
  • Enhanced test reference implementation to match kernel behavior with use_fast_math flag (FP8 scale quantization and BF16 truncation)

Critical issue:

  • The dispatcher at line 1171 of quantize_transpose_nvfp4.cuh is missing a runtime architecture check before calling the tuned kernel. This will cause runtime failures on non-Blackwell GPUs when compiled with CUDA 12.8+. The FP4_TYPE_SUPPORTED macro only checks compile-time CUDA version, not device capability.

Other observations:

  • Test improvements include better numerical accuracy handling and clearer error reporting
  • Added align_smem_ptr_per_TMA_requirements helper for TMA alignment requirements
  • Kernel uses sophisticated shared memory management with multiple buffers and swizzling to reduce bank conflicts

Confidence Score: 3/5

  • This PR has a critical runtime architecture check missing that will cause failures on non-Blackwell GPUs
  • The implementation is technically sound and well-structured with proper compile-time guards in the kernel code. However, the dispatcher lacks a runtime check for sm_100+ before calling the Blackwell-specific tuned kernel. This means the code will fail on Ampere/Hopper GPUs when built with CUDA 12.8+. The issue was previously identified but the developer's response incorrectly conflated compile-time FP4 support with runtime GPU capability. Adding is_supported_by_CC_100() check in the dispatcher would resolve this and bring the score to 5.
  • Pay close attention to transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh line 1171 - the dispatch logic needs a runtime architecture check

Important Files Changed

Filename Overview
transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh Added dispatch logic to tuned 1D kernel for BF16 inputs without 2D quantization. Missing runtime architecture check could cause failures on pre-Blackwell GPUs.
transformer_engine/common/cast/nvfp4/specialized/quantize_transpose_nvfp4_tuned_1D.cuh New tuned kernel implementation for NVFP4 quantization. Uses Blackwell-specific TMA, mbarrier, and cluster launch control. Compile-time guards present but no runtime check in caller.

Sequence Diagram

sequenceDiagram
    participant User
    participant Dispatcher as quantize_transpose<br/>(nvfp4/quantize_transpose_nvfp4.cuh)
    participant TunedKernel as quantize_transpose_tuned_1D<br/>(specialized/...tuned_1D.cuh)
    participant GPU as GPU Kernel<br/>(Blackwell-specific)
    participant PTX as PTX Instructions<br/>(TMA, mbarrier, cluster)

    User->>Dispatcher: quantize_transpose(input, output, config)
    Dispatcher->>Dispatcher: Check: !use_2d_quantization &&<br/>input.dtype == BF16
    alt BF16 + 1D quantization
        Dispatcher->>TunedKernel: quantize_transpose_tuned_1D()
        TunedKernel->>TunedKernel: Validate inputs & setup
        TunedKernel->>TunedKernel: Create TMA tensor maps
        TunedKernel->>TunedKernel: Launch kernel with config<br/>(USE_STOCHASTIC_ROUNDING,<br/>USE_FAST_MATH, RETURN_TRANSPOSE)
        TunedKernel->>GPU: quantize_transpose_nvfp4_tuned_1D_kernel<<<grid, block>>>
        GPU->>PTX: TMA async copy (global->shared)
        GPU->>PTX: mbarrier init/arrive/wait
        GPU->>PTX: Quantize BF16->NVFP4 with scaling
        GPU->>PTX: Optional: cluster launch control
        GPU->>PTX: TMA async copy (shared->global)
        GPU-->>TunedKernel: Return
        TunedKernel-->>Dispatcher: Return
    else Other configurations
        Dispatcher->>Dispatcher: Use generic kernel path
    end
    Dispatcher-->>User: Return quantized output
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@Oleg-Goncharov Oleg-Goncharov changed the title [Common] Persistent NVFP4 kernel [Common] Persistent NVFP4 cast + transpose kernel Nov 22, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@Oleg-Goncharov Oleg-Goncharov force-pushed the pr_nvfp4_persistent_kernel branch from 445c870 to a7a0652 Compare November 22, 2025 01:34
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. transformer_engine/common/cast/dispatch/quantize.cuh, line 250 (link)

    logic: missing is_supported_by_CC_100() check. forward pass (line 104) has this check, but backward pass doesn't. without it, the persistent kernel can be dispatched on non-Blackwell hardware and hit NVTE_DEVICE_ERROR at runtime.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ptrendx ptrendx added the 2.11.0 label Nov 25, 2025
@Oleg-Goncharov Oleg-Goncharov added performance Performance issues enhancement New feature or request labels Dec 4, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@Oleg-Goncharov Oleg-Goncharov changed the title [Common] Persistent NVFP4 cast + transpose kernel [Common] Tuned NVFP4 cast + transpose kernel Dec 8, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh, line 1171-1174 (link)

    logic: Missing runtime architecture check before dispatching to persistent kernel. The tuned 1D kernel uses Blackwell-specific PTX instructions (mbarrier variants, cluster launch control, etc.) and requires sm_100+. Should guard with is_supported_by_CC_100() check similar to other TMA kernel dispatches in the codebase (see quantize_fp8.cuh:535, gated.cuh:48).

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@ptrendx ptrendx added the fp4 label Dec 11, 2025
@Oleg-Goncharov Oleg-Goncharov force-pushed the pr_nvfp4_persistent_kernel branch from 91312be to a38eeff Compare December 12, 2025 15:45
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh, line 1171-1174 (link)

    logic: Missing runtime check for sm_100+ before dispatching to tuned kernel. The tuned kernel uses Blackwell-specific features (TMA, cluster launch control) that will fail on non-sm_100 devices.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. tests/cpp/operator/test_cast_nvfp4_transpose.cu, line 125-126 (link)

    syntax: typo in comment: 'Numercial' should be 'Numerical'

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@Oleg-Goncharov Oleg-Goncharov changed the title [Common] Tuned NVFP4 cast + transpose kernel [Common] Tuned NVFP4 cast kernel Dec 15, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh, line 1171-1173 (link)

    logic: Add runtime architecture check before dispatching to tuned kernel:

    The tuned kernel requires sm_100+ but this dispatch lacks the is_supported_by_CC_100() check that other sm_100+ kernels use (see cast/dispatch/gated.cuh:48, cast/fp8/quantize_fp8.cuh:535). Without this check, the kernel will launch on non-Blackwell devices and hit NVTE_DEVICE_ERROR at runtime.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@Oleg-Goncharov
Copy link
Collaborator Author

/te-ci

pre-commit-ci bot and others added 3 commits January 16, 2026 20:36
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +1171 to +1174
if (!use_2d_quantization && (input.dtype() == DType::kBFloat16)) {
quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: missing runtime check for sm_100+ before dispatching to tuned kernel. kernel uses Blackwell-specific instructions (TMA, mbarrier, cluster launch control) that will cause runtime errors on older GPUs. add is_supported_by_CC_100() check like other architecture-specific kernels.

Suggested change
if (!use_2d_quantization && (input.dtype() == DType::kBFloat16)) {
quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);
return;
}
if (!use_2d_quantization && (input.dtype() == DType::kBFloat16) && is_supported_by_CC_100()) {
quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);
return;
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is guarded by macros FP4_TYPE_SUPPORTED which triggers the runtime NVTE error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right! The check should be within the #if FP4_TYPE_SUPPORTED block to prevent the runtime error. The current code will fall through to the NVTE_ERROR when FP4_TYPE_SUPPORTED is false.

Suggested change
if (!use_2d_quantization && (input.dtype() == DType::kBFloat16)) {
quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);
return;
}
if (!use_2d_quantization && (input.dtype() == DType::kBFloat16) && is_supported_by_CC_100()) {
quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);
return;
}

However, I need to check if is_supported_by_CC_100() is available in this context. Could you show me where this function is defined or what the correct architecture check should be for sm_100+?

Oleg-Goncharov and others added 14 commits January 16, 2026 20:40
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
…(numerical truncation). Tightened the numerical tolerance

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
…formant when inputs are large

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
@Oleg-Goncharov Oleg-Goncharov force-pushed the pr_nvfp4_persistent_kernel branch from 7362e47 to bc26160 Compare January 16, 2026 20:58
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 16, 2026

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

@Oleg-Goncharov
Copy link
Collaborator Author

/te-ci

const double atol = 0.05;
const double rtol = 0.1;
const double atol = 1.0E-6;
const double rtol = 1.0E-6;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this changing? Looks like it got stricter so I think if it passes, it is better. But I am not understanding why it was lower before

constexpr int THREADS_X_ROWWISE = TILE_DIM_X / ELTS_PER_THREAD;
constexpr int THREADS_Y_ROWWISE = THREADS_NUM / THREADS_X_ROWWISE;

constexpr int THREADS_X_TRANSP = TILE_DIM_X / 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use 1 convention for macros for transpose? Like you named THREADS_X_T or BUFF_OUT_T_DIM_Y before, so "_T", or "_TRANSP", or "_TR" (like the BUFFS_NUM_OUT_TR one. But just choose 1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I’m fine to standardize. Please consolidate any remaining feedback into a single final pass by EOD today so we can merge without another round of incremental comments. This PR has been open since late November and was originally requested as time-sensitive.

inline void quantize_transpose_tuned_1D(const Tensor &input, const Tensor *noop, Tensor *output,
const QuantizationConfig *quant_config,
cudaStream_t stream) {
#if FP4_TYPE_SUPPORTED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But shouldn't the TunableConfig struct is for that purpose? To make sure that these values are tuned separately from the common patrt of the 2 kernels, and we keep the common part in one place only. Then it should not be tight coupling

@Oleg-Goncharov
Copy link
Collaborator Author

TunableConfig/coupling: we already aligned on this with Przemek, we’re not moving these values into TunableConfig for this change. Let’s keep the scope as agreed.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +1171 to +1173
if (!use_2d_quantization && (input.dtype() == DType::kBFloat16)) {
quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: add runtime check for sm_100+ before dispatching to tuned kernel

Suggested change
if (!use_2d_quantization && (input.dtype() == DType::kBFloat16)) {
quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);
return;
if (!use_2d_quantization && (input.dtype() == DType::kBFloat16) && is_supported_by_CC_100()) {
quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);
return;
}

The tuned kernel uses Blackwell-specific instructions (TMA, mbarrier, cluster launch control) that will cause runtime errors on older GPUs. FP4_TYPE_SUPPORTED is only a compile-time check for CUDA version, not device capability.

@Oleg-Goncharov
Copy link
Collaborator Author

/te-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2.12.0 enhancement New feature or request fp4 performance Performance issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants