[Common] Tuned NVFP4 cast kernel #2412

Oleg-Goncharov · 2025-11-21T23:47:10Z

Description

This PR introduces a specialized CUDA kernel optimized for NVFP4 quantization of BF16 inputs on Blackwell architecture (sm100f family). The implementation achieves performance improvements by leveraging architecture-specific features:

RN: round-to-nearest mode 6.4 TB/s (rowwise only 7.2 TB/s)

SR: stochastic rounding 4.5 TB/s (rowwise only 7.0 TB/s)

Rowwise + Colwise (transpose)

Rowwise only

a) round-to-nearest

b) stochastic rounding

Below are the performance measurements for quantizing tensors using dimensions representative of DSv3 [8192×8, 7168] on internal Cluster (B300).

Using `--fast-math` can improve performance of the kernel with the stochastic rounding (RNG) by up to ~10%.

Threads to data mapping (colwise case)

To reduce shared memory bank conflicts, the following mapping is use when reading from and writing to shmem buffers:

Singe thread processes 16x2 elements (2x NVFP4 blocks).
Cells of the same color belong to the same warp
Indices of threads and their offsets are computed as:

const int tid_Y_colwise = (thread_lane % 4 + warp) % 4;
const int tid_X_colwise = thread_lane;

const int thread_offset_Y_colwise = tid_Y_colwise * SCALE_DIM;
const int thread_offset_X_colwise = tid_X_colwise * 2;

where SCALE_DIM=16.
The arrows in the figure below illustrate how thread indices increment, forming a zigzag pattern.

a) Reads from SHMEM Input Buffer

b) Writes to SHMEM Output Transpose Buffer

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Added a specialized kernel
Added the logic to use it when the conditions are met

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2025-11-21T23:51:05Z

Greptile Summary

This PR introduces a specialized CUDA kernel for NVFP4 quantization of BF16 tensors on Blackwell architecture (sm_100+), achieving significant performance improvements (6.4 TB/s for round-to-nearest, 4.5 TB/s for stochastic rounding).

Key changes:

Added tuned 1D kernel (quantize_transpose_nvfp4_tuned_1D.cuh) using Blackwell-specific features: TMA async copies, memory barriers, and cluster launch control
Added PTX helper functions for mbarrier operations and cluster management in ptx.cuh
Updated dispatcher to route BF16+1D quantization cases to the tuned kernel
Enhanced test reference implementation to match kernel behavior with use_fast_math flag (FP8 scale quantization and BF16 truncation)

Critical issue:

The dispatcher at line 1171 of quantize_transpose_nvfp4.cuh is missing a runtime architecture check before calling the tuned kernel. This will cause runtime failures on non-Blackwell GPUs when compiled with CUDA 12.8+. The FP4_TYPE_SUPPORTED macro only checks compile-time CUDA version, not device capability.

Other observations:

Test improvements include better numerical accuracy handling and clearer error reporting
Added align_smem_ptr_per_TMA_requirements helper for TMA alignment requirements
Kernel uses sophisticated shared memory management with multiple buffers and swizzling to reduce bank conflicts

Confidence Score: 3/5

This PR has a critical runtime architecture check missing that will cause failures on non-Blackwell GPUs
The implementation is technically sound and well-structured with proper compile-time guards in the kernel code. However, the dispatcher lacks a runtime check for sm_100+ before calling the Blackwell-specific tuned kernel. This means the code will fail on Ampere/Hopper GPUs when built with CUDA 12.8+. The issue was previously identified but the developer's response incorrectly conflated compile-time FP4 support with runtime GPU capability. Adding is_supported_by_CC_100() check in the dispatcher would resolve this and bring the score to 5.
Pay close attention to transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh line 1171 - the dispatch logic needs a runtime architecture check

Important Files Changed

Filename	Overview
transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh	Added dispatch logic to tuned 1D kernel for BF16 inputs without 2D quantization. Missing runtime architecture check could cause failures on pre-Blackwell GPUs.
transformer_engine/common/cast/nvfp4/specialized/quantize_transpose_nvfp4_tuned_1D.cuh	New tuned kernel implementation for NVFP4 quantization. Uses Blackwell-specific TMA, mbarrier, and cluster launch control. Compile-time guards present but no runtime check in caller.

Sequence Diagram

sequenceDiagram
    participant User
    participant Dispatcher as quantize_transpose<br/>(nvfp4/quantize_transpose_nvfp4.cuh)
    participant TunedKernel as quantize_transpose_tuned_1D<br/>(specialized/...tuned_1D.cuh)
    participant GPU as GPU Kernel<br/>(Blackwell-specific)
    participant PTX as PTX Instructions<br/>(TMA, mbarrier, cluster)

    User->>Dispatcher: quantize_transpose(input, output, config)
    Dispatcher->>Dispatcher: Check: !use_2d_quantization &&<br/>input.dtype == BF16
    alt BF16 + 1D quantization
        Dispatcher->>TunedKernel: quantize_transpose_tuned_1D()
        TunedKernel->>TunedKernel: Validate inputs & setup
        TunedKernel->>TunedKernel: Create TMA tensor maps
        TunedKernel->>TunedKernel: Launch kernel with config<br/>(USE_STOCHASTIC_ROUNDING,<br/>USE_FAST_MATH, RETURN_TRANSPOSE)
        TunedKernel->>GPU: quantize_transpose_nvfp4_tuned_1D_kernel<<<grid, block>>>
        GPU->>PTX: TMA async copy (global->shared)
        GPU->>PTX: mbarrier init/arrive/wait
        GPU->>PTX: Quantize BF16->NVFP4 with scaling
        GPU->>PTX: Optional: cluster launch control
        GPU->>PTX: TMA async copy (shared->global)
        GPU-->>TunedKernel: Return
        TunedKernel-->>Dispatcher: Return
    else Other configurations
        Dispatcher->>Dispatcher: Use generic kernel path
    end
    Dispatcher-->>User: Return quantized output

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

Additional Comments (1)

transformer_engine/common/cast/dispatch/quantize.cuh, line 250 (link)

logic: missing is_supported_by_CC_100() check. forward pass (line 104) has this check, but backward pass doesn't. without it, the persistent kernel can be dispatched on non-Blackwell hardware and hit NVTE_DEVICE_ERROR at runtime.

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/cast/dispatch/quantize.cuh

transformer_engine/common/util/ptx.cuh

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

Additional Comments (1)

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh, line 1171-1174 (link)

logic: Missing runtime architecture check before dispatching to persistent kernel. The tuned 1D kernel uses Blackwell-specific PTX instructions (mbarrier variants, cluster launch control, etc.) and requires sm_100+. Should guard with is_supported_by_CC_100() check similar to other TMA kernel dispatches in the codebase (see quantize_fp8.cuh:535, gated.cuh:48).

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

Additional Comments (1)

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh, line 1171-1174 (link)

logic: Missing runtime check for sm_100+ before dispatching to tuned kernel. The tuned kernel uses Blackwell-specific features (TMA, cluster launch control) that will fail on non-sm_100 devices.

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

Additional Comments (1)

tests/cpp/operator/test_cast_nvfp4_transpose.cu, line 125-126 (link)

syntax: typo in comment: 'Numercial' should be 'Numerical'

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

Additional Comments (1)

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh, line 1171-1173 (link)

logic: Add runtime architecture check before dispatching to tuned kernel:

The tuned kernel requires sm_100+ but this dispatch lacks the is_supported_by_CC_100() check that other sm_100+ kernels use (see cast/dispatch/gated.cuh:48, cast/fp8/quantize_fp8.cuh:535). Without this check, the kernel will launch on non-Blackwell devices and hit NVTE_DEVICE_ERROR at runtime.

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Oleg-Goncharov · 2025-12-15T18:56:34Z

/te-ci

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-16T20:36:47Z

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh

+  if (!use_2d_quantization && (input.dtype() == DType::kBFloat16)) {
+    quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);
+    return;
+  }


logic: missing runtime check for sm_100+ before dispatching to tuned kernel. kernel uses Blackwell-specific instructions (TMA, mbarrier, cluster launch control) that will cause runtime errors on older GPUs. add is_supported_by_CC_100() check like other architecture-specific kernels.

Suggested change

if (!use_2d_quantization && (input.dtype() == DType::kBFloat16)) {

quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);

return;

}

if (!use_2d_quantization && (input.dtype() == DType::kBFloat16) && is_supported_by_CC_100()) {

quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);

return;

}

It is guarded by macros FP4_TYPE_SUPPORTED which triggers the runtime NVTE error

You're absolutely right! The check should be within the #if FP4_TYPE_SUPPORTED block to prevent the runtime error. The current code will fall through to the NVTE_ERROR when FP4_TYPE_SUPPORTED is false.

Suggested change

if (!use_2d_quantization && (input.dtype() == DType::kBFloat16)) {

quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);

return;

}

if (!use_2d_quantization && (input.dtype() == DType::kBFloat16) && is_supported_by_CC_100()) {

quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);

return;

}

However, I need to check if is_supported_by_CC_100() is available in this context. Could you show me where this function is defined or what the correct architecture check should be for sm_100+?

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

…(numerical truncation). Tightened the numerical tolerance Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

…formant when inputs are large Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-01-16T21:03:17Z

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

Oleg-Goncharov · 2026-01-16T21:13:23Z

/te-ci

tdophung · 2026-01-16T23:38:52Z

tests/cpp/operator/test_cast_nvfp4_transpose.cu

-    const double atol = 0.05;
-    const double rtol = 0.1;
+    const double atol = 1.0E-6;
+    const double rtol = 1.0E-6;


why is this changing? Looks like it got stricter so I think if it passes, it is better. But I am not understanding why it was lower before

tdophung · 2026-01-16T23:43:56Z

transformer_engine/common/cast/nvfp4/specialized/quantize_transpose_nvfp4_tuned_1D.cuh

+constexpr int THREADS_X_ROWWISE = TILE_DIM_X / ELTS_PER_THREAD;
+constexpr int THREADS_Y_ROWWISE = THREADS_NUM / THREADS_X_ROWWISE;
+
+constexpr int THREADS_X_TRANSP = TILE_DIM_X / 2;


Can we use 1 convention for macros for transpose? Like you named THREADS_X_T or BUFF_OUT_T_DIM_Y before, so "_T", or "_TRANSP", or "_TR" (like the BUFFS_NUM_OUT_TR one. But just choose 1

Thanks, I’m fine to standardize. Please consolidate any remaining feedback into a single final pass by EOD today so we can merge without another round of incremental comments. This PR has been open since late November and was originally requested as time-sensitive.

tdophung · 2026-01-17T00:04:01Z

transformer_engine/common/cast/nvfp4/specialized/quantize_transpose_nvfp4_tuned_1D.cuh

+inline void quantize_transpose_tuned_1D(const Tensor &input, const Tensor *noop, Tensor *output,
+                                        const QuantizationConfig *quant_config,
+                                        cudaStream_t stream) {
+#if FP4_TYPE_SUPPORTED


But shouldn't the TunableConfig struct is for that purpose? To make sure that these values are tuned separately from the common patrt of the 2 kernels, and we keep the common part in one place only. Then it should not be tight coupling

Oleg-Goncharov · 2026-01-20T13:01:19Z

TunableConfig/coupling: we already aligned on this with Przemek, we’re not moving these values into TunableConfig for this change. Let’s keep the scope as agreed.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-20T13:21:31Z

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh

+  if (!use_2d_quantization && (input.dtype() == DType::kBFloat16)) {
+    quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);
+    return;


logic: add runtime check for sm_100+ before dispatching to tuned kernel

Suggested change

if (!use_2d_quantization && (input.dtype() == DType::kBFloat16)) {

quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);

return;

if (!use_2d_quantization && (input.dtype() == DType::kBFloat16) && is_supported_by_CC_100()) {

quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);

return;

}

The tuned kernel uses Blackwell-specific instructions (TMA, mbarrier, cluster launch control) that will cause runtime errors on older GPUs. FP4_TYPE_SUPPORTED is only a compile-time check for CUDA version, not device capability.

Oleg-Goncharov · 2026-01-20T14:07:47Z

/te-ci

Oleg-Goncharov requested review from ptrendx, tdophung and timmoon10 November 21, 2025 23:47

greptile-apps bot reviewed Nov 21, 2025

View reviewed changes

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh Outdated Show resolved Hide resolved

greptile-apps bot reviewed Nov 22, 2025

View reviewed changes

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh Outdated Show resolved Hide resolved

greptile-apps bot reviewed Nov 22, 2025

View reviewed changes

Oleg-Goncharov changed the title ~~[Common] Persistent NVFP4 kernel~~ [Common] Persistent NVFP4 cast + transpose kernel Nov 22, 2025

greptile-apps bot reviewed Nov 22, 2025

View reviewed changes

Oleg-Goncharov force-pushed the pr_nvfp4_persistent_kernel branch from 445c870 to a7a0652 Compare November 22, 2025 01:34

greptile-apps bot reviewed Nov 22, 2025

View reviewed changes

ptrendx added the 2.11.0 label Nov 25, 2025

Oleg-Goncharov added performance Performance issues enhancement New feature or request labels Dec 4, 2025

ptrendx reviewed Dec 5, 2025

View reviewed changes

transformer_engine/common/cast/dispatch/quantize.cuh Outdated Show resolved Hide resolved

ptrendx reviewed Dec 5, 2025

View reviewed changes

transformer_engine/common/util/ptx.cuh Outdated Show resolved Hide resolved

ptrendx reviewed Dec 5, 2025

View reviewed changes

transformer_engine/common/util/ptx.cuh Outdated Show resolved Hide resolved

greptile-apps bot reviewed Dec 8, 2025

View reviewed changes

Oleg-Goncharov changed the title ~~[Common] Persistent NVFP4 cast + transpose kernel~~ [Common] Tuned NVFP4 cast + transpose kernel Dec 8, 2025

greptile-apps bot reviewed Dec 8, 2025

View reviewed changes

ptrendx added the fp4 label Dec 11, 2025

Oleg-Goncharov force-pushed the pr_nvfp4_persistent_kernel branch from 91312be to a38eeff Compare December 12, 2025 15:45

greptile-apps bot reviewed Dec 12, 2025

View reviewed changes

greptile-apps bot reviewed Dec 15, 2025

View reviewed changes

Oleg-Goncharov changed the title ~~[Common] Tuned NVFP4 cast + transpose kernel~~ [Common] Tuned NVFP4 cast kernel Dec 15, 2025

greptile-apps bot reviewed Dec 15, 2025

View reviewed changes

pre-commit-ci bot and others added 3 commits January 16, 2026 20:36

[pre-commit.ci] auto fixes from pre-commit.com hooks

2f3a3df

for more information, see https://pre-commit.ci

Fix FP4 guard in ptx

d7f6126

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fix

fd65da0

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps bot reviewed Jan 16, 2026

View reviewed changes

Oleg-Goncharov and others added 14 commits January 16, 2026 20:40

Fix in ptx. reduxf32 guard

8e9df8c

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d0ff5af

for more information, see https://pre-commit.ci

Fix

18faabc

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixes per PR review

24bf822

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

26231ca

for more information, see https://pre-commit.ci

Fixes per PR review. Added parameter to turn off the persistency

c3150ab

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

43b548c

for more information, see https://pre-commit.ci

Modified reference CPU implementation in C++ unit tests to match GPU …

1bfd070

…(numerical truncation). Tightened the numerical tolerance Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Disabled persistency by default, as non-persistent kernel is more per…

7210c12

…formant when inputs are large Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

57a3467

for more information, see https://pre-commit.ci

Use the tuned kernel also for the rowwise only quantization

1f6f609

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixed typo

d81cacf

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Addressed comments from the PR review

166bc20

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Resolved conflicts

bc26160

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov force-pushed the pr_nvfp4_persistent_kernel branch from 7362e47 to bc26160 Compare January 16, 2026 20:58

[pre-commit.ci] auto fixes from pre-commit.com hooks

1d6780c

for more information, see https://pre-commit.ci

Oleg-Goncharov requested a review from ptrendx January 16, 2026 21:00

tdophung reviewed Jan 16, 2026

View reviewed changes

tdophung reviewed Jan 17, 2026

View reviewed changes

Merge branch 'main' into pr_nvfp4_persistent_kernel

85f5910

Macros renaming

76efd00

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps bot reviewed Jan 20, 2026

View reviewed changes

[Common] Tuned NVFP4 cast kernel #2412

Are you sure you want to change the base?

[Common] Tuned NVFP4 cast kernel #2412

Conversation

Oleg-Goncharov commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

RN: round-to-nearest mode 6.4 TB/s (rowwise only 7.2 TB/s)

SR: stochastic rounding 4.5 TB/s (rowwise only 7.0 TB/s)

Rowwise + Colwise (transpose)

Rowwise only

a) round-to-nearest

b) stochastic rounding

Using --fast-math can improve performance of the kernel with the stochastic rounding (RNG) by up to ~10%.

Threads to data mapping (colwise case)

a) Reads from SHMEM Input Buffer

b) Writes to SHMEM Output Transpose Buffer

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Oleg-Goncharov commented Nov 21, 2025 •

edited

Loading

Using `--fast-math` can improve performance of the kernel with the stochastic rounding (RNG) by up to ~10%.

greptile-apps bot commented Nov 21, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading