Skip to content

[Common] Persistent Grouped NVFP4 quantization kernel#2743

Open
Oleg-Goncharov wants to merge 53 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_persistent_grouped_nvfp4_kernel
Open

[Common] Persistent Grouped NVFP4 quantization kernel#2743
Oleg-Goncharov wants to merge 53 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_persistent_grouped_nvfp4_kernel

Conversation

@Oleg-Goncharov
Copy link
Collaborator

@Oleg-Goncharov Oleg-Goncharov commented Mar 6, 2026

Description

This PR adds a persistent grouped NVFP4 quantization + transpose kernel with static scheduling.
It is built on top of the PR#2738 [Common] Persistent Grouped MXFP8 quantization kernel

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Added persistent grouped kernel
  • Added test suite

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Oleg-Goncharov and others added 30 commits February 27, 2026 15:53
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
std::vector<fp4e2m1> out_data_rowwise_h(total_elts / 2);
std::vector<fp4e2m1> out_data_colwise_h(total_elts / 2);
std::vector<fp8e4m3> out_scales_rowwise_h(rowwise_scales_num);
std::vector<fp8e4m3> out_scales_colwise_h(colwise_scales_num);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong variable used in "more mismatches" condition

mismatch_messages is only appended while total_mismatches <= max_mismatches_to_print (3), so its size() can never exceed max_mismatches_to_print. The condition is therefore always false and the "... and X more mismatches" line is dead code — even inside the print_detailed_summary branch. The comparison should use total_mismatches:

Suggested change
std::vector<fp8e4m3> out_scales_colwise_h(colwise_scales_num);
if (total_mismatches > max_mismatches_to_print) {

Comment on lines +388 to +396
cudaMemcpy(last_dims_d, last_dims_h.data(), num_tensors * sizeof(int64_t), cudaMemcpyHostToDevice);
cudaMemcpy(offsets_d, offsets_h.data(), (num_tensors + 1) * sizeof(int64_t), cudaMemcpyHostToDevice);

cudaMemset(out_data_rowwise_d, 0, out_data_size);
cudaMemset(out_data_colwise_d, 0, out_data_size);
cudaMemset(out_scales_rowwise_d, 0, rowwise_scales_size);
cudaMemset(out_scales_colwise_d, 0, colwise_scales_size);

NVTEShape logical_shape_ = nvte_make_shape(logical_shape.data(), logical_shape.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA API return values are not checked

All cudaMalloc, cudaMemcpy, and cudaMemset calls in performTest silently ignore their return values. A failed allocation would leave the pointer uninitialized (or null) and the test would proceed, likely crashing or producing a spurious cudaGetLastError failure that obscures the real problem.

Consider wrapping the calls with a helper that asserts success, e.g.:

ASSERT_EQ(cudaMalloc((void**)&in_data_d, in_data_size), cudaSuccess);
ASSERT_EQ(cudaMemcpy(in_data_d, grouped_input.data(), in_data_size, cudaMemcpyHostToDevice), cudaSuccess);

This pattern applies to all CUDA API calls from the allocation block down through the cudaMemset calls (lines 388–410).

Comment on lines 163 to 164
* \param[in] stream CUDA stream used for the operation.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary-incompatible API change without a breaking-change marker

nvte_group_quantize_dbias (and the five related nvte_group_quantize_dbias_d* functions) previously accepted NVTETensor dbias; this PR changes the parameter to NVTEGroupedTensor dbias. Any existing C/C++ caller that was compiled against the old header will silently pass the wrong type at runtime. Even though both types are opaque pointers at the ABI level, callers that stored the dbias as NVTETensor will need to migrate.

The PR description marks this as "New feature (non-breaking change)", but this signature change will break downstream callers (Python bindings, external C++ users) that previously compiled against NVTETensor dbias. It is worth auditing all internal call sites (Python pybind layer, etc.) and explicitly documenting the migration in the PR / changelog.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
@Oleg-Goncharov Oleg-Goncharov force-pushed the pr_persistent_grouped_nvfp4_kernel branch from 97ec071 to fef9220 Compare March 9, 2026 11:55
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
@Oleg-Goncharov Oleg-Goncharov force-pushed the pr_persistent_grouped_nvfp4_kernel branch from 2e289c9 to 9e37b4c Compare March 9, 2026 14:51
Comment on lines +104 to +115

const size_t rows = tensor_rows / chunk_dim_Y;
const size_t cols = last_logical_dim;

const size_t dbias_in_offset_Y =
(shape_rep == ShapeRepresentation::SAME_BOTH_DIMS)
? (tensor_id * (tensor_rows / chunk_dim_Y))
: (static_cast<size_t>(offsets_ptr[tensor_id]) / cols / chunk_dim_Y);

const size_t thread_id = blockIdx.x * blockDim.x + threadIdx.x;

if (thread_id * nvec >= cols) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group_reduce_dbias_kernel uses last_logical_dim as cols for all shape representations

cols is unconditionally set to last_logical_dim:

const size_t cols = last_logical_dim;

For VARYING_LAST_DIM and VARYING_BOTH_DIMS shapes, each tensor has a different last dimension. Using the scalar last_logical_dim for all tensors will produce incorrect partial-dbias strides and wrong output write offsets (tensor_id * cols assumes uniform column counts). The same issue affects the dbias_in_offset_Y calculation for those shape representations.

cast.h documents that "Grouped dbias is not yet supported for grouped tensors with a varying last dimension," but there is no runtime guard in grouped_reduce_dbias or this kernel to enforce that. If called with such shapes the function silently corrupts memory. Consider adding an explicit NVTE_CHECK(shape_rep != ShapeRepresentation::VARYING_LAST_DIM && shape_rep != ShapeRepresentation::VARYING_BOTH_DIMS, ...) guard in grouped_reduce_dbias before the kernel launch.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
@Oleg-Goncharov Oleg-Goncharov force-pushed the pr_persistent_grouped_nvfp4_kernel branch from f6b5928 to 6a7409d Compare March 9, 2026 15:05
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
@Oleg-Goncharov Oleg-Goncharov force-pushed the pr_persistent_grouped_nvfp4_kernel branch from 50a4921 to 6c5cc7f Compare March 9, 2026 16:05
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
@Oleg-Goncharov Oleg-Goncharov force-pushed the pr_persistent_grouped_nvfp4_kernel branch from c83b558 to f5e2ba0 Compare March 9, 2026 16:06
Comment on lines 175 to +176
void nvte_group_quantize_dbias(const NVTEGroupedTensor input, NVTEGroupedTensor output,
NVTETensor dbias, NVTETensor workspace, cudaStream_t stream);
NVTEGroupedTensor dbias, NVTETensor workspace, cudaStream_t stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The six nvte_group_quantize_dbias* functions (this one and nvte_group_quantize_dbias_dgelu, nvte_group_quantize_dbias_dsilu, nvte_group_quantize_dbias_drelu) now take NVTEGroupedTensor dbias instead of NVTETensor dbias. This is a signature change that may affect existing C/C++ callers compiled against the old header. While this appears intentional as part of the grouped tensor API consolidation, consider auditing Python bindings and any external C++ code to ensure compatibility, and explicitly document the migration path in the changelog.

Comment on lines +1044 to +1055

const bool use_single_work_grid = (shape_rep == ShapeRepresentation::SAME_BOTH_DIMS ||
shape_rep == ShapeRepresentation::VARYING_FIRST_DIM);

const size_t first_logical_dim = input->logical_shape.data[0];
const size_t last_logical_dim = input->logical_shape.data[1];
const size_t elts_total = first_logical_dim * last_logical_dim;
const size_t num_tensors = input->num_tensors;

NVTE_CHECK(num_tensors <= MAX_SUPPORTED_TENSOR_DESCRIPTORS,
"Number of tensors in a group is larger than the MAX number of supported "
"descriptors (64).");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the SAME_BOTH_DIMS case, the kernel computes per-tensor row counts via integer division (first_logical_dim / num_tensors), which silently truncates if first_logical_dim is not exactly divisible by num_tensors. This causes incorrect base offsets and may skip or overwrite the last few rows. Add a host-side check:

if (shape_rep == ShapeRepresentation::SAME_BOTH_DIMS) {
    NVTE_CHECK(first_logical_dim % num_tensors == 0,
               "For SAME_BOTH_DIMS, first_logical_dim (", first_logical_dim,
               ") must be divisible by num_tensors (", num_tensors, ").");
}

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
@Oleg-Goncharov Oleg-Goncharov force-pushed the pr_persistent_grouped_nvfp4_kernel branch from 811a146 to eace4a6 Compare March 9, 2026 17:59
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Comment on lines +508 to +510
if (offset_bits % 8 != 0) {
NVTE_DEVICE_ERROR("Data offset is not byte-aligned.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decode_block silently produces wrong results for VARYING_BOTH_DIMS when per-tensor element count is not ELTS_PER_CHUNK-aligned

block_id_in_current_tensor is computed as:

const size_t block_id_in_current_tensor = job.block_id - block.tensor_base / ELTS_PER_CHUNK;

block.tensor_base = offsets_ptr[tensor_id] is the cumulative element offset of this tensor. The integer division /ELTS_PER_CHUNK is exact only when every preceding tensor's element count (rows_t * cols_t) is a multiple of ELTS_PER_CHUNK (= CHUNK_DIM_Y * CHUNK_DIM_X = 16384). If that condition is violated, the division truncates and block_id_in_current_tensor (and therefore block_id_Y / block_id_X) is wrong, silently corrupting the output.

For VARYING_LAST_DIM this is guaranteed safe because first_logical_dim % 128 == 0 is validated and cols % 128 == 0 is required for TMA, so rows_t * cols_t ≥ 128 * 128 = ELTS_PER_CHUNK and is always a multiple. But for VARYING_BOTH_DIMS the host-side check is explicitly skipped:

if (shape_rep != ShapeRepresentation::VARYING_BOTH_DIMS) {
    NVTE_CHECK(first_logical_dim % 128 == 0, ...);
}

…and there is no per-tensor row-dimension guard device-side either. A tensor with, e.g., rows=64, cols=256 (8192 elements = ELTS_PER_CHUNK / 2) would cause offsets_ptr[tensor_id+1] / ELTS_PER_CHUNK to truncate, breaking all subsequent block mappings.

Consider adding a host-side check for VARYING_BOTH_DIMS that each tensor satisfies rows_t % 128 == 0 and cols_t % 128 == 0, or at minimum assert that each offset is ELTS_PER_CHUNK-aligned:

for (size_t t = 0; t < num_tensors; ++t) {
    const size_t rows_t = first_dims_ptr[t];
    const size_t cols_t = last_dims_ptr[t];
    NVTE_CHECK(rows_t % CHUNK_DIM_Y == 0 && cols_t % CHUNK_DIM_X == 0,
               "VARYING_BOTH_DIMS: each tensor's rows and cols must be multiples of 128.");
}

Comment on lines +104 to +108

const size_t rows = tensor_rows / chunk_dim_Y;
const size_t cols = last_logical_dim;

const size_t dbias_in_offset_Y =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group_reduce_dbias_kernel uses last_logical_dim as cols for all tensors, which is wrong for VARYING_LAST_DIM / VARYING_BOTH_DIMS

const size_t cols = last_logical_dim;

For VARYING_LAST_DIM and VARYING_BOTH_DIMS each tensor has a distinct last dimension, but cols is set to the global last_logical_dim for all of them. This has two immediate consequences:

  1. Wrong output offset: thread_out_base = dbias_output + tensor_id * cols + ... assumes all tensors have the same column count. For varying-last-dim shapes, the dbias output for tensor t should be written at the cumulative sum of all preceding tensors' column counts, not at tensor_id * last_logical_dim.

  2. Wrong dbias_in_offset_Y: the calculation offsets_ptr[tensor_id] / cols / chunk_dim_Y uses the wrong stride for the same reason.

The comment in cast.h documents "Grouped dbias is not yet supported for grouped tensors with a varying last dimension," but no runtime guard in grouped_reduce_dbias or this kernel enforces that. An explicit NVTE_CHECK should be added before the kernel launch to surface misuse early rather than silently corrupting memory:

NVTE_CHECK(shape_rep != ShapeRepresentation::VARYING_LAST_DIM &&
           shape_rep != ShapeRepresentation::VARYING_BOTH_DIMS,
           "Grouped dbias is not yet supported for grouped tensors with a varying last dimension.");

Oleg-Goncharov and others added 2 commits March 10, 2026 15:12
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Comment on lines +1057 to +1072
case ShapeRepresentation::SAME_BOTH_DIMS: {
NVTE_CHECK(first_logical_dim % num_tensors == 0,
"First logical dimension of a grouped tensor must be divisible by the number of "
"tensors.");
NVTE_CHECK((first_logical_dim / num_tensors) % 128 == 0,
"First dimension of each tensor in a group must be divisible by 128.");
break;
}
case ShapeRepresentation::VARYING_FIRST_DIM: {
NVTE_CHECK(first_logical_dim % 128 == 0,
"First logical dimension of a grouped tensor must be divisible by 128.");
break;
}
case ShapeRepresentation::VARYING_LAST_DIM: {
NVTE_CHECK(first_logical_dim % 128 == 0,
"First logical dimension of a grouped tensor must be divisible by 128.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VARYING_BOTH_DIMS host check validates total element count, not per-tensor alignment

The check last_logical_dim % ELTS_PER_CHUNK == 0 passes as long as the sum of all per-tensor element counts is divisible by ELTS_PER_CHUNK (16 384). However, decode_block requires that each individual tensor's element count (rows_t * cols_t) is a multiple of ELTS_PER_CHUNK — otherwise the integer division block.tensor_base / ELTS_PER_CHUNK truncates and produces wrong block_id_Y / block_id_X values, silently corrupting the output.

A valid-looking invocation like two tensors with sizes 8192 and 8192 (total = 16 384, passes the check) would break decode_block because neither tensor is ELTS_PER_CHUNK-aligned.

Additionally, the error message says "Last logical dimension" but last_logical_dim for VARYING_BOTH_DIMS represents the total element count, not a spatial dimension — this is misleading to callers.

Since per-tensor dims are device-side, a practical fix is to add a per-tensor validation kernel (or assert inside update_tma_descriptors) and fix the error message:

case ShapeRepresentation::VARYING_BOTH_DIMS: {
    // last_logical_dim == total elements across all tensors for this shape rep
    NVTE_CHECK(last_logical_dim % ELTS_PER_CHUNK == 0,
               "Total element count across all tensors (", last_logical_dim,
               ") must be divisible by ELTS_PER_CHUNK (", ELTS_PER_CHUNK, ")."
               " Each individual tensor must also satisfy rows_t*cols_t % ELTS_PER_CHUNK == 0.");
    break;
}

Comment on lines +390 to +415
}

__device__ __forceinline__ size_t get_tensor_cols_num(
const size_t tensor_id, const ShapeRepresentation shape_rep, const size_t last_logical_dim,
const int64_t *const __restrict__ last_dims_ptr) {
size_t cols_num = 0;
switch (shape_rep) {
case ShapeRepresentation::SAME_BOTH_DIMS:
case ShapeRepresentation::VARYING_FIRST_DIM:
cols_num = last_logical_dim;
break;
case ShapeRepresentation::VARYING_LAST_DIM:
case ShapeRepresentation::VARYING_BOTH_DIMS:
cols_num = static_cast<size_t>(last_dims_ptr[tensor_id]);
if (cols_num % 128 != 0) {
NVTE_DEVICE_ERROR(
"For non-single tensors, the last dimension of each tensor in a group "
"must be divisible by 128.");
}
break;
}
return cols_num;
}

__device__ __forceinline__ size_t
get_tensor_base_offset(const size_t tensor_id, const ShapeRepresentation shape_rep,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decode_block contains a division that is only exact under unvalidated assumptions

For the non-use_single_work_grid path (VARYING_LAST_DIM / VARYING_BOTH_DIMS):

const size_t block_id_in_current_tensor = job.block_id - block.tensor_base / ELTS_PER_CHUNK;

block.tensor_base = offsets_ptr[tensor_id] is the cumulative element count preceding this tensor. The division /ELTS_PER_CHUNK is exact only when every preceding tensor's element count is a multiple of ELTS_PER_CHUNK = 16384. For VARYING_LAST_DIM this is guaranteed because first_logical_dim % 128 == 0 and cols % 128 == 0 are both enforced, so rows * cols >= 128*128 = ELTS_PER_CHUNK and the product is always a multiple.

For VARYING_BOTH_DIMS, however, neither the host nor device enforces this per-tensor condition (the host only checks the total element sum). If any tensor has rows_t * cols_t % ELTS_PER_CHUNK != 0, the division truncates and block_id_Y / block_id_X are wrong for all subsequent tensors, silently writing quantized values to incorrect memory locations.

Adding a per-tensor validation in update_tma_descriptors (which already iterates tensors) would catch this early:

// Inside update_tma_descriptors, for each tensor_id:
if (shape_rep == ShapeRepresentation::VARYING_BOTH_DIMS) {
    if ((rows * cols) % ELTS_PER_CHUNK != 0) {
        NVTE_DEVICE_ERROR("VARYING_BOTH_DIMS: each tensor's rows*cols must be divisible by ELTS_PER_CHUNK.");
    }
}

Comment on lines +104 to +135

const size_t rows = tensor_rows / chunk_dim_Y;
const size_t cols = last_logical_dim;

const size_t dbias_in_offset_Y =
(shape_rep == ShapeRepresentation::SAME_BOTH_DIMS)
? (tensor_id * (tensor_rows / chunk_dim_Y))
: (static_cast<size_t>(offsets_ptr[tensor_id]) / cols / chunk_dim_Y);

const size_t thread_id = blockIdx.x * blockDim.x + threadIdx.x;

if (thread_id * nvec >= cols) {
return;
}

const float *const thread_in_base = dbias_partial + dbias_in_offset_Y * cols + thread_id * nvec;
OType *const thread_out_base = dbias_output + tensor_id * cols + thread_id * nvec;

ComputeVec ldg_vec;
ComputeVec acc_vec;
acc_vec.clear();
for (int i = 0; i < rows; ++i) {
ldg_vec.load_from(thread_in_base + i * cols);
#pragma unroll
for (int e = 0; e < nvec; ++e) {
acc_vec.data.elt[e] += ldg_vec.data.elt[e];
}
}

OutputVec stg_vec;
#pragma unroll
for (int e = 0; e < nvec; ++e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group_reduce_dbias_kernel uses last_logical_dim as cols for all tensors regardless of shape

cols is unconditionally set to last_logical_dim:

const size_t cols = last_logical_dim;

For VARYING_LAST_DIM and VARYING_BOTH_DIMS, each tensor has a distinct last dimension, but this kernel uses the same cols for all of them, producing two concrete errors:

  1. Wrong output offset: thread_out_base = dbias_output + tensor_id * cols + ... assumes all tensors have identical column counts. For varying-last-dim shapes the dbias output for tensor t should be written at the cumulative sum of all preceding tensors' column counts.

  2. Wrong dbias_in_offset_Y: offsets_ptr[tensor_id] / cols / chunk_dim_Y uses cols = last_logical_dim as the stride, which is incorrect when per-tensor column counts differ.

While cast.h documents "Grouped dbias is not yet supported for grouped tensors with a varying last dimension," there is no runtime guard in grouped_reduce_dbias or this kernel to enforce that. A caller passing VARYING_LAST_DIM or VARYING_BOTH_DIMS shapes would silently corrupt memory. An explicit NVTE_CHECK should be added at the call site:

NVTE_CHECK(shape_rep != ShapeRepresentation::VARYING_LAST_DIM &&
           shape_rep != ShapeRepresentation::VARYING_BOTH_DIMS,
           "Grouped dbias is not supported for tensors with a varying last dimension.");

Comment on lines +619 to +627
if (first_dims[t] % 128 != 0) {
GTEST_SKIP();
}

if (shape_rep == VARYING_LAST_DIM || shape_rep == VARYING_BOTH_DIMS) {
if (last_dims[t] % 128 != 0) {
GTEST_SKIP();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GTEST_SKIP guard for first-dim uses % 16 but the kernel requires % 128

if (first_dims[t] % 128 != 0) {
    GTEST_SKIP();
}

Wait — looking at the actual code it checks % 128. However the VARYING_LAST_DIM / VARYING_BOTH_DIMS last-dim guard should also be aligned with the kernel's actual device-side error:

if (shape_rep == VARYING_LAST_DIM || shape_rep == VARYING_BOTH_DIMS) {
    if (last_dims[t] % 128 != 0) {
        GTEST_SKIP();
    }
}

But for SAME_BOTH_DIMS and VARYING_FIRST_DIM, there is no check on last_dims[t]. The kernel for these shapes does not enforce any alignment on cols, however get_nvfp4_scale_stride and the scale write loop use chunk_cols / SCALE_DIM (integer division), so if cols % 16 != 0 a trailing partial scale block would go unwritten. Test configs like {VARYING_FIRST_DIM, 3, 1024, 160, ...} exercise a last_dim = 160 which happens to be 10 * 16 — but the test guard does not verify this. Consider adding a % 16 check (or % 128 for stricter alignment) for SAME_BOTH_DIMS and VARYING_FIRST_DIM as well, to prevent tests from silently running with partially-initialized scale buffers:

if (last_dims[t] % 16 != 0) {
    GTEST_SKIP();
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant