[Common] Persistent Grouped NVFP4 quantization kernel#2743
[Common] Persistent Grouped NVFP4 quantization kernel#2743Oleg-Goncharov wants to merge 53 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
| std::vector<fp4e2m1> out_data_rowwise_h(total_elts / 2); | ||
| std::vector<fp4e2m1> out_data_colwise_h(total_elts / 2); | ||
| std::vector<fp8e4m3> out_scales_rowwise_h(rowwise_scales_num); | ||
| std::vector<fp8e4m3> out_scales_colwise_h(colwise_scales_num); |
There was a problem hiding this comment.
Wrong variable used in "more mismatches" condition
mismatch_messages is only appended while total_mismatches <= max_mismatches_to_print (3), so its size() can never exceed max_mismatches_to_print. The condition is therefore always false and the "... and X more mismatches" line is dead code — even inside the print_detailed_summary branch. The comparison should use total_mismatches:
| std::vector<fp8e4m3> out_scales_colwise_h(colwise_scales_num); | |
| if (total_mismatches > max_mismatches_to_print) { |
| cudaMemcpy(last_dims_d, last_dims_h.data(), num_tensors * sizeof(int64_t), cudaMemcpyHostToDevice); | ||
| cudaMemcpy(offsets_d, offsets_h.data(), (num_tensors + 1) * sizeof(int64_t), cudaMemcpyHostToDevice); | ||
|
|
||
| cudaMemset(out_data_rowwise_d, 0, out_data_size); | ||
| cudaMemset(out_data_colwise_d, 0, out_data_size); | ||
| cudaMemset(out_scales_rowwise_d, 0, rowwise_scales_size); | ||
| cudaMemset(out_scales_colwise_d, 0, colwise_scales_size); | ||
|
|
||
| NVTEShape logical_shape_ = nvte_make_shape(logical_shape.data(), logical_shape.size()); |
There was a problem hiding this comment.
CUDA API return values are not checked
All cudaMalloc, cudaMemcpy, and cudaMemset calls in performTest silently ignore their return values. A failed allocation would leave the pointer uninitialized (or null) and the test would proceed, likely crashing or producing a spurious cudaGetLastError failure that obscures the real problem.
Consider wrapping the calls with a helper that asserts success, e.g.:
ASSERT_EQ(cudaMalloc((void**)&in_data_d, in_data_size), cudaSuccess);
ASSERT_EQ(cudaMemcpy(in_data_d, grouped_input.data(), in_data_size, cudaMemcpyHostToDevice), cudaSuccess);This pattern applies to all CUDA API calls from the allocation block down through the cudaMemset calls (lines 388–410).
| * \param[in] stream CUDA stream used for the operation. | ||
| */ |
There was a problem hiding this comment.
Binary-incompatible API change without a breaking-change marker
nvte_group_quantize_dbias (and the five related nvte_group_quantize_dbias_d* functions) previously accepted NVTETensor dbias; this PR changes the parameter to NVTEGroupedTensor dbias. Any existing C/C++ caller that was compiled against the old header will silently pass the wrong type at runtime. Even though both types are opaque pointers at the ABI level, callers that stored the dbias as NVTETensor will need to migrate.
The PR description marks this as "New feature (non-breaking change)", but this signature change will break downstream callers (Python bindings, external C++ users) that previously compiled against NVTETensor dbias. It is worth auditing all internal call sites (Python pybind layer, etc.) and explicitly documenting the migration in the PR / changelog.
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
97ec071 to
fef9220
Compare
transformer_engine/common/cast/nvfp4/specialized/group_quantize_transpose_nvfp4_tuned_1D.cuh
Outdated
Show resolved
Hide resolved
transformer_engine/common/cast/nvfp4/specialized/group_quantize_transpose_nvfp4_tuned_1D.cuh
Show resolved
Hide resolved
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
2e289c9 to
9e37b4c
Compare
|
|
||
| const size_t rows = tensor_rows / chunk_dim_Y; | ||
| const size_t cols = last_logical_dim; | ||
|
|
||
| const size_t dbias_in_offset_Y = | ||
| (shape_rep == ShapeRepresentation::SAME_BOTH_DIMS) | ||
| ? (tensor_id * (tensor_rows / chunk_dim_Y)) | ||
| : (static_cast<size_t>(offsets_ptr[tensor_id]) / cols / chunk_dim_Y); | ||
|
|
||
| const size_t thread_id = blockIdx.x * blockDim.x + threadIdx.x; | ||
|
|
||
| if (thread_id * nvec >= cols) { |
There was a problem hiding this comment.
group_reduce_dbias_kernel uses last_logical_dim as cols for all shape representations
cols is unconditionally set to last_logical_dim:
const size_t cols = last_logical_dim;For VARYING_LAST_DIM and VARYING_BOTH_DIMS shapes, each tensor has a different last dimension. Using the scalar last_logical_dim for all tensors will produce incorrect partial-dbias strides and wrong output write offsets (tensor_id * cols assumes uniform column counts). The same issue affects the dbias_in_offset_Y calculation for those shape representations.
cast.h documents that "Grouped dbias is not yet supported for grouped tensors with a varying last dimension," but there is no runtime guard in grouped_reduce_dbias or this kernel to enforce that. If called with such shapes the function silently corrupts memory. Consider adding an explicit NVTE_CHECK(shape_rep != ShapeRepresentation::VARYING_LAST_DIM && shape_rep != ShapeRepresentation::VARYING_BOTH_DIMS, ...) guard in grouped_reduce_dbias before the kernel launch.
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
f6b5928 to
6a7409d
Compare
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
50a4921 to
6c5cc7f
Compare
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
c83b558 to
f5e2ba0
Compare
for more information, see https://pre-commit.ci
| void nvte_group_quantize_dbias(const NVTEGroupedTensor input, NVTEGroupedTensor output, | ||
| NVTETensor dbias, NVTETensor workspace, cudaStream_t stream); | ||
| NVTEGroupedTensor dbias, NVTETensor workspace, cudaStream_t stream); |
There was a problem hiding this comment.
The six nvte_group_quantize_dbias* functions (this one and nvte_group_quantize_dbias_dgelu, nvte_group_quantize_dbias_dsilu, nvte_group_quantize_dbias_drelu) now take NVTEGroupedTensor dbias instead of NVTETensor dbias. This is a signature change that may affect existing C/C++ callers compiled against the old header. While this appears intentional as part of the grouped tensor API consolidation, consider auditing Python bindings and any external C++ code to ensure compatibility, and explicitly document the migration path in the changelog.
|
|
||
| const bool use_single_work_grid = (shape_rep == ShapeRepresentation::SAME_BOTH_DIMS || | ||
| shape_rep == ShapeRepresentation::VARYING_FIRST_DIM); | ||
|
|
||
| const size_t first_logical_dim = input->logical_shape.data[0]; | ||
| const size_t last_logical_dim = input->logical_shape.data[1]; | ||
| const size_t elts_total = first_logical_dim * last_logical_dim; | ||
| const size_t num_tensors = input->num_tensors; | ||
|
|
||
| NVTE_CHECK(num_tensors <= MAX_SUPPORTED_TENSOR_DESCRIPTORS, | ||
| "Number of tensors in a group is larger than the MAX number of supported " | ||
| "descriptors (64)."); |
There was a problem hiding this comment.
For the SAME_BOTH_DIMS case, the kernel computes per-tensor row counts via integer division (first_logical_dim / num_tensors), which silently truncates if first_logical_dim is not exactly divisible by num_tensors. This causes incorrect base offsets and may skip or overwrite the last few rows. Add a host-side check:
if (shape_rep == ShapeRepresentation::SAME_BOTH_DIMS) {
NVTE_CHECK(first_logical_dim % num_tensors == 0,
"For SAME_BOTH_DIMS, first_logical_dim (", first_logical_dim,
") must be divisible by num_tensors (", num_tensors, ").");
}Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
811a146 to
eace4a6
Compare
for more information, see https://pre-commit.ci
transformer_engine/common/cast/nvfp4/specialized/group_quantize_transpose_nvfp4_tuned_1D.cuh
Outdated
Show resolved
Hide resolved
transformer_engine/common/cast/nvfp4/specialized/group_quantize_transpose_nvfp4_tuned_1D.cuh
Show resolved
Hide resolved
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
| if (offset_bits % 8 != 0) { | ||
| NVTE_DEVICE_ERROR("Data offset is not byte-aligned."); | ||
| } |
There was a problem hiding this comment.
decode_block silently produces wrong results for VARYING_BOTH_DIMS when per-tensor element count is not ELTS_PER_CHUNK-aligned
block_id_in_current_tensor is computed as:
const size_t block_id_in_current_tensor = job.block_id - block.tensor_base / ELTS_PER_CHUNK;block.tensor_base = offsets_ptr[tensor_id] is the cumulative element offset of this tensor. The integer division /ELTS_PER_CHUNK is exact only when every preceding tensor's element count (rows_t * cols_t) is a multiple of ELTS_PER_CHUNK (= CHUNK_DIM_Y * CHUNK_DIM_X = 16384). If that condition is violated, the division truncates and block_id_in_current_tensor (and therefore block_id_Y / block_id_X) is wrong, silently corrupting the output.
For VARYING_LAST_DIM this is guaranteed safe because first_logical_dim % 128 == 0 is validated and cols % 128 == 0 is required for TMA, so rows_t * cols_t ≥ 128 * 128 = ELTS_PER_CHUNK and is always a multiple. But for VARYING_BOTH_DIMS the host-side check is explicitly skipped:
if (shape_rep != ShapeRepresentation::VARYING_BOTH_DIMS) {
NVTE_CHECK(first_logical_dim % 128 == 0, ...);
}…and there is no per-tensor row-dimension guard device-side either. A tensor with, e.g., rows=64, cols=256 (8192 elements = ELTS_PER_CHUNK / 2) would cause offsets_ptr[tensor_id+1] / ELTS_PER_CHUNK to truncate, breaking all subsequent block mappings.
Consider adding a host-side check for VARYING_BOTH_DIMS that each tensor satisfies rows_t % 128 == 0 and cols_t % 128 == 0, or at minimum assert that each offset is ELTS_PER_CHUNK-aligned:
for (size_t t = 0; t < num_tensors; ++t) {
const size_t rows_t = first_dims_ptr[t];
const size_t cols_t = last_dims_ptr[t];
NVTE_CHECK(rows_t % CHUNK_DIM_Y == 0 && cols_t % CHUNK_DIM_X == 0,
"VARYING_BOTH_DIMS: each tensor's rows and cols must be multiples of 128.");
}|
|
||
| const size_t rows = tensor_rows / chunk_dim_Y; | ||
| const size_t cols = last_logical_dim; | ||
|
|
||
| const size_t dbias_in_offset_Y = |
There was a problem hiding this comment.
group_reduce_dbias_kernel uses last_logical_dim as cols for all tensors, which is wrong for VARYING_LAST_DIM / VARYING_BOTH_DIMS
const size_t cols = last_logical_dim;For VARYING_LAST_DIM and VARYING_BOTH_DIMS each tensor has a distinct last dimension, but cols is set to the global last_logical_dim for all of them. This has two immediate consequences:
-
Wrong output offset:
thread_out_base = dbias_output + tensor_id * cols + ...assumes all tensors have the same column count. For varying-last-dim shapes, the dbias output for tensortshould be written at the cumulative sum of all preceding tensors' column counts, not attensor_id * last_logical_dim. -
Wrong
dbias_in_offset_Y: the calculationoffsets_ptr[tensor_id] / cols / chunk_dim_Yuses the wrong stride for the same reason.
The comment in cast.h documents "Grouped dbias is not yet supported for grouped tensors with a varying last dimension," but no runtime guard in grouped_reduce_dbias or this kernel enforces that. An explicit NVTE_CHECK should be added before the kernel launch to surface misuse early rather than silently corrupting memory:
NVTE_CHECK(shape_rep != ShapeRepresentation::VARYING_LAST_DIM &&
shape_rep != ShapeRepresentation::VARYING_BOTH_DIMS,
"Grouped dbias is not yet supported for grouped tensors with a varying last dimension.");Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci
| case ShapeRepresentation::SAME_BOTH_DIMS: { | ||
| NVTE_CHECK(first_logical_dim % num_tensors == 0, | ||
| "First logical dimension of a grouped tensor must be divisible by the number of " | ||
| "tensors."); | ||
| NVTE_CHECK((first_logical_dim / num_tensors) % 128 == 0, | ||
| "First dimension of each tensor in a group must be divisible by 128."); | ||
| break; | ||
| } | ||
| case ShapeRepresentation::VARYING_FIRST_DIM: { | ||
| NVTE_CHECK(first_logical_dim % 128 == 0, | ||
| "First logical dimension of a grouped tensor must be divisible by 128."); | ||
| break; | ||
| } | ||
| case ShapeRepresentation::VARYING_LAST_DIM: { | ||
| NVTE_CHECK(first_logical_dim % 128 == 0, | ||
| "First logical dimension of a grouped tensor must be divisible by 128."); |
There was a problem hiding this comment.
VARYING_BOTH_DIMS host check validates total element count, not per-tensor alignment
The check last_logical_dim % ELTS_PER_CHUNK == 0 passes as long as the sum of all per-tensor element counts is divisible by ELTS_PER_CHUNK (16 384). However, decode_block requires that each individual tensor's element count (rows_t * cols_t) is a multiple of ELTS_PER_CHUNK — otherwise the integer division block.tensor_base / ELTS_PER_CHUNK truncates and produces wrong block_id_Y / block_id_X values, silently corrupting the output.
A valid-looking invocation like two tensors with sizes 8192 and 8192 (total = 16 384, passes the check) would break decode_block because neither tensor is ELTS_PER_CHUNK-aligned.
Additionally, the error message says "Last logical dimension" but last_logical_dim for VARYING_BOTH_DIMS represents the total element count, not a spatial dimension — this is misleading to callers.
Since per-tensor dims are device-side, a practical fix is to add a per-tensor validation kernel (or assert inside update_tma_descriptors) and fix the error message:
case ShapeRepresentation::VARYING_BOTH_DIMS: {
// last_logical_dim == total elements across all tensors for this shape rep
NVTE_CHECK(last_logical_dim % ELTS_PER_CHUNK == 0,
"Total element count across all tensors (", last_logical_dim,
") must be divisible by ELTS_PER_CHUNK (", ELTS_PER_CHUNK, ")."
" Each individual tensor must also satisfy rows_t*cols_t % ELTS_PER_CHUNK == 0.");
break;
}| } | ||
|
|
||
| __device__ __forceinline__ size_t get_tensor_cols_num( | ||
| const size_t tensor_id, const ShapeRepresentation shape_rep, const size_t last_logical_dim, | ||
| const int64_t *const __restrict__ last_dims_ptr) { | ||
| size_t cols_num = 0; | ||
| switch (shape_rep) { | ||
| case ShapeRepresentation::SAME_BOTH_DIMS: | ||
| case ShapeRepresentation::VARYING_FIRST_DIM: | ||
| cols_num = last_logical_dim; | ||
| break; | ||
| case ShapeRepresentation::VARYING_LAST_DIM: | ||
| case ShapeRepresentation::VARYING_BOTH_DIMS: | ||
| cols_num = static_cast<size_t>(last_dims_ptr[tensor_id]); | ||
| if (cols_num % 128 != 0) { | ||
| NVTE_DEVICE_ERROR( | ||
| "For non-single tensors, the last dimension of each tensor in a group " | ||
| "must be divisible by 128."); | ||
| } | ||
| break; | ||
| } | ||
| return cols_num; | ||
| } | ||
|
|
||
| __device__ __forceinline__ size_t | ||
| get_tensor_base_offset(const size_t tensor_id, const ShapeRepresentation shape_rep, |
There was a problem hiding this comment.
decode_block contains a division that is only exact under unvalidated assumptions
For the non-use_single_work_grid path (VARYING_LAST_DIM / VARYING_BOTH_DIMS):
const size_t block_id_in_current_tensor = job.block_id - block.tensor_base / ELTS_PER_CHUNK;block.tensor_base = offsets_ptr[tensor_id] is the cumulative element count preceding this tensor. The division /ELTS_PER_CHUNK is exact only when every preceding tensor's element count is a multiple of ELTS_PER_CHUNK = 16384. For VARYING_LAST_DIM this is guaranteed because first_logical_dim % 128 == 0 and cols % 128 == 0 are both enforced, so rows * cols >= 128*128 = ELTS_PER_CHUNK and the product is always a multiple.
For VARYING_BOTH_DIMS, however, neither the host nor device enforces this per-tensor condition (the host only checks the total element sum). If any tensor has rows_t * cols_t % ELTS_PER_CHUNK != 0, the division truncates and block_id_Y / block_id_X are wrong for all subsequent tensors, silently writing quantized values to incorrect memory locations.
Adding a per-tensor validation in update_tma_descriptors (which already iterates tensors) would catch this early:
// Inside update_tma_descriptors, for each tensor_id:
if (shape_rep == ShapeRepresentation::VARYING_BOTH_DIMS) {
if ((rows * cols) % ELTS_PER_CHUNK != 0) {
NVTE_DEVICE_ERROR("VARYING_BOTH_DIMS: each tensor's rows*cols must be divisible by ELTS_PER_CHUNK.");
}
}|
|
||
| const size_t rows = tensor_rows / chunk_dim_Y; | ||
| const size_t cols = last_logical_dim; | ||
|
|
||
| const size_t dbias_in_offset_Y = | ||
| (shape_rep == ShapeRepresentation::SAME_BOTH_DIMS) | ||
| ? (tensor_id * (tensor_rows / chunk_dim_Y)) | ||
| : (static_cast<size_t>(offsets_ptr[tensor_id]) / cols / chunk_dim_Y); | ||
|
|
||
| const size_t thread_id = blockIdx.x * blockDim.x + threadIdx.x; | ||
|
|
||
| if (thread_id * nvec >= cols) { | ||
| return; | ||
| } | ||
|
|
||
| const float *const thread_in_base = dbias_partial + dbias_in_offset_Y * cols + thread_id * nvec; | ||
| OType *const thread_out_base = dbias_output + tensor_id * cols + thread_id * nvec; | ||
|
|
||
| ComputeVec ldg_vec; | ||
| ComputeVec acc_vec; | ||
| acc_vec.clear(); | ||
| for (int i = 0; i < rows; ++i) { | ||
| ldg_vec.load_from(thread_in_base + i * cols); | ||
| #pragma unroll | ||
| for (int e = 0; e < nvec; ++e) { | ||
| acc_vec.data.elt[e] += ldg_vec.data.elt[e]; | ||
| } | ||
| } | ||
|
|
||
| OutputVec stg_vec; | ||
| #pragma unroll | ||
| for (int e = 0; e < nvec; ++e) { |
There was a problem hiding this comment.
group_reduce_dbias_kernel uses last_logical_dim as cols for all tensors regardless of shape
cols is unconditionally set to last_logical_dim:
const size_t cols = last_logical_dim;For VARYING_LAST_DIM and VARYING_BOTH_DIMS, each tensor has a distinct last dimension, but this kernel uses the same cols for all of them, producing two concrete errors:
-
Wrong output offset:
thread_out_base = dbias_output + tensor_id * cols + ...assumes all tensors have identical column counts. For varying-last-dim shapes the dbias output for tensortshould be written at the cumulative sum of all preceding tensors' column counts. -
Wrong
dbias_in_offset_Y:offsets_ptr[tensor_id] / cols / chunk_dim_Yusescols = last_logical_dimas the stride, which is incorrect when per-tensor column counts differ.
While cast.h documents "Grouped dbias is not yet supported for grouped tensors with a varying last dimension," there is no runtime guard in grouped_reduce_dbias or this kernel to enforce that. A caller passing VARYING_LAST_DIM or VARYING_BOTH_DIMS shapes would silently corrupt memory. An explicit NVTE_CHECK should be added at the call site:
NVTE_CHECK(shape_rep != ShapeRepresentation::VARYING_LAST_DIM &&
shape_rep != ShapeRepresentation::VARYING_BOTH_DIMS,
"Grouped dbias is not supported for tensors with a varying last dimension.");| if (first_dims[t] % 128 != 0) { | ||
| GTEST_SKIP(); | ||
| } | ||
|
|
||
| if (shape_rep == VARYING_LAST_DIM || shape_rep == VARYING_BOTH_DIMS) { | ||
| if (last_dims[t] % 128 != 0) { | ||
| GTEST_SKIP(); | ||
| } | ||
| } |
There was a problem hiding this comment.
GTEST_SKIP guard for first-dim uses % 16 but the kernel requires % 128
if (first_dims[t] % 128 != 0) {
GTEST_SKIP();
}Wait — looking at the actual code it checks % 128. However the VARYING_LAST_DIM / VARYING_BOTH_DIMS last-dim guard should also be aligned with the kernel's actual device-side error:
if (shape_rep == VARYING_LAST_DIM || shape_rep == VARYING_BOTH_DIMS) {
if (last_dims[t] % 128 != 0) {
GTEST_SKIP();
}
}But for SAME_BOTH_DIMS and VARYING_FIRST_DIM, there is no check on last_dims[t]. The kernel for these shapes does not enforce any alignment on cols, however get_nvfp4_scale_stride and the scale write loop use chunk_cols / SCALE_DIM (integer division), so if cols % 16 != 0 a trailing partial scale block would go unwritten. Test configs like {VARYING_FIRST_DIM, 3, 1024, 160, ...} exercise a last_dim = 160 which happens to be 10 * 16 — but the test guard does not verify this. Consider adding a % 16 check (or % 128 for stricter alignment) for SAME_BOTH_DIMS and VARYING_FIRST_DIM as well, to prevent tests from silently running with partially-initialized scale buffers:
if (last_dims[t] % 16 != 0) {
GTEST_SKIP();
}
Description
This PR adds a persistent grouped NVFP4 quantization + transpose kernel with static scheduling.
It is built on top of the PR#2738 [Common] Persistent Grouped MXFP8 quantization kernel
Type of change
Changes
Checklist: