[Core] MXFP8 grouped GEMM + tensor-scaled FP8 fixes#2748
[Core] MXFP8 grouped GEMM + tensor-scaled FP8 fixes#2748jberchtold-nvidia wants to merge 12 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
for more information, see https://pre-commit.ci
Greptile SummaryThis PR adds MXFP8 grouped GEMM support to the cuBLAS-LT backend and fixes tensor-scaled FP8 grouped GEMM scale pointer handling. The core change introduces per-tensor E8M0 scale pointer arrays in the GPU setup kernel, proper cuBLAS Key changes:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant make_mxfp8_operand
participant nvte_quantize
participant nvte_swizzle_scaling_factors
participant build_grouped_tensor
participant nvte_grouped_gemm
participant setup_grouped_gemm_kernel
participant cublasLtMatmul
User->>make_mxfp8_operand: BF16 tensor + shape + (is_A, transposed)
make_mxfp8_operand->>nvte_quantize: BF16 → MXFP8 (rowwise or columnwise)
make_mxfp8_operand->>nvte_swizzle_scaling_factors: swizzle E8M0 scales for GEMM
make_mxfp8_operand-->>User: mxfp8_swizzled tensor
User->>build_grouped_tensor: [mxfp8_swizzled tensors], NVTE_MXFP8_1D_SCALING
build_grouped_tensor->>build_grouped_tensor: gather_scales() — pack E8M0 per-tensor scales contiguously
build_grouped_tensor->>build_grouped_tensor: set use_random_padding=false (offsets = sum of numel, no gaps)
build_grouped_tensor->>build_grouped_tensor: nvte_set_grouped_tensor_swizzled_scales(h, 1)
build_grouped_tensor-->>User: GroupedBuffers (data + scale_inv + columnwise_scale_inv)
User->>nvte_grouped_gemm: GroupedTensor A, B, C, D
nvte_grouped_gemm->>nvte_grouped_gemm: select_grouped_operand — pick rowwise/columnwise per A/B transpose
nvte_grouped_gemm->>nvte_grouped_gemm: GroupedGemmSetupWorkspace::from_buffers (16-byte aligned ptr arrays)
nvte_grouped_gemm->>setup_grouped_gemm_kernel: launch (a_mxfp8_scale_base, b_mxfp8_scale_base)
setup_grouped_gemm_kernel->>setup_grouped_gemm_kernel: a_scale_inv_ptrs[i] = base + a_offset/32
setup_grouped_gemm_kernel-->>nvte_grouped_gemm: per-tensor A/B/C/D/alpha/beta/scale pointers filled
nvte_grouped_gemm->>nvte_grouped_gemm: set_fp8_scale_pointers — CUBLASLT_MATMUL_MATRIX_SCALE_VEC32_UE8M0
nvte_grouped_gemm->>cublasLtMatmul: execute grouped GEMM
cublasLtMatmul-->>User: output D tensors
|
|
/te-ci |
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
|
/te-ci |
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
|
/te-ci |
| for (size_t i = 0; i < num_tensors; ++i) { | ||
| tensors[i]->to_cpu(); | ||
| NVTE_CHECK_CUDA(cudaGetLastError()); | ||
| void* dst = static_cast<char*>(buffer.get()) + scale_offsets[i]; | ||
| const void* src = get_cpu_ptr_fn(tensors[i]); | ||
| NVTE_CHECK_CUDA(cudaMemcpy(dst, src, numels[i], cudaMemcpyHostToDevice)); | ||
| } |
There was a problem hiding this comment.
Redundant CPU sync for swizzled MXFP8 scales.
The loop calls tensors[i]->to_cpu() on line 1263, then immediately passes the tensor to get_cpu_ptr_fn(tensors[i]) on line 1267. However, both rowwise_cpu_scale_inv_ptr<uint8_t>() and columnwise_cpu_scale_inv_ptr<uint8_t>() internally call to_cpu() themselves (test_common.h lines 249 and 264), making the explicit call on line 1263 redundant.
Additionally, the GPU pointers are available directly via get_rowwise_scale_inv().data_ptr and get_columnwise_scale_inv().data_ptr, allowing a device-to-device copy that avoids the round-trip entirely:
| for (size_t i = 0; i < num_tensors; ++i) { | |
| tensors[i]->to_cpu(); | |
| NVTE_CHECK_CUDA(cudaGetLastError()); | |
| void* dst = static_cast<char*>(buffer.get()) + scale_offsets[i]; | |
| const void* src = get_cpu_ptr_fn(tensors[i]); | |
| NVTE_CHECK_CUDA(cudaMemcpy(dst, src, numels[i], cudaMemcpyHostToDevice)); | |
| } | |
| NVTE_CHECK_CUDA(cudaMemcpy(dst, | |
| has_rowwise ? tensors[i]->tensor_.get_rowwise_scale_inv().data_ptr | |
| : tensors[i]->tensor_.get_columnwise_scale_inv().data_ptr, | |
| numels[i], | |
| cudaMemcpyDeviceToDevice)); |
This improves both clarity and efficiency in test code.
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Add documentation for scaling factors in common.h Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
|
/te-ci |
1 similar comment
|
/te-ci |
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
|
/te-ci |
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: