Skip to content

Fix CUTLASS Blackwell FMHA register spills on DRIVE Thor (sm_110a)#3308

Open
pzhao-eng wants to merge 1 commit into
NVIDIA:mainfrom
pzhao-eng:fix/sm110a-reg-reconfig
Open

Fix CUTLASS Blackwell FMHA register spills on DRIVE Thor (sm_110a)#3308
pzhao-eng wants to merge 1 commit into
NVIDIA:mainfrom
pzhao-eng:fix/sm110a-reg-reconfig

Conversation

@pzhao-eng

@pzhao-eng pzhao-eng commented Jun 8, 2026

Copy link
Copy Markdown

TL;DR

The CUTLASS Blackwell SM100 FMHA forward kernel spills ~529 MB of the softmax score tile to local memory on Thor (sm_110a) because the per-warp register re-partitioning (setmaxnreg) is silently compiled out for __CUDA_ARCH__ == 1100. Adding the missing 1100 arch clause to CUTLASS's CUDA_CTA_RECONFIG_ACTIVATED gate re-enables setmaxnreg, eliminates the spills, and makes the kernel 1.74× faster (1.33 ms → 762 µs) with no other changes.

  • Change: 2 lines added to cutlass/include/cutlass/arch/reg_reconfig.h.
  • Impact: spill traffic 529 MB → 0; DRAM/L2 write 535 MB → 6 MB (just the output); SM throughput 42% → 65%.

Problem

I was working on profiling the CUTLASS FMHA forward kernel (example 77) on DRIVE Thor (Sm100FmhaFwdKernelTmaWarpspecialized, fp16, B=1 H=16 D=64 S=2984, TileShape 256×128×64, built for sm_110a) showed the kernel writing ~535 MB to memory for a 6 MB output.
image

Root cause

The kernel is warp-specialized: each warp group is meant to be granted a different physical register budget at runtime via the setmaxnreg PTX instruction (warpgroup_reg_set<NumRegsSoftmax=192> for the Softmax warps, <96> for Correction, <32> for the rest).

setmaxnreg is only emitted when CUDA_CTA_RECONFIG_ACTIVATED is defined, and its __CUDA_ARCH__ whitelist in reg_reconfig.h omits 1100 (Thor):

// cutlass/include/cutlass/arch/reg_reconfig.h  (before)
(__CUDA_ARCH__ ==  900 && defined(__CUDA_ARCH_FEAT_SM90_ALL))
|| (__CUDA_ARCH__ == 1000 && defined(__CUDA_ARCH_FEAT_SM100_ALL))
|| (__CUDA_ARCH__ == 1010 && defined(__CUDA_ARCH_FEAT_SM101_ALL))
|| (__CUDA_ARCH__ == 1030 && defined(__CUDA_ARCH_FEAT_SM103_ALL))
|| (__CUDA_ARCH__ == 1200 && defined(__CUDA_ARCH_FEAT_SM120_ALL))
|| (__CUDA_ARCH__ == 1210 && defined(__CUDA_ARCH_FEAT_SM121_ALL))
//   ^^ 1100 (sm_110 / Thor) missing

On sm_110a, warpgroup_reg_set<192> therefore expands to nothing (confirmed: zero setmaxnreg in the cubin). All 16 warps then share the single flat __launch_bounds__(512, 1) budget = 65,536 / 512 = 128 regs/thread. The Softmax warps need ~192 to hold the 128-element fp32 score row across the rowmax→exp2→rescale→fp16-convert pipeline; capped at 128 they spill the row to local memory — 7.75M times.

Fix

Add the 1100 clause to both CUDA_CTA_RECONFIG_ACTIVATED blocks, matching the
exact pattern CUTLASS already uses for Thor elsewhere
(grid_dependency_control.h, config.h):

 // cutlass/include/cutlass/arch/reg_reconfig.h
      || (__CUDA_ARCH__ == 1030 && defined(__CUDA_ARCH_FEAT_SM103_ALL))     \
+     || (__CUDA_ARCH__ == 1100 && defined(__CUDA_ARCH_FEAT_SM110_ALL))     \
      || (__CUDA_ARCH__ == 1200 && defined(__CUDA_ARCH_FEAT_SM120_ALL))     \
 ...
      || (__CUDA_ARCH__ == 1030 && CUDA_ARCH_FAMILY(1030))  \
+     || (__CUDA_ARCH__ == 1100 && CUDA_ARCH_FAMILY(1100))  \
      || (__CUDA_ARCH__ == 1200 && CUDA_ARCH_FAMILY(1200))  \

Results

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant