Fix CUTLASS Blackwell FMHA register spills on DRIVE Thor (sm_110a) by pzhao-eng · Pull Request #3308 · NVIDIA/cutlass

pzhao-eng · 2026-06-08T02:38:13Z

TL;DR

The CUTLASS Blackwell SM100 FMHA forward kernel spills ~529 MB of the softmax score tile to local memory on Thor (sm_110a) because the per-warp register re-partitioning (setmaxnreg) is silently compiled out for __CUDA_ARCH__ == 1100. Adding the missing 1100 arch clause to CUTLASS's CUDA_CTA_RECONFIG_ACTIVATED gate re-enables setmaxnreg, eliminates the spills, and makes the kernel 1.74× faster (1.33 ms → 762 µs) with no other changes.

Change: 2 lines added to cutlass/include/cutlass/arch/reg_reconfig.h.
Impact: spill traffic 529 MB → 0; DRAM/L2 write 535 MB → 6 MB (just the output); SM throughput 42% → 65%.

Problem

I was working on profiling the CUTLASS FMHA forward kernel (example 77) on DRIVE Thor (Sm100FmhaFwdKernelTmaWarpspecialized, fp16, B=1 H=16 D=64 S=2984, TileShape 256×128×64, built for sm_110a) showed the kernel writing ~535 MB to memory for a 6 MB output.

Root cause

The kernel is warp-specialized: each warp group is meant to be granted a different physical register budget at runtime via the setmaxnreg PTX instruction (warpgroup_reg_set<NumRegsSoftmax=192> for the Softmax warps, <96> for Correction, <32> for the rest).

setmaxnreg is only emitted when CUDA_CTA_RECONFIG_ACTIVATED is defined, and its __CUDA_ARCH__ whitelist in reg_reconfig.h omits 1100 (Thor):

// cutlass/include/cutlass/arch/reg_reconfig.h  (before)
(__CUDA_ARCH__ ==  900 && defined(__CUDA_ARCH_FEAT_SM90_ALL))
|| (__CUDA_ARCH__ == 1000 && defined(__CUDA_ARCH_FEAT_SM100_ALL))
|| (__CUDA_ARCH__ == 1010 && defined(__CUDA_ARCH_FEAT_SM101_ALL))
|| (__CUDA_ARCH__ == 1030 && defined(__CUDA_ARCH_FEAT_SM103_ALL))
|| (__CUDA_ARCH__ == 1200 && defined(__CUDA_ARCH_FEAT_SM120_ALL))
|| (__CUDA_ARCH__ == 1210 && defined(__CUDA_ARCH_FEAT_SM121_ALL))
//   ^^ 1100 (sm_110 / Thor) missing

On sm_110a, warpgroup_reg_set<192> therefore expands to nothing (confirmed: zero setmaxnreg in the cubin). All 16 warps then share the single flat __launch_bounds__(512, 1) budget = 65,536 / 512 = 128 regs/thread. The Softmax warps need ~192 to hold the 128-element fp32 score row across the rowmax→exp2→rescale→fp16-convert pipeline; capped at 128 they spill the row to local memory — 7.75M times.

Fix

Add the 1100 clause to both CUDA_CTA_RECONFIG_ACTIVATED blocks, matching the
exact pattern CUTLASS already uses for Thor elsewhere
(grid_dependency_control.h, config.h):

 // cutlass/include/cutlass/arch/reg_reconfig.h
      || (__CUDA_ARCH__ == 1030 && defined(__CUDA_ARCH_FEAT_SM103_ALL))     \
+     || (__CUDA_ARCH__ == 1100 && defined(__CUDA_ARCH_FEAT_SM110_ALL))     \
      || (__CUDA_ARCH__ == 1200 && defined(__CUDA_ARCH_FEAT_SM120_ALL))     \
 ...
      || (__CUDA_ARCH__ == 1030 && CUDA_ARCH_FAMILY(1030))  \
+     || (__CUDA_ARCH__ == 1100 && CUDA_ARCH_FAMILY(1100))  \
      || (__CUDA_ARCH__ == 1200 && CUDA_ARCH_FAMILY(1200))  \

Results

Enable setmaxnreg for sm_110a (CUDA_ARCH 1100)

b569cec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CUTLASS Blackwell FMHA register spills on DRIVE Thor (sm_110a)#3308

Fix CUTLASS Blackwell FMHA register spills on DRIVE Thor (sm_110a)#3308
pzhao-eng wants to merge 1 commit into
NVIDIA:mainfrom
pzhao-eng:fix/sm110a-reg-reconfig

pzhao-eng commented Jun 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pzhao-eng commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Problem

Root cause

Fix

Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pzhao-eng commented Jun 8, 2026 •

edited

Loading