cuda: increase matmul rows/block from 8 to 16 by arpesenti · Pull Request #266 · antirez/ds4

arpesenti · 2026-05-27T15:57:43Z

CUDA: increase matmul rows/block from 8 to 16

Doubles the number of output rows processed per block in the warp-level Q8_0, HC-expand, grouped
matmul and MoE gate_up decode kernels, improving GPU occupancy on architectures with high CUDA core
counts. Prefill is unaffected.

Benchmark (DGX Spark GB10, q2, 128 gen tokens)

ctx	before	after	Δ
2048	14.4	18.1	+25.7%
4096	14.9	19.1	+28.0%
6144	14.9	19.1	+28.0%
8192	14.7	18.7	+27.5%
16384	14.6	18.4	+26.7%
32768	13.5	16.8	+24.4%
40960	13.3	16.4	+23.3%
65536	12.5	15.3	+22.7%

Prefill unchanged across all frontiers (within ±1 t/s noise).

Changes

Q8_0/HC-expand/grouped matmul kernels: rows/block 8→16
MoE gate_up decode kernels: rows/block 128→256
Updated launch grid dimensions for all affected kernels
Fixed tests/cuda_long_context_smoke.c (missing comp_kv_f16 arg)

Verification

make cuda-spark — clean build
make cuda-regression — passed

Doubles rows/block in Q8_0, HC-expand, grouped matmul kernels and MoE gate_up decode kernels, improving GPU occupancy. Benchmark (DGX Spark GB10, q2, 128 gen tokens): 2048 ctx: 14.4 → 18.1 t/s (+25.7%) 4096 ctx: 14.9 → 19.1 t/s (+28.0%) 8192 ctx: 14.7 → 18.7 t/s (+27.5%) 32768 ctx: 13.5 → 16.8 t/s (+24.4%) 65536 ctx: 12.5 → 15.3 t/s (+22.7%) Prefill unchanged. Regression tests pass.

gundemirbas · 2026-05-29T17:21:25Z

Any progress on merging this PR?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: increase matmul rows/block from 8 to 16#266

cuda: increase matmul rows/block from 8 to 16#266
arpesenti wants to merge 1 commit into
antirez:mainfrom
arpesenti:cuda/matmul-rows-per-block

arpesenti commented May 27, 2026

Uh oh!

gundemirbas commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

arpesenti commented May 27, 2026