Skip to content

cuda: increase matmul rows/block from 8 to 16#266

Open
arpesenti wants to merge 1 commit into
antirez:mainfrom
arpesenti:cuda/matmul-rows-per-block
Open

cuda: increase matmul rows/block from 8 to 16#266
arpesenti wants to merge 1 commit into
antirez:mainfrom
arpesenti:cuda/matmul-rows-per-block

Conversation

@arpesenti
Copy link
Copy Markdown

CUDA: increase matmul rows/block from 8 to 16

Doubles the number of output rows processed per block in the warp-level Q8_0, HC-expand, grouped
matmul and MoE gate_up decode kernels, improving GPU occupancy on architectures with high CUDA core
counts. Prefill is unaffected.

Benchmark (DGX Spark GB10, q2, 128 gen tokens)

ctx before after Δ
2048 14.4 18.1 +25.7%
4096 14.9 19.1 +28.0%
6144 14.9 19.1 +28.0%
8192 14.7 18.7 +27.5%
16384 14.6 18.4 +26.7%
32768 13.5 16.8 +24.4%
40960 13.3 16.4 +23.3%
65536 12.5 15.3 +22.7%

Prefill unchanged across all frontiers (within ±1 t/s noise).

Changes

  • Q8_0/HC-expand/grouped matmul kernels: rows/block 8→16
  • MoE gate_up decode kernels: rows/block 128→256
  • Updated launch grid dimensions for all affected kernels
  • Fixed tests/cuda_long_context_smoke.c (missing comp_kv_f16 arg)

Verification

  • make cuda-spark — clean build
  • make cuda-regression — passed

Doubles rows/block in Q8_0, HC-expand, grouped matmul kernels
and MoE gate_up decode kernels, improving GPU occupancy.

Benchmark (DGX Spark GB10, q2, 128 gen tokens):
2048 ctx: 14.4 → 18.1 t/s (+25.7%)
4096 ctx: 14.9 → 19.1 t/s (+28.0%)
8192 ctx: 14.7 → 18.7 t/s (+27.5%)
32768 ctx: 13.5 → 16.8 t/s (+24.4%)
65536 ctx: 12.5 → 15.3 t/s (+22.7%)

Prefill unchanged. Regression tests pass.
@gundemirbas
Copy link
Copy Markdown

Any progress on merging this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants