Add Q4_K MoE prefill kernels + CUDA 11.x compatibility for V100 by eazlong · Pull Request #279 · antirez/ds4

eazlong · 2026-05-28T10:04:17Z

Summary

Two independent fixes that together enable running ds4 on Tesla V100 (sm_70) GPUs with CUDA 11.x.

1. CUDA 11.x compatibility

cudaMemLocation is a CUDA 12.0+ API. On CUDA 11.x (common on V100 machines), compilation fails. Fix: guard with CUDART_VERSION >= 12000, fall back to int device on older CUDA.

2. Q4_K MoE prefill kernels

Models quantized with Q4_K expert weights (gate_type=12, down_type=12) fail prefill. The existing Q4_K kernels only supported single-token decode. This PR adds sorted-pair Q4_K prefill kernels for gate/up/mid and down projection, and routes Q4_K prefill through the non-expert-tile path.

Test plan

V100 compile: make cuda CUDA_ARCH=sm_70
V100 benchmark: ./ds4-bench --cuda --ctx-start 2048 --ctx-max 16384 --step-incr 2048 --gen-tokens 64
Verify IQ2_XXS/Q2_K path still works (no regression)
Verify Metal backend unaffected

🤖 Generated with Claude Code

Two independent fixes that together enable V100 (sm_70) support: 1. CUDA 11.x compatibility (ds4_cuda.cu): - Guard cudaMemLocation usage with CUDART_VERSION >= 12000 - Fall back to int device parameter on CUDA 11.x - Fix cudaMemPrefetchAsync signature (remove spurious 0 arg) 2. Q4_K MoE prefill (ds4_cuda.cu): - Add moe_gate_up_mid_sorted_q4K_qwarp32_kernel (gate/up/mid) - Add moe_down_sorted_q4K_qwarp32_kernel (down projection) - Remove n_tokens==1 restriction for Q4_K expert weights - Route Q4_K prefill through sorted (non-expert-tile) path - Disable expert tiles and P2-sorted for Q4_K (IQ2_XXS-only) The existing Q4_K kernels only supported decode (n_tokens=1). Prefill with Q4_K expert weights now uses sorted-pair dispatch with the new Q4_K dot-product kernels, followed by moe_sum_kernel for reduction. Also adds diagnostic fprintf at key prefill failure points to aid future debugging of GPU backend issues. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

This was referenced May 30, 2026

'cuda prefill fail' on DGX Spark running driver version 580.159.03 #289

Closed

cuda prefill failed on NVIDIA A100-SXM4-80GB during both inference and bench #284

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Q4_K MoE prefill kernels + CUDA 11.x compatibility for V100#279

Add Q4_K MoE prefill kernels + CUDA 11.x compatibility for V100#279
eazlong wants to merge 1 commit into
antirez:mainfrom
eazlong:q4k-prefill-v100

eazlong commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eazlong commented May 28, 2026

Summary

1. CUDA 11.x compatibility

2. Q4_K MoE prefill kernels

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant