Skip to content

Add Q4_K MoE prefill kernels + CUDA 11.x compatibility for V100#279

Open
eazlong wants to merge 1 commit into
antirez:mainfrom
eazlong:q4k-prefill-v100
Open

Add Q4_K MoE prefill kernels + CUDA 11.x compatibility for V100#279
eazlong wants to merge 1 commit into
antirez:mainfrom
eazlong:q4k-prefill-v100

Conversation

@eazlong
Copy link
Copy Markdown

@eazlong eazlong commented May 28, 2026

Summary

Two independent fixes that together enable running ds4 on Tesla V100 (sm_70) GPUs with CUDA 11.x.

1. CUDA 11.x compatibility

cudaMemLocation is a CUDA 12.0+ API. On CUDA 11.x (common on V100 machines), compilation fails. Fix: guard with CUDART_VERSION >= 12000, fall back to int device on older CUDA.

2. Q4_K MoE prefill kernels

Models quantized with Q4_K expert weights (gate_type=12, down_type=12) fail prefill. The existing Q4_K kernels only supported single-token decode. This PR adds sorted-pair Q4_K prefill kernels for gate/up/mid and down projection, and routes Q4_K prefill through the non-expert-tile path.

Test plan

  • V100 compile: make cuda CUDA_ARCH=sm_70
  • V100 benchmark: ./ds4-bench --cuda --ctx-start 2048 --ctx-max 16384 --step-incr 2048 --gen-tokens 64
  • Verify IQ2_XXS/Q2_K path still works (no regression)
  • Verify Metal backend unaffected

🤖 Generated with Claude Code

Two independent fixes that together enable V100 (sm_70) support:

1. CUDA 11.x compatibility (ds4_cuda.cu):
   - Guard cudaMemLocation usage with CUDART_VERSION >= 12000
   - Fall back to int device parameter on CUDA 11.x
   - Fix cudaMemPrefetchAsync signature (remove spurious 0 arg)

2. Q4_K MoE prefill (ds4_cuda.cu):
   - Add moe_gate_up_mid_sorted_q4K_qwarp32_kernel (gate/up/mid)
   - Add moe_down_sorted_q4K_qwarp32_kernel (down projection)
   - Remove n_tokens==1 restriction for Q4_K expert weights
   - Route Q4_K prefill through sorted (non-expert-tile) path
   - Disable expert tiles and P2-sorted for Q4_K (IQ2_XXS-only)

   The existing Q4_K kernels only supported decode (n_tokens=1).
   Prefill with Q4_K expert weights now uses sorted-pair dispatch
   with the new Q4_K dot-product kernels, followed by moe_sum_kernel
   for reduction.

Also adds diagnostic fprintf at key prefill failure points to aid
future debugging of GPU backend issues.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant