Add Q4_K MoE prefill kernels + CUDA 11.x compatibility for V100#279
Open
eazlong wants to merge 1 commit into
Open
Add Q4_K MoE prefill kernels + CUDA 11.x compatibility for V100#279eazlong wants to merge 1 commit into
eazlong wants to merge 1 commit into
Conversation
Two independent fixes that together enable V100 (sm_70) support: 1. CUDA 11.x compatibility (ds4_cuda.cu): - Guard cudaMemLocation usage with CUDART_VERSION >= 12000 - Fall back to int device parameter on CUDA 11.x - Fix cudaMemPrefetchAsync signature (remove spurious 0 arg) 2. Q4_K MoE prefill (ds4_cuda.cu): - Add moe_gate_up_mid_sorted_q4K_qwarp32_kernel (gate/up/mid) - Add moe_down_sorted_q4K_qwarp32_kernel (down projection) - Remove n_tokens==1 restriction for Q4_K expert weights - Route Q4_K prefill through sorted (non-expert-tile) path - Disable expert tiles and P2-sorted for Q4_K (IQ2_XXS-only) The existing Q4_K kernels only supported decode (n_tokens=1). Prefill with Q4_K expert weights now uses sorted-pair dispatch with the new Q4_K dot-product kernels, followed by moe_sum_kernel for reduction. Also adds diagnostic fprintf at key prefill failure points to aid future debugging of GPU backend issues. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two independent fixes that together enable running ds4 on Tesla V100 (sm_70) GPUs with CUDA 11.x.
1. CUDA 11.x compatibility
cudaMemLocationis a CUDA 12.0+ API. On CUDA 11.x (common on V100 machines), compilation fails. Fix: guard withCUDART_VERSION >= 12000, fall back toint deviceon older CUDA.2. Q4_K MoE prefill kernels
Models quantized with Q4_K expert weights (gate_type=12, down_type=12) fail prefill. The existing Q4_K kernels only supported single-token decode. This PR adds sorted-pair Q4_K prefill kernels for gate/up/mid and down projection, and routes Q4_K prefill through the non-expert-tile path.
Test plan
make cuda CUDA_ARCH=sm_70./ds4-bench --cuda --ctx-start 2048 --ctx-max 16384 --step-incr 2048 --gen-tokens 64🤖 Generated with Claude Code