CPU support for Q4_K routed experts (fixes #171)#272
Conversation
cc612e8 to
912f155
Compare
|
I confirm the patch works with DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2-imatrix.gguf on a PC with the --cpu option, producing correct output. |
|
Hi @antirez, Just a quick update to request a review for PR #272 (CPU support for Q4_K routed experts). Both loge-gh and OceanX89 have tested and confirmed that the patch works. Specifically, OceanX89 mentioned in issue #171 that "ds4 is building and running fine" with this branch and that it now successfully handles the Q4 models (which they reported give much better results than the Q2 version). Since this fixes the expected IQ2_XXS expert tensors crash on CPU and has been verified by other users, could you please merge it when you have a moment? Thank you very much for your work! |
|
Thanks, checking. |
|
Merged. Probably could go faster, not sure what is the interest here, if to have like Q2 a CPU reference implementation for debugging, or to really use this in ARM systems with a lot of memory but no GPU? |
Hi antirez, thank you for merging this! For me, the main motivation was correctness first, with CPU-only usability as a useful side benefit. I noticed issues #114 and #171 and wanted to close the functional gap — Q4_K routed experts were a real model format people were trying to run, but the CPU reference path crashed on them. The ARM NEON path was a natural extension since the quantization format already had the building blocks, so I included it as well. I agree it can probably go faster, this PR was intentionally focused on correctness first. I'm also quite fascinated by finding ways to push inference speed further in general! I have PR #261 exploring suffix-tree-based speculative decoding, and I'm looking into AVX2/AVX512 kernels on the CPU side. Beyond that, I'm interested in CPU-GPU hybrid deployment for large MoE models — shared experts on GPU, routed experts on CPU, similar to what KTransformers and fastllm are doing. I think ds4's clean architecture makes it a great fit for that kind of exploration. |
Summary
Fixes #171 — the CPU inference path (
--cpu) now handles Q4_K routed expert tensors instead of crashing withds4: expected IQ2_XXS expert tensors.Related to #114: this PR fixes the CPU/reference routed-MoE Q4_K path that produced
expected IQ2_XXS expert tensors. It does not change the Metal Q4_K kernels or investigate the BOS-repeat Metal generation behavior reported there.What's the problem?
The
q4-imatrixmodel variant quantizes all routed MoE experts to Q4_K (4-bit per weight). The GPU (Metal/CUDA) backends already support this via dedicated Q4_K kernels (kernel_mul_mv_id_q4_K_*). However, the CPU MoE matvec was hardcoded for IQ2_XXS gate/up and Q2_K down projections — any Q4_K expert tensor would hitds4_die().What changed?
New CPU dot product kernel (
ds4_vec_dot_q4_K_q8_K):byte_off = (j>>1)*32,shift = (j&1)*4)q4_k_get_scale_min()Q4_K matvec workers (matching existing IQ2_XXS/Q2_K patterns):
matvec_q4_k_mid_worker— gate/up mid-vector builder with SiLU, clamp, router weightmatvec_q4_k_accum_worker— down projection accumulator across selected expertsmatvec_q4_k_batch_mid_worker/matvec_q4_k_batch_accum_rows_worker— batch prefill variantsType-dispatch wrappers that route to the correct backend based on
tensor->type:matvec_experts_mid_prequant()— IQ2_XXS or Q4_K gate/upmatvec_experts_down_accum_prequant()— Q2_K or Q4_K downmatvec_expert_pair_prequant()/matvec_expert_down()— trace/diagnostic pathslayer_routed_moe_batch()— dispatches batch mid and batch down workersExisting IQ2_XXS/Q2_K paths are unchanged — dispatch only activates when Q4_K tensors are detected.
Test
tests/test_q4k_dot.c— block size validation, scale extraction round-trip, known-value dot product, and 50-block random reference comparison (scalar dequantize+dot vs kernel)make q4k-dot-test— one-command build + runValidation status
make cpuandmake ds4(Metal)If you have a 256 GB+ machine and can run:
a short prompt to confirm it produces coherent output (no crash, no garbage) would be very helpful. This is the main blocker for confidence in this PR.
Caveats
Test Plan
make cpu NATIVE_CPU_FLAG=— builds successfullymake ds4 NATIVE_CPU_FLAG=— builds successfully (Metal build unaffected)make q4k-dot-test— 4/4 passq4-imatrixmodel on CPU (need community help)