Skip to content

CPU support for Q4_K routed experts (fixes #171)#272

Merged
antirez merged 2 commits into
antirez:mainfrom
hexxyan:feat/cpu-q4k-routed-experts
May 30, 2026
Merged

CPU support for Q4_K routed experts (fixes #171)#272
antirez merged 2 commits into
antirez:mainfrom
hexxyan:feat/cpu-q4k-routed-experts

Conversation

@hexxyan
Copy link
Copy Markdown
Contributor

@hexxyan hexxyan commented May 27, 2026

Summary

Fixes #171 — the CPU inference path (--cpu) now handles Q4_K routed expert tensors instead of crashing with ds4: expected IQ2_XXS expert tensors.

Related to #114: this PR fixes the CPU/reference routed-MoE Q4_K path that produced expected IQ2_XXS expert tensors. It does not change the Metal Q4_K kernels or investigate the BOS-repeat Metal generation behavior reported there.

What's the problem?

The q4-imatrix model variant quantizes all routed MoE experts to Q4_K (4-bit per weight). The GPU (Metal/CUDA) backends already support this via dedicated Q4_K kernels (kernel_mul_mv_id_q4_K_*). However, the CPU MoE matvec was hardcoded for IQ2_XXS gate/up and Q2_K down projections — any Q4_K expert tensor would hit ds4_die().

What changed?

New CPU dot product kernel (ds4_vec_dot_q4_K_q8_K):

  • ARM NEON DOTPROD fast path + portable scalar fallback
  • Correctly handles Q4_K's nibble packing: 2 groups share 32 bytes via low/high nibble shift (byte_off = (j>>1)*32, shift = (j&1)*4)
  • Handles 6-bit scale/min packing via q4_k_get_scale_min()

Q4_K matvec workers (matching existing IQ2_XXS/Q2_K patterns):

  • matvec_q4_k_mid_worker — gate/up mid-vector builder with SiLU, clamp, router weight
  • matvec_q4_k_accum_worker — down projection accumulator across selected experts
  • matvec_q4_k_batch_mid_worker / matvec_q4_k_batch_accum_rows_worker — batch prefill variants

Type-dispatch wrappers that route to the correct backend based on tensor->type:

  • matvec_experts_mid_prequant() — IQ2_XXS or Q4_K gate/up
  • matvec_experts_down_accum_prequant() — Q2_K or Q4_K down
  • matvec_expert_pair_prequant() / matvec_expert_down() — trace/diagnostic paths
  • layer_routed_moe_batch() — dispatches batch mid and batch down workers

Existing IQ2_XXS/Q2_K paths are unchanged — dispatch only activates when Q4_K tensors are detected.

Test

  • tests/test_q4k_dot.c — block size validation, scale extraction round-trip, known-value dot product, and 50-block random reference comparison (scalar dequantize+dot vs kernel)
  • make q4k-dot-test — one-command build + run

Validation status

  • Dot product kernel verified against scalar reference implementation (50 random blocks, <1% relative error)
  • Builds successfully for both make cpu and make ds4 (Metal)
  • Not tested end-to-end with actual q4-imatrix model — my machine does not have enough RAM (~153 GB model). I cannot validate full inference correctness myself.

If you have a 256 GB+ machine and can run:

./ds4 --cpu path/to/q4-imatrix.gguf

a short prompt to confirm it produces coherent output (no crash, no garbage) would be very helpful. This is the main blocker for confidence in this PR.

Caveats

Test Plan

  • make cpu NATIVE_CPU_FLAG= — builds successfully
  • make ds4 NATIVE_CPU_FLAG= — builds successfully (Metal build unaffected)
  • make q4k-dot-test — 4/4 pass
  • E2E validation with q4-imatrix model on CPU (need community help)

@hexxyan hexxyan force-pushed the feat/cpu-q4k-routed-experts branch from cc612e8 to 912f155 Compare May 27, 2026 19:11
@loge-gh
Copy link
Copy Markdown

loge-gh commented May 28, 2026

I confirm the patch works with DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2-imatrix.gguf on a PC with the --cpu option, producing correct output.

@hexxyan
Copy link
Copy Markdown
Contributor Author

hexxyan commented May 30, 2026

Hi @antirez,

Just a quick update to request a review for PR #272 (CPU support for Q4_K routed experts).

Both loge-gh and OceanX89 have tested and confirmed that the patch works. Specifically, OceanX89 mentioned in issue #171 that "ds4 is building and running fine" with this branch and that it now successfully handles the Q4 models (which they reported give much better results than the Q2 version).

Since this fixes the expected IQ2_XXS expert tensors crash on CPU and has been verified by other users, could you please merge it when you have a moment?

Thank you very much for your work!

@antirez
Copy link
Copy Markdown
Owner

antirez commented May 30, 2026

Thanks, checking.

@antirez antirez merged commit ba00a8a into antirez:main May 30, 2026
@antirez
Copy link
Copy Markdown
Owner

antirez commented May 30, 2026

Merged. Probably could go faster, not sure what is the interest here, if to have like Q2 a CPU reference implementation for debugging, or to really use this in ARM systems with a lot of memory but no GPU?

@hexxyan
Copy link
Copy Markdown
Contributor Author

hexxyan commented May 30, 2026

Merged. Probably could go faster, not sure what is the interest here, if to have like Q2 a CPU reference implementation for debugging, or to really use this in ARM systems with a lot of memory but no GPU?

Hi antirez, thank you for merging this!

For me, the main motivation was correctness first, with CPU-only usability as a useful side benefit. I noticed issues #114 and #171 and wanted to close the functional gap — Q4_K routed experts were a real model format people were trying to run, but the CPU reference path crashed on them. The ARM NEON path was a natural extension since the quantization format already had the building blocks, so I included it as well. I agree it can probably go faster, this PR was intentionally focused on correctness first.

I'm also quite fascinated by finding ways to push inference speed further in general! I have PR #261 exploring suffix-tree-based speculative decoding, and I'm looking into AVX2/AVX512 kernels on the CPU side. Beyond that, I'm interested in CPU-GPU hybrid deployment for large MoE models — shared experts on GPU, routed experts on CPU, similar to what KTransformers and fastllm are doing. I think ds4's clean architecture makes it a great fit for that kind of exploration.
Thanks again for ds4 — it's been a great project to learn from and contribute to!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

does CPU support IQ4?

3 participants