CPU support for Q4_K routed experts (fixes #171) by hexxyan · Pull Request #272 · antirez/ds4

hexxyan · 2026-05-27T18:50:33Z

Summary

Fixes #171 — the CPU inference path (--cpu) now handles Q4_K routed expert tensors instead of crashing with ds4: expected IQ2_XXS expert tensors.

Related to #114: this PR fixes the CPU/reference routed-MoE Q4_K path that produced expected IQ2_XXS expert tensors. It does not change the Metal Q4_K kernels or investigate the BOS-repeat Metal generation behavior reported there.

What's the problem?

The q4-imatrix model variant quantizes all routed MoE experts to Q4_K (4-bit per weight). The GPU (Metal/CUDA) backends already support this via dedicated Q4_K kernels (kernel_mul_mv_id_q4_K_*). However, the CPU MoE matvec was hardcoded for IQ2_XXS gate/up and Q2_K down projections — any Q4_K expert tensor would hit ds4_die().

What changed?

New CPU dot product kernel (ds4_vec_dot_q4_K_q8_K):

ARM NEON DOTPROD fast path + portable scalar fallback
Correctly handles Q4_K's nibble packing: 2 groups share 32 bytes via low/high nibble shift (byte_off = (j>>1)*32, shift = (j&1)*4)
Handles 6-bit scale/min packing via q4_k_get_scale_min()

Q4_K matvec workers (matching existing IQ2_XXS/Q2_K patterns):

matvec_q4_k_mid_worker — gate/up mid-vector builder with SiLU, clamp, router weight
matvec_q4_k_accum_worker — down projection accumulator across selected experts
matvec_q4_k_batch_mid_worker / matvec_q4_k_batch_accum_rows_worker — batch prefill variants

Type-dispatch wrappers that route to the correct backend based on tensor->type:

matvec_experts_mid_prequant() — IQ2_XXS or Q4_K gate/up
matvec_experts_down_accum_prequant() — Q2_K or Q4_K down
matvec_expert_pair_prequant() / matvec_expert_down() — trace/diagnostic paths
layer_routed_moe_batch() — dispatches batch mid and batch down workers

Existing IQ2_XXS/Q2_K paths are unchanged — dispatch only activates when Q4_K tensors are detected.

Test

tests/test_q4k_dot.c — block size validation, scale extraction round-trip, known-value dot product, and 50-block random reference comparison (scalar dequantize+dot vs kernel)
make q4k-dot-test — one-command build + run

Validation status

Dot product kernel verified against scalar reference implementation (50 random blocks, <1% relative error)
Builds successfully for both make cpu and make ds4 (Metal)
Not tested end-to-end with actual q4-imatrix model — my machine does not have enough RAM (~153 GB model). I cannot validate full inference correctness myself.

If you have a 256 GB+ machine and can run:

./ds4 --cpu path/to/q4-imatrix.gguf

a short prompt to confirm it produces coherent output (no crash, no garbage) would be very helpful. This is the main blocker for confidence in this PR.

Caveats

CUDA Q4_K routed expert support is not included (the CUDA path has separate matvec dispatch).
Metal Q4_K kernel correctness (BOS-repeat issue from q4-imatrix weights fail to load: "expected IQ2_XXS expert tensors" #114) is not addressed here.

Test Plan

make cpu NATIVE_CPU_FLAG= — builds successfully
make ds4 NATIVE_CPU_FLAG= — builds successfully (Metal build unaffected)
make q4k-dot-test — 4/4 pass
E2E validation with q4-imatrix model on CPU (need community help)

loge-gh · 2026-05-28T21:17:26Z

I confirm the patch works with DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2-imatrix.gguf on a PC with the --cpu option, producing correct output.

hexxyan · 2026-05-30T14:45:54Z

Hi @antirez,

Just a quick update to request a review for PR #272 (CPU support for Q4_K routed experts).

Both loge-gh and OceanX89 have tested and confirmed that the patch works. Specifically, OceanX89 mentioned in issue #171 that "ds4 is building and running fine" with this branch and that it now successfully handles the Q4 models (which they reported give much better results than the Q2 version).

Since this fixes the expected IQ2_XXS expert tensors crash on CPU and has been verified by other users, could you please merge it when you have a moment?

Thank you very much for your work!

antirez · 2026-05-30T15:42:21Z

Thanks, checking.

antirez · 2026-05-30T15:54:14Z

Merged. Probably could go faster, not sure what is the interest here, if to have like Q2 a CPU reference implementation for debugging, or to really use this in ARM systems with a lot of memory but no GPU?

hexxyan · 2026-05-30T19:58:41Z

Merged. Probably could go faster, not sure what is the interest here, if to have like Q2 a CPU reference implementation for debugging, or to really use this in ARM systems with a lot of memory but no GPU?

Hi antirez, thank you for merging this!

For me, the main motivation was correctness first, with CPU-only usability as a useful side benefit. I noticed issues #114 and #171 and wanted to close the functional gap — Q4_K routed experts were a real model format people were trying to run, but the CPU reference path crashed on them. The ARM NEON path was a natural extension since the quantization format already had the building blocks, so I included it as well. I agree it can probably go faster, this PR was intentionally focused on correctness first.

I'm also quite fascinated by finding ways to push inference speed further in general! I have PR #261 exploring suffix-tree-based speculative decoding, and I'm looking into AVX2/AVX512 kernels on the CPU side. Beyond that, I'm interested in CPU-GPU hybrid deployment for large MoE models — shared experts on GPU, routed experts on CPU, similar to what KTransformers and fastllm are doing. I think ds4's clean architecture makes it a great fit for that kind of exploration.
Thanks again for ds4 — it's been a great project to learn from and contribute to!

Add CPU Q4_K routed expert support (fixes antirez#171)

912f155

hexxyan force-pushed the feat/cpu-q4k-routed-experts branch from cc612e8 to 912f155 Compare May 27, 2026 19:11

Add q4k-dot-test Makefile target for standalone Q4_K unit tests

121d541

This was referenced May 27, 2026

does CPU support IQ4? #171

Closed

q4-imatrix weights fail to load: "expected IQ2_XXS expert tensors" #114

Closed

antirez merged commit ba00a8a into antirez:main May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU support for Q4_K routed experts (fixes #171)#272

CPU support for Q4_K routed experts (fixes #171)#272
antirez merged 2 commits into
antirez:mainfrom
hexxyan:feat/cpu-q4k-routed-experts

hexxyan commented May 27, 2026 •

edited

Loading

Uh oh!

loge-gh commented May 28, 2026

Uh oh!

hexxyan commented May 30, 2026

Uh oh!

antirez commented May 30, 2026

Uh oh!

antirez commented May 30, 2026

Uh oh!

hexxyan commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hexxyan commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's the problem?

What changed?

Test

Validation status

Caveats

Test Plan

Uh oh!

loge-gh commented May 28, 2026

Uh oh!

hexxyan commented May 30, 2026

Uh oh!

antirez commented May 30, 2026

Uh oh!

antirez commented May 30, 2026

Uh oh!

hexxyan commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hexxyan commented May 27, 2026 •

edited

Loading

hexxyan commented May 30, 2026 •

edited

Loading