cpu: add x86 AVX2/AVX512 SIMD fast paths for quantized MoE dot products by hexxyan · Pull Request #305 · antirez/ds4

hexxyan · 2026-05-30T23:14:39Z

Summary

The x86 CPU path currently falls back to scalar code for all quantized dot products (Q2_K, Q4_K, IQ2_XXS) used in routed MoE expert inference. This PR adds AVX2 and AVX512BW SIMD fast paths, and explicit Makefile targets for building ISA-specific binaries.

The existing ARM NEON, scalar, Metal, and CUDA paths are unchanged.

Why does this matter?

Each MoE layer runs dot products against multiple expert weight blocks (typically 8 selected experts × gate/up/down projections). On x86 these are currently pure scalar loops — one multiply-accumulate per cycle per value. AVX2 processes 8×float32 or 16×int8 per cycle; AVX512 doubles that again. For CPU-only inference on large-RAM x86 machines, this should make routed expert evaluation substantially faster, directly improving tok/s.

What changed?

New SIMD dot product kernels (compile-time selected via __AVX2__ / __AVX512F__+__AVX512BW__):

Kernel	AVX2	AVX512BW	AVX512_VNNI
`Q2_K × Q8_K`	✓	✓	—
`Q4_K × Q8_K`	✓	✓	✓ (dpbusd)
`IQ2_XXS × Q8_K`	✓	✓	—
`IQ2_XXS pair × Q8_K`	✓ (delegates)	✓ (delegates)	—

Key implementation details:

Q2_K: 2-bit extraction via shift+mask, madd_epi16 for signed dot product
Q4_K: maddubs_epi16 (unsigned × signed byte), AVX512 variant with dpbusd_epi32 when VNNI is available
IQ2_XXS: grid/sign lookup stays scalar (LUT-dependent), only int8 dot uses SIMD via madd_epi16
Proper zero-extension (ds4_zext_i128_to_i256) for AVX512 cvtepi8_epi16 — avoids undefined upper-lane garbage

New Makefile targets for explicit ISA-specific binaries:

make cpu               # default: -march=native, matches build machine
make cpu-avx2          # fixed AVX2 binary (ds4-avx2, ds4-server-avx2, ...)
make cpu-avx512        # fixed AVX512BW binary
make cpu-avx512-vnni   # fixed AVX512BW+VNNI binary

Each SIMD variant uses separate build/cpu-<suffix>/ object directories to avoid .o file conflicts. Architecture gating uses $(filter x86_64 amd64,$(UNAME_M)) — works across both Darwin and Linux. Non-x86 hosts get a clear error message.

New test file: tests/test_quant_dot.c replaces tests/test_q2k_dot.c, covering all three quant formats:

Block size validation
Known-value dot product
100-seed random reference comparison (scalar vs SIMD)
Multi-block accumulation
Q4_K nibble edge cases

Validation status

Kernel correctness — tested via unit tests only:

make quant-dot-test — 11/11 pass (scalar vs AVX2 cross-checked on 100 random blocks per format)
make cpu-avx2 — builds all five binaries
make ds4_cpu.o -B — only pre-existing warnings
AVX512 paths compile-clean with -mavx512f -mavx512bw and -mavx512vnni

E2E model inference not tested — I do not have enough RAM to load a DeepSeek V4 model and run actual inference. The dot product kernels are verified correct in isolation, but real-world tok/s and output quality need validation on a large-RAM machine.

Community testing needed. If you have a large-RAM x86 machine (AVX2 or AVX512):

make quant-dot-test    # unit test first
make cpu               # or make cpu-avx2 / make cpu-avx512
./ds4 --cpu -m /path/to/model.gguf -p "Explain merge sort in one paragraph."

Useful reports:

CPU model and ISA support (lscpu | grep -i avx)
Model quant format (Q2_K, Q4_K, IQ2_XXS, etc.)
Whether make quant-dot-test passes
First-token latency and tokens/sec
Whether output looks sane (no garbage, no crash)

Notes

No runtime CPU feature detection is added. SIMD paths are compile-time via compiler macros.
This does not change CUDA or Metal paths.

Test Plan

make quant-dot-test — 11/11 pass (AVX2 cross-checked against scalar)
make cpu — builds successfully, default behavior unchanged
make cpu-avx2 — builds ds4-avx2 and siblings with fixed -mavx2 (no -march=native)
AVX512 compile-only check (ds4.c + test file, with/without VNNI)
git diff --check clean
E2E model inference on large-RAM x86 with AVX2 (need community help)
AVX512 runtime validation (need community help)

cpu: add x86 AVX2/AVX512 SIMD fast paths for quantized MoE dot products

3dbce9f

This was referenced May 30, 2026

q4-imatrix weights fail to load: "expected IQ2_XXS expert tensors" #114

Closed

does CPU support IQ4? #171

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu: add x86 AVX2/AVX512 SIMD fast paths for quantized MoE dot products#305

cpu: add x86 AVX2/AVX512 SIMD fast paths for quantized MoE dot products#305
hexxyan wants to merge 1 commit into
antirez:mainfrom
hexxyan:cpu-simd-dot-kernels

hexxyan commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hexxyan commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why does this matter?

What changed?

Validation status

Notes

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hexxyan commented May 30, 2026 •

edited

Loading