Skip to content

cpu: add x86 AVX2/AVX512 SIMD fast paths for quantized MoE dot products#305

Open
hexxyan wants to merge 1 commit into
antirez:mainfrom
hexxyan:cpu-simd-dot-kernels
Open

cpu: add x86 AVX2/AVX512 SIMD fast paths for quantized MoE dot products#305
hexxyan wants to merge 1 commit into
antirez:mainfrom
hexxyan:cpu-simd-dot-kernels

Conversation

@hexxyan
Copy link
Copy Markdown
Contributor

@hexxyan hexxyan commented May 30, 2026

Summary

The x86 CPU path currently falls back to scalar code for all quantized dot products (Q2_K, Q4_K, IQ2_XXS) used in routed MoE expert inference. This PR adds AVX2 and AVX512BW SIMD fast paths, and explicit Makefile targets for building ISA-specific binaries.

The existing ARM NEON, scalar, Metal, and CUDA paths are unchanged.

Why does this matter?

Each MoE layer runs dot products against multiple expert weight blocks (typically 8 selected experts × gate/up/down projections). On x86 these are currently pure scalar loops — one multiply-accumulate per cycle per value. AVX2 processes 8×float32 or 16×int8 per cycle; AVX512 doubles that again. For CPU-only inference on large-RAM x86 machines, this should make routed expert evaluation substantially faster, directly improving tok/s.

What changed?

New SIMD dot product kernels (compile-time selected via __AVX2__ / __AVX512F__+__AVX512BW__):

Kernel AVX2 AVX512BW AVX512_VNNI
Q2_K × Q8_K
Q4_K × Q8_K ✓ (dpbusd)
IQ2_XXS × Q8_K
IQ2_XXS pair × Q8_K ✓ (delegates) ✓ (delegates)

Key implementation details:

  • Q2_K: 2-bit extraction via shift+mask, madd_epi16 for signed dot product
  • Q4_K: maddubs_epi16 (unsigned × signed byte), AVX512 variant with dpbusd_epi32 when VNNI is available
  • IQ2_XXS: grid/sign lookup stays scalar (LUT-dependent), only int8 dot uses SIMD via madd_epi16
  • Proper zero-extension (ds4_zext_i128_to_i256) for AVX512 cvtepi8_epi16 — avoids undefined upper-lane garbage

New Makefile targets for explicit ISA-specific binaries:

make cpu               # default: -march=native, matches build machine
make cpu-avx2          # fixed AVX2 binary (ds4-avx2, ds4-server-avx2, ...)
make cpu-avx512        # fixed AVX512BW binary
make cpu-avx512-vnni   # fixed AVX512BW+VNNI binary

Each SIMD variant uses separate build/cpu-<suffix>/ object directories to avoid .o file conflicts. Architecture gating uses $(filter x86_64 amd64,$(UNAME_M)) — works across both Darwin and Linux. Non-x86 hosts get a clear error message.

New test file: tests/test_quant_dot.c replaces tests/test_q2k_dot.c, covering all three quant formats:

  • Block size validation
  • Known-value dot product
  • 100-seed random reference comparison (scalar vs SIMD)
  • Multi-block accumulation
  • Q4_K nibble edge cases

Validation status

Kernel correctness — tested via unit tests only:

  • make quant-dot-test — 11/11 pass (scalar vs AVX2 cross-checked on 100 random blocks per format)
  • make cpu-avx2 — builds all five binaries
  • make ds4_cpu.o -B — only pre-existing warnings
  • AVX512 paths compile-clean with -mavx512f -mavx512bw and -mavx512vnni

E2E model inference not tested — I do not have enough RAM to load a DeepSeek V4 model and run actual inference. The dot product kernels are verified correct in isolation, but real-world tok/s and output quality need validation on a large-RAM machine.

Community testing needed. If you have a large-RAM x86 machine (AVX2 or AVX512):

make quant-dot-test    # unit test first
make cpu               # or make cpu-avx2 / make cpu-avx512
./ds4 --cpu -m /path/to/model.gguf -p "Explain merge sort in one paragraph."

Useful reports:

  • CPU model and ISA support (lscpu | grep -i avx)
  • Model quant format (Q2_K, Q4_K, IQ2_XXS, etc.)
  • Whether make quant-dot-test passes
  • First-token latency and tokens/sec
  • Whether output looks sane (no garbage, no crash)

Notes

  • No runtime CPU feature detection is added. SIMD paths are compile-time via compiler macros.
  • This does not change CUDA or Metal paths.

Test Plan

  • make quant-dot-test — 11/11 pass (AVX2 cross-checked against scalar)
  • make cpu — builds successfully, default behavior unchanged
  • make cpu-avx2 — builds ds4-avx2 and siblings with fixed -mavx2 (no -march=native)
  • AVX512 compile-only check (ds4.c + test file, with/without VNNI)
  • git diff --check clean
  • E2E model inference on large-RAM x86 with AVX2 (need community help)
  • AVX512 runtime validation (need community help)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant