cpu: add x86 AVX2/AVX512 SIMD fast paths for quantized MoE dot products#305
Open
hexxyan wants to merge 1 commit into
Open
cpu: add x86 AVX2/AVX512 SIMD fast paths for quantized MoE dot products#305hexxyan wants to merge 1 commit into
hexxyan wants to merge 1 commit into
Conversation
This was referenced May 30, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The x86 CPU path currently falls back to scalar code for all quantized dot products (Q2_K, Q4_K, IQ2_XXS) used in routed MoE expert inference. This PR adds AVX2 and AVX512BW SIMD fast paths, and explicit Makefile targets for building ISA-specific binaries.
The existing ARM NEON, scalar, Metal, and CUDA paths are unchanged.
Why does this matter?
Each MoE layer runs dot products against multiple expert weight blocks (typically 8 selected experts × gate/up/down projections). On x86 these are currently pure scalar loops — one multiply-accumulate per cycle per value. AVX2 processes 8×float32 or 16×int8 per cycle; AVX512 doubles that again. For CPU-only inference on large-RAM x86 machines, this should make routed expert evaluation substantially faster, directly improving tok/s.
What changed?
New SIMD dot product kernels (compile-time selected via
__AVX2__/__AVX512F__+__AVX512BW__):Q2_K × Q8_KQ4_K × Q8_KIQ2_XXS × Q8_KIQ2_XXS pair × Q8_KKey implementation details:
madd_epi16for signed dot productmaddubs_epi16(unsigned × signed byte), AVX512 variant withdpbusd_epi32when VNNI is availablemadd_epi16ds4_zext_i128_to_i256) for AVX512cvtepi8_epi16— avoids undefined upper-lane garbageNew Makefile targets for explicit ISA-specific binaries:
Each SIMD variant uses separate
build/cpu-<suffix>/object directories to avoid.ofile conflicts. Architecture gating uses$(filter x86_64 amd64,$(UNAME_M))— works across both Darwin and Linux. Non-x86 hosts get a clear error message.New test file:
tests/test_quant_dot.creplacestests/test_q2k_dot.c, covering all three quant formats:Validation status
Kernel correctness — tested via unit tests only:
make quant-dot-test— 11/11 pass (scalar vs AVX2 cross-checked on 100 random blocks per format)make cpu-avx2— builds all five binariesmake ds4_cpu.o -B— only pre-existing warnings-mavx512f -mavx512bwand-mavx512vnniE2E model inference not tested — I do not have enough RAM to load a DeepSeek V4 model and run actual inference. The dot product kernels are verified correct in isolation, but real-world tok/s and output quality need validation on a large-RAM machine.
Community testing needed. If you have a large-RAM x86 machine (AVX2 or AVX512):
Useful reports:
lscpu | grep -i avx)make quant-dot-testpassesNotes
Test Plan
make quant-dot-test— 11/11 pass (AVX2 cross-checked against scalar)make cpu— builds successfully, default behavior unchangedmake cpu-avx2— buildsds4-avx2and siblings with fixed-mavx2(no-march=native)git diff --checkclean