metal: simdgroup MMA mini-GEMM for decode MoE [experimental]#306
Open
hexxyan wants to merge 1 commit into
Open
metal: simdgroup MMA mini-GEMM for decode MoE [experimental]#306hexxyan wants to merge 1 commit into
hexxyan wants to merge 1 commit into
Conversation
Experimental decode-path optimization: replace scalar dot-product MoE with tiled simdgroup_multiply_accumulate mini-GEMM for the 6 selected experts in DeepSeek-V4 MoE layers. New kernels: - kernel_mul_mm_selected_pair_swiglu: fused gate+up projection with shared RHS activation tile and SwiGLU, Q2_K/Q4_K/IQ2_XXS templates - kernel_mul_mm_selected_sum: down projection accumulating all experts directly into the output row (eliminates separate sum6 kernel) Host dispatch (ds4_metal.m): - Opt-in via DS4_METAL_DECODE_MOE_MINIGEMM=1 env var - Legacy path unchanged when env var is not set - Stage profiling via DS4_METAL_MOE_STAGE_PROFILE for A/B comparison Also adds tools/validate_metal.sh for offline Metal shader compilation checking (requires Metal Toolchain component). Status: compile-verified (C/ObjC + 9662 lines Metal shader pass). Runtime correctness and tok/s benchmark pending Apple Silicon testing. Not yet recommended for production use.
Contributor
Author
|
@nhwaani would you be willing to give this experimental PR a quick spin on Apple Silicon? This adds an opt-in decode MoE mini-GEMM path using The most useful checks would be:
DS4_METAL_DECODE_MOE_MINIGEMM=1 ./ds4-server -m <model>
./ds4-server -m <model>
DS4_METAL_DECODE_MOE_MINIGEMM=1 ./ds4-server -m <model>Optional stage profile: DS4_METAL_MOE_STAGE_PROFILE=1 ./ds4-server -m <model>
DS4_METAL_DECODE_MOE_MINIGEMM=1 DS4_METAL_MOE_STAGE_PROFILE=1 ./ds4-server -m <model>No pressure at all. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Experimental optimization for the DeepSeek-V4 decode MoE path: replace scalar dot-product kernels with tiled
simdgroup_multiply_accumulatemini-GEMM for the 6 selected experts.New Metal kernels (
metal/moe.metal):kernel_mul_mm_selected_pair_swiglu— fused gate+up projection with shared RHS activation tile and SwiGLU activation. Template instantiations for Q2_K, Q4_K, IQ2_XXS × F32/F16.kernel_mul_mm_selected_sum— down projection that accumulates all 6 experts directly into the output row, eliminating the separate sum kernel.Host dispatch (
ds4_metal.m):DS4_METAL_DECODE_MOE_MINIGEMM=1environment variableDS4_METAL_MOE_STAGE_PROFILEfor A/B comparisonValidation tool (
tools/validate_metal.sh):xcrun metalds4_gpu_full_source()(base header + all.metalfiles)Verification
make ds4)ds4_test --serverpassestest_q4k_dotpassesds4-eval --self-test-extractorspassesHow to Test
Notes
This is an experimental prototype. The mini-GEMM path is disabled by default and only activates when the env var is set. No performance claims yet — Apple Silicon testing is needed to confirm whether the simdgroup MMA tiles outperform the existing fused scalar dot-product path for the decode MoE workload (6 experts × small batch).
The hypothesis is that for decode (small M, large K), simdgroup MMA tiles can better utilize the GPU matrix hardware than per-expert scalar dot products. But the tile overhead may dominate at small problem sizes, so empirical measurement is essential.