metal: simdgroup MMA mini-GEMM for decode MoE [experimental] by hexxyan · Pull Request #306 · antirez/ds4

hexxyan · 2026-05-31T01:25:37Z

Summary

Experimental optimization for the DeepSeek-V4 decode MoE path: replace scalar dot-product kernels with tiled simdgroup_multiply_accumulate mini-GEMM for the 6 selected experts.

New Metal kernels (metal/moe.metal):

kernel_mul_mm_selected_pair_swiglu — fused gate+up projection with shared RHS activation tile and SwiGLU activation. Template instantiations for Q2_K, Q4_K, IQ2_XXS × F32/F16.
kernel_mul_mm_selected_sum — down projection that accumulates all 6 experts directly into the output row, eliminating the separate sum kernel.

Host dispatch (ds4_metal.m):

Opt-in via DS4_METAL_DECODE_MOE_MINIGEMM=1 environment variable
Default decode path completely unchanged
Stage profiling via DS4_METAL_MOE_STAGE_PROFILE for A/B comparison

Validation tool (tools/validate_metal.sh):

Offline Metal shader compilation check using xcrun metal
Mirrors the default source concatenation used by ds4_gpu_full_source() (base header + all .metal files)

Verification

C/ObjC build passes (make ds4)
ds4_test --server passes
test_q4k_dot passes
ds4-eval --self-test-extractors passes
Full concatenated Metal source (~9.6k lines) compiles cleanly (unused-function warnings only)
Apple Silicon runtime pipeline creation
Numerical correctness vs scalar path
tok/s benchmark (baseline vs mini-GEMM)

How to Test

# Build
make ds4

# Validate Metal shaders offline
tools/validate_metal.sh

# Baseline (scalar path)
./ds4-server -m <model.gguf>

# Mini-GEMM prototype
DS4_METAL_DECODE_MOE_MINIGEMM=1 ./ds4-server -m <model.gguf>

# With stage profiling for A/B comparison
DS4_METAL_MOE_STAGE_PROFILE=1 ./ds4-server -m <model.gguf>
DS4_METAL_DECODE_MOE_MINIGEMM=1 DS4_METAL_MOE_STAGE_PROFILE=1 ./ds4-server -m <model.gguf>

Notes

This is an experimental prototype. The mini-GEMM path is disabled by default and only activates when the env var is set. No performance claims yet — Apple Silicon testing is needed to confirm whether the simdgroup MMA tiles outperform the existing fused scalar dot-product path for the decode MoE workload (6 experts × small batch).

The hypothesis is that for decode (small M, large K), simdgroup MMA tiles can better utilize the GPU matrix hardware than per-expert scalar dot products. But the tile overhead may dominate at small problem sizes, so empirical measurement is essential.

Experimental decode-path optimization: replace scalar dot-product MoE with tiled simdgroup_multiply_accumulate mini-GEMM for the 6 selected experts in DeepSeek-V4 MoE layers. New kernels: - kernel_mul_mm_selected_pair_swiglu: fused gate+up projection with shared RHS activation tile and SwiGLU, Q2_K/Q4_K/IQ2_XXS templates - kernel_mul_mm_selected_sum: down projection accumulating all experts directly into the output row (eliminates separate sum6 kernel) Host dispatch (ds4_metal.m): - Opt-in via DS4_METAL_DECODE_MOE_MINIGEMM=1 env var - Legacy path unchanged when env var is not set - Stage profiling via DS4_METAL_MOE_STAGE_PROFILE for A/B comparison Also adds tools/validate_metal.sh for offline Metal shader compilation checking (requires Metal Toolchain component). Status: compile-verified (C/ObjC + 9662 lines Metal shader pass). Runtime correctness and tok/s benchmark pending Apple Silicon testing. Not yet recommended for production use.

hexxyan · 2026-05-31T01:40:24Z

@nhwaani would you be willing to give this experimental PR a quick spin on Apple Silicon?

This adds an opt-in decode MoE mini-GEMM path using simdgroup_multiply_accumulate. It is disabled by default, so the existing path should be unchanged.

The most useful checks would be:

Does it build and run without crashing?

DS4_METAL_DECODE_MOE_MINIGEMM=1 ./ds4-server -m <model>

Do you see any obvious numerical/output issues compared with the default path?
If you have time, a quick tok/s comparison would be amazing:

./ds4-server -m <model>
DS4_METAL_DECODE_MOE_MINIGEMM=1 ./ds4-server -m <model>

Optional stage profile:

DS4_METAL_MOE_STAGE_PROFILE=1 ./ds4-server -m <model>
DS4_METAL_DECODE_MOE_MINIGEMM=1 DS4_METAL_MOE_STAGE_PROFILE=1 ./ds4-server -m <model>

No pressure at all. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal: simdgroup MMA mini-GEMM for decode MoE [experimental]#306

metal: simdgroup MMA mini-GEMM for decode MoE [experimental]#306
hexxyan wants to merge 1 commit into
antirez:mainfrom
hexxyan:codex/decode-moe-minigemm

hexxyan commented May 31, 2026 •

edited

Loading

Uh oh!

hexxyan commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hexxyan commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

How to Test

Notes

Uh oh!

hexxyan commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hexxyan commented May 31, 2026 •

edited

Loading