Added prefill strategy benchmarking script and results by jijiaz · Pull Request #1923 · intel/auto-round

jijiaz · 2026-06-15T07:31:07Z

MoE Prefill Optimization Strategy Benchmarking

We conducted comprehensive benchmarking to evaluate optimization strategies for MoE prefill kernels using realistic token distributions from production workloads. The benchmarking experiment compared two approaches: the current per-expert baseline versus a full pad-and-clip strategy that consolidates all experts into fewer launches through strategic padding.

Key Findings:

Optimal threshold identification: 256-token threshold consistently delivers best performance across all quantization formats
Moderate speedups achieved: 1.05-1.12x improvements for most INT/FP8 formats at optimal thresholds
Launch reduction effectiveness: Reduced kernel launches from 238 to ~38 (84% reduction) while maintaining competitive performance
Format-agnostic benefits: Performance gains are consistent across INT8/INT4/INT2 (sym/asym) and FP8 (E4M3/E5M2) quantization schemes

Performance Summary (representative formats):

Format	Baseline (ms)	Optimized (ms)	Speedup	Threshold
INT8_sym	69.36	63.58	1.09x	256
INT4_sym	68.09	62.07	1.10x	256
INT2_sym	66.51	59.62	1.12x	256
FP8_E4M3	72.27	67.24	1.07x	256
FP8_E5M2	73.09	68.10	1.07x	256

The results demonstrate that strategic padding can effectively trade memory bandwidth for reduced kernel launch overhead in MoE workloads. The approach brings moderate (5-12%) speedups with the current kernel.

Benchmark Configuration: 256 experts, ~32K routed tokens, N=3072×K=1536 dimensions, group_size=128.

for more information, see https://pre-commit.ci

Signed-off-by: lkk12014402 <kaokao.lv@intel.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: chensuyue <suyue.chen@intel.com>

…emma4-unified (intel#1879) Signed-off-by: Wenhua Cheng <wenhua.cheng@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Signed-off-by: chensuyue <suyue.chen@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…nd related tests (intel#1865) Signed-off-by: Xin He <xin3.he@intel.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>

Co-authored-by: Yogesh Rao <yogesh-tessl@users.noreply.github.com> Co-authored-by: Liang Lv <liang1.lv@intel.com>

Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>

Signed-off-by: Wenhua Cheng <wenhua.cheng@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…les (intel#1909) Signed-off-by: Entrpi <entrpi@proton.me>

Signed-off-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: Xin He <xin3.he@intel.com>

- Use [E, N, K] weight layout for pack helpers (matches their contract and the working correctness tests in test_moe.py). - Drop the erroneous dequant.transpose(1, 2); dequant is already [E, N, K], which is what the baseline expects. - Call ark.moe_gemm_prefill for quantized timing (the dedicated quantized prefill kernel) instead of the FP-only ark.moe_gemm. - Add a per-test skip guard for quantized tests requiring moe_gemm_prefill so the FP perf cases still run when only moe_gemm is present.

…rness Print both `base mm(ms)` (matmul-only baseline, prior behavior) and `base+deq(ms)` (dequant + matmul, the apples-to-apples comparison for the current Stage-1 `moe_gemm_prefill` which materialises a fp16/bf16 workspace before dispatching to the FP GEMM). Speedup is now reported against the dequant-inclusive baseline so the numbers reflect real end-to-end quantized prefill cost. FP perf test (`test_perf_fp`) is unchanged: it has no dequant pass.

For the quantized paths the Python `moe_gemm_prefill` wrapper previously allocated a fresh `E*K*N*sizeof(act_dtype)` scratch tensor on every call. For real MoE prefill workloads the same shape repeats every step, so the allocator overhead is pure waste — and dominates the small-shape numbers. Move the workspace to a module-level cache keyed by `(device, dtype, E, K, N)`. The unquantized fast path is unchanged: it still uses a per-call transposed copy of `weights`. Added `clear_moe_prefill_workspace_cache()` for callers that need to drop the cached buffers.

- Add `ark.moe(...)` dispatcher that auto-selects between `moe_gemm_decode` (GEMV-tuned, small tokens/expert) and `moe_gemm_prefill` (GEMM-tuned, many tokens/expert). Single call site for model code; `phase="decode"` or `phase="prefill"` skips the auto-dispatch host-device sync. - Add `test_moe_unified.py`: bit-parity tests vs the underlying kernels across fp/int8/int4/int2/fp8 + dispatch correctness + error path. - Add `test_moe_model_perf.py`: model-level forward (1 prefill + N decode steps over L MoE layers) comparing always_prefill, always_decode, manual_branch, unified_auto, unified_hinted strategies. - `moe_gemm_decode` / `moe_gemm_prefill` are kept for backward compatibility.

- Add test/test_ark/test_moe_model_perf.py: real MoE LLM perf scan (tiny by default, AR_MOE_PERF_FULL=1 for full checkpoints) over prefill + per-token decode latency. Compares FP reference vs ARK backend (and optional GPTQModel). Asserts ARK decode <= 2x FP to guard against silent ark.moe dispatcher regressions. - Remove obsolete auto_round_extension/ark/test/test_moe_model_perf.py (synthetic kernel-trace bench; replaced by the model-level one). - Register the `perf` pytest marker in pyproject.toml.

jijiaz and others added 29 commits June 15, 2026 11:53

Added prefill strategy benchmarking script and results

e18c72e

[pre-commit.ci] auto fixes from pre-commit.com hooks

dc8cb42

for more information, see https://pre-commit.ci

Update auto-round-lib release package build (intel#1895)

d3f9e73

Signed-off-by: chensuyue <suyue.chen@intel.com>

Fix CI coverage & bug grep issue (intel#1893)

bc97b13

Signed-off-by: chensuyue <suyue.chen@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

feat: add MXFP4/MXFP8 quantization support (llmc_compressor format) a…

5e230ff

…nd related tests (intel#1865) Signed-off-by: Xin He <xin3.he@intel.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Fix slow startup time of pytest coverage for unit tests (intel#1899)

1766dfa

Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>

feat: improve review-pr skill score from 76% to 90% (intel#1901)

98b004a

Co-authored-by: Yogesh Rao <yogesh-tessl@users.noreply.github.com> Co-authored-by: Liang Lv <liang1.lv@intel.com>

update llama-cpp-python installation for CUDA CI (intel#1907)

20ed3b0

Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>

fix gguf opt-rtn regression (intel#1905)

fbe2d7c

Signed-off-by: Wenhua Cheng <wenhua.cheng@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

fix: guard zero-division in GGUF quant kernels to avoid NaN block sca…

69037fc

…les (intel#1909) Signed-off-by: Entrpi <entrpi@proton.me>

fallback compute type on b70 if needed (intel#1904)

f9211d5

Signed-off-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

[ARK] update README (intel#1906)

3257144

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

fix inplace rotation issue (intel#1903)

39f5b94

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

fix awq cuda CI (intel#1912)

95d4840

Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

enable MXFP clamp for model free (intel#1914)

a08f80a

Signed-off-by: Xin He <xin3.he@intel.com>

feat: add MoE prefill performance test with TFLOPS calculation

49fac43

docs: add MoE prefill performance test documentation

e6f3f72

refactor: change ARK() instance to module reference in MoE perf tests

eefb346

perf(moe_prefill): skip dequant for experts with zero tokens

4fc0e27

perf(moe_prefill): pack PACK_K K-outputs per dequant work-item

aedacdd

docs(moe_prefill): clarify PACK_K must divide group_size for hoist

4f2acb9

test: add accuracy UT for MoE prefill (ark.moe_gemm / moe_gemm_prefill)

732dbf2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added prefill strategy benchmarking script and results#1923

Added prefill strategy benchmarking script and results#1923
jijiaz wants to merge 29 commits into
intel:copilot/add-xpu-moe-decode-implementationfrom
jijiaz:prefill-perf-benchmark

jijiaz commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Conversation

jijiaz commented Jun 15, 2026

MoE Prefill Optimization Strategy Benchmarking

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants