Skip to content

Added prefill strategy benchmarking script and results#1923

Open
jijiaz wants to merge 29 commits into
intel:copilot/add-xpu-moe-decode-implementationfrom
jijiaz:prefill-perf-benchmark
Open

Added prefill strategy benchmarking script and results#1923
jijiaz wants to merge 29 commits into
intel:copilot/add-xpu-moe-decode-implementationfrom
jijiaz:prefill-perf-benchmark

Conversation

@jijiaz

@jijiaz jijiaz commented Jun 15, 2026

Copy link
Copy Markdown

MoE Prefill Optimization Strategy Benchmarking

We conducted comprehensive benchmarking to evaluate optimization strategies for MoE prefill kernels using realistic token distributions from production workloads. The benchmarking experiment compared two approaches: the current per-expert baseline versus a full pad-and-clip strategy that consolidates all experts into fewer launches through strategic padding.

Key Findings:

  • Optimal threshold identification: 256-token threshold consistently delivers best performance across all quantization formats
  • Moderate speedups achieved: 1.05-1.12x improvements for most INT/FP8 formats at optimal thresholds
  • Launch reduction effectiveness: Reduced kernel launches from 238 to ~38 (84% reduction) while maintaining competitive performance
  • Format-agnostic benefits: Performance gains are consistent across INT8/INT4/INT2 (sym/asym) and FP8 (E4M3/E5M2) quantization schemes

Performance Summary (representative formats):

Format Baseline (ms) Optimized (ms) Speedup Threshold
INT8_sym 69.36 63.58 1.09x 256
INT4_sym 68.09 62.07 1.10x 256
INT2_sym 66.51 59.62 1.12x 256
FP8_E4M3 72.27 67.24 1.07x 256
FP8_E5M2 73.09 68.10 1.07x 256

The results demonstrate that strategic padding can effectively trade memory bandwidth for reduced kernel launch overhead in MoE workloads. The approach brings moderate (5-12%) speedups with the current kernel.

Benchmark Configuration: 256 experts, ~32K routed tokens, N=3072×K=1536 dimensions, group_size=128.

jijiaz and others added 29 commits June 15, 2026 11:53
Signed-off-by: lkk12014402 <kaokao.lv@intel.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
…emma4-unified (intel#1879)

Signed-off-by: Wenhua Cheng <wenhua.cheng@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…nd related tests (intel#1865)

Signed-off-by: Xin He <xin3.he@intel.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>
Co-authored-by: Yogesh Rao <yogesh-tessl@users.noreply.github.com>
Co-authored-by: Liang Lv <liang1.lv@intel.com>
Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>
Signed-off-by: Wenhua Cheng <wenhua.cheng@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Xin He <xin3.he@intel.com>
- Use [E, N, K] weight layout for pack helpers (matches their contract
  and the working correctness tests in test_moe.py).
- Drop the erroneous dequant.transpose(1, 2); dequant is already
  [E, N, K], which is what the baseline expects.
- Call ark.moe_gemm_prefill for quantized timing (the dedicated
  quantized prefill kernel) instead of the FP-only ark.moe_gemm.
- Add a per-test skip guard for quantized tests requiring
  moe_gemm_prefill so the FP perf cases still run when only moe_gemm
  is present.
…rness

Print both `base mm(ms)` (matmul-only baseline, prior behavior) and
`base+deq(ms)` (dequant + matmul, the apples-to-apples comparison for
the current Stage-1 `moe_gemm_prefill` which materialises a fp16/bf16
workspace before dispatching to the FP GEMM). Speedup is now reported
against the dequant-inclusive baseline so the numbers reflect real
end-to-end quantized prefill cost.

FP perf test (`test_perf_fp`) is unchanged: it has no dequant pass.
For the quantized paths the Python `moe_gemm_prefill` wrapper previously
allocated a fresh `E*K*N*sizeof(act_dtype)` scratch tensor on every call.
For real MoE prefill workloads the same shape repeats every step, so the
allocator overhead is pure waste — and dominates the small-shape numbers.

Move the workspace to a module-level cache keyed by
`(device, dtype, E, K, N)`. The unquantized fast path is unchanged: it
still uses a per-call transposed copy of `weights`. Added
`clear_moe_prefill_workspace_cache()` for callers that need to drop the
cached buffers.
- Add `ark.moe(...)` dispatcher that auto-selects between `moe_gemm_decode`
  (GEMV-tuned, small tokens/expert) and `moe_gemm_prefill` (GEMM-tuned,
  many tokens/expert). Single call site for model code; `phase="decode"`
  or `phase="prefill"` skips the auto-dispatch host-device sync.
- Add `test_moe_unified.py`: bit-parity tests vs the underlying kernels
  across fp/int8/int4/int2/fp8 + dispatch correctness + error path.
- Add `test_moe_model_perf.py`: model-level forward (1 prefill + N decode
  steps over L MoE layers) comparing always_prefill, always_decode,
  manual_branch, unified_auto, unified_hinted strategies.
- `moe_gemm_decode` / `moe_gemm_prefill` are kept for backward compatibility.
- Add test/test_ark/test_moe_model_perf.py: real MoE LLM perf scan
  (tiny by default, AR_MOE_PERF_FULL=1 for full checkpoints) over
  prefill + per-token decode latency. Compares FP reference vs ARK
  backend (and optional GPTQModel). Asserts ARK decode <= 2x FP to
  guard against silent ark.moe dispatcher regressions.
- Remove obsolete auto_round_extension/ark/test/test_moe_model_perf.py
  (synthetic kernel-trace bench; replaced by the model-level one).
- Register the `perf` pytest marker in pyproject.toml.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.