Added prefill strategy benchmarking script and results#1923
Open
jijiaz wants to merge 29 commits into
Open
Conversation
for more information, see https://pre-commit.ci
Signed-off-by: lkk12014402 <kaokao.lv@intel.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
…emma4-unified (intel#1879) Signed-off-by: Wenhua Cheng <wenhua.cheng@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: chensuyue <suyue.chen@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…nd related tests (intel#1865) Signed-off-by: Xin He <xin3.he@intel.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>
Co-authored-by: Yogesh Rao <yogesh-tessl@users.noreply.github.com> Co-authored-by: Liang Lv <liang1.lv@intel.com>
Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>
Signed-off-by: Wenhua Cheng <wenhua.cheng@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…les (intel#1909) Signed-off-by: Entrpi <entrpi@proton.me>
Signed-off-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Xin He <xin3.he@intel.com>
- Use [E, N, K] weight layout for pack helpers (matches their contract and the working correctness tests in test_moe.py). - Drop the erroneous dequant.transpose(1, 2); dequant is already [E, N, K], which is what the baseline expects. - Call ark.moe_gemm_prefill for quantized timing (the dedicated quantized prefill kernel) instead of the FP-only ark.moe_gemm. - Add a per-test skip guard for quantized tests requiring moe_gemm_prefill so the FP perf cases still run when only moe_gemm is present.
…rness Print both `base mm(ms)` (matmul-only baseline, prior behavior) and `base+deq(ms)` (dequant + matmul, the apples-to-apples comparison for the current Stage-1 `moe_gemm_prefill` which materialises a fp16/bf16 workspace before dispatching to the FP GEMM). Speedup is now reported against the dequant-inclusive baseline so the numbers reflect real end-to-end quantized prefill cost. FP perf test (`test_perf_fp`) is unchanged: it has no dequant pass.
For the quantized paths the Python `moe_gemm_prefill` wrapper previously allocated a fresh `E*K*N*sizeof(act_dtype)` scratch tensor on every call. For real MoE prefill workloads the same shape repeats every step, so the allocator overhead is pure waste — and dominates the small-shape numbers. Move the workspace to a module-level cache keyed by `(device, dtype, E, K, N)`. The unquantized fast path is unchanged: it still uses a per-call transposed copy of `weights`. Added `clear_moe_prefill_workspace_cache()` for callers that need to drop the cached buffers.
- Add `ark.moe(...)` dispatcher that auto-selects between `moe_gemm_decode` (GEMV-tuned, small tokens/expert) and `moe_gemm_prefill` (GEMM-tuned, many tokens/expert). Single call site for model code; `phase="decode"` or `phase="prefill"` skips the auto-dispatch host-device sync. - Add `test_moe_unified.py`: bit-parity tests vs the underlying kernels across fp/int8/int4/int2/fp8 + dispatch correctness + error path. - Add `test_moe_model_perf.py`: model-level forward (1 prefill + N decode steps over L MoE layers) comparing always_prefill, always_decode, manual_branch, unified_auto, unified_hinted strategies. - `moe_gemm_decode` / `moe_gemm_prefill` are kept for backward compatibility.
- Add test/test_ark/test_moe_model_perf.py: real MoE LLM perf scan (tiny by default, AR_MOE_PERF_FULL=1 for full checkpoints) over prefill + per-token decode latency. Compares FP reference vs ARK backend (and optional GPTQModel). Asserts ARK decode <= 2x FP to guard against silent ark.moe dispatcher regressions. - Remove obsolete auto_round_extension/ark/test/test_moe_model_perf.py (synthetic kernel-trace bench; replaced by the model-level one). - Register the `perf` pytest marker in pyproject.toml.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
MoE Prefill Optimization Strategy Benchmarking
We conducted comprehensive benchmarking to evaluate optimization strategies for MoE prefill kernels using realistic token distributions from production workloads. The benchmarking experiment compared two approaches: the current per-expert baseline versus a full pad-and-clip strategy that consolidates all experts into fewer launches through strategic padding.
Key Findings:
Performance Summary (representative formats):
The results demonstrate that strategic padding can effectively trade memory bandwidth for reduced kernel launch overhead in MoE workloads. The approach brings moderate (5-12%) speedups with the current kernel.
Benchmark Configuration: 256 experts, ~32K routed tokens, N=3072×K=1536 dimensions, group_size=128.