test(ark): add model-level MoE perf benchmark on XPU by Copilot · Pull Request #1813 · intel/auto-round

Copilot · 2026-05-14T04:27:10Z

Description

Replaces the synthetic kernel-trace auto_round_extension/ark/test/test_moe_model_perf.py with a real model-level perf benchmark under test/test_ark/test_moe_model_perf.py, modeled on test/test_ark/test_model.py. Exercises the full integration path (tokenizer → attention → MoE-MLP via ARK kernels → lm_head → sampler) instead of just the MoE GEMM in isolation.

New benchmark (test/test_ark/test_moe_model_perf.py)
- Loads Qwen/Qwen1.5-MoE-A2.7B and deepseek-ai/DeepSeek-V2-Lite; tiny slice (num_layers=2, num_experts=4) by default via helpers.get_tiny_model, full checkpoints gated behind AR_MOE_PERF_FULL=1.
- Quantizes with AutoRound(iters=0, nsamples=1, disable_opt_rtn=True), reloads on XPU through AutoRoundConfig(backend="ark").
- Parametrized over (model, bits∈{4,8}, dtype∈{fp16,bf16}).
- Measures prefill (single forward, 128-token synthetic prompt) and per-token decode (generate(max_new_tokens=32) after a 4-token warmup) using torch.xpu.Event; reports median of 3 runs.
- Compares FP reference vs ARK (and optionally GPTQModel — skipped if unavailable). Prints one table per dtype with prefill(ms) | decode(ms/tok) | tokens/s | speedup vs FP.
- Asserts ark_decode ≤ 2× fp_decode as a regression guard against silent ark.moe dispatcher slowdowns (e.g. the phase="auto" host-device sync overhead).
- Skip gates: no XPU, ARK extension missing / lacking moe_gemm_{decode,prefill} / ark.moe, or checkpoint not resolvable locally.
Removed auto_round_extension/ark/test/test_moe_model_perf.py (synthetic kernel-trace bench, superseded).
Registered the perf pytest marker in pyproject.toml so the suite is opt-in (-m perf) and excludable from default CI (-m "not perf").

Type of Change

Performance / Test

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.
The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/95841e6d-d5d1-4662-8db0-4dd69690bc28 Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

for more information, see https://pre-commit.ci

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/91221649-2c90-4404-ae86-3321b1581428 Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

a32543254 · 2026-05-18T15:33:53Z

@copilot resolve the merge conflicts in this pull request

…ecode-implementation # Conflicts: # auto_round_extension/ark/auto_round_kernel/ark.cpp Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

Copilot · 2026-05-18T15:36:17Z

@copilot resolve the merge conflicts in this pull request

Merged origin/main in 430868d. The only conflict was in auto_round_extension/ark/auto_round_kernel/ark.cpp — I kept the new moe_gemm_decode_wrapper from this branch alongside the updated sage_dynamic_quant signature (added bias param) from main.

Copilot

Pull request overview

This PR adds an XPU-optimized MoE decode-phase GEMV kernel (small M per expert) with multiple weight formats, and wires it through the C++/PyTorch extension layer with corresponding unit tests.

Changes:

Added a SYCL decode GEMV kernel supporting FP16/BF16, INT8/INT4/INT2 (sym/asym), and FP8 (E4M3/E5M2) weights.
Exposed the kernel via pybind (moe_gemm_decode) and added a Python wrapper with argument validation.
Added unit tests covering the new decode paths and key validation error cases.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
auto_round_extension/ark/test/test_moe.py	Adds decode-path unit tests plus packing/dequant reference helpers for INT2/4/8 and FP8.
auto_round_extension/ark/auto_round_kernel/wrapper/include/sycl_tla_moe_decode.hpp	Introduces the new SYCL MoE decode GEMV kernel implementations and dispatch.
auto_round_extension/ark/auto_round_kernel/wrapper/include/sycl_tla_common.hpp	Declares the new `moe_gemm_decode` API (but docs currently lag implementation).
auto_round_extension/ark/auto_round_kernel/ark.cpp	Includes the new header and binds `moe_gemm_decode` via pybind.
auto_round_extension/ark/auto_round_kernel/init.py	Adds the `ARK.moe_gemm_decode` Python wrapper and validation logic.

Comments suppressed due to low confidence (2)

auto_round_extension/ark/auto_round_kernel/init.py:871

num_tokens_per_expert is converted to int32/contiguous but its device is not validated. If it’s a CPU tensor, the kernel will treat a host pointer as device memory. Please ensure num_tokens_per_expert is on XPU (and matches activations.device), or move it to XPU explicitly before calling into the extension.

            weights = weights.contiguous()

        if num_tokens_per_expert.dtype != torch.int32:
            num_tokens_per_expert = num_tokens_per_expert.to(torch.int32)
        if not num_tokens_per_expert.is_contiguous():

auto_round_extension/ark/auto_round_kernel/init.py:896

group_size is used in modulo/division checks (e.g., K % group_size) without validating group_size > 0. Passing group_size=0 will raise a ZeroDivisionError rather than a clear ValueError. Please add an explicit check that group_size is a positive integer before any modulo/division operations.

            if scales is None:
                raise ValueError("scales is required for FP8 weights")
            if scales.dtype != activations.dtype:
                raise ValueError("scales dtype must match activations dtype")
            if K % group_size != 0:

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/132db2ab-85c0-45b6-81a7-b9baaa533e5e Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

for more information, see https://pre-commit.ci

chensuyue · 2026-06-17T04:02:28Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-06-17T04:02:35Z

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

- Use [E, N, K] weight layout for pack helpers (matches their contract and the working correctness tests in test_moe.py). - Drop the erroneous dequant.transpose(1, 2); dequant is already [E, N, K], which is what the baseline expects. - Call ark.moe_gemm_prefill for quantized timing (the dedicated quantized prefill kernel) instead of the FP-only ark.moe_gemm. - Add a per-test skip guard for quantized tests requiring moe_gemm_prefill so the FP perf cases still run when only moe_gemm is present.

chensuyue · 2026-06-17T05:39:40Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-06-17T05:39:47Z

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

…rness Print both `base mm(ms)` (matmul-only baseline, prior behavior) and `base+deq(ms)` (dequant + matmul, the apples-to-apples comparison for the current Stage-1 `moe_gemm_prefill` which materialises a fp16/bf16 workspace before dispatching to the FP GEMM). Speedup is now reported against the dequant-inclusive baseline so the numbers reflect real end-to-end quantized prefill cost. FP perf test (`test_perf_fp`) is unchanged: it has no dequant pass.

For the quantized paths the Python `moe_gemm_prefill` wrapper previously allocated a fresh `E*K*N*sizeof(act_dtype)` scratch tensor on every call. For real MoE prefill workloads the same shape repeats every step, so the allocator overhead is pure waste — and dominates the small-shape numbers. Move the workspace to a module-level cache keyed by `(device, dtype, E, K, N)`. The unquantized fast path is unchanged: it still uses a per-call transposed copy of `weights`. Added `clear_moe_prefill_workspace_cache()` for callers that need to drop the cached buffers.

chensuyue · 2026-06-17T07:58:22Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-06-17T07:58:30Z

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

chensuyue · 2026-06-17T10:17:54Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-06-17T10:18:01Z

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

- Add `ark.moe(...)` dispatcher that auto-selects between `moe_gemm_decode` (GEMV-tuned, small tokens/expert) and `moe_gemm_prefill` (GEMM-tuned, many tokens/expert). Single call site for model code; `phase="decode"` or `phase="prefill"` skips the auto-dispatch host-device sync. - Add `test_moe_unified.py`: bit-parity tests vs the underlying kernels across fp/int8/int4/int2/fp8 + dispatch correctness + error path. - Add `test_moe_model_perf.py`: model-level forward (1 prefill + N decode steps over L MoE layers) comparing always_prefill, always_decode, manual_branch, unified_auto, unified_hinted strategies. - `moe_gemm_decode` / `moe_gemm_prefill` are kept for backward compatibility.

a32543254 · 2026-06-17T11:08:23Z

下面是整理好的 GitHub comment 格式，可以直接粘贴使用：

✅ MoE Prefill — Perf & Accuracy Test Results

Environment: xpu_available=True · xpu_lib=loaded · has_moe_gemm=True
Platform: Linux · Python 3.13.12 · pytest 9.0.2

📊 Performance Results

FP weights — ark.moe_gemm (prefill) vs per-expert A @ W.T

`float16`

shape	E	N	K	tokens	baseline (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	11.1743	5.3752	2.08x	1.6
medium E=8	8	4096	14336	528	12.7908	10.8748	1.18x	5.7
medium E=8	8	14336	4096	528	12.8744	10.8833	1.18x	5.7
large E=16	16	2048	2048	256	19.6544	4.9422	3.98x	0.4
large E=32	32	2048	2048	256	73.5049	5.4408	13.51x	0.4
large E=64	64	2048	2048	256	145.3573	7.5744	19.19x	0.3
uneven E=8	8	4096	4096	610	11.2116	5.3317	2.10x	3.8

`bfloat16`

shape	E	N	K	tokens	baseline (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	11.1654	5.3858	2.07x	1.6
medium E=8	8	4096	14336	528	12.7828	10.8717	1.18x	5.7
medium E=8	8	14336	4096	528	12.8719	10.9248	1.18x	5.7
large E=16	16	2048	2048	256	19.6520	4.9459	3.97x	0.4
large E=32	32	2048	2048	256	73.3472	5.4542	13.45x	0.4
large E=64	64	2048	2048	256	145.7464	7.5853	19.21x	0.3
uneven E=8	8	4096	4096	610	11.1995	5.3342	2.10x	3.8

INT4 (group_size=128) — ark.moe_gemm_prefill vs dequant + A @ W.T

sym, act=`float16`

shape	E	N	K	tokens	base mm (ms)	base+deq (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	11.0943	46.2232	14.3001	3.23x	0.6
medium E=8	8	4096	14336	528	13.5260	113.8531	47.2768	2.41x	1.3
medium E=8	8	14336	4096	528	13.0802	114.3094	49.0601	2.33x	1.3
large E=16	16	2048	2048	256	37.2911	61.6995	7.4836	8.24x	0.3
large E=32	32	2048	2048	256	73.4550	108.6259	11.7776	9.22x	0.2
large E=64	64	2048	2048	256	145.6920	204.0485	23.9238	8.53x	0.1
uneven E=8	8	4096	4096	610	11.2395	46.4091	13.9499	3.33x	1.5

sym, act=`bfloat16`

shape	E	N	K	tokens	base mm (ms)	base+deq (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	11.0758	46.4585	14.0467	3.31x	0.6
medium E=8	8	4096	14336	528	13.4995	114.5090	47.2857	2.42x	1.3
medium E=8	8	14336	4096	528	13.0646	114.9398	49.0738	2.34x	1.3
large E=16	16	2048	2048	256	37.2981	61.8135	7.5655	8.17x	0.3
large E=32	32	2048	2048	256	73.3958	108.8733	13.0127	8.37x	0.2
large E=64	64	2048	2048	256	145.3949	204.3172	23.9301	8.54x	0.1
uneven E=8	8	4096	4096	610	11.2451	46.6819	14.2496	3.28x	1.4

asym, act=`float16`

shape	E	N	K	tokens	base mm (ms)	base+deq (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	11.0931	52.6291	16.7643	3.14x	0.5
medium E=8	8	4096	14336	528	13.4947	142.5273	59.3267	2.40x	1.0
medium E=8	8	14336	4096	528	13.1180	142.3340	59.6549	2.39x	1.0
large E=16	16	2048	2048	256	37.2608	58.9527	8.0059	7.36x	0.3
large E=32	32	2048	2048	256	73.4041	115.0355	14.2060	8.10x	0.2
large E=64	64	2048	2048	256	145.6150	219.5465	27.6512	7.94x	0.1
uneven E=8	8	4096	4096	610	11.2504	52.8442	16.6637	3.17x	1.2

asym, act=`bfloat16`

shape	E	N	K	tokens	base mm (ms)	base+deq (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	11.0764	53.3380	16.7745	3.18x	0.5
medium E=8	8	4096	14336	528	13.4873	145.3749	59.2862	2.45x	1.0
medium E=8	8	14336	4096	528	13.0745	145.1425	59.6205	2.43x	1.0
large E=16	16	2048	2048	256	37.2752	61.6077	7.9958	7.71x	0.3
large E=32	32	2048	2048	256	73.5586	115.6253	15.2235	7.60x	0.1
large E=64	64	2048	2048	256	145.6446	224.0353	27.5478	8.13x	0.1
uneven E=8	8	4096	4096	610	11.2666	53.4504	16.6813	3.20x	1.2

INT8 (group_size=128) — ark.moe_gemm_prefill vs dequant + A @ W.T

sym, act=`float16`

shape	E	N	K	tokens	base mm (ms)	base+deq (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	11.0886	21.9720	12.6380	1.74x	0.7
medium E=8	8	4096	14336	528	13.5173	50.8977	41.1863	1.24x	1.5
medium E=8	8	14336	4096	528	13.0750	50.5824	41.0089	1.23x	1.5
large E=16	16	2048	2048	256	37.2774	43.2763	7.6975	5.62x	0.3
large E=32	32	2048	2048	256	73.5237	84.2765	12.0557	6.99x	0.2
large E=64	64	2048	2048	256	145.6569	167.5987	22.0110	7.61x	0.1
uneven E=8	8	4096	4096	610	11.2530	22.1476	12.5922	1.76x	1.6

sym, act=`bfloat16`

shape	E	N	K	tokens	base mm (ms)	base+deq (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	11.0685	22.1704	12.6382	1.75x	0.7
medium E=8	8	4096	14336	528	13.4965	52.8375	41.1465	1.28x	1.5
medium E=8	8	14336	4096	528	13.0501	52.3792	41.0161	1.28x	1.5
large E=16	16	2048	2048	256	37.2833	43.3786	7.6832	5.65x	0.3
large E=32	32	2048	2048	256	73.4866	84.4736	13.1791	6.41x	0.2
large E=64	64	2048	2048	256	145.6534	167.8831	21.9684	7.64x	0.1
uneven E=8	8	4096	4096	610	11.2760	22.3109	12.5643	1.78x	1.6

asym, act=`float16`

shape	E	N	K	tokens	base mm (ms)	base+deq (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	11.0692	33.1561	14.4715	2.29x	0.6
medium E=8	8	4096	14336	528	13.4955	86.0909	49.3754	1.74x	1.3
medium E=8	8	14336	4096	528	13.0600	85.6853	47.5951	1.80x	1.3
large E=16	16	2048	2048	256	37.2747	49.9370	7.8903	6.33x	0.3
large E=32	32	2048	2048	256	73.3727	95.4454	12.4763	7.65x	0.2
large E=64	64	2048	2048	256	145.4552	187.9701	24.0674	7.81x	0.1
uneven E=8	8	4096	4096	610	11.2650	33.3232	14.4399	2.31x	1.4

asym, act=`bfloat16`

shape	E	N	K	tokens	base mm (ms)	base+deq (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	11.0687	33.5627	14.4608	2.32x	0.6
medium E=8	8	4096	14336	528	13.5176	88.6815	49.3579	1.80x	1.3
medium E=8	8	14336	4096	528	13.0654	88.1732	47.5636	1.85x	1.3
large E=16	16	2048	2048	256	37.3091	50.2275	7.8594	6.39x	0.3
large E=32	32	2048	2048	256	73.4880	95.9222	13.5886	7.06x	0.2
large E=64	64	2048	2048	256	145.7669	188.5899	23.9677	7.87x	0.1
uneven E=8	8	4096	4096	610	11.2739	33.7794	14.4359	2.34x	1.4

INT2 (group_size=128) — ark.moe_gemm_prefill vs dequant + A @ W.T

sym, act=`float16`

shape	E	N	K	tokens	base mm (ms)	base+deq (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	11.0759	60.6933	8.4057	7.22x	1.0
medium E=8	8	4096	14336	528	13.4512	157.8263	28.1047	5.62x	2.2
medium E=8	8	14336	4096	528	13.0528	157.3870	28.3782	5.55x	2.2
large E=16	16	2048	2048	256	37.3023	66.5552	5.6929	11.69x	0.4
large E=32	32	2048	2048	256	73.2710	119.0248	8.0819	14.73x	0.3
large E=64	64	2048	2048	256	145.7239	235.9103	15.4014	15.32x	0.1
uneven E=8	8	4096	4096	610	11.2585	55.3790	8.3760	6.61x	2.4

sym, act=`bfloat16`

shape	E	N	K	tokens	base mm (ms)	base+deq (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	11.0545	61.3696	8.4293	7.28x	1.0
medium E=8	8	4096	14336	528	13.4594	159.2971	28.6929	5.55x	2.2
medium E=8	8	14336	4096	528	13.0241	159.2343	28.4423	5.60x	2.2
large E=16	16	2048	2048	256	37.2961	68.3397	5.6837	12.02x	0.4
large E=32	32	2048	2048	256	73.4299	119.3870	8.1449	14.66x	0.3
large E=64	64	2048	2048	256	145.6200	236.9812	13.8138	17.16x	0.2
uneven E=8	8	4096	4096	610	11.2465	56.9521	8.3960	6.78x	2.4

asym, act=`float16`

shape	E	N	K	tokens	base mm (ms)	base+deq (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	11.0255	58.3124	10.5080	5.55x	0.8
medium E=8	8	4096	14336	528	13.3463	167.0102	34.5696	4.83x	1.8
medium E=8	8	14336	4096	528	12.9347	147.3313	34.8071	4.23x	1.8
large E=16	16	2048	2048	256	37.2887	65.5966	5.8136	11.28x	0.4
large E=32	32	2048	2048	256	73.2917	119.5064	8.3657	14.29x	0.3
large E=64	64	2048	2048	256	145.4086	222.8914	15.8196	14.09x	0.1
uneven E=8	8	4096	4096	610	11.2367	58.4050	10.4326	5.60x	2.0

asym, act=`bfloat16`

shape	E	N	K	tokens	base mm (ms)	base+deq (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	11.0206	61.5193	10.4709	5.88x	0.8
medium E=8	8	4096	14336	528	13.3318	171.7155	34.5221	4.97x	1.8
medium E=8	8	14336	4096	528	12.9161	150.1810	34.7625	4.32x	1.8
large E=16	16	2048	2048	256	37.2678	69.5612	2.3345	29.80x	0.9
large E=32	32	2048	2048	256	12.0685	32.2610	3.2826	9.83x	0.7
large E=64	64	2048	2048	256	7.8215	159.7134	13.6930	11.66x	0.2
uneven E=8	8	4096	4096	610	3.6469	53.0169	5.8798	9.02x	3.5

FP8 (group_size=128) — ark.moe_gemm_prefill vs dequant + A @ W.T

`float8_e4m3fn`, act=`float16`

shape	E	N	K	tokens	base mm (ms)	base+deq (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	4.8305	15.3085	9.4883	1.61x	0.9
medium E=8	8	4096	14336	528	8.8149	38.8058	39.9981	0.97x	1.6
medium E=8	8	14336	4096	528	11.8783	46.9332	38.7341	1.21x	1.6
large E=16	16	2048	2048	256	29.4404	30.9897	7.5328	4.11x	0.3
large E=32	32	2048	2048	256	56.8703	40.6506	9.0923	4.47x	0.2
large E=64	64	2048	2048	256	52.2962	53.5669	13.2572	4.04x	0.2
uneven E=8	8	4096	4096	610	3.0481	9.6519	6.9644	1.39x	2.9

`float8_e4m3fn`, act=`bfloat16` ⚠️

shape	E	N	K	tokens	base mm (ms)	base+deq (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	2.2356	9.0504	5.1459	1.76x	1.6
medium E=8	8	4096	14336	528	3.6434	24.1100	18.2615	1.32x	3.4
medium E=8	8	14336	4096	528	3.5101	24.5804	581.1054	0.04x	0.1
large E=16	16	2048	2048	256	0.9075	3.7703	2.7270	1.38x	0.8
large E=32	32	2048	2048	256	2.1443	8056.9061	4604.9666	1.75x	0.0
large E=64	64	2048	2048	256	5654.7606	17.2789	8.8871	1.94x	0.2
uneven E=8	8	4096	4096	610	1.3802	7.3652	5.2822	1.39x	3.9

⚠️ Anomaly detected: Several rows show extreme timing variance (e.g. 581ms, 8056ms, 5654ms) — likely measurement jitter, JIT/warm-up effect, or memory thrash. Recommend re-running this configuration with additional warm-up iterations / lock-step timing.

`float8_e5m2`, act=`float16`

shape	E	N	K	tokens	base mm (ms)	base+deq (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	1.3118	6.8722	5.2438	1.31x	1.6
medium E=8	8	4096	14336	528	3.6575	23.1073	45.3614	0.51x	1.4
medium E=8	8	14336	4096	528	13.0823	55.0204	45.1934	1.22x	1.4
large E=16	16	2048	2048	256	37.3216	43.5314	7.7669	5.60x	0.3
large E=32	32	2048	2048	256	73.6025	86.1460	12.4799	6.90x	0.2
large E=64	64	2048	2048	256	145.4007	169.7390	24.1494	7.03x	0.1
uneven E=8	8	4096	4096	610	11.2092	23.8410	14.0927	1.69x	1.5

`float8_e5m2`, act=`bfloat16`

shape	E	N	K	tokens	base mm (ms)	base+deq (ms)	ark (ms)	speedup	TFLOPS
small E=8	8	4096	4096	252	11.1113	24.0478	14.1489	1.70x	0.6
medium E=8	8	4096	14336	528	13.5566	59.0466	45.3463	1.30x	1.4
medium E=8	8	14336	4096	528	13.0841	58.5334	45.2276	1.29x	1.4
large E=16	16	2048	2048	256	37.2955	43.7300	7.7709	5.63x	0.3
large E=32	32	2048	2048	256	73.4335	86.3980	13.3932	6.45x	0.2
large E=64	64	2048	2048	256	145.6443	171.5714	24.1137	7.12x	0.1
uneven E=8	8	4096	4096	610	11.2536	24.1764	14.0933	1.72x	1.5

🎯 Accuracy Results

18 / 18 PASSED ✅

dtype	Variant	Status
FP	`float16` / `bfloat16`	✅✅
INT4 sym	act=`float16` / act=`bfloat16`	✅✅
INT4 asym	act=`float16` / act=`bfloat16`	✅✅
INT8 sym	act=`float16` / act=`bfloat16`	✅✅
INT8 asym	act=`float16` / act=`bfloat16`	✅✅
INT2 sym	act=`float16` / act=`bfloat16`	✅✅
INT2 asym	act=`float16` / act=`bfloat16`	✅✅
FP8 `e4m3fn`	act=`float16` / act=`bfloat16`	✅✅
FP8 `e5m2`	act=`float16` / act=`bfloat16`	✅✅

📌 Summary

Path	Best speedup	Typical speedup	Notes
FP (fp16/bf16)	19.21x (E=64)	2–4x on small/medium	Scales strongly with expert count
INT4 (sym/asym)	9.22x	2.3–3.3x	Avoids costly per-expert dequant
INT8 (sym/asym)	7.87x	1.2–2.3x	Most benefit on `large E≥16`
INT2 (sym/asym)	29.80x	5–15x	Biggest wins overall
FP8 e4m3fn / e5m2	7.12x	1.3–4.5x	⚠️ e4m3fn + bf16 has unstable rows — re-run recommended

✅ All 18 accuracy tests pass.
✅ All 18 perf tests pass, ark.moe_gemm_prefill outperforms baseline in nearly all configs.
⚠️ One config (FP8 e4m3fn, act=bfloat16) shows abnormal timings — needs investigation.

chensuyue · 2026-06-17T11:52:48Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-06-17T11:52:55Z

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

- Add test/test_ark/test_moe_model_perf.py: real MoE LLM perf scan (tiny by default, AR_MOE_PERF_FULL=1 for full checkpoints) over prefill + per-token decode latency. Compares FP reference vs ARK backend (and optional GPTQModel). Asserts ARK decode <= 2x FP to guard against silent ark.moe dispatcher regressions. - Remove obsolete auto_round_extension/ark/test/test_moe_model_perf.py (synthetic kernel-trace bench; replaced by the model-level one). - Register the `perf` pytest marker in pyproject.toml.

Copilot AI and others added 2 commits May 14, 2026 04:03

Add XPU MoE decode kernel with INT4 sym/asym and FP16/BF16 baselines

58b0900

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/95841e6d-d5d1-4662-8db0-4dd69690bc28 Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

Document int4 sign-extension trick

527eede

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/95841e6d-d5d1-4662-8db0-4dd69690bc28 Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

Copilot AI assigned Copilot and a32543254 May 14, 2026

Copilot created this pull request from a session on behalf of a32543254 May 14, 2026 04:27 View session

[pre-commit.ci] auto fixes from pre-commit.com hooks

78ecc0c

for more information, see https://pre-commit.ci

Copilot started work on behalf of a32543254 May 14, 2026 06:14 View session

Copilot finished work on behalf of a32543254 May 14, 2026 06:30

Copilot AI requested a review from a32543254 May 14, 2026 06:30

Copilot started work on behalf of a32543254 May 14, 2026 06:56 View session

Copilot finished work on behalf of a32543254 May 14, 2026 07:00

Copilot started work on behalf of a32543254 May 14, 2026 07:04 View session

Copilot AI and others added 2 commits May 14, 2026 07:16

Add INT8/INT2/FP8 decode MoE GEMV kernels and tests

5dc9d95

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/91221649-2c90-4404-ae86-3321b1581428 Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

docs: clarify int2 bit-indexing notation in moe_gemm_decode

f15093a

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/91221649-2c90-4404-ae86-3321b1581428 Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

Copilot finished work on behalf of a32543254 May 14, 2026 07:20

a32543254 changed the title ~~Add XPU MoE decode kernel (FP16/BF16 + INT4 sym/asym)~~ Add XPU MoE decode kernel (FP16/BF16 + INT2/4/8 sym/asym + FP8) May 15, 2026

a32543254 requested a review from luoyu-intel May 18, 2026 15:33

a32543254 marked this pull request as ready for review May 18, 2026 15:33

Copilot AI review requested due to automatic review settings May 18, 2026 15:33

Copilot started reviewing on behalf of a32543254 May 18, 2026 15:34 View session

Copilot started work on behalf of a32543254 May 18, 2026 15:35 View session

Merge remote-tracking branch 'origin/main' into copilot/add-xpu-moe-d…

430868d

…ecode-implementation # Conflicts: # auto_round_extension/ark/auto_round_kernel/ark.cpp Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

Copilot finished work on behalf of a32543254 May 18, 2026 15:37

Copilot AI reviewed May 18, 2026

View reviewed changes

Copilot started work on behalf of a32543254 May 19, 2026 06:19 View session

test: add perf comparison UT — moe_gemm_decode vs default XPU MoE

4395884

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/132db2ab-85c0-45b6-81a7-b9baaa533e5e Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

Copilot finished work on behalf of a32543254 May 19, 2026 06:23

Copilot started work on behalf of a32543254 May 20, 2026 06:40 View session

a32543254 and others added 2 commits June 17, 2026 11:18

Merge branch 'main' into copilot/add-xpu-moe-decode-implementation

8a22540

[pre-commit.ci] auto fixes from pre-commit.com hooks

0e78f31

for more information, see https://pre-commit.ci

Copilot started work on behalf of a32543254 June 17, 2026 04:08 View session

Copilot AI added 2 commits June 17, 2026 04:11

feat: add MoE prefill performance test with TFLOPS calculation

d48f45d

docs: add MoE prefill performance test documentation

b07b59f

Copilot AI changed the title ~~Add quantized MoE prefill kernel for XPU (stage-1 functional baseline)~~ feat: add MoE prefill performance test with TFLOPS metrics Jun 17, 2026

Copilot finished work on behalf of a32543254 June 17, 2026 04:12

Copilot started work on behalf of a32543254 June 17, 2026 04:15 View session

refactor: change ARK() instance to module reference in MoE perf tests

5feb707

Copilot AI changed the title ~~feat: add MoE prefill performance test with TFLOPS metrics~~ Add MoE prefill performance tests with TFLOPS and unify ARK module usage Jun 17, 2026

Copilot AI added 3 commits June 17, 2026 06:54

perf(moe_prefill): skip dequant for experts with zero tokens

2caa066

Copilot AI added 2 commits June 17, 2026 09:39

perf(moe_prefill): pack PACK_K K-outputs per dequant work-item

3e5192c

docs(moe_prefill): clarify PACK_K must divide group_size for hoist

c5f7c21

Copilot AI added 2 commits June 17, 2026 10:31

test: add accuracy UT for MoE prefill (ark.moe_gemm / moe_gemm_prefill)

81712c8

Conversation

Copilot AI commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Checklist Before Submitting

Uh oh!

a32543254 commented May 18, 2026

Uh oh!

Copilot AI commented May 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chensuyue commented Jun 17, 2026

Uh oh!

azure-pipelines Bot commented Jun 17, 2026

Uh oh!

chensuyue commented Jun 17, 2026

Uh oh!

azure-pipelines Bot commented Jun 17, 2026

Uh oh!

chensuyue commented Jun 17, 2026

Uh oh!

azure-pipelines Bot commented Jun 17, 2026

Uh oh!

chensuyue commented Jun 17, 2026

Uh oh!

azure-pipelines Bot commented Jun 17, 2026

Uh oh!

a32543254 commented Jun 17, 2026

✅ MoE Prefill — Perf & Accuracy Test Results

📊 Performance Results

float16

bfloat16

sym, act=float16

sym, act=bfloat16

asym, act=float16

asym, act=bfloat16

sym, act=float16

sym, act=bfloat16

asym, act=float16

asym, act=bfloat16

sym, act=float16

sym, act=bfloat16

asym, act=float16

asym, act=bfloat16

float8_e4m3fn, act=float16

float8_e4m3fn, act=bfloat16 ⚠️

float8_e5m2, act=float16

float8_e5m2, act=bfloat16

🎯 Accuracy Results

📌 Summary

Uh oh!

chensuyue commented Jun 17, 2026

Uh oh!

azure-pipelines Bot commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented May 14, 2026 •

edited

Loading

`float16`

`bfloat16`

sym, act=`float16`

sym, act=`bfloat16`

asym, act=`float16`

asym, act=`bfloat16`

sym, act=`float16`

sym, act=`bfloat16`

asym, act=`float16`

asym, act=`bfloat16`

sym, act=`float16`

sym, act=`bfloat16`

asym, act=`float16`

asym, act=`bfloat16`

`float8_e4m3fn`, act=`float16`

`float8_e4m3fn`, act=`bfloat16` ⚠️

`float8_e5m2`, act=`float16`

`float8_e5m2`, act=`bfloat16`