Skip to content

test(ark): add model-level MoE perf benchmark on XPU#1813

Open
Copilot wants to merge 34 commits into
mainfrom
copilot/add-xpu-moe-decode-implementation
Open

test(ark): add model-level MoE perf benchmark on XPU#1813
Copilot wants to merge 34 commits into
mainfrom
copilot/add-xpu-moe-decode-implementation

Conversation

Copilot AI commented May 14, 2026

Copy link
Copy Markdown
Contributor

Description

Replaces the synthetic kernel-trace auto_round_extension/ark/test/test_moe_model_perf.py with a real model-level perf benchmark under test/test_ark/test_moe_model_perf.py, modeled on test/test_ark/test_model.py. Exercises the full integration path (tokenizer → attention → MoE-MLP via ARK kernels → lm_head → sampler) instead of just the MoE GEMM in isolation.

  • New benchmark (test/test_ark/test_moe_model_perf.py)
    • Loads Qwen/Qwen1.5-MoE-A2.7B and deepseek-ai/DeepSeek-V2-Lite; tiny slice (num_layers=2, num_experts=4) by default via helpers.get_tiny_model, full checkpoints gated behind AR_MOE_PERF_FULL=1.
    • Quantizes with AutoRound(iters=0, nsamples=1, disable_opt_rtn=True), reloads on XPU through AutoRoundConfig(backend="ark").
    • Parametrized over (model, bits∈{4,8}, dtype∈{fp16,bf16}).
    • Measures prefill (single forward, 128-token synthetic prompt) and per-token decode (generate(max_new_tokens=32) after a 4-token warmup) using torch.xpu.Event; reports median of 3 runs.
    • Compares FP reference vs ARK (and optionally GPTQModel — skipped if unavailable). Prints one table per dtype with prefill(ms) | decode(ms/tok) | tokens/s | speedup vs FP.
    • Asserts ark_decode ≤ 2× fp_decode as a regression guard against silent ark.moe dispatcher slowdowns (e.g. the phase="auto" host-device sync overhead).
    • Skip gates: no XPU, ARK extension missing / lacking moe_gemm_{decode,prefill} / ark.moe, or checkpoint not resolvable locally.
  • Removed auto_round_extension/ark/test/test_moe_model_perf.py (synthetic kernel-trace bench, superseded).
  • Registered the perf pytest marker in pyproject.toml so the suite is opt-in (-m perf) and excludable from default CI (-m "not perf").

Type of Change

Performance / Test

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.
  • The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.

Copilot AI and others added 2 commits May 14, 2026 04:03
Copilot AI and others added 2 commits May 14, 2026 07:16
@a32543254 a32543254 changed the title Add XPU MoE decode kernel (FP16/BF16 + INT4 sym/asym) Add XPU MoE decode kernel (FP16/BF16 + INT2/4/8 sym/asym + FP8) May 15, 2026
@a32543254 a32543254 requested a review from luoyu-intel May 18, 2026 15:33
@a32543254 a32543254 marked this pull request as ready for review May 18, 2026 15:33
Copilot AI review requested due to automatic review settings May 18, 2026 15:33
@a32543254

Copy link
Copy Markdown

@copilot resolve the merge conflicts in this pull request

…ecode-implementation

# Conflicts:
#	auto_round_extension/ark/auto_round_kernel/ark.cpp

Co-authored-by: a32543254 <53296245+a32543254@users.noreply.github.com>

Copilot AI commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

@copilot resolve the merge conflicts in this pull request

Merged origin/main in 430868d. The only conflict was in auto_round_extension/ark/auto_round_kernel/ark.cpp — I kept the new moe_gemm_decode_wrapper from this branch alongside the updated sage_dynamic_quant signature (added bias param) from main.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an XPU-optimized MoE decode-phase GEMV kernel (small M per expert) with multiple weight formats, and wires it through the C++/PyTorch extension layer with corresponding unit tests.

Changes:

  • Added a SYCL decode GEMV kernel supporting FP16/BF16, INT8/INT4/INT2 (sym/asym), and FP8 (E4M3/E5M2) weights.
  • Exposed the kernel via pybind (moe_gemm_decode) and added a Python wrapper with argument validation.
  • Added unit tests covering the new decode paths and key validation error cases.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
auto_round_extension/ark/test/test_moe.py Adds decode-path unit tests plus packing/dequant reference helpers for INT2/4/8 and FP8.
auto_round_extension/ark/auto_round_kernel/wrapper/include/sycl_tla_moe_decode.hpp Introduces the new SYCL MoE decode GEMV kernel implementations and dispatch.
auto_round_extension/ark/auto_round_kernel/wrapper/include/sycl_tla_common.hpp Declares the new moe_gemm_decode API (but docs currently lag implementation).
auto_round_extension/ark/auto_round_kernel/ark.cpp Includes the new header and binds moe_gemm_decode via pybind.
auto_round_extension/ark/auto_round_kernel/init.py Adds the ARK.moe_gemm_decode Python wrapper and validation logic.
Comments suppressed due to low confidence (2)

auto_round_extension/ark/auto_round_kernel/init.py:871

  • num_tokens_per_expert is converted to int32/contiguous but its device is not validated. If it’s a CPU tensor, the kernel will treat a host pointer as device memory. Please ensure num_tokens_per_expert is on XPU (and matches activations.device), or move it to XPU explicitly before calling into the extension.
            weights = weights.contiguous()

        if num_tokens_per_expert.dtype != torch.int32:
            num_tokens_per_expert = num_tokens_per_expert.to(torch.int32)
        if not num_tokens_per_expert.is_contiguous():

auto_round_extension/ark/auto_round_kernel/init.py:896

  • group_size is used in modulo/division checks (e.g., K % group_size) without validating group_size > 0. Passing group_size=0 will raise a ZeroDivisionError rather than a clear ValueError. Please add an explicit check that group_size is a positive integer before any modulo/division operations.
            if scales is None:
                raise ValueError("scales is required for FP8 weights")
            if scales.dtype != activations.dtype:
                raise ValueError("scales dtype must match activations dtype")
            if K % group_size != 0:

Comment thread auto_round_extension/ark/auto_round_kernel/__init__.py Outdated
@chensuyue

Copy link
Copy Markdown
Contributor

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

Copilot AI changed the title Add quantized MoE prefill kernel for XPU (stage-1 functional baseline) feat: add MoE prefill performance test with TFLOPS metrics Jun 17, 2026
Copilot AI changed the title feat: add MoE prefill performance test with TFLOPS metrics Add MoE prefill performance tests with TFLOPS and unify ARK module usage Jun 17, 2026
- Use [E, N, K] weight layout for pack helpers (matches their contract
  and the working correctness tests in test_moe.py).
- Drop the erroneous dequant.transpose(1, 2); dequant is already
  [E, N, K], which is what the baseline expects.
- Call ark.moe_gemm_prefill for quantized timing (the dedicated
  quantized prefill kernel) instead of the FP-only ark.moe_gemm.
- Add a per-test skip guard for quantized tests requiring
  moe_gemm_prefill so the FP perf cases still run when only moe_gemm
  is present.
@chensuyue

Copy link
Copy Markdown
Contributor

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

Copilot AI added 3 commits June 17, 2026 06:54
…rness

Print both `base mm(ms)` (matmul-only baseline, prior behavior) and
`base+deq(ms)` (dequant + matmul, the apples-to-apples comparison for
the current Stage-1 `moe_gemm_prefill` which materialises a fp16/bf16
workspace before dispatching to the FP GEMM). Speedup is now reported
against the dequant-inclusive baseline so the numbers reflect real
end-to-end quantized prefill cost.

FP perf test (`test_perf_fp`) is unchanged: it has no dequant pass.
For the quantized paths the Python `moe_gemm_prefill` wrapper previously
allocated a fresh `E*K*N*sizeof(act_dtype)` scratch tensor on every call.
For real MoE prefill workloads the same shape repeats every step, so the
allocator overhead is pure waste — and dominates the small-shape numbers.

Move the workspace to a module-level cache keyed by
`(device, dtype, E, K, N)`. The unquantized fast path is unchanged: it
still uses a per-call transposed copy of `weights`. Added
`clear_moe_prefill_workspace_cache()` for callers that need to drop the
cached buffers.
@chensuyue

Copy link
Copy Markdown
Contributor

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@chensuyue

Copy link
Copy Markdown
Contributor

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

Copilot AI added 2 commits June 17, 2026 10:31
- Add `ark.moe(...)` dispatcher that auto-selects between `moe_gemm_decode`
  (GEMV-tuned, small tokens/expert) and `moe_gemm_prefill` (GEMM-tuned,
  many tokens/expert). Single call site for model code; `phase="decode"`
  or `phase="prefill"` skips the auto-dispatch host-device sync.
- Add `test_moe_unified.py`: bit-parity tests vs the underlying kernels
  across fp/int8/int4/int2/fp8 + dispatch correctness + error path.
- Add `test_moe_model_perf.py`: model-level forward (1 prefill + N decode
  steps over L MoE layers) comparing always_prefill, always_decode,
  manual_branch, unified_auto, unified_hinted strategies.
- `moe_gemm_decode` / `moe_gemm_prefill` are kept for backward compatibility.
@a32543254

Copy link
Copy Markdown

下面是整理好的 GitHub comment 格式,可以直接粘贴使用:


✅ MoE Prefill — Perf & Accuracy Test Results

Environment: xpu_available=True · xpu_lib=loaded · has_moe_gemm=True
Platform: Linux · Python 3.13.12 · pytest 9.0.2


📊 Performance Results

FP weights — ark.moe_gemm (prefill) vs per-expert A @ W.T

float16

shape E N K tokens baseline (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 11.1743 5.3752 2.08x 1.6
medium E=8 8 4096 14336 528 12.7908 10.8748 1.18x 5.7
medium E=8 8 14336 4096 528 12.8744 10.8833 1.18x 5.7
large E=16 16 2048 2048 256 19.6544 4.9422 3.98x 0.4
large E=32 32 2048 2048 256 73.5049 5.4408 13.51x 0.4
large E=64 64 2048 2048 256 145.3573 7.5744 19.19x 0.3
uneven E=8 8 4096 4096 610 11.2116 5.3317 2.10x 3.8

bfloat16

shape E N K tokens baseline (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 11.1654 5.3858 2.07x 1.6
medium E=8 8 4096 14336 528 12.7828 10.8717 1.18x 5.7
medium E=8 8 14336 4096 528 12.8719 10.9248 1.18x 5.7
large E=16 16 2048 2048 256 19.6520 4.9459 3.97x 0.4
large E=32 32 2048 2048 256 73.3472 5.4542 13.45x 0.4
large E=64 64 2048 2048 256 145.7464 7.5853 19.21x 0.3
uneven E=8 8 4096 4096 610 11.1995 5.3342 2.10x 3.8
INT4 (group_size=128) — ark.moe_gemm_prefill vs dequant + A @ W.T

sym, act=float16

shape E N K tokens base mm (ms) base+deq (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 11.0943 46.2232 14.3001 3.23x 0.6
medium E=8 8 4096 14336 528 13.5260 113.8531 47.2768 2.41x 1.3
medium E=8 8 14336 4096 528 13.0802 114.3094 49.0601 2.33x 1.3
large E=16 16 2048 2048 256 37.2911 61.6995 7.4836 8.24x 0.3
large E=32 32 2048 2048 256 73.4550 108.6259 11.7776 9.22x 0.2
large E=64 64 2048 2048 256 145.6920 204.0485 23.9238 8.53x 0.1
uneven E=8 8 4096 4096 610 11.2395 46.4091 13.9499 3.33x 1.5

sym, act=bfloat16

shape E N K tokens base mm (ms) base+deq (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 11.0758 46.4585 14.0467 3.31x 0.6
medium E=8 8 4096 14336 528 13.4995 114.5090 47.2857 2.42x 1.3
medium E=8 8 14336 4096 528 13.0646 114.9398 49.0738 2.34x 1.3
large E=16 16 2048 2048 256 37.2981 61.8135 7.5655 8.17x 0.3
large E=32 32 2048 2048 256 73.3958 108.8733 13.0127 8.37x 0.2
large E=64 64 2048 2048 256 145.3949 204.3172 23.9301 8.54x 0.1
uneven E=8 8 4096 4096 610 11.2451 46.6819 14.2496 3.28x 1.4

asym, act=float16

shape E N K tokens base mm (ms) base+deq (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 11.0931 52.6291 16.7643 3.14x 0.5
medium E=8 8 4096 14336 528 13.4947 142.5273 59.3267 2.40x 1.0
medium E=8 8 14336 4096 528 13.1180 142.3340 59.6549 2.39x 1.0
large E=16 16 2048 2048 256 37.2608 58.9527 8.0059 7.36x 0.3
large E=32 32 2048 2048 256 73.4041 115.0355 14.2060 8.10x 0.2
large E=64 64 2048 2048 256 145.6150 219.5465 27.6512 7.94x 0.1
uneven E=8 8 4096 4096 610 11.2504 52.8442 16.6637 3.17x 1.2

asym, act=bfloat16

shape E N K tokens base mm (ms) base+deq (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 11.0764 53.3380 16.7745 3.18x 0.5
medium E=8 8 4096 14336 528 13.4873 145.3749 59.2862 2.45x 1.0
medium E=8 8 14336 4096 528 13.0745 145.1425 59.6205 2.43x 1.0
large E=16 16 2048 2048 256 37.2752 61.6077 7.9958 7.71x 0.3
large E=32 32 2048 2048 256 73.5586 115.6253 15.2235 7.60x 0.1
large E=64 64 2048 2048 256 145.6446 224.0353 27.5478 8.13x 0.1
uneven E=8 8 4096 4096 610 11.2666 53.4504 16.6813 3.20x 1.2
INT8 (group_size=128) — ark.moe_gemm_prefill vs dequant + A @ W.T

sym, act=float16

shape E N K tokens base mm (ms) base+deq (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 11.0886 21.9720 12.6380 1.74x 0.7
medium E=8 8 4096 14336 528 13.5173 50.8977 41.1863 1.24x 1.5
medium E=8 8 14336 4096 528 13.0750 50.5824 41.0089 1.23x 1.5
large E=16 16 2048 2048 256 37.2774 43.2763 7.6975 5.62x 0.3
large E=32 32 2048 2048 256 73.5237 84.2765 12.0557 6.99x 0.2
large E=64 64 2048 2048 256 145.6569 167.5987 22.0110 7.61x 0.1
uneven E=8 8 4096 4096 610 11.2530 22.1476 12.5922 1.76x 1.6

sym, act=bfloat16

shape E N K tokens base mm (ms) base+deq (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 11.0685 22.1704 12.6382 1.75x 0.7
medium E=8 8 4096 14336 528 13.4965 52.8375 41.1465 1.28x 1.5
medium E=8 8 14336 4096 528 13.0501 52.3792 41.0161 1.28x 1.5
large E=16 16 2048 2048 256 37.2833 43.3786 7.6832 5.65x 0.3
large E=32 32 2048 2048 256 73.4866 84.4736 13.1791 6.41x 0.2
large E=64 64 2048 2048 256 145.6534 167.8831 21.9684 7.64x 0.1
uneven E=8 8 4096 4096 610 11.2760 22.3109 12.5643 1.78x 1.6

asym, act=float16

shape E N K tokens base mm (ms) base+deq (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 11.0692 33.1561 14.4715 2.29x 0.6
medium E=8 8 4096 14336 528 13.4955 86.0909 49.3754 1.74x 1.3
medium E=8 8 14336 4096 528 13.0600 85.6853 47.5951 1.80x 1.3
large E=16 16 2048 2048 256 37.2747 49.9370 7.8903 6.33x 0.3
large E=32 32 2048 2048 256 73.3727 95.4454 12.4763 7.65x 0.2
large E=64 64 2048 2048 256 145.4552 187.9701 24.0674 7.81x 0.1
uneven E=8 8 4096 4096 610 11.2650 33.3232 14.4399 2.31x 1.4

asym, act=bfloat16

shape E N K tokens base mm (ms) base+deq (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 11.0687 33.5627 14.4608 2.32x 0.6
medium E=8 8 4096 14336 528 13.5176 88.6815 49.3579 1.80x 1.3
medium E=8 8 14336 4096 528 13.0654 88.1732 47.5636 1.85x 1.3
large E=16 16 2048 2048 256 37.3091 50.2275 7.8594 6.39x 0.3
large E=32 32 2048 2048 256 73.4880 95.9222 13.5886 7.06x 0.2
large E=64 64 2048 2048 256 145.7669 188.5899 23.9677 7.87x 0.1
uneven E=8 8 4096 4096 610 11.2739 33.7794 14.4359 2.34x 1.4
INT2 (group_size=128) — ark.moe_gemm_prefill vs dequant + A @ W.T

sym, act=float16

shape E N K tokens base mm (ms) base+deq (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 11.0759 60.6933 8.4057 7.22x 1.0
medium E=8 8 4096 14336 528 13.4512 157.8263 28.1047 5.62x 2.2
medium E=8 8 14336 4096 528 13.0528 157.3870 28.3782 5.55x 2.2
large E=16 16 2048 2048 256 37.3023 66.5552 5.6929 11.69x 0.4
large E=32 32 2048 2048 256 73.2710 119.0248 8.0819 14.73x 0.3
large E=64 64 2048 2048 256 145.7239 235.9103 15.4014 15.32x 0.1
uneven E=8 8 4096 4096 610 11.2585 55.3790 8.3760 6.61x 2.4

sym, act=bfloat16

shape E N K tokens base mm (ms) base+deq (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 11.0545 61.3696 8.4293 7.28x 1.0
medium E=8 8 4096 14336 528 13.4594 159.2971 28.6929 5.55x 2.2
medium E=8 8 14336 4096 528 13.0241 159.2343 28.4423 5.60x 2.2
large E=16 16 2048 2048 256 37.2961 68.3397 5.6837 12.02x 0.4
large E=32 32 2048 2048 256 73.4299 119.3870 8.1449 14.66x 0.3
large E=64 64 2048 2048 256 145.6200 236.9812 13.8138 17.16x 0.2
uneven E=8 8 4096 4096 610 11.2465 56.9521 8.3960 6.78x 2.4

asym, act=float16

shape E N K tokens base mm (ms) base+deq (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 11.0255 58.3124 10.5080 5.55x 0.8
medium E=8 8 4096 14336 528 13.3463 167.0102 34.5696 4.83x 1.8
medium E=8 8 14336 4096 528 12.9347 147.3313 34.8071 4.23x 1.8
large E=16 16 2048 2048 256 37.2887 65.5966 5.8136 11.28x 0.4
large E=32 32 2048 2048 256 73.2917 119.5064 8.3657 14.29x 0.3
large E=64 64 2048 2048 256 145.4086 222.8914 15.8196 14.09x 0.1
uneven E=8 8 4096 4096 610 11.2367 58.4050 10.4326 5.60x 2.0

asym, act=bfloat16

shape E N K tokens base mm (ms) base+deq (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 11.0206 61.5193 10.4709 5.88x 0.8
medium E=8 8 4096 14336 528 13.3318 171.7155 34.5221 4.97x 1.8
medium E=8 8 14336 4096 528 12.9161 150.1810 34.7625 4.32x 1.8
large E=16 16 2048 2048 256 37.2678 69.5612 2.3345 29.80x 0.9
large E=32 32 2048 2048 256 12.0685 32.2610 3.2826 9.83x 0.7
large E=64 64 2048 2048 256 7.8215 159.7134 13.6930 11.66x 0.2
uneven E=8 8 4096 4096 610 3.6469 53.0169 5.8798 9.02x 3.5
FP8 (group_size=128) — ark.moe_gemm_prefill vs dequant + A @ W.T

float8_e4m3fn, act=float16

shape E N K tokens base mm (ms) base+deq (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 4.8305 15.3085 9.4883 1.61x 0.9
medium E=8 8 4096 14336 528 8.8149 38.8058 39.9981 0.97x 1.6
medium E=8 8 14336 4096 528 11.8783 46.9332 38.7341 1.21x 1.6
large E=16 16 2048 2048 256 29.4404 30.9897 7.5328 4.11x 0.3
large E=32 32 2048 2048 256 56.8703 40.6506 9.0923 4.47x 0.2
large E=64 64 2048 2048 256 52.2962 53.5669 13.2572 4.04x 0.2
uneven E=8 8 4096 4096 610 3.0481 9.6519 6.9644 1.39x 2.9

float8_e4m3fn, act=bfloat16 ⚠️

shape E N K tokens base mm (ms) base+deq (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 2.2356 9.0504 5.1459 1.76x 1.6
medium E=8 8 4096 14336 528 3.6434 24.1100 18.2615 1.32x 3.4
medium E=8 8 14336 4096 528 3.5101 24.5804 581.1054 0.04x 0.1
large E=16 16 2048 2048 256 0.9075 3.7703 2.7270 1.38x 0.8
large E=32 32 2048 2048 256 2.1443 8056.9061 4604.9666 1.75x 0.0
large E=64 64 2048 2048 256 5654.7606 17.2789 8.8871 1.94x 0.2
uneven E=8 8 4096 4096 610 1.3802 7.3652 5.2822 1.39x 3.9

⚠️ Anomaly detected: Several rows show extreme timing variance (e.g. 581ms, 8056ms, 5654ms) — likely measurement jitter, JIT/warm-up effect, or memory thrash. Recommend re-running this configuration with additional warm-up iterations / lock-step timing.

float8_e5m2, act=float16

shape E N K tokens base mm (ms) base+deq (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 1.3118 6.8722 5.2438 1.31x 1.6
medium E=8 8 4096 14336 528 3.6575 23.1073 45.3614 0.51x 1.4
medium E=8 8 14336 4096 528 13.0823 55.0204 45.1934 1.22x 1.4
large E=16 16 2048 2048 256 37.3216 43.5314 7.7669 5.60x 0.3
large E=32 32 2048 2048 256 73.6025 86.1460 12.4799 6.90x 0.2
large E=64 64 2048 2048 256 145.4007 169.7390 24.1494 7.03x 0.1
uneven E=8 8 4096 4096 610 11.2092 23.8410 14.0927 1.69x 1.5

float8_e5m2, act=bfloat16

shape E N K tokens base mm (ms) base+deq (ms) ark (ms) speedup TFLOPS
small E=8 8 4096 4096 252 11.1113 24.0478 14.1489 1.70x 0.6
medium E=8 8 4096 14336 528 13.5566 59.0466 45.3463 1.30x 1.4
medium E=8 8 14336 4096 528 13.0841 58.5334 45.2276 1.29x 1.4
large E=16 16 2048 2048 256 37.2955 43.7300 7.7709 5.63x 0.3
large E=32 32 2048 2048 256 73.4335 86.3980 13.3932 6.45x 0.2
large E=64 64 2048 2048 256 145.6443 171.5714 24.1137 7.12x 0.1
uneven E=8 8 4096 4096 610 11.2536 24.1764 14.0933 1.72x 1.5

🎯 Accuracy Results

18 / 18 PASSED

dtype Variant Status
FP float16 / bfloat16 ✅✅
INT4 sym act=float16 / act=bfloat16 ✅✅
INT4 asym act=float16 / act=bfloat16 ✅✅
INT8 sym act=float16 / act=bfloat16 ✅✅
INT8 asym act=float16 / act=bfloat16 ✅✅
INT2 sym act=float16 / act=bfloat16 ✅✅
INT2 asym act=float16 / act=bfloat16 ✅✅
FP8 e4m3fn act=float16 / act=bfloat16 ✅✅
FP8 e5m2 act=float16 / act=bfloat16 ✅✅

📌 Summary

Path Best speedup Typical speedup Notes
FP (fp16/bf16) 19.21x (E=64) 2–4x on small/medium Scales strongly with expert count
INT4 (sym/asym) 9.22x 2.3–3.3x Avoids costly per-expert dequant
INT8 (sym/asym) 7.87x 1.2–2.3x Most benefit on large E≥16
INT2 (sym/asym) 29.80x 5–15x Biggest wins overall
FP8 e4m3fn / e5m2 7.12x 1.3–4.5x ⚠️ e4m3fn + bf16 has unstable rows — re-run recommended
  • ✅ All 18 accuracy tests pass.
  • ✅ All 18 perf tests pass, ark.moe_gemm_prefill outperforms baseline in nearly all configs.
  • ⚠️ One config (FP8 e4m3fn, act=bfloat16) shows abnormal timings — needs investigation.

@chensuyue

Copy link
Copy Markdown
Contributor

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

- Add test/test_ark/test_moe_model_perf.py: real MoE LLM perf scan
  (tiny by default, AR_MOE_PERF_FULL=1 for full checkpoints) over
  prefill + per-token decode latency. Compares FP reference vs ARK
  backend (and optional GPTQModel). Asserts ARK decode <= 2x FP to
  guard against silent ark.moe dispatcher regressions.
- Remove obsolete auto_round_extension/ark/test/test_moe_model_perf.py
  (synthetic kernel-trace bench; replaced by the model-level one).
- Register the `perf` pytest marker in pyproject.toml.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants