perf(models): profile apertus/seed_oss decode and trim xIELU op#399
Merged
Conversation
Characterizes the per-model dense decode overhead for apertus and seed_oss versus mlx-lm and recovers the one cheap, bit-exact lever found, satisfying the acceptance criteria for #269. Characterization (M1 Ultra, mlxcel-bench-decode warmup harness, temp-0, --no-chat-template, 100 decode tokens, mlx-lm 0.31.3 baselines from the issue). The bulk of the gap reported in #269 was the bf16-scale quantized decode regression (#289), which was fixed by #290 after this issue was filed. Both families ship bf16 quant scales and were in the regressed set, so current main already closes most of the gap: apertus is ~83.5 tok/s vs mlx-lm ~85 (~0.98x, was 0.83x at filing) and seed_oss is ~20.4 tok/s vs mlx-lm ~20.3 (~1.0x, was 0.84x at filing). The remaining apertus delta is diffuse per-op / kernel-launch overhead with no single high-value host-side lever: MLXCEL_PROFILE_FORWARD shows steady-state decode is GPU-eval-bound (~0.7ms host graph-build vs ~13ms GPU eval per token, host build is ~5% of the step), and MLXCEL_PROFILE_BLOCKS attributes the forced-serialized per-layer time as attn ~45% / xIELU MLP ~55% for apertus and attn ~38% / SwiGLU MLP ~62% for seed_oss. seed_oss is a vanilla Llama dense and carries no model-specific lever beyond what #290 already recovered. Adds an env-gated, default-off forward profiling hook to apertus and seed_oss, mirroring the existing hooks in nemotron_h.rs and qwen3_moe.rs. MLXCEL_PROFILE_BLOCKS attributes single-token decode wall-clock to the attention vs MLP sub-blocks; MLXCEL_PROFILE_FORWARD splits the step into graph-build vs GPU-eval. Both flags are read once and cached in a OnceLock, and array_shape is only queried when a flag is set, so the steady-state decode path is unchanged (two atomic loads per token, no extra FFI calls or eval()). Recovers one cheap, bit-exact op in apertus xIELU: both branches end in `+ beta * x`, so that shared term is now added once after the per-element select instead of inside each branch. This drops one elementwise add (and its kernel launch) per layer per token. The select only chooses which value beta*x is added to, so the arithmetic is unchanged. Verified greedy temp-0 output is byte-identical to the pre-change baseline for both apertus (500 bytes) and seed_oss (308 bytes); the decode delta is within run-to-run noise (~83.3 -> ~83.5 tok/s on apertus), consistent with the GPU-eval-bound profile. A fused/compiled xIELU activation (single kernel over the elementwise graph) is the only remaining real lever and is left as a documented follow-up, since it touches the C++ FFI bridge and risks f16 jitter. Adds a CPU-only unit test (xielu_factored_beta_x_matches_branch_local) asserting the factored form equals the branch-local reference bit-for-bit in f32 across positive, negative, and zero inputs.
Member
Author
Implementation Review SummaryIntent
Findings Addressed
Verification notes
Minor (LOW, not fixed)
Verification
|
Member
Author
PR Finalization CompleteSummary
Branch is up to date with remote. Ready for merge. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Characterizes the per-model dense decode overhead for
apertusandseed_ossversus mlx-lm and recovers the one cheap, bit-exact lever found, satisfying all three acceptance criteria of #269. The headline result is that most of the gap reported in the issue was the bf16-scale quantized decode regression (#289), which #290 fixed after the issue was filed, so currentmainalready sits at ~0.98x / ~1.0x of mlx-lm.Per-model characterization (M1 Ultra)
Measured with
mlxcel-bench-decode(warmup pass + measured pass in one process,--no-chat-template, 100 decode tokens, temp-0 greedy), against the mlx-lm 0.31.3 baselines quoted in the issue.Apertus-8B-Instruct-2509-4bit): ~83.5 tok/s vs mlx-lm ~85 = ~0.98x (the issue reported 0.83x at filing). Both apertus and seed_oss ship bf16 quant scales and were in the set regressed by perf: bf16->f16 quant-scale promotion (#260) regresses bf16-scale quantized model decode ~33-41% on M1 Ultra #289; perf(models): keep quantized models bf16 to fix M1 Ultra decode regression (#289) #290 ("keep quantized models bf16", merged 2026-06-15, after perf(models): apertus and seed_oss dense decode at ~0.83x of mlx-lm #269 was filed 2026-06-13) recovered the bulk of the gap.Seed-OSS-36B-Instruct-4bit): ~20.4 tok/s vs mlx-lm ~20.3 = ~1.0x (the issue reported 0.84x at filing). seed_oss is a vanilla Llama dense and carries no model-specific lever beyond what perf(models): keep quantized models bf16 to fix M1 Ultra decode regression (#289) #290 already recovered.MLXCEL_PROFILE_FORWARDshows steady-state apertus decode is GPU-eval-bound, ~0.7ms host graph-build vs ~13ms GPU eval per token (host build is only ~5% of the step), so host-side op-count reductions have very limited headroom.MLXCEL_PROFILE_BLOCKSattributes the forced-serialized per-layer time as attn ~45% / xIELU MLP ~55% (apertus) and attn ~38% / SwiGLU MLP ~62% (seed_oss). This matches the project's prior finding that these gaps are diffuse per-op / kernel-launch overhead with no single high-value lever.What changed
src/models/apertus.rsandsrc/models/seed_oss.rs, mirroring the existing hooks innemotron_h.rs/qwen3_moe.rs:MLXCEL_PROFILE_BLOCKSattributes single-token decode wall-clock to the attention vs MLP sub-blocks, andMLXCEL_PROFILE_FORWARDsplits the step into graph-build vs GPU-eval. Both flags are read once and cached in aOnceLock, andarray_shapeis only queried when a flag is set, so the steady-state decode path is unchanged (two atomic loads per token, no extra FFI calls oreval()when profiling is off).apertus_xielu): both branches end in+ beta * x, so that shared term is now added once after the per-element select instead of inside each branch, dropping one elementwise add (and its kernel launch) per layer per token. The select only chooses which valuebeta * xis added to, so the arithmetic is unchanged.xielu_factored_beta_x_matches_branch_localasserting the factored form equals the branch-local reference bit-for-bit in f32 across positive, negative, and zero inputs.Greedy temp-0 output unchanged (byte-identical)
Verified the production decode path (no profiling env vars) is byte-identical to the pre-change
mainbinary for both models: apertus 500-byte continuation and seed_oss 308-byte continuation bothdiff-clean. The decode delta is within run-to-run noise (~83.3 -> ~83.5 tok/s on apertus), consistent with the GPU-eval-bound profile above.Recommended follow-up
A fused/compiled xIELU activation (a single
mx::core::compilewindow or kernel over the ~12 elementwise ops) is the only remaining real lever for apertus, but it touches the C++ FFI bridge and risks f16 jitter, so it is intentionally out of scope here and should be a separate issue. Constant-hoisting the per-calleps/zeroarrays inapertus_xieluwas evaluated and rejected: it would need runtime-dtype-keyed caching to stay bit-identical and saves only host-side FFI calls, which the FORWARD profile shows are a negligible fraction of the step.Test plan
cargo test --release -p mlxcel apertus_tests::(9 passed, incl. new factoring test)cargo test --release -p mlxcel seed_oss_tests::(7 passed)cargo clippy --features metal,accelerate -p mlxcel --lib --tests -- -D warnings(clean)cargo fmt --check(clean)MLXCEL_PROFILE_BLOCKSandMLXCEL_PROFILE_FORWARDexercised on both modelsCloses #269