perf(models): profile apertus/seed_oss decode and trim xIELU op by inureyes · Pull Request #399 · lablup/mlxcel

inureyes · 2026-06-22T03:11:24Z

Summary

Characterizes the per-model dense decode overhead for apertus and seed_oss versus mlx-lm and recovers the one cheap, bit-exact lever found, satisfying all three acceptance criteria of #269. The headline result is that most of the gap reported in the issue was the bf16-scale quantized decode regression (#289), which #290 fixed after the issue was filed, so current main already sits at ~0.98x / ~1.0x of mlx-lm.

Per-model characterization (M1 Ultra)

Measured with mlxcel-bench-decode (warmup pass + measured pass in one process, --no-chat-template, 100 decode tokens, temp-0 greedy), against the mlx-lm 0.31.3 baselines quoted in the issue.

apertus (Apertus-8B-Instruct-2509-4bit): ~83.5 tok/s vs mlx-lm ~85 = ~0.98x (the issue reported 0.83x at filing). Both apertus and seed_oss ship bf16 quant scales and were in the set regressed by perf: bf16->f16 quant-scale promotion (#260) regresses bf16-scale quantized model decode ~33-41% on M1 Ultra #289; perf(models): keep quantized models bf16 to fix M1 Ultra decode regression (#289) #290 ("keep quantized models bf16", merged 2026-06-15, after perf(models): apertus and seed_oss dense decode at ~0.83x of mlx-lm #269 was filed 2026-06-13) recovered the bulk of the gap.
seed_oss (Seed-OSS-36B-Instruct-4bit): ~20.4 tok/s vs mlx-lm ~20.3 = ~1.0x (the issue reported 0.84x at filing). seed_oss is a vanilla Llama dense and carries no model-specific lever beyond what perf(models): keep quantized models bf16 to fix M1 Ultra decode regression (#289) #290 already recovered.
Where the remaining overhead is: MLXCEL_PROFILE_FORWARD shows steady-state apertus decode is GPU-eval-bound, ~0.7ms host graph-build vs ~13ms GPU eval per token (host build is only ~5% of the step), so host-side op-count reductions have very limited headroom. MLXCEL_PROFILE_BLOCKS attributes the forced-serialized per-layer time as attn ~45% / xIELU MLP ~55% (apertus) and attn ~38% / SwiGLU MLP ~62% (seed_oss). This matches the project's prior finding that these gaps are diffuse per-op / kernel-launch overhead with no single high-value lever.

What changed

Adds an env-gated, default-off forward profiling hook to src/models/apertus.rs and src/models/seed_oss.rs, mirroring the existing hooks in nemotron_h.rs / qwen3_moe.rs: MLXCEL_PROFILE_BLOCKS attributes single-token decode wall-clock to the attention vs MLP sub-blocks, and MLXCEL_PROFILE_FORWARD splits the step into graph-build vs GPU-eval. Both flags are read once and cached in a OnceLock, and array_shape is only queried when a flag is set, so the steady-state decode path is unchanged (two atomic loads per token, no extra FFI calls or eval() when profiling is off).
Recovers one cheap, bit-exact op in apertus xIELU (apertus_xielu): both branches end in + beta * x, so that shared term is now added once after the per-element select instead of inside each branch, dropping one elementwise add (and its kernel launch) per layer per token. The select only chooses which value beta * x is added to, so the arithmetic is unchanged.
Adds a CPU-only unit test xielu_factored_beta_x_matches_branch_local asserting the factored form equals the branch-local reference bit-for-bit in f32 across positive, negative, and zero inputs.

Greedy temp-0 output unchanged (byte-identical)

Verified the production decode path (no profiling env vars) is byte-identical to the pre-change main binary for both models: apertus 500-byte continuation and seed_oss 308-byte continuation both diff-clean. The decode delta is within run-to-run noise (~83.3 -> ~83.5 tok/s on apertus), consistent with the GPU-eval-bound profile above.

Recommended follow-up

A fused/compiled xIELU activation (a single mx::core::compile window or kernel over the ~12 elementwise ops) is the only remaining real lever for apertus, but it touches the C++ FFI bridge and risks f16 jitter, so it is intentionally out of scope here and should be a separate issue. Constant-hoisting the per-call eps/zero arrays in apertus_xielu was evaluated and rejected: it would need runtime-dtype-keyed caching to stay bit-identical and saves only host-side FFI calls, which the FORWARD profile shows are a negligible fraction of the step.

Test plan

cargo test --release -p mlxcel apertus_tests:: (9 passed, incl. new factoring test)
cargo test --release -p mlxcel seed_oss_tests:: (7 passed)
cargo clippy --features metal,accelerate -p mlxcel --lib --tests -- -D warnings (clean)
cargo fmt --check (clean)
apertus greedy temp-0 output byte-identical to baseline (500 bytes, diff-clean)
seed_oss greedy temp-0 output byte-identical to baseline (308 bytes, diff-clean)
apertus decode re-benchmarked before/after (~83.3 -> ~83.5 tok/s, within noise)
MLXCEL_PROFILE_BLOCKS and MLXCEL_PROFILE_FORWARD exercised on both models

Closes #269

Characterizes the per-model dense decode overhead for apertus and seed_oss versus mlx-lm and recovers the one cheap, bit-exact lever found, satisfying the acceptance criteria for #269. Characterization (M1 Ultra, mlxcel-bench-decode warmup harness, temp-0, --no-chat-template, 100 decode tokens, mlx-lm 0.31.3 baselines from the issue). The bulk of the gap reported in #269 was the bf16-scale quantized decode regression (#289), which was fixed by #290 after this issue was filed. Both families ship bf16 quant scales and were in the regressed set, so current main already closes most of the gap: apertus is ~83.5 tok/s vs mlx-lm ~85 (~0.98x, was 0.83x at filing) and seed_oss is ~20.4 tok/s vs mlx-lm ~20.3 (~1.0x, was 0.84x at filing). The remaining apertus delta is diffuse per-op / kernel-launch overhead with no single high-value host-side lever: MLXCEL_PROFILE_FORWARD shows steady-state decode is GPU-eval-bound (~0.7ms host graph-build vs ~13ms GPU eval per token, host build is ~5% of the step), and MLXCEL_PROFILE_BLOCKS attributes the forced-serialized per-layer time as attn ~45% / xIELU MLP ~55% for apertus and attn ~38% / SwiGLU MLP ~62% for seed_oss. seed_oss is a vanilla Llama dense and carries no model-specific lever beyond what #290 already recovered. Adds an env-gated, default-off forward profiling hook to apertus and seed_oss, mirroring the existing hooks in nemotron_h.rs and qwen3_moe.rs. MLXCEL_PROFILE_BLOCKS attributes single-token decode wall-clock to the attention vs MLP sub-blocks; MLXCEL_PROFILE_FORWARD splits the step into graph-build vs GPU-eval. Both flags are read once and cached in a OnceLock, and array_shape is only queried when a flag is set, so the steady-state decode path is unchanged (two atomic loads per token, no extra FFI calls or eval()). Recovers one cheap, bit-exact op in apertus xIELU: both branches end in `+ beta * x`, so that shared term is now added once after the per-element select instead of inside each branch. This drops one elementwise add (and its kernel launch) per layer per token. The select only chooses which value beta*x is added to, so the arithmetic is unchanged. Verified greedy temp-0 output is byte-identical to the pre-change baseline for both apertus (500 bytes) and seed_oss (308 bytes); the decode delta is within run-to-run noise (~83.3 -> ~83.5 tok/s on apertus), consistent with the GPU-eval-bound profile. A fused/compiled xIELU activation (single kernel over the elementwise graph) is the only remaining real lever and is left as a documented follow-up, since it touches the C++ FFI bridge and risks f16 jitter. Adds a CPU-only unit test (xielu_factored_beta_x_matches_branch_local) asserting the factored form equals the branch-local reference bit-for-bit in f32 across positive, negative, and zero inputs.

inureyes · 2026-06-22T03:18:20Z

Implementation Review Summary

Intent

Characterize apertus/seed_oss dense decode overhead vs mlx-lm (#269) and recover the one cheap, bit-exact lever (factor the shared beta*x out of the xIELU select), without changing greedy temp-0 output.

Findings Addressed

None requiring code changes. No CRITICAL/HIGH/MEDIUM findings.

Verification notes

xIELU beta*x factoring is genuinely bit-exact. This is factoring a common addend out of a per-element select, not reassociation. For any element the result is core[i] + beta*x[i] with identical operands and operand order in both old and new forms; where_cond only selects (no re-rounding). multiply_scalar preserves input dtype (no float32 promotion, no astype), so beta*x is the same array as the old beta_x. Net effect is one fewer elementwise add per layer per token. Greedy temp-0 byte-identical claim is credible.
Test xielu_factored_beta_x_matches_branch_local is adequate. It exercises negative (-4, -1, -0.25), zero (0.0), and positive (0.25, 1, 4, 12.5) inputs and asserts to_bits() equality between the branch-local reference (xielu_scalar) and the factored form. Algebraically guaranteed to pass.
Profiling hooks are default-off and near-zero cost. Production path (env vars unset) pays two OnceLock atomic loads per token; array_shape is short-circuited and never called, no eval(), no Instant::now() (Option::map on None). This mirrors the nemotron_h.rs/qwen3_moe.rs pattern and is actually cleaner: those read std::env::var on every token, this caches in OnceLock. Hooks are wired into the real forward loop (forward_timed), not dead code.
Conventions clean. No shared function modified (only apertus_xielu, a private model fn, was touched), so no // Used by: update is owed. No stray AsType. Apple Silicon precision rules respected. TransformerBlock::forward is kept as a public delegating wrapper, preserving the API surface; it stays reachable via pub mod models, so no dead-code warning.
Scope contained. Only apertus.rs, apertus_tests.rs, seed_oss.rs changed; production logits path is restructured but semantically identical when profiling is off.

Minor (LOW, not fixed)

BlockTiming + profile_blocks_enabled/profile_forward_enabled are duplicated verbatim across apertus.rs and seed_oss.rs. This is consistent with the existing per-model profiling precedent (nemotron_h/qwen3_moe inline their own variants and do not share), so it is left as-is.

Verification

All stated requirements implemented (characterization, cheap recovery, output-unchanged)
No placeholder/mock code remaining
Integrated into project code flow (forward_timed in the layer loop; FORWARD/BLOCKS reachable)
Project conventions followed
Existing modules reused where applicable
No unintended structural changes (public API stable)
Tests pass (new scalar test verified by inspection; full build/clippy/tests run by orchestrator)

inureyes · 2026-06-22T03:38:25Z

PR Finalization Complete

Summary

Tests: xielu_factored_beta_x_matches_branch_local covers the bit-exact equivalence of the factored xIELU form; existing tests cover config parsing, field defaults, softplus numerics, and the branch formula. No gaps found, nothing added.
Documentation: MLXCEL_PROFILE_BLOCKS and MLXCEL_PROFILE_FORWARD are already documented in docs/environment-variables.md with "where implemented" / "model-specific forward profiling where implemented" phrasing, which is inherently non-stale and requires no update when new models adopt the hooks. No changes needed.
Lint/Format: cargo fmt produced no diff on the three touched files. cargo clippy --features metal,accelerate -p mlxcel --lib --tests -- -D warnings passed clean.

Branch is up to date with remote. Ready for merge.

inureyes added status:done Completed and removed status:review Under review labels Jun 22, 2026

inureyes merged commit 3d73b66 into main Jun 22, 2026
5 checks passed

inureyes deleted the perf/issue-269-apertus-seedoss-dense-decode branch June 22, 2026 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(models): profile apertus/seed_oss decode and trim xIELU op#399

perf(models): profile apertus/seed_oss decode and trim xIELU op#399
inureyes merged 1 commit into
mainfrom
perf/issue-269-apertus-seedoss-dense-decode

inureyes commented Jun 22, 2026

Uh oh!

inureyes commented Jun 22, 2026

Uh oh!

inureyes commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

inureyes commented Jun 22, 2026

Summary

Per-model characterization (M1 Ultra)

What changed

Greedy temp-0 output unchanged (byte-identical)

Recommended follow-up

Test plan

Uh oh!

inureyes commented Jun 22, 2026

Implementation Review Summary

Intent

Findings Addressed

Verification notes

Minor (LOW, not fixed)

Verification

Uh oh!

inureyes commented Jun 22, 2026

PR Finalization Complete

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant