Skip to content

perf(models): profile apertus/seed_oss decode and trim xIELU op#399

Merged
inureyes merged 1 commit into
mainfrom
perf/issue-269-apertus-seedoss-dense-decode
Jun 22, 2026
Merged

perf(models): profile apertus/seed_oss decode and trim xIELU op#399
inureyes merged 1 commit into
mainfrom
perf/issue-269-apertus-seedoss-dense-decode

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Characterizes the per-model dense decode overhead for apertus and seed_oss versus mlx-lm and recovers the one cheap, bit-exact lever found, satisfying all three acceptance criteria of #269. The headline result is that most of the gap reported in the issue was the bf16-scale quantized decode regression (#289), which #290 fixed after the issue was filed, so current main already sits at ~0.98x / ~1.0x of mlx-lm.

Per-model characterization (M1 Ultra)

Measured with mlxcel-bench-decode (warmup pass + measured pass in one process, --no-chat-template, 100 decode tokens, temp-0 greedy), against the mlx-lm 0.31.3 baselines quoted in the issue.

What changed

  • Adds an env-gated, default-off forward profiling hook to src/models/apertus.rs and src/models/seed_oss.rs, mirroring the existing hooks in nemotron_h.rs / qwen3_moe.rs: MLXCEL_PROFILE_BLOCKS attributes single-token decode wall-clock to the attention vs MLP sub-blocks, and MLXCEL_PROFILE_FORWARD splits the step into graph-build vs GPU-eval. Both flags are read once and cached in a OnceLock, and array_shape is only queried when a flag is set, so the steady-state decode path is unchanged (two atomic loads per token, no extra FFI calls or eval() when profiling is off).
  • Recovers one cheap, bit-exact op in apertus xIELU (apertus_xielu): both branches end in + beta * x, so that shared term is now added once after the per-element select instead of inside each branch, dropping one elementwise add (and its kernel launch) per layer per token. The select only chooses which value beta * x is added to, so the arithmetic is unchanged.
  • Adds a CPU-only unit test xielu_factored_beta_x_matches_branch_local asserting the factored form equals the branch-local reference bit-for-bit in f32 across positive, negative, and zero inputs.

Greedy temp-0 output unchanged (byte-identical)

Verified the production decode path (no profiling env vars) is byte-identical to the pre-change main binary for both models: apertus 500-byte continuation and seed_oss 308-byte continuation both diff-clean. The decode delta is within run-to-run noise (~83.3 -> ~83.5 tok/s on apertus), consistent with the GPU-eval-bound profile above.

Recommended follow-up

A fused/compiled xIELU activation (a single mx::core::compile window or kernel over the ~12 elementwise ops) is the only remaining real lever for apertus, but it touches the C++ FFI bridge and risks f16 jitter, so it is intentionally out of scope here and should be a separate issue. Constant-hoisting the per-call eps/zero arrays in apertus_xielu was evaluated and rejected: it would need runtime-dtype-keyed caching to stay bit-identical and saves only host-side FFI calls, which the FORWARD profile shows are a negligible fraction of the step.

Test plan

  • cargo test --release -p mlxcel apertus_tests:: (9 passed, incl. new factoring test)
  • cargo test --release -p mlxcel seed_oss_tests:: (7 passed)
  • cargo clippy --features metal,accelerate -p mlxcel --lib --tests -- -D warnings (clean)
  • cargo fmt --check (clean)
  • apertus greedy temp-0 output byte-identical to baseline (500 bytes, diff-clean)
  • seed_oss greedy temp-0 output byte-identical to baseline (308 bytes, diff-clean)
  • apertus decode re-benchmarked before/after (~83.3 -> ~83.5 tok/s, within noise)
  • MLXCEL_PROFILE_BLOCKS and MLXCEL_PROFILE_FORWARD exercised on both models

Closes #269

Characterizes the per-model dense decode overhead for apertus and seed_oss versus mlx-lm and recovers the one cheap, bit-exact lever found, satisfying the acceptance criteria for #269.

Characterization (M1 Ultra, mlxcel-bench-decode warmup harness, temp-0, --no-chat-template, 100 decode tokens, mlx-lm 0.31.3 baselines from the issue). The bulk of the gap reported in #269 was the bf16-scale quantized decode regression (#289), which was fixed by #290 after this issue was filed. Both families ship bf16 quant scales and were in the regressed set, so current main already closes most of the gap: apertus is ~83.5 tok/s vs mlx-lm ~85 (~0.98x, was 0.83x at filing) and seed_oss is ~20.4 tok/s vs mlx-lm ~20.3 (~1.0x, was 0.84x at filing). The remaining apertus delta is diffuse per-op / kernel-launch overhead with no single high-value host-side lever: MLXCEL_PROFILE_FORWARD shows steady-state decode is GPU-eval-bound (~0.7ms host graph-build vs ~13ms GPU eval per token, host build is ~5% of the step), and MLXCEL_PROFILE_BLOCKS attributes the forced-serialized per-layer time as attn ~45% / xIELU MLP ~55% for apertus and attn ~38% / SwiGLU MLP ~62% for seed_oss. seed_oss is a vanilla Llama dense and carries no model-specific lever beyond what #290 already recovered.

Adds an env-gated, default-off forward profiling hook to apertus and seed_oss, mirroring the existing hooks in nemotron_h.rs and qwen3_moe.rs. MLXCEL_PROFILE_BLOCKS attributes single-token decode wall-clock to the attention vs MLP sub-blocks; MLXCEL_PROFILE_FORWARD splits the step into graph-build vs GPU-eval. Both flags are read once and cached in a OnceLock, and array_shape is only queried when a flag is set, so the steady-state decode path is unchanged (two atomic loads per token, no extra FFI calls or eval()).

Recovers one cheap, bit-exact op in apertus xIELU: both branches end in `+ beta * x`, so that shared term is now added once after the per-element select instead of inside each branch. This drops one elementwise add (and its kernel launch) per layer per token. The select only chooses which value beta*x is added to, so the arithmetic is unchanged. Verified greedy temp-0 output is byte-identical to the pre-change baseline for both apertus (500 bytes) and seed_oss (308 bytes); the decode delta is within run-to-run noise (~83.3 -> ~83.5 tok/s on apertus), consistent with the GPU-eval-bound profile. A fused/compiled xIELU activation (single kernel over the elementwise graph) is the only remaining real lever and is left as a documented follow-up, since it touches the C++ FFI bridge and risks f16 jitter.

Adds a CPU-only unit test (xielu_factored_beta_x_matches_branch_local) asserting the factored form equals the branch-local reference bit-for-bit in f32 across positive, negative, and zero inputs.
@inureyes inureyes added type:performance Performance improvements priority:low Low priority area:models Model architectures, weights, loading, metadata area:inference Generation, sampling, decoding (incl. speculative, DRY) platform:macos macOS (Apple Silicon) specific status:review Under review labels Jun 22, 2026
@inureyes

Copy link
Copy Markdown
Member Author

Implementation Review Summary

Intent

Characterize apertus/seed_oss dense decode overhead vs mlx-lm (#269) and recover the one cheap, bit-exact lever (factor the shared beta*x out of the xIELU select), without changing greedy temp-0 output.

Findings Addressed

  • None requiring code changes. No CRITICAL/HIGH/MEDIUM findings.

Verification notes

  • xIELU beta*x factoring is genuinely bit-exact. This is factoring a common addend out of a per-element select, not reassociation. For any element the result is core[i] + beta*x[i] with identical operands and operand order in both old and new forms; where_cond only selects (no re-rounding). multiply_scalar preserves input dtype (no float32 promotion, no astype), so beta*x is the same array as the old beta_x. Net effect is one fewer elementwise add per layer per token. Greedy temp-0 byte-identical claim is credible.
  • Test xielu_factored_beta_x_matches_branch_local is adequate. It exercises negative (-4, -1, -0.25), zero (0.0), and positive (0.25, 1, 4, 12.5) inputs and asserts to_bits() equality between the branch-local reference (xielu_scalar) and the factored form. Algebraically guaranteed to pass.
  • Profiling hooks are default-off and near-zero cost. Production path (env vars unset) pays two OnceLock atomic loads per token; array_shape is short-circuited and never called, no eval(), no Instant::now() (Option::map on None). This mirrors the nemotron_h.rs/qwen3_moe.rs pattern and is actually cleaner: those read std::env::var on every token, this caches in OnceLock. Hooks are wired into the real forward loop (forward_timed), not dead code.
  • Conventions clean. No shared function modified (only apertus_xielu, a private model fn, was touched), so no // Used by: update is owed. No stray AsType. Apple Silicon precision rules respected. TransformerBlock::forward is kept as a public delegating wrapper, preserving the API surface; it stays reachable via pub mod models, so no dead-code warning.
  • Scope contained. Only apertus.rs, apertus_tests.rs, seed_oss.rs changed; production logits path is restructured but semantically identical when profiling is off.

Minor (LOW, not fixed)

  • BlockTiming + profile_blocks_enabled/profile_forward_enabled are duplicated verbatim across apertus.rs and seed_oss.rs. This is consistent with the existing per-model profiling precedent (nemotron_h/qwen3_moe inline their own variants and do not share), so it is left as-is.

Verification

  • All stated requirements implemented (characterization, cheap recovery, output-unchanged)
  • No placeholder/mock code remaining
  • Integrated into project code flow (forward_timed in the layer loop; FORWARD/BLOCKS reachable)
  • Project conventions followed
  • Existing modules reused where applicable
  • No unintended structural changes (public API stable)
  • Tests pass (new scalar test verified by inspection; full build/clippy/tests run by orchestrator)

@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization Complete

Summary

  • Tests: xielu_factored_beta_x_matches_branch_local covers the bit-exact equivalence of the factored xIELU form; existing tests cover config parsing, field defaults, softplus numerics, and the branch formula. No gaps found, nothing added.
  • Documentation: MLXCEL_PROFILE_BLOCKS and MLXCEL_PROFILE_FORWARD are already documented in docs/environment-variables.md with "where implemented" / "model-specific forward profiling where implemented" phrasing, which is inherently non-stale and requires no update when new models adopt the hooks. No changes needed.
  • Lint/Format: cargo fmt produced no diff on the three touched files. cargo clippy --features metal,accelerate -p mlxcel --lib --tests -- -D warnings passed clean.

Branch is up to date with remote. Ready for merge.

@inureyes inureyes added status:done Completed and removed status:review Under review labels Jun 22, 2026
@inureyes inureyes merged commit 3d73b66 into main Jun 22, 2026
5 checks passed
@inureyes inureyes deleted the perf/issue-269-apertus-seedoss-dense-decode branch June 22, 2026 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:inference Generation, sampling, decoding (incl. speculative, DRY) area:models Model architectures, weights, loading, metadata platform:macos macOS (Apple Silicon) specific priority:low Low priority status:done Completed type:performance Performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(models): apertus and seed_oss dense decode at ~0.83x of mlx-lm

1 participant