Skip to content

Tune suffix decoding defaults for M5 Max Metal#276

Open
nhwaani wants to merge 2 commits into
antirez:mainfrom
nhwaani:tune-suffix-defaults-m5
Open

Tune suffix decoding defaults for M5 Max Metal#276
nhwaani wants to merge 2 commits into
antirez:mainfrom
nhwaani:tune-suffix-defaults-m5

Conversation

@nhwaani
Copy link
Copy Markdown

@nhwaani nhwaani commented May 27, 2026

Context

This PR is related to the suffix-decoding work in #261.

Important note: the suffix-tree implementation itself is from the #261 branch. My contribution here is the M5 Max validation + default tuning that was merged into hexxyan/suffix-decoding in hexxyan#1.

So the preferred merge path can still be #261. This PR exists to make the M5 Max benchmark/tuning visible directly on antirez/ds4.

Summary

Suffix decoding is a model-free speculative decoder. It learns repeated token patterns from the prompt and previous output, then proposes likely next tokens from a suffix tree. The main model still verifies every proposed token, so correctness remains gated by the target model.

On my M5 Max 128GB, the feature works, but the original default draft cap was too aggressive for the Metal verifier path:

old default: --suffix-spec-factor 1.0 --suffix-spec-offset 0

That often proposes long drafts. The verifier then pays to check the long draft, but often commits only a short prefix. In my benchmarks this caused a large slowdown.

This tuning changes the default to:

new default: --suffix-spec-factor 0.01 --suffix-spec-offset 2

In plain language: suffix decoding now tries short drafts first by default. Users can still increase the factor/offset for workloads with very high acceptance.

Why this helps

The suffix-tree lookup is cheap. The expensive part is target-model verification.

Example debug timing from the same M5 Max setup:

suffix spec hit p=15 drafts=15 score=7.5000
mtp timing micro drafted=15 committed=2 verify=213 ms replay=50 ms total=264 ms

suffix spec hit p=8 drafts=8 score=4.0000
mtp timing micro drafted=8 committed=2 verify=94 ms replay=49 ms total=144 ms

suffix spec hit p=2 drafts=2 score=1.0000
mtp timing micro drafted=2 committed=2 verify=50 ms total=51 ms

So the safer default is not “draft as much as possible”. It is “draft a small amount unless the user opts into larger drafts”.

Benchmark setup

Hardware:

  • MacBook Pro M5 Max, 128GB
  • macOS 26.5
  • DeepSeek V4 Flash q2 GGUF
  • Metal backend
  • ./ds4-bench --warm-weights

Branch tested:

Workloads:

  1. Agent JSONL transcript — repeated tool calls, JSON keys, pytest output, and status summaries.
  2. Code boilerplate — repeated Python functions and tests.

These are repetitive workloads where suffix decoding should have a chance to help.

Results: agent-style JSONL prompt

Command shape:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file /tmp/ds4_suffix_agent_jsonl.txt \
  --ctx-start 8192 \
  --ctx-max 8192 \
  --gen-tokens 512 \
  --warm-weights \
  [suffix flags] \
  --csv /tmp/out.csv

Average of 2 runs for baseline/new default; 1 run for old default.

Mode gen tok/s decode sec vs baseline Notes
Baseline, no suffix 29.47 16.93 1.00× plain decode
Suffix, old default 14.74 34.60 0.50× over-drafts; much slower
Suffix, new default 30.48 16.63 1.03× short drafts; slightly faster

Visual summary:

Agent JSONL, 8192 ctx, 512 gen tokens

old suffix default   14.74 tok/s  ███████████████░░░░░░░░░░░░░░░  50% of baseline
baseline             29.47 tok/s  ██████████████████████████████  100%
new suffix default   30.48 tok/s  ███████████████████████████████ 103% of baseline

Results: code-boilerplate prompt

Command shape:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file /tmp/ds4_suffix_code_boilerplate.py \
  --ctx-start 2048 \
  --ctx-max 2048 \
  --gen-tokens 512 \
  --warm-weights \
  [suffix flags] \
  --csv /tmp/out.csv

Average of 2 runs for baseline/new default; 1 run for old default.

Mode gen tok/s decode sec vs baseline Notes
Baseline, no suffix 35.05 14.13 1.00× plain decode
Suffix, old default 13.54 37.66 0.39× over-drafts; much slower
Suffix, new default 34.55 14.61 0.99× near baseline; avoids large slowdown

Visual summary:

Code boilerplate, 2048 ctx, 512 gen tokens

old suffix default   13.54 tok/s  ███████████░░░░░░░░░░░░░░░░░░░  39% of baseline
baseline             35.05 tok/s  ██████████████████████████████  100%
new suffix default   34.55 tok/s  █████████████████████████████░  99% of baseline

Telemetry

Agent JSONL, 8192 ctx, 512 generated tokens

Mode suffix hits accepted draft tokens avg draft len gen tok/s
old default 199 298 6.56 14.74
new default 201 296 1.66 30.48

Code boilerplate, 2048 ctx, 512 generated tokens

Mode suffix hits accepted draft tokens avg draft len gen tok/s
old default 207 295 9.27 13.54
new default 216 286 1.95 34.55

The new default keeps almost the same number of accepted draft tokens, but avoids expensive long verification attempts.

What changed

Only defaults and docs changed:

  • ds4.c: engine fallback defaults
  • ds4_bench.c: bench defaults and help text
  • ds4_cli.c: help text
  • README.md: explain conservative default and when to increase it

No verifier logic changed.

Tests

make clean && make -j$(sysctl -n hw.ncpu) ds4-bench ds4_test suffix-tree-test
./tests/suffix_tree_test
./ds4_test --server
make test

All passed on M5 Max.

Caveat

This is tuning from one Metal machine and two repetitive workloads. It does not claim to be universally optimal. The goal is to make the opt-in default safe: avoid turning on a large slowdown by default, while keeping the knobs exposed for larger-draft experiments.

hexxyan and others added 2 commits May 28, 2026 01:34
Implements an opt-in, model-free suffix trie (arXiv:2411.04975) for
speculative decoding in ds4, with probability estimation and adaptive
draft caps ported from Snowflake ArcticInference.

New files:
- ds4_suffix_tree.h/c: bounded CPU-resident suffix trie with sorted-array
  children, frequency-aging pruning, memory budget enforcement, probability
  estimation, and cached best-child index (~500 lines)
- tests/suffix_tree_test.c: 8 unit tests covering continuation matching,
  probability estimation, confidence filtering, best-child caching, and pruning

Modified files:
- ds4.h: 6 new engine options (suffix_decoding, suffix_max_depth,
  suffix_memory_budget, suffix_spec_factor, suffix_spec_offset,
  suffix_min_prob); suffix stats telemetry with draft_score_total
- ds4.c: session lifecycle integration, incremental learning via
  suffix_learned_len, two-phase draft selection (match_depth then query),
  score-based draft gating, independent spec_logits allocation
- ds4_cli.c: 6 new CLI flags
- ds4_bench.c: suffix telemetry CSV columns, 3 new config options
- Makefile: suffix-tree-test target
- README.md: usage documentation
- CONTRIBUTING.md: telemetry column guidance
- speed-bench/README.md: MTP and suffix bench sweep examples

Disabled by default (--suffix-decoding flag required). No external
dependencies, no training, no GPU kernels. Falls back gracefully to
MTP or single-token decode when no suffix match is found.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@nhwaani nhwaani force-pushed the tune-suffix-defaults-m5 branch from 6bcd4c3 to 4faf050 Compare May 27, 2026 21:07
@hexxyan
Copy link
Copy Markdown
Contributor

hexxyan commented May 27, 2026

Hi @nhwaani — thanks again for the excellent benchmark work on M5 Max. Your parameter tuning (factor 0.01, offset 2) is a valuable contribution and I've already incorporated it into PR #261 where the suffix-tree implementation is under review.

Your commit is preserved with full authorship credit — when PR #261 lands, you'll appear in the Contributors list.

Since the core implementation in this PR (commit b097d85) is the same as #261, would you consider closing this PR to keep the review focused in one place? Your tuning is already included there. Thanks!

@nhwaani
Copy link
Copy Markdown
Author

nhwaani commented May 27, 2026

Hi @nhwaani — thanks again for the excellent benchmark work on M5 Max. Your parameter tuning (factor 0.01, offset 2) is a valuable contribution and I've already incorporated it into PR #261 where the suffix-tree implementation is under review.

Your commit is preserved with full authorship credit — when PR #261 lands, you'll appear in the Contributors list.

Since the core implementation in this PR (commit b097d85) is the same as #261, would you consider closing this PR to keep the review focused in one place? Your tuning is already included there. Thanks!

@hexxyan Happy to add some value with these clankers.
Would love to hear from @ivanfioravanti @antirez as well, if they consider it

@nhwaani nhwaani changed the title Tune suffix defaults m5 Tune suffix decoding defaults for M5 Max Metal May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants