Tune suffix decoding defaults for M5 Max Metal by nhwaani · Pull Request #276 · antirez/ds4

nhwaani · 2026-05-27T21:03:42Z

Context

This PR is related to the suffix-decoding work in #261.

Important note: the suffix-tree implementation itself is from the #261 branch. My contribution here is the M5 Max validation + default tuning that was merged into hexxyan/suffix-decoding in hexxyan#1.

So the preferred merge path can still be #261. This PR exists to make the M5 Max benchmark/tuning visible directly on antirez/ds4.

Summary

Suffix decoding is a model-free speculative decoder. It learns repeated token patterns from the prompt and previous output, then proposes likely next tokens from a suffix tree. The main model still verifies every proposed token, so correctness remains gated by the target model.

On my M5 Max 128GB, the feature works, but the original default draft cap was too aggressive for the Metal verifier path:

old default: --suffix-spec-factor 1.0 --suffix-spec-offset 0

That often proposes long drafts. The verifier then pays to check the long draft, but often commits only a short prefix. In my benchmarks this caused a large slowdown.

This tuning changes the default to:

new default: --suffix-spec-factor 0.01 --suffix-spec-offset 2

In plain language: suffix decoding now tries short drafts first by default. Users can still increase the factor/offset for workloads with very high acceptance.

Why this helps

The suffix-tree lookup is cheap. The expensive part is target-model verification.

Example debug timing from the same M5 Max setup:

suffix spec hit p=15 drafts=15 score=7.5000
mtp timing micro drafted=15 committed=2 verify=213 ms replay=50 ms total=264 ms

suffix spec hit p=8 drafts=8 score=4.0000
mtp timing micro drafted=8 committed=2 verify=94 ms replay=49 ms total=144 ms

suffix spec hit p=2 drafts=2 score=1.0000
mtp timing micro drafted=2 committed=2 verify=50 ms total=51 ms

So the safer default is not “draft as much as possible”. It is “draft a small amount unless the user opts into larger drafts”.

Benchmark setup

Hardware:

MacBook Pro M5 Max, 128GB
macOS 26.5
DeepSeek V4 Flash q2 GGUF
Metal backend
./ds4-bench --warm-weights

Branch tested:

hexxyan/suffix-decoding after merging Tune suffix decoding default draft cap for Metal hexxyan/ds4#1
Head: 1df9f1a

Workloads:

Agent JSONL transcript — repeated tool calls, JSON keys, pytest output, and status summaries.
Code boilerplate — repeated Python functions and tests.

These are repetitive workloads where suffix decoding should have a chance to help.

Results: agent-style JSONL prompt

Command shape:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file /tmp/ds4_suffix_agent_jsonl.txt \
  --ctx-start 8192 \
  --ctx-max 8192 \
  --gen-tokens 512 \
  --warm-weights \
  [suffix flags] \
  --csv /tmp/out.csv

Average of 2 runs for baseline/new default; 1 run for old default.

Mode	gen tok/s	decode sec	vs baseline	Notes
Baseline, no suffix	29.47	16.93	1.00×	plain decode
Suffix, old default	14.74	34.60	0.50×	over-drafts; much slower
Suffix, new default	30.48	16.63	1.03×	short drafts; slightly faster

Visual summary:

Agent JSONL, 8192 ctx, 512 gen tokens

old suffix default   14.74 tok/s  ███████████████░░░░░░░░░░░░░░░  50% of baseline
baseline             29.47 tok/s  ██████████████████████████████  100%
new suffix default   30.48 tok/s  ███████████████████████████████ 103% of baseline

Results: code-boilerplate prompt

Command shape:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file /tmp/ds4_suffix_code_boilerplate.py \
  --ctx-start 2048 \
  --ctx-max 2048 \
  --gen-tokens 512 \
  --warm-weights \
  [suffix flags] \
  --csv /tmp/out.csv

Average of 2 runs for baseline/new default; 1 run for old default.

Mode	gen tok/s	decode sec	vs baseline	Notes
Baseline, no suffix	35.05	14.13	1.00×	plain decode
Suffix, old default	13.54	37.66	0.39×	over-drafts; much slower
Suffix, new default	34.55	14.61	0.99×	near baseline; avoids large slowdown

Visual summary:

Code boilerplate, 2048 ctx, 512 gen tokens

old suffix default   13.54 tok/s  ███████████░░░░░░░░░░░░░░░░░░░  39% of baseline
baseline             35.05 tok/s  ██████████████████████████████  100%
new suffix default   34.55 tok/s  █████████████████████████████░  99% of baseline

Telemetry

Agent JSONL, 8192 ctx, 512 generated tokens

Mode	suffix hits	accepted draft tokens	avg draft len	gen tok/s
old default	199	298	6.56	14.74
new default	201	296	1.66	30.48

Code boilerplate, 2048 ctx, 512 generated tokens

Mode	suffix hits	accepted draft tokens	avg draft len	gen tok/s
old default	207	295	9.27	13.54
new default	216	286	1.95	34.55

The new default keeps almost the same number of accepted draft tokens, but avoids expensive long verification attempts.

What changed

Only defaults and docs changed:

ds4.c: engine fallback defaults
ds4_bench.c: bench defaults and help text
ds4_cli.c: help text
README.md: explain conservative default and when to increase it

No verifier logic changed.

Tests

make clean && make -j$(sysctl -n hw.ncpu) ds4-bench ds4_test suffix-tree-test
./tests/suffix_tree_test
./ds4_test --server
make test

All passed on M5 Max.

Caveat

This is tuning from one Metal machine and two repetitive workloads. It does not claim to be universally optimal. The goal is to make the opt-in default safe: avoid turning on a large slowdown by default, while keeping the knobs exposed for larger-draft experiments.

Implements an opt-in, model-free suffix trie (arXiv:2411.04975) for speculative decoding in ds4, with probability estimation and adaptive draft caps ported from Snowflake ArcticInference. New files: - ds4_suffix_tree.h/c: bounded CPU-resident suffix trie with sorted-array children, frequency-aging pruning, memory budget enforcement, probability estimation, and cached best-child index (~500 lines) - tests/suffix_tree_test.c: 8 unit tests covering continuation matching, probability estimation, confidence filtering, best-child caching, and pruning Modified files: - ds4.h: 6 new engine options (suffix_decoding, suffix_max_depth, suffix_memory_budget, suffix_spec_factor, suffix_spec_offset, suffix_min_prob); suffix stats telemetry with draft_score_total - ds4.c: session lifecycle integration, incremental learning via suffix_learned_len, two-phase draft selection (match_depth then query), score-based draft gating, independent spec_logits allocation - ds4_cli.c: 6 new CLI flags - ds4_bench.c: suffix telemetry CSV columns, 3 new config options - Makefile: suffix-tree-test target - README.md: usage documentation - CONTRIBUTING.md: telemetry column guidance - speed-bench/README.md: MTP and suffix bench sweep examples Disabled by default (--suffix-decoding flag required). No external dependencies, no training, no GPU kernels. Falls back gracefully to MTP or single-token decode when no suffix match is found. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

hexxyan · 2026-05-27T21:35:03Z

Hi @nhwaani — thanks again for the excellent benchmark work on M5 Max. Your parameter tuning (factor 0.01, offset 2) is a valuable contribution and I've already incorporated it into PR #261 where the suffix-tree implementation is under review.

Your commit is preserved with full authorship credit — when PR #261 lands, you'll appear in the Contributors list.

Since the core implementation in this PR (commit b097d85) is the same as #261, would you consider closing this PR to keep the review focused in one place? Your tuning is already included there. Thanks!

nhwaani · 2026-05-27T21:37:42Z

Hi @nhwaani — thanks again for the excellent benchmark work on M5 Max. Your parameter tuning (factor 0.01, offset 2) is a valuable contribution and I've already incorporated it into PR #261 where the suffix-tree implementation is under review.

Your commit is preserved with full authorship credit — when PR #261 lands, you'll appear in the Contributors list.

Since the core implementation in this PR (commit b097d85) is the same as #261, would you consider closing this PR to keep the review focused in one place? Your tuning is already included there. Thanks!

@hexxyan Happy to add some value with these clankers.
Would love to hear from @ivanfioravanti @antirez as well, if they consider it

hexxyan and others added 2 commits May 28, 2026 01:34

Tune suffix decoding default draft cap for Metal

4faf050

nhwaani force-pushed the tune-suffix-defaults-m5 branch from 6bcd4c3 to 4faf050 Compare May 27, 2026 21:07

nhwaani changed the title ~~Tune suffix defaults m5~~ Tune suffix decoding defaults for M5 Max Metal May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune suffix decoding defaults for M5 Max Metal#276

Tune suffix decoding defaults for M5 Max Metal#276
nhwaani wants to merge 2 commits into
antirez:mainfrom
nhwaani:tune-suffix-defaults-m5

nhwaani commented May 27, 2026 •

edited

Loading

Uh oh!

hexxyan commented May 27, 2026

Uh oh!

nhwaani commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nhwaani commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Summary

Why this helps

Benchmark setup

Results: agent-style JSONL prompt

Results: code-boilerplate prompt

Telemetry

Agent JSONL, 8192 ctx, 512 generated tokens

Code boilerplate, 2048 ctx, 512 generated tokens

What changed

Tests

Caveat

Uh oh!

hexxyan commented May 27, 2026

Uh oh!

nhwaani commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nhwaani commented May 27, 2026 •

edited

Loading

nhwaani commented May 27, 2026 •

edited

Loading