Add suffix-tree speculative decoding for repetitive/agentic generation patterns by hexxyan · Pull Request #261 · antirez/ds4

hexxyan · 2026-05-26T19:37:46Z

Thanks @antirez for this wonderful project.

Summary

This PR adds an opt-in, model-free suffix-tree draft source for speculative decoding, implementing the SuffixDecoding approach (arXiv:2411.04975) with probability estimation features ported from Snowflake ArcticInference.

The suffix trie learns repetitive token patterns from prompt and prior output, then proposes draft tokens at zero model cost. The target-model verifier accepts or rejects each draft, so output correctness is always guaranteed — the trie can only offer speedups, never change results.

When this helps

The suffix trie is most effective when generated text contains repetitive or predictable subsequences:

Agentic / tool-use loops: repeated JSON schemas, function call templates, and response patterns are learned from the first few rounds and then drafted at zero model cost
Code generation: repeated imports, boilerplate, and structural tokens (brackets, indentation) have high suffix-tree hit rates
Multi-turn conversation: system prompts, formatting tokens, and common phrasings are learned once and reused across turns
Structured output: any fixed-format output (XML, JSON, CSV headers) that recurs within the context window

The SuffixDecoding paper (arXiv:2411.04975) reports 1.3–2.5× speedup on agentic benchmarks with their reference implementation. ds4's implementation uses the same core algorithm (suffix trie + frequency-based continuation) with a different internal data structure (sorted arrays vs hash maps), so actual speedup depends on the workload's repetition patterns and the model's baseline token latency.

Key properties

Disabled by default (--suffix-decoding flag required) — zero impact on existing behavior
No external dependencies: no draft model, no training, no GPU kernels
Reuses existing target-model verification path — every proposed token is verified against the target model's logits
Falls back gracefully to MTP or single-token decode when no suffix match is found
Works alone or combined with MTP: suffix matches are tried first; MTP remains the fallback. When both are enabled, either source can provide drafts.

What's included

New files:

ds4_suffix_tree.h/c — Bounded CPU-resident suffix trie with sorted-array children, frequency-aging pruning, and best-effort memory budget (~500 lines)
tests/suffix_tree_test.c — 8 unit tests covering continuation matching, probability estimation, confidence filtering, best-child caching, and pruning

Modified files:

ds4.h — 6 new engine options (suffix_decoding, suffix_max_depth, suffix_memory_budget, suffix_spec_factor, suffix_spec_offset, suffix_min_prob)
ds4.c — Session lifecycle integration, incremental learning, two-phase draft selection (match → query), score-based draft gating, independent spec_logits allocation for suffix-only batch verification
ds4_cli.c — 6 new CLI flags
ds4_bench.c — Suffix telemetry CSV columns, new config options (note: CSV schema expanded from 6 columns to include context memory breakdown, MTP stats, and suffix telemetry — downstream CSV parsers will need updating)
Makefile — suffix-tree-test target
README.md — Usage documentation
CONTRIBUTING.md — Telemetry column guidance
speed-bench/README.md — MTP and suffix bench sweep examples

Architecture

The suffix trie learns repetitive token patterns from prompt, checkpoint, and accepted generation tokens. During speculative decode:

ds4_suffix_tree_match_depth() finds the longest matching suffix in the trie
Adaptive draft cap is computed: cap = match_len × factor + offset
ds4_suffix_tree_query() follows the highest-frequency continuation path with probability estimation (prob *= child_freq / parent_freq)
Drafts below min_prob confidence are filtered; score is used for quality gating
All proposed tokens are verified by the existing target-model verifier

Incremental learning: the tree uses suffix_learned_len to track what's already been inserted, appending only new tokens per decode step instead of re-inserting the entire checkpoint. This avoids frequency inflation for older patterns and keeps per-step cost O(max_depth).

Pruning: frequency aging (decrement all by 1, clamp at 0) followed by zero-frequency leaf removal, with a configurable node budget (default 64 MB). Pruning runs up to 16 rounds per trigger and may temporarily exceed the budget under heavy insert load before converging back.

Verification

make suffix-tree-test NATIVE_CPU_FLAG=   # 8 unit tests, all pass
make cpu NATIVE_CPU_FLAG=                 # all CPU binaries compile
make ds4-bench NATIVE_CPU_FLAG=           # Metal binary compiles
./ds4-eval --self-test-extractors         # extractor self-tests pass

Important caveats

No e2e benchmark yet: I do not have access to 80GB+ hardware for full DeepSeek V4 end-to-end testing. The implementation is correct by construction (target verifier gates all output), but actual throughput numbers need real hardware.
Default parameters = no behavior change: With default settings (--suffix-spec-factor 1.0, --suffix-spec-offset 0.0, --suffix-min-prob 0.0), the implementation reproduces the original alpha=1 draft cap. The new parameters are configurable infrastructure for tuning.
Suffix-only batch verification: spec_logits is now allocated independently when --suffix-decoding is enabled (without requiring MTP). The structural prerequisite is in place but needs real-model validation.

Request for maintainers/community

If you have access to a machine with 80GB+ GPU running DeepSeek V4, help with the following would be appreciated:

End-to-end speculative decode test with --suffix-decoding
Throughput measurement (tokens/s) comparing: baseline → suffix-only → suffix+MTP
Draft acceptance rate under agentic/repetitive workloads
Validation of suffix-only batch verification correctness

AI assistance disclosure

This code was written with assistance from GPT-5.5-xhigh and GLM-5.1 models.

References

SuffixDecoding paper: https://arxiv.org/abs/2411.04975
SuffixDecoding project page: https://suffix-decoding.github.io/
ArcticInference (Snowflake): https://github.com/snowflakedb/ArcticInference
vLLM suffix decoding: https://docs.vllm.ai/en/latest/features/speculative_decoding/suffix/

Follow-up fixes (commits after initial PR)

Build/test fixes

Conservative default tuning (merged from @nhwaani): --suffix-spec-factor 0.01 --suffix-spec-offset 2 avoids over-drafting on Metal, keeping suffix decoding near baseline on M5 Max
Pre-allocated verifier scratch: eliminated per-step malloc/free in the speculative decode hot path

Bug fixes

--suffix-spec-offset 0 now works: sentinel defaults (-1.0f) + correct engine-side checks so explicit zero is respected
suffix_spec_factor zero-init safety: > 0.0f check so API callers with ds4_engine_options opt = {0} get engine defaults

Telemetry (speculative decode)

Added comprehensive telemetry printed on exit when --suffix-decoding is active:

spec_steps, first_draft_hit/miss — how often speculative decode runs and how often the first draft matches
accept_rate — committed draft tokens / verified draft tokens
N=X:full=Y:partial=Z histogram — distribution of full vs partial accepts by draft depth
verify_ms, replay_ms, draft_query_ms — timing breakdown across all verifier paths (decode2_exact, generic batch verifier, sequential fallback)

Telemetry covers all verifier paths, not just one branch. In ds4-bench, telemetry resets per frontier so each CSV row is an independent snapshot.

What we learned from M5 Max testing

Conservative defaults keep suffix decoding at ~1.02x (near baseline) on agent-style JSONL workloads
The main bottleneck is verifier cost: verify + replay time dominates draft query time
Dynamic adaptive cap (adjusting draft depth based on recent accept rate) is the next promising optimization direction

Implements an opt-in, model-free suffix trie (arXiv:2411.04975) for speculative decoding in ds4, with probability estimation and adaptive draft caps ported from Snowflake ArcticInference. New files: - ds4_suffix_tree.h/c: bounded CPU-resident suffix trie with sorted-array children, frequency-aging pruning, memory budget enforcement, probability estimation, and cached best-child index (~500 lines) - tests/suffix_tree_test.c: 8 unit tests covering continuation matching, probability estimation, confidence filtering, best-child caching, and pruning Modified files: - ds4.h: 6 new engine options (suffix_decoding, suffix_max_depth, suffix_memory_budget, suffix_spec_factor, suffix_spec_offset, suffix_min_prob); suffix stats telemetry with draft_score_total - ds4.c: session lifecycle integration, incremental learning via suffix_learned_len, two-phase draft selection (match_depth then query), score-based draft gating, independent spec_logits allocation - ds4_cli.c: 6 new CLI flags - ds4_bench.c: suffix telemetry CSV columns, 3 new config options - Makefile: suffix-tree-test target - README.md: usage documentation - CONTRIBUTING.md: telemetry column guidance - speed-bench/README.md: MTP and suffix bench sweep examples Disabled by default (--suffix-decoding flag required). No external dependencies, no training, no GPU kernels. Falls back gracefully to MTP or single-token decode when no suffix match is found. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

nhwaani · 2026-05-27T21:02:53Z

I tested this PR on an M5 Max 128GB with the DeepSeek V4 Flash q2 GGUF.

Good news: the feature builds and works on real hardware.

Passed:

./tests/suffix_tree_test
./ds4_test --server
make test

Main finding: the current default draft cap is too aggressive on my Metal runs. It often proposes long drafts, but the verifier only commits a short prefix. That makes default --suffix-decoding slower than baseline on the repetitive workloads I tested.

Example, agent-style JSONL prompt, 8192 ctx, 512 gen tokens:

Mode	gen tok/s	vs baseline
baseline	29.10	1.00x
suffix current default	16.69	0.57x
suffix conservative cap	32.17	1.11x

Repeated baseline vs conservative cap:

Run	baseline	conservative	gain
1	29.14	31.26	+7.3%
2	29.24	30.42	+4.0%
3	28.19	30.12	+6.8%

The useful tuning was:

--suffix-spec-factor 0.01 --suffix-spec-offset 2

I opened a small follow-up PR against this branch with the benchmark details and default tuning:

hexxyan#1

nhwaani · 2026-05-27T21:31:21Z

Small follow-up: I rebased the tuning branch onto hexxyan/suffix-decoding so the follow-up PR diff is now only the intended defaults/docs change (+12 / -9), and reran the benchmarks on that exact final branch.

Follow-up PR: hexxyan#1

Final M5 Max 128GB numbers:

Agent JSONL, 8192 ctx, 512 gen tokens

Mode	gen tok/s	vs baseline
baseline	30.82	1.00x
old suffix default	14.54	0.47x
new suffix default	31.37	1.02x

Code boilerplate, 2048 ctx, 512 gen tokens

Mode	gen tok/s	vs baseline
baseline	34.20	1.00x
old suffix default	13.02	0.38x
new suffix default	33.13	0.97x

Tests passed after rebase:

./tests/suffix_tree_test
./ds4_test --server
make test

So the main point is: the current default over-drafts and can cause a big slowdown on Metal; the conservative default avoids that and keeps suffix decoding near baseline / slightly faster on the agent-style workload.

Thanks for the thorough benchmarking work! Merging with merge commit to preserve authorship.

hexxyan · 2026-05-27T21:34:53Z

Thanks @nhwaani for the thorough M5 Max benchmarking — the conservative defaults (factor 0.01, offset 2) are a clear improvement over the original aggressive cap. I've merged your tuning PR into this branch so the opt-in experience is safe out of the box.

Your commit is preserved with you as author, so when this PR lands you'll appear in the project's Contributors list as well. Appreciate the real-hardware testing!

nhwaani · 2026-05-27T21:41:14Z

Thanks @nhwaani for the thorough M5 Max benchmarking — the conservative defaults (factor 0.01, offset 2) are a clear improvement over the original aggressive cap. I've merged your tuning PR into this branch so the opt-in experience is safe out of the box.

Your commit is preserved with you as author, so when this PR lands you'll appear in the project's Contributors list as well. Appreciate the real-hardware testing!

Thanks, let me know if you see any concerns @hexxyan

STRML · 2026-05-28T01:42:35Z

Given the complication and marginal performance benefits in these benchmarks, what's the reason to merge this? Is there a more representative benchmark for agentic coding?

- Fix sentinel defaults: use >= 0.0f check so --suffix-spec-offset 0 works - Add speculative decode telemetry (accept rate, timing, hit/miss counts) - Pre-allocate verifier scratch buffers in session (eliminate hot-path malloc) - Enable decode2_exact for suffix tree N=2 drafts (faster verify path) - Print telemetry summary on exit when suffix decoding is active

hexxyan · 2026-05-28T06:00:27Z

Suffix decoding fixes pushed

Just pushed a batch of fixes and improvements:

Fixed --suffix-spec-offset 0 — previously ignored due to wrong default check
Added speculative decode telemetry — prints accept rate, timing breakdown, hit/miss stats on exit
Pre-allocated verifier scratch — eliminated per-step malloc/free in the hot verification path
Enabled decode2_exact for suffix tree N=2 drafts — faster verification when suffix tree proposes exactly 2 draft tokens

@nhwaani — would you mind re-testing with the updated branch? The telemetry output will now print at the end, which should make it easier to see what's happening with accept rates and verify timing. A quick run like:

./ds4 -p "Your test prompt" --suffix-decoding --temp 0 -n 100

would be very helpful. Thanks!

nhwaani · 2026-05-28T10:58:59Z

Retested latest hexxyan/suffix-decoding on M5 Max 128GB.

Branch tested:

3d553db + 1800c4d

1800c4d is just a one-line telemetry fix I opened here:

hexxyan#2

It fixes impossible replay timing like:

replay=815247931.0ms

After the fix, telemetry is sane:

verify=8316.1ms replay=3.1ms

Tests passed:

make -j$(sysctl -n hw.ncpu) suffix-tree-test ds4_test ds4-bench
./tests/suffix_tree_test
./ds4_test --server

Model used: q2-q4-imatrix GGUF.

Agent JSONL, 8192 ctx, 512 gen

Mode	gen tok/s	vs baseline
baseline	29.72	1.00x
suffix new default	27.11	0.91x
suffix old aggressive	15.72	0.53x

Code boilerplate, 2048 ctx, 512 gen

Mode	gen tok/s	vs baseline
baseline	35.31	1.00x
suffix new default	31.57	0.89x
suffix old aggressive	14.69	0.42x

My read:

The conservative default is much safer than the old aggressive default.
It avoids the large over-drafting slowdown.
But on this q2-q4 M5 Metal run, I do not see a net speedup over baseline yet.

So I would not present this as a proven M5 Metal performance win today.

The merge argument, if any, is:

opt-in only
correctness-preserving verifier path
unit tests included
useful telemetry now available
gives us a tunable model-free draft source for future agent/tool-call workloads

I agree with @STRML that a more representative multi-turn coding-agent benchmark would be useful before making strong speedup claims.

- Revert decode2_exact for suffix tree (was causing slowdown, it's a correctness path not a fast path; suffix N=2 should use batch verify) - Fix accept_rate formula: total_verified now counts draft_n (not draft_n-1) so the ratio correctly reflects draft token acceptance probability - Fix factor sentinel: use > 0.0f so zero-initialized API opts get engine default (0.01) instead of literal 0.0 - Apply PR #2: verify_done timestamp captured unconditionally so replay telemetry doesn't print garbage when DS4_MTP_TIMING is off

hexxyan · 2026-05-28T13:17:48Z

Regression fixes pushed

Good catch on all four issues. Just pushed a fix commit (10229e1):

Reverted decode2_exact for suffix tree — back to strict_mtp only. This was the main cause of the slowdown: the exact verifier is a correctness path, not a fast path. Suffix N=2 now uses the batch verifier again.
Fixed accept_rate formula — total_verified was counting draft_n - 1 instead of draft_n, so full-accept steps inflated committed above verified, producing nonsensical rates like 2.5.
Fixed factor sentinel — changed to > 0.0f so zero-initialized API users get the engine default (0.01), not literal 0.0. CLI uses -1.0f sentinel as before.
Applied PR Fix suffix telemetry replay timing hexxyan/ds4#2 — verify_done timestamp now captured unconditionally so replay telemetry is valid even without DS4_MTP_TIMING.

@nhwaani — could you re-run your benchmarks with this version? The decode2_exact revert should bring performance back to the 1.02x baseline.

Fix suffix telemetry replay timing

- Add total_verify_ms to all 5 generic batch verifier return points - Add total_replay_ms to the 2 replay paths (exact replay + general replay) - Always capture draft_query_ms (remove DS4_SUFFIX_SPEC_LOG gate) - Add ds4_engine_spec_telemetry_reset() API for per-frontier reset - Reset telemetry after each frontier in ds4-bench so CSV rows are independent snapshots, not cumulative Now verify/replay/draft_query timing covers the suffix-only main path (generic batch verifier), not just the decode2_exact branch.

…ect N buckets - draft_query_ms now always accumulates (removed DS4_SUFFIX_SPEC_LOG gate) - micro_verify_done captured unconditionally, verify_ms accumulated once after verifier completes rather than at each return point - total_replay_ms only counts post-verifier replay/restore cost - accept_rate prints only when total_verified > 0 - N= bucket index fixed: i+1 instead of i+2 (N=1 means 1 draft token) - first-draft miss now counts toward total_verified and partial_accept histogram so the two stay consistent

hexxyan · 2026-05-28T15:37:40Z

Telemetry completeness update pushed (`6eb2002`)

Fixed the remaining telemetry gaps that were still half-blind:

draft_query_ms always accumulates — removed the DS4_SUFFIX_SPEC_LOG gate, now every speculative step records draft tree query cost
verify_ms accumulated once after verifier completes, not duplicated at each return point
replay_ms only counts post-verifier replay/restore cost, not the verify itself
micro_verify_done unconditional — replay timing is valid even without DS4_MTP_TIMING
accept_rate guard: only prints when total_verified > 0 (no div-by-zero)
N= bucket fixed: i+1 instead of i+2 (N=1 means 1 draft token proposed)
First-draft miss now counts toward total_verified and partial_accept[N] histogram so the two stay consistent

The telemetry output now covers the full suffix-only main path. Example output on a repetitive workload should look something like:

ds4: spec telemetry: steps=53 first_hit=48 first_miss=4 committed=75 verified=30 accept_rate=0.xxx N=1:full=X:partial=Y verify=xx.xms draft_query=0.xms

All numbers should now be self-consistent. @nhwaani — if you get a chance to re-run, the telemetry should give a clear picture of where the time is going.

nhwaani · 2026-05-28T19:31:19Z

Retested latest hexxyan/suffix-decoding at 6eb2002 on M5 Max 128GB with the q2-q4-imatrix GGUF.

Build/tests passed:

make -j$(sysctl -n hw.ncpu) suffix-tree-test ds4_test ds4-bench
./tests/suffix_tree_test
./ds4_test --server
make test

3-run median results, 512 generated tokens:

Workload	Baseline	Suffix default	vs baseline
Agent JSONL, 8192 ctx	29.16 tok/s	31.97 tok/s	1.10x
Code boilerplate, 2048 ctx	33.79 tok/s	35.09 tok/s	1.04x

Raw gen tok/s:

Workload	Baseline runs	Suffix default runs
Agent JSONL	30.69, 27.86, 29.16	31.97, 31.26, 32.09
Code boilerplate	33.68, 33.79, 34.24	33.77, 35.17, 35.09

Telemetry now looks sane and stable:

agent r1: steps=218 first_hit=159 first_miss=56 committed=294 verified=374 accept_rate=0.786 N=2:full=135:partial=80 verify=8524.6ms draft_query=1.0ms
agent r2: steps=218 first_hit=159 first_miss=56 committed=294 verified=374 accept_rate=0.786 N=2:full=135:partial=80 verify=8708.5ms draft_query=0.9ms
agent r3: steps=218 first_hit=159 first_miss=56 committed=294 verified=374 accept_rate=0.786 N=2:full=135:partial=80 verify=8498.9ms draft_query=0.9ms

code r1:  steps=221 first_hit=182 first_miss=31 committed=291 verified=329 accept_rate=0.884 N=1:full=66:partial=3 N=2:full=109:partial=35 verify=8834.8ms draft_query=0.8ms
code r2:  steps=221 first_hit=182 first_miss=31 committed=291 verified=329 accept_rate=0.884 N=1:full=66:partial=3 N=2:full=109:partial=35 verify=8465.3ms draft_query=0.8ms
code r3:  steps=221 first_hit=182 first_miss=31 committed=291 verified=329 accept_rate=0.884 N=1:full=66:partial=3 N=2:full=109:partial=35 verify=8497.3ms draft_query=1.0ms

I also did a single sanity run with the old aggressive cap (--suffix-spec-factor 1.0 --suffix-spec-offset 0):

Workload	Baseline	Current default	Old aggressive cap
Agent JSONL	30.31 tok/s	31.97 tok/s	13.64 tok/s
Code boilerplate	34.03 tok/s	34.74 tok/s	17.16 tok/s

So the latest fixes recover the expected behavior on my synthetic repetitive agent/code prompts: conservative suffix defaults are near-baseline to modestly faster, while avoiding the old aggressive over-drafting slowdown.

I’d still avoid broad performance claims until there is a representative multi-turn coding-agent benchmark, but this version looks healthy on M5 Max Metal.

hexxyan marked this pull request as ready for review May 26, 2026 19:40

hexxyan force-pushed the suffix-decoding branch from 097de6b to 39acdc6 Compare May 26, 2026 19:52

hexxyan changed the title ~~Add experimental suffix-tree speculative decoding draft source~~ Add suffix-tree speculative decoding for repetitive/agentic generation patterns May 26, 2026

hexxyan force-pushed the suffix-decoding branch from 39acdc6 to 3f83465 Compare May 27, 2026 17:31

hexxyan force-pushed the suffix-decoding branch from 3f83465 to b097d85 Compare May 27, 2026 17:35

Tune suffix decoding default draft cap for Metal

4faf050

Merge pull request #1 from nhwaani/tune-suffix-defaults-m5

1df9f1a

Thanks for the thorough benchmarking work! Merging with merge commit to preserve authorship.

hexxyan mentioned this pull request May 27, 2026

Tune suffix decoding defaults for M5 Max Metal #276

Open

Fix suffix telemetry replay timing

1800c4d

hexxyan and others added 3 commits May 28, 2026 21:18

Merge pull request #2 from nhwaani/fix-suffix-telemetry-replay

cb530d6

Fix suffix telemetry replay timing

This was referenced May 30, 2026

CPU support for Q4_K routed experts (fixes #171) #272

Merged

metal: simdgroup MMA mini-GEMM for decode MoE [experimental] #306

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add suffix-tree speculative decoding for repetitive/agentic generation patterns#261

Add suffix-tree speculative decoding for repetitive/agentic generation patterns#261
hexxyan wants to merge 9 commits into
antirez:mainfrom
hexxyan:suffix-decoding

hexxyan commented May 26, 2026 •

edited

Loading

Uh oh!

nhwaani commented May 27, 2026

Uh oh!

nhwaani commented May 27, 2026

Uh oh!

hexxyan commented May 27, 2026

Uh oh!

nhwaani commented May 27, 2026 •

edited

Loading

Uh oh!

STRML commented May 28, 2026

Uh oh!

hexxyan commented May 28, 2026

Uh oh!

nhwaani commented May 28, 2026

Uh oh!

hexxyan commented May 28, 2026

Uh oh!

hexxyan commented May 28, 2026

Uh oh!

nhwaani commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hexxyan commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

When this helps

Key properties

What's included

Architecture

Verification

Important caveats

Request for maintainers/community

AI assistance disclosure

References

Follow-up fixes (commits after initial PR)

Build/test fixes

Bug fixes

Telemetry (speculative decode)

What we learned from M5 Max testing

Uh oh!

nhwaani commented May 27, 2026

Uh oh!

nhwaani commented May 27, 2026

Agent JSONL, 8192 ctx, 512 gen tokens

Code boilerplate, 2048 ctx, 512 gen tokens

Uh oh!

hexxyan commented May 27, 2026

Uh oh!

nhwaani commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

STRML commented May 28, 2026

Uh oh!

hexxyan commented May 28, 2026

Suffix decoding fixes pushed

Uh oh!

nhwaani commented May 28, 2026

Agent JSONL, 8192 ctx, 512 gen

Code boilerplate, 2048 ctx, 512 gen

Uh oh!

hexxyan commented May 28, 2026

Regression fixes pushed

Uh oh!

hexxyan commented May 28, 2026

Telemetry completeness update pushed (6eb2002)

Uh oh!

nhwaani commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hexxyan commented May 26, 2026 •

edited

Loading

nhwaani commented May 27, 2026 •

edited

Loading

Telemetry completeness update pushed (`6eb2002`)