Skip to content

Add suffix-tree speculative decoding for repetitive/agentic generation patterns#261

Open
hexxyan wants to merge 9 commits into
antirez:mainfrom
hexxyan:suffix-decoding
Open

Add suffix-tree speculative decoding for repetitive/agentic generation patterns#261
hexxyan wants to merge 9 commits into
antirez:mainfrom
hexxyan:suffix-decoding

Conversation

@hexxyan
Copy link
Copy Markdown
Contributor

@hexxyan hexxyan commented May 26, 2026

Thanks @antirez for this wonderful project.

Summary

This PR adds an opt-in, model-free suffix-tree draft source for speculative decoding, implementing the SuffixDecoding approach (arXiv:2411.04975) with probability estimation features ported from Snowflake ArcticInference.

The suffix trie learns repetitive token patterns from prompt and prior output, then proposes draft tokens at zero model cost. The target-model verifier accepts or rejects each draft, so output correctness is always guaranteed — the trie can only offer speedups, never change results.

When this helps

The suffix trie is most effective when generated text contains repetitive or predictable subsequences:

  • Agentic / tool-use loops: repeated JSON schemas, function call templates, and response patterns are learned from the first few rounds and then drafted at zero model cost
  • Code generation: repeated imports, boilerplate, and structural tokens (brackets, indentation) have high suffix-tree hit rates
  • Multi-turn conversation: system prompts, formatting tokens, and common phrasings are learned once and reused across turns
  • Structured output: any fixed-format output (XML, JSON, CSV headers) that recurs within the context window

The SuffixDecoding paper (arXiv:2411.04975) reports 1.3–2.5× speedup on agentic benchmarks with their reference implementation. ds4's implementation uses the same core algorithm (suffix trie + frequency-based continuation) with a different internal data structure (sorted arrays vs hash maps), so actual speedup depends on the workload's repetition patterns and the model's baseline token latency.

Key properties

  • Disabled by default (--suffix-decoding flag required) — zero impact on existing behavior
  • No external dependencies: no draft model, no training, no GPU kernels
  • Reuses existing target-model verification path — every proposed token is verified against the target model's logits
  • Falls back gracefully to MTP or single-token decode when no suffix match is found
  • Works alone or combined with MTP: suffix matches are tried first; MTP remains the fallback. When both are enabled, either source can provide drafts.

What's included

New files:

  • ds4_suffix_tree.h/c — Bounded CPU-resident suffix trie with sorted-array children, frequency-aging pruning, and best-effort memory budget (~500 lines)
  • tests/suffix_tree_test.c — 8 unit tests covering continuation matching, probability estimation, confidence filtering, best-child caching, and pruning

Modified files:

  • ds4.h — 6 new engine options (suffix_decoding, suffix_max_depth, suffix_memory_budget, suffix_spec_factor, suffix_spec_offset, suffix_min_prob)
  • ds4.c — Session lifecycle integration, incremental learning, two-phase draft selection (match → query), score-based draft gating, independent spec_logits allocation for suffix-only batch verification
  • ds4_cli.c — 6 new CLI flags
  • ds4_bench.c — Suffix telemetry CSV columns, new config options (note: CSV schema expanded from 6 columns to include context memory breakdown, MTP stats, and suffix telemetry — downstream CSV parsers will need updating)
  • Makefilesuffix-tree-test target
  • README.md — Usage documentation
  • CONTRIBUTING.md — Telemetry column guidance
  • speed-bench/README.md — MTP and suffix bench sweep examples

Architecture

The suffix trie learns repetitive token patterns from prompt, checkpoint, and accepted generation tokens. During speculative decode:

  1. ds4_suffix_tree_match_depth() finds the longest matching suffix in the trie
  2. Adaptive draft cap is computed: cap = match_len × factor + offset
  3. ds4_suffix_tree_query() follows the highest-frequency continuation path with probability estimation (prob *= child_freq / parent_freq)
  4. Drafts below min_prob confidence are filtered; score is used for quality gating
  5. All proposed tokens are verified by the existing target-model verifier

Incremental learning: the tree uses suffix_learned_len to track what's already been inserted, appending only new tokens per decode step instead of re-inserting the entire checkpoint. This avoids frequency inflation for older patterns and keeps per-step cost O(max_depth).

Pruning: frequency aging (decrement all by 1, clamp at 0) followed by zero-frequency leaf removal, with a configurable node budget (default 64 MB). Pruning runs up to 16 rounds per trigger and may temporarily exceed the budget under heavy insert load before converging back.

Verification

make suffix-tree-test NATIVE_CPU_FLAG=   # 8 unit tests, all pass
make cpu NATIVE_CPU_FLAG=                 # all CPU binaries compile
make ds4-bench NATIVE_CPU_FLAG=           # Metal binary compiles
./ds4-eval --self-test-extractors         # extractor self-tests pass

Important caveats

  • No e2e benchmark yet: I do not have access to 80GB+ hardware for full DeepSeek V4 end-to-end testing. The implementation is correct by construction (target verifier gates all output), but actual throughput numbers need real hardware.
  • Default parameters = no behavior change: With default settings (--suffix-spec-factor 1.0, --suffix-spec-offset 0.0, --suffix-min-prob 0.0), the implementation reproduces the original alpha=1 draft cap. The new parameters are configurable infrastructure for tuning.
  • Suffix-only batch verification: spec_logits is now allocated independently when --suffix-decoding is enabled (without requiring MTP). The structural prerequisite is in place but needs real-model validation.

Request for maintainers/community

If you have access to a machine with 80GB+ GPU running DeepSeek V4, help with the following would be appreciated:

  1. End-to-end speculative decode test with --suffix-decoding
  2. Throughput measurement (tokens/s) comparing: baseline → suffix-only → suffix+MTP
  3. Draft acceptance rate under agentic/repetitive workloads
  4. Validation of suffix-only batch verification correctness

AI assistance disclosure

This code was written with assistance from GPT-5.5-xhigh and GLM-5.1 models.

References


Follow-up fixes (commits after initial PR)

Build/test fixes

  • Conservative default tuning (merged from @nhwaani): --suffix-spec-factor 0.01 --suffix-spec-offset 2 avoids over-drafting on Metal, keeping suffix decoding near baseline on M5 Max
  • Pre-allocated verifier scratch: eliminated per-step malloc/free in the speculative decode hot path

Bug fixes

  • --suffix-spec-offset 0 now works: sentinel defaults (-1.0f) + correct engine-side checks so explicit zero is respected
  • suffix_spec_factor zero-init safety: > 0.0f check so API callers with ds4_engine_options opt = {0} get engine defaults

Telemetry (speculative decode)

Added comprehensive telemetry printed on exit when --suffix-decoding is active:

  • spec_steps, first_draft_hit/miss — how often speculative decode runs and how often the first draft matches
  • accept_rate — committed draft tokens / verified draft tokens
  • N=X:full=Y:partial=Z histogram — distribution of full vs partial accepts by draft depth
  • verify_ms, replay_ms, draft_query_ms — timing breakdown across all verifier paths (decode2_exact, generic batch verifier, sequential fallback)

Telemetry covers all verifier paths, not just one branch. In ds4-bench, telemetry resets per frontier so each CSV row is an independent snapshot.

What we learned from M5 Max testing

  • Conservative defaults keep suffix decoding at ~1.02x (near baseline) on agent-style JSONL workloads
  • The main bottleneck is verifier cost: verify + replay time dominates draft query time
  • Dynamic adaptive cap (adjusting draft depth based on recent accept rate) is the next promising optimization direction

@hexxyan hexxyan marked this pull request as ready for review May 26, 2026 19:40
@hexxyan hexxyan force-pushed the suffix-decoding branch from 097de6b to 39acdc6 Compare May 26, 2026 19:52
@hexxyan hexxyan changed the title Add experimental suffix-tree speculative decoding draft source Add suffix-tree speculative decoding for repetitive/agentic generation patterns May 26, 2026
@hexxyan hexxyan force-pushed the suffix-decoding branch from 39acdc6 to 3f83465 Compare May 27, 2026 17:31
Implements an opt-in, model-free suffix trie (arXiv:2411.04975) for
speculative decoding in ds4, with probability estimation and adaptive
draft caps ported from Snowflake ArcticInference.

New files:
- ds4_suffix_tree.h/c: bounded CPU-resident suffix trie with sorted-array
  children, frequency-aging pruning, memory budget enforcement, probability
  estimation, and cached best-child index (~500 lines)
- tests/suffix_tree_test.c: 8 unit tests covering continuation matching,
  probability estimation, confidence filtering, best-child caching, and pruning

Modified files:
- ds4.h: 6 new engine options (suffix_decoding, suffix_max_depth,
  suffix_memory_budget, suffix_spec_factor, suffix_spec_offset,
  suffix_min_prob); suffix stats telemetry with draft_score_total
- ds4.c: session lifecycle integration, incremental learning via
  suffix_learned_len, two-phase draft selection (match_depth then query),
  score-based draft gating, independent spec_logits allocation
- ds4_cli.c: 6 new CLI flags
- ds4_bench.c: suffix telemetry CSV columns, 3 new config options
- Makefile: suffix-tree-test target
- README.md: usage documentation
- CONTRIBUTING.md: telemetry column guidance
- speed-bench/README.md: MTP and suffix bench sweep examples

Disabled by default (--suffix-decoding flag required). No external
dependencies, no training, no GPU kernels. Falls back gracefully to
MTP or single-token decode when no suffix match is found.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@hexxyan hexxyan force-pushed the suffix-decoding branch from 3f83465 to b097d85 Compare May 27, 2026 17:35
@nhwaani
Copy link
Copy Markdown

nhwaani commented May 27, 2026

I tested this PR on an M5 Max 128GB with the DeepSeek V4 Flash q2 GGUF.

Good news: the feature builds and works on real hardware.

Passed:

./tests/suffix_tree_test
./ds4_test --server
make test

Main finding: the current default draft cap is too aggressive on my Metal runs. It often proposes long drafts, but the verifier only commits a short prefix. That makes default --suffix-decoding slower than baseline on the repetitive workloads I tested.

Example, agent-style JSONL prompt, 8192 ctx, 512 gen tokens:

Mode gen tok/s vs baseline
baseline 29.10 1.00x
suffix current default 16.69 0.57x
suffix conservative cap 32.17 1.11x

Repeated baseline vs conservative cap:

Run baseline conservative gain
1 29.14 31.26 +7.3%
2 29.24 30.42 +4.0%
3 28.19 30.12 +6.8%

The useful tuning was:

--suffix-spec-factor 0.01 --suffix-spec-offset 2

I opened a small follow-up PR against this branch with the benchmark details and default tuning:

hexxyan#1

@nhwaani
Copy link
Copy Markdown

nhwaani commented May 27, 2026

Small follow-up: I rebased the tuning branch onto hexxyan/suffix-decoding so the follow-up PR diff is now only the intended defaults/docs change (+12 / -9), and reran the benchmarks on that exact final branch.

Follow-up PR: hexxyan#1

Final M5 Max 128GB numbers:

Agent JSONL, 8192 ctx, 512 gen tokens

Mode gen tok/s vs baseline
baseline 30.82 1.00x
old suffix default 14.54 0.47x
new suffix default 31.37 1.02x

Code boilerplate, 2048 ctx, 512 gen tokens

Mode gen tok/s vs baseline
baseline 34.20 1.00x
old suffix default 13.02 0.38x
new suffix default 33.13 0.97x

Tests passed after rebase:

./tests/suffix_tree_test
./ds4_test --server
make test

So the main point is: the current default over-drafts and can cause a big slowdown on Metal; the conservative default avoids that and keeps suffix decoding near baseline / slightly faster on the agent-style workload.

Thanks for the thorough benchmarking work! Merging with merge commit to preserve authorship.
@hexxyan
Copy link
Copy Markdown
Contributor Author

hexxyan commented May 27, 2026

Thanks @nhwaani for the thorough M5 Max benchmarking — the conservative defaults (factor 0.01, offset 2) are a clear improvement over the original aggressive cap. I've merged your tuning PR into this branch so the opt-in experience is safe out of the box.

Your commit is preserved with you as author, so when this PR lands you'll appear in the project's Contributors list as well. Appreciate the real-hardware testing!

@nhwaani
Copy link
Copy Markdown

nhwaani commented May 27, 2026

Thanks @nhwaani for the thorough M5 Max benchmarking — the conservative defaults (factor 0.01, offset 2) are a clear improvement over the original aggressive cap. I've merged your tuning PR into this branch so the opt-in experience is safe out of the box.

Your commit is preserved with you as author, so when this PR lands you'll appear in the project's Contributors list as well. Appreciate the real-hardware testing!

Thanks, let me know if you see any concerns @hexxyan

@STRML
Copy link
Copy Markdown

STRML commented May 28, 2026

Given the complication and marginal performance benefits in these benchmarks, what's the reason to merge this? Is there a more representative benchmark for agentic coding?

- Fix sentinel defaults: use >= 0.0f check so --suffix-spec-offset 0 works
- Add speculative decode telemetry (accept rate, timing, hit/miss counts)
- Pre-allocate verifier scratch buffers in session (eliminate hot-path malloc)
- Enable decode2_exact for suffix tree N=2 drafts (faster verify path)
- Print telemetry summary on exit when suffix decoding is active
@hexxyan
Copy link
Copy Markdown
Contributor Author

hexxyan commented May 28, 2026

Suffix decoding fixes pushed

Just pushed a batch of fixes and improvements:

  1. Fixed --suffix-spec-offset 0 — previously ignored due to wrong default check
  2. Added speculative decode telemetry — prints accept rate, timing breakdown, hit/miss stats on exit
  3. Pre-allocated verifier scratch — eliminated per-step malloc/free in the hot verification path
  4. Enabled decode2_exact for suffix tree N=2 drafts — faster verification when suffix tree proposes exactly 2 draft tokens

@nhwaani — would you mind re-testing with the updated branch? The telemetry output will now print at the end, which should make it easier to see what's happening with accept rates and verify timing. A quick run like:

./ds4 -p "Your test prompt" --suffix-decoding --temp 0 -n 100

would be very helpful. Thanks!

@nhwaani
Copy link
Copy Markdown

nhwaani commented May 28, 2026

Retested latest hexxyan/suffix-decoding on M5 Max 128GB.

Branch tested:

3d553db + 1800c4d

1800c4d is just a one-line telemetry fix I opened here:

hexxyan#2

It fixes impossible replay timing like:

replay=815247931.0ms

After the fix, telemetry is sane:

verify=8316.1ms replay=3.1ms

Tests passed:

make -j$(sysctl -n hw.ncpu) suffix-tree-test ds4_test ds4-bench
./tests/suffix_tree_test
./ds4_test --server

Model used: q2-q4-imatrix GGUF.

Agent JSONL, 8192 ctx, 512 gen

Mode gen tok/s vs baseline
baseline 29.72 1.00x
suffix new default 27.11 0.91x
suffix old aggressive 15.72 0.53x

Code boilerplate, 2048 ctx, 512 gen

Mode gen tok/s vs baseline
baseline 35.31 1.00x
suffix new default 31.57 0.89x
suffix old aggressive 14.69 0.42x

My read:

  • The conservative default is much safer than the old aggressive default.
  • It avoids the large over-drafting slowdown.
  • But on this q2-q4 M5 Metal run, I do not see a net speedup over baseline yet.

So I would not present this as a proven M5 Metal performance win today.

The merge argument, if any, is:

  • opt-in only
  • correctness-preserving verifier path
  • unit tests included
  • useful telemetry now available
  • gives us a tunable model-free draft source for future agent/tool-call workloads

I agree with @STRML that a more representative multi-turn coding-agent benchmark would be useful before making strong speedup claims.

- Revert decode2_exact for suffix tree (was causing slowdown, it's a
  correctness path not a fast path; suffix N=2 should use batch verify)
- Fix accept_rate formula: total_verified now counts draft_n (not draft_n-1)
  so the ratio correctly reflects draft token acceptance probability
- Fix factor sentinel: use > 0.0f so zero-initialized API opts get engine
  default (0.01) instead of literal 0.0
- Apply PR #2: verify_done timestamp captured unconditionally
  so replay telemetry doesn't print garbage when DS4_MTP_TIMING is off
@hexxyan
Copy link
Copy Markdown
Contributor Author

hexxyan commented May 28, 2026

Regression fixes pushed

Good catch on all four issues. Just pushed a fix commit (10229e1):

  1. Reverted decode2_exact for suffix tree — back to strict_mtp only. This was the main cause of the slowdown: the exact verifier is a correctness path, not a fast path. Suffix N=2 now uses the batch verifier again.

  2. Fixed accept_rate formulatotal_verified was counting draft_n - 1 instead of draft_n, so full-accept steps inflated committed above verified, producing nonsensical rates like 2.5.

  3. Fixed factor sentinel — changed to > 0.0f so zero-initialized API users get the engine default (0.01), not literal 0.0. CLI uses -1.0f sentinel as before.

  4. Applied PR Fix suffix telemetry replay timing hexxyan/ds4#2verify_done timestamp now captured unconditionally so replay telemetry is valid even without DS4_MTP_TIMING.

@nhwaani — could you re-run your benchmarks with this version? The decode2_exact revert should bring performance back to the 1.02x baseline.

hexxyan and others added 3 commits May 28, 2026 21:18
- Add total_verify_ms to all 5 generic batch verifier return points
- Add total_replay_ms to the 2 replay paths (exact replay + general replay)
- Always capture draft_query_ms (remove DS4_SUFFIX_SPEC_LOG gate)
- Add ds4_engine_spec_telemetry_reset() API for per-frontier reset
- Reset telemetry after each frontier in ds4-bench so CSV rows are
  independent snapshots, not cumulative

Now verify/replay/draft_query timing covers the suffix-only main path
(generic batch verifier), not just the decode2_exact branch.
…ect N buckets

- draft_query_ms now always accumulates (removed DS4_SUFFIX_SPEC_LOG gate)
- micro_verify_done captured unconditionally, verify_ms accumulated once
  after verifier completes rather than at each return point
- total_replay_ms only counts post-verifier replay/restore cost
- accept_rate prints only when total_verified > 0
- N= bucket index fixed: i+1 instead of i+2 (N=1 means 1 draft token)
- first-draft miss now counts toward total_verified and partial_accept
  histogram so the two stay consistent
@hexxyan
Copy link
Copy Markdown
Contributor Author

hexxyan commented May 28, 2026

Telemetry completeness update pushed (6eb2002)

Fixed the remaining telemetry gaps that were still half-blind:

  1. draft_query_ms always accumulates — removed the DS4_SUFFIX_SPEC_LOG gate, now every speculative step records draft tree query cost
  2. verify_ms accumulated once after verifier completes, not duplicated at each return point
  3. replay_ms only counts post-verifier replay/restore cost, not the verify itself
  4. micro_verify_done unconditional — replay timing is valid even without DS4_MTP_TIMING
  5. accept_rate guard: only prints when total_verified > 0 (no div-by-zero)
  6. N= bucket fixed: i+1 instead of i+2 (N=1 means 1 draft token proposed)
  7. First-draft miss now counts toward total_verified and partial_accept[N] histogram so the two stay consistent

The telemetry output now covers the full suffix-only main path. Example output on a repetitive workload should look something like:

ds4: spec telemetry: steps=53 first_hit=48 first_miss=4 committed=75 verified=30 accept_rate=0.xxx N=1:full=X:partial=Y verify=xx.xms draft_query=0.xms

All numbers should now be self-consistent. @nhwaani — if you get a chance to re-run, the telemetry should give a clear picture of where the time is going.

@nhwaani
Copy link
Copy Markdown

nhwaani commented May 28, 2026

Retested latest hexxyan/suffix-decoding at 6eb2002 on M5 Max 128GB with the q2-q4-imatrix GGUF.

Build/tests passed:

make -j$(sysctl -n hw.ncpu) suffix-tree-test ds4_test ds4-bench
./tests/suffix_tree_test
./ds4_test --server
make test

3-run median results, 512 generated tokens:

Workload Baseline Suffix default vs baseline
Agent JSONL, 8192 ctx 29.16 tok/s 31.97 tok/s 1.10x
Code boilerplate, 2048 ctx 33.79 tok/s 35.09 tok/s 1.04x

Raw gen tok/s:

Workload Baseline runs Suffix default runs
Agent JSONL 30.69, 27.86, 29.16 31.97, 31.26, 32.09
Code boilerplate 33.68, 33.79, 34.24 33.77, 35.17, 35.09

Telemetry now looks sane and stable:

agent r1: steps=218 first_hit=159 first_miss=56 committed=294 verified=374 accept_rate=0.786 N=2:full=135:partial=80 verify=8524.6ms draft_query=1.0ms
agent r2: steps=218 first_hit=159 first_miss=56 committed=294 verified=374 accept_rate=0.786 N=2:full=135:partial=80 verify=8708.5ms draft_query=0.9ms
agent r3: steps=218 first_hit=159 first_miss=56 committed=294 verified=374 accept_rate=0.786 N=2:full=135:partial=80 verify=8498.9ms draft_query=0.9ms

code r1:  steps=221 first_hit=182 first_miss=31 committed=291 verified=329 accept_rate=0.884 N=1:full=66:partial=3 N=2:full=109:partial=35 verify=8834.8ms draft_query=0.8ms
code r2:  steps=221 first_hit=182 first_miss=31 committed=291 verified=329 accept_rate=0.884 N=1:full=66:partial=3 N=2:full=109:partial=35 verify=8465.3ms draft_query=0.8ms
code r3:  steps=221 first_hit=182 first_miss=31 committed=291 verified=329 accept_rate=0.884 N=1:full=66:partial=3 N=2:full=109:partial=35 verify=8497.3ms draft_query=1.0ms

I also did a single sanity run with the old aggressive cap (--suffix-spec-factor 1.0 --suffix-spec-offset 0):

Workload Baseline Current default Old aggressive cap
Agent JSONL 30.31 tok/s 31.97 tok/s 13.64 tok/s
Code boilerplate 34.03 tok/s 34.74 tok/s 17.16 tok/s

So the latest fixes recover the expected behavior on my synthetic repetitive agent/code prompts: conservative suffix defaults are near-baseline to modestly faster, while avoiding the old aggressive over-drafting slowdown.

I’d still avoid broad performance claims until there is a representative multi-turn coding-agent benchmark, but this version looks healthy on M5 Max Metal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants