Skip to content

fix(router): use worker's authoritative token count for usage#392

Merged
inureyes merged 2 commits into
mainfrom
fix/issue-387-authoritative-token-count
Jun 21, 2026
Merged

fix(router): use worker's authoritative token count for usage#392
inureyes merged 2 commits into
mainfrom
fix/issue-387-authoritative-token-count

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

The disaggregated router derived usage.completion_tokens by counting emitted detokenized text pieces in its per-request on_token callback. That equals the worker's generated-token count for byte-level-BPE tokenizers (Qwen) but under-counts for byte-fallback tokenizers (Gemma), where a multi-byte character arrives as several <0xXX> model tokens but surfaces as one detokenized text piece (or none, for a byte-fallback first token). Because finish_reason derives from completion_tokens >= max_tokens, the under-count could also flip it between "length" and "stop", diverging from single-node.

This carries the worker's authoritative generated-token count over the serving wire protocol and has the router report it instead of frame-counting.

What changed

  • src/distributed/disaggregated/serving_protocol.rs: add generated_tokens: Option<u64> to ResultFrame, #[serde(default)] for backward compatibility (a node predating the field decodes to None; an older router ignores the unknown field).
  • src/distributed/disaggregated/coordinator.rs: drain_generation_events now returns the worker's authoritative count from the terminal GenerateEvent::Done. The prefill node sets generated_tokens on its FirstToken frame (its true first-token count: 1 normally, 0 on immediate EOS); the decode node sets it on its terminal Continuation frame (its post-handoff count, accumulated across decode ticks). Intermediate continuation frames leave it None.
  • src/server/router_front.rs: drive_handoff_result now returns a HandoffOutcome summing the per-node counts. resolve_completion_tokens prefers the authoritative total, clamps it to max_tokens (the router bounds generation to its own budget, so a larger reported count can only come from a buggy or hostile node), and falls back to the emitted-piece count when no frame carried one. finish_reason is derived from the authoritative count with the same count >= max_tokens formula the worker uses, so it matches single-node. Applies to /v1/completions (streaming and non-streaming) and /v1/chat/completions (the finish chunk and non-streaming body now carry the derived finish_reason instead of a hard-coded "stop").

Backward compatibility

The wire change is additive and optional. In a mixed-version cluster, a frame from an older prefill or decode node arrives with generated_tokens = None (serde default), and the router transparently falls back to the previous frame-counting behavior. A new node's extra field is ignored by an older router.

Test plan

  • cargo check --lib --tests --features metal,accelerate (exit 0)
  • cargo clippy --lib --tests --features metal,accelerate -- -D warnings (exit 0)
  • cargo test --lib --features metal,accelerate -- router_front::tests serving_protocol::tests (15 passed): protocol round-trip with the new field present and absent; router resolution covering authoritative-vs-fallback selection, the max_tokens clamp, and the finish_reason-flip case.
  • tests/disaggregated_router_e2e.rs::disaggregated_router_completions_match_single_node_byte_fallback: a new byte-fallback (Gemma) /v1/completions parity test asserting usage.completion_tokens and finish_reason match single-node. Gated behind #[ignore] plus a checkpoint-presence guard like the other real-model tests; not run in this environment (no Gemma checkpoint available). It requires a byte-fallback checkpoint that also supports the pool-backed paged handoff and must be run explicitly with --ignored.

Closes #387

The disaggregated router derived usage.completion_tokens by counting emitted detokenized text pieces in its per-request on_token callback. That equals the worker's generated-token count for byte-level-BPE tokenizers (Qwen) but under-counts for byte-fallback tokenizers (Gemma), where a multi-byte character arrives as several `<0xXX>` model tokens but surfaces as one text piece (or none, for a byte-fallback first token). The under-count also flipped finish_reason between "length" and "stop", since both derive from `completion_tokens >= max_tokens`.

Carry the worker's authoritative generated-token count over the serving wire protocol and have the router report it instead of frame-counting.

Wire protocol: add an optional `generated_tokens: Option<u64>` to ResultFrame, `#[serde(default)]` for backward compatibility. The prefill node sets it on its FirstToken frame (its true first-token count: 1 normally, 0 on immediate EOS); the decode node sets it on its terminal Continuation frame (its post-handoff count). Intermediate continuation frames and frames from a sender predating the field leave it None.

Router: drive_handoff_result now returns a HandoffOutcome that sums the per-node counts. The router reports that authoritative total for usage.completion_tokens and derives finish_reason from it with the same `count >= max_tokens` formula the worker uses, so the result matches single-node. When no frame carried a count (a mixed-version cluster), the router falls back to counting emitted text pieces, the prior behavior. The authoritative count is clamped to max_tokens so a buggy or hostile node cannot inflate the usage figure beyond the router's own budget.

This applies to both /v1/completions (the documented bug site, streaming and non-streaming) and /v1/chat/completions (the finish chunk and non-streaming body now carry the derived finish_reason instead of a hard-coded "stop").

Tests: protocol round-trip unit tests for the new field present and absent (serde default); router resolution unit tests covering the authoritative-vs-fallback selection, the max_tokens clamp, and the finish_reason-flip case. The /v1/completions E2E parity test is extended with a byte-fallback (Gemma) variant asserting usage.completion_tokens and finish_reason parity with single-node, gated behind the same #[ignore] plus checkpoint-presence guard as the other real-model tests (not run here: no Gemma checkpoint in this environment).

Closes #387
@inureyes inureyes added status:review Under review type:bug Bug fixes, error corrections, or issue resolutions priority:low Low priority area:architecture Architecture and code structure changes labels Jun 21, 2026
Use saturating_add when summing per-frame authoritative token counts so
a hostile or buggy node reporting a very large value cannot trigger a
debug-build overflow panic.

Clamp the frame-counted fallback path to max_tokens in
resolve_completion_tokens, matching the existing clamp on the
authoritative path and keeping both branches uniformly defensive.

Add a test (fallback_frame_count_clamped_to_max_tokens) to cover the
fallback-exceeds-budget case.
@inureyes

Copy link
Copy Markdown
Member Author

Finalization complete

Changes applied (commit bdc74f2)

src/server/router_front.rs

  • drive_handoff_result: changed generated_tokens.unwrap_or(0) + n to .saturating_add(n) in the add_count closure. A hostile or buggy remote node reporting a very large frame count can no longer cause a debug-build overflow panic.
  • resolve_completion_tokens: changed .unwrap_or(frame_counted) to .unwrap_or(frame_counted.min(max_tokens)). The fallback (frame-counted) branch now clamps to max_tokens the same way the authoritative path already does, making both branches uniformly defensive.

src/server/router_front_tests.rs

  • Added fallback_frame_count_clamped_to_max_tokens: asserts that when no authoritative count is present and the emitted-piece count exceeds the budget, resolve_completion_tokens returns max_tokens, not the raw frame count.

docs/distributed.md: no change. The file's disaggregated-inference section describes topology and the security trust model; it does not document token-count or usage derivation, so there was nothing to update.

Quality gate results

  • cargo clippy --lib --tests --features metal,accelerate -- -D warnings: clean (Finished in 15.56s, no warnings)
  • cargo test --lib --features metal,accelerate -- router_front::tests: 6/6 passed (5 pre-existing + 1 new)
  • cargo test --lib --features metal,accelerate -- serving_protocol::tests: 10/10 passed
  • cargo fmt: pre-existing failure on router_front.rs due to let-chain syntax (lines 368, 370, 604, 613) that rustfmt 1.8.0 cannot parse; this is not introduced by this commit. The test file (router_front_tests.rs) is format-clean per rustfmt --check.

@inureyes inureyes added status:done Completed and removed status:review Under review labels Jun 21, 2026
@inureyes inureyes merged commit 455ed8d into main Jun 21, 2026
5 checks passed
@inureyes inureyes deleted the fix/issue-387-authoritative-token-count branch June 21, 2026 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:architecture Architecture and code structure changes priority:low Low priority status:done Completed type:bug Bug fixes, error corrections, or issue resolutions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(router): carry the worker's true generated-token count over the disaggregated wire protocol

1 participant