fix(router): use worker's authoritative token count for usage#392
Merged
Conversation
The disaggregated router derived usage.completion_tokens by counting emitted detokenized text pieces in its per-request on_token callback. That equals the worker's generated-token count for byte-level-BPE tokenizers (Qwen) but under-counts for byte-fallback tokenizers (Gemma), where a multi-byte character arrives as several `<0xXX>` model tokens but surfaces as one text piece (or none, for a byte-fallback first token). The under-count also flipped finish_reason between "length" and "stop", since both derive from `completion_tokens >= max_tokens`. Carry the worker's authoritative generated-token count over the serving wire protocol and have the router report it instead of frame-counting. Wire protocol: add an optional `generated_tokens: Option<u64>` to ResultFrame, `#[serde(default)]` for backward compatibility. The prefill node sets it on its FirstToken frame (its true first-token count: 1 normally, 0 on immediate EOS); the decode node sets it on its terminal Continuation frame (its post-handoff count). Intermediate continuation frames and frames from a sender predating the field leave it None. Router: drive_handoff_result now returns a HandoffOutcome that sums the per-node counts. The router reports that authoritative total for usage.completion_tokens and derives finish_reason from it with the same `count >= max_tokens` formula the worker uses, so the result matches single-node. When no frame carried a count (a mixed-version cluster), the router falls back to counting emitted text pieces, the prior behavior. The authoritative count is clamped to max_tokens so a buggy or hostile node cannot inflate the usage figure beyond the router's own budget. This applies to both /v1/completions (the documented bug site, streaming and non-streaming) and /v1/chat/completions (the finish chunk and non-streaming body now carry the derived finish_reason instead of a hard-coded "stop"). Tests: protocol round-trip unit tests for the new field present and absent (serde default); router resolution unit tests covering the authoritative-vs-fallback selection, the max_tokens clamp, and the finish_reason-flip case. The /v1/completions E2E parity test is extended with a byte-fallback (Gemma) variant asserting usage.completion_tokens and finish_reason parity with single-node, gated behind the same #[ignore] plus checkpoint-presence guard as the other real-model tests (not run here: no Gemma checkpoint in this environment). Closes #387
Use saturating_add when summing per-frame authoritative token counts so a hostile or buggy node reporting a very large value cannot trigger a debug-build overflow panic. Clamp the frame-counted fallback path to max_tokens in resolve_completion_tokens, matching the existing clamp on the authoritative path and keeping both branches uniformly defensive. Add a test (fallback_frame_count_clamped_to_max_tokens) to cover the fallback-exceeds-budget case.
Member
Author
Finalization completeChanges applied (commit bdc74f2)
Quality gate results
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The disaggregated router derived
usage.completion_tokensby counting emitted detokenized text pieces in its per-requeston_tokencallback. That equals the worker's generated-token count for byte-level-BPE tokenizers (Qwen) but under-counts for byte-fallback tokenizers (Gemma), where a multi-byte character arrives as several<0xXX>model tokens but surfaces as one detokenized text piece (or none, for a byte-fallback first token). Becausefinish_reasonderives fromcompletion_tokens >= max_tokens, the under-count could also flip it between "length" and "stop", diverging from single-node.This carries the worker's authoritative generated-token count over the serving wire protocol and has the router report it instead of frame-counting.
What changed
src/distributed/disaggregated/serving_protocol.rs: addgenerated_tokens: Option<u64>toResultFrame,#[serde(default)]for backward compatibility (a node predating the field decodes toNone; an older router ignores the unknown field).src/distributed/disaggregated/coordinator.rs:drain_generation_eventsnow returns the worker's authoritative count from the terminalGenerateEvent::Done. The prefill node setsgenerated_tokenson itsFirstTokenframe (its true first-token count: 1 normally, 0 on immediate EOS); the decode node sets it on its terminalContinuationframe (its post-handoff count, accumulated across decode ticks). Intermediate continuation frames leave itNone.src/server/router_front.rs:drive_handoff_resultnow returns aHandoffOutcomesumming the per-node counts.resolve_completion_tokensprefers the authoritative total, clamps it tomax_tokens(the router bounds generation to its own budget, so a larger reported count can only come from a buggy or hostile node), and falls back to the emitted-piece count when no frame carried one.finish_reasonis derived from the authoritative count with the samecount >= max_tokensformula the worker uses, so it matches single-node. Applies to/v1/completions(streaming and non-streaming) and/v1/chat/completions(the finish chunk and non-streaming body now carry the derivedfinish_reasoninstead of a hard-coded "stop").Backward compatibility
The wire change is additive and optional. In a mixed-version cluster, a frame from an older prefill or decode node arrives with
generated_tokens = None(serde default), and the router transparently falls back to the previous frame-counting behavior. A new node's extra field is ignored by an older router.Test plan
cargo check --lib --tests --features metal,accelerate(exit 0)cargo clippy --lib --tests --features metal,accelerate -- -D warnings(exit 0)cargo test --lib --features metal,accelerate -- router_front::tests serving_protocol::tests(15 passed): protocol round-trip with the new field present and absent; router resolution covering authoritative-vs-fallback selection, themax_tokensclamp, and the finish_reason-flip case.tests/disaggregated_router_e2e.rs::disaggregated_router_completions_match_single_node_byte_fallback: a new byte-fallback (Gemma)/v1/completionsparity test assertingusage.completion_tokensandfinish_reasonmatch single-node. Gated behind#[ignore]plus a checkpoint-presence guard like the other real-model tests; not run in this environment (no Gemma checkpoint available). It requires a byte-fallback checkpoint that also supports the pool-backed paged handoff and must be run explicitly with--ignored.Closes #387