fix(router): use worker's authoritative token count for usage by inureyes · Pull Request #392 · lablup/mlxcel

inureyes · 2026-06-21T22:56:34Z

Summary

The disaggregated router derived usage.completion_tokens by counting emitted detokenized text pieces in its per-request on_token callback. That equals the worker's generated-token count for byte-level-BPE tokenizers (Qwen) but under-counts for byte-fallback tokenizers (Gemma), where a multi-byte character arrives as several <0xXX> model tokens but surfaces as one detokenized text piece (or none, for a byte-fallback first token). Because finish_reason derives from completion_tokens >= max_tokens, the under-count could also flip it between "length" and "stop", diverging from single-node.

This carries the worker's authoritative generated-token count over the serving wire protocol and has the router report it instead of frame-counting.

What changed

src/distributed/disaggregated/serving_protocol.rs: add generated_tokens: Option<u64> to ResultFrame, #[serde(default)] for backward compatibility (a node predating the field decodes to None; an older router ignores the unknown field).
src/distributed/disaggregated/coordinator.rs: drain_generation_events now returns the worker's authoritative count from the terminal GenerateEvent::Done. The prefill node sets generated_tokens on its FirstToken frame (its true first-token count: 1 normally, 0 on immediate EOS); the decode node sets it on its terminal Continuation frame (its post-handoff count, accumulated across decode ticks). Intermediate continuation frames leave it None.
src/server/router_front.rs: drive_handoff_result now returns a HandoffOutcome summing the per-node counts. resolve_completion_tokens prefers the authoritative total, clamps it to max_tokens (the router bounds generation to its own budget, so a larger reported count can only come from a buggy or hostile node), and falls back to the emitted-piece count when no frame carried one. finish_reason is derived from the authoritative count with the same count >= max_tokens formula the worker uses, so it matches single-node. Applies to /v1/completions (streaming and non-streaming) and /v1/chat/completions (the finish chunk and non-streaming body now carry the derived finish_reason instead of a hard-coded "stop").

Backward compatibility

The wire change is additive and optional. In a mixed-version cluster, a frame from an older prefill or decode node arrives with generated_tokens = None (serde default), and the router transparently falls back to the previous frame-counting behavior. A new node's extra field is ignored by an older router.

Test plan

cargo check --lib --tests --features metal,accelerate (exit 0)
cargo clippy --lib --tests --features metal,accelerate -- -D warnings (exit 0)
cargo test --lib --features metal,accelerate -- router_front::tests serving_protocol::tests (15 passed): protocol round-trip with the new field present and absent; router resolution covering authoritative-vs-fallback selection, the max_tokens clamp, and the finish_reason-flip case.
tests/disaggregated_router_e2e.rs::disaggregated_router_completions_match_single_node_byte_fallback: a new byte-fallback (Gemma) /v1/completions parity test asserting usage.completion_tokens and finish_reason match single-node. Gated behind #[ignore] plus a checkpoint-presence guard like the other real-model tests; not run in this environment (no Gemma checkpoint available). It requires a byte-fallback checkpoint that also supports the pool-backed paged handoff and must be run explicitly with --ignored.

Closes #387

The disaggregated router derived usage.completion_tokens by counting emitted detokenized text pieces in its per-request on_token callback. That equals the worker's generated-token count for byte-level-BPE tokenizers (Qwen) but under-counts for byte-fallback tokenizers (Gemma), where a multi-byte character arrives as several `<0xXX>` model tokens but surfaces as one text piece (or none, for a byte-fallback first token). The under-count also flipped finish_reason between "length" and "stop", since both derive from `completion_tokens >= max_tokens`. Carry the worker's authoritative generated-token count over the serving wire protocol and have the router report it instead of frame-counting. Wire protocol: add an optional `generated_tokens: Option<u64>` to ResultFrame, `#[serde(default)]` for backward compatibility. The prefill node sets it on its FirstToken frame (its true first-token count: 1 normally, 0 on immediate EOS); the decode node sets it on its terminal Continuation frame (its post-handoff count). Intermediate continuation frames and frames from a sender predating the field leave it None. Router: drive_handoff_result now returns a HandoffOutcome that sums the per-node counts. The router reports that authoritative total for usage.completion_tokens and derives finish_reason from it with the same `count >= max_tokens` formula the worker uses, so the result matches single-node. When no frame carried a count (a mixed-version cluster), the router falls back to counting emitted text pieces, the prior behavior. The authoritative count is clamped to max_tokens so a buggy or hostile node cannot inflate the usage figure beyond the router's own budget. This applies to both /v1/completions (the documented bug site, streaming and non-streaming) and /v1/chat/completions (the finish chunk and non-streaming body now carry the derived finish_reason instead of a hard-coded "stop"). Tests: protocol round-trip unit tests for the new field present and absent (serde default); router resolution unit tests covering the authoritative-vs-fallback selection, the max_tokens clamp, and the finish_reason-flip case. The /v1/completions E2E parity test is extended with a byte-fallback (Gemma) variant asserting usage.completion_tokens and finish_reason parity with single-node, gated behind the same #[ignore] plus checkpoint-presence guard as the other real-model tests (not run here: no Gemma checkpoint in this environment). Closes #387

Use saturating_add when summing per-frame authoritative token counts so a hostile or buggy node reporting a very large value cannot trigger a debug-build overflow panic. Clamp the frame-counted fallback path to max_tokens in resolve_completion_tokens, matching the existing clamp on the authoritative path and keeping both branches uniformly defensive. Add a test (fallback_frame_count_clamped_to_max_tokens) to cover the fallback-exceeds-budget case.

inureyes · 2026-06-21T23:12:28Z

Finalization complete

Changes applied (commit `bdc74f2`)

src/server/router_front.rs

drive_handoff_result: changed generated_tokens.unwrap_or(0) + n to .saturating_add(n) in the add_count closure. A hostile or buggy remote node reporting a very large frame count can no longer cause a debug-build overflow panic.
resolve_completion_tokens: changed .unwrap_or(frame_counted) to .unwrap_or(frame_counted.min(max_tokens)). The fallback (frame-counted) branch now clamps to max_tokens the same way the authoritative path already does, making both branches uniformly defensive.

src/server/router_front_tests.rs

Added fallback_frame_count_clamped_to_max_tokens: asserts that when no authoritative count is present and the emitted-piece count exceeds the budget, resolve_completion_tokens returns max_tokens, not the raw frame count.

docs/distributed.md: no change. The file's disaggregated-inference section describes topology and the security trust model; it does not document token-count or usage derivation, so there was nothing to update.

Quality gate results

cargo clippy --lib --tests --features metal,accelerate -- -D warnings: clean (Finished in 15.56s, no warnings)
cargo test --lib --features metal,accelerate -- router_front::tests: 6/6 passed (5 pre-existing + 1 new)
cargo test --lib --features metal,accelerate -- serving_protocol::tests: 10/10 passed
cargo fmt: pre-existing failure on router_front.rs due to let-chain syntax (lines 368, 370, 604, 613) that rustfmt 1.8.0 cannot parse; this is not introduced by this commit. The test file (router_front_tests.rs) is format-clean per rustfmt --check.

inureyes added status:review Under review type:bug Bug fixes, error corrections, or issue resolutions priority:low Low priority area:architecture Architecture and code structure changes labels Jun 21, 2026

inureyes added status:done Completed and removed status:review Under review labels Jun 21, 2026

inureyes merged commit 455ed8d into main Jun 21, 2026
5 checks passed

inureyes deleted the fix/issue-387-authoritative-token-count branch June 21, 2026 23:13

inureyes mentioned this pull request Jun 22, 2026

fix(router): emit usage on the disaggregated /v1/chat/completions responses (streaming and non-streaming) #398

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(router): use worker's authoritative token count for usage#392

fix(router): use worker's authoritative token count for usage#392
inureyes merged 2 commits into
mainfrom
fix/issue-387-authoritative-token-count

inureyes commented Jun 21, 2026

Uh oh!

inureyes commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

inureyes commented Jun 21, 2026

Summary

What changed

Backward compatibility

Test plan

Uh oh!

inureyes commented Jun 21, 2026

Finalization complete

Changes applied (commit bdc74f2)

Quality gate results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Changes applied (commit `bdc74f2`)