Skip to content

fix(tps): correct off-by-one in decode token count for generation TPS#23828

Open
paul90317 wants to merge 1 commit into
ggml-org:masterfrom
paul90317:master
Open

fix(tps): correct off-by-one in decode token count for generation TPS#23828
paul90317 wants to merge 1 commit into
ggml-org:masterfrom
paul90317:master

Conversation

@paul90317
Copy link
Copy Markdown

Overview

Generation TPS was computed using an extra decode token in the token count, while the decode time measurement did not include this extra step. This caused an inflated TPS value due to mismatched token/time accounting.

This change fixes the off-by-one issue in decode token counting to ensure consistent alignment between decode tokens and decode duration.

Additional information

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: yes (used for commit message drafting and wording)

Generation TPS was computed using an extra decode token in the token count,
while the decode time measurement did not include this extra step.
This caused an inflated TPS value due to mismatched token/time accounting.

This change fixes the off-by-one issue in decode token counting to ensure
consistent alignment between decode tokens and decode duration.
@paul90317 paul90317 requested a review from a team as a code owner May 28, 2026 16:19
@paul90317
Copy link
Copy Markdown
Author

A potential issue may be in the stop condition that uses slot.n_decoded:

// check the limits
if (slot.n_decoded > 0 && slot.has_next_token && !slot.has_budget(params_base)) {
    slot.stop = STOP_TYPE_LIMIT;
    slot.has_next_token = false;
}

This may cause the model to output one more token than the configured limit.

I will investigate a better solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant