[https://nvbugs/6223556][fix] Correct cached_tokens accounting in dis-agg gen-first overlap path by tensorrt-cicd · Pull Request #14910 · NVIDIA/TensorRT-LLM

tensorrt-cicd · 2026-06-03T14:55:02Z

Summary

Root cause: In the disaggregated gen-first overlap path, both the context and generation requests run concurrently, so the context worker's usage accounting cannot be propagated to the generation worker as it is in the context-first path. The generation worker treats the entire transferred prompt as cached, over-reporting cached_tokens as the full prompt length (17) instead of the actual reused KV blocks (prompt_tokens-1 = 16). This caused test_disaggregated_overlap_gen_first to fail its cache-reuse usage check.
Fix: After gathering the concurrent context and generation responses, the gen-first path now rewrites the generation response's usage from the context worker's accounting via rewrite_usage_response_from_ctx, mirroring the context-first behavior. The test configs were also updated to enable block reuse and partial reuse so the reuse path is exercised, and the corresponding waives were removed since the test now passes.
Automated fix generated by repair-bot

Test plan

Verify fix on the same GPU type as the original failure
Check for regressions in related tests

Links

Bug: https://nvbugs/6223556

Summary by CodeRabbit

Bug Fixes
- Corrected generation usage accounting in disaggregated generation-first inference mode
Tests
- Removed test waivers for disaggregated overlap generation-first configurations
- Enabled KV cache block and partial reuse features in disaggregated test configurations

coderabbitai · 2026-06-03T15:02:37Z

📝 Walkthrough

Walkthrough

This PR modifies the generation-first disaggregation service to correctly adjust usage accounting by rewriting generation response usage from the context response, updates two test configuration files to enable cache reuse features (block and partial reuse), and removes two test waivers to enable previously skipped overlap tests.

Changes

Disaggregated generation-first flow improvements

Layer / File(s)	Summary
Usage rewriting in generation-first disaggregation `tensorrt_llm/serve/openai_disagg_service.py`	Imports `rewrite_usage_response_from_ctx` and applies it to adjust the generation response usage from the context response usage in the concurrent request path when context is required, replacing the raw `responses[-1]` return.
Test config cache reuse settings `tests/integration/defs/disaggregated/test_configs/disagg_config_overlap_gen_first.yaml`, `tests/integration/defs/disaggregated/test_configs/disagg_config_overlap_gen_first_pp4.yaml`	Both `context_servers` and `generation_servers` KV cache configs are updated to set `enable_block_reuse: True` and `enable_partial_reuse: True` for cache reuse in disaggregated overlap tests.
Enable disaggregated overlap tests `tests/integration/test_lists/waives.txt`	Removes two test waiver entries for `test_disaggregated_overlap_gen_first` with TinyLlama-1.1B-Chat-v1.0 in `ctx_pp1` and `ctx_pp4` configurations, allowing those tests to run.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#14791: Directly conflicts on the same waiver entries in tests/integration/test_lists/waives.txt for test_disaggregated_overlap_gen_first.
NVIDIA/TensorRT-LLM#14787: Also modifies waiver entries for the same test_disaggregated_overlap_gen_first test cases with TinyLlama-1.1B-Chat-v1.0.

Suggested reviewers

jieli-matrix
LarryXFly
StanleySun639
xinhe-nv
crazydemo

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description provides clear context on the root cause, the implemented fix, test verification steps, and relevant links, addressing the template requirements adequately.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately identifies the main fix: correcting cached_tokens accounting in the disaggregated gen-first overlap path, which is the core change across all modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ting In generation-first disaggregated scheduling the context and generation requests run concurrently, so the context worker's usage cannot be propagated to the generation worker through the request (as done in the context-first path). The generation worker then reports the entire transferred prompt as cached, over-reporting cached_tokens. Adopt the context worker's usage on the final response, mirroring the context-first propagation. Also enable KV block reuse in the overlap_gen_first test configs so the cache-reuse usage check is meaningful, and remove the related waivers. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

Shixiaowei02 · 2026-06-10T09:32:52Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-06-10T09:38:37Z

PR_Github #53292 [ run ] triggered by Bot. Commit: 953ffd2 Link to invocation

tensorrt-cicd requested review from a team as code owners June 3, 2026 14:55

tensorrt-cicd requested a review from hchings June 3, 2026 14:55

tensorrt-cicd assigned Shixiaowei02 Jun 3, 2026

github-actions Bot assigned tensorrt-cicd Jun 3, 2026

xinhe-nv approved these changes Jun 4, 2026

View reviewed changes

Shixiaowei02 force-pushed the repair-bot-bug6223556 branch from 38c081a to 953ffd2 Compare June 10, 2026 09:31

Shixiaowei02 requested a review from reasonsolo June 10, 2026 09:43

Shixiaowei02 changed the title ~~[https://nvbugs/6223556][fix] [https://nvbugs/6223556][fix] Correct cached_tokens accounting in disagg…~~ [https://nvbugs/6223556][fix] Correct cached_tokens accounting in dis-agg gen-first overlap path Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6223556][fix] Correct cached_tokens accounting in dis-agg gen-first overlap path#14910

[https://nvbugs/6223556][fix] Correct cached_tokens accounting in dis-agg gen-first overlap path#14910
tensorrt-cicd wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6223556

tensorrt-cicd commented Jun 3, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Shixiaowei02 commented Jun 10, 2026

Uh oh!

tensorrt-cicd commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tensorrt-cicd commented Jun 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Shixiaowei02 commented Jun 10, 2026

Uh oh!

tensorrt-cicd commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tensorrt-cicd commented Jun 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading