Skip to content

[https://nvbugs/6223556][fix] Correct cached_tokens accounting in dis-agg gen-first overlap path#14910

Open
tensorrt-cicd wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6223556
Open

[https://nvbugs/6223556][fix] Correct cached_tokens accounting in dis-agg gen-first overlap path#14910
tensorrt-cicd wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6223556

Conversation

@tensorrt-cicd

@tensorrt-cicd tensorrt-cicd commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: In the disaggregated gen-first overlap path, both the context and generation requests run concurrently, so the context worker's usage accounting cannot be propagated to the generation worker as it is in the context-first path. The generation worker treats the entire transferred prompt as cached, over-reporting cached_tokens as the full prompt length (17) instead of the actual reused KV blocks (prompt_tokens-1 = 16). This caused test_disaggregated_overlap_gen_first to fail its cache-reuse usage check.
  • Fix: After gathering the concurrent context and generation responses, the gen-first path now rewrites the generation response's usage from the context worker's accounting via rewrite_usage_response_from_ctx, mirroring the context-first behavior. The test configs were also updated to enable block reuse and partial reuse so the reuse path is exercised, and the corresponding waives were removed since the test now passes.
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Bug Fixes

    • Corrected generation usage accounting in disaggregated generation-first inference mode
  • Tests

    • Removed test waivers for disaggregated overlap generation-first configurations
    • Enabled KV cache block and partial reuse features in disaggregated test configurations

@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR modifies the generation-first disaggregation service to correctly adjust usage accounting by rewriting generation response usage from the context response, updates two test configuration files to enable cache reuse features (block and partial reuse), and removes two test waivers to enable previously skipped overlap tests.

Changes

Disaggregated generation-first flow improvements

Layer / File(s) Summary
Usage rewriting in generation-first disaggregation
tensorrt_llm/serve/openai_disagg_service.py
Imports rewrite_usage_response_from_ctx and applies it to adjust the generation response usage from the context response usage in the concurrent request path when context is required, replacing the raw responses[-1] return.
Test config cache reuse settings
tests/integration/defs/disaggregated/test_configs/disagg_config_overlap_gen_first.yaml, tests/integration/defs/disaggregated/test_configs/disagg_config_overlap_gen_first_pp4.yaml
Both context_servers and generation_servers KV cache configs are updated to set enable_block_reuse: True and enable_partial_reuse: True for cache reuse in disaggregated overlap tests.
Enable disaggregated overlap tests
tests/integration/test_lists/waives.txt
Removes two test waiver entries for test_disaggregated_overlap_gen_first with TinyLlama-1.1B-Chat-v1.0 in ctx_pp1 and ctx_pp4 configurations, allowing those tests to run.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#14791: Directly conflicts on the same waiver entries in tests/integration/test_lists/waives.txt for test_disaggregated_overlap_gen_first.
  • NVIDIA/TensorRT-LLM#14787: Also modifies waiver entries for the same test_disaggregated_overlap_gen_first test cases with TinyLlama-1.1B-Chat-v1.0.

Suggested reviewers

  • jieli-matrix
  • LarryXFly
  • StanleySun639
  • xinhe-nv
  • crazydemo
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description check ✅ Passed The PR description provides clear context on the root cause, the implemented fix, test verification steps, and relevant links, addressing the template requirements adequately.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately identifies the main fix: correcting cached_tokens accounting in the disaggregated gen-first overlap path, which is the core change across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

…ting

In generation-first disaggregated scheduling the context and generation
requests run concurrently, so the context worker's usage cannot be
propagated to the generation worker through the request (as done in the
context-first path). The generation worker then reports the entire
transferred prompt as cached, over-reporting cached_tokens. Adopt the
context worker's usage on the final response, mirroring the context-first
propagation. Also enable KV block reuse in the overlap_gen_first test
configs so the cache-reuse usage check is meaningful, and remove the
related waivers.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
@Shixiaowei02 Shixiaowei02 force-pushed the repair-bot-bug6223556 branch from 38c081a to 953ffd2 Compare June 10, 2026 09:31
@Shixiaowei02

Copy link
Copy Markdown
Collaborator

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator Author

PR_Github #53292 [ run ] triggered by Bot. Commit: 953ffd2 Link to invocation

@Shixiaowei02 Shixiaowei02 requested a review from reasonsolo June 10, 2026 09:43
@Shixiaowei02 Shixiaowei02 changed the title [https://nvbugs/6223556][fix] [https://nvbugs/6223556][fix] Correct cached_tokens accounting in disagg… [https://nvbugs/6223556][fix] Correct cached_tokens accounting in dis-agg gen-first overlap path Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants