Skip to content

[None][fix] Support V2 KV cache beam search#15197

Open
yizhang-nv wants to merge 2 commits into
NVIDIA:mainfrom
yizhang-nv:yizhan/kvcm-v2-beam-reuse-main
Open

[None][fix] Support V2 KV cache beam search#15197
yizhang-nv wants to merge 2 commits into
NVIDIA:mainfrom
yizhang-nv:yizhan/kvcm-v2-beam-reuse-main

Conversation

@yizhang-nv

@yizhang-nv yizhang-nv commented Jun 10, 2026

Copy link
Copy Markdown
Member

Summary by CodeRabbit

Release Notes

  • New Features

    • KVCacheManagerV2 now supports beam search with configurable beam widths at generation time, removing the previous beam_width=1 restriction for broader compatibility.
  • Tests

    • Added integration and unit tests validating beam search functionality with the v2 KV cache manager, including scenarios with block reuse and partial token matching.

Description

Enable KV cache manager v2 to support beam-search expansion during generation. The change materializes page-index buffers for expanded beams, copies the live partial-tail/SSM page state from beam 0 for new beams, and keeps historical full blocks unmapped for expanded beams. It also disables partial reuse when V2 is configured with multi-beam capacity.

This keeps V2 selected for max_beam_width > 1 instead of falling back to the V1 KV cache manager. The runtime commit() path still only accepts beam_search_indices is None, but it no longer rejects a cache whose layout has already been expanded for beam search by overlap scheduling. Published blocks are treated as beam0-canonical for reuse, and non-default beam pages are cleared when a block is committed.

This adds targeted KV cache manager v2 unit coverage and LLM API accuracy coverage for V2 beam search.

Test Coverage

  • Syntax/diff checks passed.
  • Targeted V2 beam unit tests passed.
  • V2 beam LLM API and V1/V2 integration checks passed.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@yizhang-nv yizhang-nv force-pushed the yizhan/kvcm-v2-beam-reuse-main branch 2 times, most recently from 31a7221 to 86f1871 Compare June 10, 2026 05:58
@yizhang-nv yizhang-nv changed the title [None][fix] Support V2 KV cache beam reuse [None][fix] Support V2 KV cache beam search Jun 10, 2026
@yizhang-nv yizhang-nv force-pushed the yizhan/kvcm-v2-beam-reuse-main branch 3 times, most recently from 109c0dc to 7be6aa3 Compare June 10, 2026 10:08
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
@yizhang-nv yizhang-nv force-pushed the yizhan/kvcm-v2-beam-reuse-main branch from 7be6aa3 to 9124618 Compare June 11, 2026 03:23
@yizhang-nv

Copy link
Copy Markdown
Member Author

/bot run

@yizhang-nv yizhang-nv force-pushed the yizhan/kvcm-v2-beam-reuse-main branch from aeba2d9 to 431829c Compare June 11, 2026 04:54
@yizhang-nv

Copy link
Copy Markdown
Member Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53478 [ run ] triggered by Bot. Commit: 431829c Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53479 [ run ] triggered by Bot. Commit: 431829c Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53478 [ run ] completed with state ABORTED. Commit: 431829c

Link to invocation

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
@yizhang-nv yizhang-nv force-pushed the yizhan/kvcm-v2-beam-reuse-main branch from 431829c to 7c3dc1d Compare June 11, 2026 05:15
@yizhang-nv

Copy link
Copy Markdown
Member Author

/bot run

@yizhang-nv yizhang-nv marked this pull request as ready for review June 11, 2026 05:16
@yizhang-nv yizhang-nv requested review from a team as code owners June 11, 2026 05:16
@yizhang-nv yizhang-nv requested a review from joyang-nv June 11, 2026 05:16
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53484 [ run ] triggered by Bot. Commit: 7c3dc1d Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53479 [ run ] completed with state ABORTED. Commit: 431829c

Link to invocation

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR extends TensorRT-LLM's KVCacheManagerV2 to support generation-time beam widths up to max_beam_width instead of requiring beam_width == 1. Key changes include removing fallback constraints, implementing dynamic beam-width expansion via GPU page cloning, gating partial-reuse to single-beam scenarios, and adding comprehensive integration and unit tests to validate multi-beam beam-search decoding with the V2 cache manager.

Changes

KVCacheManagerV2 Multi-Beam Generation Support

Layer / File(s) Summary
Fallback constraint removal in executor
tensorrt_llm/_torch/pyexecutor/_util.py
Removes max_beam_width > 1 from the fallback decision in both main and draft KV-cache paths; updates warning/error messages to reflect broadened V2 manager compatibility.
KVCacheManagerV2 initialization for multi-beam
tensorrt_llm/_torch/pyexecutor/resource_manager.py
Removes max_beam_width == 1 constructor restriction; introduces _set_page_index_bufs() helper for page-buffer reconnection based on actual kv_cache.beam_width; gates enable_partial_reuse to single-beam only; wires _ensure_generation_beam_width() into allocation, draft preparation, and dummy-request paths to enforce beam-width expansion at generation time.
KVCache dynamic beam-width expansion
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py
Implements beam_width setter for increase/decrease at runtime; adds _append_beams() to allocate GPU slots and clone beam-0 pages into new beams (skipping history blocks) with SSM state extension; provides helper methods for block boundary and history-cursor calculations.
Block matching and commit behavior updates
tensorrt_llm/runtime/kv_cache_manager_v2/_block_radix_tree.py, tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py
Early-stops partial-match emission when enable_partial_match is disabled; removes beam-search NotImplementedError from commit(); gates stop_committing() partial-block logic on enable_partial_match flag; relaxes _check_sanity() to tolerate missing holders for non-default beams before history cursor.
Integration tests for v2 beam search
tests/integration/defs/accuracy/test_llm_api_pytorch.py, tests/integration/defs/kv_cache/test_kv_cache_v2_scheduler.py
Adds test_v2_beam_search accuracy test with V2 cache manager enabled; enhances V1/V2 comparison with new _assert_outputs_match helper and assert_token_ids_match flag for multi-beam token-id parity validation.
Unit tests and test configuration
tests/unittest/_torch/executor/test_overlap_scheduler.py, tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py, tests/integration/test_lists/qa/llm_function_core.txt, tests/integration/test_lists/test-db/l0_b200.yml, tests/integration/test_lists/test-db/l0_h100.yml
Refactors create_llm to wire V2 cache manager and beam-width configuration; adds test_overlap_scheduler_v2_beam_search_block_reuse_cache_hit to validate block reuse with multi-beam V2 manager; introduces TestPartialReuseGuard with test_partial_reuse_gate to verify partial-reuse gating; updates test lists with new beam-search entries.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

api-compatible

Suggested reviewers

  • venkywonka
  • ruodil
  • thorjohnsen
  • pcastonguay
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 24.53% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title '[None][fix] Support V2 KV cache beam search' accurately summarizes the main change: enabling KV cache manager v2 to support beam-search operations during generation, which is the core objective of this PR.
Description check ✅ Passed The PR description provides a clear explanation of the changes (enabling V2 beam search, materializing buffers, disabling partial reuse for multi-beam), lists test coverage (unit tests, LLM API, V1/V2 integration), and includes the completed PR checklist with all items appropriately reviewed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tests/integration/defs/kv_cache/test_kv_cache_v2_scheduler.py (1)

144-149: ⚡ Quick win

Add strict=True to zip() calls for safer iteration.

In Python 3.10+, zip(..., strict=True) raises ValueError if the iterables have different lengths, catching bugs where output counts or beam counts unexpectedly diverge.

🔒 Proposed fix to add strict parameter
-    for i, (oa, ob) in enumerate(zip(outputs_a, outputs_b)):
+    for i, (oa, ob) in enumerate(zip(outputs_a, outputs_b, strict=True)):
         assert len(oa.outputs) == len(ob.outputs), (
             f"Prompt {i}: output beam count mismatch: "
             f"{label_a}={len(oa.outputs)}, {label_b}={len(ob.outputs)}"
         )
-        for beam_idx, (beam_a, beam_b) in enumerate(zip(oa.outputs, ob.outputs)):
+        for beam_idx, (beam_a, beam_b) in enumerate(zip(oa.outputs, ob.outputs, strict=True)):
             if assert_token_ids_match:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/defs/kv_cache/test_kv_cache_v2_scheduler.py` around lines
144 - 149, The test iterates paired sequences with zip() but doesn't detect
length mismatches; update the two zip calls in the loop that compare outputs_a
with outputs_b and oa.outputs with ob.outputs to use zip(..., strict=True) so a
ValueError is raised if lengths differ, referencing the variables outputs_a,
outputs_b and the inner beam iteration over oa.outputs and ob.outputs (the
enumerate(zip(...)) loops).
tensorrt_llm/_torch/pyexecutor/_util.py (1)

889-891: 💤 Low value

Consider aligning the draft fallback warning with the main model warning for consistency.

The draft model fallback warning uses different terminology ("disaggregated serving, event buffer max size > 0, or disagg config") compared to the main model warning at line 270 ("kv_connector_manager, event buffer max size > 0, or cache transceiver"). Since both check the same conditions, using consistent field-name references would improve clarity and reduce confusion about whether "disaggregated serving" and "disagg config" refer to the same or different conditions.

♻️ Suggested alignment
         logger.warning(
-            "KVCacheManagerV2 is not supported with disaggregated serving, event buffer max size > 0, or disagg config. "
+            "KVCacheManagerV2 is not supported with kv_connector_manager, event buffer max size > 0, or cache transceiver. "
             "Falling back to KVCacheManager for draft model.")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 889 - 891, The
draft-fallback logger.warning message in KVCacheManagerV2 should use the same
terminology as the main model warning to avoid confusion: update the warning
emitted in the KVCacheManagerV2 fallback (the logger.warning call that currently
says "disaggregated serving, event buffer max size > 0, or disagg config") to
match the main model wording ("kv_connector_manager, event buffer max size > 0,
or cache transceiver") so both checks report the same field names and meaning.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 2968-2981: In _ensure_generation_beam_width change the mutation
order so we don't permanently widen kv_cache if a later resize fails: either
call the resize operation on kv_cache (e.g., kv_cache.resize/whatever method
used by try_allocate_generation/_prepare_draft_resources) before assigning
kv_cache.beam_width, or wrap the assignment in a try/except and revert
kv_cache.beam_width to its original value if the subsequent resize raises (or if
OutOfPagesError occurs); ensure _set_page_index_bufs(req.py_request_id,
kv_cache) is only called after successful resize and keep references to the
original beam width to restore on failure so try_allocate_generation and
_prepare_draft_resources cannot leave the cache widened under memory pressure.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py`:
- Around line 719-724: The commit() change removed the multi-beam guard and can
call _commit_block() while caches may have been expanded to multiple beam pages;
either collapse non-default beam pages into the default beam before calling
_commit_block() or reintroduce the single-beam guard to prevent multi-beam
commits until _commit_block() and SeqBlock.is_committed are made
multi-beam-safe. Specifically, in commit(), detect if any cache/block has width
> 1 (or if beam_search_indices is not None) and either (a) merge/drop pages so
len(self.pages)==1 for each SeqBlock before invoking _commit_block(), or (b)
restore the previous rejection behavior (raise/assert) to block multi-beam
commits; update references to _commit_block() and SeqBlock.is_committed
accordingly.
- Around line 1397-1421: The loop currently skips appending new beam rows for
historical blocks because of the early continue in the for ordinal, block in
typed_enumerate(self._blocks) loop; change the logic so that for each beam index
in typed_range(old_beam_width, new_beam_width) you always create new_beam_block
= filled_list(cast(BlockPage, None), num_life_cycles) and append it to
block.pages, but only perform slot allocation/copy/locking when
_is_block_before_history_cursor(ordinal) is False; i.e., move
block.pages.append(new_beam_block) outside the branch (or append an all-None row
when block_before_history_cursor is True) and keep the GPU copy path
(source_lock, _copy_gpu_page_storage_to_slot, UncommittedPage creation, etc.)
guarded by the existing block_before_history_cursor check so shape is preserved
for historical blocks.

---

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 889-891: The draft-fallback logger.warning message in
KVCacheManagerV2 should use the same terminology as the main model warning to
avoid confusion: update the warning emitted in the KVCacheManagerV2 fallback
(the logger.warning call that currently says "disaggregated serving, event
buffer max size > 0, or disagg config") to match the main model wording
("kv_connector_manager, event buffer max size > 0, or cache transceiver") so
both checks report the same field names and meaning.

In `@tests/integration/defs/kv_cache/test_kv_cache_v2_scheduler.py`:
- Around line 144-149: The test iterates paired sequences with zip() but doesn't
detect length mismatches; update the two zip calls in the loop that compare
outputs_a with outputs_b and oa.outputs with ob.outputs to use zip(...,
strict=True) so a ValueError is raised if lengths differ, referencing the
variables outputs_a, outputs_b and the inner beam iteration over oa.outputs and
ob.outputs (the enumerate(zip(...)) loops).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 77943b18-27f2-49e1-aa65-b3d3c86ac368

📥 Commits

Reviewing files that changed from the base of the PR and between 016fb4c and 7c3dc1d.

📒 Files selected for processing (11)
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/runtime/kv_cache_manager_v2/_block_radix_tree.py
  • tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tests/integration/defs/kv_cache/test_kv_cache_v2_scheduler.py
  • tests/integration/test_lists/qa/llm_function_core.txt
  • tests/integration/test_lists/test-db/l0_b200.yml
  • tests/integration/test_lists/test-db/l0_h100.yml
  • tests/unittest/_torch/executor/test_overlap_scheduler.py
  • tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py

Comment on lines +2968 to +2981
def _ensure_generation_beam_width(self, req: LlmRequest,
kv_cache: _KVCache) -> bool:
target_beam_width = BeamIndex(req.py_beam_width)
assert 1 <= target_beam_width <= self.max_beam_width
if kv_cache.beam_width == target_beam_width:
return True

try:
kv_cache.beam_width = target_beam_width
except OutOfPagesError:
return False

self._set_page_index_bufs(req.py_request_id, kv_cache)
return True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Roll back beam expansion if the later resize fails.

kv_cache.beam_width is mutated here before the caller's kv_cache.resize(...). In try_allocate_generation() and _prepare_draft_resources(), a failed resize now leaves the cache widened and still holding the cloned beam pages even though the scheduling attempt failed. Under memory pressure that makes the failure sticky: later iterations see fewer free pages than they should, and the request can keep starving the pool until it is explicitly evicted or freed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 2968 - 2981,
In _ensure_generation_beam_width change the mutation order so we don't
permanently widen kv_cache if a later resize fails: either call the resize
operation on kv_cache (e.g., kv_cache.resize/whatever method used by
try_allocate_generation/_prepare_draft_resources) before assigning
kv_cache.beam_width, or wrap the assignment in a try/except and revert
kv_cache.beam_width to its original value if the subsequent resize raises (or if
OutOfPagesError occurs); ensure _set_page_index_bufs(req.py_request_id,
kv_cache) is only called after successful resize and keep references to the
original beam width to restore on failure so try_allocate_generation and
_prepare_draft_resources cannot leave the cache widened under memory pressure.

Comment on lines 719 to +724
def commit(
self,
accepted_input_tokens: Sequence[TokenIdExt],
beam_search_indices: Sequence[int] | None = None,
):
if self.beam_width != 1:
raise NotImplementedError("Not implemented yet for beam search")
assert beam_search_indices is None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

commit() now reaches a still-single-beam publication path.

Removing the multi-beam rejection here lets Line 738 call _commit_block() on caches that were expanded above width 1, but _commit_block() still asserts single-beam state at Line 1116 and SeqBlock.is_committed still assumes len(self.pages) == 1 at Line 105. The first full-block commit after beam expansion will either assert or publish beam 0 while leaving the extra beam pages attached to the same block.

Either collapse/drop non-default beam pages before committing, or keep the old guard until _commit_block() is made multi-beam-safe.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py` around lines 719
- 724, The commit() change removed the multi-beam guard and can call
_commit_block() while caches may have been expanded to multiple beam pages;
either collapse non-default beam pages into the default beam before calling
_commit_block() or reintroduce the single-beam guard to prevent multi-beam
commits until _commit_block() and SeqBlock.is_committed are made
multi-beam-safe. Specifically, in commit(), detect if any cache/block has width
> 1 (or if beam_search_indices is not None) and either (a) merge/drop pages so
len(self.pages)==1 for each SeqBlock before invoking _commit_block(), or (b)
restore the previous rejection behavior (raise/assert) to block multi-beam
commits; update references to _commit_block() and SeqBlock.is_committed
accordingly.

Comment on lines +1397 to +1421
for ordinal, block in typed_enumerate(self._blocks):
source_beam_block = block.pages[DEFAULT_BEAM_INDEX]
block_before_history_cursor = self._is_block_before_history_cursor(ordinal)
if block_before_history_cursor:
continue
for beam_idx in typed_range(old_beam_width, new_beam_width):
new_beam_block: TypedIndexList[LifeCycleId, BlockPage] = filled_list(
cast(BlockPage, None), num_life_cycles
)
for lc, source_holder in typed_enumerate(source_beam_block):
if source_holder is None:
continue

source_page = source_holder.page
slot = new_slots[lc].pop()
# Beam expansion copies existing non-historical GPU pages into new beam pages.
# Fully historical blocks remain unmapped for new beams.
assert source_page.cache_level == GPU_LEVEL
source_lock = source_page.lock(self, beam_idx, ordinal, lc, skip_wait=False)
self._copy_gpu_page_storage_to_slot(source_lock.page, slot)
with self._record_event():
source_lock.unlock()
new_page = UncommittedPage(self, ordinal, lc, GPU_LEVEL, slot, beam_idx)
new_beam_block[lc] = new_page.lock(self, beam_idx, ordinal, lc, skip_wait=True)
block.pages.append(new_beam_block)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Append placeholder beam rows for historical blocks too.

The continue at Lines 1400-1401 skips block.pages.append(new_beam_block) entirely, so after self._beam_width is raised some historical blocks still have only old_beam_width rows. Later indexed access like b.pages[beam_id] or _block(ordinal, beam_idx) can then walk past the list for non-default beams. The new _check_sanity() exception at Lines 1560-1565 only tolerates missing holders inside an existing row; it does not cover missing rows.

A safe pattern here is to append an all-None beam row for every new beam on every block, and only skip the slot allocation/copy work for pre-history ordinals.

Suggested shape-preserving fix
         for ordinal, block in typed_enumerate(self._blocks):
             source_beam_block = block.pages[DEFAULT_BEAM_INDEX]
-            block_before_history_cursor = self._is_block_before_history_cursor(ordinal)
-            if block_before_history_cursor:
-                continue
             for beam_idx in typed_range(old_beam_width, new_beam_width):
                 new_beam_block: TypedIndexList[LifeCycleId, BlockPage] = filled_list(
                     cast(BlockPage, None), num_life_cycles
                 )
+                if self._is_block_before_history_cursor(ordinal):
+                    block.pages.append(new_beam_block)
+                    continue
                 for lc, source_holder in typed_enumerate(source_beam_block):
                     if source_holder is None:
                         continue
                     ...
                 block.pages.append(new_beam_block)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py` around lines
1397 - 1421, The loop currently skips appending new beam rows for historical
blocks because of the early continue in the for ordinal, block in
typed_enumerate(self._blocks) loop; change the logic so that for each beam index
in typed_range(old_beam_width, new_beam_width) you always create new_beam_block
= filled_list(cast(BlockPage, None), num_life_cycles) and append it to
block.pages, but only perform slot allocation/copy/locking when
_is_block_before_history_cursor(ordinal) is False; i.e., move
block.pages.append(new_beam_block) outside the branch (or append an all-None row
when block_before_history_cursor is True) and keep the GPU copy path
(source_lock, _copy_gpu_page_storage_to_slot, UncommittedPage creation, etc.)
guarded by the existing block_before_history_cursor check so shape is preserved
for historical blocks.

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53484 [ run ] completed with state ABORTED. Commit: 7c3dc1d

Link to invocation

@yizhang-nv

Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53514 [ run ] triggered by Bot. Commit: 7c3dc1d Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53514 [ run ] completed with state SUCCESS. Commit: 7c3dc1d
/LLM/main/L0_MergeRequest_PR pipeline #42672 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@fredricz-20070104

Copy link
Copy Markdown
Collaborator

Thanks for picking this up — V2 multi-beam has a lot of edges. Three correctness issues I think need to be addressed before merge:

  1. _commit_block still asserts single-beam. commit() no longer rejects max_beam_width > 1, but the downstream _commit_block in _kv_cache.py still has assert typed_len(seq_block.pages) == 1, "Must have 1 beam only". After beam expansion, the first full-block commit will hit this assert. If the intent (per the PR description) is for multi-beam caches to flow through commit(), this assertion needs to be relaxed and the non-beam-0 pages need to be released — consistent with the "publish beam 0 as canonical, drop the rest" policy already in commit(). Otherwise we'd need a guarantee that commit() always runs before beam expansion, and I don't see that guarantee in the current code.

  2. _append_beams skips historical blocks without padding block.pages. When _is_block_before_history_cursor(ordinal) is true the loop continues without appending a placeholder. After self._beam_width is bumped to new_bw, those historical blocks still have len(block.pages) == old_bw, so any later access via b.pages[beam_idx] (including _block(ordinal, beam_idx)) will IndexError. _check_sanity only tolerates holder=None, not missing rows. Two options: (a) append a placeholder _BeamBlock(holder=None) to historical blocks so pages stays rectangular, or (b) make every pages[beam_idx] access tolerate sparsity. (a) is the smaller, safer change.

  3. _ensure_generation_beam_width is not rolled back on caller failure. In try_allocate_generation / _prepare_draft_resources, beam width is bumped and pages are cloned before the caller's kv_cache.resize(...). If resize fails (OOM, etc.), the cache is left in a "already expanded, holding cloned pages" state and scheduling will keep retrying / failing. Either roll back the beam-width bump and cloned pages on caller resize failure, or move the resize inside _ensure_generation_beam_width so the whole expansion is transactional.

@fredricz-20070104

Copy link
Copy Markdown
Collaborator

Thanks for taking this on — V2 multi-beam has a lot of subtle edges. I see three correctness blockers (matching the CodeRabbit Critical/Major findings) that should be resolved before merge:

1. (Critical) _commit_block still hard-asserts single beam.
commit() no longer rejects max_beam_width > 1, but _kv_cache.py:1116 still has:

assert typed_len(seq_block.pages) == 1, "Must have 1 beam only"

So the first full-block commit after a beam-width expansion will fail. If the design intent (per the PR description) is to allow a multi-beam cache to flow through commit(), this assertion needs to be relaxed and non-default beam pages need to be released (treating beam 0 as canonical, consistent with what commit() already does for the beam_search_indices argument). Otherwise the invariant "commit() always runs before beam expansion" needs to be made explicit — I don't see that guarantee in the current code.

2. (Major) _append_beams leaves historical blocks' pages array shorter than the new beam width.
At _kv_cache.py:1397-1401, when _is_block_before_history_cursor(ordinal) is true the loop continues without appending a placeholder to block.pages. After self._beam_width is promoted to new_beam_width, those historical blocks still have len(block.pages) == old_beam_width. Any subsequent access via block.pages[beam_idx] (e.g. _block(ordinal, beam_idx)) will IndexError, and _check_sanity does not currently allow a pages row to be entirely missing — it only allows holder=None.

Two ways to fix this:

  • (a) Append a placeholder _BeamBlock (holder=None) for historical blocks too, so pages length stays consistent.
  • (b) Make every pages[beam_idx] access tolerant of a sparse array.

(a) is smaller and safer.

3. (Major) _ensure_generation_beam_width does not roll back if a later resize fails.
In resource_manager.py:2896-2902 (and similarly in _prepare_draft_resources / the draft path), _ensure_generation_beam_width is called first — it promotes beam_width and clones pages — and then kv_cache.resize(...) is called. If the resize fails (e.g. OOM), the cache is left in a half-expanded state with cloned pages held, and the scheduler will fail the same request indefinitely. Either roll back beam_width and the cloned pages when the caller's resize fails, or move the resize inside _ensure_generation_beam_width and make the whole thing transactional.

Once 1–3 are addressed I think this is ready to merge. Non-blocker side note for future cleanup: enable_partial_reuse = config.enable_partial_reuse and max_beam_width == 1 is gated on the configured max_beam_width, so a single-beam request running under a multi-beam configuration cannot benefit from partial reuse. V1 gates this on the cache's current beam width. Worth converging at some point but not a merge blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants