[None][fix] Support V2 KV cache beam search by yizhang-nv · Pull Request #15197 · NVIDIA/TensorRT-LLM

yizhang-nv · 2026-06-10T04:12:48Z

Summary by CodeRabbit

Release Notes

New Features
- KVCacheManagerV2 now supports beam search with configurable beam widths at generation time, removing the previous beam_width=1 restriction for broader compatibility.
Tests
- Added integration and unit tests validating beam search functionality with the v2 KV cache manager, including scenarios with block reuse and partial token matching.

Description

Enable KV cache manager v2 to support beam-search expansion during generation. The change materializes page-index buffers for expanded beams, copies the live partial-tail/SSM page state from beam 0 for new beams, and keeps historical full blocks unmapped for expanded beams. It also disables partial reuse when V2 is configured with multi-beam capacity.

This keeps V2 selected for max_beam_width > 1 instead of falling back to the V1 KV cache manager. The runtime commit() path still only accepts beam_search_indices is None, but it no longer rejects a cache whose layout has already been expanded for beam search by overlap scheduling. Published blocks are treated as beam0-canonical for reuse, and non-default beam pages are cleared when a block is committed.

This adds targeted KV cache manager v2 unit coverage and LLM API accuracy coverage for V2 beam search.

Test Coverage

Syntax/diff checks passed.
Targeted V2 beam unit tests passed.
V2 beam LLM API and V1/V2 integration checks passed.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

yizhang-nv · 2026-06-11T04:51:33Z

/bot run

yizhang-nv · 2026-06-11T04:54:18Z

/bot run

tensorrt-cicd · 2026-06-11T04:58:03Z

PR_Github #53478 [ run ] triggered by Bot. Commit: 431829c Link to invocation

tensorrt-cicd · 2026-06-11T04:59:42Z

PR_Github #53479 [ run ] triggered by Bot. Commit: 431829c Link to invocation

tensorrt-cicd · 2026-06-11T05:04:13Z

PR_Github #53478 [ run ] completed with state ABORTED. Commit: 431829c

Link to invocation

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

yizhang-nv · 2026-06-11T05:15:31Z

/bot run

tensorrt-cicd · 2026-06-11T05:21:55Z

PR_Github #53484 [ run ] triggered by Bot. Commit: 7c3dc1d Link to invocation

tensorrt-cicd · 2026-06-11T05:25:35Z

PR_Github #53479 [ run ] completed with state ABORTED. Commit: 431829c

Link to invocation

coderabbitai · 2026-06-11T05:31:27Z

📝 Walkthrough

Walkthrough

This PR extends TensorRT-LLM's KVCacheManagerV2 to support generation-time beam widths up to max_beam_width instead of requiring beam_width == 1. Key changes include removing fallback constraints, implementing dynamic beam-width expansion via GPU page cloning, gating partial-reuse to single-beam scenarios, and adding comprehensive integration and unit tests to validate multi-beam beam-search decoding with the V2 cache manager.

Changes

KVCacheManagerV2 Multi-Beam Generation Support

Layer / File(s)	Summary
Fallback constraint removal in executor `tensorrt_llm/_torch/pyexecutor/_util.py`	Removes `max_beam_width > 1` from the fallback decision in both main and draft KV-cache paths; updates warning/error messages to reflect broadened V2 manager compatibility.
KVCacheManagerV2 initialization for multi-beam `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	Removes `max_beam_width == 1` constructor restriction; introduces `_set_page_index_bufs()` helper for page-buffer reconnection based on actual `kv_cache.beam_width`; gates `enable_partial_reuse` to single-beam only; wires `_ensure_generation_beam_width()` into allocation, draft preparation, and dummy-request paths to enforce beam-width expansion at generation time.
KVCache dynamic beam-width expansion `tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py`	Implements `beam_width` setter for increase/decrease at runtime; adds `_append_beams()` to allocate GPU slots and clone beam-0 pages into new beams (skipping history blocks) with SSM state extension; provides helper methods for block boundary and history-cursor calculations.
Block matching and commit behavior updates `tensorrt_llm/runtime/kv_cache_manager_v2/_block_radix_tree.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py`	Early-stops partial-match emission when `enable_partial_match` is disabled; removes beam-search `NotImplementedError` from `commit()`; gates `stop_committing()` partial-block logic on `enable_partial_match` flag; relaxes `_check_sanity()` to tolerate missing holders for non-default beams before history cursor.
Integration tests for v2 beam search `tests/integration/defs/accuracy/test_llm_api_pytorch.py`, `tests/integration/defs/kv_cache/test_kv_cache_v2_scheduler.py`	Adds `test_v2_beam_search` accuracy test with V2 cache manager enabled; enhances V1/V2 comparison with new `_assert_outputs_match` helper and `assert_token_ids_match` flag for multi-beam token-id parity validation.
Unit tests and test configuration `tests/unittest/_torch/executor/test_overlap_scheduler.py`, `tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py`, `tests/integration/test_lists/qa/llm_function_core.txt`, `tests/integration/test_lists/test-db/l0_b200.yml`, `tests/integration/test_lists/test-db/l0_h100.yml`	Refactors `create_llm` to wire V2 cache manager and beam-width configuration; adds `test_overlap_scheduler_v2_beam_search_block_reuse_cache_hit` to validate block reuse with multi-beam V2 manager; introduces `TestPartialReuseGuard` with `test_partial_reuse_gate` to verify partial-reuse gating; updates test lists with new beam-search entries.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

api-compatible

Suggested reviewers

venkywonka
ruodil
thorjohnsen
pcastonguay

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 24.53% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[None][fix] Support V2 KV cache beam search' accurately summarizes the main change: enabling KV cache manager v2 to support beam-search operations during generation, which is the core objective of this PR.
Description check	✅ Passed	The PR description provides a clear explanation of the changes (enabling V2 beam search, materializing buffers, disabling partial reuse for multi-beam), lists test coverage (unit tests, LLM API, V1/V2 integration), and includes the completed PR checklist with all items appropriately reviewed.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tests/integration/defs/kv_cache/test_kv_cache_v2_scheduler.py (1)

144-149: ⚡ Quick win

Add strict=True to zip() calls for safer iteration.

In Python 3.10+, zip(..., strict=True) raises ValueError if the iterables have different lengths, catching bugs where output counts or beam counts unexpectedly diverge.

🔒 Proposed fix to add strict parameter

-    for i, (oa, ob) in enumerate(zip(outputs_a, outputs_b)):
+    for i, (oa, ob) in enumerate(zip(outputs_a, outputs_b, strict=True)):
         assert len(oa.outputs) == len(ob.outputs), (
             f"Prompt {i}: output beam count mismatch: "
             f"{label_a}={len(oa.outputs)}, {label_b}={len(ob.outputs)}"
         )
-        for beam_idx, (beam_a, beam_b) in enumerate(zip(oa.outputs, ob.outputs)):
+        for beam_idx, (beam_a, beam_b) in enumerate(zip(oa.outputs, ob.outputs, strict=True)):
             if assert_token_ids_match:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/defs/kv_cache/test_kv_cache_v2_scheduler.py` around lines
144 - 149, The test iterates paired sequences with zip() but doesn't detect
length mismatches; update the two zip calls in the loop that compare outputs_a
with outputs_b and oa.outputs with ob.outputs to use zip(..., strict=True) so a
ValueError is raised if lengths differ, referencing the variables outputs_a,
outputs_b and the inner beam iteration over oa.outputs and ob.outputs (the
enumerate(zip(...)) loops).

tensorrt_llm/_torch/pyexecutor/_util.py (1)

889-891: 💤 Low value

Consider aligning the draft fallback warning with the main model warning for consistency.

The draft model fallback warning uses different terminology ("disaggregated serving, event buffer max size > 0, or disagg config") compared to the main model warning at line 270 ("kv_connector_manager, event buffer max size > 0, or cache transceiver"). Since both check the same conditions, using consistent field-name references would improve clarity and reduce confusion about whether "disaggregated serving" and "disagg config" refer to the same or different conditions.
♻️ Suggested alignment
         logger.warning(
-            "KVCacheManagerV2 is not supported with disaggregated serving, event buffer max size > 0, or disagg config. "
+            "KVCacheManagerV2 is not supported with kv_connector_manager, event buffer max size > 0, or cache transceiver. "
             "Falling back to KVCacheManager for draft model.")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 889 - 891, The
draft-fallback logger.warning message in KVCacheManagerV2 should use the same
terminology as the main model warning to avoid confusion: update the warning
emitted in the KVCacheManagerV2 fallback (the logger.warning call that currently
says "disaggregated serving, event buffer max size > 0, or disagg config") to
match the main model wording ("kv_connector_manager, event buffer max size > 0,
or cache transceiver") so both checks report the same field names and meaning.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 2968-2981: In _ensure_generation_beam_width change the mutation
order so we don't permanently widen kv_cache if a later resize fails: either
call the resize operation on kv_cache (e.g., kv_cache.resize/whatever method
used by try_allocate_generation/_prepare_draft_resources) before assigning
kv_cache.beam_width, or wrap the assignment in a try/except and revert
kv_cache.beam_width to its original value if the subsequent resize raises (or if
OutOfPagesError occurs); ensure _set_page_index_bufs(req.py_request_id,
kv_cache) is only called after successful resize and keep references to the
original beam width to restore on failure so try_allocate_generation and
_prepare_draft_resources cannot leave the cache widened under memory pressure.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py`:
- Around line 719-724: The commit() change removed the multi-beam guard and can
call _commit_block() while caches may have been expanded to multiple beam pages;
either collapse non-default beam pages into the default beam before calling
_commit_block() or reintroduce the single-beam guard to prevent multi-beam
commits until _commit_block() and SeqBlock.is_committed are made
multi-beam-safe. Specifically, in commit(), detect if any cache/block has width
> 1 (or if beam_search_indices is not None) and either (a) merge/drop pages so
len(self.pages)==1 for each SeqBlock before invoking _commit_block(), or (b)
restore the previous rejection behavior (raise/assert) to block multi-beam
commits; update references to _commit_block() and SeqBlock.is_committed
accordingly.
- Around line 1397-1421: The loop currently skips appending new beam rows for
historical blocks because of the early continue in the for ordinal, block in
typed_enumerate(self._blocks) loop; change the logic so that for each beam index
in typed_range(old_beam_width, new_beam_width) you always create new_beam_block
= filled_list(cast(BlockPage, None), num_life_cycles) and append it to
block.pages, but only perform slot allocation/copy/locking when
_is_block_before_history_cursor(ordinal) is False; i.e., move
block.pages.append(new_beam_block) outside the branch (or append an all-None row
when block_before_history_cursor is True) and keep the GPU copy path
(source_lock, _copy_gpu_page_storage_to_slot, UncommittedPage creation, etc.)
guarded by the existing block_before_history_cursor check so shape is preserved
for historical blocks.

---

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 889-891: The draft-fallback logger.warning message in
KVCacheManagerV2 should use the same terminology as the main model warning to
avoid confusion: update the warning emitted in the KVCacheManagerV2 fallback
(the logger.warning call that currently says "disaggregated serving, event
buffer max size > 0, or disagg config") to match the main model wording
("kv_connector_manager, event buffer max size > 0, or cache transceiver") so
both checks report the same field names and meaning.

In `@tests/integration/defs/kv_cache/test_kv_cache_v2_scheduler.py`:
- Around line 144-149: The test iterates paired sequences with zip() but doesn't
detect length mismatches; update the two zip calls in the loop that compare
outputs_a with outputs_b and oa.outputs with ob.outputs to use zip(...,
strict=True) so a ValueError is raised if lengths differ, referencing the
variables outputs_a, outputs_b and the inner beam iteration over oa.outputs and
ob.outputs (the enumerate(zip(...)) loops).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 77943b18-27f2-49e1-aa65-b3d3c86ac368

📥 Commits

Reviewing files that changed from the base of the PR and between 016fb4c and 7c3dc1d.

📒 Files selected for processing (11)

tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/runtime/kv_cache_manager_v2/_block_radix_tree.py
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tests/integration/defs/kv_cache/test_kv_cache_v2_scheduler.py
tests/integration/test_lists/qa/llm_function_core.txt
tests/integration/test_lists/test-db/l0_b200.yml
tests/integration/test_lists/test-db/l0_h100.yml
tests/unittest/_torch/executor/test_overlap_scheduler.py
tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py

coderabbitai · 2026-06-11T05:31:30Z

+    def _ensure_generation_beam_width(self, req: LlmRequest,
+                                      kv_cache: _KVCache) -> bool:
+        target_beam_width = BeamIndex(req.py_beam_width)
+        assert 1 <= target_beam_width <= self.max_beam_width
+        if kv_cache.beam_width == target_beam_width:
+            return True
+
+        try:
+            kv_cache.beam_width = target_beam_width
+        except OutOfPagesError:
+            return False
+
+        self._set_page_index_bufs(req.py_request_id, kv_cache)
+        return True


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Roll back beam expansion if the later resize fails.

kv_cache.beam_width is mutated here before the caller's kv_cache.resize(...). In try_allocate_generation() and _prepare_draft_resources(), a failed resize now leaves the cache widened and still holding the cloned beam pages even though the scheduling attempt failed. Under memory pressure that makes the failure sticky: later iterations see fewer free pages than they should, and the request can keep starving the pool until it is explicitly evicted or freed.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 2968 - 2981, In _ensure_generation_beam_width change the mutation order so we don't permanently widen kv_cache if a later resize fails: either call the resize operation on kv_cache (e.g., kv_cache.resize/whatever method used by try_allocate_generation/_prepare_draft_resources) before assigning kv_cache.beam_width, or wrap the assignment in a try/except and revert kv_cache.beam_width to its original value if the subsequent resize raises (or if OutOfPagesError occurs); ensure _set_page_index_bufs(req.py_request_id, kv_cache) is only called after successful resize and keep references to the original beam width to restore on failure so try_allocate_generation and _prepare_draft_resources cannot leave the cache widened under memory pressure.

coderabbitai · 2026-06-11T05:31:30Z

    def commit(
        self,
        accepted_input_tokens: Sequence[TokenIdExt],
        beam_search_indices: Sequence[int] | None = None,
    ):
-        if self.beam_width != 1:
-            raise NotImplementedError("Not implemented yet for beam search")
+        assert beam_search_indices is None


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

commit() now reaches a still-single-beam publication path.

Removing the multi-beam rejection here lets Line 738 call _commit_block() on caches that were expanded above width 1, but _commit_block() still asserts single-beam state at Line 1116 and SeqBlock.is_committed still assumes len(self.pages) == 1 at Line 105. The first full-block commit after beam expansion will either assert or publish beam 0 while leaving the extra beam pages attached to the same block.

Either collapse/drop non-default beam pages before committing, or keep the old guard until _commit_block() is made multi-beam-safe.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py` around lines 719 - 724, The commit() change removed the multi-beam guard and can call _commit_block() while caches may have been expanded to multiple beam pages; either collapse non-default beam pages into the default beam before calling _commit_block() or reintroduce the single-beam guard to prevent multi-beam commits until _commit_block() and SeqBlock.is_committed are made multi-beam-safe. Specifically, in commit(), detect if any cache/block has width > 1 (or if beam_search_indices is not None) and either (a) merge/drop pages so len(self.pages)==1 for each SeqBlock before invoking _commit_block(), or (b) restore the previous rejection behavior (raise/assert) to block multi-beam commits; update references to _commit_block() and SeqBlock.is_committed accordingly.

coderabbitai · 2026-06-11T05:31:30Z

+        for ordinal, block in typed_enumerate(self._blocks):
+            source_beam_block = block.pages[DEFAULT_BEAM_INDEX]
+            block_before_history_cursor = self._is_block_before_history_cursor(ordinal)
+            if block_before_history_cursor:
+                continue
+            for beam_idx in typed_range(old_beam_width, new_beam_width):
+                new_beam_block: TypedIndexList[LifeCycleId, BlockPage] = filled_list(
+                    cast(BlockPage, None), num_life_cycles
+                )
+                for lc, source_holder in typed_enumerate(source_beam_block):
+                    if source_holder is None:
+                        continue
+
+                    source_page = source_holder.page
+                    slot = new_slots[lc].pop()
+                    # Beam expansion copies existing non-historical GPU pages into new beam pages.
+                    # Fully historical blocks remain unmapped for new beams.
+                    assert source_page.cache_level == GPU_LEVEL
+                    source_lock = source_page.lock(self, beam_idx, ordinal, lc, skip_wait=False)
+                    self._copy_gpu_page_storage_to_slot(source_lock.page, slot)
+                    with self._record_event():
+                        source_lock.unlock()
+                    new_page = UncommittedPage(self, ordinal, lc, GPU_LEVEL, slot, beam_idx)
+                    new_beam_block[lc] = new_page.lock(self, beam_idx, ordinal, lc, skip_wait=True)
+                block.pages.append(new_beam_block)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Append placeholder beam rows for historical blocks too.

The continue at Lines 1400-1401 skips block.pages.append(new_beam_block) entirely, so after self._beam_width is raised some historical blocks still have only old_beam_width rows. Later indexed access like b.pages[beam_id] or _block(ordinal, beam_idx) can then walk past the list for non-default beams. The new _check_sanity() exception at Lines 1560-1565 only tolerates missing holders inside an existing row; it does not cover missing rows.

A safe pattern here is to append an all-None beam row for every new beam on every block, and only skip the slot allocation/copy work for pre-history ordinals.

Suggested shape-preserving fix

for ordinal, block in typed_enumerate(self._blocks): source_beam_block = block.pages[DEFAULT_BEAM_INDEX] - block_before_history_cursor = self._is_block_before_history_cursor(ordinal) - if block_before_history_cursor: - continue for beam_idx in typed_range(old_beam_width, new_beam_width): new_beam_block: TypedIndexList[LifeCycleId, BlockPage] = filled_list( cast(BlockPage, None), num_life_cycles ) + if self._is_block_before_history_cursor(ordinal): + block.pages.append(new_beam_block) + continue for lc, source_holder in typed_enumerate(source_beam_block): if source_holder is None: continue ... block.pages.append(new_beam_block)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py` around lines 1397 - 1421, The loop currently skips appending new beam rows for historical blocks because of the early continue in the for ordinal, block in typed_enumerate(self._blocks) loop; change the logic so that for each beam index in typed_range(old_beam_width, new_beam_width) you always create new_beam_block = filled_list(cast(BlockPage, None), num_life_cycles) and append it to block.pages, but only perform slot allocation/copy/locking when _is_block_before_history_cursor(ordinal) is False; i.e., move block.pages.append(new_beam_block) outside the branch (or append an all-None row when block_before_history_cursor is True) and keep the GPU copy path (source_lock, _copy_gpu_page_storage_to_slot, UncommittedPage creation, etc.) guarded by the existing block_before_history_cursor check so shape is preserved for historical blocks.

tensorrt-cicd · 2026-06-11T05:59:02Z

PR_Github #53484 [ run ] completed with state ABORTED. Commit: 7c3dc1d

Link to invocation

yizhang-nv · 2026-06-11T07:15:38Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-11T07:22:11Z

PR_Github #53514 [ run ] triggered by Bot. Commit: 7c3dc1d Link to invocation

tensorrt-cicd · 2026-06-11T14:58:39Z

PR_Github #53514 [ run ] completed with state SUCCESS. Commit: 7c3dc1d
/LLM/main/L0_MergeRequest_PR pipeline #42672 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

fredricz-20070104 · 2026-06-12T02:09:06Z

Thanks for picking this up — V2 multi-beam has a lot of edges. Three correctness issues I think need to be addressed before merge:

_commit_block still asserts single-beam. commit() no longer rejects max_beam_width > 1, but the downstream _commit_block in _kv_cache.py still has assert typed_len(seq_block.pages) == 1, "Must have 1 beam only". After beam expansion, the first full-block commit will hit this assert. If the intent (per the PR description) is for multi-beam caches to flow through commit(), this assertion needs to be relaxed and the non-beam-0 pages need to be released — consistent with the "publish beam 0 as canonical, drop the rest" policy already in commit(). Otherwise we'd need a guarantee that commit() always runs before beam expansion, and I don't see that guarantee in the current code.
_append_beams skips historical blocks without padding block.pages. When _is_block_before_history_cursor(ordinal) is true the loop continues without appending a placeholder. After self._beam_width is bumped to new_bw, those historical blocks still have len(block.pages) == old_bw, so any later access via b.pages[beam_idx] (including _block(ordinal, beam_idx)) will IndexError. _check_sanity only tolerates holder=None, not missing rows. Two options: (a) append a placeholder _BeamBlock(holder=None) to historical blocks so pages stays rectangular, or (b) make every pages[beam_idx] access tolerate sparsity. (a) is the smaller, safer change.
_ensure_generation_beam_width is not rolled back on caller failure. In try_allocate_generation / _prepare_draft_resources, beam width is bumped and pages are cloned before the caller's kv_cache.resize(...). If resize fails (OOM, etc.), the cache is left in a "already expanded, holding cloned pages" state and scheduling will keep retrying / failing. Either roll back the beam-width bump and cloned pages on caller resize failure, or move the resize inside _ensure_generation_beam_width so the whole expansion is transactional.

fredricz-20070104 · 2026-06-12T02:11:24Z

Thanks for taking this on — V2 multi-beam has a lot of subtle edges. I see three correctness blockers (matching the CodeRabbit Critical/Major findings) that should be resolved before merge:

1. (Critical) _commit_block still hard-asserts single beam.
commit() no longer rejects max_beam_width > 1, but _kv_cache.py:1116 still has:

assert typed_len(seq_block.pages) == 1, "Must have 1 beam only"

So the first full-block commit after a beam-width expansion will fail. If the design intent (per the PR description) is to allow a multi-beam cache to flow through commit(), this assertion needs to be relaxed and non-default beam pages need to be released (treating beam 0 as canonical, consistent with what commit() already does for the beam_search_indices argument). Otherwise the invariant "commit() always runs before beam expansion" needs to be made explicit — I don't see that guarantee in the current code.

2. (Major) _append_beams leaves historical blocks' pages array shorter than the new beam width.
At _kv_cache.py:1397-1401, when _is_block_before_history_cursor(ordinal) is true the loop continues without appending a placeholder to block.pages. After self._beam_width is promoted to new_beam_width, those historical blocks still have len(block.pages) == old_beam_width. Any subsequent access via block.pages[beam_idx] (e.g. _block(ordinal, beam_idx)) will IndexError, and _check_sanity does not currently allow a pages row to be entirely missing — it only allows holder=None.

Two ways to fix this:

(a) Append a placeholder _BeamBlock (holder=None) for historical blocks too, so pages length stays consistent.
(b) Make every pages[beam_idx] access tolerant of a sparse array.

(a) is smaller and safer.

3. (Major) _ensure_generation_beam_width does not roll back if a later resize fails.
In resource_manager.py:2896-2902 (and similarly in _prepare_draft_resources / the draft path), _ensure_generation_beam_width is called first — it promotes beam_width and clones pages — and then kv_cache.resize(...) is called. If the resize fails (e.g. OOM), the cache is left in a half-expanded state with cloned pages held, and the scheduler will fail the same request indefinitely. Either roll back beam_width and the cloned pages when the caller's resize fails, or move the resize inside _ensure_generation_beam_width and make the whole thing transactional.

Once 1–3 are addressed I think this is ready to merge. Non-blocker side note for future cleanup: enable_partial_reuse = config.enable_partial_reuse and max_beam_width == 1 is gated on the configured max_beam_width, so a single-beam request running under a multi-beam configuration cannot benefit from partial reuse. V1 gates this on the cache's current beam width. Worth converging at some point but not a merge blocker.

github-actions Bot assigned yizhang-nv Jun 10, 2026

yizhang-nv force-pushed the yizhan/kvcm-v2-beam-reuse-main branch 2 times, most recently from 31a7221 to 86f1871 Compare June 10, 2026 05:58

yizhang-nv changed the title ~~[None][fix] Support V2 KV cache beam reuse~~ [None][fix] Support V2 KV cache beam search Jun 10, 2026

yizhang-nv force-pushed the yizhan/kvcm-v2-beam-reuse-main branch 3 times, most recently from 109c0dc to 7be6aa3 Compare June 10, 2026 10:08

Support beam search KV cache reuse

9124618

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

yizhang-nv force-pushed the yizhan/kvcm-v2-beam-reuse-main branch from 7be6aa3 to 9124618 Compare June 11, 2026 03:23

yizhang-nv force-pushed the yizhan/kvcm-v2-beam-reuse-main branch from aeba2d9 to 431829c Compare June 11, 2026 04:54

test: restore beam search validation coverage

7c3dc1d

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

yizhang-nv force-pushed the yizhan/kvcm-v2-beam-reuse-main branch from 431829c to 7c3dc1d Compare June 11, 2026 05:15

yizhang-nv marked this pull request as ready for review June 11, 2026 05:16

yizhang-nv requested review from a team as code owners June 11, 2026 05:16

yizhang-nv requested a review from joyang-nv June 11, 2026 05:16

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

yizhang-nv mentioned this pull request Jun 11, 2026

[None][feat] Support beam search in KV cache manager v2 #14945

Closed

1 task

jieli-matrix approved these changes Jun 12, 2026

View reviewed changes

Conversation

yizhang-nv commented Jun 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

yizhang-nv commented Jun 11, 2026

Uh oh!

yizhang-nv commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

yizhang-nv commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

coderabbitai Bot commented Jun 11, 2026

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

yizhang-nv commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

fredricz-20070104 commented Jun 12, 2026

Uh oh!

fredricz-20070104 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yizhang-nv commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading