[None][fix] Support V2 KV cache beam search#15197
Conversation
31a7221 to
86f1871
Compare
109c0dc to
7be6aa3
Compare
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
7be6aa3 to
9124618
Compare
|
/bot run |
aeba2d9 to
431829c
Compare
|
/bot run |
|
PR_Github #53478 [ run ] triggered by Bot. Commit: |
|
PR_Github #53479 [ run ] triggered by Bot. Commit: |
|
PR_Github #53478 [ run ] completed with state |
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
431829c to
7c3dc1d
Compare
|
/bot run |
|
PR_Github #53484 [ run ] triggered by Bot. Commit: |
|
PR_Github #53479 [ run ] completed with state |
📝 WalkthroughWalkthroughThis PR extends TensorRT-LLM's KVCacheManagerV2 to support generation-time beam widths up to ChangesKVCacheManagerV2 Multi-Beam Generation Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/integration/defs/kv_cache/test_kv_cache_v2_scheduler.py (1)
144-149: ⚡ Quick winAdd
strict=Truetozip()calls for safer iteration.In Python 3.10+,
zip(..., strict=True)raisesValueErrorif the iterables have different lengths, catching bugs where output counts or beam counts unexpectedly diverge.🔒 Proposed fix to add strict parameter
- for i, (oa, ob) in enumerate(zip(outputs_a, outputs_b)): + for i, (oa, ob) in enumerate(zip(outputs_a, outputs_b, strict=True)): assert len(oa.outputs) == len(ob.outputs), ( f"Prompt {i}: output beam count mismatch: " f"{label_a}={len(oa.outputs)}, {label_b}={len(ob.outputs)}" ) - for beam_idx, (beam_a, beam_b) in enumerate(zip(oa.outputs, ob.outputs)): + for beam_idx, (beam_a, beam_b) in enumerate(zip(oa.outputs, ob.outputs, strict=True)): if assert_token_ids_match:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/defs/kv_cache/test_kv_cache_v2_scheduler.py` around lines 144 - 149, The test iterates paired sequences with zip() but doesn't detect length mismatches; update the two zip calls in the loop that compare outputs_a with outputs_b and oa.outputs with ob.outputs to use zip(..., strict=True) so a ValueError is raised if lengths differ, referencing the variables outputs_a, outputs_b and the inner beam iteration over oa.outputs and ob.outputs (the enumerate(zip(...)) loops).tensorrt_llm/_torch/pyexecutor/_util.py (1)
889-891: 💤 Low valueConsider aligning the draft fallback warning with the main model warning for consistency.
The draft model fallback warning uses different terminology ("disaggregated serving, event buffer max size > 0, or disagg config") compared to the main model warning at line 270 ("kv_connector_manager, event buffer max size > 0, or cache transceiver"). Since both check the same conditions, using consistent field-name references would improve clarity and reduce confusion about whether "disaggregated serving" and "disagg config" refer to the same or different conditions.
♻️ Suggested alignment
logger.warning( - "KVCacheManagerV2 is not supported with disaggregated serving, event buffer max size > 0, or disagg config. " + "KVCacheManagerV2 is not supported with kv_connector_manager, event buffer max size > 0, or cache transceiver. " "Falling back to KVCacheManager for draft model.")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 889 - 891, The draft-fallback logger.warning message in KVCacheManagerV2 should use the same terminology as the main model warning to avoid confusion: update the warning emitted in the KVCacheManagerV2 fallback (the logger.warning call that currently says "disaggregated serving, event buffer max size > 0, or disagg config") to match the main model wording ("kv_connector_manager, event buffer max size > 0, or cache transceiver") so both checks report the same field names and meaning.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 2968-2981: In _ensure_generation_beam_width change the mutation
order so we don't permanently widen kv_cache if a later resize fails: either
call the resize operation on kv_cache (e.g., kv_cache.resize/whatever method
used by try_allocate_generation/_prepare_draft_resources) before assigning
kv_cache.beam_width, or wrap the assignment in a try/except and revert
kv_cache.beam_width to its original value if the subsequent resize raises (or if
OutOfPagesError occurs); ensure _set_page_index_bufs(req.py_request_id,
kv_cache) is only called after successful resize and keep references to the
original beam width to restore on failure so try_allocate_generation and
_prepare_draft_resources cannot leave the cache widened under memory pressure.
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py`:
- Around line 719-724: The commit() change removed the multi-beam guard and can
call _commit_block() while caches may have been expanded to multiple beam pages;
either collapse non-default beam pages into the default beam before calling
_commit_block() or reintroduce the single-beam guard to prevent multi-beam
commits until _commit_block() and SeqBlock.is_committed are made
multi-beam-safe. Specifically, in commit(), detect if any cache/block has width
> 1 (or if beam_search_indices is not None) and either (a) merge/drop pages so
len(self.pages)==1 for each SeqBlock before invoking _commit_block(), or (b)
restore the previous rejection behavior (raise/assert) to block multi-beam
commits; update references to _commit_block() and SeqBlock.is_committed
accordingly.
- Around line 1397-1421: The loop currently skips appending new beam rows for
historical blocks because of the early continue in the for ordinal, block in
typed_enumerate(self._blocks) loop; change the logic so that for each beam index
in typed_range(old_beam_width, new_beam_width) you always create new_beam_block
= filled_list(cast(BlockPage, None), num_life_cycles) and append it to
block.pages, but only perform slot allocation/copy/locking when
_is_block_before_history_cursor(ordinal) is False; i.e., move
block.pages.append(new_beam_block) outside the branch (or append an all-None row
when block_before_history_cursor is True) and keep the GPU copy path
(source_lock, _copy_gpu_page_storage_to_slot, UncommittedPage creation, etc.)
guarded by the existing block_before_history_cursor check so shape is preserved
for historical blocks.
---
Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 889-891: The draft-fallback logger.warning message in
KVCacheManagerV2 should use the same terminology as the main model warning to
avoid confusion: update the warning emitted in the KVCacheManagerV2 fallback
(the logger.warning call that currently says "disaggregated serving, event
buffer max size > 0, or disagg config") to match the main model wording
("kv_connector_manager, event buffer max size > 0, or cache transceiver") so
both checks report the same field names and meaning.
In `@tests/integration/defs/kv_cache/test_kv_cache_v2_scheduler.py`:
- Around line 144-149: The test iterates paired sequences with zip() but doesn't
detect length mismatches; update the two zip calls in the loop that compare
outputs_a with outputs_b and oa.outputs with ob.outputs to use zip(...,
strict=True) so a ValueError is raised if lengths differ, referencing the
variables outputs_a, outputs_b and the inner beam iteration over oa.outputs and
ob.outputs (the enumerate(zip(...)) loops).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 77943b18-27f2-49e1-aa65-b3d3c86ac368
📒 Files selected for processing (11)
tensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/runtime/kv_cache_manager_v2/_block_radix_tree.pytensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.pytests/integration/defs/accuracy/test_llm_api_pytorch.pytests/integration/defs/kv_cache/test_kv_cache_v2_scheduler.pytests/integration/test_lists/qa/llm_function_core.txttests/integration/test_lists/test-db/l0_b200.ymltests/integration/test_lists/test-db/l0_h100.ymltests/unittest/_torch/executor/test_overlap_scheduler.pytests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py
| def _ensure_generation_beam_width(self, req: LlmRequest, | ||
| kv_cache: _KVCache) -> bool: | ||
| target_beam_width = BeamIndex(req.py_beam_width) | ||
| assert 1 <= target_beam_width <= self.max_beam_width | ||
| if kv_cache.beam_width == target_beam_width: | ||
| return True | ||
|
|
||
| try: | ||
| kv_cache.beam_width = target_beam_width | ||
| except OutOfPagesError: | ||
| return False | ||
|
|
||
| self._set_page_index_bufs(req.py_request_id, kv_cache) | ||
| return True |
There was a problem hiding this comment.
Roll back beam expansion if the later resize fails.
kv_cache.beam_width is mutated here before the caller's kv_cache.resize(...). In try_allocate_generation() and _prepare_draft_resources(), a failed resize now leaves the cache widened and still holding the cloned beam pages even though the scheduling attempt failed. Under memory pressure that makes the failure sticky: later iterations see fewer free pages than they should, and the request can keep starving the pool until it is explicitly evicted or freed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 2968 - 2981,
In _ensure_generation_beam_width change the mutation order so we don't
permanently widen kv_cache if a later resize fails: either call the resize
operation on kv_cache (e.g., kv_cache.resize/whatever method used by
try_allocate_generation/_prepare_draft_resources) before assigning
kv_cache.beam_width, or wrap the assignment in a try/except and revert
kv_cache.beam_width to its original value if the subsequent resize raises (or if
OutOfPagesError occurs); ensure _set_page_index_bufs(req.py_request_id,
kv_cache) is only called after successful resize and keep references to the
original beam width to restore on failure so try_allocate_generation and
_prepare_draft_resources cannot leave the cache widened under memory pressure.
| def commit( | ||
| self, | ||
| accepted_input_tokens: Sequence[TokenIdExt], | ||
| beam_search_indices: Sequence[int] | None = None, | ||
| ): | ||
| if self.beam_width != 1: | ||
| raise NotImplementedError("Not implemented yet for beam search") | ||
| assert beam_search_indices is None |
There was a problem hiding this comment.
commit() now reaches a still-single-beam publication path.
Removing the multi-beam rejection here lets Line 738 call _commit_block() on caches that were expanded above width 1, but _commit_block() still asserts single-beam state at Line 1116 and SeqBlock.is_committed still assumes len(self.pages) == 1 at Line 105. The first full-block commit after beam expansion will either assert or publish beam 0 while leaving the extra beam pages attached to the same block.
Either collapse/drop non-default beam pages before committing, or keep the old guard until _commit_block() is made multi-beam-safe.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py` around lines 719
- 724, The commit() change removed the multi-beam guard and can call
_commit_block() while caches may have been expanded to multiple beam pages;
either collapse non-default beam pages into the default beam before calling
_commit_block() or reintroduce the single-beam guard to prevent multi-beam
commits until _commit_block() and SeqBlock.is_committed are made
multi-beam-safe. Specifically, in commit(), detect if any cache/block has width
> 1 (or if beam_search_indices is not None) and either (a) merge/drop pages so
len(self.pages)==1 for each SeqBlock before invoking _commit_block(), or (b)
restore the previous rejection behavior (raise/assert) to block multi-beam
commits; update references to _commit_block() and SeqBlock.is_committed
accordingly.
| for ordinal, block in typed_enumerate(self._blocks): | ||
| source_beam_block = block.pages[DEFAULT_BEAM_INDEX] | ||
| block_before_history_cursor = self._is_block_before_history_cursor(ordinal) | ||
| if block_before_history_cursor: | ||
| continue | ||
| for beam_idx in typed_range(old_beam_width, new_beam_width): | ||
| new_beam_block: TypedIndexList[LifeCycleId, BlockPage] = filled_list( | ||
| cast(BlockPage, None), num_life_cycles | ||
| ) | ||
| for lc, source_holder in typed_enumerate(source_beam_block): | ||
| if source_holder is None: | ||
| continue | ||
|
|
||
| source_page = source_holder.page | ||
| slot = new_slots[lc].pop() | ||
| # Beam expansion copies existing non-historical GPU pages into new beam pages. | ||
| # Fully historical blocks remain unmapped for new beams. | ||
| assert source_page.cache_level == GPU_LEVEL | ||
| source_lock = source_page.lock(self, beam_idx, ordinal, lc, skip_wait=False) | ||
| self._copy_gpu_page_storage_to_slot(source_lock.page, slot) | ||
| with self._record_event(): | ||
| source_lock.unlock() | ||
| new_page = UncommittedPage(self, ordinal, lc, GPU_LEVEL, slot, beam_idx) | ||
| new_beam_block[lc] = new_page.lock(self, beam_idx, ordinal, lc, skip_wait=True) | ||
| block.pages.append(new_beam_block) |
There was a problem hiding this comment.
Append placeholder beam rows for historical blocks too.
The continue at Lines 1400-1401 skips block.pages.append(new_beam_block) entirely, so after self._beam_width is raised some historical blocks still have only old_beam_width rows. Later indexed access like b.pages[beam_id] or _block(ordinal, beam_idx) can then walk past the list for non-default beams. The new _check_sanity() exception at Lines 1560-1565 only tolerates missing holders inside an existing row; it does not cover missing rows.
A safe pattern here is to append an all-None beam row for every new beam on every block, and only skip the slot allocation/copy work for pre-history ordinals.
Suggested shape-preserving fix
for ordinal, block in typed_enumerate(self._blocks):
source_beam_block = block.pages[DEFAULT_BEAM_INDEX]
- block_before_history_cursor = self._is_block_before_history_cursor(ordinal)
- if block_before_history_cursor:
- continue
for beam_idx in typed_range(old_beam_width, new_beam_width):
new_beam_block: TypedIndexList[LifeCycleId, BlockPage] = filled_list(
cast(BlockPage, None), num_life_cycles
)
+ if self._is_block_before_history_cursor(ordinal):
+ block.pages.append(new_beam_block)
+ continue
for lc, source_holder in typed_enumerate(source_beam_block):
if source_holder is None:
continue
...
block.pages.append(new_beam_block)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py` around lines
1397 - 1421, The loop currently skips appending new beam rows for historical
blocks because of the early continue in the for ordinal, block in
typed_enumerate(self._blocks) loop; change the logic so that for each beam index
in typed_range(old_beam_width, new_beam_width) you always create new_beam_block
= filled_list(cast(BlockPage, None), num_life_cycles) and append it to
block.pages, but only perform slot allocation/copy/locking when
_is_block_before_history_cursor(ordinal) is False; i.e., move
block.pages.append(new_beam_block) outside the branch (or append an all-None row
when block_before_history_cursor is True) and keep the GPU copy path
(source_lock, _copy_gpu_page_storage_to_slot, UncommittedPage creation, etc.)
guarded by the existing block_before_history_cursor check so shape is preserved
for historical blocks.
|
PR_Github #53484 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #53514 [ run ] triggered by Bot. Commit: |
|
PR_Github #53514 [ run ] completed with state
|
|
Thanks for picking this up — V2 multi-beam has a lot of edges. Three correctness issues I think need to be addressed before merge:
|
|
Thanks for taking this on — V2 multi-beam has a lot of subtle edges. I see three correctness blockers (matching the CodeRabbit Critical/Major findings) that should be resolved before merge: 1. (Critical) assert typed_len(seq_block.pages) == 1, "Must have 1 beam only"So the first full-block commit after a beam-width expansion will fail. If the design intent (per the PR description) is to allow a multi-beam cache to flow through 2. (Major) Two ways to fix this:
(a) is smaller and safer. 3. (Major) Once 1–3 are addressed I think this is ready to merge. Non-blocker side note for future cleanup: |
Summary by CodeRabbit
Release Notes
New Features
Tests
Description
Enable KV cache manager v2 to support beam-search expansion during generation. The change materializes page-index buffers for expanded beams, copies the live partial-tail/SSM page state from beam 0 for new beams, and keeps historical full blocks unmapped for expanded beams. It also disables partial reuse when V2 is configured with multi-beam capacity.
This keeps V2 selected for
max_beam_width > 1instead of falling back to the V1 KV cache manager. The runtimecommit()path still only acceptsbeam_search_indices is None, but it no longer rejects a cache whose layout has already been expanded for beam search by overlap scheduling. Published blocks are treated as beam0-canonical for reuse, and non-default beam pages are cleared when a block is committed.This adds targeted KV cache manager v2 unit coverage and LLM API accuracy coverage for V2 beam search.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.