[TRTLLM-12339][chore] Early-dispatch cross-KV path in KVCacheManager#1
Open
eopXD wants to merge 1 commit into
Open
[TRTLLM-12339][chore] Early-dispatch cross-KV path in KVCacheManager#1eopXD wants to merge 1 commit into
eopXD wants to merge 1 commit into
Conversation
prepare_resources had accumulated `if not is_cross` guards that expressed the common self-attention path as the negation of the rare cross-attention case. `is_cross` is a per-instance constant (derived from kv_cache_type at construction), so branching on it at every step re-checks something fixed for the object's lifetime. Restructure so the common path reads as the primary narrative: - prepare_resources: dispatch the cross pool to a dedicated _prepare_cross_resources() and return early, leaving the self-attention body unconditional. Shared context-sequence collection is factored into _collect_context_sequences(). - update_resources: flip the cross guard from `if not is_cross` to a positive `kv_cache_type != CROSS` check with an intent-revealing comment; the rewind/relocation body is otherwise unchanged. No functional change: the cross instance still performs exactly the allocate-once subset (collect + add_sequence_batch + refresh_blocks), and the self/draft instance runs the full path as before. The now-redundant batch_ctx_requests list (identical to batch_llm_requests once the guard is removed) is collapsed. Signed-off-by: Yueh-Ting Chen <yuehtingc@nvidia.com>
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Readability refactor of
KVCacheManager.prepare_resources/update_resources, stacked on top ofguiju/en_de3(PR NVIDIA#13919). No functional change — purely structural.The cross-attention support currently expresses the common self-attention path as the negation of the rare cross case (
if not is_crossat ~5 sites across the two methods). Becauseis_crossis a per-instance constant (derived fromkv_cache_typeat construction), this both inverts the narrative — the dominant path reads as "not the special case" — and re-checks a fixed value on every call.Change
prepare_resources: early-dispatch the cross pool to a dedicated_prepare_cross_resources()andreturn, leaving the self-attention body unconditional. Shared context-sequence collection moves into_collect_context_sequences()(single source of truth, no copy-paste of the loop).update_resources: flip the cross guard fromif not is_crossto a positivekv_cache_type != CROSScheck with an intent-revealing comment; the rewind/relocation body is otherwise unchanged (no helper extraction — the guarded block is small enough to read inline).batch_ctx_requestslist (identical tobatch_llm_requestsonce the guard is removed).Equivalence
add_sequence_batch→refresh_blocks— no decode-time token growth, draft-token reserve, or scheduler bookkeeping (encoder K/V is written once and never grows).if not is_crossblocks always executed for it.Verification
pre-commit(isort / yapf v0.43.0 / autoflake / ruff-legacy / codespell) clean on the changed file.python -m py_compileclean.Intent is to make the dual-pool blending read top-to-bottom as the common case. If you'd rather fold it in differently — or not at all — please feel free; happy to adjust.