Skip to content

[TRTLLM-12339][chore] Early-dispatch cross-KV path in KVCacheManager#1

Open
eopXD wants to merge 1 commit into
cascade812:guiju/en_de3from
eopXD:eopxd/kvcm-cross-early-dispatch
Open

[TRTLLM-12339][chore] Early-dispatch cross-KV path in KVCacheManager#1
eopXD wants to merge 1 commit into
cascade812:guiju/en_de3from
eopXD:eopxd/kvcm-cross-early-dispatch

Conversation

@eopXD

@eopXD eopXD commented Jun 11, 2026

Copy link
Copy Markdown

What

Readability refactor of KVCacheManager.prepare_resources / update_resources, stacked on top of guiju/en_de3 (PR NVIDIA#13919). No functional change — purely structural.

The cross-attention support currently expresses the common self-attention path as the negation of the rare cross case (if not is_cross at ~5 sites across the two methods). Because is_cross is a per-instance constant (derived from kv_cache_type at construction), this both inverts the narrative — the dominant path reads as "not the special case" — and re-checks a fixed value on every call.

Change

  • prepare_resources: early-dispatch the cross pool to a dedicated _prepare_cross_resources() and return, leaving the self-attention body unconditional. Shared context-sequence collection moves into _collect_context_sequences() (single source of truth, no copy-paste of the loop).
  • update_resources: flip the cross guard from if not is_cross to a positive kv_cache_type != CROSS check with an intent-revealing comment; the rewind/relocation body is otherwise unchanged (no helper extraction — the guarded block is small enough to read inline).
  • Collapse the now-redundant batch_ctx_requests list (identical to batch_llm_requests once the guard is removed).

Equivalence

  • The cross instance still performs exactly the allocate-once subset: collect → add_sequence_batchrefresh_blocks — no decode-time token growth, draft-token reserve, or scheduler bookkeeping (encoder K/V is written once and never grows).
  • The self/draft instance runs the full path unchanged; the old if not is_cross blocks always executed for it.

Verification

Intent is to make the dual-pool blending read top-to-bottom as the common case. If you'd rather fold it in differently — or not at all — please feel free; happy to adjust.

prepare_resources had accumulated `if not is_cross` guards that expressed
the common self-attention path as the negation of the rare cross-attention
case. `is_cross` is a per-instance constant (derived from kv_cache_type at
construction), so branching on it at every step re-checks something fixed
for the object's lifetime.

Restructure so the common path reads as the primary narrative:
- prepare_resources: dispatch the cross pool to a dedicated
  _prepare_cross_resources() and return early, leaving the self-attention
  body unconditional. Shared context-sequence collection is factored into
  _collect_context_sequences().
- update_resources: flip the cross guard from `if not is_cross` to a
  positive `kv_cache_type != CROSS` check with an intent-revealing comment;
  the rewind/relocation body is otherwise unchanged.

No functional change: the cross instance still performs exactly the
allocate-once subset (collect + add_sequence_batch + refresh_blocks), and
the self/draft instance runs the full path as before. The now-redundant
batch_ctx_requests list (identical to batch_llm_requests once the guard is
removed) is collapsed.

Signed-off-by: Yueh-Ting Chen <yuehtingc@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant