Skip to content

[TRTLLM-12339][feat] Support T5 encoder-decoder models in the PyTorch backend#13919

Open
cascade812 wants to merge 44 commits into
NVIDIA:mainfrom
cascade812:guiju/en_de3
Open

[TRTLLM-12339][feat] Support T5 encoder-decoder models in the PyTorch backend#13919
cascade812 wants to merge 44 commits into
NVIDIA:mainfrom
cascade812:guiju/en_de3

Conversation

@cascade812

@cascade812 cascade812 commented May 8, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR adds the core encoder-decoder request flow, including encoder execution, decoder context/generation handling, cross-attention metadata, and cross-KV cache management.

Key Changes

  • Add PyTorch backend support for T5 model execution.
  • thop.attention support for encoder-decoder cross-attention.
  • Add dual KV cache pool support:
    • self-attention KV cache for decoder tokens
    • cross-attention KV cache for encoder outputs
  • Add precision conversion support for encoder-decoder model weights.
  • Support beam_width > 1 for encoder-decoder generation.
  • Plumb encoder-side inputs through LLM API request handling and executor flow.
  • Add unit and integration coverage for encoder-decoder request handling, scheduling, KV cache behavior, and T5 execution.

Below features are not supported in this PR, will come in separate PR.

  • cuda graph
  • TP

Test Coverage

  • Added T5 PyTorch integration tests.
  • Added encoder-decoder request API tests.
  • Added dual-pool KV cache and scheduler unit tests.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

  • New Features

    • Added full encoder–decoder model support (BART, mBART, T5) and cross‑attention paths.
    • Exposed encoder_tokens and encoder output handling in the public APIs for encoder→decoder workflows.
    • Added relative position bias support in attention.
  • Improvements

    • Dual‑pool KV‑cache management and coordinated scheduling for self/cross attention to improve encoder‑decoder admission and budgeting.
    • Scheduler/executor now handle encoder phases, readiness gating, and encoder streaming.
  • Tests

    • New unit and integration tests covering encoder‑decoder, dual‑pool KV cache, and scheduling.

@coderabbitai

coderabbitai Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds encoder–decoder cross-attention end-to-end: mutable cross-KV scheduling, cross-attention/relative-bias plumbing in C++ and Python, PyTorch backends and CrossAttention, BART/T5 models, executor encoder pass and KV dual-pool creation, plus comprehensive unit and integration tests.

Changes

Encoder–decoder cross-attention stack

Layer / File(s) Summary
C++ schedulers and capacity API
cpp/include/...capacityScheduler.h, cpp/tensorrt_llm/batch_manager/capacityScheduler.cpp, cpp/tests/*
Make crossKvCacheManager a mutable optional, add encoder-init gating and mirrored cross scheduled-block bookkeeping, and add C++ unit tests for encoder-init dual-pool behavior.
C++ attention op and bindings
cpp/tensorrt_llm/thop/attentionOp.*, cpp/tensorrt_llm/common/attentionOp.cpp, cpp/tensorrt_llm/nanobind/*
Thread cross-attention, encoder lengths, and relative-attention-bias through THOP runner/enqueue and expose encoder_tokens to Python bindings.
PyTorch attention backends & modules
tensorrt_llm/_torch/attention_backend/*, tensorrt_llm/_torch/modules/*
Extend Attention APIs with cross/relative-bias args, add Vanilla SDPA fallback, RTT/THOP/TRTLLM plumbing for cross-attention and bias validation, and CrossAttention module.
Models: BART & T5
tensorrt_llm/_torch/models/*
Add full BART and T5 encoder–decoder model implementations, relative-position bias (T5), and HF weight conversion helpers.
KV-cache creator / resource manager
tensorrt_llm/_torch/pyexecutor/_util.py, tensorrt_llm/_torch/pyexecutor/resource_manager.py
Add cross-pool budget split, cross KV manager creation/registration, cross-specific prepare/update paths, and extend low-level KV manager construction to accept pool type and head geometry.
Model engine & executor
tensorrt_llm/_torch/pyexecutor/model_engine.py, tensorrt_llm/_torch/pyexecutor/py_executor.py
Add forward_encoder API, encoder_stream execution, scatter/attach encoder outputs, readiness event handling, and executor gating/guards for encoder–decoder flows.
Python schedulers
tensorrt_llm/_torch/pyexecutor/scheduler/*
Add encoder_requests bucket, readiness filtering helpers, Bind/Capacity scheduler wiring for cross manager and ENCODER_INIT gating, and V2 cross-KV reservation flow.
Public API & config
tensorrt_llm/llmapi/*, tensorrt_llm/executor/*
Propagate encoder_input_token_ids through executor/GenerationRequest and add PreprocessedInputs.encoder_input_token_ids; add ModelConfig.is_encoder_decoder and validate cross_kv_cache_fraction.
Tests
cpp/tests/*, tests/unittest/_torch/*, tests/integration/defs/llmapi/test_llm_api_pytorch_t5.py
Add/modify unit and integration tests covering dual-pool KV splitting/wiring, encoder-init admission, readiness filtering, SDPA fallback, serializable scheduler outputs, and T5 end-to-end generation.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant PyExecutor
  participant Scheduler
  participant ModelEngine
  participant CrossKVManager
  Client->>PyExecutor: enqueue requests (encoder_input_token_ids)
  PyExecutor->>Scheduler: schedule (includes encoder_requests)
  Scheduler->>CrossKVManager: validate / reserve cross-KV (for CONTEXT_INIT)
  PyExecutor->>ModelEngine: forward_encoder(encoder_requests) on encoder_stream
  ModelEngine->>PyExecutor: packed encoder outputs + seq_lens
  PyExecutor->>Scheduler: mark encoder-ready -> move to CONTEXT_INIT
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested reviewers

  • suyoggupta
  • nv-guomingz
  • venkywonka
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (12)
tensorrt_llm/executor/executor.py (1)

123-170: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Mirror encoder_input_token_ids on generate().

This wires encoder inputs through generate_async(), but the synchronous generate() wrapper still has no way to accept or forward them. Users of the blocking executor API are therefore missing the new encoder-decoder request path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/executor/executor.py` around lines 123 - 170, The synchronous
wrapper generate() does not accept or forward the new encoder_input_token_ids
parameter, so callers using the blocking API can't use the encoder-decoder path;
update the generate() signature to include encoder_input_token_ids:
Optional[Union[torch.Tensor, np.ndarray, list]] = None and ensure it passes this
argument through when constructing/dispatching the GenerationRequest (same name
used in generate_async and GenerationRequest). Also propagate
encoder_input_token_ids through any internal call sites or helper methods that
currently forward other generation arguments to generate_async or create the
GenerationRequest so behavior is identical to generate_async().
tensorrt_llm/_torch/attention_backend/vanilla.py (1)

400-410: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not use a causal mask on flash-attn cross-attention.

The SDPA fallback correctly disables causality when Q/K lengths differ, but the flash-attn path still sets causal=True for the default mask. For encoder-decoder cross-attention that blocks later encoder tokens and produces wrong attention scores in fp16/bf16.

Suggested fix
         attn_output_unpad = flash_attn_varlen_func(
             q,
             k,
             v,
             cu_seqlens_q,
             cu_seqlens_k,
             max_seqlen_q,
             max_seqlen_k,
             dropout_p=0.0,
             softmax_scale=softmax_scale,
-            causal=attention_mask == PredefinedAttentionMask.CAUSAL,
+            causal=(attention_mask == PredefinedAttentionMask.CAUSAL
+                    and not is_cross),
             alibi_slopes=None,
             deterministic=False,
             return_attn_probs=False,
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/attention_backend/vanilla.py` around lines 400 - 410, The
flash-attn call incorrectly applies a causal mask for cross-attention; update
the causal argument of flash_attn_varlen_func so it is True only for true
self-attention. Compute a cross-attention condition (e.g., compare cu_seqlens_q
and cu_seqlens_k shapes or values) and set causal = (attention_mask ==
PredefinedAttentionMask.CAUSAL) and not is_cross_attention before calling
flash_attn_varlen_func; use the same logic/sentinel used by the SDPA fallback to
detect differing Q/K lengths so encoder-decoder attention does not get a causal
mask.
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)

334-339: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t put raw encoder_output tensors into Diff without normalizing them first.

Diff is the cross-rank sync payload, but get_diff() only moves logits to CPU. As written, a CUDA encoder_output can escape into the transport/serialization path and either fail there or keep a large device allocation alive unexpectedly.

Suggested fix
     def get_diff(self) -> Diff:
         for i, context_logits in enumerate(self.diff.context_logits_list):
             self.diff.context_logits_list[i] = context_logits.to("cpu")
         for i, generation_logits in enumerate(self.diff.generation_logits_list):
             self.diff.generation_logits_list[i] = generation_logits.to("cpu")
+        if self.diff.encoder_output is not None:
+            self.diff.encoder_output = self.diff.encoder_output.to(
+                "cpu", non_blocking=True
+            )
         return self.diff

Also applies to: 357-358, 423-425

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/llm_request.py` around lines 334 - 339,
get_diff currently only moves logits to CPU but may leave raw CUDA
encoder_output tensors inside Diff and other cross-rank payloads; update
get_diff (and the analogous places around lines where encoder_output is added —
e.g., the blocks referenced at 357-358 and 423-425) to normalize tensors before
inserting into Diff: for each tensor (encoder_output, items in
context_logits_list and generation_logits_list) call
.detach().cpu().contiguous() (or .clone().detach().cpu() if contiguous not
needed) to remove CUDA/device references and break autograd, and replace the
originals in Diff with these CPU tensors so no CUDA memory or device-bound
tensors are serialized or kept alive. Ensure you apply the same normalization
routine wherever encoder_output or other tensors are appended to Diff.

649-679: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add the NVIDIA header to this modified Python file.

This module was updated in the PR but still lacks the required NVIDIA copyright header/current modification year.

As per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/llm_request.py` around lines 649 - 679, Add
the required NVIDIA copyright header (with the current modification year) to the
top of this Python source file (llm_request.py); place the standard NVIDIA
header block before any imports or code (above the class/constructor that
contains the shown super().__init__ calls and symbols like encoder_input_tokens,
encoder_output_len, py_return_encoder_output) and update the copyright year to
the current year.
cpp/include/tensorrt_llm/batch_manager/capacityScheduler.h (1)

169-183: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the copyright year in this modified header.

This header was changed for the PR, but its NVIDIA copyright line still ends at 2024.

As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/tensorrt_llm/batch_manager/capacityScheduler.h` around lines 169
- 183, The file capacityScheduler.h was modified but its NVIDIA copyright header
still ends at 2024; update the copyright year to the current year (2026) in the
header comment so the top-of-file NVIDIA copyright line reflects the
modification (e.g., replace "2024" with "2026" or use a range like "2024-2026"),
then save and commit the change.
cpp/tensorrt_llm/thop/attentionOp.cpp (1)

679-699: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require encoder lengths on the cross-attention path.

cross_attention=true is now accepted even when encoder_input_lengths is absent, but Runner::run() only populates the cross-attention length pointer when that tensor exists. That leaves the enqueue params under-specified for the new mode and turns a bad caller into wrong kernel bounds instead of a deterministic error.

Suggested guard
     TLLM_CHECK_WITH_INFO(is_mla_enable || is_fused_qkv || use_sage_attn || cross_attention,
         "Only fused QKV is supported for non-MLA non-SageAttention non-cross attention now");
     TLLM_CHECK_WITH_INFO(update_kv_cache || cross_attention, "KV cache update cannot be disabled now");
+    TLLM_CHECK_WITH_INFO(
+        !cross_attention || encoder_input_lengths.has_value(),
+        "encoder_input_lengths must be provided for cross attention");
+    TLLM_CHECK_WITH_INFO(
+        !cross_attention || !update_kv_cache || cross_kv.has_value(),
+        "cross_kv must be provided when cross attention updates KV cache");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/thop/attentionOp.cpp` around lines 679 - 699, When
cross_attention is true you must require encoder_input_lengths to be present to
avoid creating under-specified enqueue params; add a runtime check in the
attention op (near the existing TLLM_CHECK_WITH_INFO calls in the function that
has parameters cross_attention and encoder_input_lengths) that asserts
encoder_input_lengths.has_value() when cross_attention is true and emits a clear
error via TLLM_CHECK_WITH_INFO (or similar) referencing cross_attention and
encoder_input_lengths; this will surface a deterministic error (rather than
letting Runner::run produce missing cross-attention length pointers and invalid
kernel bounds).
tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py (1)

60-89: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Initialize is_encoder_init_state explicitly on the request mocks.

KVCacheV2Scheduler._sort_requests() now branches on req.is_encoder_init_state. On plain Mock objects, a missing attribute becomes a truthy child mock, so context requests get misclassified as encoder-init and the new chunk-order assertions can pass even if that logic regresses.

Suggested fix
 def make_ctx_request(
@@
     req.lora_task_id = lora_task_id
     req.is_context_init_state = True
+    req.is_encoder_init_state = False
     req.is_generation_in_progress_state = False
     req.encoder_output_len = encoder_output_len
     req.py_encoder_output_ready_event = None
     req.py_skip_cross_kv_projection = False
     return req
 
 
 def make_encoder_request(request_id, encoder_output_len, lora_task_id=None):
@@
     req.encoder_output_len = encoder_output_len
     req.lora_task_id = lora_task_id
     req.is_context_init_state = False
+    req.is_encoder_init_state = True
     req.is_generation_in_progress_state = False
     req.is_first_context_chunk = True
     return req

Also applies to: 92-102

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py` around lines 60
- 89, The request mocks created by make_ctx_request are missing an explicit
is_encoder_init_state attribute so Mock returns a truthy child mock and causes
KVCacheV2Scheduler._sort_requests to misclassify context requests; update
make_ctx_request to set req.is_encoder_init_state = False (and likewise set
is_encoder_init_state explicitly in the other helper around lines 92-102) so
encoder-init branching in _sort_requests behaves deterministically.
tensorrt_llm/_torch/pyexecutor/_util.py (1)

1-25: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the required NVIDIA file header.

This modified Python file still starts directly with imports, so it is missing the required NVIDIA copyright header/current modification year.

As per coding guidelines, "**/*.{cpp,cc,cxx,h,hpp,py}: Read and follow CODING_GUIDELINES.md for all code changes (C++ and Python) Include NVIDIA copyright header on all new files; update year on modified files" and "**/*.{cpp,h,hpp,cu,cuh,py}: All C++, Python, and other source files must contain NVIDIA copyright header with current modification year".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 1 - 25, This file is
missing the required NVIDIA copyright/header; add the official NVIDIA file
header (with the current modification year) to the very top of
tensorrt_llm/_torch/pyexecutor/_util.py before any imports so the header
precedes the existing module-level imports and symbols (e.g., torch,
tensorrt_llm, DecodingMode, MODEL_CLASS_VISION_ENCODER_MAPPING); if this is a
modification of an existing file, update the year in the header to the current
year.
tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py (1)

1-16: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the required NVIDIA file header.

This modified Python file still starts directly with imports, so it is missing the required NVIDIA copyright header/current modification year.

As per coding guidelines, "**/*.{cpp,cc,cxx,h,hpp,py}: Read and follow CODING_GUIDELINES.md for all code changes (C++ and Python) Include NVIDIA copyright header on all new files; update year on modified files" and "**/*.{cpp,h,hpp,cu,cuh,py}: All C++, Python, and other source files must contain NVIDIA copyright header with current modification year".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py` around lines 1 - 16,
Add the required NVIDIA copyright/header block to the very top of this Python
module (above all imports) per CODING_GUIDELINES.md, updating the modification
year to the current year; ensure the header matches the project's standard
NVIDIA copyright template and includes the copyright owner and year range, then
commit the file with the header in place (this file contains symbols like the
module imports tensorrt_llm.bindings/internal, the import lines for LlmRequest
and LlmRequestState, and references to CapacitySchedulerPolicy and logger to
help locate the file).
tensorrt_llm/executor/result.py (1)

1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required NVIDIA source header to this modified Python file.

This file was modified but still lacks the required NVIDIA copyright/license header.

Suggested patch
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 import asyncio

As per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/executor/result.py` at line 1, This Python module
(tensorrt_llm.executor.result.py) is missing the required NVIDIA
copyright/license header; add the standard NVIDIA source header (with the
current modification year) at the very top of result.py before any code (e.g.,
before the existing "import asyncio") so the file complies with the project's
C++/Python source header policy.
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)

2-3: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update copyright year range on this modified file.

The file changed in this PR but the header still ends at 2025.

Suggested patch
- * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp` around lines 2 - 3,
Update the copyright header's year range in the SPDX comments: change the
SPDX-FileCopyrightText line that currently ends "2025" to include the current
year (e.g., "2026") so the range reads "2022-2026" while leaving the
SPDX-License-Identifier line unchanged.
tests/unittest/_torch/executor/test_request_utils.py (1)

1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the required NVIDIA source header to this modified test file.

Even for unit-test Python sources, the repository rule requires the NVIDIA copyright/license header.

Suggested patch
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 """Tests for request_utils.py functions.

As per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/executor/test_request_utils.py` at line 1, Add the
required NVIDIA source header to the top of the modified test module
test_request_utils.py by inserting the standard NVIDIA copyright/license header
(with the current modification year) as the first thing in the file, before the
existing module docstring and imports so the file complies with the repository’s
source-header rule.
🟡 Minor comments (8)
legacy_enc_dec_architecture.md-123-123 (1)

123-123: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifier to fenced code block.

The fenced code block should specify a language identifier for better rendering and to satisfy markdown linting rules.

📝 Proposed fix
-```
+```text
 out/<tpX>/encoder/rank0.engine, config.json
 out/<tpX>/decoder/rank0.engine, config.json
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @legacy_enc_dec_architecture.md at line 123, The fenced code block at the end
of the file lacks a language identifier; update the block delimiter to include a
language (e.g., change totext) so it becomes a labeled fenced code block
for proper rendering and linting—target the fenced block containing the lines
"out//encoder/rank0.engine, config.json" and
"out//decoder/rank0.engine, config.json" and add the language identifier.


</details>

</blockquote></details>
<details>
<summary>tests/unittest/llmapi/test_llm_args.py-342-353 (1)</summary><blockquote>

`342-353`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Add boundary acceptance tests for `cross_kv_cache_fraction`.**

These new tests cover invalid values, but not valid boundaries. Please also assert that `0.0`, `1.0`, and `None` are accepted.

<details>
<summary>🧪 Suggested test additions</summary>

```diff
+@pytest.mark.parametrize("value", [0.0, 1.0, None])
+def test_KvCacheConfig_accepts_cross_kv_cache_fraction_boundaries(value):
+    cfg = KvCacheConfig(cross_kv_cache_fraction=value)
+    assert cfg.cross_kv_cache_fraction == value
```
</details>

As per coding guidelines, "Coverage expectations: Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix."

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_llm_args.py` around lines 342 - 353, Add
positive-boundary tests for KvCacheConfig to ensure cross_kv_cache_fraction
accepts 0.0, 1.0, and None: create new tests (e.g.,
test_KvCacheConfig_accepts_zero_fraction,
test_KvCacheConfig_accepts_one_fraction,
test_KvCacheConfig_accepts_none_fraction) that instantiate
KvCacheConfig(cross_kv_cache_fraction=0.0),
KvCacheConfig(cross_kv_cache_fraction=1.0), and
KvCacheConfig(cross_kv_cache_fraction=None) respectively and assert the object
is created (no exception) and the attribute equals the provided value; place
these alongside the existing failing-value tests that reference KvCacheConfig so
behavior is fully covered.
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/inputs/data.py-38-45 (1)</summary><blockquote>

`38-45`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**The public prompt types still require `prompt` / `prompt_token_ids`.**

`prompt_inputs()` now accepts encoder-only dictionaries, but `TextPrompt` and `TokensPrompt` still model `prompt` / `prompt_token_ids` as required keys. That leaves the published type contract out of sync with the runtime API for encoder-decoder requests.

 


Also applies to: 78-85

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/inputs/data.py` around lines 38 - 45, TextPrompt and
TokensPrompt currently require prompt / prompt_token_ids but prompt_inputs()
accepts encoder-only dicts; update the prompt type definitions to make prompt
and prompt_token_ids optional (NotRequired/Optional) to match encoder-decoder
requests. Locate the TextPrompt and TokensPrompt type/class definitions and
change the required keys for "prompt" and "prompt_token_ids" to optional (align
with encoder_input_token_ids / encoder_inputs used elsewhere) so the public type
contract matches runtime behavior; ensure any related docstrings/comments
reflect the optionality.
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/_torch/pyexecutor/resource_manager.py-704-732 (1)</summary><blockquote>

`704-732`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Add the NVIDIA header to this modified Python file.**

This file was changed in the PR but still starts without the required NVIDIA copyright header/current modification year.

 
As per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year".

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 704 - 732,
This Python source (tensorrt_llm/_torch/pyexecutor/resource_manager.py) is
missing the required NVIDIA copyright header; add the standard NVIDIA
copyright/header comment block (with the current modification year) at the very
top of the file before any imports or code, ensuring it follows the project's
header format used in other .py files; no functional code changes to symbols
like CacheTypeCpp, kv_cache_type, add_sequence_batch, refresh_blocks, or
class/method definitions are needed—just insert the header comment block exactly
as mandated by the coding guidelines.
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/_torch/modules/encoder_decoder_layer.py-29-36 (1)</summary><blockquote>

`29-36`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Split abstract method declarations onto proper multi-line bodies to comply with PEP 8.**

Both `forward` methods have one-line bodies that violate flake8 E704 ("multiple statements on one line"). Move the `...` ellipsis to its own line:

<details>
<summary>Suggested fix</summary>

```diff
     `@abstractmethod`
     def forward(
         self,
         hidden_states: torch.Tensor,
         attn_metadata: AttentionMetadata,
         position_ids: Optional[torch.IntTensor] = None,
         **kwargs,
-    ) -> torch.Tensor: ...
+    ) -> torch.Tensor:
+        ...
@@
     `@abstractmethod`
     def forward(
         self,
         position_ids: torch.IntTensor,
         hidden_states: torch.Tensor,
         attn_metadata: AttentionMetadata,
         encoder_hidden_states: Optional[torch.Tensor] = None,
         cross_attn_metadata: Optional[AttentionMetadata] = None,
         skip_cross_kv_projection: bool = False,
         **kwargs,
-    ) -> torch.Tensor: ...
+    ) -> torch.Tensor:
+        ...
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/encoder_decoder_layer.py` around lines 29 - 36,
The abstract forward method currently uses a one-line body with the ellipsis,
violating flake8 E704; update the forward declaration in the class (the def
forward(...) method) so the signature ends with a colon and the ellipsis is
placed on its own indented line (i.e., split the method into a proper multi-line
body) to satisfy PEP 8 and flake8 checks.
```

</details>

</blockquote></details>
<details>
<summary>encoder_decoder_porting_guide.md-9-10 (1)</summary><blockquote>

`9-10`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Update the top-level contract to match the implementation.**

This still says enc-dec is “V2-only” and baseline beam width is `1`, but this PR adds V1 wiring/smoke coverage and explicitly calls out `beam_width > 1` generation support. Please align the guide’s opening contract so readers do not build against stale constraints.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@encoder_decoder_porting_guide.md` around lines 9 - 10, The top-level contract
incorrectly states "V2-only" and beam width = 1; update the opening paragraph to
reflect that encoder-decoder models (T5, BART, mBART) now include V1
wiring/smoke coverage and support generation with beam_width > 1. Change
references to require use_kv_cache_manager_v2=True and KVCacheManagerV2
constraints to note that V2 is required for full enc-dec features (explicit SELF
+ CROSS pools) but V1 has limited wiring/smoke tests in this PR, and ensure the
contract clearly documents supported beam_width behavior and runtime
requirements (PyTorch vs. other runtimes).
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/_torch/pyexecutor/model_engine.py-4400-4402 (1)</summary><blockquote>

`4400-4402`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Consider converting assert to explicit RuntimeError for production safety.**

The assertion on line 4400-4402 validates that the packed encoder length doesn't exceed `max_num_tokens`. While the assertion message is informative, assertions can be disabled with `-O` flag in production Python. Consider using an explicit check with `RuntimeError` for consistent error behavior:

<details>
<summary>🛡️ Suggested change</summary>

```diff
-        assert num_tokens <= self.max_num_tokens, (
-            f"encoder packed length ({num_tokens}) exceeds max_num_tokens "
-            f"({self.max_num_tokens})")
+        if num_tokens > self.max_num_tokens:
+            raise RuntimeError(
+                f"encoder packed length ({num_tokens}) exceeds max_num_tokens "
+                f"({self.max_num_tokens}). Reduce batch size or sequence lengths.")
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 4400 - 4402, The
assert checking packed encoder length should be replaced with an explicit
runtime check to avoid being bypassed by Python -O; in the scope where
num_tokens and self.max_num_tokens are available (the code block containing the
current assert), change to an if statement that tests if num_tokens >
self.max_num_tokens and raise RuntimeError with the same formatted message
(include num_tokens and self.max_num_tokens) so the error is always raised in
production.
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py-449-450 (1)</summary><blockquote>

`449-450`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**`PyMicroBatchScheduler.schedule()` bypasses the new readiness gate.**

`_can_be_scheduled()` now rejects decoder-context requests until `py_encoder_output_ready_event` is set, but the scheduling loop re-inlines only the state-range check. A direct call to `schedule()` can still admit an unready `CONTEXT_INIT` request.
 


Also applies to: 488-494

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py` around lines 449 -
450, The scheduling loop in PyMicroBatchScheduler.schedule() bypasses the new
decoder-context readiness gate, so update schedule() to consult the same logic
as _can_be_scheduled(): before admitting a request (particularly CONTEXT_INIT /
decoder-context requests) call is_decoder_context_request_ready(req) or call
_can_be_scheduled(req) so that py_encoder_output_ready_event is respected; apply
the same fix to the second occurrence referenced (the block around the other
admission at lines ~488-494) so no CONTEXT_INIT request is scheduled until
py_encoder_output_ready_event allows it.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🧹 Nitpick comments (12)</summary><blockquote>

<details>
<summary>tests/unittest/llmapi/test_llm_args.py (1)</summary><blockquote>

`342-353`: **QA list update is unnecessary for this test change.**

This PR delta is unit-test scope in `tests/unittest/...`; no `tests/integration/defs/` entry changes are needed for QA scheduled lists.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_llm_args.py` around lines 342 - 353, The change is
only a unit-test update under tests/unittest and the QA scheduled lists should
not be modified; revert or remove any edits in tests/integration/defs/ (or any
QA list files) introduced by this PR, leaving the two new/updated tests
(test_KvCacheConfig_rejects_cross_kv_cache_fraction_below_zero and
test_KvCacheConfig_rejects_cross_kv_cache_fraction_above_one) and the
KvCacheConfig-related assertions as the sole change set.
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/llmapi/llm_args.py (1)</summary><blockquote>

`2607-2617`: _⚡ Quick win_

**Prefer declarative field constraints for `cross_kv_cache_fraction`.**

The custom validator at lines 2607–2616 duplicates constraint logic that belongs in the field definition. Pydantic v2 `Field(ge=0, le=1)` on `Optional[float]` enforces range validation on provided values while allowing `None`, matching the current validator behavior. This aligns with the codebase pattern (e.g., `free_gpu_memory_fraction` uses the same approach) and the coding guideline: "In Python Pydantic models, prefer `PositiveInt`, `NonNegativeInt`, `NonNegativeFloat`, `PositiveFloat`, or `Field(gt=0)` for numeric constraints instead of custom validators."

<details>
<summary>♻️ Proposed refactor</summary>

```diff
     cross_kv_cache_fraction: Optional[float] = Field(
         default=None,
+        ge=0,
+        le=1,
         description=
         "The fraction of the KV Cache memory should be reserved for cross attention. If set to p, self attention will use 1-p of KV Cache memory and cross attention will use p of KV Cache memory. Default is 50%. Should only be set when using encoder-decoder model."
     )

@@
-    `@field_validator`('cross_kv_cache_fraction')
-    `@classmethod`
-    def validate_cross_kv_cache_fraction(cls, v: Optional[float]):
-        if v is None:
-            return v
-        if not 0 <= v <= 1:
-            raise ValueError(
-                "kv_cache_config.cross_kv_cache_fraction must be a float between 0 and 1"
-            )
-        return v
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/llmapi/llm_args.py` around lines 2607 - 2617, The custom
validator validate_cross_kv_cache_fraction duplicates range checks; remove that
validator and declare the field cross_kv_cache_fraction as Optional[float] with
Pydantic constraints (e.g., Field(ge=0, le=1)) in its definition so None is
allowed but any provided value must be between 0 and 1, matching the pattern
used for free_gpu_memory_fraction.
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)</summary><blockquote>

`77-80`: _⚡ Quick win_

**Declare `CROSS_KV_CACHE_MANAGER` as a normal enum member.**

Using `locals()` here is just reflection for a plain enum constant. It makes the symbol harder to grep and reason about, and the direct declaration is clearer.

  
<details>
<summary>Suggested fix</summary>

```diff
 class ResourceManagerType(enum.Enum):
     KV_CACHE_MANAGER = "KV_CACHE_MANAGER"
     DRAFT_KV_CACHE_MANAGER = "DRAFT_KV_CACHE_MANAGER"
-    locals()["CR" + "OSS_KV_CACHE_MANAGER"] = "CR" + "OSS_KV_CACHE_MANAGER"
+    CROSS_KV_CACHE_MANAGER = "CROSS_KV_CACHE_MANAGER"
     PEFT_CACHE_MANAGER = "PEFT_CACHE_MANAGER"
     SEQ_SLOT_MANAGER = "SEQ_SLOT_MANAGER"
     SPEC_RESOURCE_MANAGER = "SPEC_RESOURCE_MANAGER"
```
</details>
As per coding guidelines, "Python code should avoid reflection when functionality can be easily achieved without it".

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 77 - 80, The
enum ResourceManagerType currently injects a member via reflection using
locals()["CR" + "OSS_KV_CACHE_MANAGER"] = "CR" + "OSS_KV_CACHE_MANAGER"; replace
that reflective line with a direct enum member declaration like
CROSS_KV_CACHE_MANAGER = "CROSS_KV_CACHE_MANAGER" inside the ResourceManagerType
class so the member is explicit and discoverable (update any references to
ResourceManagerType.CROSS_KV_CACHE_MANAGER if needed).
```

</details>

</blockquote></details>
<details>
<summary>tests/unittest/_torch/executor/test_encoder_step.py (1)</summary><blockquote>

`406-427`: _⚡ Quick win_

**Add a beam-expanded case here.**

`beam_width > 1` is a headline behavior of this PR, but this helper is only tested with one row per request. A regression where encoder lengths or cached-token counts are not replicated correctly per beam would still pass; a small parametrized case with duplicated beams and mixed cached/new cross-KV would close that gap.



As per coding guidelines, "Coverage expectations: Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix ... Note missing negative tests, missing parametrization where multiple backends or dtypes apply."

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/executor/test_encoder_step.py` around lines 406 - 427,
The test test_builds_metadata_for_mixed_projection_and_cached_sequences lacks a
beam-expanded case; add a subcase where you call
engine._prepare_encoder_decoder_cross_attention_inputs with beam_width > 1
(e.g., 2) and ensure inputs and metadata are validated for replicated beams:
pass encoder_hidden_states duplicated per beam (or a batch with beam dimension),
provide encoder_seq_lens and encoder_num_cached_tokens_per_seq expanded/repeated
per beam, and assert that inputs["encoder_hidden_states"] points to the original
tensor or its proper beam-expanded view, inputs["skip_cross_kv_projection"]
remains False, inputs["cross_attn_metadata"] is metadata.cross_metadata with
metadata.cross_metadata.prepared True, metadata.cross_kv_cache_manager equals
the _FakeResourceManager's cross_manager, and that metadata.encoder_seq_lens and
metadata.encoder_num_cached_tokens_per_seq have correct replicated values for
each beam; reference
test_builds_metadata_for_mixed_projection_and_cached_sequences,
_prepare_encoder_decoder_cross_attention_inputs, _FakeAttentionMetadata, and
_FakeResourceManager when making these additions.
```

</details>

</blockquote></details>
<details>
<summary>tests/integration/defs/llmapi/test_llm_api_pytorch_t5.py (1)</summary><blockquote>

`374-606`: **Consider wiring this into the QA functional list.**

This is core single-node functional coverage for a new encoder-decoder flow, so it likely deserves an entry in `tests/integration/test_lists/qa/llm_function_core.txt` if you want it exercised in scheduled QA. From this file alone, a perf-list update does not look necessary.



As per coding guidelines, "If the change adds or materially alters an integration test under tests/integration/defs/ ... call out whether an entry is needed under tests/integration/test_lists/qa/."

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/defs/llmapi/test_llm_api_pytorch_t5.py` around lines 374 -
606, The new encoder-decoder integration tests
(test_t5_pytorch_generate_encoder_decoder_end_to_end and
test_t5_pytorch_generate_encoder_decoder_cuda_graph_mixed_encoder_lengths_batch)
should be added to the QA functional test list so they run under scheduled QA;
update the QA list (llm_function_core.txt) to include a short entry referencing
these tests (model encoder-decoder flow / T5 PyTorch backend) so the change
under tests/integration/defs/... is explicitly exercised by the QA pipeline.
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/llmapi/llm.py (1)</summary><blockquote>

`628-636`: _💤 Low value_

**Minor indentation style issue flagged by linter.**

Static analysis flagged E121 (continuation line under-indented) at line 636. The closing parenthesis should align with the opening of the function call or maintain consistent indentation with the parameters.



<details>
<summary>🔧 Suggested fix for indentation</summary>

```diff
         (prompt_token_ids, prompt, query_token_ids, multimodal_params,
          encoder_input_token_ids) = self._preprocess(
-             inputs,
-             sampling_params,
-             disaggregated_params,
-             encoder_inputs=encoder_inputs,
-             encoder_input_token_ids=encoder_input_token_ids,
-             decoder_input_token_ids=decoder_input_token_ids,
-         )
+            inputs,
+            sampling_params,
+            disaggregated_params,
+            encoder_inputs=encoder_inputs,
+            encoder_input_token_ids=encoder_input_token_ids,
+            decoder_input_token_ids=decoder_input_token_ids,
+        )
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/llmapi/llm.py` around lines 628 - 636, The continuation call to
self._preprocess is under-indented causing E121; fix by aligning the closing
parenthesis with the opening parenthesis of the call or by indenting it to match
the parameter block so the call looks consistently formatted; locate the call to
_preprocess (the line starting with (prompt_token_ids, prompt, query_token_ids,
multimodal_params, encoder_input_token_ids) = self._preprocess(...)) and adjust
the indentation of the closing parenthesis and any continuation lines so they
align with the opening self._preprocess( or match the indentation level used for
the other multi-line calls in the file.
```

</details>

</blockquote></details>
<details>
<summary>tests/unittest/_torch/modeling/test_modeling_enc_dec.py (1)</summary><blockquote>

`684-688`: _💤 Low value_

**Consider extracting `_make_cross_attn` as a module-level helper.**

The smoke benchmark instantiates `TestCrossAttentionTrtllmBackend()` just to call its `_make_cross_attn` method. This is functional but couples the smoke test to the test class structure.

Consider moving `_make_cross_attn` to a module-level helper function (similar to `_build_trtllm_cross_metadata`) for cleaner reuse across test classes.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modeling/test_modeling_enc_dec.py` around lines 684 -
688, Extract the TestCrossAttentionTrtllmBackend._make_cross_attn method into a
module-level helper function (e.g., make_cross_attn) and update callers to use
that helper instead of instantiating TestCrossAttentionTrtllmBackend;
specifically, move the implementation out of the class into a top-level function
alongside helpers like _build_trtllm_cross_metadata, keep the same parameters
(hidden_size, num_heads, head_dim, dtype, device), preserve behavior and
returned tensor/device handling, and replace usages such as
TestCrossAttentionTrtllmBackend()._make_cross_attn(...) with
make_cross_attn(...).
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/_torch/pyexecutor/model_engine.py (2)</summary><blockquote>

`1872-1905`: _💤 Low value_

**Consider extracting the nested `update_cross_metadata` function.**

The nested `update_cross_metadata` function directly modifies internal attributes like `_seq_lens` and `_seq_lens_cuda`. While this works, it couples tightly to `AttentionMetadata` internals.

Additionally, the function captures variables from the enclosing scope (`attn_metadata`, `encoder_seq_lens_tensor`, etc.) which makes the data flow less explicit. Consider whether this could be a method on `AttentionMetadata` itself or at minimum accept all dependencies as parameters for clarity.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 1872 - 1905, The
nested function update_cross_metadata tightly couples to AttentionMetadata
internals and captures external variables (attn_metadata,
encoder_seq_lens_tensor, encoder_num_cached_tokens_per_seq,
cross_kv_cache_manager), making data flow implicit; refactor by extracting
update_cross_metadata into either an instance/class method on AttentionMetadata
(e.g., AttentionMetadata.update_for_cross(attn_metadata,
encoder_seq_lens_tensor, encoder_num_cached_tokens_per_seq,
cross_kv_cache_manager)) or a standalone function that takes all dependencies as
explicit parameters, and move creation of KVCacheParams (currently constructed
inside update_cross_metadata) into that method so it only mutates public
attributes (avoid direct writes to _seq_lens/_seq_lens_cuda) and document
required inputs.
```

</details>

---

`4468-4477`: _💤 Low value_

**Consider using `_get_top_level_model()` for consistency with the embedding lookup pattern.**

The encoder discovery at lines 4468-4472 retrieves the encoder directly from `self.model` and `self.model.model`, but the embedding lookup at line 4484 uses `_get_top_level_model()` to unwrap compiled models. For consistency and to guard against potential edge cases with `torch.compile` wrapped submodules, refactor the encoder discovery to use the same `_get_top_level_model()` pattern.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 4468 - 4477, The
encoder lookup currently reads encoder from self.model and self.model.model
directly which can miss wrapped/compiled submodules; change the discovery to
first call _get_top_level_model(self.model) (the same helper used by the
embedding lookup) and then get the "encoder" attribute (and fallback to
inner.model.encoder if necessary) so encoder resolution mirrors the embedding
pattern—update the block that references encoder and inner to use
_get_top_level_model(...) and retain the existing AttributeError message if no
encoder is found.
```

</details>

</blockquote></details>
<details>
<summary>tests/unittest/_torch/executor/test_py_scheduler.py (1)</summary><blockquote>

`2506-2516`: _⚡ Quick win_

**Add the constrained cross-pool case for `MAX_UTILIZATION`.**

This test uses a roomy cross pool, so it still passes if `ENCODER_INIT` is accidentally charged against cross-KV blocks in the max-utilization path. Using something like `MockKVCacheManager(num_free_blocks=1, blocks_per_request=2)` here would mirror the guaranteed-no-evict regression case and actually prove encoder admission is cross-pool-neutral under both policies.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/executor/test_py_scheduler.py` around lines 2506 -
2516, The test test_max_utilization_admits_encoder_with_cross_pool currently
uses a roomy cross pool; change the cross_kv MockKVCacheManager instantiation to
mimic the constrained cross-pool case (e.g.,
MockKVCacheManager(num_free_blocks=1, blocks_per_request=2)) so the test
verifies MAX_UTILIZATION does not incorrectly charge ENCODER_INIT to cross-KV
blocks; keep the same requests (make_encoder_request and
make_generation_request), call scheduler.schedule_request, and retain the
existing assertions on fitting and paused to ensure encoder admission is
cross-pool-neutral under this policy.
```

</details>

</blockquote></details>
<details>
<summary>tests/unittest/llmapi/test_encoder_decoder_request_api.py (1)</summary><blockquote>

`77-82`: _⚡ Quick win_

**Add one behavioral `BaseLLM.generate` test, not just a signature check.**

The async path is covered end-to-end, but the sync wrapper only gets a parameter-order assertion. If `generate()` drops `encoder_input_token_ids` or handles the conflict checks differently from `generate_async()`, this file stays green.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_encoder_decoder_request_api.py` around lines 77 -
82, The test currently only asserts parameter order for BaseLLM.generate; add a
behavioral unit test that calls BaseLLM.generate (or a small test
subclass/mocked LLM implementing generate_async) to verify that when both
priority and encoder_input_token_ids/encoder_inputs kwargs are provided,
generate forwards encoder inputs and enforces the same conflict/priority
behavior as generate_async — specifically instantiate a fake LLM where
generate_async records received kwargs, call the synchronous BaseLLM.generate
wrapper with both priority and encoder_input_token_ids/encoder_inputs, and
assert the recorded args include the encoder inputs (and that priority
wins/behavior matches generate_async) to exercise the actual code paths rather
than just signatures.
```

</details>

</blockquote></details>
<details>
<summary>tests/unittest/_torch/executor/test_request_utils.py (1)</summary><blockquote>

`64-109`: **QA list update is not needed for this test change.**

This is unit-test scope only (`tests/unittest/**`), so integration/release QA list edits are unnecessary in this PR.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/executor/test_request_utils.py` around lines 64 - 109,
This PR only changes unit tests (e.g.,
test_executor_request_to_llm_request_preserves_encoder_tokens and
test_scheduled_requests_does_not_query_encoder_context_chunk_state using
ScheduledRequests/EncoderInitRequest), so remove any unrelated QA list /
integration/release QA edits from the commit and keep only the test file
modifications; ensure the diff contains solely the test changes and no QA list
or release checklist updates before finalizing the PR.
```

</details>

</blockquote></details>

</blockquote></details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Enterprise

**Run ID**: `be032c9d-11c9-4f74-b7b7-0350127c8b65`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between f8572abec4bf623dc142d7df8babc67ab1616996 and e4752c379acc0e88142ba6d285b588b3e869be7b.

</details>

<details>
<summary>📒 Files selected for processing (48)</summary>

* `cpp/include/tensorrt_llm/batch_manager/capacityScheduler.h`
* `cpp/tensorrt_llm/batch_manager/capacityScheduler.cpp`
* `cpp/tensorrt_llm/common/attentionOp.cpp`
* `cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp`
* `cpp/tensorrt_llm/nanobind/thop/bindings.cpp`
* `cpp/tensorrt_llm/thop/attentionOp.cpp`
* `cpp/tensorrt_llm/thop/attentionOp.h`
* `cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp`
* `encoder_decoder_porting_guide.md`
* `legacy_enc_dec_architecture.md`
* `tensorrt_llm/_torch/attention_backend/interface.py`
* `tensorrt_llm/_torch/attention_backend/trtllm.py`
* `tensorrt_llm/_torch/attention_backend/trtllm_gen.py`
* `tensorrt_llm/_torch/attention_backend/vanilla.py`
* `tensorrt_llm/_torch/model_config.py`
* `tensorrt_llm/_torch/models/__init__.py`
* `tensorrt_llm/_torch/models/modeling_bart.py`
* `tensorrt_llm/_torch/models/modeling_t5.py`
* `tensorrt_llm/_torch/modules/cross_attention.py`
* `tensorrt_llm/_torch/modules/encoder_decoder_layer.py`
* `tensorrt_llm/_torch/modules/rms_norm.py`
* `tensorrt_llm/_torch/pyexecutor/_util.py`
* `tensorrt_llm/_torch/pyexecutor/llm_request.py`
* `tensorrt_llm/_torch/pyexecutor/model_engine.py`
* `tensorrt_llm/_torch/pyexecutor/model_loader.py`
* `tensorrt_llm/_torch/pyexecutor/py_executor.py`
* `tensorrt_llm/_torch/pyexecutor/resource_manager.py`
* `tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py`
* `tensorrt_llm/_torch/pyexecutor/scheduler/scheduler_v2.py`
* `tensorrt_llm/executor/base_worker.py`
* `tensorrt_llm/executor/executor.py`
* `tensorrt_llm/executor/request.py`
* `tensorrt_llm/executor/result.py`
* `tensorrt_llm/inputs/data.py`
* `tensorrt_llm/llmapi/llm.py`
* `tensorrt_llm/llmapi/llm_args.py`
* `tests/integration/defs/llmapi/test_llm_api_pytorch_t5.py`
* `tests/unittest/_torch/executor/test_dual_pool_kv_cache.py`
* `tests/unittest/_torch/executor/test_encoder_step.py`
* `tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py`
* `tests/unittest/_torch/executor/test_py_scheduler.py`
* `tests/unittest/_torch/executor/test_request_utils.py`
* `tests/unittest/_torch/modeling/test_modeling_enc_dec.py`
* `tests/unittest/_torch/test_model_config.py`
* `tests/unittest/api_stability/references/llm.yaml`
* `tests/unittest/api_stability/references/request_output.yaml`
* `tests/unittest/llmapi/test_encoder_decoder_request_api.py`
* `tests/unittest/llmapi/test_llm_args.py`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment thread tensorrt_llm/_torch/attention_backend/interface.py
Comment thread tensorrt_llm/_torch/attention_backend/trtllm_gen.py Outdated
Comment thread tensorrt_llm/_torch/attention_backend/trtllm.py Outdated
Comment thread tensorrt_llm/_torch/models/__init__.py
Comment thread tensorrt_llm/_torch/models/modeling_bart.py
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py
Comment thread tensorrt_llm/_torch/pyexecutor/scheduler/scheduler_v2.py
Comment thread tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py
Comment thread tensorrt_llm/inputs/data.py Outdated
@pengbowang-nv

Copy link
Copy Markdown
Collaborator

Hi, I'd first like to know if it is possible that we may split this single PR into multiple smaller PRs, so as to reduce the difficulty for reviewer and change scope? Thanks!

cascade812 added 13 commits May 13, 2026 14:42
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
@cascade812

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53123 [ run ] triggered by Bot. Commit: 2641898 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53123 [ run ] completed with state FAILURE. Commit: 2641898
/LLM/main/L0_MergeRequest_PR pipeline #42330 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@cascade812

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53229 [ run ] triggered by Bot. Commit: d3d176c Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53229 [ run ] completed with state SUCCESS. Commit: d3d176c
/LLM/main/L0_MergeRequest_PR pipeline #42424 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@cascade812

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53343 [ run ] triggered by Bot. Commit: d3d176c Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53343 [ run ] completed with state FAILURE. Commit: d3d176c
/LLM/main/L0_MergeRequest_PR pipeline #42524 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@cascade812

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53387 [ run ] triggered by Bot. Commit: d3d176c Link to invocation

@tburt-nv

Copy link
Copy Markdown
Collaborator

pipeline 42424 actually passed:

  • [2026-06-10T07:39:02.468Z] DGX_B200-PyTorch-3/test_unittests.py::test_unittests_v2[unittest/kv_cache_manager_v2_tests/] <- test_unittests.py PASSED
  • [2026-06-10T07:29:31.808Z] DGX_B200-PyTorch-1/test_unittests.py::test_unittests_v2[unittest/_torch/thop/parallel] <- test_unittests.py PASSED

@tburt-nv

Copy link
Copy Markdown
Collaborator

/bot skip --comment "pipeline 42424 passed"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53389 [ skip ] triggered by Bot. Commit: d3d176c Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53387 [ run ] completed with state ABORTED. Commit: d3d176c

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53389 [ skip ] completed with state SUCCESS. Commit: d3d176c
Skipping testing for commit d3d176c

Link to invocation

@cascade812 cascade812 enabled auto-merge (squash) June 10, 2026 23:07
quant_mode=self.quant_mode,
q_scaling=self.q_scaling,
position_embedding_type=self.position_embedding_type,
position_embedding_type=forward_args.position_embedding_type,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to the forward_args section

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question as below

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forward_args.position_embedding_type should move to the "Per-call (AttentionForwardArgs)" section, but I prefer to not add position_embedding_type to forward_args, see my previous comment.

encoder_input_lengths=forward_args.encoder_input_lengths,
relative_attention_bias=forward_args.relative_attention_bias,
relative_attention_max_distance=forward_args.
relative_attention_max_distance,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move these new arguments to the correct sections.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuxianq can you clarify this? I think new arguments are already in the forward_args section?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You put encoder_input_lengths/relative_attention_bias /relative_attention_max_distance in the "Literals intentionally None" section, please move it to the "Per-call (AttentionForwardArgs)" section.


use_trtllm_gen = False
if _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION:
if _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION and not metadata.is_cross:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't add not metadata.is_cross here, move it to trtllm_gen_backend.is_supported

- disaggregated/test_disaggregated_single_gpu.py::test_disaggregated_logits[True-TinyLlama-1.1B-Chat-v1.0]
- disaggregated/test_disaggregated_single_gpu.py::test_disaggregated_logprobs[False-TinyLlama-1.1B-Chat-v1.0]
- disaggregated/test_disaggregated_single_gpu.py::test_disaggregated_logprobs[True-TinyLlama-1.1B-Chat-v1.0]
- llmapi/test_llm_api_pytorch_bart.py::test_bart_pytorch_generate_encoder_decoder_end_to_end[bf16-kv-v1-cuda-graph-off-greedy-bart-large-cnn]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which arch support cross attention with your PR? We only test sm90, how about sm100?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross attention is now supported on both SM90 and SM100 by falling back to THOP attention.
I'll have a separate PR that adds SM100 cross-attention support to trtllm-gen.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THOP uses trtllm-gen for sm100 and XQA/fmha_v2 for sm90, if trtllm-gen cross-attention kernels are aded in a follow-up PR, this PR will only contain sm90 support, right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it supports attn_backend="TRTLLM" inside which it falls back to thop for cross attention on both sm90 and sm100.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are trtllm_gen_backend and thop in trtllm attention backend, trtllm_gen_backend means flashinfer's trtllm-gen kernels (we will refactor it in #15204 to make it clearer), but we also integrate trtllm-gen kernels to thop. If there is no trtllm-gen kernels for cross attention, thop also cannot support it. Do you mean we already have cross attention support in trtllm-gen, but you have not enabled it for trtllm_gen_backend?

forward_args.cross_kv = None
forward_args.encoder_input_lengths = None

forward_args.position_embedding_type = (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to determine the position_embedding_type during forward? It is a static info, we should update the position_embedding_type in attention backend's init and remove position_embedding_type from forward_args.

@eopXD eopXD left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes under resource_manager.py come from a perspective of "if not is_cross", from a perspective of supporting the cross attention path you are trying to achieve. This is understandable.

However it is a slight code smell since you are branching out the main code path.You may check cascade812#1 for reference.

All other changes look good to me now. Thank you for your patience.

is_cross = self.kv_cache_type == CacheTypeCpp.CROSS
if not is_cross:
if not self.is_draft:
_update_kv_cache_draft_token_location(self, scheduled_batch,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This looks like a drive-by fix to achieve symmetry of the v2 manager's path of update_resources. I think we can remove this in case this dillutes the goal of this MR. A follow-up for this addition should also be good. My nit comment is to remove it here.

dim=1).contiguous()
else:
forward_args.cross_kv = None
forward_args.encoder_input_lengths = metadata.kv_lens_cuda_runtime

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already pass metadata.kv_lens_cuda_runtime to thop, don't need to pass encoder_input_lengths again, we can remove encoder_input_lengths from forward_args

not forward_args.is_fused_qkv and k is not None
and v is not None) or (metadata.is_cross
and not forward_args.update_kv_cache
and k is None and v is None)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding assertion, we should update forward_args.update_kv_cache for cross attention here instead of L1432

forward_args.cross_kv = torch.cat([k_flat, v_flat],
dim=1).contiguous()
else:
forward_args.cross_kv = None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line since cross_kv is None by default.

forward_args.is_fused_qkv = True
else:
forward_args.cross_kv = None
forward_args.encoder_input_lengths = None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this else block since they are None by default.

# stay as literal ``None`` until DeepSeek V4 sparse-MLA lands.
sparse_mla_topk_lens=None,
compressed_kv_cache_pool_ptr=None,
cross_attention=metadata.is_cross,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the same name for kwargs, e.g., is_cross=metadata.is_cross

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants