[TRTLLM-12339][feat] Support T5 encoder-decoder models in the PyTorch backend#13919
[TRTLLM-12339][feat] Support T5 encoder-decoder models in the PyTorch backend#13919cascade812 wants to merge 44 commits into
Conversation
📝 WalkthroughWalkthroughAdds encoder–decoder cross-attention end-to-end: mutable cross-KV scheduling, cross-attention/relative-bias plumbing in C++ and Python, PyTorch backends and CrossAttention, BART/T5 models, executor encoder pass and KV dual-pool creation, plus comprehensive unit and integration tests. ChangesEncoder–decoder cross-attention stack
Sequence Diagram(s)sequenceDiagram
participant Client
participant PyExecutor
participant Scheduler
participant ModelEngine
participant CrossKVManager
Client->>PyExecutor: enqueue requests (encoder_input_token_ids)
PyExecutor->>Scheduler: schedule (includes encoder_requests)
Scheduler->>CrossKVManager: validate / reserve cross-KV (for CONTEXT_INIT)
PyExecutor->>ModelEngine: forward_encoder(encoder_requests) on encoder_stream
ModelEngine->>PyExecutor: packed encoder outputs + seq_lens
PyExecutor->>Scheduler: mark encoder-ready -> move to CONTEXT_INIT
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 13
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (12)
tensorrt_llm/executor/executor.py (1)
123-170:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMirror
encoder_input_token_idsongenerate().This wires encoder inputs through
generate_async(), but the synchronousgenerate()wrapper still has no way to accept or forward them. Users of the blocking executor API are therefore missing the new encoder-decoder request path.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/executor/executor.py` around lines 123 - 170, The synchronous wrapper generate() does not accept or forward the new encoder_input_token_ids parameter, so callers using the blocking API can't use the encoder-decoder path; update the generate() signature to include encoder_input_token_ids: Optional[Union[torch.Tensor, np.ndarray, list]] = None and ensure it passes this argument through when constructing/dispatching the GenerationRequest (same name used in generate_async and GenerationRequest). Also propagate encoder_input_token_ids through any internal call sites or helper methods that currently forward other generation arguments to generate_async or create the GenerationRequest so behavior is identical to generate_async().tensorrt_llm/_torch/attention_backend/vanilla.py (1)
400-410:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDo not use a causal mask on flash-attn cross-attention.
The SDPA fallback correctly disables causality when Q/K lengths differ, but the flash-attn path still sets
causal=Truefor the default mask. For encoder-decoder cross-attention that blocks later encoder tokens and produces wrong attention scores in fp16/bf16.Suggested fix
attn_output_unpad = flash_attn_varlen_func( q, k, v, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, dropout_p=0.0, softmax_scale=softmax_scale, - causal=attention_mask == PredefinedAttentionMask.CAUSAL, + causal=(attention_mask == PredefinedAttentionMask.CAUSAL + and not is_cross), alibi_slopes=None, deterministic=False, return_attn_probs=False, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/attention_backend/vanilla.py` around lines 400 - 410, The flash-attn call incorrectly applies a causal mask for cross-attention; update the causal argument of flash_attn_varlen_func so it is True only for true self-attention. Compute a cross-attention condition (e.g., compare cu_seqlens_q and cu_seqlens_k shapes or values) and set causal = (attention_mask == PredefinedAttentionMask.CAUSAL) and not is_cross_attention before calling flash_attn_varlen_func; use the same logic/sentinel used by the SDPA fallback to detect differing Q/K lengths so encoder-decoder attention does not get a causal mask.tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
334-339:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t put raw
encoder_outputtensors intoDiffwithout normalizing them first.
Diffis the cross-rank sync payload, butget_diff()only moves logits to CPU. As written, a CUDAencoder_outputcan escape into the transport/serialization path and either fail there or keep a large device allocation alive unexpectedly.Suggested fix
def get_diff(self) -> Diff: for i, context_logits in enumerate(self.diff.context_logits_list): self.diff.context_logits_list[i] = context_logits.to("cpu") for i, generation_logits in enumerate(self.diff.generation_logits_list): self.diff.generation_logits_list[i] = generation_logits.to("cpu") + if self.diff.encoder_output is not None: + self.diff.encoder_output = self.diff.encoder_output.to( + "cpu", non_blocking=True + ) return self.diffAlso applies to: 357-358, 423-425
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/llm_request.py` around lines 334 - 339, get_diff currently only moves logits to CPU but may leave raw CUDA encoder_output tensors inside Diff and other cross-rank payloads; update get_diff (and the analogous places around lines where encoder_output is added — e.g., the blocks referenced at 357-358 and 423-425) to normalize tensors before inserting into Diff: for each tensor (encoder_output, items in context_logits_list and generation_logits_list) call .detach().cpu().contiguous() (or .clone().detach().cpu() if contiguous not needed) to remove CUDA/device references and break autograd, and replace the originals in Diff with these CPU tensors so no CUDA memory or device-bound tensors are serialized or kept alive. Ensure you apply the same normalization routine wherever encoder_output or other tensors are appended to Diff.
649-679:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd the NVIDIA header to this modified Python file.
This module was updated in the PR but still lacks the required NVIDIA copyright header/current modification year.
As per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/llm_request.py` around lines 649 - 679, Add the required NVIDIA copyright header (with the current modification year) to the top of this Python source file (llm_request.py); place the standard NVIDIA header block before any imports or code (above the class/constructor that contains the shown super().__init__ calls and symbols like encoder_input_tokens, encoder_output_len, py_return_encoder_output) and update the copyright year to the current year.cpp/include/tensorrt_llm/batch_manager/capacityScheduler.h (1)
169-183:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate the copyright year in this modified header.
This header was changed for the PR, but its NVIDIA copyright line still ends at 2024.
As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/tensorrt_llm/batch_manager/capacityScheduler.h` around lines 169 - 183, The file capacityScheduler.h was modified but its NVIDIA copyright header still ends at 2024; update the copyright year to the current year (2026) in the header comment so the top-of-file NVIDIA copyright line reflects the modification (e.g., replace "2024" with "2026" or use a range like "2024-2026"), then save and commit the change.cpp/tensorrt_llm/thop/attentionOp.cpp (1)
679-699:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRequire encoder lengths on the cross-attention path.
cross_attention=trueis now accepted even whenencoder_input_lengthsis absent, butRunner::run()only populates the cross-attention length pointer when that tensor exists. That leaves the enqueue params under-specified for the new mode and turns a bad caller into wrong kernel bounds instead of a deterministic error.Suggested guard
TLLM_CHECK_WITH_INFO(is_mla_enable || is_fused_qkv || use_sage_attn || cross_attention, "Only fused QKV is supported for non-MLA non-SageAttention non-cross attention now"); TLLM_CHECK_WITH_INFO(update_kv_cache || cross_attention, "KV cache update cannot be disabled now"); + TLLM_CHECK_WITH_INFO( + !cross_attention || encoder_input_lengths.has_value(), + "encoder_input_lengths must be provided for cross attention"); + TLLM_CHECK_WITH_INFO( + !cross_attention || !update_kv_cache || cross_kv.has_value(), + "cross_kv must be provided when cross attention updates KV cache");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tensorrt_llm/thop/attentionOp.cpp` around lines 679 - 699, When cross_attention is true you must require encoder_input_lengths to be present to avoid creating under-specified enqueue params; add a runtime check in the attention op (near the existing TLLM_CHECK_WITH_INFO calls in the function that has parameters cross_attention and encoder_input_lengths) that asserts encoder_input_lengths.has_value() when cross_attention is true and emits a clear error via TLLM_CHECK_WITH_INFO (or similar) referencing cross_attention and encoder_input_lengths; this will surface a deterministic error (rather than letting Runner::run produce missing cross-attention length pointers and invalid kernel bounds).tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py (1)
60-89:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winInitialize
is_encoder_init_stateexplicitly on the request mocks.
KVCacheV2Scheduler._sort_requests()now branches onreq.is_encoder_init_state. On plainMockobjects, a missing attribute becomes a truthy child mock, so context requests get misclassified as encoder-init and the new chunk-order assertions can pass even if that logic regresses.Suggested fix
def make_ctx_request( @@ req.lora_task_id = lora_task_id req.is_context_init_state = True + req.is_encoder_init_state = False req.is_generation_in_progress_state = False req.encoder_output_len = encoder_output_len req.py_encoder_output_ready_event = None req.py_skip_cross_kv_projection = False return req def make_encoder_request(request_id, encoder_output_len, lora_task_id=None): @@ req.encoder_output_len = encoder_output_len req.lora_task_id = lora_task_id req.is_context_init_state = False + req.is_encoder_init_state = True req.is_generation_in_progress_state = False req.is_first_context_chunk = True return reqAlso applies to: 92-102
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py` around lines 60 - 89, The request mocks created by make_ctx_request are missing an explicit is_encoder_init_state attribute so Mock returns a truthy child mock and causes KVCacheV2Scheduler._sort_requests to misclassify context requests; update make_ctx_request to set req.is_encoder_init_state = False (and likewise set is_encoder_init_state explicitly in the other helper around lines 92-102) so encoder-init branching in _sort_requests behaves deterministically.tensorrt_llm/_torch/pyexecutor/_util.py (1)
1-25:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd the required NVIDIA file header.
This modified Python file still starts directly with imports, so it is missing the required NVIDIA copyright header/current modification year.
As per coding guidelines, "
**/*.{cpp,cc,cxx,h,hpp,py}: Read and follow CODING_GUIDELINES.md for all code changes (C++ and Python) Include NVIDIA copyright header on all new files; update year on modified files" and "**/*.{cpp,h,hpp,cu,cuh,py}: All C++, Python, and other source files must contain NVIDIA copyright header with current modification year".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 1 - 25, This file is missing the required NVIDIA copyright/header; add the official NVIDIA file header (with the current modification year) to the very top of tensorrt_llm/_torch/pyexecutor/_util.py before any imports so the header precedes the existing module-level imports and symbols (e.g., torch, tensorrt_llm, DecodingMode, MODEL_CLASS_VISION_ENCODER_MAPPING); if this is a modification of an existing file, update the year in the header to the current year.tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py (1)
1-16:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd the required NVIDIA file header.
This modified Python file still starts directly with imports, so it is missing the required NVIDIA copyright header/current modification year.
As per coding guidelines, "
**/*.{cpp,cc,cxx,h,hpp,py}: Read and follow CODING_GUIDELINES.md for all code changes (C++ and Python) Include NVIDIA copyright header on all new files; update year on modified files" and "**/*.{cpp,h,hpp,cu,cuh,py}: All C++, Python, and other source files must contain NVIDIA copyright header with current modification year".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py` around lines 1 - 16, Add the required NVIDIA copyright/header block to the very top of this Python module (above all imports) per CODING_GUIDELINES.md, updating the modification year to the current year; ensure the header matches the project's standard NVIDIA copyright template and includes the copyright owner and year range, then commit the file with the header in place (this file contains symbols like the module imports tensorrt_llm.bindings/internal, the import lines for LlmRequest and LlmRequestState, and references to CapacitySchedulerPolicy and logger to help locate the file).tensorrt_llm/executor/result.py (1)
1-1:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd required NVIDIA source header to this modified Python file.
This file was modified but still lacks the required NVIDIA copyright/license header.
Suggested patch
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + import asyncioAs per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/executor/result.py` at line 1, This Python module (tensorrt_llm.executor.result.py) is missing the required NVIDIA copyright/license header; add the standard NVIDIA source header (with the current modification year) at the very top of result.py before any code (e.g., before the existing "import asyncio") so the file complies with the project's C++/Python source header policy.cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)
2-3:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUpdate copyright year range on this modified file.
The file changed in this PR but the header still ends at 2025.
Suggested patch
- * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp` around lines 2 - 3, Update the copyright header's year range in the SPDX comments: change the SPDX-FileCopyrightText line that currently ends "2025" to include the current year (e.g., "2026") so the range reads "2022-2026" while leaving the SPDX-License-Identifier line unchanged.tests/unittest/_torch/executor/test_request_utils.py (1)
1-1:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd the required NVIDIA source header to this modified test file.
Even for unit-test Python sources, the repository rule requires the NVIDIA copyright/license header.
Suggested patch
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + """Tests for request_utils.py functions.As per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/executor/test_request_utils.py` at line 1, Add the required NVIDIA source header to the top of the modified test module test_request_utils.py by inserting the standard NVIDIA copyright/license header (with the current modification year) as the first thing in the file, before the existing module docstring and imports so the file complies with the repository’s source-header rule.
🟡 Minor comments (8)
legacy_enc_dec_architecture.md-123-123 (1)
123-123:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd language identifier to fenced code block.
The fenced code block should specify a language identifier for better rendering and to satisfy markdown linting rules.
📝 Proposed fix
-``` +```text out/<tpX>/encoder/rank0.engine, config.json out/<tpX>/decoder/rank0.engine, config.json</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@legacy_enc_dec_architecture.mdat line 123, The fenced code block at the end
of the file lacks a language identifier; update the block delimiter to include a
language (e.g., changetotext) so it becomes a labeled fenced code block
for proper rendering and linting—target the fenced block containing the lines
"out//encoder/rank0.engine, config.json" and
"out//decoder/rank0.engine, config.json" and add the language identifier.</details> </blockquote></details> <details> <summary>tests/unittest/llmapi/test_llm_args.py-342-353 (1)</summary><blockquote> `342-353`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Add boundary acceptance tests for `cross_kv_cache_fraction`.** These new tests cover invalid values, but not valid boundaries. Please also assert that `0.0`, `1.0`, and `None` are accepted. <details> <summary>🧪 Suggested test additions</summary> ```diff +@pytest.mark.parametrize("value", [0.0, 1.0, None]) +def test_KvCacheConfig_accepts_cross_kv_cache_fraction_boundaries(value): + cfg = KvCacheConfig(cross_kv_cache_fraction=value) + assert cfg.cross_kv_cache_fraction == value ``` </details> As per coding guidelines, "Coverage expectations: Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix." <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/llmapi/test_llm_args.py` around lines 342 - 353, Add positive-boundary tests for KvCacheConfig to ensure cross_kv_cache_fraction accepts 0.0, 1.0, and None: create new tests (e.g., test_KvCacheConfig_accepts_zero_fraction, test_KvCacheConfig_accepts_one_fraction, test_KvCacheConfig_accepts_none_fraction) that instantiate KvCacheConfig(cross_kv_cache_fraction=0.0), KvCacheConfig(cross_kv_cache_fraction=1.0), and KvCacheConfig(cross_kv_cache_fraction=None) respectively and assert the object is created (no exception) and the attribute equals the provided value; place these alongside the existing failing-value tests that reference KvCacheConfig so behavior is fully covered. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/inputs/data.py-38-45 (1)</summary><blockquote> `38-45`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **The public prompt types still require `prompt` / `prompt_token_ids`.** `prompt_inputs()` now accepts encoder-only dictionaries, but `TextPrompt` and `TokensPrompt` still model `prompt` / `prompt_token_ids` as required keys. That leaves the published type contract out of sync with the runtime API for encoder-decoder requests. Also applies to: 78-85 <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/inputs/data.py` around lines 38 - 45, TextPrompt and TokensPrompt currently require prompt / prompt_token_ids but prompt_inputs() accepts encoder-only dicts; update the prompt type definitions to make prompt and prompt_token_ids optional (NotRequired/Optional) to match encoder-decoder requests. Locate the TextPrompt and TokensPrompt type/class definitions and change the required keys for "prompt" and "prompt_token_ids" to optional (align with encoder_input_token_ids / encoder_inputs used elsewhere) so the public type contract matches runtime behavior; ensure any related docstrings/comments reflect the optionality. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/pyexecutor/resource_manager.py-704-732 (1)</summary><blockquote> `704-732`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Add the NVIDIA header to this modified Python file.** This file was changed in the PR but still starts without the required NVIDIA copyright header/current modification year. As per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year". <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 704 - 732, This Python source (tensorrt_llm/_torch/pyexecutor/resource_manager.py) is missing the required NVIDIA copyright header; add the standard NVIDIA copyright/header comment block (with the current modification year) at the very top of the file before any imports or code, ensuring it follows the project's header format used in other .py files; no functional code changes to symbols like CacheTypeCpp, kv_cache_type, add_sequence_batch, refresh_blocks, or class/method definitions are needed—just insert the header comment block exactly as mandated by the coding guidelines. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/modules/encoder_decoder_layer.py-29-36 (1)</summary><blockquote> `29-36`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Split abstract method declarations onto proper multi-line bodies to comply with PEP 8.** Both `forward` methods have one-line bodies that violate flake8 E704 ("multiple statements on one line"). Move the `...` ellipsis to its own line: <details> <summary>Suggested fix</summary> ```diff `@abstractmethod` def forward( self, hidden_states: torch.Tensor, attn_metadata: AttentionMetadata, position_ids: Optional[torch.IntTensor] = None, **kwargs, - ) -> torch.Tensor: ... + ) -> torch.Tensor: + ... @@ `@abstractmethod` def forward( self, position_ids: torch.IntTensor, hidden_states: torch.Tensor, attn_metadata: AttentionMetadata, encoder_hidden_states: Optional[torch.Tensor] = None, cross_attn_metadata: Optional[AttentionMetadata] = None, skip_cross_kv_projection: bool = False, **kwargs, - ) -> torch.Tensor: ... + ) -> torch.Tensor: + ... ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/modules/encoder_decoder_layer.py` around lines 29 - 36, The abstract forward method currently uses a one-line body with the ellipsis, violating flake8 E704; update the forward declaration in the class (the def forward(...) method) so the signature ends with a colon and the ellipsis is placed on its own indented line (i.e., split the method into a proper multi-line body) to satisfy PEP 8 and flake8 checks. ``` </details> </blockquote></details> <details> <summary>encoder_decoder_porting_guide.md-9-10 (1)</summary><blockquote> `9-10`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Update the top-level contract to match the implementation.** This still says enc-dec is “V2-only” and baseline beam width is `1`, but this PR adds V1 wiring/smoke coverage and explicitly calls out `beam_width > 1` generation support. Please align the guide’s opening contract so readers do not build against stale constraints. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@encoder_decoder_porting_guide.md` around lines 9 - 10, The top-level contract incorrectly states "V2-only" and beam width = 1; update the opening paragraph to reflect that encoder-decoder models (T5, BART, mBART) now include V1 wiring/smoke coverage and support generation with beam_width > 1. Change references to require use_kv_cache_manager_v2=True and KVCacheManagerV2 constraints to note that V2 is required for full enc-dec features (explicit SELF + CROSS pools) but V1 has limited wiring/smoke tests in this PR, and ensure the contract clearly documents supported beam_width behavior and runtime requirements (PyTorch vs. other runtimes). ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/pyexecutor/model_engine.py-4400-4402 (1)</summary><blockquote> `4400-4402`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Consider converting assert to explicit RuntimeError for production safety.** The assertion on line 4400-4402 validates that the packed encoder length doesn't exceed `max_num_tokens`. While the assertion message is informative, assertions can be disabled with `-O` flag in production Python. Consider using an explicit check with `RuntimeError` for consistent error behavior: <details> <summary>🛡️ Suggested change</summary> ```diff - assert num_tokens <= self.max_num_tokens, ( - f"encoder packed length ({num_tokens}) exceeds max_num_tokens " - f"({self.max_num_tokens})") + if num_tokens > self.max_num_tokens: + raise RuntimeError( + f"encoder packed length ({num_tokens}) exceeds max_num_tokens " + f"({self.max_num_tokens}). Reduce batch size or sequence lengths.") ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 4400 - 4402, The assert checking packed encoder length should be replaced with an explicit runtime check to avoid being bypassed by Python -O; in the scope where num_tokens and self.max_num_tokens are available (the code block containing the current assert), change to an if statement that tests if num_tokens > self.max_num_tokens and raise RuntimeError with the same formatted message (include num_tokens and self.max_num_tokens) so the error is always raised in production. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py-449-450 (1)</summary><blockquote> `449-450`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **`PyMicroBatchScheduler.schedule()` bypasses the new readiness gate.** `_can_be_scheduled()` now rejects decoder-context requests until `py_encoder_output_ready_event` is set, but the scheduling loop re-inlines only the state-range check. A direct call to `schedule()` can still admit an unready `CONTEXT_INIT` request. Also applies to: 488-494 <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py` around lines 449 - 450, The scheduling loop in PyMicroBatchScheduler.schedule() bypasses the new decoder-context readiness gate, so update schedule() to consult the same logic as _can_be_scheduled(): before admitting a request (particularly CONTEXT_INIT / decoder-context requests) call is_decoder_context_request_ready(req) or call _can_be_scheduled(req) so that py_encoder_output_ready_event is respected; apply the same fix to the second occurrence referenced (the block around the other admission at lines ~488-494) so no CONTEXT_INIT request is scheduled until py_encoder_output_ready_event allows it. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🧹 Nitpick comments (12)</summary><blockquote> <details> <summary>tests/unittest/llmapi/test_llm_args.py (1)</summary><blockquote> `342-353`: **QA list update is unnecessary for this test change.** This PR delta is unit-test scope in `tests/unittest/...`; no `tests/integration/defs/` entry changes are needed for QA scheduled lists. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/llmapi/test_llm_args.py` around lines 342 - 353, The change is only a unit-test update under tests/unittest and the QA scheduled lists should not be modified; revert or remove any edits in tests/integration/defs/ (or any QA list files) introduced by this PR, leaving the two new/updated tests (test_KvCacheConfig_rejects_cross_kv_cache_fraction_below_zero and test_KvCacheConfig_rejects_cross_kv_cache_fraction_above_one) and the KvCacheConfig-related assertions as the sole change set. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/llmapi/llm_args.py (1)</summary><blockquote> `2607-2617`: _⚡ Quick win_ **Prefer declarative field constraints for `cross_kv_cache_fraction`.** The custom validator at lines 2607–2616 duplicates constraint logic that belongs in the field definition. Pydantic v2 `Field(ge=0, le=1)` on `Optional[float]` enforces range validation on provided values while allowing `None`, matching the current validator behavior. This aligns with the codebase pattern (e.g., `free_gpu_memory_fraction` uses the same approach) and the coding guideline: "In Python Pydantic models, prefer `PositiveInt`, `NonNegativeInt`, `NonNegativeFloat`, `PositiveFloat`, or `Field(gt=0)` for numeric constraints instead of custom validators." <details> <summary>♻️ Proposed refactor</summary> ```diff cross_kv_cache_fraction: Optional[float] = Field( default=None, + ge=0, + le=1, description= "The fraction of the KV Cache memory should be reserved for cross attention. If set to p, self attention will use 1-p of KV Cache memory and cross attention will use p of KV Cache memory. Default is 50%. Should only be set when using encoder-decoder model." ) @@ - `@field_validator`('cross_kv_cache_fraction') - `@classmethod` - def validate_cross_kv_cache_fraction(cls, v: Optional[float]): - if v is None: - return v - if not 0 <= v <= 1: - raise ValueError( - "kv_cache_config.cross_kv_cache_fraction must be a float between 0 and 1" - ) - return v ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/llmapi/llm_args.py` around lines 2607 - 2617, The custom validator validate_cross_kv_cache_fraction duplicates range checks; remove that validator and declare the field cross_kv_cache_fraction as Optional[float] with Pydantic constraints (e.g., Field(ge=0, le=1)) in its definition so None is allowed but any provided value must be between 0 and 1, matching the pattern used for free_gpu_memory_fraction. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)</summary><blockquote> `77-80`: _⚡ Quick win_ **Declare `CROSS_KV_CACHE_MANAGER` as a normal enum member.** Using `locals()` here is just reflection for a plain enum constant. It makes the symbol harder to grep and reason about, and the direct declaration is clearer. <details> <summary>Suggested fix</summary> ```diff class ResourceManagerType(enum.Enum): KV_CACHE_MANAGER = "KV_CACHE_MANAGER" DRAFT_KV_CACHE_MANAGER = "DRAFT_KV_CACHE_MANAGER" - locals()["CR" + "OSS_KV_CACHE_MANAGER"] = "CR" + "OSS_KV_CACHE_MANAGER" + CROSS_KV_CACHE_MANAGER = "CROSS_KV_CACHE_MANAGER" PEFT_CACHE_MANAGER = "PEFT_CACHE_MANAGER" SEQ_SLOT_MANAGER = "SEQ_SLOT_MANAGER" SPEC_RESOURCE_MANAGER = "SPEC_RESOURCE_MANAGER" ``` </details> As per coding guidelines, "Python code should avoid reflection when functionality can be easily achieved without it". <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 77 - 80, The enum ResourceManagerType currently injects a member via reflection using locals()["CR" + "OSS_KV_CACHE_MANAGER"] = "CR" + "OSS_KV_CACHE_MANAGER"; replace that reflective line with a direct enum member declaration like CROSS_KV_CACHE_MANAGER = "CROSS_KV_CACHE_MANAGER" inside the ResourceManagerType class so the member is explicit and discoverable (update any references to ResourceManagerType.CROSS_KV_CACHE_MANAGER if needed). ``` </details> </blockquote></details> <details> <summary>tests/unittest/_torch/executor/test_encoder_step.py (1)</summary><blockquote> `406-427`: _⚡ Quick win_ **Add a beam-expanded case here.** `beam_width > 1` is a headline behavior of this PR, but this helper is only tested with one row per request. A regression where encoder lengths or cached-token counts are not replicated correctly per beam would still pass; a small parametrized case with duplicated beams and mixed cached/new cross-KV would close that gap. As per coding guidelines, "Coverage expectations: Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix ... Note missing negative tests, missing parametrization where multiple backends or dtypes apply." <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/executor/test_encoder_step.py` around lines 406 - 427, The test test_builds_metadata_for_mixed_projection_and_cached_sequences lacks a beam-expanded case; add a subcase where you call engine._prepare_encoder_decoder_cross_attention_inputs with beam_width > 1 (e.g., 2) and ensure inputs and metadata are validated for replicated beams: pass encoder_hidden_states duplicated per beam (or a batch with beam dimension), provide encoder_seq_lens and encoder_num_cached_tokens_per_seq expanded/repeated per beam, and assert that inputs["encoder_hidden_states"] points to the original tensor or its proper beam-expanded view, inputs["skip_cross_kv_projection"] remains False, inputs["cross_attn_metadata"] is metadata.cross_metadata with metadata.cross_metadata.prepared True, metadata.cross_kv_cache_manager equals the _FakeResourceManager's cross_manager, and that metadata.encoder_seq_lens and metadata.encoder_num_cached_tokens_per_seq have correct replicated values for each beam; reference test_builds_metadata_for_mixed_projection_and_cached_sequences, _prepare_encoder_decoder_cross_attention_inputs, _FakeAttentionMetadata, and _FakeResourceManager when making these additions. ``` </details> </blockquote></details> <details> <summary>tests/integration/defs/llmapi/test_llm_api_pytorch_t5.py (1)</summary><blockquote> `374-606`: **Consider wiring this into the QA functional list.** This is core single-node functional coverage for a new encoder-decoder flow, so it likely deserves an entry in `tests/integration/test_lists/qa/llm_function_core.txt` if you want it exercised in scheduled QA. From this file alone, a perf-list update does not look necessary. As per coding guidelines, "If the change adds or materially alters an integration test under tests/integration/defs/ ... call out whether an entry is needed under tests/integration/test_lists/qa/." <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/defs/llmapi/test_llm_api_pytorch_t5.py` around lines 374 - 606, The new encoder-decoder integration tests (test_t5_pytorch_generate_encoder_decoder_end_to_end and test_t5_pytorch_generate_encoder_decoder_cuda_graph_mixed_encoder_lengths_batch) should be added to the QA functional test list so they run under scheduled QA; update the QA list (llm_function_core.txt) to include a short entry referencing these tests (model encoder-decoder flow / T5 PyTorch backend) so the change under tests/integration/defs/... is explicitly exercised by the QA pipeline. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/llmapi/llm.py (1)</summary><blockquote> `628-636`: _💤 Low value_ **Minor indentation style issue flagged by linter.** Static analysis flagged E121 (continuation line under-indented) at line 636. The closing parenthesis should align with the opening of the function call or maintain consistent indentation with the parameters. <details> <summary>🔧 Suggested fix for indentation</summary> ```diff (prompt_token_ids, prompt, query_token_ids, multimodal_params, encoder_input_token_ids) = self._preprocess( - inputs, - sampling_params, - disaggregated_params, - encoder_inputs=encoder_inputs, - encoder_input_token_ids=encoder_input_token_ids, - decoder_input_token_ids=decoder_input_token_ids, - ) + inputs, + sampling_params, + disaggregated_params, + encoder_inputs=encoder_inputs, + encoder_input_token_ids=encoder_input_token_ids, + decoder_input_token_ids=decoder_input_token_ids, + ) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/llmapi/llm.py` around lines 628 - 636, The continuation call to self._preprocess is under-indented causing E121; fix by aligning the closing parenthesis with the opening parenthesis of the call or by indenting it to match the parameter block so the call looks consistently formatted; locate the call to _preprocess (the line starting with (prompt_token_ids, prompt, query_token_ids, multimodal_params, encoder_input_token_ids) = self._preprocess(...)) and adjust the indentation of the closing parenthesis and any continuation lines so they align with the opening self._preprocess( or match the indentation level used for the other multi-line calls in the file. ``` </details> </blockquote></details> <details> <summary>tests/unittest/_torch/modeling/test_modeling_enc_dec.py (1)</summary><blockquote> `684-688`: _💤 Low value_ **Consider extracting `_make_cross_attn` as a module-level helper.** The smoke benchmark instantiates `TestCrossAttentionTrtllmBackend()` just to call its `_make_cross_attn` method. This is functional but couples the smoke test to the test class structure. Consider moving `_make_cross_attn` to a module-level helper function (similar to `_build_trtllm_cross_metadata`) for cleaner reuse across test classes. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/modeling/test_modeling_enc_dec.py` around lines 684 - 688, Extract the TestCrossAttentionTrtllmBackend._make_cross_attn method into a module-level helper function (e.g., make_cross_attn) and update callers to use that helper instead of instantiating TestCrossAttentionTrtllmBackend; specifically, move the implementation out of the class into a top-level function alongside helpers like _build_trtllm_cross_metadata, keep the same parameters (hidden_size, num_heads, head_dim, dtype, device), preserve behavior and returned tensor/device handling, and replace usages such as TestCrossAttentionTrtllmBackend()._make_cross_attn(...) with make_cross_attn(...). ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/pyexecutor/model_engine.py (2)</summary><blockquote> `1872-1905`: _💤 Low value_ **Consider extracting the nested `update_cross_metadata` function.** The nested `update_cross_metadata` function directly modifies internal attributes like `_seq_lens` and `_seq_lens_cuda`. While this works, it couples tightly to `AttentionMetadata` internals. Additionally, the function captures variables from the enclosing scope (`attn_metadata`, `encoder_seq_lens_tensor`, etc.) which makes the data flow less explicit. Consider whether this could be a method on `AttentionMetadata` itself or at minimum accept all dependencies as parameters for clarity. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 1872 - 1905, The nested function update_cross_metadata tightly couples to AttentionMetadata internals and captures external variables (attn_metadata, encoder_seq_lens_tensor, encoder_num_cached_tokens_per_seq, cross_kv_cache_manager), making data flow implicit; refactor by extracting update_cross_metadata into either an instance/class method on AttentionMetadata (e.g., AttentionMetadata.update_for_cross(attn_metadata, encoder_seq_lens_tensor, encoder_num_cached_tokens_per_seq, cross_kv_cache_manager)) or a standalone function that takes all dependencies as explicit parameters, and move creation of KVCacheParams (currently constructed inside update_cross_metadata) into that method so it only mutates public attributes (avoid direct writes to _seq_lens/_seq_lens_cuda) and document required inputs. ``` </details> --- `4468-4477`: _💤 Low value_ **Consider using `_get_top_level_model()` for consistency with the embedding lookup pattern.** The encoder discovery at lines 4468-4472 retrieves the encoder directly from `self.model` and `self.model.model`, but the embedding lookup at line 4484 uses `_get_top_level_model()` to unwrap compiled models. For consistency and to guard against potential edge cases with `torch.compile` wrapped submodules, refactor the encoder discovery to use the same `_get_top_level_model()` pattern. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 4468 - 4477, The encoder lookup currently reads encoder from self.model and self.model.model directly which can miss wrapped/compiled submodules; change the discovery to first call _get_top_level_model(self.model) (the same helper used by the embedding lookup) and then get the "encoder" attribute (and fallback to inner.model.encoder if necessary) so encoder resolution mirrors the embedding pattern—update the block that references encoder and inner to use _get_top_level_model(...) and retain the existing AttributeError message if no encoder is found. ``` </details> </blockquote></details> <details> <summary>tests/unittest/_torch/executor/test_py_scheduler.py (1)</summary><blockquote> `2506-2516`: _⚡ Quick win_ **Add the constrained cross-pool case for `MAX_UTILIZATION`.** This test uses a roomy cross pool, so it still passes if `ENCODER_INIT` is accidentally charged against cross-KV blocks in the max-utilization path. Using something like `MockKVCacheManager(num_free_blocks=1, blocks_per_request=2)` here would mirror the guaranteed-no-evict regression case and actually prove encoder admission is cross-pool-neutral under both policies. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/executor/test_py_scheduler.py` around lines 2506 - 2516, The test test_max_utilization_admits_encoder_with_cross_pool currently uses a roomy cross pool; change the cross_kv MockKVCacheManager instantiation to mimic the constrained cross-pool case (e.g., MockKVCacheManager(num_free_blocks=1, blocks_per_request=2)) so the test verifies MAX_UTILIZATION does not incorrectly charge ENCODER_INIT to cross-KV blocks; keep the same requests (make_encoder_request and make_generation_request), call scheduler.schedule_request, and retain the existing assertions on fitting and paused to ensure encoder admission is cross-pool-neutral under this policy. ``` </details> </blockquote></details> <details> <summary>tests/unittest/llmapi/test_encoder_decoder_request_api.py (1)</summary><blockquote> `77-82`: _⚡ Quick win_ **Add one behavioral `BaseLLM.generate` test, not just a signature check.** The async path is covered end-to-end, but the sync wrapper only gets a parameter-order assertion. If `generate()` drops `encoder_input_token_ids` or handles the conflict checks differently from `generate_async()`, this file stays green. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/llmapi/test_encoder_decoder_request_api.py` around lines 77 - 82, The test currently only asserts parameter order for BaseLLM.generate; add a behavioral unit test that calls BaseLLM.generate (or a small test subclass/mocked LLM implementing generate_async) to verify that when both priority and encoder_input_token_ids/encoder_inputs kwargs are provided, generate forwards encoder inputs and enforces the same conflict/priority behavior as generate_async — specifically instantiate a fake LLM where generate_async records received kwargs, call the synchronous BaseLLM.generate wrapper with both priority and encoder_input_token_ids/encoder_inputs, and assert the recorded args include the encoder inputs (and that priority wins/behavior matches generate_async) to exercise the actual code paths rather than just signatures. ``` </details> </blockquote></details> <details> <summary>tests/unittest/_torch/executor/test_request_utils.py (1)</summary><blockquote> `64-109`: **QA list update is not needed for this test change.** This is unit-test scope only (`tests/unittest/**`), so integration/release QA list edits are unnecessary in this PR. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/executor/test_request_utils.py` around lines 64 - 109, This PR only changes unit tests (e.g., test_executor_request_to_llm_request_preserves_encoder_tokens and test_scheduled_requests_does_not_query_encoder_context_chunk_state using ScheduledRequests/EncoderInitRequest), so remove any unrelated QA list / integration/release QA edits from the commit and keep only the test file modifications; ensure the diff contains solely the test changes and no QA list or release checklist updates before finalizing the PR. ``` </details> </blockquote></details> </blockquote></details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: CHILL **Plan**: Enterprise **Run ID**: `be032c9d-11c9-4f74-b7b7-0350127c8b65` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between f8572abec4bf623dc142d7df8babc67ab1616996 and e4752c379acc0e88142ba6d285b588b3e869be7b. </details> <details> <summary>📒 Files selected for processing (48)</summary> * `cpp/include/tensorrt_llm/batch_manager/capacityScheduler.h` * `cpp/tensorrt_llm/batch_manager/capacityScheduler.cpp` * `cpp/tensorrt_llm/common/attentionOp.cpp` * `cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp` * `cpp/tensorrt_llm/nanobind/thop/bindings.cpp` * `cpp/tensorrt_llm/thop/attentionOp.cpp` * `cpp/tensorrt_llm/thop/attentionOp.h` * `cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp` * `encoder_decoder_porting_guide.md` * `legacy_enc_dec_architecture.md` * `tensorrt_llm/_torch/attention_backend/interface.py` * `tensorrt_llm/_torch/attention_backend/trtllm.py` * `tensorrt_llm/_torch/attention_backend/trtllm_gen.py` * `tensorrt_llm/_torch/attention_backend/vanilla.py` * `tensorrt_llm/_torch/model_config.py` * `tensorrt_llm/_torch/models/__init__.py` * `tensorrt_llm/_torch/models/modeling_bart.py` * `tensorrt_llm/_torch/models/modeling_t5.py` * `tensorrt_llm/_torch/modules/cross_attention.py` * `tensorrt_llm/_torch/modules/encoder_decoder_layer.py` * `tensorrt_llm/_torch/modules/rms_norm.py` * `tensorrt_llm/_torch/pyexecutor/_util.py` * `tensorrt_llm/_torch/pyexecutor/llm_request.py` * `tensorrt_llm/_torch/pyexecutor/model_engine.py` * `tensorrt_llm/_torch/pyexecutor/model_loader.py` * `tensorrt_llm/_torch/pyexecutor/py_executor.py` * `tensorrt_llm/_torch/pyexecutor/resource_manager.py` * `tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py` * `tensorrt_llm/_torch/pyexecutor/scheduler/scheduler_v2.py` * `tensorrt_llm/executor/base_worker.py` * `tensorrt_llm/executor/executor.py` * `tensorrt_llm/executor/request.py` * `tensorrt_llm/executor/result.py` * `tensorrt_llm/inputs/data.py` * `tensorrt_llm/llmapi/llm.py` * `tensorrt_llm/llmapi/llm_args.py` * `tests/integration/defs/llmapi/test_llm_api_pytorch_t5.py` * `tests/unittest/_torch/executor/test_dual_pool_kv_cache.py` * `tests/unittest/_torch/executor/test_encoder_step.py` * `tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py` * `tests/unittest/_torch/executor/test_py_scheduler.py` * `tests/unittest/_torch/executor/test_request_utils.py` * `tests/unittest/_torch/modeling/test_modeling_enc_dec.py` * `tests/unittest/_torch/test_model_config.py` * `tests/unittest/api_stability/references/llm.yaml` * `tests/unittest/api_stability/references/request_output.yaml` * `tests/unittest/llmapi/test_encoder_decoder_request_api.py` * `tests/unittest/llmapi/test_llm_args.py` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
|
Hi, I'd first like to know if it is possible that we may split this single PR into multiple smaller PRs, so as to reduce the difficulty for reviewer and change scope? Thanks! |
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #53123 [ run ] triggered by Bot. Commit: |
|
PR_Github #53123 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #53229 [ run ] triggered by Bot. Commit: |
|
PR_Github #53229 [ run ] completed with state
|
|
/bot run |
|
PR_Github #53343 [ run ] triggered by Bot. Commit: |
|
PR_Github #53343 [ run ] completed with state
|
|
/bot run |
|
PR_Github #53387 [ run ] triggered by Bot. Commit: |
|
pipeline 42424 actually passed:
|
|
/bot skip --comment "pipeline 42424 passed" |
|
PR_Github #53389 [ skip ] triggered by Bot. Commit: |
|
PR_Github #53387 [ run ] completed with state |
|
PR_Github #53389 [ skip ] completed with state |
| quant_mode=self.quant_mode, | ||
| q_scaling=self.q_scaling, | ||
| position_embedding_type=self.position_embedding_type, | ||
| position_embedding_type=forward_args.position_embedding_type, |
There was a problem hiding this comment.
Please move this to the forward_args section
There was a problem hiding this comment.
same question as below
There was a problem hiding this comment.
forward_args.position_embedding_type should move to the "Per-call (AttentionForwardArgs)" section, but I prefer to not add position_embedding_type to forward_args, see my previous comment.
| encoder_input_lengths=forward_args.encoder_input_lengths, | ||
| relative_attention_bias=forward_args.relative_attention_bias, | ||
| relative_attention_max_distance=forward_args. | ||
| relative_attention_max_distance, |
There was a problem hiding this comment.
Please move these new arguments to the correct sections.
There was a problem hiding this comment.
@yuxianq can you clarify this? I think new arguments are already in the forward_args section?
There was a problem hiding this comment.
You put encoder_input_lengths/relative_attention_bias /relative_attention_max_distance in the "Literals intentionally None" section, please move it to the "Per-call (AttentionForwardArgs)" section.
|
|
||
| use_trtllm_gen = False | ||
| if _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION: | ||
| if _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION and not metadata.is_cross: |
There was a problem hiding this comment.
Don't add not metadata.is_cross here, move it to trtllm_gen_backend.is_supported
| - disaggregated/test_disaggregated_single_gpu.py::test_disaggregated_logits[True-TinyLlama-1.1B-Chat-v1.0] | ||
| - disaggregated/test_disaggregated_single_gpu.py::test_disaggregated_logprobs[False-TinyLlama-1.1B-Chat-v1.0] | ||
| - disaggregated/test_disaggregated_single_gpu.py::test_disaggregated_logprobs[True-TinyLlama-1.1B-Chat-v1.0] | ||
| - llmapi/test_llm_api_pytorch_bart.py::test_bart_pytorch_generate_encoder_decoder_end_to_end[bf16-kv-v1-cuda-graph-off-greedy-bart-large-cnn] |
There was a problem hiding this comment.
Which arch support cross attention with your PR? We only test sm90, how about sm100?
There was a problem hiding this comment.
Cross attention is now supported on both SM90 and SM100 by falling back to THOP attention.
I'll have a separate PR that adds SM100 cross-attention support to trtllm-gen.
There was a problem hiding this comment.
THOP uses trtllm-gen for sm100 and XQA/fmha_v2 for sm90, if trtllm-gen cross-attention kernels are aded in a follow-up PR, this PR will only contain sm90 support, right?
There was a problem hiding this comment.
Currently it supports attn_backend="TRTLLM" inside which it falls back to thop for cross attention on both sm90 and sm100.
There was a problem hiding this comment.
There are trtllm_gen_backend and thop in trtllm attention backend, trtllm_gen_backend means flashinfer's trtllm-gen kernels (we will refactor it in #15204 to make it clearer), but we also integrate trtllm-gen kernels to thop. If there is no trtllm-gen kernels for cross attention, thop also cannot support it. Do you mean we already have cross attention support in trtllm-gen, but you have not enabled it for trtllm_gen_backend?
| forward_args.cross_kv = None | ||
| forward_args.encoder_input_lengths = None | ||
|
|
||
| forward_args.position_embedding_type = ( |
There was a problem hiding this comment.
Why we need to determine the position_embedding_type during forward? It is a static info, we should update the position_embedding_type in attention backend's init and remove position_embedding_type from forward_args.
There was a problem hiding this comment.
The changes under resource_manager.py come from a perspective of "if not is_cross", from a perspective of supporting the cross attention path you are trying to achieve. This is understandable.
However it is a slight code smell since you are branching out the main code path.You may check cascade812#1 for reference.
All other changes look good to me now. Thank you for your patience.
| is_cross = self.kv_cache_type == CacheTypeCpp.CROSS | ||
| if not is_cross: | ||
| if not self.is_draft: | ||
| _update_kv_cache_draft_token_location(self, scheduled_batch, |
There was a problem hiding this comment.
Nit: This looks like a drive-by fix to achieve symmetry of the v2 manager's path of update_resources. I think we can remove this in case this dillutes the goal of this MR. A follow-up for this addition should also be good. My nit comment is to remove it here.
| dim=1).contiguous() | ||
| else: | ||
| forward_args.cross_kv = None | ||
| forward_args.encoder_input_lengths = metadata.kv_lens_cuda_runtime |
There was a problem hiding this comment.
We have already pass metadata.kv_lens_cuda_runtime to thop, don't need to pass encoder_input_lengths again, we can remove encoder_input_lengths from forward_args
| not forward_args.is_fused_qkv and k is not None | ||
| and v is not None) or (metadata.is_cross | ||
| and not forward_args.update_kv_cache | ||
| and k is None and v is None) |
There was a problem hiding this comment.
Instead of adding assertion, we should update forward_args.update_kv_cache for cross attention here instead of L1432
| forward_args.cross_kv = torch.cat([k_flat, v_flat], | ||
| dim=1).contiguous() | ||
| else: | ||
| forward_args.cross_kv = None |
There was a problem hiding this comment.
Remove this line since cross_kv is None by default.
| forward_args.is_fused_qkv = True | ||
| else: | ||
| forward_args.cross_kv = None | ||
| forward_args.encoder_input_lengths = None |
There was a problem hiding this comment.
Remove this else block since they are None by default.
| # stay as literal ``None`` until DeepSeek V4 sparse-MLA lands. | ||
| sparse_mla_topk_lens=None, | ||
| compressed_kv_cache_pool_ptr=None, | ||
| cross_attention=metadata.is_cross, |
There was a problem hiding this comment.
Use the same name for kwargs, e.g., is_cross=metadata.is_cross
Description
This PR adds the core encoder-decoder request flow, including encoder execution, decoder context/generation handling, cross-attention metadata, and cross-KV cache management.
Key Changes
beam_width > 1for encoder-decoder generation.Below features are not supported in this PR, will come in separate PR.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.Summary by CodeRabbit
New Features
Improvements
Tests