executorch: log once when the TensorRT delegate stages host I/O; document the device-handling contract by shoumikhin · Pull Request #4328 · pytorch/TensorRT

shoumikhin · 2026-06-09T19:37:30Z

What

Two small, self-contained changes to the ExecuTorch TensorRT delegate runtime (cpp/.../executorch/TensorRTBackend.{h,cpp}):

Observability for the zero-copy path. execute() binds device-resident I/O straight into the execution context (zero-copy) and silently stages host-resident I/O through a per-call device allocation + cudaMemcpyAsync, with no signal. This adds a one-shot ET_LOG(Info) (guarded by a new staged_warned flag on EngineHandle, read/written under the existing mu lock) the first time an engine stages host I/O, so a caller intending device-resident, zero-copy I/O can tell when they have fallen off the fast path. Purely additive.
Document the device-handling contract. A WHY-only comment at execute()'s header records three intentional choices that diverge from the CUDA/AOTI delegate: (a) the runtime ignores the AOT target_device metadata and sniffs pointers at runtime (so asserting device_type would wrongly reject host inputs today), (b) it stages H2D/D2H itself rather than via ExecuTorch device-copy ops / a DeviceAllocator, and (c) the engine-baked device_id is the runtime source of truth for the GPU.

Why now

ExecuTorch is moving on-device memory planning toward the default. Once that lands, device-aware planning is meant to keep delegate I/O on device; a regression that reinserts a host op between two GPU delegates would silently restore copies. The one-shot log makes that visible, and the contract comment keeps the intentional divergences from being "fixed" into breakage.

Risk

Low. (1) is additive logging guarded by a flag under the existing lock; (2) is a comment. No fast-path or control-flow change.

Test plan

Device-resident I/O: the new Info line is NOT emitted (fast path unchanged).
Host I/O: exactly one Info line per engine on the first execute().
Repeated calls / a second engine: one line per engine total.
Existing TensorRT delegate tests pass; outputs unchanged.

Draft for review; part of a small device-support readiness series (a partitioner-side target_device fix + a readiness test follow).

…the device-handling contract The ExecuTorch TensorRT delegate binds device-resident I/O pointers straight into the execution context (zero-copy). When an I/O tensor is host memory it silently stages through a per-call device allocation plus cudaMemcpyAsync, with no signal. That is an invisible performance cliff, and a silent regression hazard once device-aware memory planning is expected to keep delegate I/O on device. Emit a single Info log the first time an engine stages host I/O. Also document, at execute()'s header, the delegate's deliberate device-handling contract (runtime pointer-sniffing instead of reading the AOT device metadata, self-managed host/device staging, and the engine-baked device_id as the runtime source of truth) so a future change does not "reconcile" it into breakage.

shoumikhin · 2026-06-10T00:17:29Z

@narendasan @lanluo-nvidia — would appreciate a review when you have a chance.

This adds a one-shot ET_LOG(Info) the first time the ExecuTorch TensorRT delegate has to stage host I/O (a CPU↔device copy), plus a comment documenting the delegate's device-handling contract. It's prep for ExecuTorch's device-memory-planning flip: once delegate I/O is GPU-resident the copies disappear, and this makes the non-optimal (host-staged) path visible/diagnosable without per-run log spam.

The red CI is the current repo-wide Python-3.10 / Windows infra outage — torch_tensorrt fails to install (pypi.nvidia.com TLS error on Windows + a nightly-TRT weight-streaming flake), so the 3.10 test lanes error at import. Same failures show on unrelated PRs (#4318, #4262); the C++/lint/build lanes here are green. Not caused by this change.

meta-cla Bot added the cla signed label Jun 9, 2026

github-actions Bot added the component: api [C++] Issues re: C++ API label Jun 9, 2026

github-actions Bot requested a review from narendasan June 9, 2026 19:37

shoumikhin mentioned this pull request Jun 9, 2026

executorch: derive the TensorRT delegate target_device from the engine's real device index #4329

Merged

shoumikhin force-pushed the et-trt-host-staging-log branch from 2660c91 to 9e64259 Compare June 9, 2026 20:29

shoumikhin marked this pull request as ready for review June 9, 2026 20:55

shoumikhin closed this Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

executorch: log once when the TensorRT delegate stages host I/O; document the device-handling contract#4328

executorch: log once when the TensorRT delegate stages host I/O; document the device-handling contract#4328
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:et-trt-host-staging-log

shoumikhin commented Jun 9, 2026

Uh oh!

shoumikhin commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shoumikhin commented Jun 9, 2026

What

Why now

Risk

Test plan

Uh oh!

shoumikhin commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant