Skip to content

executorch: log once when the TensorRT delegate stages host I/O; document the device-handling contract#4328

Closed
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:et-trt-host-staging-log
Closed

executorch: log once when the TensorRT delegate stages host I/O; document the device-handling contract#4328
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:et-trt-host-staging-log

Conversation

@shoumikhin

Copy link
Copy Markdown
Contributor

What

Two small, self-contained changes to the ExecuTorch TensorRT delegate runtime (cpp/.../executorch/TensorRTBackend.{h,cpp}):

  1. Observability for the zero-copy path. execute() binds device-resident I/O straight into the execution context (zero-copy) and silently stages host-resident I/O through a per-call device allocation + cudaMemcpyAsync, with no signal. This adds a one-shot ET_LOG(Info) (guarded by a new staged_warned flag on EngineHandle, read/written under the existing mu lock) the first time an engine stages host I/O, so a caller intending device-resident, zero-copy I/O can tell when they have fallen off the fast path. Purely additive.

  2. Document the device-handling contract. A WHY-only comment at execute()'s header records three intentional choices that diverge from the CUDA/AOTI delegate: (a) the runtime ignores the AOT target_device metadata and sniffs pointers at runtime (so asserting device_type would wrongly reject host inputs today), (b) it stages H2D/D2H itself rather than via ExecuTorch device-copy ops / a DeviceAllocator, and (c) the engine-baked device_id is the runtime source of truth for the GPU.

Why now

ExecuTorch is moving on-device memory planning toward the default. Once that lands, device-aware planning is meant to keep delegate I/O on device; a regression that reinserts a host op between two GPU delegates would silently restore copies. The one-shot log makes that visible, and the contract comment keeps the intentional divergences from being "fixed" into breakage.

Risk

Low. (1) is additive logging guarded by a flag under the existing lock; (2) is a comment. No fast-path or control-flow change.

Test plan

  • Device-resident I/O: the new Info line is NOT emitted (fast path unchanged).
  • Host I/O: exactly one Info line per engine on the first execute().
  • Repeated calls / a second engine: one line per engine total.
  • Existing TensorRT delegate tests pass; outputs unchanged.

Draft for review; part of a small device-support readiness series (a partitioner-side target_device fix + a readiness test follow).

…the device-handling contract

The ExecuTorch TensorRT delegate binds device-resident I/O pointers straight into the execution context (zero-copy). When an I/O tensor is host memory it silently stages through a per-call device allocation plus cudaMemcpyAsync, with no signal. That is an invisible performance cliff, and a silent regression hazard once device-aware memory planning is expected to keep delegate I/O on device. Emit a single Info log the first time an engine stages host I/O.

Also document, at execute()'s header, the delegate's deliberate device-handling contract (runtime pointer-sniffing instead of reading the AOT device metadata, self-managed host/device staging, and the engine-baked device_id as the runtime source of truth) so a future change does not "reconcile" it into breakage.
@shoumikhin shoumikhin force-pushed the et-trt-host-staging-log branch from 2660c91 to 9e64259 Compare June 9, 2026 20:29
@shoumikhin shoumikhin marked this pull request as ready for review June 9, 2026 20:55
@shoumikhin

Copy link
Copy Markdown
Contributor Author

@narendasan @lanluo-nvidia — would appreciate a review when you have a chance.

This adds a one-shot ET_LOG(Info) the first time the ExecuTorch TensorRT delegate has to stage host I/O (a CPU↔device copy), plus a comment documenting the delegate's device-handling contract. It's prep for ExecuTorch's device-memory-planning flip: once delegate I/O is GPU-resident the copies disappear, and this makes the non-optimal (host-staged) path visible/diagnosable without per-run log spam.

The red CI is the current repo-wide Python-3.10 / Windows infra outage — torch_tensorrt fails to install (pypi.nvidia.com TLS error on Windows + a nightly-TRT weight-streaming flake), so the 3.10 test lanes error at import. Same failures show on unrelated PRs (#4318, #4262); the C++/lint/build lanes here are green. Not caused by this change.

@shoumikhin shoumikhin closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant