Add TensorRT weight streaming support to the ExecuTorch delegate by shoumikhin · Pull Request #4336 · pytorch/TensorRT

shoumikhin · 2026-06-11T00:08:15Z

Summary

Add TensorRT weight streaming support to the Torch-TensorRT ExecuTorch delegate, so a model whose weights do not all fit in GPU memory can run when exported to an ExecuTorch program. Ref #4334.

Torch-TensorRT already builds a weight streamable engine when you compile with enable_weight_streaming=True, but the ExecuTorch delegate never set a budget on the engine at load time, so large models could not stream. This change sets the budget in the delegate init(), after the engine is deserialized and before the execution context is created, which is the same pattern the other Torch-TensorRT runtimes already use.

How it works

By default the delegate applies TensorRT's automatic budget, computed at load time from the free memory on the actual GPU, gated on getStreamableWeightsSize() > 0. So an engine built with enable_weight_streaming=True runs out of the box and adapts to the deploy device. Nothing is baked into the .pte for this default case.

An explicit budget is a non-negative number of bytes and can be set two ways, in order of precedence:

Load time (preferred): an ExecuTorch backend option named weight_streaming_budget, passed by the caller via Module::load(LoadBackendOptionsMap) and read in init() with BackendInitContext::get_runtime_spec. This lets a deployment size the budget for its own GPU without re-exporting. It is the same load-time pattern CoreML and XNNPACK use.
Export time (default / fallback): the same key baked into the .pte via torch_tensorrt.save(output_format="executorch", weight_streaming_budget=N). This is used when no load-time option is given, and it is the only channel for loaders that cannot pass backend options yet (the ExecuTorch Python and Android runtimes).

Resolution order in the delegate is: load-time option, then the baked value, then automatic. The value is a decimal string on the wire because ExecuTorch's typed integer option is only 32 bit and a byte budget can exceed 2 GB.

import torch_tensorrt

compiled = torch_tensorrt.compile(
    model, arg_inputs=example_inputs, enable_weight_streaming=True
)

# Default: the delegate applies the automatic budget at load, so a large model runs.
torch_tensorrt.save(
    compiled, "model.pte", arg_inputs=example_inputs, output_format="executorch"
)

# Optional export-time default budget (overridable at load):
torch_tensorrt.save(
    compiled, "model.pte", arg_inputs=example_inputs,
    output_format="executorch",
    weight_streaming_budget=8 * 1024**3,  # 8 GiB
)

// C++ runtime: use the automatic or baked budget (no change needed), or override
// the budget at load for this specific GPU with a backend option.
executorch::runtime::BackendOptions<1> trt_opts;
trt_opts.set_option("weight_streaming_budget", "8589934592");  // 8 GiB
executorch::runtime::LoadBackendOptionsMap options;
options.set_options("TensorRTBackend", trt_opts.view());

executorch::extension::Module module("model.pte");
module.load(options);
auto outputs = module.forward(inputs);

Changes

TensorRTBackend::init applies the budget via setWeightStreamingBudgetV2 before creating the execution context, gated on getStreamableWeightsSize() > 0. It resolves the budget as load-time runtime spec, then baked compile spec, then automatic.
New standalone WeightStreamingBudget parser (cpp/include and cpp/src), unit tested without a GPU. It uses std::from_chars and accepts only a non-negative decimal integer.
Python save(..., weight_streaming_budget=...) writes the export-time default; validation lives in _compile.py. Passing the budget through compile_specs is rejected in favor of the keyword argument.
C++ gtest cases for the parser and a CPU Python test suite.
Bazel and CMake wiring for the new files.

Requirements

The load-time override uses ExecuTorch's BackendInitContext::get_runtime_spec and LoadBackendOptionsMap. The export-time default and the automatic budget work without it.

Backward compatibility

The new code only runs when the engine was built for weight streaming. Engines built with the default settings report zero streamable weights and skip the new path, so they behave exactly as before. This covers every existing .pte.
There is no change to the .pte format or the engine blob.
The one intended behavior change is that a streamable engine now applies the automatic budget at load, which enables the large model case and matches the PyTorch runtimes.

Edge cases

A budget on an engine that was not built for streaming is ignored with a log.
An explicit budget larger than the streamable size is clamped.
A malformed or negative budget is rejected at export (TypeError or ValueError) and at load (Error::InvalidProgram).
If the automatic budget cannot be applied, the runtime retries with maximum streaming before failing.
For a model split into more than one engine, an explicit byte budget applies to each engine and emits a warning. Leave weight_streaming_budget as None for multi-engine models.

Status and validation

The C++ value parser was built and run on CPU, and the Python validation logic was checked on CPU.
The full C++ build, the Python test suite, and the GPU end to end tests run in CI.

Follow-ups

Expose LoadBackendOptionsMap in ExecuTorch's Python (and Android) runtime bindings so non-C++ loaders can also set the budget at load. Until then the export-time default covers them.
Add a live, after-load setter (change the budget on an already loaded model) via the backend set_option API. This needs the execution context to be destroyed and recreated, like TRTEngine::set_device_memory_budget, so it is deferred.

narendasan

Why do we want to set the budget at serialization time? This is a runtime configurable setting. Should we expose some sort of API to let someone set this when they deserialize or later?

shoumikhin · 2026-06-12T20:10:31Z

Good call, and I agree the budget should be a runtime setting. Here is what the PR does today, and a change that gives you exactly what you are asking for.

What happens today

If you do not set a budget, nothing is written into the .pte. At load time the delegate asks TensorRT to pick an automatic budget based on the GPU it is actually running on. So the normal case already adapts to the deployment device, with no value baked in.
If you do set an explicit budget, that number is stored in the .pte and applied at load. It is applied inside the delegate's init(), right after the engine is deserialized and just before the execution context is created. This timing is required: TensorRT does not let you change the budget once an execution context exists.

So the only thing fixed at export is the explicit number, and you are right that a fixed byte count is really a per-deployment choice.

Proposed change (a real load-time API)

Read the budget at load from ExecuTorch's existing backend options. In init() the delegate will look for a weight_streaming_budget option in BackendInitContext (the LoadBackendOptionsMap a caller passes to Module::load). These options arrive before the execution context is created, so the timing stays correct and no context rebuild is needed.
Order of precedence: load-time option first, then the value baked at export, then automatic. This is the same pattern CoreML uses for compute_unit and XNNPACK uses for its load options.
The value is passed as a string, because ExecuTorch's typed integer option is only 32 bit and a byte budget can be larger than 2 GB.

Why keep the export-time value too

ExecuTorch's Python and Android load paths do not expose backend options yet (only C++ Module and iOS do). So for anyone loading a .pte from Python or Android, the value baked at export is currently the only way to set an explicit budget. I would keep it as an overridable default rather than remove it.

Changing it after load (the "or later" part)

This is doable as a follow-up, not in this PR. TensorRT requires destroying and recreating the execution context to change the budget, and ExecuTorch's post-load set_option is global to the backend with no per-engine handle. The CUDA backend already implements set_option, so there is a pattern to follow when we do it.

One question so I build the right thing

Will these large models be served mainly from the C++ runtime or from Python? If C++, the load-time option covers it and we can treat the baked value as just a default. If Python, we need to keep the baked value until ExecuTorch exposes backend options to Python, which I am happy to help add upstream.

Apply a TensorRT weight streaming budget in the delegate init(), after the engine is deserialized and before the execution context is created (the budget cannot be changed once a context exists). When the engine was built with weight streaming, the delegate applies TensorRT's automatic budget by default, mirroring the PyTorch runtimes. An explicit budget can be set two ways, in order of precedence: a load-time ExecuTorch backend option ("weight_streaming_budget" runtime spec passed to Module::load), or the same key baked into the .pte at export via torch_tensorrt.save(output_format="executorch", weight_streaming_budget=N). The load-time option lets a deployment size the budget for its own GPU without re-exporting; the export-time value is the default and the only channel for loaders that cannot pass backend options yet (Python, Android). The value is a non-negative decimal byte count, string-encoded on the wire. Non-streamable engines make no budget call, so existing programs are unchanged. The one intended behavior change is that a streamable engine now applies the automatic budget on load, which enables running models whose weights exceed GPU memory. Ref pytorch#4334

cehongwang

Overall OK. Some minor comments

cehongwang · 2026-06-12T23:15:22Z

+        return Error::InvalidProgram;
+      }
+      is_explicit = true;
+    }


What happens if len = 0? Give a warning to the user

cehongwang · 2026-06-12T23:19:43Z

+    resolved_compile_specs = _resolve_executorch_compile_specs(
+        exp_program,
+        list(executorch_compile_specs),
+        kwargs.get("weight_streaming_budget"),


Here we expect users to set the total budget right? If there is a graph break, is there a correct mapping to the per-engine budget?

meta-cla Bot added the cla signed label Jun 11, 2026

github-actions Bot added component: tests Issues re: Tests component: api [Python] Issues re: Python API component: api [C++] Issues re: C++ API labels Jun 11, 2026

shoumikhin force-pushed the weight_streaming_executorch_delegate branch 5 times, most recently from 983583d to 053e902 Compare June 11, 2026 04:56

shoumikhin changed the title ~~Prototype: TensorRT weight streaming in the ExecuTorch delegate~~ Add TensorRT weight streaming support to the ExecuTorch delegate Jun 11, 2026

narendasan requested a review from cehongwang June 11, 2026 19:47

narendasan reviewed Jun 12, 2026

View reviewed changes

shoumikhin force-pushed the weight_streaming_executorch_delegate branch from 053e902 to c354c29 Compare June 12, 2026 20:43

shoumikhin force-pushed the weight_streaming_executorch_delegate branch from c354c29 to a996b54 Compare June 12, 2026 21:27

cehongwang requested changes Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TensorRT weight streaming support to the ExecuTorch delegate#4336

Add TensorRT weight streaming support to the ExecuTorch delegate#4336
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:weight_streaming_executorch_delegate

shoumikhin commented Jun 11, 2026 •

edited

Loading

Uh oh!

narendasan left a comment •

edited

Loading

Uh oh!

shoumikhin commented Jun 12, 2026 •

edited

Loading

Uh oh!

cehongwang left a comment

Uh oh!

cehongwang Jun 12, 2026

Uh oh!

cehongwang Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shoumikhin commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Changes

Requirements

Backward compatibility

Edge cases

Status and validation

Follow-ups

Uh oh!

narendasan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoumikhin commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cehongwang left a comment

Choose a reason for hiding this comment

Uh oh!

cehongwang Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

cehongwang Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shoumikhin commented Jun 11, 2026 •

edited

Loading

narendasan left a comment •

edited

Loading

shoumikhin commented Jun 12, 2026 •

edited

Loading