Skip to content

Add TensorRT weight streaming support to the ExecuTorch delegate#4336

Draft
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:weight_streaming_executorch_delegate
Draft

Add TensorRT weight streaming support to the ExecuTorch delegate#4336
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:weight_streaming_executorch_delegate

Conversation

@shoumikhin

@shoumikhin shoumikhin commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Add TensorRT weight streaming support to the Torch-TensorRT ExecuTorch delegate, so a model whose weights do not all fit in GPU memory can run when exported to an ExecuTorch program. Ref #4334.

Torch-TensorRT already builds a weight streamable engine when you compile with enable_weight_streaming=True, but the ExecuTorch delegate never set a budget on the engine at load time, so large models could not stream. This change sets the budget in the delegate init(), after the engine is deserialized and before the execution context is created, which is the same pattern the other Torch-TensorRT runtimes already use.

How it works

By default the delegate applies TensorRT's automatic budget, computed at load time from the free memory on the actual GPU, gated on getStreamableWeightsSize() > 0. So an engine built with enable_weight_streaming=True runs out of the box and adapts to the deploy device. Nothing is baked into the .pte for this default case.

An explicit budget is a non-negative number of bytes and can be set two ways, in order of precedence:

  1. Load time (preferred): an ExecuTorch backend option named weight_streaming_budget, passed by the caller via Module::load(LoadBackendOptionsMap) and read in init() with BackendInitContext::get_runtime_spec. This lets a deployment size the budget for its own GPU without re-exporting. It is the same load-time pattern CoreML and XNNPACK use.
  2. Export time (default / fallback): the same key baked into the .pte via torch_tensorrt.save(output_format="executorch", weight_streaming_budget=N). This is used when no load-time option is given, and it is the only channel for loaders that cannot pass backend options yet (the ExecuTorch Python and Android runtimes).

Resolution order in the delegate is: load-time option, then the baked value, then automatic. The value is a decimal string on the wire because ExecuTorch's typed integer option is only 32 bit and a byte budget can exceed 2 GB.

import torch_tensorrt

compiled = torch_tensorrt.compile(
    model, arg_inputs=example_inputs, enable_weight_streaming=True
)

# Default: the delegate applies the automatic budget at load, so a large model runs.
torch_tensorrt.save(
    compiled, "model.pte", arg_inputs=example_inputs, output_format="executorch"
)

# Optional export-time default budget (overridable at load):
torch_tensorrt.save(
    compiled, "model.pte", arg_inputs=example_inputs,
    output_format="executorch",
    weight_streaming_budget=8 * 1024**3,  # 8 GiB
)
// C++ runtime: use the automatic or baked budget (no change needed), or override
// the budget at load for this specific GPU with a backend option.
executorch::runtime::BackendOptions<1> trt_opts;
trt_opts.set_option("weight_streaming_budget", "8589934592");  // 8 GiB
executorch::runtime::LoadBackendOptionsMap options;
options.set_options("TensorRTBackend", trt_opts.view());

executorch::extension::Module module("model.pte");
module.load(options);
auto outputs = module.forward(inputs);

Changes

  • TensorRTBackend::init applies the budget via setWeightStreamingBudgetV2 before creating the execution context, gated on getStreamableWeightsSize() > 0. It resolves the budget as load-time runtime spec, then baked compile spec, then automatic.
  • New standalone WeightStreamingBudget parser (cpp/include and cpp/src), unit tested without a GPU. It uses std::from_chars and accepts only a non-negative decimal integer.
  • Python save(..., weight_streaming_budget=...) writes the export-time default; validation lives in _compile.py. Passing the budget through compile_specs is rejected in favor of the keyword argument.
  • C++ gtest cases for the parser and a CPU Python test suite.
  • Bazel and CMake wiring for the new files.

Requirements

The load-time override uses ExecuTorch's BackendInitContext::get_runtime_spec and LoadBackendOptionsMap. The export-time default and the automatic budget work without it.

Backward compatibility

  • The new code only runs when the engine was built for weight streaming. Engines built with the default settings report zero streamable weights and skip the new path, so they behave exactly as before. This covers every existing .pte.
  • There is no change to the .pte format or the engine blob.
  • The one intended behavior change is that a streamable engine now applies the automatic budget at load, which enables the large model case and matches the PyTorch runtimes.

Edge cases

  • A budget on an engine that was not built for streaming is ignored with a log.
  • An explicit budget larger than the streamable size is clamped.
  • A malformed or negative budget is rejected at export (TypeError or ValueError) and at load (Error::InvalidProgram).
  • If the automatic budget cannot be applied, the runtime retries with maximum streaming before failing.
  • For a model split into more than one engine, an explicit byte budget applies to each engine and emits a warning. Leave weight_streaming_budget as None for multi-engine models.

Status and validation

  • The C++ value parser was built and run on CPU, and the Python validation logic was checked on CPU.
  • The full C++ build, the Python test suite, and the GPU end to end tests run in CI.

Follow-ups

  • Expose LoadBackendOptionsMap in ExecuTorch's Python (and Android) runtime bindings so non-C++ loaders can also set the budget at load. Until then the export-time default covers them.
  • Add a live, after-load setter (change the budget on an already loaded model) via the backend set_option API. This needs the execution context to be destroyed and recreated, like TRTEngine::set_device_memory_budget, so it is deferred.

@meta-cla meta-cla Bot added the cla signed label Jun 11, 2026
@github-actions github-actions Bot added component: tests Issues re: Tests component: api [Python] Issues re: Python API component: api [C++] Issues re: C++ API labels Jun 11, 2026
@shoumikhin shoumikhin force-pushed the weight_streaming_executorch_delegate branch 5 times, most recently from 983583d to 053e902 Compare June 11, 2026 04:56
@shoumikhin shoumikhin changed the title Prototype: TensorRT weight streaming in the ExecuTorch delegate Add TensorRT weight streaming support to the ExecuTorch delegate Jun 11, 2026
@narendasan narendasan requested a review from cehongwang June 11, 2026 19:47

@narendasan narendasan left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to set the budget at serialization time? This is a runtime configurable setting. Should we expose some sort of API to let someone set this when they deserialize or later?

@shoumikhin

shoumikhin commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Good call, and I agree the budget should be a runtime setting. Here is what the PR does today, and a change that gives you exactly what you are asking for.

What happens today

  • If you do not set a budget, nothing is written into the .pte. At load time the delegate asks TensorRT to pick an automatic budget based on the GPU it is actually running on. So the normal case already adapts to the deployment device, with no value baked in.
  • If you do set an explicit budget, that number is stored in the .pte and applied at load. It is applied inside the delegate's init(), right after the engine is deserialized and just before the execution context is created. This timing is required: TensorRT does not let you change the budget once an execution context exists.

So the only thing fixed at export is the explicit number, and you are right that a fixed byte count is really a per-deployment choice.

Proposed change (a real load-time API)

  • Read the budget at load from ExecuTorch's existing backend options. In init() the delegate will look for a weight_streaming_budget option in BackendInitContext (the LoadBackendOptionsMap a caller passes to Module::load). These options arrive before the execution context is created, so the timing stays correct and no context rebuild is needed.
  • Order of precedence: load-time option first, then the value baked at export, then automatic. This is the same pattern CoreML uses for compute_unit and XNNPACK uses for its load options.
  • The value is passed as a string, because ExecuTorch's typed integer option is only 32 bit and a byte budget can be larger than 2 GB.

Why keep the export-time value too

ExecuTorch's Python and Android load paths do not expose backend options yet (only C++ Module and iOS do). So for anyone loading a .pte from Python or Android, the value baked at export is currently the only way to set an explicit budget. I would keep it as an overridable default rather than remove it.

Changing it after load (the "or later" part)

This is doable as a follow-up, not in this PR. TensorRT requires destroying and recreating the execution context to change the budget, and ExecuTorch's post-load set_option is global to the backend with no per-engine handle. The CUDA backend already implements set_option, so there is a pattern to follow when we do it.

One question so I build the right thing

Will these large models be served mainly from the C++ runtime or from Python? If C++, the load-time option covers it and we can treat the baked value as just a default. If Python, we need to keep the baked value until ExecuTorch exposes backend options to Python, which I am happy to help add upstream.

@shoumikhin shoumikhin force-pushed the weight_streaming_executorch_delegate branch from 053e902 to c354c29 Compare June 12, 2026 20:43
Apply a TensorRT weight streaming budget in the delegate init(), after the engine
is deserialized and before the execution context is created (the budget cannot be
changed once a context exists). When the engine was built with weight streaming,
the delegate applies TensorRT's automatic budget by default, mirroring the PyTorch
runtimes.

An explicit budget can be set two ways, in order of precedence: a load-time
ExecuTorch backend option ("weight_streaming_budget" runtime spec passed to
Module::load), or the same key baked into the .pte at export via
torch_tensorrt.save(output_format="executorch", weight_streaming_budget=N). The
load-time option lets a deployment size the budget for its own GPU without
re-exporting; the export-time value is the default and the only channel for
loaders that cannot pass backend options yet (Python, Android). The value is a
non-negative decimal byte count, string-encoded on the wire.

Non-streamable engines make no budget call, so existing programs are unchanged.
The one intended behavior change is that a streamable engine now applies the
automatic budget on load, which enables running models whose weights exceed GPU
memory.

Ref pytorch#4334
@shoumikhin shoumikhin force-pushed the weight_streaming_executorch_delegate branch from c354c29 to a996b54 Compare June 12, 2026 21:27

@cehongwang cehongwang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall OK. Some minor comments

return Error::InvalidProgram;
}
is_explicit = true;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if len = 0? Give a warning to the user

resolved_compile_specs = _resolve_executorch_compile_specs(
exp_program,
list(executorch_compile_specs),
kwargs.get("weight_streaming_budget"),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we expect users to set the total budget right? If there is a graph break, is there a correct mapping to the per-engine budget?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed component: api [C++] Issues re: C++ API component: api [Python] Issues re: Python API component: tests Issues re: Tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants