WebGPU: add memory aliasing for intermediate tensor buffers by digantdesai · Pull Request #19305 · pytorch/executorch

digantdesai · 2026-05-05T19:30:47Z

USE ETVK's mem_obj_id for the WebGPU runtime to implement memory aliasing

The export pipeline already runs a greedy memory planning pass that assigns mem_obj_id to tensors with non-overlapping lifetimes, but the WebGPU runtime was ignoring it and allocating a dedicated WGPUBuffer per tensor. Read mem_obj_id from the flatbuffer during graph build. Tensors sharing the same mem_obj_id now share a single WGPUBuffer sized to the largest user. Constants and tensors without a mem_obj_id still get dedicated buffers. Adds a chained-add native test (z=x+y; z=z+x; z=z+y) that verifies both correctness and that memory aliasing produces savings (~20% for this model). Co-authored with Claude.

Replace the silent `nbytes > 0 ? nbytes : 4` fallback pattern with ET_CHECK_MSG assertions. If a zero-byte tensor reaches buffer creation, we want to know immediately rather than silently creating a dummy 4-byte buffer that masks the issue. Co-authored with Claude.

Invert the condition to eliminate the empty if-body with a comment. Co-authored with Claude.

Export and run the chained-add memory aliasing test in test_build_webgpu.sh so it runs automatically instead of requiring a manual WEBGPU_TEST_CHAINED_MODEL env var. Co-authored with Claude.

Longer chain produces more intermediates, giving the memory planner more opportunity to alias buffers. Expected output: 3x + 3y. Co-authored with Claude.

Fix: if a constant tensor has mem_obj_id >= 0, force it to -1 so the dedicated buffer path and the destructor stay consistent. Previously the buffer would leak and get overwritten by the shared buffer pass. Also make the chained-add test actually fail when aliasing is absent instead of just printing informational messages. Co-authored with Claude.

pytorch-bot · 2026-05-05T19:30:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19305

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 2 Cancelled Jobs

As of commit 55ceea4 with merge base cf01617 ():

NEW FAILURES - The following jobs have failed:

Build Presets / apple (pybind) / build (gh)
The process '/opt/homebrew/bin/git' failed with exit code 128
pull / test-qnn-python-imports-linux / linux-job (gh)
RuntimeError: Command docker exec -t 80fde7da7307f5ddc1e8b83aefc534aad3c4130790d3488e0d3f1e366454ba8e /exec failed with exit code 2
pull / unittest-arm-backend-with-no-deps (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 650a1b6a39a2a6c9868d1ffaa7249fe5dfa48b0f991366755b95346c68d27448 /exec failed with exit code 1

CANCELLED JOBS - The following jobs were cancelled. Please retry:

pull / unittest / macos / macos-job (gh)
##[error]The operation was canceled.
pull / unittest-editable / macos / macos-job (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

digantdesai · 2026-05-05T19:30:59Z

@claude review this PR

github-actions · 2026-05-05T19:31:40Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

claude · 2026-05-05T19:31:47Z

Claude finished @digantdesai's task in 1m 37s —— View job

Review: WebGPU memory aliasing for intermediate tensor buffers

Read changed files and understand the implementation
Review WebGPUBackend changes
Review WebGPUGraph changes (core aliasing logic)
Review test changes
Post review feedback

Overall: Clean, well-structured implementation. The approach of using ETVK's mem_obj_id to share GPU buffers between non-overlapping intermediate tensors is sound and the test coverage is good. A few items worth addressing:

Issues

1. Global mutable state for memory stats (WebGPUBackend.cpp:41)

static WebGPUMemoryStats s_last_memory_stats;

This is a process-wide global that gets overwritten on every init() call. If two models are loaded concurrently (or sequentially), the stats from the first are silently lost. This is fine for a test-only helper, but consider documenting that intent (e.g., naming it something like s_last_memory_stats_for_testing) or gating it behind a test/debug flag so it doesn't ship in production builds. If this is intentionally test-only, could it live in the test file instead?

2. Destructor correctness — shared buffer double-release guard (WebGPUGraph.cpp:55-58)

for (size_t i = 0; i < tensors_.size(); i++) {
    if (tensors_[i].buffer &&
        (i >= tensor_mem_obj_ids_.size() || tensor_mem_obj_ids_[i] < 0)) {
      wgpuBufferRelease(tensors_[i].buffer);
    }
}

The guard i >= tensor_mem_obj_ids_.size() should never be true since both vectors are resized to num_vals in build(). If build() throws before that resize (e.g., device is null), both vectors are empty and the loop wouldn't execute anyway. The check is harmless but somewhat misleading—a simple tensor_mem_obj_ids_[i] < 0 would suffice given the invariant. Minor nit, no action needed.

3. total_bytes() excludes tensor_buffer_bytes field (WebGPUGraph.h:44-46)

size_t total_bytes() const {
    return shared_buffer_bytes + unshared_tensor_buffer_bytes +
        staging_buffer_bytes + uniform_buffer_bytes;
}

This is correct (it equals tensor_buffer_bytes + staging + uniform), but it's mildly confusing that total_bytes() doesn't reference the tensor_buffer_bytes field directly. Rewriting as return tensor_buffer_bytes + staging_buffer_bytes + uniform_buffer_bytes; would make the semantics clearer, since tensor_buffer_bytes is already computed as shared + unshared in memory_stats().

Fix this →

Minor Suggestions

4. Test assertion could verify specific aliasing count (test_webgpu_native.cpp:143)

The test asserts stats.num_shared_objects > 0 — for a 5-op chained add with 2 inputs + 1 output + intermediates, you could assert a more specific expected value (e.g., that num_shared_objects equals a known count) to catch regressions where aliasing becomes less effective. Though I recognize this is fragile if the partitioner changes its allocation strategy.

5. export_chained_add_model not invocable from CLI (test_add.py)

The export functions are defined but only callable programmatically or via the test script's -c block. Not a blocker, just noting that adding a --export-chained CLI arg would be nice for manual testing.

What looks good

The two-pass approach (collect max sizes in pass 1, allocate shared buffers in pass 2) is clean and avoids over-allocation.
The destructor properly separates ownership: shared buffers are released once via shared_buffers_, unshared per-tensor buffers are released individually.
The memory stats struct gives good observability into the aliasing savings.
The chained-add test validates both correctness and memory savings, which is the right thing to test here.
Replacing zero-size buffer fallbacks with asserts (ET_CHECK_MSG) is a good cleanup — zero-size tensors would indicate a bug upstream.

Verdict: LGTM with minor suggestions. The core aliasing logic is correct and well-tested. The global stats accessor is the only thing I'd recommend addressing before merge (even if just renaming to clarify it's test-only).

digantdesai added 6 commits May 4, 2026 21:23

WebGPU: clean up empty if-branch in memory_stats()

8f78792

Invert the condition to eliminate the empty if-body with a comment. Co-authored with Claude.

WebGPU: add chained-add model to test script

d4333e3

Export and run the chained-add memory aliasing test in test_build_webgpu.sh so it runs automatically instead of requiring a manual WEBGPU_TEST_CHAINED_MODEL env var. Co-authored with Claude.

WebGPU: extend chained add test to 5 ops for better aliasing coverage

08b37d8

Longer chain produces more intermediates, giving the memory planner more opportunity to alias buffers. Expected output: 3x + 3y. Co-authored with Claude.

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebGPU: add memory aliasing for intermediate tensor buffers#19305

WebGPU: add memory aliasing for intermediate tensor buffers#19305
digantdesai wants to merge 6 commits intomainfrom
wgpu_memory_aliasing

digantdesai commented May 5, 2026

Uh oh!

pytorch-bot Bot commented May 5, 2026 •

edited

Loading

Uh oh!

digantdesai commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

claude Bot commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

digantdesai commented May 5, 2026

Uh oh!

pytorch-bot Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19305

❌ 3 New Failures, 2 Cancelled Jobs

Uh oh!

digantdesai commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

This PR needs a release notes: label

Uh oh!

claude Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: WebGPU memory aliasing for intermediate tensor buffers

Issues

Minor Suggestions

What looks good

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot Bot commented May 5, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented May 5, 2026 •

edited

Loading