Skip to content

WebGPU: add memory aliasing for intermediate tensor buffers#19305

Draft
digantdesai wants to merge 6 commits intomainfrom
wgpu_memory_aliasing
Draft

WebGPU: add memory aliasing for intermediate tensor buffers#19305
digantdesai wants to merge 6 commits intomainfrom
wgpu_memory_aliasing

Conversation

@digantdesai
Copy link
Copy Markdown
Contributor

USE ETVK's mem_obj_id for the WebGPU runtime to implement memory aliasing

The export pipeline already runs a greedy memory planning pass that assigns
mem_obj_id to tensors with non-overlapping lifetimes, but the WebGPU runtime
was ignoring it and allocating a dedicated WGPUBuffer per tensor.

Read mem_obj_id from the flatbuffer during graph build. Tensors sharing the
same mem_obj_id now share a single WGPUBuffer sized to the largest user.
Constants and tensors without a mem_obj_id still get dedicated buffers.

Adds a chained-add native test (z=x+y; z=z+x; z=z+y) that verifies both
correctness and that memory aliasing produces savings (~20% for this model).

Co-authored with Claude.
Replace the silent `nbytes > 0 ? nbytes : 4` fallback pattern with
ET_CHECK_MSG assertions. If a zero-byte tensor reaches buffer creation,
we want to know immediately rather than silently creating a dummy 4-byte
buffer that masks the issue.

Co-authored with Claude.
Invert the condition to eliminate the empty if-body with a comment.

Co-authored with Claude.
Export and run the chained-add memory aliasing test in
test_build_webgpu.sh so it runs automatically instead of requiring
a manual WEBGPU_TEST_CHAINED_MODEL env var.

Co-authored with Claude.
Longer chain produces more intermediates, giving the memory planner
more opportunity to alias buffers. Expected output: 3x + 3y.

Co-authored with Claude.
Fix: if a constant tensor has mem_obj_id >= 0, force it to -1 so the
dedicated buffer path and the destructor stay consistent. Previously
the buffer would leak and get overwritten by the shared buffer pass.

Also make the chained-add test actually fail when aliasing is absent
instead of just printing informational messages.

Co-authored with Claude.
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 5, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19305

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 2 Cancelled Jobs

As of commit 55ceea4 with merge base cf01617 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 5, 2026
@digantdesai
Copy link
Copy Markdown
Contributor Author

@claude review this PR

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@claude
Copy link
Copy Markdown

claude Bot commented May 5, 2026

Claude finished @digantdesai's task in 1m 37s —— View job


Review: WebGPU memory aliasing for intermediate tensor buffers

  • Read changed files and understand the implementation
  • Review WebGPUBackend changes
  • Review WebGPUGraph changes (core aliasing logic)
  • Review test changes
  • Post review feedback

Overall: Clean, well-structured implementation. The approach of using ETVK's mem_obj_id to share GPU buffers between non-overlapping intermediate tensors is sound and the test coverage is good. A few items worth addressing:


Issues

1. Global mutable state for memory stats (WebGPUBackend.cpp:41)

static WebGPUMemoryStats s_last_memory_stats;

This is a process-wide global that gets overwritten on every init() call. If two models are loaded concurrently (or sequentially), the stats from the first are silently lost. This is fine for a test-only helper, but consider documenting that intent (e.g., naming it something like s_last_memory_stats_for_testing) or gating it behind a test/debug flag so it doesn't ship in production builds. If this is intentionally test-only, could it live in the test file instead?

2. Destructor correctness — shared buffer double-release guard (WebGPUGraph.cpp:55-58)

for (size_t i = 0; i < tensors_.size(); i++) {
    if (tensors_[i].buffer &&
        (i >= tensor_mem_obj_ids_.size() || tensor_mem_obj_ids_[i] < 0)) {
      wgpuBufferRelease(tensors_[i].buffer);
    }
}

The guard i >= tensor_mem_obj_ids_.size() should never be true since both vectors are resized to num_vals in build(). If build() throws before that resize (e.g., device is null), both vectors are empty and the loop wouldn't execute anyway. The check is harmless but somewhat misleading—a simple tensor_mem_obj_ids_[i] < 0 would suffice given the invariant. Minor nit, no action needed.

3. total_bytes() excludes tensor_buffer_bytes field (WebGPUGraph.h:44-46)

size_t total_bytes() const {
    return shared_buffer_bytes + unshared_tensor_buffer_bytes +
        staging_buffer_bytes + uniform_buffer_bytes;
}

This is correct (it equals tensor_buffer_bytes + staging + uniform), but it's mildly confusing that total_bytes() doesn't reference the tensor_buffer_bytes field directly. Rewriting as return tensor_buffer_bytes + staging_buffer_bytes + uniform_buffer_bytes; would make the semantics clearer, since tensor_buffer_bytes is already computed as shared + unshared in memory_stats().

Fix this →


Minor Suggestions

4. Test assertion could verify specific aliasing count (test_webgpu_native.cpp:143)

The test asserts stats.num_shared_objects > 0 — for a 5-op chained add with 2 inputs + 1 output + intermediates, you could assert a more specific expected value (e.g., that num_shared_objects equals a known count) to catch regressions where aliasing becomes less effective. Though I recognize this is fragile if the partitioner changes its allocation strategy.

5. export_chained_add_model not invocable from CLI (test_add.py)

The export functions are defined but only callable programmatically or via the test script's -c block. Not a blocker, just noting that adding a --export-chained CLI arg would be nice for manual testing.


What looks good

  • The two-pass approach (collect max sizes in pass 1, allocate shared buffers in pass 2) is clean and avoids over-allocation.
  • The destructor properly separates ownership: shared buffers are released once via shared_buffers_, unshared per-tensor buffers are released individually.
  • The memory stats struct gives good observability into the aliasing savings.
  • The chained-add test validates both correctness and memory savings, which is the right thing to test here.
  • Replacing zero-size buffer fallbacks with asserts (ET_CHECK_MSG) is a good cleanup — zero-size tensors would indicate a bug upstream.

Verdict: LGTM with minor suggestions. The core aliasing logic is correct and well-tested. The global stats accessor is the only thing I'd recommend addressing before merge (even if just renaming to clarify it's test-only).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant