Skip to content

Use caller CUDA stream for D2H and H2D copies (#20498)#20498

Open
Conarnar wants to merge 1 commit into
pytorch:mainfrom
Conarnar:export-D109590531
Open

Use caller CUDA stream for D2H and H2D copies (#20498)#20498
Conarnar wants to merge 1 commit into
pytorch:mainfrom
Conarnar:export-D109590531

Conversation

@Conarnar

@Conarnar Conarnar commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via getCallerStream()), copy_host_to_device and copy_device_to_host use cudaMemcpyAsync and synchronize the stream before returning — preserving the blocking API contract while allowing work to be issued on the caller's stream. When no caller stream is set, the synchronous cudaMemcpy path is used as before.

Additionally:

  • Added null pointer and zero-byte validation — null dst/src return Error::InvalidArgument instead of aborting in cudaMemcpy, and zero-byte copies return Error::Ok early.
  • Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
  • Wired //executorch/extension/cuda:caller_stream dependency in TARGETS.
  • Added test_cuda_allocator with coverage for sync/async paths and error handling.

Differential Revision: D109590531

Copilot AI review requested due to automatic review settings June 24, 2026 22:51
@pytorch-bot

pytorch-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20498

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

❌ 3 New Failures, 1 Unrelated Failure

As of commit 3d8da75 with merge base 45a14b9 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026
@meta-codesync

meta-codesync Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@Conarnar has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109590531.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync` and synchronize the stream before returning — preserving the blocking API contract while allowing work to be issued on the caller's stream. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Differential Revision: D109590531
@meta-codesync meta-codesync Bot changed the title Use caller CUDA stream for D2H and H2D copies Use caller CUDA stream for D2H and H2D copies (#20498) Jun 24, 2026
@Conarnar Conarnar force-pushed the export-D109590531 branch from 3ac4dc3 to 3d8da75 Compare June 24, 2026 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants