Use caller CUDA stream for D2H and H2D copies (#20498)#20498
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20498
Note: Links to docs will display an error until the docs builds have been completed. ❗ 2 Active SEVsThere are 2 currently active SEVs. If your PR is affected, please view them below:
❌ 3 New Failures, 1 Unrelated FailureAs of commit 3d8da75 with merge base 45a14b9 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@Conarnar has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109590531. |
This PR needs a
|
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync` and synchronize the stream before returning — preserving the blocking API contract while allowing work to be issued on the caller's stream. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531
3ac4dc3 to
3d8da75
Compare
Summary:
CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via
getCallerStream()),copy_host_to_deviceandcopy_device_to_hostusecudaMemcpyAsyncand synchronize the stream before returning — preserving the blocking API contract while allowing work to be issued on the caller's stream. When no caller stream is set, the synchronouscudaMemcpypath is used as before.Additionally:
dst/srcreturnError::InvalidArgumentinstead of aborting incudaMemcpy, and zero-byte copies returnError::Okearly.//executorch/extension/cuda:caller_streamdependency in TARGETS.test_cuda_allocatorwith coverage for sync/async paths and error handling.Differential Revision: D109590531