Fix TOCTOU heap-buffer-overflow in SDPA per-thread scratch#20470
Fix TOCTOU heap-buffer-overflow in SDPA per-thread scratch#20470derekdixu wants to merge 1 commit into
Conversation
Summary: The SDPA flash-attention kernel allocates per-thread scratch space using the threadpool's current thread count, then dispatches parallel work that independently re-reads the thread count. On a 96-core host post-fork, the threadpool can be resized between the two reads, causing the parallel dispatcher to create more tasks than scratch slots were allocated. Worker threads then index past the end of the buffer, triggering heap-buffer-overflow (reproduced 55% of the time, 11/20 runs). This adds an optional `num_threads` parameter to `parallel_for` and `calc_num_tasks_and_chunk_size`. When `<= 0` (the default), they read the threadpool's current count as before, so existing callers are unchanged. The SDPA kernel now passes the same thread count it used to size the buffer, guaranteeing `num_tasks <= num_thread` and keeping every worker's `ompIdx` in bounds. Reviewed By: GregoryComer Differential Revision: D109464749 Signed-off-by: Chris Edmonds <edmondsc@meta.com>
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20470
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 Awaiting Approval, 4 New Failures, 3 Unrelated FailuresAs of commit 10df7cc with merge base aada6d7 ( AWAITING APPROVAL - The following workflow needs approval before CI can run:
NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
|
@derekdixu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109464749. |
This PR needs a
|
Summary:
The SDPA flash-attention kernel allocates per-thread scratch space using the threadpool's current thread count, then dispatches parallel work that independently re-reads the thread count. On a 96-core host post-fork, the threadpool can be resized between the two reads, causing the parallel dispatcher to create more tasks than scratch slots were allocated. Worker threads then index past the end of the buffer, triggering heap-buffer-overflow (reproduced 55% of the time, 11/20 runs).
This adds an optional
num_threadsparameter toparallel_forandcalc_num_tasks_and_chunk_size. When<= 0(the default), they read the threadpool's current count as before, so existing callers are unchanged. The SDPA kernel now passes the same thread count it used to size the buffer, guaranteeingnum_tasks <= num_threadand keeping every worker'sompIdxin bounds.Reviewed By: GregoryComer
Differential Revision: D109464749