[ExecuTorch][WebGPU] SDPA: branchless aligned/tail loads in the QK/AV kernels by JulianCloudNTH · Pull Request #20493 · pytorch/executorch

JulianCloudNTH · 2026-06-24T19:53:42Z

Stack from ghstack (oldest at bottom):

Branchless aligned/tail loads + vec4 storage bindings — drop the always-true per-lane bounds checks in the tiled QK/AV hot loops, split the AV context contraction into a branch-free aligned body plus a checked tail, and declare the head-dim-indexed SDPA storage buffers as array<vec4<f32>> so the loads/stores are forced-vectorized (addresses review feedback to mirror Vulkan's vec4 bindings).

Problem: The tiled QK/AV vec4 loaders run 4 per-lane if bounds checks on every load, every contraction iteration (8 loads/iter). But head_dim is always a multiple of 4, so the D-axis checks never fire, and the AV context axis only needs a bounds check on the last ragged chunk. Separately the storage buffers were declared array<f32>, so the 4-lane loads/stores were not guaranteed to compile to aligned 128-bit vector accesses.

Solution: Remove the dead checks, split the ragged axis, and vectorize the bindings:

Before: load_q_vec4/load_k_vec4 (and AV load_a_vec4/load_v_d4) do 4 per-lane bounds ifs per call; the AV c4 loop runs checked loads for every chunk; t_q/t_k_cache/t_v_cache/t_out are array<f32> accessed element-by-element.
After: QK loads are a plain unchecked vec4 (D%4==0, host-guarded); AV runs a branch-free aligned body over c4 in [0, context_len - context_len%4) then a 0-or-1 checked tail; the head-dim-indexed buffers t_q/t_k_cache/t_v_cache/t_out are array<vec4<f32>> indexed [base/4u], and AV writes a single aligned store_out_vec4.

Implementation:

QK: load_q_vec4/load_k_vec4 drop the per-lane D checks and return t_q[base/4u] / t_k_cache[base/4u].
AV: branch-free load_a_vec4_nc/load_v_d4_nc for the aligned body; checked load_a_vec4/load_v_d4 for the tail; V reads t_v_cache[base/4u]; output is one aligned store_out_vec4.
Bindings: t_q, t_k_cache (QK) and t_v_cache, t_out (AV) are array<vec4<f32>>. t_attn_weights and the softmax buffer stay array<f32> — they are context_len-indexed (row stride not 4-aligned) and written per-element under the causal mask, so a vec4 binding there would need a padded scratch row.
Host: add a D % 4 == 0 guard in Sdpa.cpp — WGSL has no SDPA_PAD_D pad-load, so fail loud rather than read past the row; this guard also makes every [base/4u] index 4-aligned and every buffer a 16-byte multiple.
Test: add a reject_d6 (head_dim=6) config + an expect_reject harness branch asserting the guard rejects a non-aligned head_dim at load.
Mirrors Vulkan sdpa_compute_out_tiled.glsl (aligned/tail split) and Vulkan's array<vec4> SDPA bindings.

Constraints:

Requires head_dim % 4 == 0 (true for every Llama config, D=64); enforced by a loud host throw, not a silent narrowing.
Bit-identical output: the aligned body processes the same chunks in the same accumulation order as the scalar loop, the tail's out-of-range lanes contribute 0, and the vec4 bindings read/write the same bytes as the scalar version.
No KV-cache layout, dispatch, or uniform change.

Co-authored with Claude Code.
@exported-using-ghexport

Differential Revision: D109521069

[ghstack-poisoned]

pytorch-bot · 2026-06-24T19:53:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20493

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure, 11 Pending, 2 Unrelated Failures

As of commit 192f09f with merge base e03f777 ():

NEW FAILURE - The following job has failed:

pull / unittest-buck / linux / linux-job (gh)
RuntimeError: Command docker exec -t 01174de7bff9009e906654fb4bb57ededdf45b84e2a15b5c500104eb7ad01348 /exec failed with exit code 3

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (trunk failure)
pull / unittest-buck / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 3

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-24T19:54:47Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]

… kernels Pull Request resolved: #20493 **Branchless aligned/tail loads + vec4 storage bindings** — drop the always-true per-lane bounds checks in the tiled QK/AV hot loops, split the AV context contraction into a branch-free aligned body plus a checked tail, and declare the head-dim-indexed SDPA storage buffers as `array<vec4<f32>>` so the loads/stores are forced-vectorized (addresses review feedback to mirror Vulkan's vec4 bindings). **Problem**: The tiled QK/AV vec4 loaders run 4 per-lane `if` bounds checks on every load, every contraction iteration (8 loads/iter). But `head_dim` is always a multiple of 4, so the D-axis checks never fire, and the AV context axis only needs a bounds check on the last ragged chunk. Separately the storage buffers were declared `array<f32>`, so the 4-lane loads/stores were not guaranteed to compile to aligned 128-bit vector accesses. **Solution**: Remove the dead checks, split the ragged axis, and vectorize the bindings: - **Before**: `load_q_vec4`/`load_k_vec4` (and AV `load_a_vec4`/`load_v_d4`) do 4 per-lane bounds `if`s per call; the AV `c4` loop runs checked loads for every chunk; `t_q`/`t_k_cache`/`t_v_cache`/`t_out` are `array<f32>` accessed element-by-element. - **After**: QK loads are a plain unchecked `vec4` (D%4==0, host-guarded); AV runs a branch-free aligned body over `c4 in [0, context_len - context_len%4)` then a 0-or-1 checked tail; the head-dim-indexed buffers `t_q`/`t_k_cache`/`t_v_cache`/`t_out` are `array<vec4<f32>>` indexed `[base/4u]`, and AV writes a single aligned `store_out_vec4`. **Implementation**: - QK: `load_q_vec4`/`load_k_vec4` drop the per-lane D checks and return `t_q[base/4u]` / `t_k_cache[base/4u]`. - AV: branch-free `load_a_vec4_nc`/`load_v_d4_nc` for the aligned body; checked `load_a_vec4`/`load_v_d4` for the tail; V reads `t_v_cache[base/4u]`; output is one aligned `store_out_vec4`. - Bindings: `t_q`, `t_k_cache` (QK) and `t_v_cache`, `t_out` (AV) are `array<vec4<f32>>`. `t_attn_weights` and the softmax buffer stay `array<f32>` — they are `context_len`-indexed (row stride not 4-aligned) and written per-element under the causal mask, so a `vec4` binding there would need a padded scratch row. - Host: add a `D % 4 == 0` guard in `Sdpa.cpp` — WGSL has no `SDPA_PAD_D` pad-load, so fail loud rather than read past the row; this guard also makes every `[base/4u]` index 4-aligned and every buffer a 16-byte multiple. - Test: add a `reject_d6` (head_dim=6) config + an `expect_reject` harness branch asserting the guard rejects a non-aligned head_dim at load. - Mirrors Vulkan `sdpa_compute_out_tiled.glsl` (aligned/tail split) and Vulkan's `array<vec4>` SDPA bindings. **Constraints**: - Requires `head_dim % 4 == 0` (true for every Llama config, D=64); enforced by a loud host throw, not a silent narrowing. - Bit-identical output: the aligned body processes the same chunks in the same accumulation order as the scalar loop, the tail's out-of-range lanes contribute 0, and the `vec4` bindings read/write the same bytes as the scalar version. - No KV-cache layout, dispatch, or uniform change. Co-authored with Claude Code. ghstack-source-id: 396717582 @exported-using-ghexport Differential Revision: [D109521069](https://our.internmc.facebook.com/intern/diff/D109521069/)

SS-JIA

Review automatically exported from Phabricator review in Meta.

[ghstack-poisoned]

… kernels Pull Request resolved: #20493 **Branchless aligned/tail loads + vec4 storage bindings** — drop the always-true per-lane bounds checks in the tiled QK/AV hot loops, split the AV context contraction into a branch-free aligned body plus a checked tail, and declare the head-dim-indexed SDPA storage buffers as `array<vec4<f32>>` so the loads/stores are forced-vectorized (addresses review feedback to mirror Vulkan's vec4 bindings). **Problem**: The tiled QK/AV vec4 loaders run 4 per-lane `if` bounds checks on every load, every contraction iteration (8 loads/iter). But `head_dim` is always a multiple of 4, so the D-axis checks never fire, and the AV context axis only needs a bounds check on the last ragged chunk. Separately the storage buffers were declared `array<f32>`, so the 4-lane loads/stores were not guaranteed to compile to aligned 128-bit vector accesses. **Solution**: Remove the dead checks, split the ragged axis, and vectorize the bindings: - **Before**: `load_q_vec4`/`load_k_vec4` (and AV `load_a_vec4`/`load_v_d4`) do 4 per-lane bounds `if`s per call; the AV `c4` loop runs checked loads for every chunk; `t_q`/`t_k_cache`/`t_v_cache`/`t_out` are `array<f32>` accessed element-by-element. - **After**: QK loads are a plain unchecked `vec4` (D%4==0, host-guarded); AV runs a branch-free aligned body over `c4 in [0, context_len - context_len%4)` then a 0-or-1 checked tail; the head-dim-indexed buffers `t_q`/`t_k_cache`/`t_v_cache`/`t_out` are `array<vec4<f32>>` indexed `[base/4u]`, and AV writes a single aligned `store_out_vec4`. **Implementation**: - QK: `load_q_vec4`/`load_k_vec4` drop the per-lane D checks and return `t_q[base/4u]` / `t_k_cache[base/4u]`. - AV: branch-free `load_a_vec4_nc`/`load_v_d4_nc` for the aligned body; checked `load_a_vec4`/`load_v_d4` for the tail; V reads `t_v_cache[base/4u]`; output is one aligned `store_out_vec4`. - Bindings: `t_q`, `t_k_cache` (QK) and `t_v_cache`, `t_out` (AV) are `array<vec4<f32>>`. `t_attn_weights` and the softmax buffer stay `array<f32>` — they are `context_len`-indexed (row stride not 4-aligned) and written per-element under the causal mask, so a `vec4` binding there would need a padded scratch row. - Host: add a `D % 4 == 0` guard in `Sdpa.cpp` — WGSL has no `SDPA_PAD_D` pad-load, so fail loud rather than read past the row; this guard also makes every `[base/4u]` index 4-aligned and every buffer a 16-byte multiple. - Test: add a `reject_d6` (head_dim=6) config + an `expect_reject` harness branch asserting the guard rejects a non-aligned head_dim at load. - Mirrors Vulkan `sdpa_compute_out_tiled.glsl` (aligned/tail split) and Vulkan's `array<vec4>` SDPA bindings. **Constraints**: - Requires `head_dim % 4 == 0` (true for every Llama config, D=64); enforced by a loud host throw, not a silent narrowing. - Bit-identical output: the aligned body processes the same chunks in the same accumulation order as the scalar loop, the tail's out-of-range lanes contribute 0, and the `vec4` bindings read/write the same bytes as the scalar version. - No KV-cache layout, dispatch, or uniform change. Co-authored with Claude Code. ghstack-source-id: 396792517 @exported-using-ghexport Differential Revision: [D109521069](https://our.internmc.facebook.com/intern/diff/D109521069/)

Update

6e022e1

[ghstack-poisoned]

JulianCloudNTH mentioned this pull request Jun 24, 2026

[ExecuTorch][WebGPU] Register-tile the SDPA QK/AV kernels #20405

Open

JulianCloudNTH temporarily deployed to cadence June 24, 2026 19:53 — with GitHub Actions Inactive

This was referenced Jun 24, 2026

[ExecuTorch][WebGPU] Coalesce SDPA AV V-cache reads along contiguous head-dim #20459

Open

[ExecuTorch][WebGPU] SDPA: skip QK contraction for fully-masked causal tiles #20492

Open

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026

Update

8ba8c50

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 24, 2026 21:40 — with GitHub Actions Inactive

meta-codesync Bot added the meta-exported label Jun 24, 2026

SS-JIA approved these changes Jun 24, 2026

View reviewed changes

Update

192f09f

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 25, 2026 02:35 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] SDPA: branchless aligned/tail loads in the QK/AV kernels#20493

[ExecuTorch][WebGPU] SDPA: branchless aligned/tail loads in the QK/AV kernels#20493
JulianCloudNTH wants to merge 3 commits into
gh/JulianCloudNTH/63/basefrom
gh/JulianCloudNTH/63/head

JulianCloudNTH commented Jun 24, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

SS-JIA left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JulianCloudNTH commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20493

❗ 2 Active SEVs

❌ 1 New Failure, 11 Pending, 2 Unrelated Failures

Uh oh!

github-actions Bot commented Jun 24, 2026

This PR needs a release notes: label

Uh oh!

SS-JIA left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JulianCloudNTH commented Jun 24, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 24, 2026 •

edited

Loading

This PR needs a `release notes:` label