[ExecuTorch][WebGPU] rms_norm: add a vec4 kernel for 4-aligned row widths by pytorchbot · Pull Request #20502 · pytorch/executorch

pytorchbot · 2026-06-25T00:18:26Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #20458 by @JulianCloudNTH
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/53/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/53/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/52/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/53/orig

@diff-train-skip-merge

…dths Pull Request resolved: #20458 Add optimized vec4 kernel for bandwidth-bound rms_norm on Llama decode. **Problem**: Scalar kernel loads one element per lane per iteration — bandwidth-limited on Llama decode. **Solution**: Add vec4 kernel that loads/stores four contiguous elements as `vec4<f32>` and squares them with `dot(v, v)`, cutting loop iterations 4× and widening memory transactions. **Routing Logic**: - Use vec4 when: row_width % 4 == 0 - Otherwise: Fall back to scalar kernel **Constraints**: - row_width % 4 == 0: vec4 kernel has no partial-texel tail handling - Llama models (all hidden sizes 4-aligned) satisfy constraint **Implementation**: - New kernel: rms_norm_vec4.wgsl (same 64-lane workgroup) - Shared infrastructure: Same bind layout, Params, dispatch - Numerical: Float reassociation differs, not bit-identical to scalar **Performance**: ~33% faster on Apple M4 Pro / Metal across benchmark shapes (largest on decode, smallest on long prefill where already bandwidth-bound). This change was authored with assistance from Claude. ghstack-source-id: 396677654 @exported-using-ghexport Differential Revision: [D109333390](https://our.internmc.facebook.com/intern/diff/D109333390/)

pytorch-bot · 2026-06-25T00:18:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20502

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI runner label rename: rebase PRs using old linux.rocm.gpu.gfx950.* labels

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorchbot temporarily deployed to cadence June 25, 2026 00:18 — with GitHub Actions Inactive

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 25, 2026

JulianCloudNTH merged commit ab74ef0 into gh/JulianCloudNTH/52/orig Jun 25, 2026
168 of 175 checks passed

JulianCloudNTH deleted the gh/JulianCloudNTH/53/orig branch June 25, 2026 00:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] rms_norm: add a vec4 kernel for 4-aligned row widths#20502

[ExecuTorch][WebGPU] rms_norm: add a vec4 kernel for 4-aligned row widths#20502
JulianCloudNTH merged 1 commit into
gh/JulianCloudNTH/52/origfrom
gh/JulianCloudNTH/53/orig

pytorchbot commented Jun 25, 2026

Uh oh!

pytorch-bot Bot commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

pytorchbot commented Jun 25, 2026

Uh oh!

pytorch-bot Bot commented Jun 25, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20502

❗ 1 Active SEVs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants