Inline per-tensor SIMD fast path in fusion_g3 op_dequantize (recreate D108798741) by zonglinpeng · Pull Request #20499 · pytorch/executorch

zonglinpeng · 2026-06-24T23:40:03Z

Summary:
Recreates the optimized Fusion-G3 dequantize from D108798741, but instead of shipping it as a separate devmate kernel wired in through operator_fallback.bzl, it places the PDX SIMD fast path directly into the existing executorch operator dequantize_per_tensor_out in executorch/backends/cadence/fusion_g3/operators/op_dequantize.cpp (per-tensor function only; per_channel/tensor/tensor_args variants are untouched).

When the input and output buffers are 16-byte aligned (dequant_simd_aligned), the per-tensor path runs an inline PDX SIMD loop (xb_vecMxf32/xb_vecMx32/PDX_MUL_MXF32); otherwise it falls back to the NNLib path (xa_nn_elm_dequantize_*). The result is numerically identical to the original op — the same float-domain affine (x - zero_point) * scale.

This intentionally does NOT include the mvartanian integer-subtract change (D109458111, PDX_SUB_MX32); it uses the float-domain asymmetric path from D108798741 as requested. The macro fast paths (ASYM_DEQUANTIZE_IMPL_CHANNEL/SYM_DEQUANTIZE_IMPL_CHANNEL) get the static_cast<CTYPE_OUT>((x - zp) * scale) parenthesization required to build clean under the G3 dev mode's -Werror,-Wdouble-promotion.

For A/B measurement this also adds op_dequantize_baseline.cpp under the Jarvis operator test dir: a benchmark-only snapshot of the ORIGINAL executorch op (pre-SIMD, with only the -Wdouble-promotion fix). It defines impl::G3::native::dequantize_per_tensor_out, so the shared benchmark source from D109441948 is linked into two binaries — _optimized (against the real executorch op) and _stock (against the snapshot) — and compared on the cycle-accurate G3 ISS. operators_header visibility is extended to the Jarvis test package so the snapshot can include operators.h.

Differential Revision: D109500113

… D108798741) Summary: Recreates the optimized Fusion-G3 dequantize from D108798741, but instead of shipping it as a separate devmate kernel wired in through `operator_fallback.bzl`, it places the PDX SIMD fast path directly into the existing executorch operator `dequantize_per_tensor_out` in `executorch/backends/cadence/fusion_g3/operators/op_dequantize.cpp` (per-tensor function only; `per_channel`/`tensor`/`tensor_args` variants are untouched). When the input and output buffers are 16-byte aligned (`dequant_simd_aligned`), the per-tensor path runs an inline PDX SIMD loop (`xb_vecMxf32`/`xb_vecMx32`/`PDX_MUL_MXF32`); otherwise it falls back to the NNLib path (`xa_nn_elm_dequantize_*`). The result is numerically identical to the original op — the same float-domain affine `(x - zero_point) * scale`. This intentionally does NOT include the mvartanian integer-subtract change (D109458111, `PDX_SUB_MX32`); it uses the float-domain asymmetric path from D108798741 as requested. The macro fast paths (`ASYM_DEQUANTIZE_IMPL_CHANNEL`/`SYM_DEQUANTIZE_IMPL_CHANNEL`) get the `static_cast<CTYPE_OUT>((x - zp) * scale)` parenthesization required to build clean under the G3 `dev` mode's `-Werror,-Wdouble-promotion`. For A/B measurement this also adds `op_dequantize_baseline.cpp` under the Jarvis operator test dir: a benchmark-only snapshot of the ORIGINAL executorch op (pre-SIMD, with only the `-Wdouble-promotion` fix). It defines `impl::G3::native::dequantize_per_tensor_out`, so the shared benchmark source from D109441948 is linked into two binaries — `_optimized` (against the real executorch op) and `_stock` (against the snapshot) — and compared on the cycle-accurate G3 ISS. `operators_header` visibility is extended to the Jarvis test package so the snapshot can include `operators.h`. Differential Revision: D109500113

pytorch-bot · 2026-06-24T23:40:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20499

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

❌ 3 New Failures, 1 Unrelated Failure

As of commit 60754f2 with merge base 45a14b9 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner (gh)
>>> Lint for backends/cadence/fusion_g3/operators/op_dequantize.cpp:
pull / unittest-buck / linux / linux-job (gh)
RuntimeError: Command docker exec -t 3b013a8c90794732bf92d1c1260b6419ce6351a47542a5ad553e6983ce1a3753 /exec failed with exit code 3
pull / unittest-buck / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 3

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (trunk failure)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2026-06-24T23:40:10Z

❌ - login: @zonglinpeng / name: Zonglin Peng. The commit (60754f2) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please visit our EasyCLA portal and chat with our support bot.

meta-codesync · 2026-06-24T23:40:10Z

@zonglinpeng has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109500113.

github-actions · 2026-06-24T23:40:51Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026

meta-codesync Bot added the meta-exported label Jun 24, 2026

meta-codesync Bot temporarily deployed to cadence June 24, 2026 23:40 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inline per-tensor SIMD fast path in fusion_g3 op_dequantize (recreate D108798741)#20499

Inline per-tensor SIMD fast path in fusion_g3 op_dequantize (recreate D108798741)#20499
zonglinpeng wants to merge 1 commit into
pytorch:mainfrom
zonglinpeng:export-D109500113

zonglinpeng commented Jun 24, 2026

Uh oh!

pytorch-bot Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

linux-foundation-easycla Bot commented Jun 24, 2026

Uh oh!

meta-codesync Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zonglinpeng commented Jun 24, 2026

Uh oh!

pytorch-bot Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20499

❗ 2 Active SEVs

❌ 3 New Failures, 1 Unrelated Failure

Uh oh!

linux-foundation-easycla Bot commented Jun 24, 2026

Uh oh!

meta-codesync Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot Bot commented Jun 24, 2026 •

edited

Loading

This PR needs a `release notes:` label