Skip to content

Inline per-tensor SIMD fast path in fusion_g3 op_dequantize (recreate D108798741)#20499

Open
zonglinpeng wants to merge 1 commit into
pytorch:mainfrom
zonglinpeng:export-D109500113
Open

Inline per-tensor SIMD fast path in fusion_g3 op_dequantize (recreate D108798741)#20499
zonglinpeng wants to merge 1 commit into
pytorch:mainfrom
zonglinpeng:export-D109500113

Conversation

@zonglinpeng

Copy link
Copy Markdown
Contributor

Summary:
Recreates the optimized Fusion-G3 dequantize from D108798741, but instead of shipping it as a separate devmate kernel wired in through operator_fallback.bzl, it places the PDX SIMD fast path directly into the existing executorch operator dequantize_per_tensor_out in executorch/backends/cadence/fusion_g3/operators/op_dequantize.cpp (per-tensor function only; per_channel/tensor/tensor_args variants are untouched).

When the input and output buffers are 16-byte aligned (dequant_simd_aligned), the per-tensor path runs an inline PDX SIMD loop (xb_vecMxf32/xb_vecMx32/PDX_MUL_MXF32); otherwise it falls back to the NNLib path (xa_nn_elm_dequantize_*). The result is numerically identical to the original op — the same float-domain affine (x - zero_point) * scale.

This intentionally does NOT include the mvartanian integer-subtract change (D109458111, PDX_SUB_MX32); it uses the float-domain asymmetric path from D108798741 as requested. The macro fast paths (ASYM_DEQUANTIZE_IMPL_CHANNEL/SYM_DEQUANTIZE_IMPL_CHANNEL) get the static_cast<CTYPE_OUT>((x - zp) * scale) parenthesization required to build clean under the G3 dev mode's -Werror,-Wdouble-promotion.

For A/B measurement this also adds op_dequantize_baseline.cpp under the Jarvis operator test dir: a benchmark-only snapshot of the ORIGINAL executorch op (pre-SIMD, with only the -Wdouble-promotion fix). It defines impl::G3::native::dequantize_per_tensor_out, so the shared benchmark source from D109441948 is linked into two binaries — _optimized (against the real executorch op) and _stock (against the snapshot) — and compared on the cycle-accurate G3 ISS. operators_header visibility is extended to the Jarvis test package so the snapshot can include operators.h.

Differential Revision: D109500113

… D108798741)

Summary:
Recreates the optimized Fusion-G3 dequantize from D108798741, but instead of shipping it as a separate devmate kernel wired in through `operator_fallback.bzl`, it places the PDX SIMD fast path directly into the existing executorch operator `dequantize_per_tensor_out` in `executorch/backends/cadence/fusion_g3/operators/op_dequantize.cpp` (per-tensor function only; `per_channel`/`tensor`/`tensor_args` variants are untouched).

When the input and output buffers are 16-byte aligned (`dequant_simd_aligned`), the per-tensor path runs an inline PDX SIMD loop (`xb_vecMxf32`/`xb_vecMx32`/`PDX_MUL_MXF32`); otherwise it falls back to the NNLib path (`xa_nn_elm_dequantize_*`). The result is numerically identical to the original op — the same float-domain affine `(x - zero_point) * scale`.

This intentionally does NOT include the mvartanian integer-subtract change (D109458111, `PDX_SUB_MX32`); it uses the float-domain asymmetric path from D108798741 as requested. The macro fast paths (`ASYM_DEQUANTIZE_IMPL_CHANNEL`/`SYM_DEQUANTIZE_IMPL_CHANNEL`) get the `static_cast<CTYPE_OUT>((x - zp) * scale)` parenthesization required to build clean under the G3 `dev` mode's `-Werror,-Wdouble-promotion`.

For A/B measurement this also adds `op_dequantize_baseline.cpp` under the Jarvis operator test dir: a benchmark-only snapshot of the ORIGINAL executorch op (pre-SIMD, with only the `-Wdouble-promotion` fix). It defines `impl::G3::native::dequantize_per_tensor_out`, so the shared benchmark source from D109441948 is linked into two binaries — `_optimized` (against the real executorch op) and `_stock` (against the snapshot) — and compared on the cycle-accurate G3 ISS. `operators_header` visibility is extended to the Jarvis test package so the snapshot can include `operators.h`.

Differential Revision: D109500113
@pytorch-bot

pytorch-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20499

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

❌ 3 New Failures, 1 Unrelated Failure

As of commit 60754f2 with merge base 45a14b9 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026
@linux-foundation-easycla

Copy link
Copy Markdown

CLA Not Signed

@meta-codesync

meta-codesync Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@zonglinpeng has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109500113.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant