Inline per-tensor SIMD fast path in fusion_g3 op_dequantize (recreate D108798741)#20499
Inline per-tensor SIMD fast path in fusion_g3 op_dequantize (recreate D108798741)#20499zonglinpeng wants to merge 1 commit into
Conversation
… D108798741) Summary: Recreates the optimized Fusion-G3 dequantize from D108798741, but instead of shipping it as a separate devmate kernel wired in through `operator_fallback.bzl`, it places the PDX SIMD fast path directly into the existing executorch operator `dequantize_per_tensor_out` in `executorch/backends/cadence/fusion_g3/operators/op_dequantize.cpp` (per-tensor function only; `per_channel`/`tensor`/`tensor_args` variants are untouched). When the input and output buffers are 16-byte aligned (`dequant_simd_aligned`), the per-tensor path runs an inline PDX SIMD loop (`xb_vecMxf32`/`xb_vecMx32`/`PDX_MUL_MXF32`); otherwise it falls back to the NNLib path (`xa_nn_elm_dequantize_*`). The result is numerically identical to the original op — the same float-domain affine `(x - zero_point) * scale`. This intentionally does NOT include the mvartanian integer-subtract change (D109458111, `PDX_SUB_MX32`); it uses the float-domain asymmetric path from D108798741 as requested. The macro fast paths (`ASYM_DEQUANTIZE_IMPL_CHANNEL`/`SYM_DEQUANTIZE_IMPL_CHANNEL`) get the `static_cast<CTYPE_OUT>((x - zp) * scale)` parenthesization required to build clean under the G3 `dev` mode's `-Werror,-Wdouble-promotion`. For A/B measurement this also adds `op_dequantize_baseline.cpp` under the Jarvis operator test dir: a benchmark-only snapshot of the ORIGINAL executorch op (pre-SIMD, with only the `-Wdouble-promotion` fix). It defines `impl::G3::native::dequantize_per_tensor_out`, so the shared benchmark source from D109441948 is linked into two binaries — `_optimized` (against the real executorch op) and `_stock` (against the snapshot) — and compared on the cycle-accurate G3 ISS. `operators_header` visibility is extended to the Jarvis test package so the snapshot can include `operators.h`. Differential Revision: D109500113
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20499
Note: Links to docs will display an error until the docs builds have been completed. ❗ 2 Active SEVsThere are 2 currently active SEVs. If your PR is affected, please view them below:
❌ 3 New Failures, 1 Unrelated FailureAs of commit 60754f2 with merge base 45a14b9 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
|
@zonglinpeng has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109500113. |
This PR needs a
|
Summary:
Recreates the optimized Fusion-G3 dequantize from D108798741, but instead of shipping it as a separate devmate kernel wired in through
operator_fallback.bzl, it places the PDX SIMD fast path directly into the existing executorch operatordequantize_per_tensor_outinexecutorch/backends/cadence/fusion_g3/operators/op_dequantize.cpp(per-tensor function only;per_channel/tensor/tensor_argsvariants are untouched).When the input and output buffers are 16-byte aligned (
dequant_simd_aligned), the per-tensor path runs an inline PDX SIMD loop (xb_vecMxf32/xb_vecMx32/PDX_MUL_MXF32); otherwise it falls back to the NNLib path (xa_nn_elm_dequantize_*). The result is numerically identical to the original op — the same float-domain affine(x - zero_point) * scale.This intentionally does NOT include the mvartanian integer-subtract change (D109458111,
PDX_SUB_MX32); it uses the float-domain asymmetric path from D108798741 as requested. The macro fast paths (ASYM_DEQUANTIZE_IMPL_CHANNEL/SYM_DEQUANTIZE_IMPL_CHANNEL) get thestatic_cast<CTYPE_OUT>((x - zp) * scale)parenthesization required to build clean under the G3devmode's-Werror,-Wdouble-promotion.For A/B measurement this also adds
op_dequantize_baseline.cppunder the Jarvis operator test dir: a benchmark-only snapshot of the ORIGINAL executorch op (pre-SIMD, with only the-Wdouble-promotionfix). It definesimpl::G3::native::dequantize_per_tensor_out, so the shared benchmark source from D109441948 is linked into two binaries —_optimized(against the real executorch op) and_stock(against the snapshot) — and compared on the cycle-accurate G3 ISS.operators_headervisibility is extended to the Jarvis test package so the snapshot can includeoperators.h.Differential Revision: D109500113