[executorch][gemma4] fuse MLP gate/up at GGUF load #20481
Draft
Gasoonjia wants to merge 7 commits into
Draft
Conversation
Summary: - Add MLX lowering for aten.leaky_relu.default using existing GreaterEqual, Multiply, and Where nodes. - Add focused MLX op tests for custom negative_slope values, including a slope above 1. Test Plan: - python -m py_compile backends/mlx/ops.py backends/mlx/test/test_ops.py - git diff --check HEAD^..HEAD - PATH="$PWD/.venv-mlx/bin:$PATH" .venv-mlx/bin/lintrunner backends/mlx/ops.py backends/mlx/test/test_ops.py - .venv-mlx/bin/python -m executorch.backends.mlx.test.run_all_tests leaky_relu --timeout 180 cc @metascroy
…20408) MLX backend already has mutable state in a separate execution context from its constant data. This PR exposes a way to configure that for external callers, and uses this to support serve.py on MLX like CUDA backend.
### Summary Add profiling support for the NXP backend. ### Test plan All CI tests passed including new test for the profiling feature. --------- Signed-off-by: Irina Korchakova <irina.trukhina@nxp.com>
Differential Revision: D108478011 Pull Request resolved: #20453
Differential Revision: D109082060 Pull Request resolved: #20403
Summary: Fuse each gemma4_31b MLP's gate_proj|up_proj into a single [2*intermediate, hidden] coalesced-int4 matmul, applied by default in the CUDA export. This issues one activation-quant + one W4A8 matvec per layer instead of two, cutting per-token launch + activation-quant overhead in the launch-bound decode path. Only Q4_K (CudaCoalescedInt4Tensor) gate/up pairs are fused; any other quant type (e.g. Q6_K) is left as two matmuls (guarded, still correct). Builds on the already-landed kv_len-bounded tq4_sdpa kernel + gemma4_31b call-site (kv_len + mask_is_causal), which recovered 128k decode from ~2.8 to ~43 tok/s. With both, ET gemma4_31b 128k+TurboQuant decode beats llama.cpp at every measured context (cuda_graph ON): ctx ET llama 512 44.80 42.77 2K 43.20 41.97 8K 42.23 41.23 32K 41.64 40.27 127K 38.41 35.97 TurboQuant KV compression kept; prefill restored (6-8x) with no regression; output quality preserved. Test Plan: - Fusion numerics: fused vs unfused MLP through the real W4A8 int4_plain_mm kernel = bit-exact (max_abs_diff 0.0, cos 1.000000) for decode (T=1) and prefill (T=4). - Export + run: fused module exported via CudaPartitioner and executed through executor_runner (RC=0, cos 0.999915 vs eager). Full 31B export logs "Fused gate+up on 60 MLP layers". - Decode A/B (gemma4_31b 128k+TQ, cuda_graph ON, 5x median): table above; beats llama.cpp at 512 -> 127K. nsys: tq4_sdpa 91.7% -> 2.9% of decode.
…a+mlx) Summary: Move the gemma4 MLP gate_proj|up_proj fusion to a single backend-agnostic point in the GGUF loader, and make the model forward consume it. Supersedes the earlier CUDA-only export-time fusion (reverted here). - gguf_loader.py: before any backend conversion (_convert_weight), buffer each layer's raw gate/up ExportableGGUFTensor and, once both arrive, row-concat their raw GGUF blocks along the output dim into one fused gate_up ExportableGGUFTensor (gate rows then up rows). Both backends then pack the already-fused weight with NO per-type concat: CUDA (Q4_K -> CudaCoalescedInt4Tensor, Q6_K -> CudaDp4aPlanarInt6Tensor) and MLX (ExportableGGUFTensor). Guards: same ggml_type + K; non-fuseable pairs and unpaired leftovers fall through unfused. - Gemma4MLP: when a fused gate_up_proj is present, run one matmul and split the [.., 2*intermediate_size] output back into gate/up; otherwise use the separate projections. The shared MLP stays safe for unfused checkpoints and the prequant/HF load paths (no gate_up_proj -> original path, no crash). - Revert the previous CUDA-localized fusion (cuda_source_transformations.py and export.py back to their original form). The kv_len-bounded tq4_sdpa kernel + call-site (already on main) are unchanged. Single fusion point widens applicability (CUDA + MLX, incl. Q6_K) and keeps the model def backend-agnostic. Decode win is unchanged (same fused matmul, produced at load instead of at export). Test Plan: - Raw concat (real GGUF blk.0 ffn, q4_k): fused.dequantize() == [gate; up] stacked, bit-exact; fused CudaCoalescedInt4Tensor rows [:N]/[N:] qdata+scale+zero bit-identical to gate/up. - Model-def fused vs unfused forward through real W4A8 int4_plain_mm: decode (T=1) bit-exact (cos 1.000000); prefill (T=4) cos 0.999988 -- the only delta is cuBLAS GEMM shape-dependent fp ordering (N=43008 vs 21504, identical weights), benign and inherent to any gate/up fusion. - Full CUDA GGUF export (gemma4_31b, --turboquant, max-seq-len 131072): loader logs "Fused gate+up on 60 MLP layers", TurboQuant swaps 10 layers, AOTI build clean (model.pte + 26.18GB aoti_cuda_blob.ptd, "Done."). - Decode via gemma4_31b_runner on the new build: coherent output, no NaN; prefill 1375 tok/s, decode 38.3 tok/s (no cuda_graph sanity).
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20481
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 3 New Failures, 3 Unrelated Failures, 2 Unclassified FailuresAs of commit 638f07a with merge base 65bc0ca ( NEW FAILURES - The following jobs have failed:
UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
8b145b5 to
1c371e2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Move the gemma4 MLP gate_proj|up_proj fusion to a single backend-agnostic point in the GGUF loader, and make the model forward consume it. Supersedes the earlier CUDA-only export-time fusion (reverted here).
already-fused weight with NO per-type concat: CUDA (Q4_K ->
CudaCoalescedInt4Tensor, Q6_K -> CudaDp4aPlanarInt6Tensor) and MLX (ExportableGGUFTensor). Guards: same ggml_type + K; non-fuseable pairs and unpaired leftovers fall through unfused.
Single fusion point widens applicability (CUDA + MLX, incl. Q6_K) and keeps the model def backend-agnostic. Decode win is unchanged (same fused matmul, produced at load instead of at export).
Test Plan: