[gemma4_31b][cuda] Export Gemma4-31B @128k on 5090 by Gasoonjia · Pull Request #20480 · pytorch/executorch

Gasoonjia · 2026-06-24T08:35:40Z

Current gemma4-31b can not be successfully exported on consumer gpu like 5090 with three reasons:

During int4_dispatch we need to dequant whole matmul weight to bf16 for prefill in one step for lm_head, leading to weight duplcation;
When lowering to AOTI-CUDA, we moved the whole model, including kv cache, onto gpu. With context length increased, the gpu memory consumption will also be increased dramatically.
No autotune config for kernels like sdpa work for consumer gpu like 5090.

Three CUDA-export memory optimizations, all gated behind the existing low_memory_mode compile spec (no impact on other models or on runtime):

int4_dispatch: chunk the inline _dequant_matmul along N for vocab-sized weights, gated behind a low-memory flag with an N>65536 threshold so only the lm_head crosses it. Avoids transiently materializing the full ~10 GiB bf16 lm_head during AOTI autotune / cpp_wrapper. The prefill MLP path is untouched -> zero runtime impact.
cuda_backend / aoti_backend: skip occupying the GPU with the KV-cache buffers during AOTI compile. A new move_program_to_device hook places KV constants on the target device but immediately frees their storage (resize_(0)), so the fake-tensor device check passes while no real KV bytes sit on the GPU during autotune. The emptied buffers are re-synthesized as zeros at the unlift_graph clone and at serialization, and excluded from constant dedup (resize(0) gives every KV data_ptr 0, which would otherwise collapse same-shape caches across layers). All gated behind low_memory_mode.
tq4_sdpa: add BLOCK_N=16 (and a BLOCK_M=32) autotune config. The superset is kept for big-shared-memory GPUs (A100/H100); the Triton autotuner auto-prunes configs that exceed a GPU's shared memory (OutOfResources -> inf), so the same config list also works on the 5090 (Blackwell, ~101 KB SMEM) where the previous smallest config did not fit.

Full Gemma4-31B on 128k TQ export: peak 28.0 GiB, runtime output correct ("...Paris.").

pytorch-bot · 2026-06-24T08:35:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20480

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI jobs will have longer queue times due to CI migration

❌ 3 New Failures, 3 Unrelated Failures

As of commit 993cff5 with merge base 1b726b2 ():

NEW FAILURES - The following jobs have failed:

pull / test-arm-backend-no-driver (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 240605ac100851bf3f29b75b801e35308a3562f7979541483ce6d8133c30e0d5 /exec failed with exit code 1
pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t 485e4fcfd099959114159bdfd57ae874b14a1aa5dbb354818ebb69da8cb8ed06 /exec failed with exit code 1
pull / unittest-editable / linux / linux-job (gh)
RuntimeError: Command docker exec -t 66f95c521d611bd0dd325b49e963b2632bd6467386ee969ca66a6240c99ca7ad /exec failed with exit code 1

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (trunk failure)
pull / unittest / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-24T08:36:33Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@128k

Three CUDA-export memory optimizations: - tq4_sdpa: add BLOCK_N=16 (and a BLOCK_M=32) autotune config. The superset is kept for big-shared-memory GPUs (A100/H100); the Triton autotuner auto-prunes configs that exceed a GPU's shared memory (OutOfResources -> inf), so the same config list also works on the 5090 (Blackwell, ~101 KB SMEM) where the previous smallest config did not fit. - int4_dispatch: chunk the inline _dequant_matmul along N for vocab-sized weights (N>65536, i.e. only the lm_head). Avoids transiently materializing the full ~10 GiB bf16 lm_head when AOTI executes the int4_plain_mm custom op during autotune / cpp_wrapper. The runtime decode path uses the C++ dp4a shim and the M>4 prefill inline path is below the threshold, so this never enters the runtime graph -> zero runtime / accuracy impact. Applied unconditionally (no flag). - cuda_backend / aoti_backend: skip occupying the GPU with the KV-cache buffers during AOTI compile (gated behind low_memory_mode). A new move_program_to_device hook places KV constants on the target device but immediately frees their storage (resize_(0)), so the fake-tensor device check passes while no real KV bytes sit on the GPU during autotune. The emptied buffers are re-synthesized as zeros at the _unlift_graph clone and at serialization, and excluded from constant dedup (resize_(0) gives every KV data_ptr 0, which would otherwise collapse same-shape caches across layers). Result on 2xA100: Gemma4-31B @128k no-TQ export peak 36.3 -> 27.0 GiB; the exported model runs correctly (output "...Paris.").

digantdesai · 2026-06-24T18:44:50Z

+_DEQUANT_N_THRESHOLD = 65536
+_DEQUANT_N_CHUNK = 32768


Aren't these kind of device specific?

digantdesai · 2026-06-24T18:46:27Z

    return _dequant_matmul(self, qdata, scale, zero, group_size)


+# Chunked dequant for the export GPU budget. The lm_head dequant (N = vocab_size,


I wish there is a better way to do this i.e. why does this logic needs to be aware of export issues?

metascroy · 2026-06-24T22:22:42Z


+# Chunked dequant for the export GPU budget. The lm_head dequant (N = vocab_size,
+# e.g. 262144) runs through the int4_plain_mm custom op (M=1); AOTI executes that
+# op's CUDA impl during autotune / cpp_wrapper codegen, where it transiently holds


Is this just a crude way of doing tile level dequant?

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026

Gasoonjia temporarily deployed to cadence June 24, 2026 08:35 — with GitHub Actions Inactive

Gasoonjia had a problem deploying to cadence June 24, 2026 08:35 — with GitHub Actions Error

Gasoonjia force-pushed the gemma4_31b_export_under_32gb branch from 498a419 to 993cff5 Compare June 24, 2026 08:55

Gasoonjia temporarily deployed to cadence June 24, 2026 08:55 — with GitHub Actions Inactive

mergennachin requested review from digantdesai and metascroy June 24, 2026 18:07

digantdesai reviewed Jun 24, 2026

View reviewed changes

metascroy reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[gemma4_31b][cuda] Export Gemma4-31B @128k on 5090#20480

[gemma4_31b][cuda] Export Gemma4-31B @128k on 5090#20480
Gasoonjia wants to merge 1 commit into
mainfrom
gemma4_31b_export_under_32gb

Gasoonjia commented Jun 24, 2026

Uh oh!

pytorch-bot Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

digantdesai Jun 24, 2026

Uh oh!

digantdesai Jun 24, 2026

Uh oh!

metascroy Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return _dequant_matmul(self, qdata, scale, zero, group_size)


		# Chunked dequant for the export GPU budget. The lm_head dequant (N = vocab_size,

		_DEQUANT_N_THRESHOLD = 65536
		_DEQUANT_N_CHUNK = 32768

Uh oh!

Conversation

Gasoonjia commented Jun 24, 2026

Uh oh!

pytorch-bot Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20480

❗ 1 Active SEVs

❌ 3 New Failures, 3 Unrelated Failures

Uh oh!

github-actions Bot commented Jun 24, 2026

This PR needs a release notes: label

Uh oh!

digantdesai Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

digantdesai Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

metascroy Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot Bot commented Jun 24, 2026 •

edited

Loading

This PR needs a `release notes:` label