[gemma4_31b][cuda] Export Gemma4-31B @128k on 5090#20480
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20480
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 3 New Failures, 3 Unrelated FailuresAs of commit 993cff5 with merge base 1b726b2 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Three CUDA-export memory optimizations: - tq4_sdpa: add BLOCK_N=16 (and a BLOCK_M=32) autotune config. The superset is kept for big-shared-memory GPUs (A100/H100); the Triton autotuner auto-prunes configs that exceed a GPU's shared memory (OutOfResources -> inf), so the same config list also works on the 5090 (Blackwell, ~101 KB SMEM) where the previous smallest config did not fit. - int4_dispatch: chunk the inline _dequant_matmul along N for vocab-sized weights (N>65536, i.e. only the lm_head). Avoids transiently materializing the full ~10 GiB bf16 lm_head when AOTI executes the int4_plain_mm custom op during autotune / cpp_wrapper. The runtime decode path uses the C++ dp4a shim and the M>4 prefill inline path is below the threshold, so this never enters the runtime graph -> zero runtime / accuracy impact. Applied unconditionally (no flag). - cuda_backend / aoti_backend: skip occupying the GPU with the KV-cache buffers during AOTI compile (gated behind low_memory_mode). A new move_program_to_device hook places KV constants on the target device but immediately frees their storage (resize_(0)), so the fake-tensor device check passes while no real KV bytes sit on the GPU during autotune. The emptied buffers are re-synthesized as zeros at the _unlift_graph clone and at serialization, and excluded from constant dedup (resize_(0) gives every KV data_ptr 0, which would otherwise collapse same-shape caches across layers). Result on 2xA100: Gemma4-31B @128k no-TQ export peak 36.3 -> 27.0 GiB; the exported model runs correctly (output "...Paris.").
498a419 to
993cff5
Compare
| _DEQUANT_N_THRESHOLD = 65536 | ||
| _DEQUANT_N_CHUNK = 32768 |
There was a problem hiding this comment.
Aren't these kind of device specific?
| return _dequant_matmul(self, qdata, scale, zero, group_size) | ||
|
|
||
|
|
||
| # Chunked dequant for the export GPU budget. The lm_head dequant (N = vocab_size, |
There was a problem hiding this comment.
I wish there is a better way to do this i.e. why does this logic needs to be aware of export issues?
|
|
||
| # Chunked dequant for the export GPU budget. The lm_head dequant (N = vocab_size, | ||
| # e.g. 262144) runs through the int4_plain_mm custom op (M=1); AOTI executes that | ||
| # op's CUDA impl during autotune / cpp_wrapper codegen, where it transiently holds |
There was a problem hiding this comment.
Is this just a crude way of doing tile level dequant?
Current gemma4-31b can not be successfully exported on consumer gpu like 5090 with three reasons:
Three CUDA-export memory optimizations, all gated behind the existing low_memory_mode compile spec (no impact on other models or on runtime):
int4_dispatch: chunk the inline _dequant_matmul along N for vocab-sized weights, gated behind a low-memory flag with an N>65536 threshold so only the lm_head crosses it. Avoids transiently materializing the full ~10 GiB bf16 lm_head during AOTI autotune / cpp_wrapper. The prefill MLP path is untouched -> zero runtime impact.
cuda_backend / aoti_backend: skip occupying the GPU with the KV-cache buffers during AOTI compile. A new move_program_to_device hook places KV constants on the target device but immediately frees their storage (resize_(0)), so the fake-tensor device check passes while no real KV bytes sit on the GPU during autotune. The emptied buffers are re-synthesized as zeros at the unlift_graph clone and at serialization, and excluded from constant dedup (resize(0) gives every KV data_ptr 0, which would otherwise collapse same-shape caches across layers). All gated behind low_memory_mode.
tq4_sdpa: add BLOCK_N=16 (and a BLOCK_M=32) autotune config. The superset is kept for big-shared-memory GPUs (A100/H100); the Triton autotuner auto-prunes configs that exceed a GPU's shared memory (OutOfResources -> inf), so the same config list also works on the 5090 (Blackwell, ~101 KB SMEM) where the previous smallest config did not fit.
Full Gemma4-31B on 128k TQ export: peak 28.0 GiB, runtime output correct ("...Paris.").