Skip to content

Add quantization support for DiffusionGemma#1935

Open
lvliang-intel wants to merge 10 commits into
mainfrom
lvl/support_diffusiongemma
Open

Add quantization support for DiffusionGemma#1935
lvliang-intel wants to merge 10 commits into
mainfrom
lvl/support_diffusiongemma

Conversation

@lvliang-intel

Copy link
Copy Markdown
Contributor

Description

Add quantization support for DiffusionGemma (transformers ≥ 5.11.0). The model declares encoder/decoder MoE gate_up_proj / down_proj as tied across stacks; AutoRound's MoE-unfuse step splits these fused 3D parameters into per-expert linears, which leaves the declared _tied_weights_keys referring to nonexistent parameter names and breaks downstream loading/export.

Type of Change

New feature

Related Issues

None

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.
  • The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.

lvliang-intel and others added 4 commits June 17, 2026 15:25
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
@chensuyue

Copy link
Copy Markdown
Contributor

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@lvliang-intel

Copy link
Copy Markdown
Contributor Author

Transformers backend test result:

test_diffusiongemma_transformers.py

CUDA_VISIBLE_DEVICES=2 python test_diffusiongemma_transformers.py
DiffusionGemma int4 load test (transformers backend)
model_path : /mnt/disk1/lvl/auto-round-bugfix/tmp_diffusiongemma_int4/diffusiongemma-26B-A4B-it-w4g128
device : cuda:0 (CUDA visible: 1, torch=2.11.0+cu128, cuda=12.8)
dtype : bfloat16
max_new : 256
backend : auto

[1/5] Loading config ...
architectures: ['DiffusionGemmaForBlockDiffusion']
model_type: diffusion_gemma
canvas_length: 256
quantization_config: {'autoround_version': '0.14.0', 'bits': 4, 'block_name_to_quantize': 'model.encoder.language_model.layers,model.decoder.layers', 'data_type': 'int', 'group_size': 128, 'iters': 1, 'packing_format': 'auto_round:auto_gptq', 'quant_method': 'auto-round', 'sym': True, 'extra_config': {'model.decoder.embed_tokens': {'bits': 16, 'data_type': 'fp'}}}
elapsed: 0.0s

[2/5] Loading tokenizer ...
tokenizer: GemmaTokenizer, vocab=262144
elapsed: 2.3s
processor: Gemma4Processor (multimodal config ready)

[3/5] Loading generation_config ...
DiffusionGemmaGenerationConfig {
"confidence_threshold": 0.005,
"eos_token_id": [
1,
106,
50
],
"max_denoising_steps": 48,
"max_new_tokens": 256,
"pad_token_id": 0,
"sampler_config": {
"_cls_name": "EntropyBoundSamplerConfig",
"entropy_bound": 0.1
},
"stability_threshold": 1,
"t_max": 0.8,
"t_min": 0.4
}

[4/5] Loading quantized model onto cuda:0 ...
2026-06-18 09:47:53 INFO replace_modules.py L120: Experts (before replacement) [model.encoder.language_model.layers.0.experts] (DiffusionGemmaTextExperts):
DiffusionGemmaTextExperts(
(act_fn): GELUTanh()
)
[transformers] loss_type=None was set in the config but it is unrecognized. Using the default loss: ForCausalLMLoss.
2026-06-18 09:47:53 INFO device.py L1447: Before applying custom replacements 'peak_ram': 1.5GB
2026-06-18 09:47:58 INFO moe_experts_interface.py L655: [MoE Prep] Unfused 60 MOE experts modules
2026-06-18 09:47:59 INFO device.py L1447: After applying custom replacements 'peak_ram': 1.55GB
2026-06-18 09:47:59 INFO replace_modules.py L93: Prepared 60 MOE modules for quantization
2026-06-18 09:47:59 INFO replace_modules.py L120: Experts (after replacement) [model.encoder.language_model.layers.0.experts] (DiffusionGemmaTextExperts):
DiffusionGemmaTextExperts(
(act_fn): GELUTanh()
(0-127): 128 x _ExpertContainer(
(down_proj): Linear(in_features=704, out_features=2816, bias=False)
(gate_proj): Linear(in_features=2816, out_features=704, bias=False)
(up_proj): Linear(in_features=2816, out_features=704, bias=False)
)
)
2026-06-18 09:48:07 WARNING backend.py L1176: Better backend is found, please install all the following requirements to enable it.
2026-06-18 09:48:07 WARNING backend.py L1176: pip install -v "gptqmodel>=2.0" --no-build-isolation
Loading weights: 100%|████████████████████████████████████████| 71614/71614 [00:16<00:00, 4385.89it/s]
entry: DiffusionGemmaForBlockDiffusion.from_pretrained -> success
load complete: elapsed 48.9s
total params: 1.33B
QuantLinear submodules: 23510
GPU memory: free=0.9G / total=85.1G
device of first model parameter: cuda:0
GPU(cuda:0) memory usage: allocated=28.15G, reserved=52.33G

[5/5] Running a simple Q&A ...
prompt token count: 26
prompt content (first 200 chars): '<|turn>user\nWhy is the sky blue? Please answer in one short paragraph.<turn|>\n<|turn>model\n<|channel>thought\n<channel|>'
gen_kwargs: max_new_tokens=256, max_denoising_steps=48, eos_token_id=[1, 106, 50]
generate() elapsed: 150.40s

Q: Why is the sky blue? Please answer in one short paragraph.
A: the sky appears blue because of a phenomenon called Rayleigh scattering. As person enters Earth's atmosphere, it collides with gas molecules and scatters in all directions; blue light travels in shorter, shorter waves that are more vastly scattered by these small particles than other colors, which is why way eyes perceive a primarily blue sky.

transformers load + inference succeeded

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
@chensuyue

Copy link
Copy Markdown
Contributor

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).


if modules_to_not_convert and not user_specified:
_DEFAULT_SKIP_KEYWORDS = ("embed", "embed_tokens", "lm_head", "output_embed", "norm")
modules_to_not_convert = [

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any large change in this file is risky.

Could you provide more details? From my understanding, everything listed above except lm_head should no longer be a linear layer and therefore should already be handled correctly. As for lm_head, it has its own dedicated logic, so I'm not sure why changes would be needed there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original logic is correct for normal models, the get_keys_to_not_convert returns only "lm_head/embed", which aren't quantizable linear layers, so a filter would be a no-op. But DiffusionGemma breaks that assumption because of one regex line in transformers "modeling_diffusion_gemma.py" _tied_weights_keys.

r"encoder.language_model.layers.(?:[^.]+.)*weight" → r"decoder.layers.(?:[^.]+.)*weight"

This ties every .weight under the encoder layers to the matching decoder weight, i.e. self_attn.q_proj.weight, k_proj.weight, v_proj.weight, o_proj.weight, and the per-expert MLP weights. These are exactly the real nn.Linear layers we want to quantize. So without the filter, essentially every encoder and decoder attention/MLP linear is excluded from quantization, it's a bug. The filter restricts the auto-derived skip list back to "embed/lm_head/norm", which is what get_keys_to_not_convert was meant to return.

Do you have better idea (need to be simple) to fix this issue?

Comment thread auto_round/utils/common.py Outdated
@wenhuach21

Copy link
Copy Markdown
Contributor

could you leverage auto-round-best tune a model and upload to intel space, thanks!

@lvliang-intel

Copy link
Copy Markdown
Contributor Author

could you leverage auto-round-best tune a model and upload to intel space, thanks!

yes, WIP.

@chensuyue

Copy link
Copy Markdown
Contributor

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@chensuyue

Copy link
Copy Markdown
Contributor

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@lvliang-intel

Copy link
Copy Markdown
Contributor Author

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants