Add quantization support for DiffusionGemma by lvliang-intel · Pull Request #1935 · intel/auto-round

lvliang-intel · 2026-06-17T13:59:13Z

Description

Add quantization support for DiffusionGemma (transformers ≥ 5.11.0). The model declares encoder/decoder MoE gate_up_proj / down_proj as tied across stacks; AutoRound's MoE-unfuse step splits these fused 3D parameters into per-expert linears, which leaves the declared _tied_weights_keys referring to nonexistent parameter names and breaks downstream loading/export.

Type of Change

New feature

Related Issues

None

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.
The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

…upport_diffusiongemma

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

for more information, see https://pre-commit.ci

chensuyue · 2026-06-17T14:32:12Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-06-17T14:32:22Z

Azure Pipelines successfully started running 1 pipeline(s).

lvliang-intel · 2026-06-18T01:56:25Z

Transformers backend test result:

test_diffusiongemma_transformers.py

CUDA_VISIBLE_DEVICES=2 python test_diffusiongemma_transformers.py
DiffusionGemma int4 load test (transformers backend)
model_path : /mnt/disk1/lvl/auto-round-bugfix/tmp_diffusiongemma_int4/diffusiongemma-26B-A4B-it-w4g128
device : cuda:0 (CUDA visible: 1, torch=2.11.0+cu128, cuda=12.8)
dtype : bfloat16
max_new : 256
backend : auto

[1/5] Loading config ...
architectures: ['DiffusionGemmaForBlockDiffusion']
model_type: diffusion_gemma
canvas_length: 256
quantization_config: {'autoround_version': '0.14.0', 'bits': 4, 'block_name_to_quantize': 'model.encoder.language_model.layers,model.decoder.layers', 'data_type': 'int', 'group_size': 128, 'iters': 1, 'packing_format': 'auto_round:auto_gptq', 'quant_method': 'auto-round', 'sym': True, 'extra_config': {'model.decoder.embed_tokens': {'bits': 16, 'data_type': 'fp'}}}
elapsed: 0.0s

[2/5] Loading tokenizer ...
tokenizer: GemmaTokenizer, vocab=262144
elapsed: 2.3s
processor: Gemma4Processor (multimodal config ready)

[3/5] Loading generation_config ...
DiffusionGemmaGenerationConfig {
"confidence_threshold": 0.005,
"eos_token_id": [
1,
106,
50
],
"max_denoising_steps": 48,
"max_new_tokens": 256,
"pad_token_id": 0,
"sampler_config": {
"_cls_name": "EntropyBoundSamplerConfig",
"entropy_bound": 0.1
},
"stability_threshold": 1,
"t_max": 0.8,
"t_min": 0.4
}

[4/5] Loading quantized model onto cuda:0 ...
2026-06-18 09:47:53 INFO replace_modules.py L120: Experts (before replacement) [model.encoder.language_model.layers.0.experts] (DiffusionGemmaTextExperts):
DiffusionGemmaTextExperts(
(act_fn): GELUTanh()
)
[transformers] loss_type=None was set in the config but it is unrecognized. Using the default loss: ForCausalLMLoss.
2026-06-18 09:47:53 INFO device.py L1447: Before applying custom replacements 'peak_ram': 1.5GB
2026-06-18 09:47:58 INFO moe_experts_interface.py L655: [MoE Prep] Unfused 60 MOE experts modules
2026-06-18 09:47:59 INFO device.py L1447: After applying custom replacements 'peak_ram': 1.55GB
2026-06-18 09:47:59 INFO replace_modules.py L93: Prepared 60 MOE modules for quantization
2026-06-18 09:47:59 INFO replace_modules.py L120: Experts (after replacement) [model.encoder.language_model.layers.0.experts] (DiffusionGemmaTextExperts):
DiffusionGemmaTextExperts(
(act_fn): GELUTanh()
(0-127): 128 x _ExpertContainer(
(down_proj): Linear(in_features=704, out_features=2816, bias=False)
(gate_proj): Linear(in_features=2816, out_features=704, bias=False)
(up_proj): Linear(in_features=2816, out_features=704, bias=False)
)
)
2026-06-18 09:48:07 WARNING backend.py L1176: Better backend is found, please install all the following requirements to enable it.
2026-06-18 09:48:07 WARNING backend.py L1176: pip install -v "gptqmodel>=2.0" --no-build-isolation
Loading weights: 100%|████████████████████████████████████████| 71614/71614 [00:16<00:00, 4385.89it/s]
entry: DiffusionGemmaForBlockDiffusion.from_pretrained -> success
load complete: elapsed 48.9s
total params: 1.33B
QuantLinear submodules: 23510
GPU memory: free=0.9G / total=85.1G
device of first model parameter: cuda:0
GPU(cuda:0) memory usage: allocated=28.15G, reserved=52.33G

[5/5] Running a simple Q&A ...
prompt token count: 26
prompt content (first 200 chars): '<|turn>user\nWhy is the sky blue? Please answer in one short paragraph.<turn|>\n<|turn>model\n<|channel>thought\n<channel|>'
gen_kwargs: max_new_tokens=256, max_denoising_steps=48, eos_token_id=[1, 106, 50]
generate() elapsed: 150.40s

Q: Why is the sky blue? Please answer in one short paragraph.
A: the sky appears blue because of a phenomenon called Rayleigh scattering. As person enters Earth's atmosphere, it collides with gas molecules and scatters in all directions; blue light travels in shorter, shorter waves that are more vastly scattered by these small particles than other colors, which is why way eyes perceive a primarily blue sky.

transformers load + inference succeeded

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

…/auto-round into lvl/support_diffusiongemma

chensuyue · 2026-06-18T04:32:39Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-06-18T04:32:47Z

Azure Pipelines successfully started running 1 pipeline(s).

wenhuach21 · 2026-06-18T05:12:44Z

+
+    if modules_to_not_convert and not user_specified:
+        _DEFAULT_SKIP_KEYWORDS = ("embed", "embed_tokens", "lm_head", "output_embed", "norm")
+        modules_to_not_convert = [


Any large change in this file is risky.

Could you provide more details? From my understanding, everything listed above except lm_head should no longer be a linear layer and therefore should already be handled correctly. As for lm_head, it has its own dedicated logic, so I'm not sure why changes would be needed there.

The original logic is correct for normal models, the get_keys_to_not_convert returns only "lm_head/embed", which aren't quantizable linear layers, so a filter would be a no-op. But DiffusionGemma breaks that assumption because of one regex line in transformers "modeling_diffusion_gemma.py" _tied_weights_keys.

r"encoder.language_model.layers.(?:[^.]+.)*weight" → r"decoder.layers.(?:[^.]+.)*weight"

This ties every .weight under the encoder layers to the matching decoder weight, i.e. self_attn.q_proj.weight, k_proj.weight, v_proj.weight, o_proj.weight, and the per-expert MLP weights. These are exactly the real nn.Linear layers we want to quantize. So without the filter, essentially every encoder and decoder attention/MLP linear is excluded from quantization, it's a bug. The filter restricts the auto-derived skip list back to "embed/lm_head/norm", which is what get_keys_to_not_convert was meant to return.

Do you have better idea (need to be simple) to fix this issue?

wenhuach21 · 2026-06-18T05:15:24Z

could you leverage auto-round-best tune a model and upload to intel space, thanks!

lvliang-intel · 2026-06-18T06:01:18Z

could you leverage auto-round-best tune a model and upload to intel space, thanks!

yes, WIP.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

chensuyue · 2026-06-18T06:43:17Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-06-18T06:43:27Z

Azure Pipelines successfully started running 1 pipeline(s).

…upport_diffusiongemma

…/auto-round into lvl/support_diffusiongemma

chensuyue · 2026-06-22T02:57:57Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-06-22T02:58:06Z

Azure Pipelines successfully started running 1 pipeline(s).

lvliang-intel · 2026-06-22T07:02:30Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-06-22T07:02:41Z

Azure Pipelines successfully started running 1 pipeline(s).

lvliang-intel and others added 4 commits June 17, 2026 15:25

Support DiffusionGemma model

20a55ad

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Merge branch 'main' of https://github.com/intel/auto-round into lvl/s…

1350440

…upport_diffusiongemma

add ut

0991c18

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f413a48

for more information, see https://pre-commit.ci

lvliang-intel added 2 commits June 18, 2026 11:46

fix ci

6e666ec

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Merge branch 'lvl/support_diffusiongemma' of https://github.com/intel…

f0c060d

…/auto-round into lvl/support_diffusiongemma

wenhuach21 reviewed Jun 18, 2026

View reviewed changes

Comment thread auto_round/utils/common.py Outdated

lvliang-intel added 2 commits June 18, 2026 14:01

remove redundant line

e0bad09

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Merge branch 'main' into lvl/support_diffusiongemma

8735287

lvliang-intel added 2 commits June 22, 2026 10:02

Merge branch 'main' of https://github.com/intel/auto-round into lvl/s…

837b17b

…upport_diffusiongemma

Merge branch 'lvl/support_diffusiongemma' of https://github.com/intel…

f2072c3

…/auto-round into lvl/support_diffusiongemma

Conversation

lvliang-intel commented Jun 17, 2026

Description

Type of Change

Related Issues

Checklist Before Submitting

Uh oh!

chensuyue commented Jun 17, 2026

Uh oh!

azure-pipelines Bot commented Jun 17, 2026

Uh oh!

lvliang-intel commented Jun 18, 2026

Transformers backend test result:

Uh oh!

chensuyue commented Jun 18, 2026

Uh oh!

azure-pipelines Bot commented Jun 18, 2026

Uh oh!

wenhuach21 Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

lvliang-intel Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wenhuach21 commented Jun 18, 2026

Uh oh!

lvliang-intel commented Jun 18, 2026

Uh oh!

chensuyue commented Jun 18, 2026

Uh oh!

azure-pipelines Bot commented Jun 18, 2026

Uh oh!

chensuyue commented Jun 22, 2026

Uh oh!

azure-pipelines Bot commented Jun 22, 2026

Uh oh!

lvliang-intel commented Jun 22, 2026

Uh oh!

azure-pipelines Bot commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants