Add quantization support for DiffusionGemma#1935
Conversation
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
…upport_diffusiongemma
for more information, see https://pre-commit.ci
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Transformers backend test result:test_diffusiongemma_transformers.py CUDA_VISIBLE_DEVICES=2 python test_diffusiongemma_transformers.py [1/5] Loading config ... [2/5] Loading tokenizer ... [3/5] Loading generation_config ... [4/5] Loading quantized model onto cuda:0 ... [5/5] Running a simple Q&A ... Q: Why is the sky blue? Please answer in one short paragraph. transformers load + inference succeeded |
…/auto-round into lvl/support_diffusiongemma
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
|
||
| if modules_to_not_convert and not user_specified: | ||
| _DEFAULT_SKIP_KEYWORDS = ("embed", "embed_tokens", "lm_head", "output_embed", "norm") | ||
| modules_to_not_convert = [ |
There was a problem hiding this comment.
Any large change in this file is risky.
Could you provide more details? From my understanding, everything listed above except lm_head should no longer be a linear layer and therefore should already be handled correctly. As for lm_head, it has its own dedicated logic, so I'm not sure why changes would be needed there.
There was a problem hiding this comment.
The original logic is correct for normal models, the get_keys_to_not_convert returns only "lm_head/embed", which aren't quantizable linear layers, so a filter would be a no-op. But DiffusionGemma breaks that assumption because of one regex line in transformers "modeling_diffusion_gemma.py" _tied_weights_keys.
r"encoder.language_model.layers.(?:[^.]+.)*weight" → r"decoder.layers.(?:[^.]+.)*weight"
This ties every .weight under the encoder layers to the matching decoder weight, i.e. self_attn.q_proj.weight, k_proj.weight, v_proj.weight, o_proj.weight, and the per-expert MLP weights. These are exactly the real nn.Linear layers we want to quantize. So without the filter, essentially every encoder and decoder attention/MLP linear is excluded from quantization, it's a bug. The filter restricts the auto-derived skip list back to "embed/lm_head/norm", which is what get_keys_to_not_convert was meant to return.
Do you have better idea (need to be simple) to fix this issue?
|
could you leverage auto-round-best tune a model and upload to intel space, thanks! |
yes, WIP. |
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…upport_diffusiongemma
…/auto-round into lvl/support_diffusiongemma
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description
Add quantization support for DiffusionGemma (transformers ≥ 5.11.0). The model declares encoder/decoder MoE gate_up_proj / down_proj as tied across stacks; AutoRound's MoE-unfuse step splits these fused 3D parameters into per-expert linears, which leaves the declared _tied_weights_keys referring to nonexistent parameter names and breaks downstream loading/export.
Type of Change
New feature
Related Issues
None
Checklist Before Submitting
/azp run Unit-Test-CUDA-AutoRound.