Describe the bug
In QwenImagePipeline, when users manually pre-compute prompt embeddings to optimize memory usage (e.g., placing the encoder and transformer on different GPUs), Classifier-Free Guidance (CFG) is silently disabled if negative_prompt_embeds_mask is set to None.
However, encode_prompt explicitly converts an all-ones mask to None as an optimization. This creates a logical contradiction where the pipeline's own encoder returns a valid state (None) that the __call__ method subsequently rejects, causing CFG to fail with a warning.
Reproduction
If you manually extract embeddings and pass them to the pipeline:
# 1. Manually encode prompts
pos_embeds, pos_mask = pipeline.encode_prompt("A photo of a cat")
neg_embeds, neg_mask = pipeline.encode_prompt("bad quality")
# Note: neg_mask is often `None` here because `encode_prompt` optimizes `prompt_embeds_mask.all() -> None`
# 2. Pass them to the pipeline
image = pipeline(
prompt_embeds=pos_embeds,
prompt_embeds_mask=pos_mask,
negative_prompt_embeds=neg_embeds,
negative_prompt_embeds_mask=neg_mask, # This passes None
true_cfg_scale=4.0
).images[0]
Output Warning:
true_cfg_scale is passed as 4.0, but classifier-free guidance is not enabled since no negative_prompt is provided.
Root Cause Analysis:
In pipeline_qwenimage.py:
1. The has_neg_prompt check in __call__ requires the mask to NOT be None:
has_neg_prompt = negative_prompt is not None or (
negative_prompt_embeds is not None and negative_prompt_embeds_mask is not None
)
2. But encode_prompt intentionally sets the mask to None if it's full:
if prompt_embeds_mask is not None:
# ... (reshape logic)
if prompt_embeds_mask.all():
prompt_embeds_mask = None # <--- Here!
Because neg_mask becomes None, has_neg_prompt evaluates to False, and do_true_cfg is set to False.
Expected behavior:
The presence of negative_prompt_embeds alone should be sufficient to trigger has_neg_prompt = True. The negative_prompt_embeds_mask being None should simply mean "no masking is required" (all valid), which is consistent with the behavior of encode_prompt.
The check in __call__ should probably be relaxed to:
has_neg_prompt = negative_prompt is not None or negative_prompt_embeds is not None
Temporary Workaround:
Users currently have to manually fake an all-ones mask before passing it to the transformer:
if neg_mask is None:
neg_mask = torch.ones(neg_embeds.shape[:2], dtype=torch.long, device=device)
Logs
System Info
- 🤗 Diffusers version: 0.37.1
- Platform: Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39
- Running on Google Colab?: No
- Python version: 3.10.20
- PyTorch version (GPU?): 2.11.0+cu130 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 1.8.0
- Transformers version: 5.4.0
- Accelerate version: 1.13.0
- PEFT version: not installed
- Bitsandbytes version: not installed
- Safetensors version: 0.7.0
- xFormers version: not installed
- Accelerator: NVIDIA GeForce RTX 3060, 12288 MiB
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes
Who can help?
@yiyixuxu @DN6
Describe the bug
In
QwenImagePipeline, when users manually pre-compute prompt embeddings to optimize memory usage (e.g., placing the encoder and transformer on different GPUs), Classifier-Free Guidance (CFG) is silently disabled ifnegative_prompt_embeds_maskis set toNone.However,
encode_promptexplicitly converts an all-ones mask toNoneas an optimization. This creates a logical contradiction where the pipeline's own encoder returns a valid state (None) that the__call__method subsequently rejects, causing CFG to fail with a warning.Reproduction
If you manually extract embeddings and pass them to the pipeline:
Output Warning:
Root Cause Analysis:
In
pipeline_qwenimage.py:1. The
has_neg_promptcheck in__call__requires the mask to NOT beNone:2. But
encode_promptintentionally sets the mask toNoneif it's full:Because
neg_maskbecomesNone,has_neg_promptevaluates toFalse, anddo_true_cfgis set toFalse.Expected behavior:
The presence of
negative_prompt_embedsalone should be sufficient to triggerhas_neg_prompt = True. Thenegative_prompt_embeds_maskbeingNoneshould simply mean "no masking is required" (all valid), which is consistent with the behavior ofencode_prompt.The check in
__call__should probably be relaxed to:Temporary Workaround:
Users currently have to manually fake an all-ones mask before passing it to the transformer:
Logs
System Info
Who can help?
@yiyixuxu @DN6