Conversation
yiyixuxu
left a comment
There was a problem hiding this comment.
thanks for the PR!
i left some feedbacks
| def __init__(self, hidden_size: int, num_heads: int, ffn_hidden_size: int, eps: float = 1e-6, qk_layernorm: bool = True): | ||
| super().__init__() | ||
| self.adaLN_sa_ln = RMSNorm(hidden_size, eps=eps) | ||
| self.self_attention = Attention( |
There was a problem hiding this comment.
can you create a custom attention class for ernie?see flux2 example https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_flux2.py#L493
| return x.reshape(B, D, Hp * Wp).transpose(1, 2).contiguous() | ||
|
|
||
|
|
||
| class TimestepEmbedding(nn.Module): |
There was a problem hiding this comment.
is this same as our TimestepEeembedding? should we reuse?
https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py#L1261
|
|
||
| return ErnieImageTransformer2DModelOutput(sample=output) if return_dict else (output,) | ||
|
|
||
| def _pad_text(self, text_hiddens: List[torch.Tensor], device: torch.device, dtype: torch.dtype): |
There was a problem hiding this comment.
ohh, we are padding the text embeddings here, is it possible to move this outside of the model, in to the pipeline? e.g. you can pass image_ids, text_ids and text_seq_lens instead
i think it would affect torch.compile too if we pad text embeddings inside the transformer
| return out.float() | ||
|
|
||
|
|
||
| class EmbedND3(nn.Module): |
There was a problem hiding this comment.
| class EmbedND3(nn.Module): | |
| class ErnieImageEmbedND3(nn.Module): |
can we follow our naming conventions and add the ErnieImage prefix every where?
| self.vae_scale_factor = 16 # VAE downsample factor | ||
|
|
||
| @classmethod | ||
| def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs): |
There was a problem hiding this comment.
why did you write a custom from_pretrained method here? is there any reason, you could not use the from_pretrained in inhrited from DiffusionPipeline?
| if hasattr(self.pe, "_hf_hook") and hasattr(self.pe._hf_hook, "execution_device"): | ||
| pe_device = self.pe._hf_hook.execution_device | ||
| else: | ||
| pe_device = device |
There was a problem hiding this comment.
| if hasattr(self.pe, "_hf_hook") and hasattr(self.pe._hf_hook, "execution_device"): | |
| pe_device = self.pe._hf_hook.execution_device | |
| else: | |
| pe_device = device | |
| pe_device = device or self._execution_deivce |
this is basically self._execution_device, no? https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_utils.py#L1136
| text_hiddens = self.encode_prompt(prompt, device, num_images_per_prompt) | ||
|
|
||
| # CFG with negative prompt | ||
| do_cfg = guidance_scale > 1.0 |
There was a problem hiding this comment.
can we add a do_classifier_free_guidance property instead?
https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py#L590
| do_cfg = guidance_scale > 1.0 |
|
|
||
| # CFG with negative prompt | ||
| do_cfg = guidance_scale > 1.0 | ||
| if do_cfg: |
There was a problem hiding this comment.
| if do_cfg: | |
| if self.do_classifier_free_guidance: |
What does this PR do?
We have introduced a new text-to-image model called ERNIE-Image, which will soon be open-sourced to the community. This PR includes the model architecture definition, the pipeline, as well as the related documentation and test files.
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.