Add ernie image by HsiaWinter · Pull Request #13432 · huggingface/diffusers

HsiaWinter · 2026-04-08T04:14:03Z

What does this PR do?

We have introduced a new text-to-image model called ERNIE-Image, which will soon be open-sourced to the community. This PR includes the model architecture definition, the pipeline, as well as the related documentation and test files.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[✅] Did you read the contributor guideline?
[✅] Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[✅] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

yiyixuxu

thanks for the PR!
i left some feedbacks

yiyixuxu · 2026-04-08T07:18:03Z

src/diffusers/models/transformers/transformer_ernie_image.py

+    def __init__(self, hidden_size: int, num_heads: int, ffn_hidden_size: int, eps: float = 1e-6, qk_layernorm: bool = True):
+        super().__init__()
+        self.adaLN_sa_ln = RMSNorm(hidden_size, eps=eps)
+        self.self_attention = Attention(


can you create a custom attention class for ernie?see flux2 example https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_flux2.py#L493

yiyixuxu · 2026-04-08T07:48:11Z

src/diffusers/models/transformers/transformer_ernie_image.py

+        return x.reshape(B, D, Hp * Wp).transpose(1, 2).contiguous()
+
+
+class TimestepEmbedding(nn.Module):


is this same as our TimestepEeembedding? should we reuse?
https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py#L1261

yiyixuxu · 2026-04-08T07:51:36Z

src/diffusers/models/transformers/transformer_ernie_image.py

+
+        return ErnieImageTransformer2DModelOutput(sample=output) if return_dict else (output,)
+
+    def _pad_text(self, text_hiddens: List[torch.Tensor], device: torch.device, dtype: torch.dtype):


ohh, we are padding the text embeddings here, is it possible to move this outside of the model, in to the pipeline? e.g. you can pass image_ids, text_ids and text_seq_lens instead
i think it would affect torch.compile too if we pad text embeddings inside the transformer

yiyixuxu · 2026-04-08T08:08:28Z

src/diffusers/models/transformers/transformer_ernie_image.py

+    return out.float()
+
+
+class EmbedND3(nn.Module):


Suggested change

class EmbedND3(nn.Module):

class ErnieImageEmbedND3(nn.Module):

can we follow our naming conventions and add the ErnieImage prefix every where?

yiyixuxu · 2026-04-08T08:53:46Z

src/diffusers/pipelines/ernie_image/pipeline_ernie_image.py

+        self.vae_scale_factor = 16  # VAE downsample factor
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs):


why did you write a custom from_pretrained method here? is there any reason, you could not use the from_pretrained in inhrited from DiffusionPipeline?

yiyixuxu · 2026-04-08T08:58:34Z

src/diffusers/pipelines/ernie_image/pipeline_ernie_image.py

+        if hasattr(self.pe, "_hf_hook") and hasattr(self.pe._hf_hook, "execution_device"):
+            pe_device = self.pe._hf_hook.execution_device
+        else:
+            pe_device = device


Suggested change

if hasattr(self.pe, "_hf_hook") and hasattr(self.pe._hf_hook, "execution_device"):

pe_device = self.pe._hf_hook.execution_device

else:

pe_device = device

pe_device = device or self._execution_deivce

this is basically self._execution_device, no? https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_utils.py#L1136

yiyixuxu · 2026-04-08T09:00:36Z

src/diffusers/pipelines/ernie_image/pipeline_ernie_image.py

+        text_hiddens = self.encode_prompt(prompt, device, num_images_per_prompt)
+
+        # CFG with negative prompt
+        do_cfg = guidance_scale > 1.0


can we add a do_classifier_free_guidance property instead?
https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py#L590

Suggested change

do_cfg = guidance_scale > 1.0

yiyixuxu · 2026-04-08T09:00:58Z

src/diffusers/pipelines/ernie_image/pipeline_ernie_image.py

+
+        # CFG with negative prompt
+        do_cfg = guidance_scale > 1.0
+        if do_cfg:


Suggested change

if do_cfg:

if self.do_classifier_free_guidance:

HsiaWinter and others added 9 commits April 2, 2026 16:39

Add ERNIE-Image

4533474

Update doc

4049a20

Update doc

579e6c7

Change from Custom-Attention to Diffusers Style Attention

d16d16e

Change from Custom-Attention to Diffusers Style Attention

9cbbf5d

兼容SGLang

9fca912

优化PE模块的加载与offload策略

465f009

更新Doc文件与config配置相关内容

6afd534

Merge branch 'huggingface:main' into add-ernie-image

11ffcd9

github-actions bot added documentation Improvements or additions to documentation models tests utils pipelines size/L PR with diff > 200 LOC labels Apr 8, 2026

yiyixuxu reviewed Apr 8, 2026

View reviewed changes

yiyixuxu requested a review from dg845 April 8, 2026 09:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ernie image#13432

Add ernie image#13432
HsiaWinter wants to merge 9 commits intohuggingface:mainfrom
HsiaWinter:add-ernie-image

HsiaWinter commented Apr 8, 2026

Uh oh!

yiyixuxu left a comment

Uh oh!

yiyixuxu Apr 8, 2026

Uh oh!

yiyixuxu Apr 8, 2026

Uh oh!

yiyixuxu Apr 8, 2026

Uh oh!

yiyixuxu Apr 8, 2026

Uh oh!

yiyixuxu Apr 8, 2026

Uh oh!

yiyixuxu Apr 8, 2026

Uh oh!

yiyixuxu Apr 8, 2026

Uh oh!

yiyixuxu Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return x.reshape(B, D, Hp * Wp).transpose(1, 2).contiguous()


		class TimestepEmbedding(nn.Module):


		return ErnieImageTransformer2DModelOutput(sample=output) if return_dict else (output,)

		def _pad_text(self, text_hiddens: List[torch.Tensor], device: torch.device, dtype: torch.dtype):

	class EmbedND3(nn.Module):
	class ErnieImageEmbedND3(nn.Module):

Conversation

HsiaWinter commented Apr 8, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants