Skip to content

Cosmos Transfer2.5 Auto-Regressive Inference Pipeline#13114

Open
miguelmartin75 wants to merge 3 commits intohuggingface:mainfrom
miguelmartin75:cosmos/transfer2.5-ar
Open

Cosmos Transfer2.5 Auto-Regressive Inference Pipeline#13114
miguelmartin75 wants to merge 3 commits intohuggingface:mainfrom
miguelmartin75:cosmos/transfer2.5-ar

Conversation

@miguelmartin75
Copy link
Contributor

@miguelmartin75 miguelmartin75 commented Feb 10, 2026

What does this PR do?

This builds off #13066 by adding auto-regressive inference for Cosmos Transfer2.5. This pipeline does not require the controlnet or controls to be input. From the documentation:

The call function can be used in two modes: with or without controls.
When controls are not provided (controls is None), inference works in the same manner as predict2.5 (see
Cosmos2_5_PredictPipeline). This mode strictly uses the base transformer (self.transformer) to perform
inference and accepts as input an optional image or video along with a prompt / negative_prompt, and
can be used in the following ways:
- Text2World: image=None, video=None, prompt provided.
- Image2World: image provided, video=None, prompt provided.
- Video2World: video provided, image=None, prompt provided.
When controls are provided and a ControlNet is attached, controls drive the conditioning and video &
image is ignored. Controls are assumed to be pre-processed, e.g. edge maps are pre-computed.
Setting num_frames will restrict the total number of frames output, if not provided or assigned to None
(default) then the number of output frames will match the input video, image or controls respectively.
Auto-regressive inference is supported and thus a sliding window of num_frames_per_chunk frames are used per
denoising loop. In addition, when auto-regressive inference is performed, the previous
num_latent_conditional_frames or num_conditional_frames are used to condition the following denoising
inference loops.

Who can review?

@miguelmartin75 miguelmartin75 force-pushed the cosmos/transfer2.5-ar branch 3 times, most recently from 775f4b8 to 0d0eeae Compare February 12, 2026 22:19
Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR!

my main question is would it make sense to make this pipeline strictly ControlNet-focused? looking at the pipeline code, this would simplify the pipeline quite a bit

self,
image: PipelineImageInput | None = None,
video: List[PipelineImageInput] | None = None,
controls: Optional[PipelineImageInput | List[PipelineImageInput]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should make this pipeline strictly about controlnet (i.e. not make it optional) and then remove the image and video argument? this is how other controlnet behave anyways
if they want to use without controlnet, they can switich to the base pipeline

else:
width = int((height + 16) * (frame.shape[2] / frame.shape[1])) # NOTE: assuming C H W

if num_latent_conditional_frames is not None and num_conditional_frames is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we need two arguments here? is it possible we only keep num_conditional_frames?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is done to provide the user with the option to provide either one, the official GH uses num_conditional_frames for transfer but num_latent_conditional_frames for predict so I figured it would be best to provide both

@sayakpaul sayakpaul requested a review from DN6 February 20, 2026 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments