Clarification on how OCR annotations are used during training

Hi, thank you for releasing this excellent work.

While reading the paper, there seems to be one point that is still unclear: how the OCR annotations are actually incorporated into training.

From the paper, the following part is understood:

PaddleOCR is applied to images from OBELICS and Zero250M
the recognized text is tokenized
100 fine-grained tags are constructed for each image
OCR data is introduced in Stage 2 together with video supervision

However, the paper does not seem to explicitly describe how these OCR-derived tags are optimized in the training objective.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on how OCR annotations are used during training #105

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Clarification on how OCR annotations are used during training #105

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions