Skip to content

Logical conflict between data loading and collation. #1319

@wenqibiao

Description

@wenqibiao

In _load_ultrachat_conversations, the messages are constructed using only the user role:
msgs = [{"role": "user", "content": prompt}]

However, LanguageDataCollator.call implements a mandatory check that skips any sample missing an assistant turn:

if not any(m.get("role") == "assistant" for m in messages):
    continue

So, this causes all samples to be skipped during training because no assistant responses exist in the pre-processed data. Is that right?

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions