Skip to content

[ML] Randomize train/test cluster boundary assignment in RDataLoader#22196

Open
siliataider wants to merge 2 commits into
root-project:masterfrom
siliataider:rdataloader
Open

[ML] Randomize train/test cluster boundary assignment in RDataLoader#22196
siliataider wants to merge 2 commits into
root-project:masterfrom
siliataider:rdataloader

Conversation

@siliataider
Copy link
Copy Markdown
Contributor

@siliataider siliataider commented May 8, 2026

This Pull request:

Previously RDataLoader always assigned the first fraction of each cluster to training and the last fraction to validation. This meant that across different runs, the train/val split was always identical regardless of the seed.

This PR fixes the issue by using the shuffle seed to randomly decide, per cluster, whether training takes the prefix or suffix of that cluster.

This PR fixes #22194

@siliataider siliataider requested a review from vepadulano as a code owner May 8, 2026 16:17
@siliataider siliataider self-assigned this May 8, 2026
@siliataider siliataider added the in:ML Everything under ROOT/ML label May 8, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Test Results

    20 files      20 suites   2d 23h 17m 56s ⏱️
 3 829 tests  3 829 ✅ 0 💤 0 ❌
69 072 runs  69 072 ✅ 0 💤 0 ❌

Results for commit 87db451.

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! See minor considerations from my side.

Comment thread bindings/pyroot/pythonizations/test/ml_dataloader.py Outdated
Comment thread tree/ml/inc/ROOT/ML/RClusterLoader.hxx Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in:ML Everything under ROOT/ML

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RDataLoader results in same train/test split across runs

2 participants