Skip to content

feat: Use HF datasets for data logic#33

Open
Pringled wants to merge 10 commits intomainfrom
update-dataset-handling
Open

feat: Use HF datasets for data logic#33
Pringled wants to merge 10 commits intomainfrom
update-dataset-handling

Conversation

@Pringled
Copy link
Member

No description provided.

@Pringled Pringled requested a review from stephantul March 22, 2026 14:23
Copy link
Contributor

@stephantul stephantul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of small things, nothing major. Most of it relates to prior things we could fix.


Tokenlearn was developed by the [Minish](https://github.com/MinishLab) team consisting of [Stephan Tulkens](https://github.com/stephantul) and [Thomas van Dongen](https://github.com/Pringled).

## Citation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a zenodo citation for the software as well? If that's possible. (not this PR, just future)

{"text": texts, "embedding": [e.tolist() for e in embeddings]},
features=_FEATURES,
)
part.save_to_disk(str(checkpoints_dir / f"part_{part_idx:08d}"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding saving. You can directly save this as a parquet file, e.g.,

shard_00000.parquet
shard_00001.parquet

To do this, you can apparently first force it into a single shard, and then save it to parquet:

single_shard = ds.shard(num_shards=1, index=0)
single_shard.to_parquet(f"shard_{part_idx:08d}.parquet")

Note that this only works for datasets, not datasetdicts.

part.save_to_disk(str(checkpoints_dir / f"part_{part_idx:08d}"))


def _compact_checkpoints(checkpoints_dir: Path, output_dir: Path, keep_checkpoints: bool) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do the above, you'd just need to write the metadata. But I think you can get away with not writing any metadata tbh.

@@ -53,7 +124,7 @@
if i * batch_size >= max_means:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rewrite to max_rows, or call the other variable means_done (I prefer renaming this line)

logger.info(f"Reached maximum number of means: {max_means}")
break
if largest_batch and i <= largest_batch:
if i * batch_size < rows_done:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you compute i * batch_size twice.

json.dump(texts, open(output_dir_path / f"feature_{i}.json", "w"), indent=4)
np.save(output_dir_path / f"feature_{i}.npy", embeddings)
_save_checkpoint(checkpoints_dir, texts, embeddings, part_idx)
part_idx += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

part_idx is necessarily equal to i // _SAVE_EVERY, so no need to bookkeep it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of random. So if someone were to switch batch size after resuming, the resume logic would still work I guess. But you wouldn't be able to guess part_idx anymore. So the above is true, except if people switch batch size.

So relying on i * batch_size is a bit brittle. So what I would suggest, perhaps is the following:

Reinterpret _SAVE_EVERY as the number of items. That way, your shard size no longer relies on the batch size. Right now it's actually a bit weird that we produce much smaller shards for smaller batch sizes. Still wouldn't solve the problem though.

model.max_seq_length = max_length
logger.info(f"Set tokenizer maximum length to {max_length}.")
# Binding i in case the dataset is empty.
i = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is necessary any more (not part of this PR, I know, sorry!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants