feat: Use HF datasets for data logic by Pringled · Pull Request #33 · MinishLab/tokenlearn

Pringled · 2026-03-22T14:22:57Z

No description provided.

stephantul

Lots of small things, nothing major. Most of it relates to prior things we could fix.

stephantul · 2026-03-23T12:42:44Z

tokenlearn/datacards/dataset_card_template.md

+
+Tokenlearn was developed by the [Minish](https://github.com/MinishLab) team consisting of [Stephan Tulkens](https://github.com/stephantul) and [Thomas van Dongen](https://github.com/Pringled).
+
+## Citation


Maybe we can add a zenodo citation for the software as well? If that's possible. (not this PR, just future)

stephantul · 2026-03-23T12:48:31Z

tokenlearn/featurize.py

+        {"text": texts, "embedding": [e.tolist() for e in embeddings]},
+        features=_FEATURES,
+    )
+    part.save_to_disk(str(checkpoints_dir / f"part_{part_idx:08d}"))


regarding saving. You can directly save this as a parquet file, e.g.,

shard_00000.parquet shard_00001.parquet

To do this, you can apparently first force it into a single shard, and then save it to parquet:

single_shard = ds.shard(num_shards=1, index=0) single_shard.to_parquet(f"shard_{part_idx:08d}.parquet")

Note that this only works for datasets, not datasetdicts.

stephantul · 2026-03-23T12:50:03Z

tokenlearn/featurize.py

+    part.save_to_disk(str(checkpoints_dir / f"part_{part_idx:08d}"))
+
+
+def _compact_checkpoints(checkpoints_dir: Path, output_dir: Path, keep_checkpoints: bool) -> None:


If you do the above, you'd just need to write the metadata. But I think you can get away with not writing any metadata tbh.

stephantul · 2026-03-23T12:53:10Z

tokenlearn/featurize.py

@@ -53,7 +124,7 @@
        if i * batch_size >= max_means:


maybe rewrite to max_rows, or call the other variable means_done (I prefer renaming this line)

stephantul · 2026-03-23T12:53:32Z

tokenlearn/featurize.py

            logger.info(f"Reached maximum number of means: {max_means}")
            break
-        if largest_batch and i <= largest_batch:
+        if i * batch_size < rows_done:


you compute i * batch_size twice.

stephantul · 2026-03-23T12:54:42Z

tokenlearn/featurize.py

-            json.dump(texts, open(output_dir_path / f"feature_{i}.json", "w"), indent=4)
-            np.save(output_dir_path / f"feature_{i}.npy", embeddings)
+            _save_checkpoint(checkpoints_dir, texts, embeddings, part_idx)
+            part_idx += 1


part_idx is necessarily equal to i // _SAVE_EVERY, so no need to bookkeep it here.

This is kind of random. So if someone were to switch batch size after resuming, the resume logic would still work I guess. But you wouldn't be able to guess part_idx anymore. So the above is true, except if people switch batch size.

So relying on i * batch_size is a bit brittle. So what I would suggest, perhaps is the following:

Reinterpret _SAVE_EVERY as the number of items. That way, your shard size no longer relies on the batch size. Right now it's actually a bit weird that we produce much smaller shards for smaller batch sizes. Still wouldn't solve the problem though.

stephantul · 2026-03-23T12:55:34Z

tokenlearn/featurize.py

        model.max_seq_length = max_length
        logger.info(f"Set tokenizer maximum length to {max_length}.")
    # Binding i in case the dataset is empty.
    i = 0


not sure if this is necessary any more (not part of this PR, I know, sorry!)

Pringled added 10 commits March 21, 2026 15:56

Changed to HF datasets

1238bcb

Update import

7108909

Updated sharding logic

0c459ba

Added better checkpointing and loading from hub

55ab807

Added better checkpointing and loading from hub

4444f02

Added better checkpointing and loading from hub

0dc8d7a

Updated comments

0e09571

Updated docs

73590a6

Added model card

3797f60

Added model card

d474c59

Pringled requested a review from stephantul March 22, 2026 14:23

stephantul approved these changes Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Use HF datasets for data logic#33

feat: Use HF datasets for data logic#33
Pringled wants to merge 10 commits intomainfrom
update-dataset-handling

Pringled commented Mar 22, 2026

Uh oh!

stephantul left a comment

Uh oh!

stephantul Mar 23, 2026

Uh oh!

stephantul Mar 23, 2026

Uh oh!

stephantul Mar 23, 2026

Uh oh!

stephantul Mar 23, 2026

Uh oh!

stephantul Mar 23, 2026

Uh oh!

stephantul Mar 23, 2026

Uh oh!

stephantul Mar 23, 2026

Uh oh!

stephantul Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		Tokenlearn was developed by the [Minish](https://github.com/MinishLab) team consisting of [Stephan Tulkens](https://github.com/stephantul) and [Thomas van Dongen](https://github.com/Pringled).

		## Citation

		part.save_to_disk(str(checkpoints_dir / f"part_{part_idx:08d}"))


		def _compact_checkpoints(checkpoints_dir: Path, output_dir: Path, keep_checkpoints: bool) -> None:

Conversation

Pringled commented Mar 22, 2026

Uh oh!

stephantul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants