Conversation
b0e722f to
723ccd7
Compare
- Add a `zarr` packaged module that reads Zarr v3 stores (zarr.json) and Zarr v2 consolidated stores (.zmetadata) - Stream Zarr stores via fsspec-compatible backends (including hf://) - Add unit tests and a small benchmark script - Document Zarr streaming in the streaming guide
723ccd7 to
082567a
Compare
|
Hi! It looks like the GitHub Actions check suites for this PR are in Could a maintainer please approve/run the workflows so CI can execute? Happy to address anything CI flags once it runs. |
lhoestq
left a comment
There was a problem hiding this comment.
Great PR ! It looks mostly good to me, I added a fes suggestions.
Btw the current implementation for streaming returns a StreamingIterableDataset with num_shards=1, which corresponds to 1 metadata file.
For large datasets it's maybe more practical to have a more finegrained sharding, e.g. at data file level. Wdyt ?
| try: | ||
| from fsspec.core import url_to_fs |
There was a problem hiding this comment.
fsspec is a dependency of datasets so no need to check for this :)
There was a problem hiding this comment.
Good point, removed the defensive fsspec check since it’s a required dependency.
Fixed in 7092e53.
| # Zarr stores are directory-based; users typically pass the root metadata file (Zarr v3: `zarr.json`, | ||
| # Zarr v2 consolidated: `.zmetadata`) explicitly via `data_files`. | ||
| ".zarr": ("zarr", {}), |
There was a problem hiding this comment.
let's add zarr.json to METADATA_FILENAMES and .zmetadata to METADATA_EXTENSIONS, this way no need to explicitly pass the root metadata file via data_files - they will be auto-included in data_files :)
| 'language_score': 0.9900368452072144, 'token_count': 716} | ||
| ``` | ||
|
|
||
| ## Streaming scientific formats (HDF5 and Zarr) |
There was a problem hiding this comment.
this doesn't feel like a section dedicated to streaming in general
maybe let's have a new dedicated set of pages in the docs about scientific data ? in addition to the existing ones 'audio', 'vision', 'text' and 'tabular' ?
There was a problem hiding this comment.
Agreed , I moved the HDF5/Zarr content out of the streaming guide into a dedicated “Scientific data” docs page and linked to it from stream.mdx.
Fixed in 7092e53.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thanks for the review. I agree on sharding: num_shards currently tracks the number of input Zarr stores, so it is often 1, which is not ideal for large datasets. I can implement finer-grained sharding using row-range shards aligned to axis-0 chunk boundaries (instead of one shard per metadata file), with a config knob like rows_per_shard / target_num_shards. I can add this in this PR if you want it before merge, or do it as a focused follow-up PR right after. |
| def _generate_shards(self, metadata_files, storage_options): | ||
| yield from metadata_files | ||
|
|
There was a problem hiding this comment.
Could you also implement _generate_num_examples before merging ?
This will be useful to know how many rows a dataset has and show it e.g. on the HF website
| def _generate_shards(self, metadata_files, storage_options): | |
| yield from metadata_files | |
| def _generate_shards(self, metadata_files, storage_options): | |
| yield from metadata_files | |
| def _generate_num_examples(self, metadata_files, storage_options): | |
| ... |
as an example here is how it's implemented for lance:
def _generate_num_examples(
self,
fragments: Optional[List["lance.LanceFragment"]],
lance_files_paths: Optional[list[str]],
lance_files: Optional[List["lance.file.LanceFileReader"]],
):
if fragments:
for fragment in fragments:
yield fragment.count_rows()
else:
for lance_file in lance_files:
yield lance_file.num_rows()There was a problem hiding this comment.
Done Implemented in 559db69 and follow-up commits
I added _generate_num_examples, and Zarr now reports row counts per shard/store so split num_examples metadata is populated
| consolidated: bool = True | ||
|
|
||
|
|
||
| class Zarr(datasets.ArrowBasedBuilder): |
There was a problem hiding this comment.
(this is the mixin for _generate_num_examples)
| class Zarr(datasets.ArrowBasedBuilder): | |
| class Zarr(datasets.ArrowBasedBuilder, datasets.builder._CountableBuilderMixin): |
There was a problem hiding this comment.
Done in 559db69 and follow-up. Zarr now inherits from datasets.builder._CountableBuilderMixin.
Done ✅ 🤗 |
|
@lhoestq quick ping. I pushed the requested updates including sharding and _generate_num_examples. Could you take another look when you have time? I can rerun or fix CI items right away if you want. |
lhoestq
left a comment
There was a problem hiding this comment.
Sorry for the delay :)
Thanks for the changes, the PR is almost ready to merge ! I just left one comment about files globbing that should stay unchanged, lmk what you think
| if (info["type"] == "file" or (info.get("islink") and os.path.isfile(os.path.realpath(filepath)))) | ||
| if ( | ||
| info["type"] == "file" | ||
| or (info["type"] == "directory" and filepath.rstrip("/\\").endswith(".zarr")) |
There was a problem hiding this comment.
I think we should be able to make it work without breaking this function, which is meant to glob files and not directories
For example Lance builder use .lance directories containing the data (using folders named transactions, _indices etc) and detects the root directory using a simple heuristic:
datasets/src/datasets/packaged_modules/lance/lance.py
Lines 62 to 69 in 1bd0a5c
you could do the same in the Zarr builder, where you can find the root .zarr directory using a similar trick
| """Zarr packaged module for 🤗 Datasets.""" | ||
|
|
||
| from .zarr import Zarr, ZarrConfig # noqa: F401 |
There was a problem hiding this comment.
| """Zarr packaged module for 🤗 Datasets.""" | |
| from .zarr import Zarr, ZarrConfig # noqa: F401 |
…-poc # Conflicts: # src/datasets/data_files.py
|
Hi @lhoestq, quick update on this PR. I added support for loading Zarr store roots directly through I also updated tests to cover store root paths, wildcard store roots, and v2 store root loading, and validated everything from a fresh clone. I ran this against
FYI I also created a Zarr collection and I am working on expanding it: Please merge if everything looks good to you. Thank you. 🤗 |
Add initial Zarr streaming support (POC).
This introduces a
zarrpackaged module and docs/tests to validate basic loading.Note: I pushed a follow-up commit to fix an accidental duplication in
benchmarks/benchmark_zarr_streaming.py(file now contains a single benchmark script).