Skip to content

Add Zarr streaming support (POC)#7983

Open
KOKOSde wants to merge 11 commits intohuggingface:mainfrom
KOKOSde:feat/zarr-streaming-poc
Open

Add Zarr streaming support (POC)#7983
KOKOSde wants to merge 11 commits intohuggingface:mainfrom
KOKOSde:feat/zarr-streaming-poc

Conversation

@KOKOSde
Copy link
Copy Markdown
Contributor

@KOKOSde KOKOSde commented Feb 3, 2026

Add initial Zarr streaming support (POC).

This introduces a zarr packaged module and docs/tests to validate basic loading.

Note: I pushed a follow-up commit to fix an accidental duplication in benchmarks/benchmark_zarr_streaming.py (file now contains a single benchmark script).

@KOKOSde KOKOSde force-pushed the feat/zarr-streaming-poc branch from b0e722f to 723ccd7 Compare February 3, 2026 23:51
Fahad Alghanim added 4 commits February 3, 2026 17:09
- Add a `zarr` packaged module that reads Zarr v3 stores (zarr.json) and Zarr v2 consolidated stores (.zmetadata)
- Stream Zarr stores via fsspec-compatible backends (including hf://)
- Add unit tests and a small benchmark script
- Document Zarr streaming in the streaming guide
@KOKOSde KOKOSde force-pushed the feat/zarr-streaming-poc branch from 723ccd7 to 082567a Compare February 4, 2026 00:09
@KOKOSde
Copy link
Copy Markdown
Contributor Author

KOKOSde commented Feb 9, 2026

Hi! It looks like the GitHub Actions check suites for this PR are in action_required (no workflows actually ran). This is usually due to fork workflow approval.

Could a maintainer please approve/run the workflows so CI can execute? Happy to address anything CI flags once it runs.

Copy link
Copy Markdown
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR ! It looks mostly good to me, I added a fes suggestions.

Btw the current implementation for streaming returns a StreamingIterableDataset with num_shards=1, which corresponds to 1 metadata file.

For large datasets it's maybe more practical to have a more finegrained sharding, e.g. at data file level. Wdyt ?

Comment on lines +149 to +150
try:
from fsspec.core import url_to_fs
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fsspec is a dependency of datasets so no need to check for this :)

Copy link
Copy Markdown
Contributor Author

@KOKOSde KOKOSde Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, removed the defensive fsspec check since it’s a required dependency.
Fixed in 7092e53.

Comment on lines +93 to +95
# Zarr stores are directory-based; users typically pass the root metadata file (Zarr v3: `zarr.json`,
# Zarr v2 consolidated: `.zmetadata`) explicitly via `data_files`.
".zarr": ("zarr", {}),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add zarr.json to METADATA_FILENAMES and .zmetadata to METADATA_EXTENSIONS, this way no need to explicitly pass the root metadata file via data_files - they will be auto-included in data_files :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ! added zarr.json to the metadata filenames and .zmetadata to the metadata extensions so Zarr root metadata gets auto-included and users can just pass the .zarr store root.
Fixed in 7092e53.

Comment thread docs/source/stream.mdx Outdated
'language_score': 0.9900368452072144, 'token_count': 716}
```

## Streaming scientific formats (HDF5 and Zarr)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't feel like a section dedicated to streaming in general

maybe let's have a new dedicated set of pages in the docs about scientific data ? in addition to the existing ones 'audio', 'vision', 'text' and 'tabular' ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed , I moved the HDF5/Zarr content out of the streaming guide into a dedicated “Scientific data” docs page and linked to it from stream.mdx.
Fixed in 7092e53.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@KOKOSde
Copy link
Copy Markdown
Contributor Author

KOKOSde commented Feb 17, 2026

Great PR ! It looks mostly good to me, I added a fes suggestions.

Btw the current implementation for streaming returns a StreamingIterableDataset with num_shards=1, which corresponds to 1 metadata file.

For large datasets it's maybe more practical to have a more finegrained sharding, e.g. at data file level. Wdyt ?

Thanks for the review. I agree on sharding: num_shards currently tracks the number of input Zarr stores, so it is often 1, which is not ideal for large datasets.

I can implement finer-grained sharding using row-range shards aligned to axis-0 chunk boundaries (instead of one shard per metadata file), with a config knob like rows_per_shard / target_num_shards.

I can add this in this PR if you want it before merge, or do it as a focused follow-up PR right after.

Copy link
Copy Markdown
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to include it in this PR if it's ok for you.

Thanks for the changes, I just have one last comment I didn't think of earlier. The rest looks good to me :)

btw it seems some tests are failing in tests/packaged_modules/test_zarr.py

Comment on lines +72 to +74
def _generate_shards(self, metadata_files, storage_options):
yield from metadata_files

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also implement _generate_num_examples before merging ?
This will be useful to know how many rows a dataset has and show it e.g. on the HF website

Suggested change
def _generate_shards(self, metadata_files, storage_options):
yield from metadata_files
def _generate_shards(self, metadata_files, storage_options):
yield from metadata_files
def _generate_num_examples(self, metadata_files, storage_options):
...

as an example here is how it's implemented for lance:

    def _generate_num_examples(
        self,
        fragments: Optional[List["lance.LanceFragment"]],
        lance_files_paths: Optional[list[str]],
        lance_files: Optional[List["lance.file.LanceFileReader"]],
    ):
        if fragments:
            for fragment in fragments:
                yield fragment.count_rows()
        else:
            for lance_file in lance_files:
                yield lance_file.num_rows()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done Implemented in 559db69 and follow-up commits
I added _generate_num_examples, and Zarr now reports row counts per shard/store so split num_examples metadata is populated

consolidated: bool = True


class Zarr(datasets.ArrowBasedBuilder):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is the mixin for _generate_num_examples)

Suggested change
class Zarr(datasets.ArrowBasedBuilder):
class Zarr(datasets.ArrowBasedBuilder, datasets.builder._CountableBuilderMixin):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 559db69 and follow-up. Zarr now inherits from datasets.builder._CountableBuilderMixin.

@KOKOSde
Copy link
Copy Markdown
Contributor Author

KOKOSde commented Feb 20, 2026

Feel free to include it in this PR if it's ok for you.

Thanks for the changes, I just have one last comment I didn't think of earlier. The rest looks good to me :)

btw it seems some tests are failing in tests/packaged_modules/test_zarr.py

Done ✅ 🤗
included the sharding/counting follow-ups in this PR
including _generate_num_examples, _CountableBuilderMixin, finer-grained sharding controls, and the robustness fixes/tests/docs.

@KOKOSde
Copy link
Copy Markdown
Contributor Author

KOKOSde commented Mar 3, 2026

@lhoestq quick ping. I pushed the requested updates including sharding and _generate_num_examples. Could you take another look when you have time?

I can rerun or fix CI items right away if you want.

Copy link
Copy Markdown
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay :)

Thanks for the changes, the PR is almost ready to merge ! I just left one comment about files globbing that should stay unchanged, lmk what you think

Comment thread src/datasets/data_files.py Outdated
if (info["type"] == "file" or (info.get("islink") and os.path.isfile(os.path.realpath(filepath))))
if (
info["type"] == "file"
or (info["type"] == "directory" and filepath.rstrip("/\\").endswith(".zarr"))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to make it work without breaking this function, which is meant to glob files and not directories

For example Lance builder use .lance directories containing the data (using folders named transactions, _indices etc) and detects the root directory using a simple heuristic:

def resolve_dataset_uris(files: List[str]) -> Dict[str, List[str]]:
dataset_uris = set()
for file_path in files:
path = Path(file_path)
if path.parent.name in {"_transactions", "_indices", "_versions"}:
dataset_root = path.parent.parent
dataset_uris.add(str(dataset_root))
return list(dataset_uris)

you could do the same in the Zarr builder, where you can find the root .zarr directory using a similar trick

Comment on lines +1 to +3
"""Zarr packaged module for 🤗 Datasets."""

from .zarr import Zarr, ZarrConfig # noqa: F401
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Zarr packaged module for 🤗 Datasets."""
from .zarr import Zarr, ZarrConfig # noqa: F401

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhoestq
Thanks for the review , I pushed 484be3d : reverted resolve_pattern directory handling, moved .zarr root detection into the Zarr builder, and updated tests (.zgroup-based non-consolidated coverage).

Could you take another look?
Thanks!

@KOKOSde
Copy link
Copy Markdown
Contributor Author

KOKOSde commented Mar 23, 2026

Hi @lhoestq, quick update on this PR.

I added support for loading Zarr store roots directly through data_files, including paths like .../brain_00000.zarr and wildcard patterns like *.zarr.

I also updated tests to cover store root paths, wildcard store roots, and v2 store root loading, and validated everything from a fresh clone.

I ran this against KokosDev/single-cell-brain-zarr with 5 different use cases a user might use:

  1. Metadata discovery and schema inspection
  2. Analytics sampling on a single shard
  3. Multi shard ingestion with explicit shard lists
  4. Multi shard ingestion with wildcard glob patterns
  5. Training pipeline smoke test plus export pipeline validation
    All passed.

FYI I also created a Zarr collection and I am working on expanding it:
https://huggingface.co/datasets/KokosDev/single-cell-brain-zarr
It has over 23k downloads in 15 days.

Please merge if everything looks good to you. Thank you. 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants