Add Zarr streaming support (POC) by KOKOSde · Pull Request #7983 · huggingface/datasets

KOKOSde · 2026-02-03T00:06:46Z

Add initial Zarr streaming support (POC).

This introduces a zarr packaged module and docs/tests to validate basic loading.

Note: I pushed a follow-up commit to fix an accidental duplication in benchmarks/benchmark_zarr_streaming.py (file now contains a single benchmark script).

- Add a `zarr` packaged module that reads Zarr v3 stores (zarr.json) and Zarr v2 consolidated stores (.zmetadata) - Stream Zarr stores via fsspec-compatible backends (including hf://) - Add unit tests and a small benchmark script - Document Zarr streaming in the streaming guide

KOKOSde · 2026-02-09T22:14:44Z

Hi! It looks like the GitHub Actions check suites for this PR are in action_required (no workflows actually ran). This is usually due to fork workflow approval.

Could a maintainer please approve/run the workflows so CI can execute? Happy to address anything CI flags once it runs.

lhoestq

Great PR ! It looks mostly good to me, I added a fes suggestions.

Btw the current implementation for streaming returns a StreamingIterableDataset with num_shards=1, which corresponds to 1 metadata file.

For large datasets it's maybe more practical to have a more finegrained sharding, e.g. at data file level. Wdyt ?

lhoestq · 2026-02-16T15:35:09Z

+    try:
+        from fsspec.core import url_to_fs


fsspec is a dependency of datasets so no need to check for this :)

Good point, removed the defensive fsspec check since it’s a required dependency.
Fixed in 7092e53.

lhoestq · 2026-02-16T15:51:31Z

+    # Zarr stores are directory-based; users typically pass the root metadata file (Zarr v3: `zarr.json`,
+    # Zarr v2 consolidated: `.zmetadata`) explicitly via `data_files`.
+    ".zarr": ("zarr", {}),


let's add zarr.json to METADATA_FILENAMES and .zmetadata to METADATA_EXTENSIONS, this way no need to explicitly pass the root metadata file via data_files - they will be auto-included in data_files :)

Done ! added zarr.json to the metadata filenames and .zmetadata to the metadata extensions so Zarr root metadata gets auto-included and users can just pass the .zarr store root.
Fixed in 7092e53.

lhoestq · 2026-02-16T16:04:02Z

 'language_score': 0.9900368452072144, 'token_count': 716}
 ```

+## Streaming scientific formats (HDF5 and Zarr)


this doesn't feel like a section dedicated to streaming in general

maybe let's have a new dedicated set of pages in the docs about scientific data ? in addition to the existing ones 'audio', 'vision', 'text' and 'tabular' ?

Agreed , I moved the HDF5/Zarr content out of the streaming guide into a dedicated “Scientific data” docs page and linked to it from stream.mdx.
Fixed in 7092e53.

HuggingFaceDocBuilderDev · 2026-02-16T16:07:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

KOKOSde · 2026-02-17T04:47:55Z

Great PR ! It looks mostly good to me, I added a fes suggestions.

Btw the current implementation for streaming returns a StreamingIterableDataset with num_shards=1, which corresponds to 1 metadata file.

For large datasets it's maybe more practical to have a more finegrained sharding, e.g. at data file level. Wdyt ?

Thanks for the review. I agree on sharding: num_shards currently tracks the number of input Zarr stores, so it is often 1, which is not ideal for large datasets.

I can implement finer-grained sharding using row-range shards aligned to axis-0 chunk boundaries (instead of one shard per metadata file), with a config knob like rows_per_shard / target_num_shards.

I can add this in this PR if you want it before merge, or do it as a focused follow-up PR right after.

lhoestq

Feel free to include it in this PR if it's ok for you.

Thanks for the changes, I just have one last comment I didn't think of earlier. The rest looks good to me :)

btw it seems some tests are failing in tests/packaged_modules/test_zarr.py

lhoestq · 2026-02-17T17:05:33Z

+    def _generate_shards(self, metadata_files, storage_options):
+        yield from metadata_files
+


Could you also implement _generate_num_examples before merging ?
This will be useful to know how many rows a dataset has and show it e.g. on the HF website

Suggested change

def _generate_shards(self, metadata_files, storage_options):

yield from metadata_files

def _generate_shards(self, metadata_files, storage_options):

yield from metadata_files

def _generate_num_examples(self, metadata_files, storage_options):

...

as an example here is how it's implemented for lance:

def _generate_num_examples( self, fragments: Optional[List["lance.LanceFragment"]], lance_files_paths: Optional[list[str]], lance_files: Optional[List["lance.file.LanceFileReader"]], ): if fragments: for fragment in fragments: yield fragment.count_rows() else: for lance_file in lance_files: yield lance_file.num_rows()

Done Implemented in 559db69 and follow-up commits
I added _generate_num_examples, and Zarr now reports row counts per shard/store so split num_examples metadata is populated

lhoestq · 2026-02-17T17:06:06Z

+    consolidated: bool = True
+
+
+class Zarr(datasets.ArrowBasedBuilder):


(this is the mixin for _generate_num_examples)

Suggested change

class Zarr(datasets.ArrowBasedBuilder):

class Zarr(datasets.ArrowBasedBuilder, datasets.builder._CountableBuilderMixin):

Done in 559db69 and follow-up. Zarr now inherits from datasets.builder._CountableBuilderMixin.

KOKOSde · 2026-02-20T09:38:07Z

Feel free to include it in this PR if it's ok for you.

Thanks for the changes, I just have one last comment I didn't think of earlier. The rest looks good to me :)

btw it seems some tests are failing in tests/packaged_modules/test_zarr.py

Done ✅ 🤗
included the sharding/counting follow-ups in this PR
including _generate_num_examples, _CountableBuilderMixin, finer-grained sharding controls, and the robustness fixes/tests/docs.

KOKOSde · 2026-03-03T09:39:30Z

@lhoestq quick ping. I pushed the requested updates including sharding and _generate_num_examples. Could you take another look when you have time?

I can rerun or fix CI items right away if you want.

lhoestq

Sorry for the delay :)

Thanks for the changes, the PR is almost ready to merge ! I just left one comment about files globbing that should stay unchanged, lmk what you think

lhoestq · 2026-03-09T17:41:24Z

-        if (info["type"] == "file" or (info.get("islink") and os.path.isfile(os.path.realpath(filepath))))
+        if (
+            info["type"] == "file"
+            or (info["type"] == "directory" and filepath.rstrip("/\\").endswith(".zarr"))


I think we should be able to make it work without breaking this function, which is meant to glob files and not directories

For example Lance builder use .lance directories containing the data (using folders named transactions, _indices etc) and detects the root directory using a simple heuristic:

datasets/src/datasets/packaged_modules/lance/lance.py

Lines 62 to 69 in 1bd0a5c

def resolve_dataset_uris(files: List[str]) -> Dict[str, List[str]]:

dataset_uris = set()

for file_path in files:

path = Path(file_path)

if path.parent.name in {"_transactions", "_indices", "_versions"}:

dataset_root = path.parent.parent

dataset_uris.add(str(dataset_root))

return list(dataset_uris)

you could do the same in the Zarr builder, where you can find the root .zarr directory using a similar trick

lhoestq · 2026-03-09T17:42:14Z

+"""Zarr packaged module for 🤗 Datasets."""
+
+from .zarr import Zarr, ZarrConfig  # noqa: F401


Suggested change

"""Zarr packaged module for 🤗 Datasets."""

from .zarr import Zarr, ZarrConfig # noqa: F401

@lhoestq
Thanks for the review , I pushed 484be3d : reverted resolve_pattern directory handling, moved .zarr root detection into the Zarr builder, and updated tests (.zgroup-based non-consolidated coverage).

Could you take another look?
Thanks!

…-poc # Conflicts: # src/datasets/data_files.py

KOKOSde · 2026-03-23T09:47:18Z

Hi @lhoestq, quick update on this PR.

I added support for loading Zarr store roots directly through data_files, including paths like .../brain_00000.zarr and wildcard patterns like *.zarr.

I also updated tests to cover store root paths, wildcard store roots, and v2 store root loading, and validated everything from a fresh clone.

I ran this against KokosDev/single-cell-brain-zarr with 5 different use cases a user might use:

Metadata discovery and schema inspection
Analytics sampling on a single shard
Multi shard ingestion with explicit shard lists
Multi shard ingestion with wildcard glob patterns
Training pipeline smoke test plus export pipeline validation
All passed.

FYI I also created a Zarr collection and I am working on expanding it:
https://huggingface.co/datasets/KokosDev/single-cell-brain-zarr
It has over 23k downloads in 15 days.

Please merge if everything looks good to you. Thank you. 🤗

KOKOSde force-pushed the feat/zarr-streaming-poc branch from b0e722f to 723ccd7 Compare February 3, 2026 23:51

Fahad Alghanim added 4 commits February 3, 2026 17:09

Fix duplicated Zarr streaming benchmark

8b69e2e

Fix Zarr tests and document optional dependency

74332f4

Allow passing .zarr store roots

082567a

KOKOSde force-pushed the feat/zarr-streaming-poc branch from 723ccd7 to 082567a Compare February 4, 2026 00:09

lhoestq reviewed Feb 16, 2026

View reviewed changes

Address review feedback for Zarr streaming

7092e53

lhoestq reviewed Feb 17, 2026

View reviewed changes

Fahad Alghanim added 3 commits February 17, 2026 23:51

Fix Zarr tests for zarr v2/v3 compatibility

4d23e78

Add Zarr _generate_num_examples support

559db69

Improve Zarr robustness for HF stores and v2/v3 compatibility

b439000

lhoestq reviewed Mar 9, 2026

View reviewed changes

Fahad Alghanim added 3 commits March 10, 2026 21:03

Address zarr review feedback on globbing

484be3d

Fix .zarr/.lance store-root matching in resolve_pattern

a609f3b

Merge remote-tracking branch 'upstream/main' into feat/zarr-streaming…

002f5f6

…-poc # Conflicts: # src/datasets/data_files.py

lhoestq mentioned this pull request Apr 16, 2026

Add Zarr / OME-Zarr Dataset Support #8135

Open

		def _generate_shards(self, metadata_files, storage_options):
		yield from metadata_files

		consolidated: bool = True


		class Zarr(datasets.ArrowBasedBuilder):

	class Zarr(datasets.ArrowBasedBuilder):
	class Zarr(datasets.ArrowBasedBuilder, datasets.builder._CountableBuilderMixin):

	def resolve_dataset_uris(files: List[str]) -> Dict[str, List[str]]:
	dataset_uris = set()
	for file_path in files:
	path = Path(file_path)
	if path.parent.name in {"_transactions", "_indices", "_versions"}:
	dataset_root = path.parent.parent
	dataset_uris.add(str(dataset_root))
	return list(dataset_uris)

		"""Zarr packaged module for 🤗 Datasets."""

		from .zarr import Zarr, ZarrConfig # noqa: F401

Conversation

KOKOSde commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KOKOSde commented Feb 9, 2026

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KOKOSde Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Feb 16, 2026

Uh oh!

KOKOSde commented Feb 17, 2026

Uh oh!

lhoestq left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KOKOSde commented Feb 20, 2026

Uh oh!

KOKOSde commented Mar 3, 2026

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KOKOSde commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KOKOSde commented Feb 3, 2026 •

edited

Loading

KOKOSde Feb 17, 2026 •

edited

Loading

lhoestq left a comment •

edited

Loading

KOKOSde commented Mar 23, 2026 •

edited

Loading