perf: cache lexicographic chunk coords in sharding codec by d-v-b · Pull Request #4012 · zarr-developers/zarr-python

d-v-b · 2026-05-26T21:02:15Z

Fixes a performance regression introduced in #3826 that made morton-order-indexing slower.

for details, see claude's summary below.

The subchunk_write_order feature (#3826) regressed sharded write performance: _encode_partial_single rebuilt the full per-shard chunk coordinate grid on every write via
np.array(list(_subchunk_order_iter(..., "lexicographic"))), and to_dict_vectorized rebuilt a tuple key per row with tuple(coords.ravel()). For a single-chunk write into a shard with tens of thousands of chunks this roughly doubled write time (~22ms -> ~40ms on test_sharded_morton_write_single_chunk, matching the -44% CodSpeed regression).

Add cached _lexicographic_order (array) and _lexicographic_order_keys (tuples) helpers in indexing.py, mirroring _morton_order/_morton_order_keys, and pass the cached keys into to_dict_vectorized instead of deriving them row-by-row. This restores write throughput to the pre-#3826 baseline while preserving identical chunk ordering (verified equal to np.ndindex across shapes including 0-d and empty).

[Description of PR]

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.md
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

The subchunk_write_order feature (zarr-developers#3826) regressed sharded write performance: _encode_partial_single rebuilt the full per-shard chunk coordinate grid on every write via `np.array(list(_subchunk_order_iter(..., "lexicographic")))`, and `to_dict_vectorized` rebuilt a tuple key per row with `tuple(coords.ravel())`. For a single-chunk write into a shard with tens of thousands of chunks this roughly doubled write time (~22ms -> ~40ms on test_sharded_morton_write_single_chunk, matching the -44% CodSpeed regression). Add cached `_lexicographic_order` (array) and `_lexicographic_order_keys` (tuples) helpers in indexing.py, mirroring `_morton_order`/`_morton_order_keys`, and pass the cached keys into `to_dict_vectorized` instead of deriving them row-by-row. This restores write throughput to the pre-zarr-developers#3826 baseline while preserving identical chunk ordering (verified equal to np.ndindex across shapes including 0-d and empty). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-26T21:11:30Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.56%. Comparing base (8932bbe) to head (2b8b095).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4012   +/-   ##
=======================================
  Coverage   93.55%   93.56%           
=======================================
  Files          88       88           
  Lines       11896    11912   +16     
=======================================
+ Hits        11129    11145   +16     
  Misses        767      767

Files with missing lines	Coverage Δ
src/zarr/codecs/sharding.py	`92.20% <100.00%> (+0.06%)`	⬆️
src/zarr/core/indexing.py	`96.30% <100.00%> (+0.05%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codspeed-hq · 2026-05-26T21:28:36Z

Merging this PR will improve performance by 76.71%

⚡ 3 improved benchmarks
✅ 63 untouched benchmarks
⏩ 6 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	WallTime	`test_sharded_morton_write_single_chunk[(32, 32, 32)-memory]`	348.6 ms	198.7 ms	+75.45%
⚡	WallTime	`test_sharded_morton_write_single_chunk[(33, 33, 33)-memory]`	386.8 ms	217.2 ms	+78.06%
⚡	WallTime	`test_sharded_morton_write_single_chunk[(30, 30, 30)-memory]`	292.6 ms	165.7 ms	+76.62%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing d-v-b:perf-sharding-coord-cache (3372b05) with main (c0e2afa)²}

6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
No successful run was found on main (1cda981) during the generation of this report, so c0e2afa was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

maxrjones · 2026-06-01T19:32:23Z

@@ -266,13 +269,19 @@ def __iter__(self) -> Iterator[tuple[int, ...]]:
    def to_dict_vectorized(


I think the following reads cleaner than adding another parameter. The reader already knows its own chunks_per_shard, so it can fetch both cached structures itself instead of having them threaded in. With this, the call site simplifies to just shard_reader.to_dict_vectorized().

I tested it still fixes the regression — same ~1.6–1.8× speedup over main on test_sharded_morton_write_single_chunk, and the difference vs the committed version is within run-to-run noise.

def to_dict_vectorized(self) -dict[tuple[int, ...], Buffer | None]: """Build a dict of chunk coordinates to buffers using vectorized lookup. The full per-shard chunk coordinate grid (both the array used for the vectorized index lookup and the plain tuples used as dict keys) is cached on ``chunks_per_shard``, so neither is rebuilt on every call. For a shard with tens of thousands of chunks this avoids reconstructing that many tuples on every partial write. Returns ------- dict mapping chunk coordinate tuples to Buffer or None """ chunks_per_shard = self.index.chunks_per_shard chunk_coords_array = _lexicographic_order(chunks_per_shard) chunk_coords_keys = _lexicographic_order_keys(chunks_per_shard) starts, ends, valid = self.index.get_chunk_slices_vectorized(chunk_coords_array) result: dict[tuple[int, ...], Buffer | None] = {} for i, coords in enumerate(chunk_coords_keys): if valid[i]: result[coords] = self.buf[int(starts[i]) : int(ends[i])] else: result[coords] = None return result

that's great feedback! have a look at 6e02c8f and see if things look better to you

Address review feedback: `_ShardReader.to_dict_vectorized` took the lexicographic coordinate array and key tuples as parameters, even though the reader already knows its own `chunks_per_shard` and both structures are `lru_cache`d. Thread nothing in — fetch them inside the method via `_lexicographic_order`/`_lexicographic_order_keys`. Same cache, so no perf change; the call site collapses to `to_dict_vectorized()`. Add a unit test covering the method directly across 0-d, 1-d, and 2-d shard grids: present chunks map to their stored bytes, empty chunks to None, and every lexicographic coordinate appears as a key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ilan-gold

Apologies for the regression!

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

d-v-b added the benchmark Code will be benchmarked in a CI job. label May 26, 2026

d-v-b requested a review from ilan-gold May 26, 2026 21:04

d-v-b requested a review from maxrjones May 27, 2026 15:53

Merge branch 'main' into perf-sharding-coord-cache

6053018

d-v-b requested a review from mkitti May 29, 2026 14:54

Merge branch 'main' into perf-sharding-coord-cache

d74d857

maxrjones assigned d-v-b and maxrjones Jun 1, 2026

maxrjones reviewed Jun 1, 2026

View reviewed changes

d-v-b and others added 2 commits June 2, 2026 15:26

Merge branch 'main' into perf-sharding-coord-cache

2b8b095

ilan-gold approved these changes Jun 2, 2026

View reviewed changes

Comment thread src/zarr/core/indexing.py Outdated

Update src/zarr/core/indexing.py

51d0f05

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: cache lexicographic chunk coords in sharding codec#4012

perf: cache lexicographic chunk coords in sharding codec#4012
d-v-b wants to merge 6 commits into
zarr-developers:mainfrom
d-v-b:perf-sharding-coord-cache

d-v-b commented May 26, 2026

Uh oh!

codecov Bot commented May 26, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented May 26, 2026

Uh oh!

maxrjones Jun 1, 2026

Uh oh!

d-v-b Jun 2, 2026

Uh oh!

ilan-gold left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -266,13 +269,19 @@ def __iter__(self) -> Iterator[tuple[int, ...]]:
		def to_dict_vectorized(

Uh oh!

Conversation

d-v-b commented May 26, 2026

Uh oh!

codecov Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codspeed-hq Bot commented May 26, 2026

Merging this PR will improve performance by 76.71%

Performance Changes

Footnotes

Uh oh!

maxrjones Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

d-v-b Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented May 26, 2026 •

edited

Loading