Skip to content

perf: cache lexicographic chunk coords in sharding codec#4012

Open
d-v-b wants to merge 6 commits into
zarr-developers:mainfrom
d-v-b:perf-sharding-coord-cache
Open

perf: cache lexicographic chunk coords in sharding codec#4012
d-v-b wants to merge 6 commits into
zarr-developers:mainfrom
d-v-b:perf-sharding-coord-cache

Conversation

@d-v-b
Copy link
Copy Markdown
Contributor

@d-v-b d-v-b commented May 26, 2026

Fixes a performance regression introduced in #3826 that made morton-order-indexing slower.

for details, see claude's summary below.

The subchunk_write_order feature (#3826) regressed sharded write performance: _encode_partial_single rebuilt the full per-shard chunk coordinate grid on every write via
np.array(list(_subchunk_order_iter(..., "lexicographic"))), and to_dict_vectorized rebuilt a tuple key per row with tuple(coords.ravel()). For a single-chunk write into a shard with tens of thousands of chunks this roughly doubled write time (~22ms -> ~40ms on test_sharded_morton_write_single_chunk, matching the -44% CodSpeed regression).

Add cached _lexicographic_order (array) and _lexicographic_order_keys (tuples) helpers in indexing.py, mirroring _morton_order/_morton_order_keys, and pass the cached keys into to_dict_vectorized instead of deriving them row-by-row. This restores write throughput to the pre-#3826 baseline while preserving identical chunk ordering (verified equal to np.ndindex across shapes including 0-d and empty).

[Description of PR]

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

The subchunk_write_order feature (zarr-developers#3826) regressed sharded write
performance: _encode_partial_single rebuilt the full per-shard chunk
coordinate grid on every write via
`np.array(list(_subchunk_order_iter(..., "lexicographic")))`, and
`to_dict_vectorized` rebuilt a tuple key per row with `tuple(coords.ravel())`.
For a single-chunk write into a shard with tens of thousands of chunks this
roughly doubled write time (~22ms -> ~40ms on test_sharded_morton_write_single_chunk,
matching the -44% CodSpeed regression).

Add cached `_lexicographic_order` (array) and `_lexicographic_order_keys`
(tuples) helpers in indexing.py, mirroring `_morton_order`/`_morton_order_keys`,
and pass the cached keys into `to_dict_vectorized` instead of deriving them
row-by-row. This restores write throughput to the pre-zarr-developers#3826 baseline while
preserving identical chunk ordering (verified equal to np.ndindex across
shapes including 0-d and empty).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-v-b d-v-b added the benchmark Code will be benchmarked in a CI job. label May 26, 2026
@d-v-b d-v-b requested a review from ilan-gold May 26, 2026 21:04
@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.56%. Comparing base (8932bbe) to head (2b8b095).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4012   +/-   ##
=======================================
  Coverage   93.55%   93.56%           
=======================================
  Files          88       88           
  Lines       11896    11912   +16     
=======================================
+ Hits        11129    11145   +16     
  Misses        767      767           
Files with missing lines Coverage Δ
src/zarr/codecs/sharding.py 92.20% <100.00%> (+0.06%) ⬆️
src/zarr/core/indexing.py 96.30% <100.00%> (+0.05%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 26, 2026

Merging this PR will improve performance by 76.71%

⚡ 3 improved benchmarks
✅ 63 untouched benchmarks
⏩ 6 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime test_sharded_morton_write_single_chunk[(32, 32, 32)-memory] 348.6 ms 198.7 ms +75.45%
WallTime test_sharded_morton_write_single_chunk[(33, 33, 33)-memory] 386.8 ms 217.2 ms +78.06%
WallTime test_sharded_morton_write_single_chunk[(30, 30, 30)-memory] 292.6 ms 165.7 ms +76.62%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing d-v-b:perf-sharding-coord-cache (3372b05) with main (c0e2afa)2

Open in CodSpeed

Footnotes

  1. 6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on main (1cda981) during the generation of this report, so c0e2afa was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@d-v-b d-v-b requested a review from maxrjones May 27, 2026 15:53
@d-v-b d-v-b requested a review from mkitti May 29, 2026 14:54
Comment thread src/zarr/codecs/sharding.py Outdated
@@ -266,13 +269,19 @@ def __iter__(self) -> Iterator[tuple[int, ...]]:
def to_dict_vectorized(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the following reads cleaner than adding another parameter. The reader already knows its own chunks_per_shard, so it can fetch both cached structures itself instead of having them threaded in. With this, the call site simplifies to just shard_reader.to_dict_vectorized().

I tested it still fixes the regression — same ~1.6–1.8× speedup over main on test_sharded_morton_write_single_chunk, and the difference vs the committed version is within run-to-run noise.

    def to_dict_vectorized(self) -dict[tuple[int, ...], Buffer | None]:
        """Build a dict of chunk coordinates to buffers using vectorized lookup.

        The full per-shard chunk coordinate grid (both the array used for the
        vectorized index lookup and the plain tuples used as dict keys) is
        cached on ``chunks_per_shard``, so neither is rebuilt on every call.
        For a shard with tens of thousands of chunks this avoids reconstructing
        that many tuples on every partial write.

        Returns
        -------
        dict mapping chunk coordinate tuples to Buffer or None
        """
        chunks_per_shard = self.index.chunks_per_shard
        chunk_coords_array = _lexicographic_order(chunks_per_shard)
        chunk_coords_keys = _lexicographic_order_keys(chunks_per_shard)
        starts, ends, valid = self.index.get_chunk_slices_vectorized(chunk_coords_array)

        result: dict[tuple[int, ...], Buffer | None] = {}
        for i, coords in enumerate(chunk_coords_keys):
            if valid[i]:
                result[coords] = self.buf[int(starts[i]) : int(ends[i])]
            else:
                result[coords] = None

        return result

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's great feedback! have a look at 6e02c8f and see if things look better to you

d-v-b and others added 2 commits June 2, 2026 15:26
Address review feedback: `_ShardReader.to_dict_vectorized` took the
lexicographic coordinate array and key tuples as parameters, even though
the reader already knows its own `chunks_per_shard` and both structures
are `lru_cache`d. Thread nothing in — fetch them inside the method via
`_lexicographic_order`/`_lexicographic_order_keys`. Same cache, so no
perf change; the call site collapses to `to_dict_vectorized()`.

Add a unit test covering the method directly across 0-d, 1-d, and 2-d
shard grids: present chunks map to their stored bytes, empty chunks to
None, and every lexicographic coordinate appears as a key.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the regression!

Comment thread src/zarr/core/indexing.py Outdated
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Code will be benchmarked in a CI job.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants