Add support for structured dtypes to `zarr3` driver, open structs as void #271

BrianMichell · 2026-01-05T15:22:04Z

Supersedes #264

Resolves comments 1 and 2

…river

…t for raw bits dtype

Implement shim for `open_as_void` driver level flag

* Begin removing void field shim * Fully removed void string shim * Cleanup debug prints * Remove shimmed validation * Remove unnecessary comment * Prefer false over zero for ternary clarity

* Implement a more general and portable example set * Fix driver cache bug * Update example for template * Cleanup example * Remove testing examples from source

* Use the appropriate fill value for open_as_void structured data * Cleanup

laramiel · 2026-01-22T23:18:00Z

I'll try to get to this in about a week, before I look this one over, please double check that the prior PR works for you. Also look over this one and see if any of the suggestions from the other one applies.

Matches the pattern from zarr v2 driver (PR google#272). When both "field" and "open_as_void" are specified in the spec, return an error since these options are mutually exclusive - field selects a specific field from a structured array, while open_as_void provides raw byte access to the entire structure.

The zarr3 URL syntax cannot represent field selection or void access mode. Following the pattern from zarr v2 driver (PR google#272), ToUrl() now returns an error when either of these options is specified instead of silently ignoring them.

…trip Following the pattern from zarr v2 driver (PR google#272), override GetBoundSpecData in ZarrDataCache to set spec.open_as_void from ChunkCacheImpl::open_as_void_. This ensures that when you open a store with open_as_void=true and then call spec(), the resulting spec correctly has open_as_void=true set. Without this fix, opening a store with open_as_void=true and then getting its spec would lose the open_as_void flag, causing incorrect behavior if the spec is used to re-open the store.

Add comprehensive tests for open_as_void functionality following the patterns from zarr v2 driver (PR google#272): Tests that PASS: - OpenAsVoidSimpleType: Verifies simple type arrays can be opened with open_as_void, gaining an extra dimension for bytes - OpenAsVoidSpecRoundtrip: Verifies open_as_void preserved in spec JSON - OpenAsVoidGetBoundSpecData: Verifies spec() on void store returns open_as_void=true (tests the GetBoundSpecData fix) - OpenAsVoidCannotUseWithField: Verifies mutual exclusivity validation - OpenAsVoidUrlNotSupported: Verifies ToUrl() rejects open_as_void - FieldSelectionUrlNotSupported: Verifies ToUrl() rejects selected_field Tests marked TODO (pending codec chain implementation): - OpenAsVoidStructuredType - OpenAsVoidWithCompression - OpenAsVoidReadWrite - OpenAsVoidWriteRoundtrip Also fixes BUILD file: adds :metadata dependency to :chunk_cache target to provide the dtype.h header that chunk_cache.h includes.

The codec chain is prepared for the original dtype and chunk shape (without the extra bytes dimension). For void access: DecodeChunk: - Strip the bytes dimension from grid's chunk_shape to get original shape - Decode using the original codec shape - Reinterpret the decoded bytes as [chunk_shape..., bytes_per_elem] EncodeChunk: - Input has shape [chunk_shape..., bytes_per_elem] of byte_t - Create a view with the original chunk shape and element_size - Encode using the original codec This follows the pattern from zarr v2 (PR google#272) where the void metadata has the chunk_layout computed to match encoded/decoded layouts.

Add tests that verify: - OpenAsVoidReadWrite: Write data via typed access, read via void access verifying byte layout is correct - OpenAsVoidWriteRoundtrip: Write via typed access, verify byte values can be read via void access with correct little-endian layout These tests verify the DecodeChunk fix works correctly for reading data written with the original dtype through void (byte) access.

Verify that open_as_void works correctly when the array uses compression codecs (gzip). The fix to DecodeChunk properly handles the bytes->bytes codec chain when decoding for void access.

For void access, the codec handling differs between: - Non-structured types: codec prepared for [chunk_shape] with original dtype Need to decode/encode then reinterpret bytes. - Structured types: codec already prepared for [chunk_shape, bytes_per_elem] with byte dtype. Just decode/encode directly. Add original_is_structured parameter to cache constructors to properly distinguish these cases in DecodeChunk and EncodeChunk. This follows the pattern from zarr v2 (PR google#272) where CreateVoidMetadata() creates a modified metadata for void access.

The structured type with void access requires additional work to handle rank mismatch between spec transform (based on original shape) and void access transform (which adds the bytes dimension). Mark as TODO for now.

For void access, the domain needs to include an extra dimension for bytes_per_outer_element. This requires: 1. Deferring rank setting in the JSON binder until after open_as_void is known, then adding 1 to the rank for void access. 2. Building the domain directly in GetDomain() when open_as_void=true and the metadata constraints include dtype and shape, adding the extra bytes dimension. This enables void access to work correctly with simple (non-structured) types when creating arrays.

The structured type void access requires additional work in GetNewMetadata to properly handle the extra bytes dimension. The current implementation doesn't correctly propagate the void rank through all the metadata validation and domain building code paths. For now, disable this test and leave as TODO for future work.

Update comments in the JSON binder initialization to better explain the void field's field_shape and how it affects the schema rank. Also update the TODO for the structured type void access test to more accurately describe the remaining work needed: - GetNewMetadata needs to handle field_shape dimensions - SetChunkLayoutFromMetadata needs dimension mismatch handling

When encoding data through void access, the codec expects the original dtype (e.g., int32), not the synthesized void dtype (byte_t). This fix: 1. Adds original_dtype_ member to ZarrLeafChunkCache and ZarrShardedChunkCache to store the original dtype from metadata. 2. Updates EncodeChunk to use original_dtype_ when creating the SharedArray for encoding, ensuring the codec receives data in the correct format. 3. Passes original_dtype through MakeZarrChunkCache and ZarrShardSubChunkCache constructors. This fixes writing through void access, both with and without compression.

Add test to verify that writing through void access with compression enabled works correctly. The test: 1. Creates an array with gzip compression 2. Initializes with zeros via typed access 3. Opens as void and writes raw bytes 4. Reads back through void access to verify the write 5. Reads back through typed access to verify byte interpretation This test exercises the EncodeChunk path for void access with the codec chain including compression.

Add tests to verify that GetSpecInfo correctly computes rank when open_as_void=true (mirroring v2 test patterns): - GetSpecInfoOpenAsVoidWithKnownRank: Verifies full_rank = chunked_rank + 1 - GetSpecInfoOpenAsVoidWithDynamicRank: Verifies dynamic rank handling - GetSpecInfoOpenAsVoidWithoutDtype: Verifies behavior without dtype - GetSpecInfoOpenAsVoidRankConsistency: Verifies spec rank matches opened store Also adds TODO for OpenAsVoidFillValue test - fill_value handling for void access requires additional implementation (similar to v2's CreateVoidMetadata which converts fill_value to byte array).

Implement proper fill_value conversion for void access mode: 1. Add is_void_access() virtual method to DataCacheBase to expose whether the cache was opened with open_as_void=true. 2. Modify ZarrDriver::GetFillValue to convert fill_value to byte array representation when in void access mode. This copies bytes from each field's fill_value at their respective offsets, similar to v2's CreateVoidMetadata handling. 3. Add OpenAsVoidFillValue test to verify that: - Normal store returns the expected scalar fill_value - Void store returns fill_value as byte array with correct shape - Byte representation matches the original value (little endian)

Fix EncodeChunk to properly handle structured types: 1. For single non-structured field: encode directly (existing behavior) 2. For structured types (multiple fields): combine field arrays into a single byte array by copying each field's data at their respective byte offsets, then encode the combined byte array. This matches the pattern in DecodeChunk which extracts fields from a decoded byte array. Add OpenAsVoidStructuredType test that: - Creates an array with structured dtype (uint8 + int16 fields) - Writes data using field access - Opens with open_as_void=true - Verifies rank is original_rank + 1 - Verifies bytes dimension is 3 (1 + 2 bytes) - Verifies dtype is byte

1. OpenAsVoidStructuredType: Now actually reads and verifies byte content - Reads raw bytes through void access - Uses proper stride calculation for the returned array - Verifies y field bytes at all 4 positions (little-endian int16) - x field is 0 (fill value) since we only wrote to y field 2. Add GetSpecInfoOpenAsVoidWithStructuredDtype test - Verifies spec rank = chunked_rank + 1 with structured dtype - Tests structured dtype with int32 + uint16 fields - Matches v2 test coverage

Test that open_as_void correctly detects when the underlying metadata has been changed to an incompatible dtype. ResolveBounds should fail with kFailedPrecondition when the stored metadata has a different bytes_per_outer_element than what was expected. This matches the v2 test that verifies metadata consistency checking works properly with void access.

Verifies that void access works correctly with sharded arrays: - Void access flags propagate through sharded caches - Reading bytes through sharded void access returns correct data - Writing bytes through sharded void access round-trips correctly

…y with zarr2 - Remove invalid oneOf constraint that didn't properly express mutual exclusivity - Update field description to match zarr2 style (document mutual exclusivity) - Update open_as_void description to document mutual exclusivity with field - Add oneOf type constraint for field to match zarr2 (string or null) The actual mutual exclusivity validation is done in code via jb::Initialize.

…nsform For consistency with GetDomain(), explicitly set implicit_lower_bounds in GetExternalToInternalTransform when building the void access transform. Both methods now follow the same pattern of explicitly setting both implicit_lower_bounds and implicit_upper_bounds.

Add assertion that num_fields == 1 in the void access path of DecodeChunk. Void access always uses a single synthesized field, so this assertion helps catch any inconsistency between GetDataCache and DecodeChunk.

Apply changes based on feedback from google#272

Add assertions in EncodeChunk and DecodeChunk to verify that arrays are C-contiguous before performing direct memcpy operations: - In EncodeChunk: verify component arrays are C-contiguous - In DecodeChunk: verify decoded byte arrays are C-contiguous These assertions validate assumptions about array layouts that the chunk cache relies on for correct operation. The chunk cache write path (AsyncWriteArray) allocates C-order arrays, and the codec chain produces C-contiguous decoded arrays. Also adds the necessary includes and BUILD dependencies for IsContiguousLayout and c_order.

Replace raw memcpy loops with CopyArray using strided ArrayViews for structured type encoding and decoding. This follows the standard TensorStore pattern (as used in zarr v2 with internal::EncodeArray) where array copies are done via IterateOverArrays which safely handles any source/destination strides. The key insight is creating an ArrayView with strides that represent the interleaved field positions within the struct layout: - For a field at byte_offset B within a struct of size S - The strides are [..., S] instead of [..., field_size] - This allows CopyArray to correctly interleave/deinterleave fields This approach: 1. Removes the need for contiguity assertions (CopyArray handles any layout) 2. Is consistent with zarr v2's use of internal::EncodeArray 3. Uses the standard IterateOverArrays iteration pattern The void access decode path retains its memcpy with assertion because it's a simple byte reinterpretation where both arrays are known to be C-contiguous (destination freshly allocated, source from codec chain).

Replace manual stride computation loops with ComputeStrides() from contiguous_layout.h. This is the standard TensorStore utility for computing C-order (or Fortran-order) byte strides given a shape and innermost element stride. The manual loop: Index stride = bytes_per_outer_element; for (DimensionIndex i = rank; i-- > 0;) { strides[i] = stride; stride *= shape[i]; } Is exactly equivalent to: ComputeStrides(c_order, bytes_per_outer_element, shape, strides);

Replace manual loops with standard library and TensorStore utilities: 1. DimensionSet::UpTo(rank) - Creates a DimensionSet with bits [0, rank) set to true. Replaces: DimensionSet s(false); for (i = 0; i < rank; ++i) s[i] = true; 2. std::fill_n for origins (all zeros) and std::copy_n for shape copy. This is more idiomatic and clearer than explicit index loops. These are standard patterns used throughout TensorStore for similar operations on dimension sets and shape vectors.

The sub-chunk cache in sharding mode uses a grid from the sharding codec state, which doesn't know about void access. This caused issues: 1. Shape mismatch: The grid's component shape was [4, 4] but decoded arrays had shape [4, 4, 4] (with bytes dimension) 2. Invalid key generation: The grid's chunk_shape affected cell indexing Fix by: - Add `grid_has_void_dimension_` flag to track whether the grid includes the bytes dimension (false for sub-chunk caches) - For sub-chunk caches with void access on non-structured types, create a modified grid with: - Component chunk_shape including bytes dimension [4, 4, 4] - Grid chunk_shape unchanged [4, 4] (for cell indexing) - Proper chunked_to_cell_dimensions mapping This enables void access to work correctly with sharding codecs.

The ZarrShardSubChunkCache template had duplicate member variables (open_as_void_, original_is_structured_, bytes_per_element_) that were already present in the base class ChunkCacheImpl (ZarrLeafChunkCache). Access these through ChunkCacheImpl:: prefix instead to follow DRY principle and maintain consistency with other TensorStore patterns.

Reviewed the code for potential inconsistencies and fixed some bugs

BrianMichell added 24 commits November 24, 2025 20:55

Begin examining how to best add structured array support to Zarr v3 d…

44b4473

…river

Merge branch 'google:master' into v3_structs

e6df164

Updates to have proper reads

187f424

Local testing and examples

c2e73cd

Begin adding support for opening struct arrays as void and add suppor…

9e8ed94

…t for raw bits dtype

Fix failing tests

44c765e

Resolve issues with opening struct as void

547642d

Remove debug print

2a4c3d8

Add field for open as void

b0abb94

Add a shim for new open_as_void flag open option

fff0a5b

Revert some formatting changes

b6c24f9

revert gitignore changes

488b160

Merge pull request #1 from BrianMichell/v3_structs_compatibility

537d309

Implement shim for `open_as_void` driver level flag

V3 structs remove shim (#2)

54941a0

* Begin removing void field shim * Fully removed void string shim * Cleanup debug prints * Remove shimmed validation * Remove unnecessary comment * Prefer false over zero for ternary clarity

Fix structured fill value population

c9f58f9

V3 examples merge (#3)

7655cfd

* Implement a more general and portable example set * Fix driver cache bug * Update example for template * Cleanup example * Remove testing examples from source

Remove vestigial example build

8c4c4ca

V3 structs fix fills (#4)

4b590f8

* Use the appropriate fill value for open_as_void structured data * Cleanup

Merge branch 'google:master' into v3_structs

7691c83

Add new options to schema

c0082a0

Fix copyright header date

9a46c82

Cleanup (#5)

b9b5e41

Merge zarr3 structured dtype support

d5f6201

Remove default values

31e55ec

BrianMichell mentioned this pull request Jan 5, 2026

Add support for structured dtypes to zarr3 driver, open zarr 2 and 3 structs as void #264

Closed

BrianMichell added 4 commits January 23, 2026 15:05

Merge branch 'google:master' into v3_structs_and_void

aad0ee0

BrianMichell added 28 commits January 26, 2026 15:28

zarr3: Add compression test for open_as_void

7065b42

Verify that open_as_void works correctly when the array uses compression codecs (gzip). The fix to DecodeChunk properly handles the bytes->bytes codec chain when decoding for void access.

zarr3: Mark structured type void access test as TODO

5dab237

The structured type with void access requires additional work to handle rank mismatch between spec transform (based on original shape) and void access transform (which adds the bytes dimension). Mark as TODO for now.

zarr3: Add OpenAsVoidWithSharding test

dda05a8

Verifies that void access works correctly with sharded arrays: - Void access flags propagate through sharded caches - Reading bytes through sharded void access returns correct data - Writing bytes through sharded void access round-trips correctly

zarr3: Add assertion in DecodeChunk for void access field count

1298bcb

Add assertion that num_fields == 1 in the void access path of DecodeChunk. Void access always uses a single synthesized field, so this assertion helps catch any inconsistency between GetDataCache and DecodeChunk.

Merge pull request #7 from BrianMichell/v3_open_as_void_validation

e197bc6

Apply changes based on feedback from google#272

Merge pull request #8 from BrianMichell/v3_review

d139d87

Reviewed the code for potential inconsistencies and fixed some bugs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for structured dtypes to `zarr3` driver, open structs as void #271

Add support for structured dtypes to `zarr3` driver, open structs as void #271

Uh oh!

BrianMichell commented Jan 5, 2026

Uh oh!

laramiel commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add support for structured dtypes to zarr3 driver, open structs as void #271

Are you sure you want to change the base?

Add support for structured dtypes to zarr3 driver, open structs as void #271

Uh oh!

Conversation

BrianMichell commented Jan 5, 2026

Uh oh!

laramiel commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add support for structured dtypes to `zarr3` driver, open structs as void #271

Add support for structured dtypes to `zarr3` driver, open structs as void #271