GH-50027: [Format][C++][Python] Add arrow.range canonical extension type#50028
Open
Hoeze wants to merge 8 commits into
Open
GH-50027: [Format][C++][Python] Add arrow.range canonical extension type#50028Hoeze wants to merge 8 commits into
Hoeze wants to merge 8 commits into
Conversation
Add a canonical extension type for bounded ranges (mathematical intervals),
distinct from Arrow's calendar Interval (duration) type.
- Spec: docs/source/format/CanonicalExtensions.rst adds the Range section.
Storage is Struct<lower, upper> with both bounds nullable (null = +/-infinity,
treated as exclusive). A closed parameter (left/right/both/neither, pandas
vocabulary) is carried as JSON extension metadata; the subtype is read from
storage. Disambiguates from the calendar Interval type per DB convention
(INTERVAL = duration, RANGE/PERIOD = bounded set).
- C++ reference impl: cpp/src/arrow/extension/range.{h,cc} (RangeType/RangeArray)
with serialize/deserialize, storage validation, registration in the global
registry, tests, and CMake/meson wiring.
The closedness is no longer defaulted on the wire: empty metadata or a JSON object without a "closed" key is now rejected by Deserialize, so a serialized arrow.range is always unambiguous. The C++ convenience default argument for constructing a RangeType in code is left-closed ([lower, upper)), matching the PostgreSQL/Rust/Python range convention. Spec and tests updated.
Verified by building the arrow-canonical-extensions-test target (50/50 pass, 10/10 RangeType). Two fixes to the previously-uncompiled test: - include arrow/array/array_nested.h for the full StructArray definition (it is only forward-declared in type_fwd.h). - wrap the CheckDeserialize helper in an anonymous namespace to avoid a link-time collision with the identically named helper in opaque_test.cc.
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
This is a draft implementation for #50027, a new canonical
rangeextension type.What changes are included in this PR?
This PR provides the spec text, a C++ reference implementation, PyArrow bindings, and the supporting documentation.
Are these changes tested?
I let the tests run locally but did not try them in any other project yet.
Note that I made heavy use of AI to create this PR and copied many structures from the fixed shape tensor extension type. I reviewed each change and hope the changes I made are meaningful.
Nevertheless, I am not sure whether the C++ parts are comprehensive or if I missed anything; this is my first contribution to Arrow.
Are there any user-facing changes?
No, this is an addition of a new canonical extension type.
arrow.rangecanonical extension type for bounded ranges #50027