Skip to content

GH-50027: [Format][C++][Python] Add arrow.range canonical extension type#50028

Open
Hoeze wants to merge 8 commits into
apache:mainfrom
Hoeze:feat/arrow-range-extension
Open

GH-50027: [Format][C++][Python] Add arrow.range canonical extension type#50028
Hoeze wants to merge 8 commits into
apache:mainfrom
Hoeze:feat/arrow-range-extension

Conversation

@Hoeze
Copy link
Copy Markdown

@Hoeze Hoeze commented May 24, 2026

Rationale for this change

This is a draft implementation for #50027, a new canonical range extension type.

What changes are included in this PR?

This PR provides the spec text, a C++ reference implementation, PyArrow bindings, and the supporting documentation.

Are these changes tested?

I let the tests run locally but did not try them in any other project yet.

Note that I made heavy use of AI to create this PR and copied many structures from the fixed shape tensor extension type. I reviewed each change and hope the changes I made are meaningful.
Nevertheless, I am not sure whether the C++ parts are comprehensive or if I missed anything; this is my first contribution to Arrow.

Are there any user-facing changes?

No, this is an addition of a new canonical extension type.

Hoeze added 8 commits May 24, 2026 13:03
Add a canonical extension type for bounded ranges (mathematical intervals),
distinct from Arrow's calendar Interval (duration) type.

- Spec: docs/source/format/CanonicalExtensions.rst adds the Range section.
  Storage is Struct<lower, upper> with both bounds nullable (null = +/-infinity,
  treated as exclusive). A closed parameter (left/right/both/neither, pandas
  vocabulary) is carried as JSON extension metadata; the subtype is read from
  storage. Disambiguates from the calendar Interval type per DB convention
  (INTERVAL = duration, RANGE/PERIOD = bounded set).
- C++ reference impl: cpp/src/arrow/extension/range.{h,cc} (RangeType/RangeArray)
  with serialize/deserialize, storage validation, registration in the global
  registry, tests, and CMake/meson wiring.
The closedness is no longer defaulted on the wire: empty metadata or a JSON
object without a "closed" key is now rejected by Deserialize, so a serialized
arrow.range is always unambiguous. The C++ convenience default argument for
constructing a RangeType in code is left-closed ([lower, upper)), matching the
PostgreSQL/Rust/Python range convention. Spec and tests updated.
Verified by building the arrow-canonical-extensions-test target (50/50 pass,
10/10 RangeType). Two fixes to the previously-uncompiled test:
- include arrow/array/array_nested.h for the full StructArray definition
  (it is only forward-declared in type_fwd.h).
- wrap the CheckDeserialize helper in an anonymous namespace to avoid a
  link-time collision with the identically named helper in opaque_test.cc.
@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #50027 has been automatically assigned in GitHub to PR creator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant