Skip to content

[Format] Add arrow.range canonical extension type for bounded ranges #50027

@Hoeze

Description

@Hoeze

Describe the enhancement requested

Arrow has no canonical way to represent a bounded range (a mathematical interval with a lower and an upper endpoint), e.g. a numeric range [0, 10), a date range, or a timestamp period. Today such data is modeled ad hoc with two separate columns or with system-specific extension types, which hurts interoperability. A canonical range type will be useful to libraries like Pandas, Polars/Polars-bio, IRanges/PyRanges, database connectors, ...

Note this is distinct from Arrow's existing calendar Interval type (INTERVAL_MONTHS / INTERVAL_DAY_TIME / INTERVAL_MONTH_DAY_NANO), which represents a duration (a signed amount of time), not a bounded set. Databases like PostgreSQL make the same distinction: SQL uses INTERVAL for durations and RANGE / PERIOD for bounded sets. This proposal follows that convention by naming the type arrow.range.

Proposed design:

  • Extension name: arrow.range.

  • Storage type: Struct<lower: T, upper: T>. When subtype T is nullable, a null bound represents an unbounded (infinite) endpoint.

    • Field names lower / upper (PostgreSQL convention) are chosen deliberately for ordering clarity. (Note that Pandas uses left / right for the field names)
    • The subtype T may be any orderable Arrow type (the numeric, temporal and decimal families, etc.). Nested or non-comparable types are out of scope.
  • Metadata: a JSON object {"closed": "..."}.

    • Parameter closed: one of left, right, both, neither (pandas vocabulary; left = lower inclusive / upper exclusive, etc.).
    • closed is required on the wire so a serialized arrow.range is always unambiguous. Unknown JSON keys are ignored for forward compatibility.
  • A range is empty implicitly when lower > upper, or when lower == upper with at least one bound exclusive. A range with lower > upper is therefore valid (it denotes the empty set), not an error.

Relation to pandas

This mirrors pandas' interval support and deliberately reuses its vocabulary:

  • pandas.Interval is the scalar form: an immutable bounded interval whose closed parameter takes exactly left, right, both, or neither; the vocabulary adopted here for the closed metadata.
  • pandas.IntervalIndex / pandas.arrays.IntervalArray is the columnar form: it stores parallel left and right bound arrays, directly analogous to the proposed Struct<lower, upper> storage.
  • Crucially, closed is part of pandas' dtype itself (interval[T, left] and interval[T, right] are distinct dtypes), so a typed interval column carries exactly one closed: constructing an array from intervals with differing closedness raises ValueError, and concatenating columns of differing closedness falls back to untyped object dtype. This uniform, type-level closed maps one-to-one onto the proposed object-level closed metadata; no per-element closedness is required.

So arrow.range would give pandas' IntervalArray / IntervalIndex a natural, lossless Arrow representation for round-tripping.

Component(s)

Format

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions