Describe the enhancement requested
Arrow has no canonical way to represent a bounded range (a mathematical interval with a lower and an upper endpoint), e.g. a numeric range [0, 10), a date range, or a timestamp period. Today such data is modeled ad hoc with two separate columns or with system-specific extension types, which hurts interoperability. A canonical range type will be useful to libraries like Pandas, Polars/Polars-bio, IRanges/PyRanges, database connectors, ...
Note this is distinct from Arrow's existing calendar Interval type (INTERVAL_MONTHS / INTERVAL_DAY_TIME / INTERVAL_MONTH_DAY_NANO), which represents a duration (a signed amount of time), not a bounded set. Databases like PostgreSQL make the same distinction: SQL uses INTERVAL for durations and RANGE / PERIOD for bounded sets. This proposal follows that convention by naming the type arrow.range.
Proposed design:
-
Extension name: arrow.range.
-
Storage type: Struct<lower: T, upper: T>. When subtype T is nullable, a null bound represents an unbounded (infinite) endpoint.
- Field names
lower / upper (PostgreSQL convention) are chosen deliberately for ordering clarity. (Note that Pandas uses left / right for the field names)
- The subtype
T may be any orderable Arrow type (the numeric, temporal and decimal families, etc.). Nested or non-comparable types are out of scope.
-
Metadata: a JSON object {"closed": "..."}.
- Parameter
closed: one of left, right, both, neither (pandas vocabulary; left = lower inclusive / upper exclusive, etc.).
closed is required on the wire so a serialized arrow.range is always unambiguous. Unknown JSON keys are ignored for forward compatibility.
-
A range is empty implicitly when lower > upper, or when lower == upper with at least one bound exclusive. A range with lower > upper is therefore valid (it denotes the empty set), not an error.
Relation to pandas
This mirrors pandas' interval support and deliberately reuses its vocabulary:
pandas.Interval is the scalar form: an immutable bounded interval whose closed parameter takes exactly left, right, both, or neither; the vocabulary adopted here for the closed metadata.
pandas.IntervalIndex / pandas.arrays.IntervalArray is the columnar form: it stores parallel left and right bound arrays, directly analogous to the proposed Struct<lower, upper> storage.
- Crucially,
closed is part of pandas' dtype itself (interval[T, left] and interval[T, right] are distinct dtypes), so a typed interval column carries exactly one closed: constructing an array from intervals with differing closedness raises ValueError, and concatenating columns of differing closedness falls back to untyped object dtype. This uniform, type-level closed maps one-to-one onto the proposed object-level closed metadata; no per-element closedness is required.
So arrow.range would give pandas' IntervalArray / IntervalIndex a natural, lossless Arrow representation for round-tripping.
Component(s)
Format
Describe the enhancement requested
Arrow has no canonical way to represent a bounded range (a mathematical interval with a lower and an upper endpoint), e.g. a numeric range
[0, 10), a date range, or a timestamp period. Today such data is modeled ad hoc with two separate columns or with system-specific extension types, which hurts interoperability. A canonical range type will be useful to libraries like Pandas, Polars/Polars-bio, IRanges/PyRanges, database connectors, ...Note this is distinct from Arrow's existing calendar
Intervaltype (INTERVAL_MONTHS/INTERVAL_DAY_TIME/INTERVAL_MONTH_DAY_NANO), which represents a duration (a signed amount of time), not a bounded set. Databases like PostgreSQL make the same distinction: SQL usesINTERVALfor durations andRANGE/PERIODfor bounded sets. This proposal follows that convention by naming the typearrow.range.Proposed design:
Extension name:
arrow.range.Storage type:
Struct<lower: T, upper: T>. When subtypeTis nullable, a null bound represents an unbounded (infinite) endpoint.lower/upper(PostgreSQL convention) are chosen deliberately for ordering clarity. (Note that Pandas usesleft/rightfor the field names)Tmay be any orderable Arrow type (the numeric, temporal and decimal families, etc.). Nested or non-comparable types are out of scope.Metadata: a JSON object
{"closed": "..."}.closed: one ofleft,right,both,neither(pandas vocabulary;left= lower inclusive / upper exclusive, etc.).closedis required on the wire so a serializedarrow.rangeis always unambiguous. Unknown JSON keys are ignored for forward compatibility.A range is empty implicitly when
lower > upper, or whenlower == upperwith at least one bound exclusive. A range withlower > upperis therefore valid (it denotes the empty set), not an error.Relation to pandas
This mirrors pandas' interval support and deliberately reuses its vocabulary:
pandas.Intervalis the scalar form: an immutable bounded interval whoseclosedparameter takes exactlyleft,right,both, orneither; the vocabulary adopted here for theclosedmetadata.pandas.IntervalIndex/pandas.arrays.IntervalArrayis the columnar form: it stores parallelleftandrightbound arrays, directly analogous to the proposedStruct<lower, upper>storage.closedis part of pandas' dtype itself (interval[T, left]andinterval[T, right]are distinct dtypes), so a typed interval column carries exactly oneclosed: constructing an array from intervals with differing closedness raisesValueError, and concatenating columns of differing closedness falls back to untypedobjectdtype. This uniform, type-levelclosedmaps one-to-one onto the proposed object-levelclosedmetadata; no per-element closedness is required.So
arrow.rangewould give pandas'IntervalArray/IntervalIndexa natural, lossless Arrow representation for round-tripping.Component(s)
Format