Skip to content

Add TimestampWithOffset canonical extension type#558

Open
serramatutu wants to merge 19 commits intoapache:mainfrom
serramatutu:serramatutu/TimestampWithOffset/go
Open

Add TimestampWithOffset canonical extension type#558
serramatutu wants to merge 19 commits intoapache:mainfrom
serramatutu:serramatutu/TimestampWithOffset/go

Conversation

@serramatutu
Copy link
Copy Markdown
Contributor

@serramatutu serramatutu commented Oct 30, 2025

Which issue does this PR close?

This PR implements the new arrow.timestamp_with_offset canonical extension type for arrow-go.

Rationale for this change

Be compatible with Arrow spec.

What changes are included in this PR?

This commit adds a new TimestampWithOffset extension type. This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. The offset in minutes can be primitive encoded, dictionary encoded or run-end encoded.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes, this is a new canonical extension type.

@felipecrv felipecrv self-requested a review November 1, 2025 01:55
felipecrv added a commit to apache/arrow that referenced this pull request Dec 5, 2025
…48002)

### Rationale for this change

Closes #44248

Arrow has no built-in canonical way of representing the `TIMESTAMP WITH
TIME ZONE` SQL type, which is present across multiple different database
systems. Not having a native way to represent this forces users to
either convert to UTC and drop the time zone, which may have correctness
implications, or use bespoke workarounds. A new
`arrow.timestamp_with_offset` extension type would introduce a standard
canonical way of representing that information.

Rust implementation: apache/arrow-rs#8743
Go implementation: apache/arrow-go#558

[DISCUSS] [thread in the mailing
list](https://lists.apache.org/thread/yhbr3rj9l59yoxv92o2s6dqlop16sfnk).

### What changes are included in this PR?

Proposal and documentation for `arrow.timestamp_with_offset` canonical
extension type.

### Are these changes tested?

N/A

### Are there any user-facing changes?

Yes, this is an extension to the arrow format.

* GitHub Issue: #44248

---------

Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
@serramatutu serramatutu force-pushed the serramatutu/TimestampWithOffset/go branch 3 times, most recently from 95230ad to b9b8bf2 Compare December 22, 2025 14:33
@serramatutu serramatutu changed the title [DRAFT] Add TimestampWithOffset extension type Add TimestampWithOffset extension type Dec 22, 2025
@serramatutu serramatutu marked this pull request as ready for review December 22, 2025 14:42
@serramatutu serramatutu changed the title Add TimestampWithOffset extension type Add TimestampWithOffset canonical extension type Dec 22, 2025
@serramatutu serramatutu force-pushed the serramatutu/TimestampWithOffset/go branch from b9b8bf2 to ccdd288 Compare December 22, 2025 15:17
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment on lines +506 to +507
timestamps.UnsafeAppendBoolToBitmap(false)
offsets.UnsafeAppendBoolToBitmap(false)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, these are non-nullable according to the spec.

Copy link
Copy Markdown
Contributor Author

@serramatutu serramatutu Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did change this but now the IPC roundtrip tests are failing because the internal count of nulls is different between the thing before IPC and after IPC.

To make them pass we'll need to relax the arrow.RecordEqual() to ignore NullN() when the field is not nullable. Alternatively we can change it so that an array builder sets nulls = 0 if the array is not nullable, regardless of what has been set before in calls to UnsafeAppendBoolToBitmap() etc

Related: https://lists.apache.org/thread/7gbqjwykh1ob3xbvwph3ljsdl5c7kxpd

Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset_test.go Outdated
@serramatutu serramatutu force-pushed the serramatutu/TimestampWithOffset/go branch from ccdd288 to cee17a3 Compare January 30, 2026 12:21
@serramatutu serramatutu requested a review from zeroshade February 2, 2026 13:29
Comment thread arrow/array/encoded.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment on lines +151 to +152
*arrow.Int8Type | *arrow.Int16Type | *arrow.Int32Type | *arrow.Int64Type |
*arrow.Uint8Type | *arrow.Uint16Type | *arrow.Uint32Type | *arrow.Uint64Type
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and the DictIndexType are both identical, instead of duplicating this we should either create a single type constraint or embed one in the other:

type TimestampWithOffsetRunEndsType interface {
     DictIndexType
}

My personal preference would be to have a single type though, but I'm not averse to having the two separate ones if necessary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: 231cb1e

Copy link
Copy Markdown
Contributor Author

@serramatutu serramatutu Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I had to revert. Spec says those types are different:

Dict index is either signed or unsigned int of 8-64 bits

The index type of a Dictionary type can only be an integer type, preferably signed, with width 8 to 64 bits.

Run ends is a signed int of 16-64 bits

The run end type of a Run-End Encoded type can only be a signed integer type with width 16 to 64 bits.

https://arrow.apache.org/docs/format/Columnar.html#data-types

Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset.go Outdated
Comment thread arrow/extensions/timestamp_with_offset_test.go Outdated
@serramatutu serramatutu force-pushed the serramatutu/TimestampWithOffset/go branch from 2762d64 to 0249806 Compare April 10, 2026 21:56
This commit adds a new `TimestampWithOffset` extension type that can be
used to represent timestamps with per-row timezone information. It
stores information in a `struct` with 2 fields, `timestamp=[T, "UTC"]`,
where `T` can be any `arrow.TimeUnit` and `offset_minutes=int16`, which
represents the offset in minutes from the UTC timestamp.
This commit allows `TimestampWithOffset` to be dict-encoded.

 - I made `NewTimestampWithOffsetType` take in an input
  `offsetType arrow.DataType`. It returns an error if the data type is not
  valid.

- I added a new infallible `NewTimestampWithOffsetTypePrimitiveEncoded`
  to make the encoding explicit.

- I added `NewTimestampWithOffsetTypeDictionaryEncoded` which returns an
  error in case the given type is not a valid dictionary key type.

- I made all tests run in a for loop with all possible allowed encoding
  types, ensuring all encodings work.
Smartly iterate over offsets if they're run-end encoded instead of doing
a binary search at every iteration.

This makes the loops O(n) instead of O(n*logn).
Changed a lot of things based on Matt's suggestions.
Stop using `time=0` as a sentinel for null in `iterValues()`. Instead,
return a tuple of `(time.Time, bool)` where the bool indicates whether the
value is valid or not.

Added a test for it.
@serramatutu serramatutu force-pushed the serramatutu/TimestampWithOffset/go branch from 0249806 to bc577e9 Compare April 15, 2026 10:35
@serramatutu serramatutu force-pushed the serramatutu/TimestampWithOffset/go branch from bc577e9 to 8a7d6d7 Compare April 15, 2026 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants