Add TimestampWithOffset canonical extension type#558
Add TimestampWithOffset canonical extension type#558serramatutu wants to merge 19 commits intoapache:mainfrom
TimestampWithOffset canonical extension type#558Conversation
…48002) ### Rationale for this change Closes #44248 Arrow has no built-in canonical way of representing the `TIMESTAMP WITH TIME ZONE` SQL type, which is present across multiple different database systems. Not having a native way to represent this forces users to either convert to UTC and drop the time zone, which may have correctness implications, or use bespoke workarounds. A new `arrow.timestamp_with_offset` extension type would introduce a standard canonical way of representing that information. Rust implementation: apache/arrow-rs#8743 Go implementation: apache/arrow-go#558 [DISCUSS] [thread in the mailing list](https://lists.apache.org/thread/yhbr3rj9l59yoxv92o2s6dqlop16sfnk). ### What changes are included in this PR? Proposal and documentation for `arrow.timestamp_with_offset` canonical extension type. ### Are these changes tested? N/A ### Are there any user-facing changes? Yes, this is an extension to the arrow format. * GitHub Issue: #44248 --------- Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
95230ad to
b9b8bf2
Compare
TimestampWithOffset extension typeTimestampWithOffset extension type
TimestampWithOffset extension typeTimestampWithOffset canonical extension type
b9b8bf2 to
ccdd288
Compare
| timestamps.UnsafeAppendBoolToBitmap(false) | ||
| offsets.UnsafeAppendBoolToBitmap(false) |
There was a problem hiding this comment.
same as above, these are non-nullable according to the spec.
There was a problem hiding this comment.
I did change this but now the IPC roundtrip tests are failing because the internal count of nulls is different between the thing before IPC and after IPC.
To make them pass we'll need to relax the arrow.RecordEqual() to ignore NullN() when the field is not nullable. Alternatively we can change it so that an array builder sets nulls = 0 if the array is not nullable, regardless of what has been set before in calls to UnsafeAppendBoolToBitmap() etc
Related: https://lists.apache.org/thread/7gbqjwykh1ob3xbvwph3ljsdl5c7kxpd
ccdd288 to
cee17a3
Compare
| *arrow.Int8Type | *arrow.Int16Type | *arrow.Int32Type | *arrow.Int64Type | | ||
| *arrow.Uint8Type | *arrow.Uint16Type | *arrow.Uint32Type | *arrow.Uint64Type |
There was a problem hiding this comment.
this and the DictIndexType are both identical, instead of duplicating this we should either create a single type constraint or embed one in the other:
type TimestampWithOffsetRunEndsType interface {
DictIndexType
}My personal preference would be to have a single type though, but I'm not averse to having the two separate ones if necessary.
There was a problem hiding this comment.
Actually I had to revert. Spec says those types are different:
Dict index is either signed or unsigned int of 8-64 bits
The index type of a Dictionary type can only be an integer type, preferably signed, with width 8 to 64 bits.
Run ends is a signed int of 16-64 bits
The run end type of a Run-End Encoded type can only be a signed integer type with width 16 to 64 bits.
https://arrow.apache.org/docs/format/Columnar.html#data-types
2762d64 to
0249806
Compare
This commit adds a new `TimestampWithOffset` extension type that can be used to represent timestamps with per-row timezone information. It stores information in a `struct` with 2 fields, `timestamp=[T, "UTC"]`, where `T` can be any `arrow.TimeUnit` and `offset_minutes=int16`, which represents the offset in minutes from the UTC timestamp.
This commit allows `TimestampWithOffset` to be dict-encoded. - I made `NewTimestampWithOffsetType` take in an input `offsetType arrow.DataType`. It returns an error if the data type is not valid. - I added a new infallible `NewTimestampWithOffsetTypePrimitiveEncoded` to make the encoding explicit. - I added `NewTimestampWithOffsetTypeDictionaryEncoded` which returns an error in case the given type is not a valid dictionary key type. - I made all tests run in a for loop with all possible allowed encoding types, ensuring all encodings work.
Smartly iterate over offsets if they're run-end encoded instead of doing a binary search at every iteration. This makes the loops O(n) instead of O(n*logn).
Changed a lot of things based on Matt's suggestions.
Stop using `time=0` as a sentinel for null in `iterValues()`. Instead, return a tuple of `(time.Time, bool)` where the bool indicates whether the value is valid or not. Added a test for it.
0249806 to
bc577e9
Compare
bc577e9 to
8a7d6d7
Compare
Which issue does this PR close?
This PR implements the new
arrow.timestamp_with_offsetcanonical extension type forarrow-go.Rationale for this change
Be compatible with Arrow spec.
What changes are included in this PR?
This commit adds a new
TimestampWithOffsetextension type. This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. The offset in minutes can be primitive encoded, dictionary encoded or run-end encoded.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes, this is a new canonical extension type.