Skip to content

fix(encoding): use LargeList internally for Utf8/Binary to prevent offset overflow#6811

Open
beinan wants to merge 10 commits into
lance-format:mainfrom
beinan:fix-arrow-offset-overflow
Open

fix(encoding): use LargeList internally for Utf8/Binary to prevent offset overflow#6811
beinan wants to merge 10 commits into
lance-format:mainfrom
beinan:fix-arrow-offset-overflow

Conversation

@beinan
Copy link
Copy Markdown
Contributor

@beinan beinan commented May 17, 2026

Summary

When reading large fragments containing variable-length data (like strings or binary), PyArrow's 32-bit (2 GiB) offset limit can be easily exceeded, resulting in a cryptic Offset overflow error: 2151740513 panic.

This updates the decoder to always use LargeList<UInt8> (i64 offsets) internally. When targeting Utf8 or Binary arrays, BinaryArrayDecoder will safely downcast to i32 offsets if the chunk data fits within 2 GiB. If the chunk strictly exceeds 2 GiB, it now panics with a descriptive error suggesting a batch size reduction or switching to LargeUtf8/LargeBinary.

Fixes #2775.

🤖 Generated with MAI Agents

…fset overflow

When reading large fragments containing variable-length data (like strings or binary), PyArrow's 32-bit (2 GiB) offset limit can be easily exceeded, resulting in a cryptic `Offset overflow error: 2151740513` panic.

This updates the decoder to always use `LargeList<UInt8>` (i64 offsets) internally. When targeting `Utf8` or `Binary` arrays, `BinaryArrayDecoder` will safely downcast to i32 offsets if the chunk data fits within 2 GiB. If the chunk strictly exceeds 2 GiB, it now panics with a descriptive error suggesting a batch size reduction or switching to LargeUtf8/LargeBinary.

Fixes lance-format#2775.

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the bug Something isn't working label May 17, 2026
@beinan
Copy link
Copy Markdown
Contributor Author

beinan commented May 17, 2026

I ran some quick benchmarks on an x86 pod to measure the impact of the i64 offset buffering.

Here are the results comparing this branch against main for the UTF-8 decoder benchmarks:

Standard UTF-8 Data (decode_primitive/utf8)

  • Main: 832.32 µs (458.83 MiB/s)
  • This PR: 830.20 µs (460.00 MiB/s)
  • Verdict: No change in performance detected. The vectorized i64 math combined with the fast offset downcast adds essentially zero overhead.

Fixed-Length UTF-8 Data (decode_primitive/fixed-utf8)

  • Main: 195.93 µs (778.77 MiB/s)
  • This PR: 217.07 µs (702.93 MiB/s)
  • Verdict: ~11.8% regression in throughput. Building the LargeList internal buffer does slightly more work (allocating/writing 64-bit offsets), which makes a statistically significant difference when the strings are strictly fixed-size and the benchmark finishes in microseconds.

Overall, the safety and crash prevention added by the i64 offset buffering has zero impact on typical variable-length string data, with a small penalty for perfectly fixed-length string columns.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 17, 2026

Codecov Report

❌ Patch coverage is 73.11828% with 25 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...-encoding/src/previous/encodings/logical/binary.rs 72.61% 22 Missing and 1 partial ⚠️
rust/lance-encoding/src/decoder.rs 77.77% 0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Offset overflow errors can be confusing for users

1 participant