fix(encoding): use LargeList internally for Utf8/Binary to prevent offset overflow by beinan · Pull Request #6811 · lance-format/lance

beinan · 2026-05-17T02:38:41Z

Summary

When reading large fragments containing variable-length data (like strings or binary), PyArrow's 32-bit (2 GiB) offset limit can be easily exceeded, resulting in a cryptic Offset overflow error: 2151740513 panic.

This updates the decoder to always use LargeList<UInt8> (i64 offsets) internally. When targeting Utf8 or Binary arrays, BinaryArrayDecoder will safely downcast to i32 offsets if the chunk data fits within 2 GiB. If the chunk strictly exceeds 2 GiB, it now panics with a descriptive error suggesting a batch size reduction or switching to LargeUtf8/LargeBinary.

Fixes #2775.

🤖 Generated with MAI Agents

…fset overflow When reading large fragments containing variable-length data (like strings or binary), PyArrow's 32-bit (2 GiB) offset limit can be easily exceeded, resulting in a cryptic `Offset overflow error: 2151740513` panic. This updates the decoder to always use `LargeList<UInt8>` (i64 offsets) internally. When targeting `Utf8` or `Binary` arrays, `BinaryArrayDecoder` will safely downcast to i32 offsets if the chunk data fits within 2 GiB. If the chunk strictly exceeds 2 GiB, it now panics with a descriptive error suggesting a batch size reduction or switching to LargeUtf8/LargeBinary. Fixes lance-format#2775. Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

beinan · 2026-05-17T04:36:06Z

I ran some quick benchmarks on an x86 pod to measure the impact of the i64 offset buffering.

Here are the results comparing this branch against main for the UTF-8 decoder benchmarks:

Standard UTF-8 Data (decode_primitive/utf8)

Main: 832.32 µs (458.83 MiB/s)
This PR: 830.20 µs (460.00 MiB/s)
Verdict: No change in performance detected. The vectorized i64 math combined with the fast offset downcast adds essentially zero overhead.

Fixed-Length UTF-8 Data (decode_primitive/fixed-utf8)

Main: 195.93 µs (778.77 MiB/s)
This PR: 217.07 µs (702.93 MiB/s)
Verdict: ~11.8% regression in throughput. Building the LargeList internal buffer does slightly more work (allocating/writing 64-bit offsets), which makes a statistically significant difference when the strings are strictly fixed-size and the benchmark finishes in microseconds.

Overall, the safety and crash prevention added by the i64 offset buffering has zero impact on typical variable-length string data, with a small penalty for perfectly fixed-length string columns.

codecov · 2026-05-17T06:22:45Z

Codecov Report

❌ Patch coverage is 73.11828% with 25 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...-encoding/src/previous/encodings/logical/binary.rs	72.61%	22 Missing and 1 partial ⚠️
rust/lance-encoding/src/decoder.rs	77.77%	0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

…tion

claude Bot reviewed May 17, 2026

View reviewed changes

github-actions Bot added the bug Something isn't working label May 17, 2026

Beinan Wang added 2 commits May 16, 2026 19:49

fix(encoding): return Result instead of panicking on offset overflow

9a907b8

Fix location macro in binary.rs

8e93d97

fix: use lance_core::Error::internal instead of Internal struct

ef7c7d2

Beinan Wang added 6 commits May 16, 2026 23:38

style: run cargo fmt

785d6cc

test: add coverage for LargeList to List downcast and overflow protec…

72bb5b1

…tion

test: fix InvalidInput error matching in test

bb5e77f

style: fix rustfmt errors (semicolon and line length)

6c09146

style: fix remaining rustfmt formatting issues

7d32f0a

test: fix binary overflow regression test

aacbf22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(encoding): use LargeList internally for Utf8/Binary to prevent offset overflow#6811

fix(encoding): use LargeList internally for Utf8/Binary to prevent offset overflow#6811
beinan wants to merge 10 commits into
lance-format:mainfrom
beinan:fix-arrow-offset-overflow

beinan commented May 17, 2026

Uh oh!

claude Bot left a comment

Uh oh!

beinan commented May 17, 2026

Uh oh!

codecov Bot commented May 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

beinan commented May 17, 2026

Summary

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

beinan commented May 17, 2026

Uh oh!

codecov Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented May 17, 2026 •

edited

Loading