fix: fix incorrect chunk_end calculation in ComputePageRanges when dictionary page is present#325
Open
zhf999 wants to merge 2 commits into
Open
fix: fix incorrect chunk_end calculation in ComputePageRanges when dictionary page is present#325zhf999 wants to merge 2 commits into
zhf999 wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds coverage and fixes range computation for Parquet column chunks that include a dictionary page, preventing computed read ranges from exceeding the actual chunk boundary.
Changes:
- Fix
ComputePageRangesto account for dictionary page size when computing data-page read ranges andchunk_end. - Add an end-to-end unit test reproducing/guarding the dictionary-encoding chunk-end overshoot bug.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/paimon/format/parquet/page_filtered_row_group_reader.cpp | Adjusts chunk boundary and fallback read range sizing when a dictionary page is present. |
| src/paimon/format/parquet/page_filtered_row_group_reader_test.cpp | Adds a regression test ensuring computed ranges don’t exceed the true column chunk end with dictionary encoding. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
318
to
+331
| int64_t data_page_offset = col_chunk->data_page_offset(); | ||
| int64_t total_compressed_size = col_chunk->total_compressed_size(); | ||
| int64_t chunk_end = data_page_offset + total_compressed_size; | ||
|
|
||
| int64_t data_page_compressed_size = col_chunk->total_compressed_size(); | ||
| // Dictionary page: always include if present | ||
| if (col_chunk->has_dictionary_page()) { | ||
| int64_t dict_offset = col_chunk->dictionary_page_offset(); | ||
| int64_t dict_size = data_page_offset - dict_offset; | ||
| // if dictionary exists, the data page size should be reduced by the dictionary | ||
| data_page_compressed_size -= dict_size; | ||
| if (dict_size > 0) { | ||
| ranges.push_back({dict_offset, dict_size}); | ||
| } | ||
| } | ||
|
|
||
| int64_t chunk_end = data_page_offset + data_page_compressed_size; |
Comment on lines
+744
to
+746
| auto val_array = val_builder.Finish().ValueOrDie(); | ||
| auto field = arrow::field("val", arrow::int32()); | ||
| auto struct_array = arrow::StructArray::Make({val_array}, {field}).ValueOrDie(); |
Comment on lines
+753
to
+754
| ASSERT_OK_AND_ASSIGN(std::shared_ptr<OutputStream> out, | ||
| fs_->Create(file_name, /*overwrite=*/false)); |
Comment on lines
+807
to
+820
| ASSERT_FALSE(ranges.empty()); | ||
|
|
||
| // The critical check: no range should extend beyond the true chunk end. | ||
| // With the bug, the last data page's range would use chunk_end = data_page_offset + | ||
| // total_compressed_size, which overshoots by the dictionary page size. | ||
| for (size_t i = 0; i < ranges.size(); ++i) { | ||
| int64_t range_end = ranges[i].offset + ranges[i].length; | ||
| EXPECT_LE(range_end, true_chunk_end) | ||
| << "Range " << i << " [offset=" << ranges[i].offset << ", length=" << ranges[i].length | ||
| << "] exceeds true chunk end (" << true_chunk_end << "). " | ||
| << "This indicates chunk_end is computed as data_page_offset + " | ||
| "total_compressed_size instead of dictionary_page_offset + " | ||
| "total_compressed_size."; | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
When dictionary encoding is enabled,
ComputePageRangescomputeschunk_endasdata_page_offset + total_compressed_size. This is incorrect becausetotal_compressed_sizecovers the entire column chunk including the dictionary page,but
data_page_offsetpoints to the first data page (after the dictionary page).This causes the last data page's byte range to overshoot the actual chunk boundary
by exactly the dictionary page size, potentially reading into the next column chunk
or beyond the file.
The fix subtracts the dictionary page size from
total_compressed_sizebeforecomputing
chunk_end, so thatchunk_end = data_page_offset + (total_compressed_size - dict_size).Tests
ComputePageRangesWithDictionaryEncodingunit test that writes a Parquet filewith dictionary encoding enabled and verifies that no computed page range exceeds
the true column chunk boundary.
API and Format
No.
Documentation
No.
Generative AI tooling
Claude 4.7 opus