Skip to content

fix: fix incorrect chunk_end calculation in ComputePageRanges when dictionary page is present#325

Open
zhf999 wants to merge 2 commits into
alibaba:mainfrom
zhf999:page-length-fix
Open

fix: fix incorrect chunk_end calculation in ComputePageRanges when dictionary page is present#325
zhf999 wants to merge 2 commits into
alibaba:mainfrom
zhf999:page-length-fix

Conversation

@zhf999
Copy link
Copy Markdown
Contributor

@zhf999 zhf999 commented Jun 1, 2026

Purpose

When dictionary encoding is enabled, ComputePageRanges computes chunk_end as
data_page_offset + total_compressed_size. This is incorrect because
total_compressed_size covers the entire column chunk including the dictionary page,
but data_page_offset points to the first data page (after the dictionary page).

This causes the last data page's byte range to overshoot the actual chunk boundary
by exactly the dictionary page size, potentially reading into the next column chunk
or beyond the file.

The fix subtracts the dictionary page size from total_compressed_size before
computing chunk_end, so that chunk_end = data_page_offset + (total_compressed_size - dict_size).

Tests

  • Added ComputePageRangesWithDictionaryEncoding unit test that writes a Parquet file
    with dictionary encoding enabled and verifies that no computed page range exceeds
    the true column chunk boundary.

API and Format

No.

Documentation

No.

Generative AI tooling

Claude 4.7 opus

Copilot AI review requested due to automatic review settings June 1, 2026 08:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds coverage and fixes range computation for Parquet column chunks that include a dictionary page, preventing computed read ranges from exceeding the actual chunk boundary.

Changes:

  • Fix ComputePageRanges to account for dictionary page size when computing data-page read ranges and chunk_end.
  • Add an end-to-end unit test reproducing/guarding the dictionary-encoding chunk-end overshoot bug.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/paimon/format/parquet/page_filtered_row_group_reader.cpp Adjusts chunk boundary and fallback read range sizing when a dictionary page is present.
src/paimon/format/parquet/page_filtered_row_group_reader_test.cpp Adds a regression test ensuring computed ranges don’t exceed the true column chunk end with dictionary encoding.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 318 to +331
int64_t data_page_offset = col_chunk->data_page_offset();
int64_t total_compressed_size = col_chunk->total_compressed_size();
int64_t chunk_end = data_page_offset + total_compressed_size;

int64_t data_page_compressed_size = col_chunk->total_compressed_size();
// Dictionary page: always include if present
if (col_chunk->has_dictionary_page()) {
int64_t dict_offset = col_chunk->dictionary_page_offset();
int64_t dict_size = data_page_offset - dict_offset;
// if dictionary exists, the data page size should be reduced by the dictionary
data_page_compressed_size -= dict_size;
if (dict_size > 0) {
ranges.push_back({dict_offset, dict_size});
}
}

int64_t chunk_end = data_page_offset + data_page_compressed_size;
Comment on lines +744 to +746
auto val_array = val_builder.Finish().ValueOrDie();
auto field = arrow::field("val", arrow::int32());
auto struct_array = arrow::StructArray::Make({val_array}, {field}).ValueOrDie();
Comment on lines +753 to +754
ASSERT_OK_AND_ASSIGN(std::shared_ptr<OutputStream> out,
fs_->Create(file_name, /*overwrite=*/false));
Comment on lines +807 to +820
ASSERT_FALSE(ranges.empty());

// The critical check: no range should extend beyond the true chunk end.
// With the bug, the last data page's range would use chunk_end = data_page_offset +
// total_compressed_size, which overshoots by the dictionary page size.
for (size_t i = 0; i < ranges.size(); ++i) {
int64_t range_end = ranges[i].offset + ranges[i].length;
EXPECT_LE(range_end, true_chunk_end)
<< "Range " << i << " [offset=" << ranges[i].offset << ", length=" << ranges[i].length
<< "] exceeds true chunk end (" << true_chunk_end << "). "
<< "This indicates chunk_end is computed as data_page_offset + "
"total_compressed_size instead of dictionary_page_offset + "
"total_compressed_size.";
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants