fix: fix incorrect chunk_end calculation in ComputePageRanges when dictionary page is present by zhf999 · Pull Request #325 · alibaba/paimon-cpp

zhf999 · 2026-06-01T08:57:38Z

Purpose

When dictionary encoding is enabled, ComputePageRanges computes chunk_end as
data_page_offset + total_compressed_size. This is incorrect because
total_compressed_size covers the entire column chunk including the dictionary page,
but data_page_offset points to the first data page (after the dictionary page).

This causes the last data page's byte range to overshoot the actual chunk boundary
by exactly the dictionary page size, potentially reading into the next column chunk
or beyond the file.

The fix subtracts the dictionary page size from total_compressed_size before
computing chunk_end, so that chunk_end = data_page_offset + (total_compressed_size - dict_size).

Tests

Added ComputePageRangesWithDictionaryEncoding unit test that writes a Parquet file
with dictionary encoding enabled and verifies that no computed page range exceeds
the true column chunk boundary.

API and Format

No.

Documentation

No.

Generative AI tooling

Claude 4.7 opus

…age is present

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds coverage and fixes range computation for Parquet column chunks that include a dictionary page, preventing computed read ranges from exceeding the actual chunk boundary.

Changes:

Fix ComputePageRanges to account for dictionary page size when computing data-page read ranges and chunk_end.
Add an end-to-end unit test reproducing/guarding the dictionary-encoding chunk-end overshoot bug.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
src/paimon/format/parquet/page_filtered_row_group_reader.cpp	Adjusts chunk boundary and fallback read range sizing when a dictionary page is present.
src/paimon/format/parquet/page_filtered_row_group_reader_test.cpp	Adds a regression test ensuring computed ranges don’t exceed the true column chunk end with dictionary encoding.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

        int64_t data_page_offset = col_chunk->data_page_offset();
-        int64_t total_compressed_size = col_chunk->total_compressed_size();
-        int64_t chunk_end = data_page_offset + total_compressed_size;
-
+        int64_t data_page_compressed_size = col_chunk->total_compressed_size();
        // Dictionary page: always include if present
        if (col_chunk->has_dictionary_page()) {
            int64_t dict_offset = col_chunk->dictionary_page_offset();
            int64_t dict_size = data_page_offset - dict_offset;
+            // if dictionary exists, the data page size should be reduced by the dictionary
+            data_page_compressed_size -= dict_size;
            if (dict_size > 0) {
                ranges.push_back({dict_offset, dict_size});
            }
        }

+        int64_t chunk_end = data_page_offset + data_page_compressed_size;


+    auto val_array = val_builder.Finish().ValueOrDie();
+    auto field = arrow::field("val", arrow::int32());
+    auto struct_array = arrow::StructArray::Make({val_array}, {field}).ValueOrDie();


+    ASSERT_OK_AND_ASSIGN(std::shared_ptr<OutputStream> out,
+                         fs_->Create(file_name, /*overwrite=*/false));


+    ASSERT_FALSE(ranges.empty());
+
+    // The critical check: no range should extend beyond the true chunk end.
+    // With the bug, the last data page's range would use chunk_end = data_page_offset +
+    // total_compressed_size, which overshoots by the dictionary page size.
+    for (size_t i = 0; i < ranges.size(); ++i) {
+        int64_t range_end = ranges[i].offset + ranges[i].length;
+        EXPECT_LE(range_end, true_chunk_end)
+            << "Range " << i << " [offset=" << ranges[i].offset << ", length=" << ranges[i].length
+            << "] exceeds true chunk end (" << true_chunk_end << "). "
+            << "This indicates chunk_end is computed as data_page_offset + "
+               "total_compressed_size instead of dictionary_page_offset + "
+               "total_compressed_size.";
+    }


fix: fix calculated chunk_end exceed true chunk end when dictionary p…

3d10b2b

…age is present

Copilot AI review requested due to automatic review settings June 1, 2026 08:57

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Merge branch 'main' into page-length-fix

8988391

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix incorrect chunk_end calculation in ComputePageRanges when dictionary page is present#325

fix: fix incorrect chunk_end calculation in ComputePageRanges when dictionary page is present#325
zhf999 wants to merge 2 commits into
alibaba:mainfrom
zhf999:page-length-fix

zhf999 commented Jun 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		ASSERT_OK_AND_ASSIGN(std::shared_ptr<OutputStream> out,
		fs_->Create(file_name, /overwrite=/false));

Conversation

zhf999 commented Jun 1, 2026

Purpose

Tests

API and Format

Documentation

Generative AI tooling

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants