[core] Fix DataEvolutionFileStoreScan schema-evolution filtering by duanyyyyyyy · Pull Request #8084 · apache/paimon

duanyyyyyyy · 2026-06-02T12:30:05Z

Purpose

The filterByStats(ManifestEntry) per-entry "filter irrelevant column files"
optimization for DE tables (#6647) lives in the wrong layer. It cannot see the
row-id-range grouping and has no access to inputFilter, so it inevitably
makes wrong decisions that break correctness for several common queries:

RENAME COLUMN silently drops pre-rename files. The check compares by
column name; the renamed field's latest name does not appear in the
pre-rename file's writeCols, so the file is dropped at manifest pruning.
```
ALTER TABLE t RENAME COLUMN score TO grade;
SELECT COUNT(*) FROM t WHERE grade < 60;
-- before: 0   (pre-rename rows lost)
```
ADD COLUMN + a predicate on the new column silently drops pre-ALTER
files. Without inputFilter in scope, the check cannot tell
WHERE new_col IS NULL (must keep) from WHERE new_col = 'x' (can drop) —
both look like "no requested column physically present".
```
ALTER TABLE t ADD COLUMN new_col STRING;
SELECT COUNT(*) FROM t WHERE new_col IS NULL;
-- before: 0   (pre-ALTER rows lost)
```
ADD COLUMN + projection only of the new column silently drops pre-ALTER
files, losing rowCount NULL-fill rows that should have been emitted.
Without the row-id-range grouping in scope, the single-entry layer cannot
know that the dropped file was the only carrier of its rows.

This PR relocates the optimization into postFilterManifestEntries, where
both inputFilter and the row-id-range grouping are available. The same
column-pruning intent is preserved, but now expressed correctly:

Compare by field id, not by name. RENAME and TYPE CHANGE no longer
matter because field id is the stable identity across schemas.
Drop a file from a group only when another file in the same group
carries one of the requested columns. If no file in the group does, keep
one as a row-count representative so the reader can emit rowCount
NULL-filled rows for the projection.
Predicate-based stats pruning, when inputFilter is set, runs in the
list-version filterByStats over the same groups.

Two supporting changes make the predicate path correct against all-NULL
columns:

DataEvolutionArray gains an optional missingFieldLong fallback
(default NO_MISSING_FIELD_FALLBACK preserves current behavior for
non-stats consumers). When set, a position whose rowOffsets[pos] == -1
— the field is absent from every file in the group — reports
isNullAt = false and getLong = fallback.
evolutionStats passes groupRowCount as that fallback for nullCounts.
LeafPredicate.test then sees nullCount == rowCount, short-circuits
the "unknown stats → keep" branch, and lets IsNull keep pre-ALTER files
while IsNotNull / Equal / GreaterThan / etc. drop them. The sentinel
rowOffsets[pos] == -2 ("field exists in a file but its stats were not
captured") keeps the previous "unknown" semantic.

Effect:

|---|---|---|
| WHERE new_col IS NULL | pre-ALTER rows lost | correct, NULL-filled |
| WHERE new_col = 'x' | pre-ALTER rows lost; pre-ALTER files dropped at wrong layer | post-ALTER row(s) returned; pre-ALTER files pruned by stats |
| WHERE renamed_col < x | pre-rename rows lost | correct, stats-pruned by field id |
| SELECT new_col no predicate | pre-ALTER rows lost | row-disjoint files kept (NULL-fill); columnar-split files pruned per-group |

Tests

DataEvolutionTableTest gains ten cases:

testAddColumnIsNullKeepsPreAlterRows — central repro for bug (2)
testAddColumnEqualityPredicatePrunesPreAlterFiles — verifies the "all NULL"
stats encoding lets Equal drop pre-ALTER files
testRenameColumnPredicateKeepsPreRenameRows — central repro for bug (1)
testNoFilterProjectionPrunesColumnarSplitFiles — columnar-split column
pruning
testNoFilterProjectionKeepsRowDisjointFiles — central repro for bug (3):
row-disjoint single-file groups stay even when their carrier is the only
source of rowCount
testColumnarSplitWithPredicateOnFileAColumn /
testColumnarSplitWithPredicateOnFileBColumn — predicate-based group
pruning composes with per-file column pruning
testThreeWayColumnarSplitPruning — three files per row id range, one
column each
testMixedColumnarSplitAndRowDisjoint — both group topologies in one table
testSystemFieldOnlyProjectionIsNotPruned — system-field-only readType
is filtered out of pruning, files flow through unchanged

All 35 cases in DataEvolutionTableTest and 199 cases across DE / blob /
global-index / append-only / commit suites pass with no pre-existing UT
regressions.

JingsongLi

LGTM. I checked the pruning change against the data-evolution read path: using field ids here matches how DataEvolutionSplitRead merges fields, so renamed columns are kept while newly-added columns that are absent from old files can still be skipped.\n\nAlso verified the focused test locally:\n\n\nmvn -pl paimon-core -Pfast-build -Dtest=DataEvolutionFileStoreScanTest test\n\n\nTests run: 8, Failures: 0, Errors: 0, Skipped: 0.

duanyyyyyyy · 2026-06-02T14:59:48Z

LGTM. I checked the pruning change against the data-evolution read path: using field ids here matches how DataEvolutionSplitRead merges fields, so renamed columns are kept while newly-added columns that are absent from old files can still be skipped.\n\nAlso verified the focused test locally:\n\n\nmvn -pl paimon-core -Pfast-build -Dtest=DataEvolutionFileStoreScanTest test\n\n\nTests run: 8, Failures: 0, Errors: 0, Skipped: 0.

Can hold on later?
I need to do some full test for data evolution.
Seems some other case will also have issue like add column.

The "filter irrelevant column files" optimization (apache#6647) tried to skip DE-table files that carry none of a query's requested columns. The intent — saving I/O on columnar-split groups where a row's columns are split across multiple files — is correct, but the implementation lived in filterByStats(ManifestEntry), the single-entry layer that has no view of the row-id-range grouping and no access to inputFilter. Three concrete bugs fell out of that layer choice: 1) RENAME COLUMN silently dropped pre-rename files. The check ran on column names; the renamed field's latest name does not appear in the pre-rename file's writeCols, so the comparison evaluated to "no overlap" and dropped the file. 2) ADD COLUMN + a predicate on the new column silently dropped pre-ALTER files. The check has no view of inputFilter, so it cannot tell "WHERE new_col IS NULL must keep the file" from "WHERE new_col = 'x' can drop the file" — both look like "no requested column physically present". 3) ADD COLUMN + projection of only the new column silently dropped pre-ALTER files, losing rowCount rows that should have come back as NULL-filled. The check has no view of the row-id-range grouping, so it could not know that a file dropped from a row-disjoint group was the only carrier of its rows. Relocate the optimization to postFilterManifestEntries, where inputFilter and the row-id-range grouping are both available. The same column-pruning intent now reads: - Compare by field id, not by name. RENAME and TYPE CHANGE no longer matter because field id is the stable identity across schemas. - Drop a file from a group only when another file in the same group carries one of the requested columns. If no file in the group does, keep one as a row-count representative so the reader can emit rowCount NULL-filled rows for the projection. - Predicate-based stats pruning, when inputFilter is set, runs in the list-version filterByStats over the same groups (see below). Two supporting changes make the predicate path correct against all-NULL columns: - DataEvolutionArray gains an optional missingFieldLong fallback (default NO_MISSING_FIELD_FALLBACK preserves current behavior for non-stats consumers). When set, a position whose rowOffsets entry is -1 — meaning the field is absent from every file in the group — reports isNullAt=false and getLong=fallback. - evolutionStats passes groupRowCount as that fallback for nullCounts. LeafPredicate.test then sees nullCount == rowCount, short-circuits the "unknown stats -> keep" branch, and lets IsNull keep pre-ALTER files while IsNotNull / Equal / GreaterThan / etc. drop them. The sentinel rowOffsets entry -2 ("field exists in a file but its stats were not captured") keeps the previous "unknown" semantic. Effect: - WHERE new_col IS NULL: pre-ALTER files kept, reader NULL-fills, correct. - WHERE new_col IS NOT NULL / = / > / etc.: stats prune pre-ALTER files via the "all NULL" encoding. - WHERE renamed_col <predicate>: evolutionStats matches by field id, LeafPredicate sees real min/max, stats prune correctly. - SELECT new_col with no predicate: row-disjoint groups keep their single file (NULL-filled rows preserved); columnar-split groups drop files that carry no requested column. Coverage: DataEvolutionTableTest gains ten cases — - testAddColumnIsNullKeepsPreAlterRows - testAddColumnEqualityPredicatePrunesPreAlterFiles - testRenameColumnPredicateKeepsPreRenameRows - testNoFilterProjectionPrunesColumnarSplitFiles - testNoFilterProjectionKeepsRowDisjointFiles - testColumnarSplitWithPredicateOnFileAColumn - testColumnarSplitWithPredicateOnFileBColumn - testThreeWayColumnarSplitPruning - testMixedColumnarSplitAndRowDisjoint - testSystemFieldOnlyProjectionIsNotPruned covering the three central bug repros, single-column projection, mixed projections, three-way column splits, mixed columnar-split / row-disjoint topologies, predicate composition, and the system-field-only fallthrough. Repro for the central bug: CREATE TABLE t (id INT) WITH ('data-evolution.enabled'='true'); INSERT INTO t VALUES (1); ALTER TABLE t ADD COLUMN new_col STRING; INSERT INTO t VALUES (2, 'v'); SELECT COUNT(*) FROM t WHERE new_col IS NULL; -- before: 0 (pre-ALTER file dropped at single filterByStats) -- after: 1 Repro for the rename regression: CREATE TABLE t (id INT, score INT) WITH ('data-evolution.enabled'='true'); INSERT INTO t VALUES (1, 10); ALTER TABLE t RENAME COLUMN score TO grade; INSERT INTO t VALUES (2, 100); SELECT COUNT(*) FROM t WHERE grade < 60; -- before: 0 (pre-rename file dropped at single filterByStats) -- after: 1 Signed-off-by: duanyyyyyyy <yan.duan9759@gmail.com>

JingsongLi

LGTM after another pass. I rechecked the grouped pruning path for the field-id based matching, ADD COLUMN all-NULL stats fallback, projection-only post-filtering, and the row-tracking assumptions for data-evolution tables. I did not find any blocking issue.

I also verified the focused PR cases from a clean worktree:

mvn -pl paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest=DataEvolutionTableTest#testAddColumnIsNullKeepsPreAlterRows+testAddColumnEqualityPredicatePrunesPreAlterFiles+testRenameColumnPredicateKeepsPreRenameRows+testNoFilterProjectionPrunesColumnarSplitFiles+testNoFilterProjectionKeepsRowDisjointFiles+testColumnarSplitWithPredicateOnFileAColumn+testColumnarSplitWithPredicateOnFileBColumn+testThreeWayColumnarSplitPruning+testMixedColumnarSplitAndRowDisjoint+testSystemFieldOnlyProjectionIsNotPruned test

Result: Tests run: 10, Failures: 0, Errors: 0, Skipped: 0.

duanyyyyyyy force-pushed the de-rename-fix-match-by-field-id branch from 516c526 to cdff3c1 Compare June 2, 2026 13:46

JingsongLi reviewed Jun 2, 2026

View reviewed changes

duanyyyyyyy force-pushed the de-rename-fix-match-by-field-id branch from cdff3c1 to 5daa733 Compare June 3, 2026 09:50

duanyyyyyyy changed the title ~~[core] Match read columns by field id in DataEvolutionFileStoreScan~~ [core] Fix DataEvolutionFileStoreScan schema-evolution filtering Jun 3, 2026

duanyyyyyyy force-pushed the de-rename-fix-match-by-field-id branch from 5daa733 to 0e2dd89 Compare June 3, 2026 11:30

JingsongLi reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Fix DataEvolutionFileStoreScan schema-evolution filtering#8084

[core] Fix DataEvolutionFileStoreScan schema-evolution filtering#8084
duanyyyyyyy wants to merge 1 commit into
apache:masterfrom
duanyyyyyyy:de-rename-fix-match-by-field-id

duanyyyyyyy commented Jun 2, 2026 •

edited

Loading

Uh oh!

JingsongLi left a comment

Uh oh!

duanyyyyyyy commented Jun 2, 2026

Uh oh!

JingsongLi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

duanyyyyyyy commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

duanyyyyyyy commented Jun 2, 2026

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

duanyyyyyyy commented Jun 2, 2026 •

edited

Loading