Skip to content

fix(scan): use current_schema for default-snapshot column validation#2566

Open
nazq wants to merge 1 commit into
apache:mainfrom
nazq:scan-current-schema-on-evolution
Open

fix(scan): use current_schema for default-snapshot column validation#2566
nazq wants to merge 1 commit into
apache:mainfrom
nazq:scan-current-schema-on-evolution

Conversation

@nazq
Copy link
Copy Markdown

@nazq nazq commented Jun 2, 2026

Which issue does this PR close?

What changes are included in this PR?

TableScanBuilder::build() currently validates caller-supplied column names against snapshot.schema(metadata), which returns the schema the snapshot was written under. For default scans (no explicit snapshot_id) on a table whose current schema differs from the snapshot's schema — i.e. any table that's been through an UpdateSchemaAction commit — this rejects columns that are valid against the post-evolution schema:

DataInvalid => Column note not found in table. Schema: <pre-evolution schema>

Fix

crates/iceberg/src/scan/mod.rs:221:

- let schema = snapshot.schema(self.table.metadata())?;
+ let schema = if self.snapshot_id.is_some() {
+     snapshot.schema(self.table.metadata())?
+ } else {
+     self.table.metadata().current_schema().clone()
+ };
  • Explicit snapshot_id (time-travel): keep the snapshot-time vocabulary. A caller asking for "the table as it existed at snapshot N" should see schema N's columns.
  • Default scan (no snapshot_id): use the table's current schema. Field IDs are stable across schemas, so the downstream Parquet projection (get_arrow_projection_mask_with_field_ids, which reads PARQUET:field_id metadata embedded by the writer) still finds the right on-disk columns even when the file's column names differ from the current schema's column names.

This matches PyIceberg's approach in pyiceberg/io/pyarrow.py::_task_to_record_batches — project by field ID, rename the arrow batch on the way out.

The column-name validation loop (lines 224–237) and the subsequent field_id_by_name lookup (lines 256–261) share the same schema variable, so the fix is one assignment with a 12-line comment explaining the invariant.

Why this wasn't caught earlier

UpdateSchemaAction (#2120) shipped with metadata-only tests in crates/catalog/loader/tests/schema_update_suite.rs — none of them exercise table.scan().select_columns(...) after the schema commit. The pre-existing crates/integration_tests/tests/read_evolved_schema.rs only uses table.scan().build() with no select_columns, which bypasses the validation loop entirely. So the regression has been latent in the add/delete path since #2120 merged; it's just easier to trip with rename.

Are these changes tested?

Yes — three regression tests in scan::tests, all on the same minimal fixture (make_schema_evolved_table) with current-schema-id=1 (id, value) and a sole snapshot at schema-id=0 (id, tmp):

Test What it covers
test_default_scan_uses_current_schema_after_evolution select(["id", "value"]) succeeds in the default scan — the post-evolution column name resolves
test_default_scan_rejects_old_name_after_rename select(["id", "tmp"]) fails with DataInvalid — the pre-evolution name is no longer part of the public vocabulary in a default scan
test_snapshot_id_scan_uses_snapshot_schema snapshot_id(1).select(["id", "tmp"]) succeeds (time-travel keeps the old vocabulary); snapshot_id(1).select(["id", "value"]) fails

All 34 existing scan::tests still pass. Full iceberg lib suite: 1299/1299. Clippy + rustfmt clean.

A default table scan (no explicit snapshot_id) currently validates
caller-supplied column names against the snapshot's schema_id, not the
table's current schema. After an UpdateSchemaAction commit changes the
current schema (rename/add/delete column), pre-existing snapshots still
point at the old schema_id, so the validation loop in
TableScanBuilder::build rejects names that are valid against the
post-evolution schema with:

    DataInvalid => Column <new_name> not found in table. Schema: <old>

The downstream Parquet projection
(arrow/reader/projection.rs::get_arrow_projection_mask_with_field_ids)
already maps field IDs to on-disk column names via PARQUET:field_id
metadata, so resolving names against the current schema is safe
end-to-end — field IDs are stable across schema versions, and the
file's original column names live in the parquet metadata until the
file is rewritten.

Fix: branch on whether the caller asked for a specific snapshot.
Explicit snapshot_id (time-travel) keeps the snapshot-time vocabulary;
default scan uses the table's current schema.

Tests: three regression tests on a fixture with current-schema-id=1
(id, value) and a sole snapshot at schema-id=0 (id, tmp):

* test_default_scan_uses_current_schema_after_evolution — select(['id','value'])
  succeeds in the default scan
* test_default_scan_rejects_old_name_after_rename — select(['id','tmp'])
  fails with DataInvalid in the default scan
* test_snapshot_id_scan_uses_snapshot_schema — snapshot_id(1).select(['id','tmp'])
  succeeds (time-travel), and snapshot_id(1).select(['id','value']) fails

All 1299 iceberg lib tests pass (37 in scan::tests = 34 existing + 3
new). Clippy + rustfmt clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Table scan rejects current-schema column names after UpdateSchemaAction commit

1 participant