fix(scan): use current_schema for default-snapshot column validation by nazq · Pull Request #2566 · apache/iceberg-rust

nazq · 2026-06-02T00:19:49Z

Which issue does this PR close?

Closes Table scan rejects current-schema column names after UpdateSchemaAction commit #2565.

What changes are included in this PR?

TableScanBuilder::build() currently validates caller-supplied column names against snapshot.schema(metadata), which returns the schema the snapshot was written under. For default scans (no explicit snapshot_id) on a table whose current schema differs from the snapshot's schema — i.e. any table that's been through an UpdateSchemaAction commit — this rejects columns that are valid against the post-evolution schema:

DataInvalid => Column note not found in table. Schema: <pre-evolution schema>

Fix

crates/iceberg/src/scan/mod.rs:221:

- let schema = snapshot.schema(self.table.metadata())?;
+ let schema = if self.snapshot_id.is_some() {
+     snapshot.schema(self.table.metadata())?
+ } else {
+     self.table.metadata().current_schema().clone()
+ };

Explicit snapshot_id (time-travel): keep the snapshot-time vocabulary. A caller asking for "the table as it existed at snapshot N" should see schema N's columns.
Default scan (no snapshot_id): use the table's current schema. Field IDs are stable across schemas, so the downstream Parquet projection (get_arrow_projection_mask_with_field_ids, which reads PARQUET:field_id metadata embedded by the writer) still finds the right on-disk columns even when the file's column names differ from the current schema's column names.

This matches PyIceberg's approach in pyiceberg/io/pyarrow.py::_task_to_record_batches — project by field ID, rename the arrow batch on the way out.

The column-name validation loop (lines 224–237) and the subsequent field_id_by_name lookup (lines 256–261) share the same schema variable, so the fix is one assignment with a 12-line comment explaining the invariant.

Why this wasn't caught earlier

UpdateSchemaAction (#2120) shipped with metadata-only tests in crates/catalog/loader/tests/schema_update_suite.rs — none of them exercise table.scan().select_columns(...) after the schema commit. The pre-existing crates/integration_tests/tests/read_evolved_schema.rs only uses table.scan().build() with no select_columns, which bypasses the validation loop entirely. So the regression has been latent in the add/delete path since #2120 merged; it's just easier to trip with rename.

Are these changes tested?

Yes — three regression tests in scan::tests, all on the same minimal fixture (make_schema_evolved_table) with current-schema-id=1 (id, value) and a sole snapshot at schema-id=0 (id, tmp):

Test	What it covers
`test_default_scan_uses_current_schema_after_evolution`	`select(["id", "value"])` succeeds in the default scan — the post-evolution column name resolves
`test_default_scan_rejects_old_name_after_rename`	`select(["id", "tmp"])` fails with `DataInvalid` — the pre-evolution name is no longer part of the public vocabulary in a default scan
`test_snapshot_id_scan_uses_snapshot_schema`	`snapshot_id(1).select(["id", "tmp"])` succeeds (time-travel keeps the old vocabulary); `snapshot_id(1).select(["id", "value"])` fails

All 34 existing scan::tests still pass. Full iceberg lib suite: 1299/1299. Clippy + rustfmt clean.

A default table scan (no explicit snapshot_id) currently validates caller-supplied column names against the snapshot's schema_id, not the table's current schema. After an UpdateSchemaAction commit changes the current schema (rename/add/delete column), pre-existing snapshots still point at the old schema_id, so the validation loop in TableScanBuilder::build rejects names that are valid against the post-evolution schema with: DataInvalid => Column <new_name> not found in table. Schema: <old> The downstream Parquet projection (arrow/reader/projection.rs::get_arrow_projection_mask_with_field_ids) already maps field IDs to on-disk column names via PARQUET:field_id metadata, so resolving names against the current schema is safe end-to-end — field IDs are stable across schema versions, and the file's original column names live in the parquet metadata until the file is rewritten. Fix: branch on whether the caller asked for a specific snapshot. Explicit snapshot_id (time-travel) keeps the snapshot-time vocabulary; default scan uses the table's current schema. Tests: three regression tests on a fixture with current-schema-id=1 (id, value) and a sole snapshot at schema-id=0 (id, tmp): * test_default_scan_uses_current_schema_after_evolution — select(['id','value']) succeeds in the default scan * test_default_scan_rejects_old_name_after_rename — select(['id','tmp']) fails with DataInvalid in the default scan * test_snapshot_id_scan_uses_snapshot_schema — snapshot_id(1).select(['id','tmp']) succeeds (time-travel), and snapshot_id(1).select(['id','value']) fails All 1299 iceberg lib tests pass (37 in scan::tests = 34 existing + 3 new). Clippy + rustfmt clean.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scan): use current_schema for default-snapshot column validation#2566

fix(scan): use current_schema for default-snapshot column validation#2566
nazq wants to merge 1 commit into
apache:mainfrom
nazq:scan-current-schema-on-evolution

nazq commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nazq commented Jun 2, 2026

Which issue does this PR close?

What changes are included in this PR?

Fix

Why this wasn't caught earlier

Are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant