Skip to content

feat(datafusion): support PARTITIONED BY for identity-partitioned external tables#2575

Open
huan233usc wants to merge 4 commits into
apache:mainfrom
huan233usc:feat/datafusion-external-table-partitioned-by
Open

feat(datafusion): support PARTITIONED BY for identity-partitioned external tables#2575
huan233usc wants to merge 4 commits into
apache:mainfrom
huan233usc:feat/datafusion-external-table-partitioned-by

Conversation

@huan233usc
Copy link
Copy Markdown
Contributor

@huan233usc huan233usc commented Jun 3, 2026

Which issue does this PR close?

What changes are included in this PR?

CREATE EXTERNAL TABLE ... STORED AS ICEBERG (via IcebergTableProviderFactory) previously rejected any PARTITIONED BY clause outright.

DataFusion's PARTITIONED BY grammar only accepts plain column names — it cannot express Iceberg transforms such as bucket(16, id) or days(ts) (unlike Spark's native DSv2 grammar). Given that constraint, this PR:

  • Stops rejecting table_partition_cols in check_cmd.
  • Adds validate_partition_columns, run after the table is loaded:
    • If the table's default partition spec uses any non-identity transform, returns a clear FeatureUnsupported error naming the offending field/transform.
    • Otherwise validates that the declared columns exactly match the identity partition columns in order (consistent with PartitionSpec::is_compatible_with and Java's PartitionSpec.compatibleWith, where field order is significant).
  • Omitting PARTITIONED BY keeps the previous behavior: any table — including non-identity partitioned ones — can still be registered for read-only access.
  • A TODO is left to support non-identity transforms once DataFusion's grammar can express them.

Example

CREATE EXTERNAL TABLE my_iceberg_table
STORED AS ICEBERG LOCATION '/path/to/metadata.json'
PARTITIONED BY (event_date);

Are these changes tested?

Yes. Added unit tests in table_provider_factory.rs plus two metadata fixtures (bucket-partitioned and multi-identity-partitioned):

  • single identity column match / mismatch
  • multiple identity columns match / wrong order / subset (count mismatch)
  • non-identity (bucket[4]) transform rejected with a clear error
  • non-identity partitioned table still registers when PARTITIONED BY is omitted

cargo test -p iceberg-datafusion and cargo clippy -p iceberg-datafusion --all-targets pass.

…ernal tables

`CREATE EXTERNAL TABLE ... STORED AS ICEBERG` previously rejected any
`PARTITIONED BY` clause. Since DataFusion's grammar only accepts plain
column names (it cannot express transforms such as `bucket[N]` or `day`),
allow the clause for identity-partitioned tables and validate that the
declared columns match the table's default partition spec, in order.

Tables partitioned with non-identity transforms can still be registered by
omitting the clause; specifying it returns a clear error pointing at the
offending transform.

Closes apache#2050
/// non-identity transforms, can still be registered for read-only access without declaring
/// its partitioning.
fn validate_partition_columns(table: &Table, declared_partition_cols: &[String]) -> Result<()> {
if declared_partition_cols.is_empty() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior here is open for discussion.

We could choose ignore validation partition spec, pros is it will unblock user creating an external table that is partitioned(potentially with the case data fusion not supported), cons is the sql is not strictly accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant