feat(datafusion): support PARTITIONED BY for identity-partitioned external tables#2575
Open
huan233usc wants to merge 4 commits into
Open
feat(datafusion): support PARTITIONED BY for identity-partitioned external tables#2575huan233usc wants to merge 4 commits into
huan233usc wants to merge 4 commits into
Conversation
…ernal tables `CREATE EXTERNAL TABLE ... STORED AS ICEBERG` previously rejected any `PARTITIONED BY` clause. Since DataFusion's grammar only accepts plain column names (it cannot express transforms such as `bucket[N]` or `day`), allow the clause for identity-partitioned tables and validate that the declared columns match the table's default partition spec, in order. Tables partitioned with non-identity transforms can still be registered by omitting the clause; specifying it returns a clear error pointing at the offending transform. Closes apache#2050
huan233usc
commented
Jun 3, 2026
| /// non-identity transforms, can still be registered for read-only access without declaring | ||
| /// its partitioning. | ||
| fn validate_partition_columns(table: &Table, declared_partition_cols: &[String]) -> Result<()> { | ||
| if declared_partition_cols.is_empty() { |
Contributor
Author
There was a problem hiding this comment.
The behavior here is open for discussion.
We could choose ignore validation partition spec, pros is it will unblock user creating an external table that is partitioned(potentially with the case data fusion not supported), cons is the sql is not strictly accurate.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
What changes are included in this PR?
CREATE EXTERNAL TABLE ... STORED AS ICEBERG(viaIcebergTableProviderFactory) previously rejected anyPARTITIONED BYclause outright.DataFusion's
PARTITIONED BYgrammar only accepts plain column names — it cannot express Iceberg transforms such asbucket(16, id)ordays(ts)(unlike Spark's native DSv2 grammar). Given that constraint, this PR:table_partition_colsincheck_cmd.validate_partition_columns, run after the table is loaded:FeatureUnsupportederror naming the offending field/transform.PartitionSpec::is_compatible_withand Java'sPartitionSpec.compatibleWith, where field order is significant).PARTITIONED BYkeeps the previous behavior: any table — including non-identity partitioned ones — can still be registered for read-only access.TODOis left to support non-identity transforms once DataFusion's grammar can express them.Example
Are these changes tested?
Yes. Added unit tests in
table_provider_factory.rsplus two metadata fixtures (bucket-partitioned and multi-identity-partitioned):bucket[4]) transform rejected with a clear errorPARTITIONED BYis omittedcargo test -p iceberg-datafusionandcargo clippy -p iceberg-datafusion --all-targetspass.