Skip to content

[SPARK-55716][SQL] Support NOT NULL constraint enforcement for V1 file source table inserts#54517

Open
yaooqinn wants to merge 1 commit intoapache:masterfrom
yaooqinn:SPARK-55716
Open

[SPARK-55716][SQL] Support NOT NULL constraint enforcement for V1 file source table inserts#54517
yaooqinn wants to merge 1 commit intoapache:masterfrom
yaooqinn:SPARK-55716

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Feb 26, 2026

What changes were proposed in this pull request?

V1 file-based DataSource writes (Parquet, ORC, JSON, etc.) silently accept null values into NOT NULL columns. This PR adds opt-in NOT NULL constraint enforcement controlled by spark.sql.fileSource.insert.enforceNotNull.

Changes:

  1. CreateDataSourceTableCommand: Preserves user-specified nullability by recursively merging nullability flags from the user schema into the resolved dataSource.schema. Previously it stored dataSource.schema directly, which is all-nullable due to DataSource.resolveRelation() calling dataSchema.asNullable (SPARK-13738).

  2. PreprocessTableInsertion: Restores nullability flags from the catalog schema before null checks, ensuring AssertNotNull is injected when needed. Gated behind spark.sql.fileSource.insert.enforceNotNull.

  3. New config: spark.sql.fileSource.insert.enforceNotNull (default false) — when set to true, enables NOT NULL constraint enforcement for V1 file-based tables, consistent with the behavior for other data sources and V2 catalog tables.

  4. SparkGetColumnsOperation: Fixed IS_NULLABLE to respect column.nullable instead of always returning "YES".

Why are the changes needed?

DataSource.resolveRelation() calls dataSchema.asNullable (added in SPARK-13738 for read safety), which strips all NOT NULL constraints recursively. CreateDataSourceTableCommand then stores this all-nullable schema in the catalog, permanently losing NOT NULL information. As a result, PreprocessTableInsertion never injects AssertNotNull for V1 file source tables.

Note: InsertableRelation (e.g., SimpleInsertSource) does NOT have this problem because it preserves the original schema (SPARK-24583).

Does this PR introduce any user-facing change?

No change in default behavior. Users can opt in to NOT NULL enforcement for V1 file source tables by setting spark.sql.fileSource.insert.enforceNotNull to true.

How was this patch tested?

  • Added 7 new tests in InsertSuite covering top-level, nested struct, array, and map null constraint enforcement.
  • Fixed 3 existing interval column test assertions in SparkMetadataOperationSuite.

Was this patch authored or co-authored using generative AI tooling?

Yes, co-authored with GitHub Copilot.

@yaooqinn
Copy link
Member Author

cc @dongjoon-hyun @cloud-fan @gengliangwang, Is the fix direction correct? is this a genuine bug or design choice. I haven't found any public discussions on this area.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 26, 2026

Hi, @yaooqinn

  • Apache Spark didn't claim to support SQL CONSTRAINT for v1 yet, did it? IIUC, it should be handled explicitly by the user in FROM SELECT clause of that INSERT statement.
  • SPARK-51207: SPIP: Constraints in DSv2 was a fairly new feature of Apache Spark 4.1.0 only.

For me, this PR seems to introduce a new feature instead of bugs.

cc @aokolnychyi , @peter-toth , too.

@yaooqinn
Copy link
Member Author

yaooqinn commented Feb 27, 2026

Apache Spark didn't claim to support SQL CONSTRAINT for v1 yet, did it? IIUC, it should be handled explicitly by the user in FROM SELECT clause of that INSERT statement.

@dongjoon-hyun
I'm not sure,StoreAssignmentPolicy is introduced in 3.0, in the meanwhile NOT NULL keywords are long-term support(but may not be fully or well documented) features. It looks like that V2 support is an addition based on that. In legacy v1, it partially work for some datasources with no asNullable called

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 27, 2026

IIUC, StoreAssignmentPolicy is supposed to check type casting only. For a single nullable type, I'm not sure that was achieved by SPARK-28730.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 27, 2026

Anyway, let's wait for @gengliangwang and @cloud-fan 's comment. Maybe, I lost the track of code change.

@yaooqinn
Copy link
Member Author

IIUC, StoreAssignmentPolicy is supposed to check type casting only.

@dongjoon-hyun
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L447

FYI, Null constraints for table output relies on it too

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 27, 2026

@cloud-fan
Copy link
Contributor

This is probably a too breaking change, and many users may already treat this bug as a feature.

If users need strict not null enforcement, they should migrate to DS v2.

@gengliangwang
Copy link
Member

+1 with @cloud-fan . At least, the default behavior should not be changed in 4.2.

@yaooqinn
Copy link
Member Author

Yeah, I agree with you that this is too breaking. However, many of our clients and developers here have raised doubts about this strange behavior of Spark. There are no clear docs for this.

they should migrate to DS v2.

Do you mean by using built-in sources with v2 code path or suggesting users migrate to third-party formats?

@cloud-fan
Copy link
Contributor

I think using plain file source table is not recommended now, people should switch to lakehouse table formats.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 27, 2026

Since DSv2 migration is an independent topic from this DSv1 bug, I made a PR for DSv2 specifically, @cloud-fan , @gengliangwang , @yaooqinn .

@yaooqinn
Copy link
Member Author

Hi @dongjoon-hyun builtin file sources with DSv2 code path have the same issue based on my unit tests

@yaooqinn
Copy link
Member Author

yaooqinn commented Feb 27, 2026

@cloud-fan Instead of making NOT NULL as a default behavior, would you mind if we give it a shot that provide an alternative option for users if they want built-in parquet and delta lake to behave the same on NOT NULL constraints?

For steaming source, we have such an option for toggling asNullable.

…ble inserts

V1 file-based DataSource writes (parquet/orc/json) silently accept null values into NOT NULL columns. This PR adds opt-in NOT NULL constraint enforcement by:

1. CreateDataSourceTableCommand: Preserve user-specified nullability by recursively merging nullability flags from the user schema into the resolved dataSource.schema.
2. PreprocessTableInsertion: Restore nullability flags from the catalog schema before null checks. This ensures AssertNotNull is injected when needed, gated by spark.sql.fileSource.insert.enforceNotNull.
3. Config: spark.sql.fileSource.insert.enforceNotNull (default false) - when true, enforces NOT NULL constraints for file-based tables.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@yaooqinn yaooqinn changed the title [SPARK-55716][SQL] Fix V1 file source NOT NULL constraint enforcement [SPARK-55716][SQL] Support NOT NULL constraint enforcement for V1 file source table inserts Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants