Skip to content

fix: implement Spark-compatible null handling for arrays_overlap#3674

Draft
andygrove wants to merge 1 commit intoapache:mainfrom
andygrove:fix/arrays-overlap-null-handling
Draft

fix: implement Spark-compatible null handling for arrays_overlap#3674
andygrove wants to merge 1 commit intoapache:mainfrom
andygrove:fix/arrays-overlap-null-handling

Conversation

@andygrove
Copy link
Member

Which issue does this PR close?

Closes #3645.

Rationale for this change

DataFusion's array_has_any function treats NULL == NULL, so arrays_overlap(array(1, NULL), array(NULL, 2)) incorrectly returns true instead of null. This violates Spark's three-valued null semantics for arrays_overlap.

What changes are included in this PR?

  • Add a custom Spark-compatible arrays_overlap implementation in Rust (native/spark-expr/src/array_funcs/arrays_overlap.rs) that intercepts the array_has_any function name with correct null handling:
    • true when arrays share a common non-null element
    • null when no common non-null elements exist but either array contains nulls
    • false when no common elements and no nulls
  • Type-specialized fast paths for primitive types (HashSet<T::Native>) and strings (HashSet<&str>), with a ScalarValue fallback for complex types
  • Dispatch and downcast hoisted to batch level; HashSet reused across rows for efficiency
  • Supports both ListArray and LargeListArray via GenericListArray<O>
  • Remove ignore annotations from previously-failing SQL tests
  • Add compatibility documentation in the user guide

How are these changes tested?

  • 6 Rust unit tests covering: overlap, no overlap, null-only overlap returning null, null array, empty array, and a multi-row scenario matching the issue reproduction
  • Re-enabled 2 SQL-based integration tests that were previously ignored due to this bug

Replace DataFusion's array_has_any (which treats NULL == NULL) with a
custom implementation that follows Spark's three-valued logic:
- true when arrays share a common non-null element
- null when no common non-null elements but either array has nulls
- false when no common elements and no nulls

Closes apache#3645
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

array_overlap correctness issue

1 participant