Skip to content

feat: add support for array_position expression#3172

Open
andygrove wants to merge 29 commits intoapache:mainfrom
andygrove:feature/array-position
Open

feat: add support for array_position expression#3172
andygrove wants to merge 29 commits intoapache:mainfrom
andygrove:feature/array-position

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Jan 15, 2026

Which issue does this PR close?

Closes #3157.
Closes #3153.

Rationale for this change

Spark's array_position function is not currently accelerated by Comet. Adding native support allows this expression to run on the native execution engine.

What changes are included in this PR?

Adds native Comet support for Spark's array_position function, which returns the 1-based position of an element in an array, or 0 if not found.

This required a custom Rust implementation because DataFusion's array_position returns UInt64 and null when not found, while Spark returns Int64 (LongType) and 0.

Key implementation details:

  • Returns Int64 to match Spark's LongType
  • Returns 0 when element is not found (Spark behavior)
  • Returns null when array is null or search element is null
  • Supports both List and LargeList array types
  • Handles NaN equality to match Spark's ordering.equiv() semantics (NaN == NaN)
  • Uses flat values buffer with direct offset access for efficient element comparison

Benchmark Results

OpenJDK 64-Bit Server VM 17.0.17+10 on Mac OS X 26.2
Apple M3 Ultra
array_position - int array:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                88             92           3         11.9          83.8       1.0X
Comet (Scan)                                        128            138           7          8.2         122.0       0.7X
Comet (Scan + Exec)                                  73             76           3         14.4          69.4       1.2X

array_position - string array:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               157            163           3          6.7         149.5       1.0X
Comet (Scan)                                        269            275           4          3.9         256.9       0.6X
Comet (Scan + Exec)                                 147            150           2          7.1         140.4       1.1X

Comet native execution is 1.1-1.2X faster than Spark for this expression when using the native DataFusion scan.

How are these changes tested?

SQL file-based tests in spark/src/test/resources/sql-tests/expressions/array/array_position.sql covering:

  • Integer, string, boolean, tinyint, smallint, bigint, float, double, decimal, date, and timestamp arrays
  • Null handling (null array, null search element, null elements within arrays)
  • Column vs literal argument combinations
  • Empty arrays and duplicate elements
  • NaN edge cases for float and double arrays
  • Microbenchmark in CometArrayExpressionBenchmark

Implements Spark's array_position function which returns the 1-based
position of an element in an array, returning 0 if not found.

This required a custom Rust implementation because DataFusion's
array_position returns UInt64 and null when not found, while Spark
returns Int64 (LongType) and 0.

Key implementation details:
- Returns Int64 to match Spark's LongType
- Returns 0 when element is not found (Spark behavior)
- Returns null when array is null or search element is null
- Supports both List and LargeList array types

Closes apache#3153

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove marked this pull request as draft January 15, 2026 02:28
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jan 15, 2026

Codecov Report

❌ Patch coverage is 76.92308% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.96%. Comparing base (f09f8af) to head (36c3320).
⚠️ Report is 907 commits behind head on main.

Files with missing lines Patch % Lines
...src/main/scala/org/apache/comet/serde/arrays.scala 75.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3172      +/-   ##
============================================
+ Coverage     56.12%   59.96%   +3.84%     
- Complexity      976     1462     +486     
============================================
  Files           119      175      +56     
  Lines         11743    16180    +4437     
  Branches       2251     2684     +433     
============================================
+ Hits           6591     9703    +3112     
- Misses         4012     5128    +1116     
- Partials       1140     1349     +209     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove andygrove marked this pull request as ready for review January 15, 2026 23:36
# Conflicts:
#	docs/source/user-guide/latest/configs.md
#	native/spark-expr/src/comet_scalar_funcs.rs
@andygrove andygrove marked this pull request as draft January 30, 2026 01:48
@andygrove
Copy link
Copy Markdown
Member Author

Moving this to draft until #3328 is merged

andygrove and others added 2 commits February 10, 2026 11:07
Move array_position tests from CometArrayExpressionSuite to a SQL file
test and fall back to Spark when all arguments are literals.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove andygrove marked this pull request as ready for review February 11, 2026 00:08
…o feature/array-position

# Conflicts:
#	native/spark-expr/src/array_funcs/mod.rs
#	native/spark-expr/src/comet_scalar_funcs.rs
andygrove and others added 4 commits February 18, 2026 07:44
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove stray array_repeat references from merge conflict resolution.
Add NULL val row to test data and add tests for all supported array
element types: boolean, tinyint, smallint, bigint, float, double,
decimal, date, and timestamp.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@comphead
Copy link
Copy Markdown
Contributor

Just wondering can we reuse DF? the builtin function gets optimized in apache/datafusion#20532

@andygrove
Copy link
Copy Markdown
Member Author

Just wondering can we reuse DF? the builtin function gets optimized in apache/datafusion#20532

The DF implementation isn't compatible with Spark though.

@andygrove andygrove marked this pull request as draft March 16, 2026 17:43
… review feedback

- Use typed array downcasting instead of ScalarValue for element
  comparison, improving performance from 0.4X to 0.7-0.8X of Spark
- Add getSupportLevel override marking as Incompatible (NaN equality)
- Add NaN edge case tests for float/double arrays
- Add CometArrayExpressionBenchmark microbenchmark
- Make spark_array_position function private
- Update docs to mark array_position as supported
Treat NaN == NaN in float/double comparisons, matching Spark's
ordering.equiv() behavior. This makes array_position Compatible
rather than Incompatible.
Avoid per-row subarray allocation from list_array.value(row_index).
Instead, downcast the flat values buffer once and iterate using
offset ranges directly. Improves from 0.7-0.8X to 0.9X of Spark.
Switches benchmark to use SCAN_NATIVE_DATAFUSION for the Comet cases,
avoiding JVM parquet reader overhead. Results now show Comet is
1.1-1.2X faster than Spark.
@andygrove andygrove marked this pull request as ready for review March 18, 2026 12:08
# Conflicts:
#	native/spark-expr/src/comet_scalar_funcs.rs
#	spark/src/test/scala/org/apache/spark/sql/benchmark/CometArrayExpressionBenchmark.scala
@mbutrovich
Copy link
Copy Markdown
Contributor

Claude summarized my notes for me. Hopefully it didn't transcribe anything wrong or hallucinate :)

PR #3172: array_position

Good implementation overall. Spark compatibility looks correct: returns Int64/0 (not DataFusion's UInt64/null), NaN equality handled properly, null array/element returns null, null values within arrays are skipped, 1-based indexing is right. Benchmarks show 1.1-1.2X faster than Spark.

NULL buffer handling

All the typed paths (position_primitive, position_float, position_boolean, position_string, position_fallback) build a Vec<Option<i64>> and convert to Int64Array::from(). This constructs the null buffer element-by-element.

Recent DataFusion optimizations (e.g. apache/datafusion#21464, apache/datafusion#21471, apache/datafusion#21482) show 3-37% improvement by computing the null buffer upfront:

  1. NullBuffer::union(list_array.nulls(), element.nulls()) to get the combined output nulls
  2. Build Vec<i64> (not Option<i64>) for the values
  3. Construct via Int64Array::new(ScalarBuffer::from(values), combined_nulls)

This avoids per-row is_null() checks in the hot loop and the Option tracking overhead. Not a blocker given the speedup is already decent, but it's a straightforward optimization that could widen the gap further.

Signature

Signature::variadic_any accepts any number of arguments of any type. Since array_position always takes exactly 2, Signature::new(TypeSignature::Any(2), Volatility::Immutable) would be more precise. The serde already validates so this is just defense in depth.

Unwrap on downcast

downcast_ref::<GenericListArray<O>>().unwrap() in generic_array_position could panic on invalid input. Should be unreachable given the match guard, but .ok_or_else(|| DataFusionError::Internal("expected list array".into()))? would be safer and consistent with patterns elsewhere.

Test gap

Consider adding a nested array test like array_position(array(array(1,2), array(3,4)), array(1,2)). Spark's unit tests cover this case and it would exercise the position_fallback code path.

- Compute combined null buffer upfront via NullBuffer::union and use
  Vec<i64> with Int64Array::new() instead of Vec<Option<i64>>, avoiding
  per-row null tracking overhead in all typed paths
- Use TypeSignature::Any(2) instead of variadic_any for precise arity
- Replace .unwrap() on downcast with .ok_or_else() for safer error
  handling
- Add nested array test cases to exercise position_fallback code path
# Conflicts:
#	native/spark-expr/src/comet_scalar_funcs.rs
#	spark/src/main/scala/org/apache/comet/serde/arrays.scala
Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove we would need to backport it to datafusion-spark crate.

For testing should we test all possible datatypes?

@andygrove
Copy link
Copy Markdown
Member Author

andygrove commented Apr 21, 2026

Thanks @andygrove we would need to backport it to datafusion-spark crate.

For testing should we test all possible datatypes?

I added support tests for timestamp_ntz

@andygrove
Copy link
Copy Markdown
Member Author

Thanks @andygrove we would need to backport it to datafusion-spark crate.
For testing should we test all possible datatypes?

I added support tests for timestamp_ntz

Also added some nested array columnar tests

# Conflicts:
#	spark/src/main/scala/org/apache/comet/serde/arrays.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support Spark expression: array_position [Feature] Support Spark expression: length_of_json_array

4 participants