feat: add support for array_position expression by andygrove · Pull Request #3172 · apache/datafusion-comet

andygrove · 2026-01-15T02:27:49Z

Which issue does this PR close?

Closes #3157.
Closes #3153.

Rationale for this change

Spark's array_position function is not currently accelerated by Comet. Adding native support allows this expression to run on the native execution engine.

What changes are included in this PR?

Adds native Comet support for Spark's array_position function, which returns the 1-based position of an element in an array, or 0 if not found.

This required a custom Rust implementation because DataFusion's array_position returns UInt64 and null when not found, while Spark returns Int64 (LongType) and 0.

Key implementation details:

Returns Int64 to match Spark's LongType
Returns 0 when element is not found (Spark behavior)
Returns null when array is null or search element is null
Supports both List and LargeList array types
Handles NaN equality to match Spark's ordering.equiv() semantics (NaN == NaN)
Uses flat values buffer with direct offset access for efficient element comparison

Benchmark Results

OpenJDK 64-Bit Server VM 17.0.17+10 on Mac OS X 26.2
Apple M3 Ultra
array_position - int array:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                88             92           3         11.9          83.8       1.0X
Comet (Scan)                                        128            138           7          8.2         122.0       0.7X
Comet (Scan + Exec)                                  73             76           3         14.4          69.4       1.2X

array_position - string array:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               157            163           3          6.7         149.5       1.0X
Comet (Scan)                                        269            275           4          3.9         256.9       0.6X
Comet (Scan + Exec)                                 147            150           2          7.1         140.4       1.1X

Comet native execution is 1.1-1.2X faster than Spark for this expression when using the native DataFusion scan.

How are these changes tested?

SQL file-based tests in spark/src/test/resources/sql-tests/expressions/array/array_position.sql covering:

Integer, string, boolean, tinyint, smallint, bigint, float, double, decimal, date, and timestamp arrays
Null handling (null array, null search element, null elements within arrays)
Column vs literal argument combinations
Empty arrays and duplicate elements
NaN edge cases for float and double arrays
Microbenchmark in CometArrayExpressionBenchmark

Implements Spark's array_position function which returns the 1-based position of an element in an array, returning 0 if not found. This required a custom Rust implementation because DataFusion's array_position returns UInt64 and null when not found, while Spark returns Int64 (LongType) and 0. Key implementation details: - Returns Int64 to match Spark's LongType - Returns 0 when element is not found (Spark behavior) - Returns null when array is null or search element is null - Supports both List and LargeList array types Closes apache#3153 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

codecov-commenter · 2026-01-15T15:40:27Z

Codecov Report

❌ Patch coverage is 76.92308% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.96%. Comparing base (f09f8af) to head (36c3320).
⚠️ Report is 907 commits behind head on main.

Files with missing lines	Patch %	Lines
...src/main/scala/org/apache/comet/serde/arrays.scala	75.00%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3172      +/-   ##
============================================
+ Coverage     56.12%   59.96%   +3.84%     
- Complexity      976     1462     +486     
============================================
  Files           119      175      +56     
  Lines         11743    16180    +4437     
  Branches       2251     2684     +433     
============================================
+ Hits           6591     9703    +3112     
- Misses         4012     5128    +1116     
- Partials       1140     1349     +209

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

# Conflicts: # docs/source/user-guide/latest/configs.md # native/spark-expr/src/comet_scalar_funcs.rs

andygrove · 2026-01-30T15:03:31Z

Moving this to draft until #3328 is merged

Move array_position tests from CometArrayExpressionSuite to a SQL file test and fall back to Spark when all arguments are literals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…o feature/array-position # Conflicts: # native/spark-expr/src/array_funcs/mod.rs # native/spark-expr/src/comet_scalar_funcs.rs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove stray array_repeat references from merge conflict resolution. Add NULL val row to test data and add tests for all supported array element types: boolean, tinyint, smallint, bigint, float, double, decimal, date, and timestamp. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

comphead · 2026-02-26T16:23:00Z

Just wondering can we reuse DF? the builtin function gets optimized in apache/datafusion#20532

andygrove · 2026-03-07T15:23:07Z

Just wondering can we reuse DF? the builtin function gets optimized in apache/datafusion#20532

The DF implementation isn't compatible with Spark though.

… review feedback - Use typed array downcasting instead of ScalarValue for element comparison, improving performance from 0.4X to 0.7-0.8X of Spark - Add getSupportLevel override marking as Incompatible (NaN equality) - Add NaN edge case tests for float/double arrays - Add CometArrayExpressionBenchmark microbenchmark - Make spark_array_position function private - Update docs to mark array_position as supported

Treat NaN == NaN in float/double comparisons, matching Spark's ordering.equiv() behavior. This makes array_position Compatible rather than Incompatible.

Avoid per-row subarray allocation from list_array.value(row_index). Instead, downcast the flat values buffer once and iterate using offset ranges directly. Improves from 0.7-0.8X to 0.9X of Spark.

Switches benchmark to use SCAN_NATIVE_DATAFUSION for the Comet cases, avoiding JVM parquet reader overhead. Results now show Comet is 1.1-1.2X faster than Spark.

# Conflicts: # native/spark-expr/src/comet_scalar_funcs.rs # spark/src/test/scala/org/apache/spark/sql/benchmark/CometArrayExpressionBenchmark.scala

mbutrovich · 2026-04-14T15:21:37Z

Claude summarized my notes for me. Hopefully it didn't transcribe anything wrong or hallucinate :)

PR #3172: array_position

Good implementation overall. Spark compatibility looks correct: returns Int64/0 (not DataFusion's UInt64/null), NaN equality handled properly, null array/element returns null, null values within arrays are skipped, 1-based indexing is right. Benchmarks show 1.1-1.2X faster than Spark.

NULL buffer handling

All the typed paths (position_primitive, position_float, position_boolean, position_string, position_fallback) build a Vec<Option<i64>> and convert to Int64Array::from(). This constructs the null buffer element-by-element.

Recent DataFusion optimizations (e.g. apache/datafusion#21464, apache/datafusion#21471, apache/datafusion#21482) show 3-37% improvement by computing the null buffer upfront:

NullBuffer::union(list_array.nulls(), element.nulls()) to get the combined output nulls

Build Vec<i64> (not Option<i64>) for the values

Construct via Int64Array::new(ScalarBuffer::from(values), combined_nulls)

This avoids per-row is_null() checks in the hot loop and the Option tracking overhead. Not a blocker given the speedup is already decent, but it's a straightforward optimization that could widen the gap further.

Signature

Signature::variadic_any accepts any number of arguments of any type. Since array_position always takes exactly 2, Signature::new(TypeSignature::Any(2), Volatility::Immutable) would be more precise. The serde already validates so this is just defense in depth.

Unwrap on downcast

downcast_ref::<GenericListArray<O>>().unwrap() in generic_array_position could panic on invalid input. Should be unreachable given the match guard, but .ok_or_else(|| DataFusionError::Internal("expected list array".into()))? would be safer and consistent with patterns elsewhere.

Test gap

Consider adding a nested array test like array_position(array(array(1,2), array(3,4)), array(1,2)). Spark's unit tests cover this case and it would exercise the position_fallback code path.

- Compute combined null buffer upfront via NullBuffer::union and use Vec<i64> with Int64Array::new() instead of Vec<Option<i64>>, avoiding per-row null tracking overhead in all typed paths - Use TypeSignature::Any(2) instead of variadic_any for precise arity - Replace .unwrap() on downcast with .ok_or_else() for safer error handling - Add nested array test cases to exercise position_fallback code path

# Conflicts: # native/spark-expr/src/comet_scalar_funcs.rs # spark/src/main/scala/org/apache/comet/serde/arrays.scala

comphead

Thanks @andygrove we would need to backport it to datafusion-spark crate.

For testing should we test all possible datatypes?

andygrove · 2026-04-21T01:53:12Z

Thanks @andygrove we would need to backport it to datafusion-spark crate.

For testing should we test all possible datatypes?

I added ~~support~~ tests for timestamp_ntz

andygrove · 2026-04-21T01:57:03Z

Thanks @andygrove we would need to backport it to datafusion-spark crate.
For testing should we test all possible datatypes?

I added ~~support~~ tests for timestamp_ntz

Also added some nested array columnar tests

# Conflicts: # spark/src/main/scala/org/apache/comet/serde/arrays.scala

andygrove marked this pull request as draft January 15, 2026 02:28

update docs

8cb27ec

andygrove marked this pull request as ready for review January 15, 2026 23:36

Merge remote-tracking branch 'origin/main' into feature/array-position

36c3320

# Conflicts: # docs/source/user-guide/latest/configs.md # native/spark-expr/src/comet_scalar_funcs.rs

andygrove marked this pull request as draft January 30, 2026 01:48

andygrove and others added 2 commits February 10, 2026 11:07

Merge remote-tracking branch 'apache/main' into feature/array-position

4864e9a

Migrate array_position tests to SQL file-based approach

258442e

Move array_position tests from CometArrayExpressionSuite to a SQL file test and fall back to Spark when all arguments are literals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

andygrove marked this pull request as ready for review February 11, 2026 00:08

andygrove added 2 commits February 11, 2026 10:08

upmerge

3f7d69d

Merge branch 'main' of https://github.com/apache/datafusion-comet int…

554dcf8

…o feature/array-position # Conflicts: # native/spark-expr/src/array_funcs/mod.rs # native/spark-expr/src/comet_scalar_funcs.rs

andygrove requested review from comphead, mbutrovich and parthchandra February 18, 2026 14:41

andygrove and others added 4 commits February 18, 2026 07:44

revert whitespace change in CometArrayExpressionSuite.scala

b716584

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rustfmt

cd3a577

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'main' into feature/array-position

d9d424f

andygrove added 3 commits March 7, 2026 08:26

upmerge

c66024e

Merge remote-tracking branch 'apache/main' into feature/array-position

6172a0c

fix: remove stray merge conflict marker

e4620e9

andygrove marked this pull request as draft March 16, 2026 17:43

andygrove added 5 commits March 16, 2026 12:58

fix: explain NaN incompatibility in ArrayPosition getSupportLevel

fad07b9

fix: handle NaN equality in array_position to match Spark semantics

24318d3

Treat NaN == NaN in float/double comparisons, matching Spark's ordering.equiv() behavior. This makes array_position Compatible rather than Incompatible.

perf: use flat values buffer and offsets for array_position

b7ddad1

Avoid per-row subarray allocation from list_array.value(row_index). Instead, downcast the flat values buffer once and iterate using offset ranges directly. Improves from 0.7-0.8X to 0.9X of Spark.

perf: use native_datafusion scan in array_position benchmark

79b348f

Switches benchmark to use SCAN_NATIVE_DATAFUSION for the Comet cases, avoiding JVM parquet reader overhead. Results now show Comet is 1.1-1.2X faster than Spark.

format

41b5495

andygrove marked this pull request as ready for review March 18, 2026 12:08

andygrove added 3 commits April 13, 2026 15:23

Merge remote-tracking branch 'apache/main' into feature/array-position

6a4ce49

# Conflicts: # native/spark-expr/src/comet_scalar_funcs.rs # spark/src/test/scala/org/apache/spark/sql/benchmark/CometArrayExpressionBenchmark.scala

chore: apply cargo fmt

736ca63

Merge branch 'main' into feature/array-position

0f2e41e

andygrove added 3 commits April 14, 2026 09:34

Merge remote-tracking branch 'apache/main' into feature/array-position

bbfe1dd

# Conflicts: # native/spark-expr/src/comet_scalar_funcs.rs # spark/src/main/scala/org/apache/comet/serde/arrays.scala

style: cargo fmt

235d053

comphead reviewed Apr 18, 2026

View reviewed changes

test: add timestamp_ntz coverage for array_position

7558c66

test: add column-based nested array coverage for array_position

60bceb5

Merge remote-tracking branch 'apache/main' into feature/array-position

d49130c

# Conflicts: # spark/src/main/scala/org/apache/comet/serde/arrays.scala

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for array_position expression#3172

feat: add support for array_position expression#3172
andygrove wants to merge 29 commits intoapache:mainfrom
andygrove:feature/array-position

andygrove commented Jan 15, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jan 15, 2026 •

edited

Loading

Uh oh!

andygrove commented Jan 30, 2026

Uh oh!

comphead commented Feb 26, 2026

Uh oh!

andygrove commented Mar 7, 2026

Uh oh!

mbutrovich commented Apr 14, 2026

PR #3172: `array_position`

NULL buffer handling

Signature

Unwrap on downcast

Test gap

Uh oh!

comphead left a comment

Uh oh!

andygrove commented Apr 21, 2026 •

edited

Loading

Uh oh!

andygrove commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

andygrove commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Benchmark Results

How are these changes tested?

Uh oh!

codecov-commenter commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove commented Jan 30, 2026

Uh oh!

comphead commented Feb 26, 2026

Uh oh!

andygrove commented Mar 7, 2026

Uh oh!

mbutrovich commented Apr 14, 2026

PR #3172: array_position

NULL buffer handling

Signature

Unwrap on downcast

Test gap

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andygrove commented Jan 15, 2026 •

edited

Loading

codecov-commenter commented Jan 15, 2026 •

edited

Loading

PR #3172: `array_position`

andygrove commented Apr 21, 2026 •

edited

Loading