Skip to content

[C++][Python] Implement search_sorted kernel for all primitive types and run-end encoded arrays #49677

@Alex-PLACET

Description

@Alex-PLACET

Describe the enhancement requested

Implement a kernel that replicates the semantics of NumPy’s searchsorted function (without the sorter argument).
The kernel should support all primitive types as well as run-end encoded arrays.

Example API

>>> sorted_array = pa.array()
>>> to_search = pa.array()

>>> pc.search_sorted(sorted_array, to_search, side='left')
<pyarrow.lib.UInt64Array>
[
  0,
  1,
  3,
  5
]

>>> pc.search_sorted(sorted_array, to_search, side='right')
<pyarrow.lib.UInt64Array>
[
  0,
  3,
  3,
  5
]

Explanation

  • 50 < all → index 0
  • 200 equals first occurrence → index 1 (side='left')
  • 200 for side='right' → index 3
  • 250 between 200 and 300 → index 3
  • 400 > all values → index 5

Null Handling

Nulls in the first (sorted) array

If nulls are clustered first:

sorted_array = pa.array([null, 200, 300, 300])
to_search = pa.array()

Expected:

  • side='left'[1, 1, 2, 4]
  • side='right'[1, 2, 2, 4]

If nulls are clustered last:

sorted_array = pa.array([200, 300, 300, null, null])

Expected:

  • side='left'[0, 0, 1, 3]
  • side='right'[0, 1, 1, 3]

Nulls in the second (to_search) array

Two options:

  1. Emit nulls for null search keys.
  2. Match nulls within the null portion of the sorted array.

Example for (2):

>>> sorted_array = pa.array([null, null, 200])  # nulls first
>>> to_search = pa.array([null, 100, 300])
>>> pc.search_sorted(sorted_array, to_search, side='left')
<pyarrow.lib.UInt64Array object at 0x7dc6c1bcefe0>
[
  0,
  2,
  3
]
>>> pc.search_sorted(sorted_array, to_search, side='right')
<pyarrow.lib.UInt64Array object at 0x7dc6c1bcefe0>
[
  1,
  2,
  3
]
>>> sorted_array = pa.array([200, null, null])  # nulls last
>>> pc.search_sorted(sorted_array, to_search, side='left')
<pyarrow.lib.UInt64Array object at 0x7dc6c1bcefe0>
[
  1,
  0,
  1
]
>>> pc.search_sorted(sorted_array, to_search, side='left')
<pyarrow.lib.UInt64Array object at 0x7dc6c1bcefe0>
[
  3,
  0,
  1
]

Requirements

  • Implement for all primitive types (int*, float*, boolean, etc.)
  • Support run-end encoded arrays
  • Handle both side='left' and side='right'
  • Define consistent behavior for null placement
  • Return a UInt64Array of insertion indices

References

Component(s)

Python, C++

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions