-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[C++][Python] Implement search_sorted kernel for all primitive types and run-end encoded arrays #49677
Copy link
Copy link
Open
Description
Describe the enhancement requested
Implement a kernel that replicates the semantics of NumPy’s searchsorted function (without the sorter argument).
The kernel should support all primitive types as well as run-end encoded arrays.
Example API
>>> sorted_array = pa.array()
>>> to_search = pa.array()
>>> pc.search_sorted(sorted_array, to_search, side='left')
<pyarrow.lib.UInt64Array>
[
0,
1,
3,
5
]
>>> pc.search_sorted(sorted_array, to_search, side='right')
<pyarrow.lib.UInt64Array>
[
0,
3,
3,
5
]Explanation
50< all → index0200equals first occurrence → index1(side='left')200forside='right'→ index3250between200and300→ index3400> all values → index5
Null Handling
Nulls in the first (sorted) array
If nulls are clustered first:
sorted_array = pa.array([null, 200, 300, 300])
to_search = pa.array()Expected:
side='left'→[1, 1, 2, 4]side='right'→[1, 2, 2, 4]
If nulls are clustered last:
sorted_array = pa.array([200, 300, 300, null, null])Expected:
side='left'→[0, 0, 1, 3]side='right'→[0, 1, 1, 3]
Nulls in the second (to_search) array
Two options:
- Emit nulls for null search keys.
- Match nulls within the null portion of the sorted array.
Example for (2):
>>> sorted_array = pa.array([null, null, 200]) # nulls first
>>> to_search = pa.array([null, 100, 300])
>>> pc.search_sorted(sorted_array, to_search, side='left')
<pyarrow.lib.UInt64Array object at 0x7dc6c1bcefe0>
[
0,
2,
3
]
>>> pc.search_sorted(sorted_array, to_search, side='right')
<pyarrow.lib.UInt64Array object at 0x7dc6c1bcefe0>
[
1,
2,
3
]
>>> sorted_array = pa.array([200, null, null]) # nulls last
>>> pc.search_sorted(sorted_array, to_search, side='left')
<pyarrow.lib.UInt64Array object at 0x7dc6c1bcefe0>
[
1,
0,
1
]
>>> pc.search_sorted(sorted_array, to_search, side='left')
<pyarrow.lib.UInt64Array object at 0x7dc6c1bcefe0>
[
3,
0,
1
]Requirements
- Implement for all primitive types (
int*,float*,boolean, etc.) - Support run-end encoded arrays
- Handle both
side='left'andside='right' - Define consistent behavior for null placement
- Return a
UInt64Arrayof insertion indices
References
Component(s)
Python, C++
Reactions are currently unavailable
Metadata
Metadata
Assignees
Type
Fields
Give feedbackNo fields configured for issues without a type.