Add SIMDBinarySearch strategy (opt-in 8-way SIMD-gather binary search) by ChrisRackauckas-Claude · Pull Request #70 · SciML/FindFirstFunctions.jl

ChrisRackauckas-Claude · 2026-05-20T15:57:46Z

Adds a new SIMDBinarySearch <: SearchStrategy exploring whether SIMD
gather instructions can beat scalar binary search. Honest negative
result on hot-cache workloads: hot scalar binary search wins almost
everywhere because branch-mispredict / dependency-chain latency
dominates over throughput, and the SIMD gather + probe-index math +
bitmask reduction costs more per iteration than the log₂(n) scalar
compares saved.

Kept as opt-in, not added to Auto.

Algorithm

8-way SIMD-gather binary search. Each iteration over bracket [lo, hi]
of length n:

Compute 8 probe indices p_k = lo + ((k-1)*(n-1)) ÷ 7 (so p_1 = lo,
p_8 = hi, six interior probes evenly spaced).
SIMD.vgather(v, idx) — load all 8 values in one instruction.
Lane-wise compare against x returns Vec{8, Bool}.
SIMD.bitmask(mask) packs the 8 lanes into UInt8; trailing_zeros
finds the first lane where the predicate flipped.
Narrow to the relevant 2-probe bracket and recurse.

Bracket shrinks ~7-8× per iteration; basecase threshold of 16 falls
through to Base.searchsortedlast(v, x, lo, hi, Forward). Specialised
for DenseVector{Int64} and DenseVector{Float64}. Other eltypes /
non-Forward orderings fall back to BinaryBracket. Ignores any hint.

Bench results

bench/simd_binary_sweep.jl. "Ratio S/B" is SIMD/Base — lower means
SIMD wins.

eltype	n	Hot cache	Cold cache
Float64	256	Base (1.37×)	tie (1.05×)
Float64	1024	Base (1.32×)	tie (1.00×)
Float64	4096	Base (1.32×)	Base (1.31×)
Float64	16384	Base (1.21×)	SIMD (0.85×)
Float64	65536	Base (1.12×)	SIMD (0.92×)
Float64	262144	tie (1.05×)	tie (1.03×)
Float64	1048576	tie (1.01×)	SIMD (0.74×)
Int64	256–10⁶	Base wins everywhere	mostly Base, ties at large n

Hot cache: scalar Base.searchsortedlast wins virtually everywhere —
the SIMD overhead exceeds the log₂(n) compares saved.

Cold cache: mixed. Float64 SIMD wins at n ≥ 16384 (best 0.74× at n=2²⁰)
because gather amortizes cold-load latency. Int64 SIMD loses everywhere
— Int compares are so cheap the gather/mask overhead never pays.

Recommendation: opt-in only, not in `Auto`

The only clear win is cold-cache Float64 at n ≥ 16384. That regime is
hardware-dependent (gather latency, vector unit width, LLC size), and
"is v cold" isn't a cheap probe for Auto. Folding into Auto would
penalize the much more common hot-cache case for a narrow worst-case
win.

For batched sorted-query workloads, the existing hinted strategies
(ExpFromLeft, BracketGallop, SIMDLinearScan) remain strictly
better — they get O(1) amortized cost from a good hint that
SIMDBinarySearch throws away by design.

Tests

105,536 tests pass. New SIMDBinarySearch correctness @safetestset
with ~20,910 fresh tests: Int64 fuzz (10k), Float64 fuzz (10k),
basecase-boundary sweep, edge cases (empty, single-element, x outside
range, exact match, duplicates, constant vector, hint-ignored sanity,
non-Int64/Float64 fallback, reverse-order fallback).

🤖 Note for reviewer: this PR is in draft. Please ignore until reviewed
by @ChrisRackauckas.

Single-query binary search that probes 8 candidate positions per iteration via `SIMD.vgather` + `SIMD.bitmask`, shrinking the bracket by ~8x per step instead of 2x. Specialised for `DenseVector{Int64}` and `DenseVector{Float64}`; other eltypes and non-Forward orderings fall back to `BinaryBracket`. Strategy is opt-in only — `Auto` does not pick it. The bench sweep (`bench/simd_binary_sweep.jl`) shows scalar `BinaryBracket` beats it on hot cache across every n from 256 to ~1M for both Float64 and Int64 (SIMD/Base ratio 1.2x–1.7x). On cold cache (working set > LLC) the picture shifts but is mixed: Float64 SIMDBinarySearch starts winning at n >= 16384 (ratio ~0.7–0.9x at n = 1M); Int64 still loses everywhere except a tie at n = 1M. The SIMD step's gather overhead and per-step probe-position math together cost more than the log2(n) scalar compares it saves at most n. Recommendation: keep opt-in. Don't fold into Auto. Workloads with very large Float64 vectors that miss LLC may want to pin it manually, but the hardware-dependent crossover argues against an Auto branch. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SIMDBinarySearch strategy (opt-in 8-way SIMD-gather binary search)#70

Add SIMDBinarySearch strategy (opt-in 8-way SIMD-gather binary search)#70
ChrisRackauckas-Claude wants to merge 1 commit into
SciML:mainfrom
ChrisRackauckas-Claude:simd-binary-search

ChrisRackauckas-Claude commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ChrisRackauckas-Claude commented May 20, 2026

Algorithm

Bench results

Recommendation: opt-in only, not in Auto

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Recommendation: opt-in only, not in `Auto`