Skip to content

Add SIMDBinarySearch strategy (opt-in 8-way SIMD-gather binary search)#70

Draft
ChrisRackauckas-Claude wants to merge 1 commit into
SciML:mainfrom
ChrisRackauckas-Claude:simd-binary-search
Draft

Add SIMDBinarySearch strategy (opt-in 8-way SIMD-gather binary search)#70
ChrisRackauckas-Claude wants to merge 1 commit into
SciML:mainfrom
ChrisRackauckas-Claude:simd-binary-search

Conversation

@ChrisRackauckas-Claude
Copy link
Copy Markdown
Contributor

Adds a new SIMDBinarySearch <: SearchStrategy exploring whether SIMD
gather instructions can beat scalar binary search. Honest negative
result on hot-cache workloads
: hot scalar binary search wins almost
everywhere because branch-mispredict / dependency-chain latency
dominates over throughput, and the SIMD gather + probe-index math +
bitmask reduction costs more per iteration than the log₂(n) scalar
compares saved.

Kept as opt-in, not added to Auto.

Algorithm

8-way SIMD-gather binary search. Each iteration over bracket [lo, hi]
of length n:

  1. Compute 8 probe indices p_k = lo + ((k-1)*(n-1)) ÷ 7 (so p_1 = lo,
    p_8 = hi, six interior probes evenly spaced).
  2. SIMD.vgather(v, idx) — load all 8 values in one instruction.
  3. Lane-wise compare against x returns Vec{8, Bool}.
  4. SIMD.bitmask(mask) packs the 8 lanes into UInt8; trailing_zeros
    finds the first lane where the predicate flipped.
  5. Narrow to the relevant 2-probe bracket and recurse.

Bracket shrinks ~7-8× per iteration; basecase threshold of 16 falls
through to Base.searchsortedlast(v, x, lo, hi, Forward). Specialised
for DenseVector{Int64} and DenseVector{Float64}. Other eltypes /
non-Forward orderings fall back to BinaryBracket. Ignores any hint.

Bench results

bench/simd_binary_sweep.jl. "Ratio S/B" is SIMD/Base — lower means
SIMD wins.

eltype n Hot cache Cold cache
Float64 256 Base (1.37×) tie (1.05×)
Float64 1024 Base (1.32×) tie (1.00×)
Float64 4096 Base (1.32×) Base (1.31×)
Float64 16384 Base (1.21×) SIMD (0.85×)
Float64 65536 Base (1.12×) SIMD (0.92×)
Float64 262144 tie (1.05×) tie (1.03×)
Float64 1048576 tie (1.01×) SIMD (0.74×)
Int64 256–10⁶ Base wins everywhere mostly Base, ties at large n

Hot cache: scalar Base.searchsortedlast wins virtually everywhere —
the SIMD overhead exceeds the log₂(n) compares saved.

Cold cache: mixed. Float64 SIMD wins at n ≥ 16384 (best 0.74× at n=2²⁰)
because gather amortizes cold-load latency. Int64 SIMD loses everywhere
— Int compares are so cheap the gather/mask overhead never pays.

Recommendation: opt-in only, not in Auto

The only clear win is cold-cache Float64 at n ≥ 16384. That regime is
hardware-dependent (gather latency, vector unit width, LLC size), and
"is v cold" isn't a cheap probe for Auto. Folding into Auto would
penalize the much more common hot-cache case for a narrow worst-case
win.

For batched sorted-query workloads, the existing hinted strategies
(ExpFromLeft, BracketGallop, SIMDLinearScan) remain strictly
better — they get O(1) amortized cost from a good hint that
SIMDBinarySearch throws away by design.

Tests

105,536 tests pass. New SIMDBinarySearch correctness @safetestset
with ~20,910 fresh tests: Int64 fuzz (10k), Float64 fuzz (10k),
basecase-boundary sweep, edge cases (empty, single-element, x outside
range, exact match, duplicates, constant vector, hint-ignored sanity,
non-Int64/Float64 fallback, reverse-order fallback).

🤖 Note for reviewer: this PR is in draft. Please ignore until reviewed
by @ChrisRackauckas.

Single-query binary search that probes 8 candidate positions per iteration
via `SIMD.vgather` + `SIMD.bitmask`, shrinking the bracket by ~8x per step
instead of 2x. Specialised for `DenseVector{Int64}` and `DenseVector{Float64}`;
other eltypes and non-Forward orderings fall back to `BinaryBracket`.

Strategy is opt-in only — `Auto` does not pick it. The bench sweep
(`bench/simd_binary_sweep.jl`) shows scalar `BinaryBracket` beats it on
hot cache across every n from 256 to ~1M for both Float64 and Int64
(SIMD/Base ratio 1.2x–1.7x). On cold cache (working set > LLC) the picture
shifts but is mixed: Float64 SIMDBinarySearch starts winning at n >= 16384
(ratio ~0.7–0.9x at n = 1M); Int64 still loses everywhere except a tie at
n = 1M. The SIMD step's gather overhead and per-step probe-position math
together cost more than the log2(n) scalar compares it saves at most n.

Recommendation: keep opt-in. Don't fold into Auto. Workloads with very
large Float64 vectors that miss LLC may want to pin it manually, but the
hardware-dependent crossover argues against an Auto branch.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants