Add SIMDBinarySearch strategy (opt-in 8-way SIMD-gather binary search)#70
Draft
ChrisRackauckas-Claude wants to merge 1 commit into
Draft
Add SIMDBinarySearch strategy (opt-in 8-way SIMD-gather binary search)#70ChrisRackauckas-Claude wants to merge 1 commit into
ChrisRackauckas-Claude wants to merge 1 commit into
Conversation
Single-query binary search that probes 8 candidate positions per iteration
via `SIMD.vgather` + `SIMD.bitmask`, shrinking the bracket by ~8x per step
instead of 2x. Specialised for `DenseVector{Int64}` and `DenseVector{Float64}`;
other eltypes and non-Forward orderings fall back to `BinaryBracket`.
Strategy is opt-in only — `Auto` does not pick it. The bench sweep
(`bench/simd_binary_sweep.jl`) shows scalar `BinaryBracket` beats it on
hot cache across every n from 256 to ~1M for both Float64 and Int64
(SIMD/Base ratio 1.2x–1.7x). On cold cache (working set > LLC) the picture
shifts but is mixed: Float64 SIMDBinarySearch starts winning at n >= 16384
(ratio ~0.7–0.9x at n = 1M); Int64 still loses everywhere except a tie at
n = 1M. The SIMD step's gather overhead and per-step probe-position math
together cost more than the log2(n) scalar compares it saves at most n.
Recommendation: keep opt-in. Don't fold into Auto. Workloads with very
large Float64 vectors that miss LLC may want to pin it manually, but the
hardware-dependent crossover argues against an Auto branch.
Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a new
SIMDBinarySearch <: SearchStrategyexploring whether SIMDgather instructions can beat scalar binary search. Honest negative
result on hot-cache workloads: hot scalar binary search wins almost
everywhere because branch-mispredict / dependency-chain latency
dominates over throughput, and the SIMD gather + probe-index math +
bitmask reduction costs more per iteration than the log₂(n) scalar
compares saved.
Kept as opt-in, not added to
Auto.Algorithm
8-way SIMD-gather binary search. Each iteration over bracket
[lo, hi]of length
n:p_k = lo + ((k-1)*(n-1)) ÷ 7(sop_1 = lo,p_8 = hi, six interior probes evenly spaced).SIMD.vgather(v, idx)— load all 8 values in one instruction.xreturnsVec{8, Bool}.SIMD.bitmask(mask)packs the 8 lanes intoUInt8;trailing_zerosfinds the first lane where the predicate flipped.
Bracket shrinks ~7-8× per iteration; basecase threshold of 16 falls
through to
Base.searchsortedlast(v, x, lo, hi, Forward). Specialisedfor
DenseVector{Int64}andDenseVector{Float64}. Other eltypes /non-Forward orderings fall back to
BinaryBracket. Ignores any hint.Bench results
bench/simd_binary_sweep.jl. "Ratio S/B" is SIMD/Base — lower meansSIMD wins.
Hot cache: scalar
Base.searchsortedlastwins virtually everywhere —the SIMD overhead exceeds the log₂(n) compares saved.
Cold cache: mixed. Float64 SIMD wins at n ≥ 16384 (best 0.74× at n=2²⁰)
because gather amortizes cold-load latency. Int64 SIMD loses everywhere
— Int compares are so cheap the gather/mask overhead never pays.
Recommendation: opt-in only, not in
AutoThe only clear win is cold-cache Float64 at n ≥ 16384. That regime is
hardware-dependent (gather latency, vector unit width, LLC size), and
"is v cold" isn't a cheap probe for Auto. Folding into Auto would
penalize the much more common hot-cache case for a narrow worst-case
win.
For batched sorted-query workloads, the existing hinted strategies
(
ExpFromLeft,BracketGallop,SIMDLinearScan) remain strictlybetter — they get O(1) amortized cost from a good hint that
SIMDBinarySearchthrows away by design.Tests
105,536 tests pass. New
SIMDBinarySearch correctness@safetestsetwith ~20,910 fresh tests: Int64 fuzz (10k), Float64 fuzz (10k),
basecase-boundary sweep, edge cases (empty, single-element, x outside
range, exact match, duplicates, constant vector, hint-ignored sanity,
non-Int64/Float64 fallback, reverse-order fallback).
🤖 Note for reviewer: this PR is in draft. Please ignore until reviewed
by @ChrisRackauckas.