Fast Split Block Bloom Filter for x86_64 and aarch64. wyhash-style hash + one-cache-line probe, with AVX2 mask compute on x86_64 and a hand-tuned NEON path on aarch64 (Apple Silicon, AWS Graviton 2/3/4, Ampere). The fastest single-key SBBF kernel we've measured — see Performance.
git clone https://github.com/dmatth1/quickbloom
cd quickbloom
make # build/libquickbloom.{a,so}
sudo make install # /usr/local by default; override with PREFIX=#include <quickbloom.h>
void* f = qb_new(qb_estimate_bits(100000, 0.01)); // 100k items, 1% FP
qb_insert(f, "hello", 5);
if (qb_contains(f, "hello", 5)) { /* probably present */ }
qb_free(f);Link with -lquickbloom -lm, or use pkg-config --cflags --libs quickbloom. examples/hello_quickbloom.c is the full runnable
version (make example).
Tests and benches:
make test # native C tests
python3 test/test_bloom.py # Python tests (incl. comparisons)
CC=clang python3 tools/bench_all.py --comparisons # bench sweep, all candidates
CC=clang python3 tools/bench_all.py --sizes S,M # subset
CC=clang python3 tools/bench_all.py --target-fp 0.001 # equal-FP sizing
make bench-hash # per-hash kernel cost
make bench-qb # native C bench of the qb_* kernelsSplit Block Bloom Filter (Apache Parquet spec):
- 256-bit blocks (8 × 32-bit words), K=8 (one bit per word). Every probe touches exactly one cache line.
- Spec-mandated salt vector, so the bitset is bit-identical to other Parquet SBBF implementations (arrow-cpp, arrow-rs, Velox, DuckDB, Impala) for the same 64-bit hash input.
- Power-of-2 block count. The Parquet-mandated fastrange block
index
((h >> 32) * nblocks) >> 32collapses to a single right shift of the upper hash half, so we get spec compliance at the same one-cycle cost as a bitmask. - AVX2 SIMD mask compute:
vpmullo+vpsrli+vpsllv+vptest. - 4-way unrolled bulk paths. Insert uses sequential load-or-store per key so store-to-load forwarding handles aliased blocks.
- Hash on the 16-byte fast path is a single 128-bit multiply with
xor-fold of the halves (wyhash-style, seeded so structured zero
inputs don't collapse). Variable-length keys use
fasthash64.
Full declarations are in quickbloom.h.
// Lifecycle. qb_new returns NULL on alloc failure; qb_free accepts NULL.
void* qb_new(size_t nbits);
void qb_free(void* p);
// Single-key API.
void qb_insert(void* p, const void* key, size_t len);
int qb_contains(void* p, const void* key, size_t len);
// Bulk API. qb_contains_bulk returns the number of hits.
void qb_insert_bulk(void* p, const uint8_t* keys, size_t klen, size_t n);
size_t qb_contains_bulk(void* p, const uint8_t* keys, size_t klen, size_t n);
// Pre-hashed (skip the built-in wymum step).
void qb_insert_prehash(void* p, uint64_t hash);
int qb_contains_prehash(void* p, uint64_t hash);
void qb_insert_prehash_bulk(void* p, const uint64_t* hashes, size_t n);
size_t qb_contains_prehash_bulk(void* p, const uint64_t* hashes, size_t n);
// Sizing helper. Returns recommended bit count for ~n items at FP rate fp.
size_t qb_estimate_bits(size_t n, double fp);
// Serialization. Raw bitset bytes, bit-identical to the Parquet
// on-disk layout (arrow-rs / arrow-cpp / Velox / DuckDB / Impala).
size_t qb_serialized_size(void* p);
void qb_serialize(void* p, uint8_t* dst);
void* qb_deserialize(const uint8_t* bytes, size_t nbytes);- Concurrent reads (
qb_contains*) are safe. - Concurrent writes (
qb_insert*) are not safe (non-atomic bit-set). Callers must synchronize. (Same convention as arrow-rs, impala, parquet-rs.) - Reads concurrent with writes may see partial state but never a false negative once the write completes.
Two backends, selected at build time by uname -m:
- x86_64 — built with
-O3 -mavx2 -mbmi2 -mfma -maes. Runs on any x86_64 with AVX2 + BMI2: Intel Haswell (2013)+ or AMD Excavator (2015)+ / Zen 1–5. - aarch64 — built with
-O3(ARMv8-A NEON is baseline). Runs on every Graviton (1–4), Apple Silicon, Ampere Altra, Azure Cobalt, and any other ARMv8-A core.
Compiler must be gcc or clang on both arches — the 16-byte hash
uses __uint128_t for the 128-bit multiply, which MSVC doesn't
accept.
Numbers are illustrative and host-dependent. Run
make benchon your hardware. Captured on Intel Xeon @ 2.8 GHz (Cascade Lake-class, 1 MB L2 per core, 33 MB L3), clang -O3, contains-miss, min ns/op. Sizes span the cache hierarchy: S fits in L2, M and L spill to L3.
Native C bench (make bench-qb), single-key API, min ns/op across
isolated runs:
| Size | hash+bloom | prehash |
|---|---|---|
| S (128 KB) | 2.85 | 2.06 |
| M (2 MB) | 5.47 | 3.10 |
| L (32 MB) | 16.09 | 9.56 |
Hash+bloom is qb_contains(bytes, len) called one key at a time.
Prehash is qb_contains_prehash(uint64) — the kernel alone, for
callers that already have a 64-bit hash. The S→M jump reflects
this host's L2/L3 boundary at 1 MB per core; on hosts with larger
L2 (Sapphire Rapids and newer have 2–8 MB per core) M stays
L2-resident and the jump shrinks.
A 4-way-unrolled qb_*_bulk path is also exported for callers
with bulk workloads — it issues 4 block loads in parallel and
trades latency for throughput, dropping S prehash miss to ~1.25 ns
(~1.7× the single-key path).
Single-key kernel cost on every row. Krassovsky is the documented exception (¹).
| Implementation | K | S | M | L | FP rate |
|---|---|---|---|---|---|
| quickbloom | 8 | 1.50 | 3.97 | 14.27 | 0.00032 |
boost.bloom ² |
8 | 2.11 | 3.62 | 15.61 | 0.00028 |
xorfuse (binary fuse) |
3 | 4.38 | 7.68 | 27.60 | 0.00394 |
krassovsky ¹ |
5 | 4.96 | 5.36 | 20.54 | 0.00345 |
classic |
8 | 8.44 | 12.25 | 30.01 | 0.00016 |
impala |
8 | 8.45 | 11.14 | 23.07 | 0.00032 |
bloomsday |
8 | 9.13 | 17.29 | 51.51 | 0.00033 |
fastbloom (Rust) |
8 | 10.57 | 16.73 | 53.47 | 0.00010 |
arrow_rs SBBF ³ |
8 | — | — | — | 0.00032 |
¹ krassovsky isn't directly comparable: it's a batched-only
design (vpgatherqq loads 4 blocks per instruction; no
comparably-tuned single-key path). The numbers above are measured
through its 8-way bloom_*_batch8_bulk entry points. For per-key
workloads it doesn't apply. K=5 also gives ~10× higher FP than
quickbloom at equal sizing.
² boost.bloom is Boost 1.89's fast_multiblock32<8> —
hand-tuned AVX2/NEON SBBF (K=8, 256-bit blocks, same Parquet block
geometry). The closest competitor in this table. Quickbloom wins
at S (1.4× faster in L2) and is within run noise at M/L (the gap
is ~1.1× either direction, swappable between runs as DRAM jitter
dominates). For Parquet-shape workloads on x86_64 or aarch64,
boost.bloom is a legitimate alternative when a Boost dependency
is acceptable; quickbloom wins on no-deps + small-key throughput.
³ arrow_rs (parquet crate) exposes no insert_hash /
check_hash. Hash+bloom numbers are 21.40 / 29.16 / 75.43 ns at
S/M/L; structurally the same SBBF + XXH64 as impala, so the
kernel would land near impala's.
Where the prehash gap comes from:
- Vs
boost.bloom: same SBBF block geometry and same hand-tuned SIMD shape. The L2-resident gap is inner-loop micro-tuning; past L2 both are DRAM-bound and the kernels converge. - Vs
classic: SBBF reads one cache line;classicdoes K=8 scattered probes. - Vs
impala/arrow_rs: same SBBF block geometry, but scalar bit-test vs SIMDvpmullo+vpsllv+_mm256_testc_si256. - Vs
bloomsday: same SBBF block geometry, but pure safe Rust relying on LLVM auto-vectorization vs hand-written intrinsics. Auto-vectorization can't match a hand-tuned kernel here. - Vs
fastbloom: non-SBBF multi-block layout touches more memory per probe. - Vs
xorfuse: different class — static set, ~9 bits/key, 3 cache lines per probe. Loses on latency, wins on memory.
16-byte keys, same compile flags as the bloom bench, make bench-hash (median of 5 isolated runs): wymum16 1.67 ns
(quickbloom), XXH64 3.50 ns (impala, arrow_rs), SipHash-1-3
3.38 ns (fastbloom). Add to any candidate's prehash number for
hash+bloom latency. Real-world fastbloom overhead is closer to
~15 ns once the Rust Hasher trait dispatch and per-call seed
load are included.
| Variant | Outcome | Why |
|---|---|---|
| AVX-512 single-key (zmm SBBF masks) | slower than AVX2 on SPR | freq throttle on Sapphire Rapids exceeds the width gain; SBBF mask compute doesn't saturate Zmm. Untested on Zen 4/5 (no license penalty) and Granite Rapids (lighter throttle) — may flip there. |
| AVX-512 batched (zmm 8-way) | slower than AVX2 on SPR | same throttle; same Zen 4/5 + Granite Rapids caveat |
| AESENC hash (replacing wymum) | slower than wymum | port-0 contention with vpmullo SBBF mask compute |
vpgatherqq contains |
slower than scalar 8-byte loads | ~12–15 cyc gather latency on SPR loses to 8 direct loads |
Two parallel mum hash chains |
no improvement | hash already saturates the multiply pipe |
| Insert-side collision skip | slightly slower | branch mispredict cost > dup-work savings on random keys |
| 4-way unroll w/ batched store on insert | 16 false negatives | aliased blocks need serialized L+O+S per key for store-load forwarding |
| Shift-based mask compute (skip xor-fold) | failed FP test | low bits of r.lo from a 64×64→128 multiply have weak diffusion |
| Bitmask block-index (skip fastrange) | failed bit-compat | h & mask reads the low bits of h32; Parquet's fastrange (h32 * nblocks) >> 32 reads the high bits, so bitsets diverged from arrow-rs / arrow-cpp / Velox / DuckDB / Impala. Fastrange on power-of-2 nblocks collapses to a single right shift — same one-cycle cost, spec-compliant. |
| Boost.Bloom shift-trick mask compute (drop salts) | no gain, ~5% worse at M | Boost's fast_multiblock32 slices 8 bit positions from one mixed hash with staggered shifts instead of a per-lane multiply — a real win over a scalar SBBF. But quickbloom's mask compute already runs in parallel with the block load and is fully hidden under load latency at every cache regime (~14 cyc L2 / ~40 L3 / ~200 DRAM ≥ either mask path). Cheaper mask compute recovers no exposed latency; the extra MCG remix + GPR→vector moves cost a hair of throughput. 8-round A/B min-of-mins: salt 2.06/3.09/10.60 vs shift 2.06/3.28/12.76 ns (S/M/L single-key). Kept the salt multiply, which also gives Parquet bit-compat for free. |
| PGO (clang -fprofile-use) | <2% | code is mostly loop-free SIMD; PGO mostly helps branchy code |
simde portable shim (NEON via _mm256_*) |
1.2–2.2× slower than hand NEON | simde lowers _mm256_testc_si256 as two _mm_testc halves with an early-exit branch + GPR round-trips on the OR/store path; hand shim uses ldp q,q + umaxv.4s straight-line. Measured on M1 bench-qb: insert hurt most (~2×), miss least (~1.2× at L). Kept the hand shim. |
Bench harness. Sizes S/M/L/XL derived from the host's L2-per-core
and L3 at runtime (override with QB_L2_KB / QB_L3_MB). Default
sweep is S,M,L; XL needs --sizes and ~1 GB RAM. Random 16-byte
keys, deterministic across runs, separate seeds for inserted vs unseen.
Repeats with warmup; reports min / median / p90 (min is the headline).
CPU, caches, and compiler are printed in the bench header.
Two benches, two roles. make bench-qb is the quickbloom-only
native C bench (no ctypes overhead); it prints single-key (headline)
and bulk-amortized (qb_*_bulk, unrolled — quickbloom-only, not
comparable to other libraries). tools/bench_all.py is the
cross-candidate sweep and reports single-key on every row.
Correctness.
test/test_quickbloom.c— round-trip + FP ceiling + variable-length keys + structured-input regression (sequential u128, padded ASCII, UUIDv4-shaped) +qb_*_bulk≡qb_*bit-identity oracle.test/test_bloom.py— round-trip + FP rate cap across all candidates.test/test_compat_arrow_rs.py— feeds the same XXH64 hash to quickbloom (prehash API) and the arrow-rsSbbfshim, asserts every probe agrees on 50k-insert / 100k-query. Verifies the Parquet bit-identical claim on every CI run, on both x86_64 and aarch64.
quickbloom.h public header
quickbloom.c the SBBF implementation (incl. qb_estimate_bits)
Makefile build libquickbloom.{a,so} + test + example; supports install
quickbloom.pc.in pkg-config template; installed as quickbloom.pc
examples/ minimal usage example
test/ correctness tests (native C + Python cross-impl)
tools/bench_hash.c per-hash kernel-cost bench (make bench-hash)
tools/bench_qb.c native C bench of the qb_* kernels (make bench-qb)
tools/fuzz_*.c libFuzzer harnesses, incl. fuzz_deserialize (make fuzz)
tools/xxh64_helper.c standalone XXH64 .so for the cross-validation test
test/test_compat_arrow_rs.py bit-compat cross-validation vs arrow-rs Sbbf
.github/workflows/ CI: build + test on gcc and clang
harness.py compile + diff_test + benchmark infrastructure (shared by test/bench)
tools/bench_all.py benchmark sweep across cache regimes; --target-fp / --comparisons
test/test_bloom.py Python correctness tests (fixed-length and variable-length keys)
comparisons/ reference Bloom implementations (impala, krassovsky, classic,
xorfuse, fastbloom, arrow-rs) plus their own README
0.2.0. Semver on the C ABI in quickbloom.h; version macros
QUICKBLOOM_VERSION_{MAJOR,MINOR,PATCH,STRING} are exposed there.
Shared library uses standard SONAME (libquickbloom.so.0 →
libquickbloom.so.0.2.0); quickbloom.pc ships for pkg-config.
SOVERSION stays 0 across the 0.1 → 0.2 bump — the aarch64
backend is purely additive and the C ABI is byte-identical.
MIT — see LICENSE.