Skip to content

dmatth1/quickbloom

Repository files navigation

quickbloom

Fast Split Block Bloom Filter for x86_64 and aarch64. wyhash-style hash + one-cache-line probe, with AVX2 mask compute on x86_64 and a hand-tuned NEON path on aarch64 (Apple Silicon, AWS Graviton 2/3/4, Ampere). The fastest single-key SBBF kernel we've measured — see Performance.

Quick start

git clone https://github.com/dmatth1/quickbloom
cd quickbloom
make                # build/libquickbloom.{a,so}
sudo make install   # /usr/local by default; override with PREFIX=
#include <quickbloom.h>

void* f = qb_new(qb_estimate_bits(100000, 0.01));   // 100k items, 1% FP
qb_insert(f, "hello", 5);
if (qb_contains(f, "hello", 5)) { /* probably present */ }
qb_free(f);

Link with -lquickbloom -lm, or use pkg-config --cflags --libs quickbloom. examples/hello_quickbloom.c is the full runnable version (make example).

Tests and benches:

make test                                        # native C tests
python3 test/test_bloom.py                       # Python tests (incl. comparisons)
CC=clang python3 tools/bench_all.py --comparisons      # bench sweep, all candidates
CC=clang python3 tools/bench_all.py --sizes S,M        # subset
CC=clang python3 tools/bench_all.py --target-fp 0.001  # equal-FP sizing
make bench-hash                                  # per-hash kernel cost
make bench-qb                                    # native C bench of the qb_* kernels

Algorithm

Split Block Bloom Filter (Apache Parquet spec):

  • 256-bit blocks (8 × 32-bit words), K=8 (one bit per word). Every probe touches exactly one cache line.
  • Spec-mandated salt vector, so the bitset is bit-identical to other Parquet SBBF implementations (arrow-cpp, arrow-rs, Velox, DuckDB, Impala) for the same 64-bit hash input.
  • Power-of-2 block count. The Parquet-mandated fastrange block index ((h >> 32) * nblocks) >> 32 collapses to a single right shift of the upper hash half, so we get spec compliance at the same one-cycle cost as a bitmask.
  • AVX2 SIMD mask compute: vpmullo + vpsrli + vpsllv + vptest.
  • 4-way unrolled bulk paths. Insert uses sequential load-or-store per key so store-to-load forwarding handles aliased blocks.
  • Hash on the 16-byte fast path is a single 128-bit multiply with xor-fold of the halves (wyhash-style, seeded so structured zero inputs don't collapse). Variable-length keys use fasthash64.

ABI

Full declarations are in quickbloom.h.

// Lifecycle. qb_new returns NULL on alloc failure; qb_free accepts NULL.
void*  qb_new(size_t nbits);
void   qb_free(void* p);

// Single-key API.
void   qb_insert(void* p, const void* key, size_t len);
int    qb_contains(void* p, const void* key, size_t len);

// Bulk API. qb_contains_bulk returns the number of hits.
void   qb_insert_bulk(void* p, const uint8_t* keys, size_t klen, size_t n);
size_t qb_contains_bulk(void* p, const uint8_t* keys, size_t klen, size_t n);

// Pre-hashed (skip the built-in wymum step).
void   qb_insert_prehash(void* p, uint64_t hash);
int    qb_contains_prehash(void* p, uint64_t hash);
void   qb_insert_prehash_bulk(void* p, const uint64_t* hashes, size_t n);
size_t qb_contains_prehash_bulk(void* p, const uint64_t* hashes, size_t n);

// Sizing helper. Returns recommended bit count for ~n items at FP rate fp.
size_t qb_estimate_bits(size_t n, double fp);

// Serialization. Raw bitset bytes, bit-identical to the Parquet
// on-disk layout (arrow-rs / arrow-cpp / Velox / DuckDB / Impala).
size_t qb_serialized_size(void* p);
void   qb_serialize(void* p, uint8_t* dst);
void*  qb_deserialize(const uint8_t* bytes, size_t nbytes);

Thread safety

  • Concurrent reads (qb_contains*) are safe.
  • Concurrent writes (qb_insert*) are not safe (non-atomic bit-set). Callers must synchronize. (Same convention as arrow-rs, impala, parquet-rs.)
  • Reads concurrent with writes may see partial state but never a false negative once the write completes.

Build target

Two backends, selected at build time by uname -m:

  • x86_64 — built with -O3 -mavx2 -mbmi2 -mfma -maes. Runs on any x86_64 with AVX2 + BMI2: Intel Haswell (2013)+ or AMD Excavator (2015)+ / Zen 1–5.
  • aarch64 — built with -O3 (ARMv8-A NEON is baseline). Runs on every Graviton (1–4), Apple Silicon, Ampere Altra, Azure Cobalt, and any other ARMv8-A core.

Compiler must be gcc or clang on both arches — the 16-byte hash uses __uint128_t for the 128-bit multiply, which MSVC doesn't accept.

Performance

Numbers are illustrative and host-dependent. Run make bench on your hardware. Captured on Intel Xeon @ 2.8 GHz (Cascade Lake-class, 1 MB L2 per core, 33 MB L3), clang -O3, contains-miss, min ns/op. Sizes span the cache hierarchy: S fits in L2, M and L spill to L3.

quickbloom

Native C bench (make bench-qb), single-key API, min ns/op across isolated runs:

Size hash+bloom prehash
S (128 KB) 2.85 2.06
M (2 MB) 5.47 3.10
L (32 MB) 16.09 9.56

Hash+bloom is qb_contains(bytes, len) called one key at a time. Prehash is qb_contains_prehash(uint64) — the kernel alone, for callers that already have a 64-bit hash. The S→M jump reflects this host's L2/L3 boundary at 1 MB per core; on hosts with larger L2 (Sapphire Rapids and newer have 2–8 MB per core) M stays L2-resident and the jump shrinks.

A 4-way-unrolled qb_*_bulk path is also exported for callers with bulk workloads — it issues 4 block loads in parallel and trades latency for throughput, dropping S prehash miss to ~1.25 ns (~1.7× the single-key path).

Comparison

Single-key kernel cost on every row. Krassovsky is the documented exception (¹).

Implementation K S M L FP rate
quickbloom 8 1.50 3.97 14.27 0.00032
boost.bloom ² 8 2.11 3.62 15.61 0.00028
xorfuse (binary fuse) 3 4.38 7.68 27.60 0.00394
krassovsky ¹ 5 4.96 5.36 20.54 0.00345
classic 8 8.44 12.25 30.01 0.00016
impala 8 8.45 11.14 23.07 0.00032
bloomsday 8 9.13 17.29 51.51 0.00033
fastbloom (Rust) 8 10.57 16.73 53.47 0.00010
arrow_rs SBBF ³ 8 0.00032

¹ krassovsky isn't directly comparable: it's a batched-only design (vpgatherqq loads 4 blocks per instruction; no comparably-tuned single-key path). The numbers above are measured through its 8-way bloom_*_batch8_bulk entry points. For per-key workloads it doesn't apply. K=5 also gives ~10× higher FP than quickbloom at equal sizing.

² boost.bloom is Boost 1.89's fast_multiblock32<8> — hand-tuned AVX2/NEON SBBF (K=8, 256-bit blocks, same Parquet block geometry). The closest competitor in this table. Quickbloom wins at S (1.4× faster in L2) and is within run noise at M/L (the gap is ~1.1× either direction, swappable between runs as DRAM jitter dominates). For Parquet-shape workloads on x86_64 or aarch64, boost.bloom is a legitimate alternative when a Boost dependency is acceptable; quickbloom wins on no-deps + small-key throughput.

³ arrow_rs (parquet crate) exposes no insert_hash / check_hash. Hash+bloom numbers are 21.40 / 29.16 / 75.43 ns at S/M/L; structurally the same SBBF + XXH64 as impala, so the kernel would land near impala's.

Where the prehash gap comes from:

  • Vs boost.bloom: same SBBF block geometry and same hand-tuned SIMD shape. The L2-resident gap is inner-loop micro-tuning; past L2 both are DRAM-bound and the kernels converge.
  • Vs classic: SBBF reads one cache line; classic does K=8 scattered probes.
  • Vs impala / arrow_rs: same SBBF block geometry, but scalar bit-test vs SIMD vpmullo + vpsllv + _mm256_testc_si256.
  • Vs bloomsday: same SBBF block geometry, but pure safe Rust relying on LLVM auto-vectorization vs hand-written intrinsics. Auto-vectorization can't match a hand-tuned kernel here.
  • Vs fastbloom: non-SBBF multi-block layout touches more memory per probe.
  • Vs xorfuse: different class — static set, ~9 bits/key, 3 cache lines per probe. Loses on latency, wins on memory.

Per-key hash cost

16-byte keys, same compile flags as the bloom bench, make bench-hash (median of 5 isolated runs): wymum16 1.67 ns (quickbloom), XXH64 3.50 ns (impala, arrow_rs), SipHash-1-3 3.38 ns (fastbloom). Add to any candidate's prehash number for hash+bloom latency. Real-world fastbloom overhead is closer to ~15 ns once the Rust Hasher trait dispatch and per-call seed load are included.

Optimizations we tried and rejected

Variant Outcome Why
AVX-512 single-key (zmm SBBF masks) slower than AVX2 on SPR freq throttle on Sapphire Rapids exceeds the width gain; SBBF mask compute doesn't saturate Zmm. Untested on Zen 4/5 (no license penalty) and Granite Rapids (lighter throttle) — may flip there.
AVX-512 batched (zmm 8-way) slower than AVX2 on SPR same throttle; same Zen 4/5 + Granite Rapids caveat
AESENC hash (replacing wymum) slower than wymum port-0 contention with vpmullo SBBF mask compute
vpgatherqq contains slower than scalar 8-byte loads ~12–15 cyc gather latency on SPR loses to 8 direct loads
Two parallel mum hash chains no improvement hash already saturates the multiply pipe
Insert-side collision skip slightly slower branch mispredict cost > dup-work savings on random keys
4-way unroll w/ batched store on insert 16 false negatives aliased blocks need serialized L+O+S per key for store-load forwarding
Shift-based mask compute (skip xor-fold) failed FP test low bits of r.lo from a 64×64→128 multiply have weak diffusion
Bitmask block-index (skip fastrange) failed bit-compat h & mask reads the low bits of h32; Parquet's fastrange (h32 * nblocks) >> 32 reads the high bits, so bitsets diverged from arrow-rs / arrow-cpp / Velox / DuckDB / Impala. Fastrange on power-of-2 nblocks collapses to a single right shift — same one-cycle cost, spec-compliant.
Boost.Bloom shift-trick mask compute (drop salts) no gain, ~5% worse at M Boost's fast_multiblock32 slices 8 bit positions from one mixed hash with staggered shifts instead of a per-lane multiply — a real win over a scalar SBBF. But quickbloom's mask compute already runs in parallel with the block load and is fully hidden under load latency at every cache regime (~14 cyc L2 / ~40 L3 / ~200 DRAM ≥ either mask path). Cheaper mask compute recovers no exposed latency; the extra MCG remix + GPR→vector moves cost a hair of throughput. 8-round A/B min-of-mins: salt 2.06/3.09/10.60 vs shift 2.06/3.28/12.76 ns (S/M/L single-key). Kept the salt multiply, which also gives Parquet bit-compat for free.
PGO (clang -fprofile-use) <2% code is mostly loop-free SIMD; PGO mostly helps branchy code
simde portable shim (NEON via _mm256_*) 1.2–2.2× slower than hand NEON simde lowers _mm256_testc_si256 as two _mm_testc halves with an early-exit branch + GPR round-trips on the OR/store path; hand shim uses ldp q,q + umaxv.4s straight-line. Measured on M1 bench-qb: insert hurt most (~2×), miss least (~1.2× at L). Kept the hand shim.

Methodology

Bench harness. Sizes S/M/L/XL derived from the host's L2-per-core and L3 at runtime (override with QB_L2_KB / QB_L3_MB). Default sweep is S,M,L; XL needs --sizes and ~1 GB RAM. Random 16-byte keys, deterministic across runs, separate seeds for inserted vs unseen. Repeats with warmup; reports min / median / p90 (min is the headline). CPU, caches, and compiler are printed in the bench header.

Two benches, two roles. make bench-qb is the quickbloom-only native C bench (no ctypes overhead); it prints single-key (headline) and bulk-amortized (qb_*_bulk, unrolled — quickbloom-only, not comparable to other libraries). tools/bench_all.py is the cross-candidate sweep and reports single-key on every row.

Correctness.

  • test/test_quickbloom.c — round-trip + FP ceiling + variable-length keys + structured-input regression (sequential u128, padded ASCII, UUIDv4-shaped) + qb_*_bulkqb_* bit-identity oracle.
  • test/test_bloom.py — round-trip + FP rate cap across all candidates.
  • test/test_compat_arrow_rs.py — feeds the same XXH64 hash to quickbloom (prehash API) and the arrow-rs Sbbf shim, asserts every probe agrees on 50k-insert / 100k-query. Verifies the Parquet bit-identical claim on every CI run, on both x86_64 and aarch64.

Files

quickbloom.h         public header
quickbloom.c         the SBBF implementation (incl. qb_estimate_bits)
Makefile             build libquickbloom.{a,so} + test + example; supports install
quickbloom.pc.in     pkg-config template; installed as quickbloom.pc
examples/            minimal usage example
test/                correctness tests (native C + Python cross-impl)
tools/bench_hash.c   per-hash kernel-cost bench (make bench-hash)
tools/bench_qb.c     native C bench of the qb_* kernels (make bench-qb)
tools/fuzz_*.c       libFuzzer harnesses, incl. fuzz_deserialize (make fuzz)
tools/xxh64_helper.c standalone XXH64 .so for the cross-validation test
test/test_compat_arrow_rs.py   bit-compat cross-validation vs arrow-rs Sbbf
.github/workflows/   CI: build + test on gcc and clang
harness.py           compile + diff_test + benchmark infrastructure (shared by test/bench)
tools/bench_all.py   benchmark sweep across cache regimes; --target-fp / --comparisons
test/test_bloom.py   Python correctness tests (fixed-length and variable-length keys)
comparisons/         reference Bloom implementations (impala, krassovsky, classic,
                     xorfuse, fastbloom, arrow-rs) plus their own README

Versioning

0.2.0. Semver on the C ABI in quickbloom.h; version macros QUICKBLOOM_VERSION_{MAJOR,MINOR,PATCH,STRING} are exposed there. Shared library uses standard SONAME (libquickbloom.so.0libquickbloom.so.0.2.0); quickbloom.pc ships for pkg-config. SOVERSION stays 0 across the 0.1 → 0.2 bump — the aarch64 backend is purely additive and the C ABI is byte-identical.

License

MIT — see LICENSE.

About

Fast single-key bloom filter implementation

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors