quickbloom

Fast Split Block Bloom Filter for x86_64 and aarch64. wyhash-style hash + one-cache-line probe, with AVX2 mask compute on x86_64 and a hand-tuned NEON path on aarch64 (Apple Silicon, AWS Graviton 2/3/4, Ampere). The fastest single-key SBBF kernel we've measured — see Performance.

Quick start

git clone https://github.com/dmatth1/quickbloom
cd quickbloom
make                # build/libquickbloom.{a,so}
sudo make install   # /usr/local by default; override with PREFIX=

#include <quickbloom.h>

void* f = qb_new(qb_estimate_bits(100000, 0.01));   // 100k items, 1% FP
qb_insert(f, "hello", 5);
if (qb_contains(f, "hello", 5)) { /* probably present */ }
qb_free(f);

Link with -lquickbloom -lm, or use pkg-config --cflags --libs quickbloom. examples/hello_quickbloom.c is the full runnable version (make example).

Tests and benches:

make test                                        # native C tests
python3 test/test_bloom.py                       # Python tests (incl. comparisons)
CC=clang python3 tools/bench_all.py --comparisons      # bench sweep, all candidates
CC=clang python3 tools/bench_all.py --sizes S,M        # subset
CC=clang python3 tools/bench_all.py --target-fp 0.001  # equal-FP sizing
make bench-hash                                  # per-hash kernel cost
make bench-qb                                    # native C bench of the qb_* kernels

Algorithm

Split Block Bloom Filter (Apache Parquet spec):

256-bit blocks (8 × 32-bit words), K=8 (one bit per word). Every probe touches exactly one cache line.
Spec-mandated salt vector, so the bitset is bit-identical to other Parquet SBBF implementations (arrow-cpp, arrow-rs, Velox, DuckDB, Impala) for the same 64-bit hash input.
Power-of-2 block count. The Parquet-mandated fastrange block index ((h >> 32) * nblocks) >> 32 collapses to a single right shift of the upper hash half, so we get spec compliance at the same one-cycle cost as a bitmask.
AVX2 SIMD mask compute: vpmullo + vpsrli + vpsllv + vptest.
4-way unrolled bulk paths. Insert uses sequential load-or-store per key so store-to-load forwarding handles aliased blocks.
Hash on the 16-byte fast path is a single 128-bit multiply with xor-fold of the halves (wyhash-style, seeded so structured zero inputs don't collapse). Variable-length keys use fasthash64.

ABI

Full declarations are in quickbloom.h.

// Lifecycle. qb_new returns NULL on alloc failure; qb_free accepts NULL.
void*  qb_new(size_t nbits);
void   qb_free(void* p);

// Single-key API.
void   qb_insert(void* p, const void* key, size_t len);
int    qb_contains(void* p, const void* key, size_t len);

// Bulk API. qb_contains_bulk returns the number of hits.
void   qb_insert_bulk(void* p, const uint8_t* keys, size_t klen, size_t n);
size_t qb_contains_bulk(void* p, const uint8_t* keys, size_t klen, size_t n);

// Pre-hashed (skip the built-in wymum step).
void   qb_insert_prehash(void* p, uint64_t hash);
int    qb_contains_prehash(void* p, uint64_t hash);
void   qb_insert_prehash_bulk(void* p, const uint64_t* hashes, size_t n);
size_t qb_contains_prehash_bulk(void* p, const uint64_t* hashes, size_t n);

// Sizing helper. Returns recommended bit count for ~n items at FP rate fp.
size_t qb_estimate_bits(size_t n, double fp);

// Serialization. Raw bitset bytes, bit-identical to the Parquet
// on-disk layout (arrow-rs / arrow-cpp / Velox / DuckDB / Impala).
size_t qb_serialized_size(void* p);
void   qb_serialize(void* p, uint8_t* dst);
void*  qb_deserialize(const uint8_t* bytes, size_t nbytes);

Thread safety

Concurrent reads (qb_contains*) are safe.
Concurrent writes (qb_insert*) are not safe (non-atomic bit-set). Callers must synchronize. (Same convention as arrow-rs, impala, parquet-rs.)
Reads concurrent with writes may see partial state but never a false negative once the write completes.

Build target

Two backends, selected at build time by uname -m:

x86_64 — built with -O3 -mavx2 -mbmi2 -mfma -maes. Runs on any x86_64 with AVX2 + BMI2: Intel Haswell (2013)+ or AMD Excavator (2015)+ / Zen 1–5.
aarch64 — built with -O3 (ARMv8-A NEON is baseline). Runs on every Graviton (1–4), Apple Silicon, Ampere Altra, Azure Cobalt, and any other ARMv8-A core.

Compiler must be gcc or clang on both arches — the 16-byte hash uses __uint128_t for the 128-bit multiply, which MSVC doesn't accept.

Performance

Numbers are illustrative and host-dependent. Run make bench on your hardware. Captured on Intel Xeon @ 2.8 GHz (Cascade Lake-class, 1 MB L2 per core, 33 MB L3), clang -O3, contains-miss, min ns/op. Sizes span the cache hierarchy: S fits in L2, M and L spill to L3.

quickbloom

Native C bench (make bench-qb), single-key API, min ns/op across isolated runs:

Size	hash+bloom	prehash
S (128 KB)	2.85	2.06
M (2 MB)	5.47	3.10
L (32 MB)	16.09	9.56

Hash+bloom is qb_contains(bytes, len) called one key at a time. Prehash is qb_contains_prehash(uint64) — the kernel alone, for callers that already have a 64-bit hash. The S→M jump reflects this host's L2/L3 boundary at 1 MB per core; on hosts with larger L2 (Sapphire Rapids and newer have 2–8 MB per core) M stays L2-resident and the jump shrinks.

A 4-way-unrolled qb_*_bulk path is also exported for callers with bulk workloads — it issues 4 block loads in parallel and trades latency for throughput, dropping S prehash miss to ~1.25 ns (~1.7× the single-key path).

Comparison

Single-key kernel cost on every row. Krassovsky is the documented exception (¹).

Implementation	K	S	M	L	FP rate
quickbloom	8	1.50	3.97	14.27	0.00032
`boost.bloom` ²	8	2.11	3.62	15.61	0.00028
`xorfuse` (binary fuse)	3	4.38	7.68	27.60	0.00394
`krassovsky` ¹	5	4.96	5.36	20.54	0.00345
`classic`	8	8.44	12.25	30.01	0.00016
`impala`	8	8.45	11.14	23.07	0.00032
`bloomsday`	8	9.13	17.29	51.51	0.00033
`fastbloom` (Rust)	8	10.57	16.73	53.47	0.00010
`arrow_rs` SBBF ³	8	—	—	—	0.00032

¹ krassovsky isn't directly comparable: it's a batched-only design (vpgatherqq loads 4 blocks per instruction; no comparably-tuned single-key path). The numbers above are measured through its 8-way bloom_*_batch8_bulk entry points. For per-key workloads it doesn't apply. K=5 also gives ~10× higher FP than quickbloom at equal sizing.

² boost.bloom is Boost 1.89's fast_multiblock32<8> — hand-tuned AVX2/NEON SBBF (K=8, 256-bit blocks, same Parquet block geometry). The closest competitor in this table. Quickbloom wins at S (1.4× faster in L2) and is within run noise at M/L (the gap is ~1.1× either direction, swappable between runs as DRAM jitter dominates). For Parquet-shape workloads on x86_64 or aarch64, boost.bloom is a legitimate alternative when a Boost dependency is acceptable; quickbloom wins on no-deps + small-key throughput.

³ arrow_rs (parquet crate) exposes no insert_hash / check_hash. Hash+bloom numbers are 21.40 / 29.16 / 75.43 ns at S/M/L; structurally the same SBBF + XXH64 as impala, so the kernel would land near impala's.

Where the prehash gap comes from:

Vs boost.bloom: same SBBF block geometry and same hand-tuned SIMD shape. The L2-resident gap is inner-loop micro-tuning; past L2 both are DRAM-bound and the kernels converge.
Vs classic: SBBF reads one cache line; classic does K=8 scattered probes.
Vs impala / arrow_rs: same SBBF block geometry, but scalar bit-test vs SIMD vpmullo + vpsllv + _mm256_testc_si256.
Vs bloomsday: same SBBF block geometry, but pure safe Rust relying on LLVM auto-vectorization vs hand-written intrinsics. Auto-vectorization can't match a hand-tuned kernel here.
Vs fastbloom: non-SBBF multi-block layout touches more memory per probe.
Vs xorfuse: different class — static set, ~9 bits/key, 3 cache lines per probe. Loses on latency, wins on memory.

Per-key hash cost

16-byte keys, same compile flags as the bloom bench, make bench-hash (median of 5 isolated runs): wymum16 1.67 ns (quickbloom), XXH64 3.50 ns (impala, arrow_rs), SipHash-1-3 3.38 ns (fastbloom). Add to any candidate's prehash number for hash+bloom latency. Real-world fastbloom overhead is closer to ~15 ns once the Rust Hasher trait dispatch and per-call seed load are included.

Optimizations we tried and rejected

Variant	Outcome	Why
AVX-512 single-key (zmm SBBF masks)	slower than AVX2 on SPR	freq throttle on Sapphire Rapids exceeds the width gain; SBBF mask compute doesn't saturate Zmm. Untested on Zen 4/5 (no license penalty) and Granite Rapids (lighter throttle) — may flip there.
AVX-512 batched (zmm 8-way)	slower than AVX2 on SPR	same throttle; same Zen 4/5 + Granite Rapids caveat
AESENC hash (replacing wymum)	slower than wymum	port-0 contention with `vpmullo` SBBF mask compute
`vpgatherqq` contains	slower than scalar 8-byte loads	~12–15 cyc gather latency on SPR loses to 8 direct loads
Two parallel `mum` hash chains	no improvement	hash already saturates the multiply pipe
Insert-side collision skip	slightly slower	branch mispredict cost > dup-work savings on random keys
4-way unroll w/ batched store on insert	16 false negatives	aliased blocks need serialized L+O+S per key for store-load forwarding
Shift-based mask compute (skip xor-fold)	failed FP test	low bits of `r.lo` from a 64×64→128 multiply have weak diffusion
Bitmask block-index (skip fastrange)	failed bit-compat	`h & mask` reads the low bits of `h32`; Parquet's fastrange `(h32 * nblocks) >> 32` reads the high bits, so bitsets diverged from arrow-rs / arrow-cpp / Velox / DuckDB / Impala. Fastrange on power-of-2 nblocks collapses to a single right shift — same one-cycle cost, spec-compliant.
Boost.Bloom shift-trick mask compute (drop salts)	no gain, ~5% worse at M	Boost's `fast_multiblock32` slices 8 bit positions from one mixed hash with staggered shifts instead of a per-lane multiply — a real win over a scalar SBBF. But quickbloom's mask compute already runs in parallel with the block load and is fully hidden under load latency at every cache regime (~14 cyc L2 / ~40 L3 / ~200 DRAM ≥ either mask path). Cheaper mask compute recovers no exposed latency; the extra MCG remix + GPR→vector moves cost a hair of throughput. 8-round A/B min-of-mins: salt 2.06/3.09/10.60 vs shift 2.06/3.28/12.76 ns (S/M/L single-key). Kept the salt multiply, which also gives Parquet bit-compat for free.
PGO (clang -fprofile-use)	<2%	code is mostly loop-free SIMD; PGO mostly helps branchy code
simde portable shim (NEON via `_mm256_*`)	1.2–2.2× slower than hand NEON	simde lowers `_mm256_testc_si256` as two `_mm_testc` halves with an early-exit branch + GPR round-trips on the OR/store path; hand shim uses `ldp q,q` + `umaxv.4s` straight-line. Measured on M1 `bench-qb`: insert hurt most (~2×), miss least (~1.2× at L). Kept the hand shim.

Methodology

Bench harness. Sizes S/M/L/XL derived from the host's L2-per-core and L3 at runtime (override with QB_L2_KB / QB_L3_MB). Default sweep is S,M,L; XL needs --sizes and ~1 GB RAM. Random 16-byte keys, deterministic across runs, separate seeds for inserted vs unseen. Repeats with warmup; reports min / median / p90 (min is the headline). CPU, caches, and compiler are printed in the bench header.

Two benches, two roles. make bench-qb is the quickbloom-only native C bench (no ctypes overhead); it prints single-key (headline) and bulk-amortized (qb_*_bulk, unrolled — quickbloom-only, not comparable to other libraries). tools/bench_all.py is the cross-candidate sweep and reports single-key on every row.

Correctness.

test/test_quickbloom.c — round-trip + FP ceiling + variable-length keys + structured-input regression (sequential u128, padded ASCII, UUIDv4-shaped) + qb_*_bulk ≡ qb_* bit-identity oracle.
test/test_bloom.py — round-trip + FP rate cap across all candidates.
test/test_compat_arrow_rs.py — feeds the same XXH64 hash to quickbloom (prehash API) and the arrow-rs Sbbf shim, asserts every probe agrees on 50k-insert / 100k-query. Verifies the Parquet bit-identical claim on every CI run, on both x86_64 and aarch64.

Files

quickbloom.h         public header
quickbloom.c         the SBBF implementation (incl. qb_estimate_bits)
Makefile             build libquickbloom.{a,so} + test + example; supports install
quickbloom.pc.in     pkg-config template; installed as quickbloom.pc
examples/            minimal usage example
test/                correctness tests (native C + Python cross-impl)
tools/bench_hash.c   per-hash kernel-cost bench (make bench-hash)
tools/bench_qb.c     native C bench of the qb_* kernels (make bench-qb)
tools/fuzz_*.c       libFuzzer harnesses, incl. fuzz_deserialize (make fuzz)
tools/xxh64_helper.c standalone XXH64 .so for the cross-validation test
test/test_compat_arrow_rs.py   bit-compat cross-validation vs arrow-rs Sbbf
.github/workflows/   CI: build + test on gcc and clang
harness.py           compile + diff_test + benchmark infrastructure (shared by test/bench)
tools/bench_all.py   benchmark sweep across cache regimes; --target-fp / --comparisons
test/test_bloom.py   Python correctness tests (fixed-length and variable-length keys)
comparisons/         reference Bloom implementations (impala, krassovsky, classic,
                     xorfuse, fastbloom, arrow-rs) plus their own README

Versioning

0.2.0. Semver on the C ABI in quickbloom.h; version macros QUICKBLOOM_VERSION_{MAJOR,MINOR,PATCH,STRING} are exposed there. Shared library uses standard SONAME (libquickbloom.so.0 → libquickbloom.so.0.2.0); quickbloom.pc ships for pkg-config. SOVERSION stays 0 across the 0.1 → 0.2 bump — the aarch64 backend is purely additive and the C ABI is byte-identical.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

quickbloom

Quick start

Algorithm

ABI

Thread safety

Build target

Performance

quickbloom

Comparison

Per-key hash cost

Optimizations we tried and rejected

Methodology

Files

Versioning

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.github/workflows		.github/workflows
comparisons		comparisons
examples		examples
test		test
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
harness.py		harness.py
quickbloom.c		quickbloom.c
quickbloom.h		quickbloom.h
quickbloom.pc.in		quickbloom.pc.in

Folders and files

Latest commit

History

Repository files navigation

quickbloom

Quick start

Algorithm

ABI

Thread safety

Build target

Performance

quickbloom

Comparison

Per-key hash cost

Optimizations we tried and rejected

Methodology

Files

Versioning

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages