Kvcache modular refactor by hazemawadalla · Pull Request #274 · mlcommons/storage

hazemawadalla · 2026-03-17T07:38:21Z

Fix 96.7% data dedup in generated KV cache entries
Problem:
The benchmark generates KV cache data by slicing from a shared 256 MB
precomputed buffer. For large models like llama3.1-70b (where a single
entry can be 2.5 GB), the buffer content repeats ~10x within one .npy
file. The existing XOR stamp used a single repeating 4 KB pattern per
key, which made different entries unique but left identical blocks at
different positions within the same entry.

Measured on Kingston DC3000ME NVMe with 100 users, prefill-only,
llama3.1-70b-instruct, 400+ GB written:
Before: 109M 4KB blocks, only 3.6M unique -- 96.7% dedup ratio
After: 109M 4KB blocks, all 109M unique -- 0.0% dedup ratio

This matters because NVMe controllers with inline dedup/compression
would silently write far less physical data, inflating reported
throughput and making benchmark results unreliable.

Fix:
Two-layer XOR stamp in KVCacheGenerator._apply_xor_stamp():

Layer 1 (existing): XOR every block with a key-derived 4 KB pattern.
Ensures different keys produce different on-disk data.
Layer 2 (new): XOR the first 8 bytes of each block with its position
index within the entry. Ensures the same buffer content at different
offsets produces different blocks on disk.
Overhead: Layer 2 touches 0.2% of data. Net cost vs original: <1%.

Tests (pytest -v -s -k TestDedup):

test_cross_entry_no_dedup: Generate 50 entries with different keys,
hash every 4 KB block, assert zero duplicates across all entries.
Validates that two .npy files never share a block on disk.
test_intra_entry_no_dedup: Generate one large entry (384 MB, exceeds
the 256 MB buffer), hash every 4 KB block within it, assert zero
duplicates inside that single file. This is the test that catches
the 96.7% bug -- without Layer 2, repeated buffer regions produce
identical blocks at different file offsets.
test_combined_dedup_many_entries: Generate 10 entries of varying
sizes (128 to 8192 tokens), hash all blocks together, assert zero
duplicates across the combined dataset. End-to-end validation of
a realistic mixed workload.
test_determinism_preserves_dedup_resistance: Create two separate
generators with the same seed, verify byte-identical output for
the same keys (reproducibility), then verify the output also has
zero duplicate blocks (dedup resistance is not sacrificed).

Other changes:

benchmark.py: Print KV block size context in results output
(block sizes can be hundreds of MB, not 4 KB pages)
test_kv_cache.py: Fix NaN comparison in test_determinism_same_key
(XOR stamp can produce NaN float16 bit patterns; use byte-level
comparison instead of np.array_equal which fails on NaN != NaN)
docs/proposal: Update generation mode names and QoS defaults

github-actions · 2026-03-17T07:38:30Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Problem: The benchmark generates KV cache data by slicing from a shared 256 MB precomputed buffer. For large models like llama3.1-70b (where a single entry can be 2.5 GB), the buffer content repeats ~10x within one .npy file. The existing XOR stamp used a single repeating 4 KB pattern per key, which made different entries unique but left identical blocks at different positions within the same entry. Measured on Kingston DC3000ME NVMe with 100 users, prefill-only, llama3.1-70b-instruct, 400+ GB written: Before: 109M 4KB blocks, only 3.6M unique -- 96.7% dedup ratio After: 109M 4KB blocks, all 109M unique -- 0.0% dedup ratio This matters because NVMe controllers with inline dedup/compression would silently write far less physical data, inflating reported throughput and making benchmark results unreliable. Fix: Two-layer XOR stamp in KVCacheGenerator._apply_xor_stamp(): - Layer 1 (existing): XOR every block with a key-derived 4 KB pattern. Ensures different keys produce different on-disk data. - Layer 2 (new): XOR the first 8 bytes of each block with its position index within the entry. Ensures the same buffer content at different offsets produces different blocks on disk. Overhead: Layer 2 touches 0.2% of data. Net cost vs original: <1%. Tests (pytest -v -s -k TestDedup): - test_cross_entry_no_dedup: Generate 50 entries with different keys, hash every 4 KB block, assert zero duplicates across all entries. Validates that two .npy files never share a block on disk. - test_intra_entry_no_dedup: Generate one large entry (384 MB, exceeds the 256 MB buffer), hash every 4 KB block within it, assert zero duplicates inside that single file. This is the test that catches the 96.7% bug -- without Layer 2, repeated buffer regions produce identical blocks at different file offsets. - test_combined_dedup_many_entries: Generate 10 entries of varying sizes (128 to 8192 tokens), hash all blocks together, assert zero duplicates across the combined dataset. End-to-end validation of a realistic mixed workload. - test_determinism_preserves_dedup_resistance: Create two separate generators with the same seed, verify byte-identical output for the same keys (reproducibility), then verify the output also has zero duplicate blocks (dedup resistance is not sacrificed). Other changes: - benchmark.py: Print KV block size context in results output (block sizes can be hundreds of MB, not 4 KB pages) - test_kv_cache.py: Fix NaN comparison in test_determinism_same_key (XOR stamp can produce NaN float16 bit patterns; use byte-level comparison instead of np.array_equal which fails on NaN != NaN) - docs/proposal: Update generation mode names and QoS defaults

dslik

Looks ok, may need to change to 512-byte dedupe granuarlity, but that can be a seperate PR.

hazemawadalla requested a review from a team March 17, 2026 07:38

hazemawadalla requested a review from a team as a code owner March 17, 2026 07:38

hazemawadalla force-pushed the kvcache-modular-refactor branch from 97af46d to 3134852 Compare March 17, 2026 07:46

hazemawadalla force-pushed the kvcache-modular-refactor branch from 3134852 to 0aa9aee Compare March 17, 2026 07:53

FileSystemGuy approved these changes Mar 17, 2026

View reviewed changes

dslik approved these changes Mar 17, 2026

View reviewed changes

FileSystemGuy merged commit ceac3d9 into mlcommons:main Mar 17, 2026
1 check passed

github-actions bot locked and limited conversation to collaborators Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kvcache modular refactor#274

Kvcache modular refactor#274
FileSystemGuy merged 1 commit intomlcommons:mainfrom
hazemawadalla:kvcache-modular-refactor

hazemawadalla commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

dslik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hazemawadalla commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dslik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 17, 2026 •

edited

Loading