Merged
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
97af46d to
3134852
Compare
Problem:
The benchmark generates KV cache data by slicing from a shared 256 MB
precomputed buffer. For large models like llama3.1-70b (where a single
entry can be 2.5 GB), the buffer content repeats ~10x within one .npy
file. The existing XOR stamp used a single repeating 4 KB pattern per
key, which made different entries unique but left identical blocks at
different positions within the same entry.
Measured on Kingston DC3000ME NVMe with 100 users, prefill-only,
llama3.1-70b-instruct, 400+ GB written:
Before: 109M 4KB blocks, only 3.6M unique -- 96.7% dedup ratio
After: 109M 4KB blocks, all 109M unique -- 0.0% dedup ratio
This matters because NVMe controllers with inline dedup/compression
would silently write far less physical data, inflating reported
throughput and making benchmark results unreliable.
Fix:
Two-layer XOR stamp in KVCacheGenerator._apply_xor_stamp():
- Layer 1 (existing): XOR every block with a key-derived 4 KB pattern.
Ensures different keys produce different on-disk data.
- Layer 2 (new): XOR the first 8 bytes of each block with its position
index within the entry. Ensures the same buffer content at different
offsets produces different blocks on disk.
Overhead: Layer 2 touches 0.2% of data. Net cost vs original: <1%.
Tests (pytest -v -s -k TestDedup):
- test_cross_entry_no_dedup: Generate 50 entries with different keys,
hash every 4 KB block, assert zero duplicates across all entries.
Validates that two .npy files never share a block on disk.
- test_intra_entry_no_dedup: Generate one large entry (384 MB, exceeds
the 256 MB buffer), hash every 4 KB block within it, assert zero
duplicates inside that single file. This is the test that catches
the 96.7% bug -- without Layer 2, repeated buffer regions produce
identical blocks at different file offsets.
- test_combined_dedup_many_entries: Generate 10 entries of varying
sizes (128 to 8192 tokens), hash all blocks together, assert zero
duplicates across the combined dataset. End-to-end validation of
a realistic mixed workload.
- test_determinism_preserves_dedup_resistance: Create two separate
generators with the same seed, verify byte-identical output for
the same keys (reproducibility), then verify the output also has
zero duplicate blocks (dedup resistance is not sacrificed).
Other changes:
- benchmark.py: Print KV block size context in results output
(block sizes can be hundreds of MB, not 4 KB pages)
- test_kv_cache.py: Fix NaN comparison in test_determinism_same_key
(XOR stamp can produce NaN float16 bit patterns; use byte-level
comparison instead of np.array_equal which fails on NaN != NaN)
- docs/proposal: Update generation mode names and QoS defaults
3134852 to
0aa9aee
Compare
FileSystemGuy
approved these changes
Mar 17, 2026
dslik
approved these changes
Mar 17, 2026
dslik
left a comment
There was a problem hiding this comment.
Looks ok, may need to change to 512-byte dedupe granuarlity, but that can be a seperate PR.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix 96.7% data dedup in generated KV cache entries
Problem:
The benchmark generates KV cache data by slicing from a shared 256 MB
precomputed buffer. For large models like llama3.1-70b (where a single
entry can be 2.5 GB), the buffer content repeats ~10x within one .npy
file. The existing XOR stamp used a single repeating 4 KB pattern per
key, which made different entries unique but left identical blocks at
different positions within the same entry.
Measured on Kingston DC3000ME NVMe with 100 users, prefill-only,
llama3.1-70b-instruct, 400+ GB written:
Before: 109M 4KB blocks, only 3.6M unique -- 96.7% dedup ratio
After: 109M 4KB blocks, all 109M unique -- 0.0% dedup ratio
This matters because NVMe controllers with inline dedup/compression
would silently write far less physical data, inflating reported
throughput and making benchmark results unreliable.
Fix:
Two-layer XOR stamp in KVCacheGenerator._apply_xor_stamp():
Ensures different keys produce different on-disk data.
index within the entry. Ensures the same buffer content at different
offsets produces different blocks on disk.
Overhead: Layer 2 touches 0.2% of data. Net cost vs original: <1%.
Tests (pytest -v -s -k TestDedup):
hash every 4 KB block, assert zero duplicates across all entries.
Validates that two .npy files never share a block on disk.
the 256 MB buffer), hash every 4 KB block within it, assert zero
duplicates inside that single file. This is the test that catches
the 96.7% bug -- without Layer 2, repeated buffer regions produce
identical blocks at different file offsets.
sizes (128 to 8192 tokens), hash all blocks together, assert zero
duplicates across the combined dataset. End-to-end validation of
a realistic mixed workload.
generators with the same seed, verify byte-identical output for
the same keys (reproducibility), then verify the output also has
zero duplicate blocks (dedup resistance is not sacrificed).
Other changes:
(block sizes can be hundreds of MB, not 4 KB pages)
(XOR stamp can produce NaN float16 bit patterns; use byte-level
comparison instead of np.array_equal which fails on NaN != NaN)