Skip to content

Kvcache modular refactor#274

Merged
FileSystemGuy merged 1 commit intomlcommons:mainfrom
hazemawadalla:kvcache-modular-refactor
Mar 17, 2026
Merged

Kvcache modular refactor#274
FileSystemGuy merged 1 commit intomlcommons:mainfrom
hazemawadalla:kvcache-modular-refactor

Conversation

@hazemawadalla
Copy link
Contributor

Fix 96.7% data dedup in generated KV cache entries
Problem:
The benchmark generates KV cache data by slicing from a shared 256 MB
precomputed buffer. For large models like llama3.1-70b (where a single
entry can be 2.5 GB), the buffer content repeats ~10x within one .npy
file. The existing XOR stamp used a single repeating 4 KB pattern per
key, which made different entries unique but left identical blocks at
different positions within the same entry.

Measured on Kingston DC3000ME NVMe with 100 users, prefill-only,
llama3.1-70b-instruct, 400+ GB written:
Before: 109M 4KB blocks, only 3.6M unique -- 96.7% dedup ratio
After: 109M 4KB blocks, all 109M unique -- 0.0% dedup ratio

This matters because NVMe controllers with inline dedup/compression
would silently write far less physical data, inflating reported
throughput and making benchmark results unreliable.

Fix:
Two-layer XOR stamp in KVCacheGenerator._apply_xor_stamp():

  • Layer 1 (existing): XOR every block with a key-derived 4 KB pattern.
    Ensures different keys produce different on-disk data.
  • Layer 2 (new): XOR the first 8 bytes of each block with its position
    index within the entry. Ensures the same buffer content at different
    offsets produces different blocks on disk.
    Overhead: Layer 2 touches 0.2% of data. Net cost vs original: <1%.

Tests (pytest -v -s -k TestDedup):

  • test_cross_entry_no_dedup: Generate 50 entries with different keys,
    hash every 4 KB block, assert zero duplicates across all entries.
    Validates that two .npy files never share a block on disk.
  • test_intra_entry_no_dedup: Generate one large entry (384 MB, exceeds
    the 256 MB buffer), hash every 4 KB block within it, assert zero
    duplicates inside that single file. This is the test that catches
    the 96.7% bug -- without Layer 2, repeated buffer regions produce
    identical blocks at different file offsets.
  • test_combined_dedup_many_entries: Generate 10 entries of varying
    sizes (128 to 8192 tokens), hash all blocks together, assert zero
    duplicates across the combined dataset. End-to-end validation of
    a realistic mixed workload.
  • test_determinism_preserves_dedup_resistance: Create two separate
    generators with the same seed, verify byte-identical output for
    the same keys (reproducibility), then verify the output also has
    zero duplicate blocks (dedup resistance is not sacrificed).

Other changes:

  • benchmark.py: Print KV block size context in results output
    (block sizes can be hundreds of MB, not 4 KB pages)
  • test_kv_cache.py: Fix NaN comparison in test_determinism_same_key
    (XOR stamp can produce NaN float16 bit patterns; use byte-level
    comparison instead of np.array_equal which fails on NaN != NaN)
  • docs/proposal: Update generation mode names and QoS defaults

@hazemawadalla hazemawadalla requested a review from a team March 17, 2026 07:38
@hazemawadalla hazemawadalla requested a review from a team as a code owner March 17, 2026 07:38
@github-actions
Copy link

github-actions bot commented Mar 17, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@hazemawadalla hazemawadalla force-pushed the kvcache-modular-refactor branch from 97af46d to 3134852 Compare March 17, 2026 07:46
Problem:
  The benchmark generates KV cache data by slicing from a shared 256 MB
  precomputed buffer. For large models like llama3.1-70b (where a single
  entry can be 2.5 GB), the buffer content repeats ~10x within one .npy
  file. The existing XOR stamp used a single repeating 4 KB pattern per
  key, which made different entries unique but left identical blocks at
  different positions within the same entry.

  Measured on Kingston DC3000ME NVMe with 100 users, prefill-only,
  llama3.1-70b-instruct, 400+ GB written:
    Before: 109M 4KB blocks, only 3.6M unique -- 96.7% dedup ratio
    After:  109M 4KB blocks, all 109M unique -- 0.0% dedup ratio

  This matters because NVMe controllers with inline dedup/compression
  would silently write far less physical data, inflating reported
  throughput and making benchmark results unreliable.

Fix:
  Two-layer XOR stamp in KVCacheGenerator._apply_xor_stamp():
  - Layer 1 (existing): XOR every block with a key-derived 4 KB pattern.
    Ensures different keys produce different on-disk data.
  - Layer 2 (new): XOR the first 8 bytes of each block with its position
    index within the entry. Ensures the same buffer content at different
    offsets produces different blocks on disk.
  Overhead: Layer 2 touches 0.2% of data. Net cost vs original: <1%.

Tests (pytest -v -s -k TestDedup):
  - test_cross_entry_no_dedup: Generate 50 entries with different keys,
    hash every 4 KB block, assert zero duplicates across all entries.
    Validates that two .npy files never share a block on disk.
  - test_intra_entry_no_dedup: Generate one large entry (384 MB, exceeds
    the 256 MB buffer), hash every 4 KB block within it, assert zero
    duplicates inside that single file. This is the test that catches
    the 96.7% bug -- without Layer 2, repeated buffer regions produce
    identical blocks at different file offsets.
  - test_combined_dedup_many_entries: Generate 10 entries of varying
    sizes (128 to 8192 tokens), hash all blocks together, assert zero
    duplicates across the combined dataset. End-to-end validation of
    a realistic mixed workload.
  - test_determinism_preserves_dedup_resistance: Create two separate
    generators with the same seed, verify byte-identical output for
    the same keys (reproducibility), then verify the output also has
    zero duplicate blocks (dedup resistance is not sacrificed).

Other changes:
  - benchmark.py: Print KV block size context in results output
    (block sizes can be hundreds of MB, not 4 KB pages)
  - test_kv_cache.py: Fix NaN comparison in test_determinism_same_key
    (XOR stamp can produce NaN float16 bit patterns; use byte-level
    comparison instead of np.array_equal which fails on NaN != NaN)
  - docs/proposal: Update generation mode names and QoS defaults
@hazemawadalla hazemawadalla force-pushed the kvcache-modular-refactor branch from 3134852 to 0aa9aee Compare March 17, 2026 07:53
Copy link

@dslik dslik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok, may need to change to 512-byte dedupe granuarlity, but that can be a seperate PR.

@FileSystemGuy FileSystemGuy merged commit ceac3d9 into mlcommons:main Mar 17, 2026
1 check passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 17, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants