Skip to content

Add FLOP_EFFICIENCY eviction policy for prefix caching#3923

Open
lmcafee-nvidia wants to merge 6 commits intoNVIDIA:mainfrom
lmcafee-nvidia:prefix-caching-mamba-evict-flops
Open

Add FLOP_EFFICIENCY eviction policy for prefix caching#3923
lmcafee-nvidia wants to merge 6 commits intoNVIDIA:mainfrom
lmcafee-nvidia:prefix-caching-mamba-evict-flops

Conversation

@lmcafee-nvidia
Copy link
Copy Markdown
Contributor

Summary

  • Add a new FLOP_EFFICIENCY eviction policy that prioritizes evicting blocks based on computational cost efficiency
  • Extends the existing REF_ZERO and LRU eviction policies with a FLOP-aware alternative

Test plan

  • New tests in test_dynamic_prefix_caching.py pass
  • Existing prefix caching tests still pass

🤖 Generated with Claude Code

Implements a FLOP-aware eviction policy (inspired by Marconi, arxiv
2411.19379) that combines recency and per-block FLOP efficiency into a
utility score. Blocks representing longer prefixes save more FLOPs per
byte of cache and are preferentially retained under memory pressure.

The utility score is: recency + alpha * flop_efficiency, both normalized
to (0,1) via min-max across cached blocks. Alpha is configurable via
--inference-dynamic-batching-prefix-caching-flop-alpha (default 1.0).
Setting alpha=0 falls back to pure LRU behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lmcafee-nvidia lmcafee-nvidia self-assigned this Mar 18, 2026
lmcafee-nvidia and others added 2 commits March 18, 2026 18:04
…e above FLOP-specific tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace test_flop_efficiency_prefers_longer_prefix and
test_flop_efficiency_alpha_zero_fallback with a single
test_flop_efficiency_eviction_ordering parameterized over alpha
(0.0/1.0/100.0) to demonstrate all 3 eviction regimes:
oldest-first, medium-first, and shallowest-first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lmcafee-nvidia lmcafee-nvidia marked this pull request as ready for review March 19, 2026 01:15
@lmcafee-nvidia lmcafee-nvidia requested review from a team as code owners March 19, 2026 01:15
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team March 19, 2026 01:15
lmcafee-nvidia and others added 2 commits March 18, 2026 18:15
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lmcafee-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 42d89dd

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lmcafee-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 5399cc7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants