Refactor prefix caching#4618
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors PyTorch prefix caching to support multimodal identity, SSM/state-cache checkpoints, routed-expert replay, and observability while keeping unsupported VL/spec/sliding-window paths disabled.
Changes:
- Adds multimodal-aware prefix-cache keys and content hashing for supported PyTorch VL paths.
- Adds SSM prefix-cache checkpoint lifecycle across scheduler, trie, state manager, input creation, model forward, and engine loop.
- Adds routed-expert replay, prefix-cache metrics, CLI/config options, and LongBenchV2 eval wiring.
Reviewed changes
Copilot reviewed 38 out of 38 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
lmdeploy/pytorch/paging/block_trie.py |
Adds multimodal keys, SSM checkpoint indexing/lifecycle, routed-expert replay, and stats rollback support. |
lmdeploy/pytorch/paging/scheduler.py |
Coordinates tentative prefix matches, rollback, SSM state availability, and migration match fix. |
lmdeploy/pytorch/paging/state_manager.py |
Splits shared state slots into runtime and checkpoint ownership. |
lmdeploy/pytorch/engine/inputs_maker.py |
Builds compact SSM restore/save offsets and adjusts long-context multimodal chunking. |
lmdeploy/pytorch/engine/model_agent/agent.py |
Copies SSM checkpoint state before/after model forward. |
lmdeploy/pytorch/engine/engine_loop.py |
Publishes checkpoint saves/restores around forward prefetching. |
lmdeploy/pytorch/engine/cache_engine.py |
Adds validated state-cache slot copy support. |
lmdeploy/pytorch/messages.py |
Adds prefix-cache metadata/state and multimodal clamp/hash helpers. |
lmdeploy/pytorch/multimodal/data_type.py |
Adds deterministic multimodal content hashing. |
lmdeploy/pytorch/model_inputs.py |
Carries compact SSM restore/save offsets through model inputs. |
lmdeploy/pytorch/config.py |
Adds prefix-cache SSM budget and decode checkpoint interval config. |
lmdeploy/messages.py |
Exposes new PyTorch engine config fields. |
lmdeploy/pytorch/engine/executor/base.py |
Keeps SSM prefix cache enabled and disables prefix cache for spec decoding. |
lmdeploy/pytorch/engine/config_builder.py |
Propagates new prefix-cache config fields. |
lmdeploy/pytorch/engine/engine.py |
Precomputes multimodal hashes when prefix caching is enabled. |
lmdeploy/pytorch/block.py |
Removes prefix-cache node state from logical block container. |
lmdeploy/pytorch/paging/seq_states/states.py |
Releases/discards prefix-cache state on sequence free. |
lmdeploy/pytorch/paging/eviction_helper/recompute_eviction_helper.py |
Evicts SSM checkpoints to recover runtime state slots. |
lmdeploy/pytorch/strategies/ar/model_inputs.py |
Preserves SSM save offsets during decoding reindex. |
lmdeploy/pytorch/strategies/ar/step_inputs.py |
Threads SSM save offsets through AR input reindexing. |
lmdeploy/pytorch/strategies/ar_spec/step_inputs.py |
Threads SSM save offsets through AR-spec reindexing. |
lmdeploy/pytorch/strategies/dllm/step_inputs.py |
Threads SSM save offsets through DLLM reindexing. |
lmdeploy/vl/model/preprocess_utils.py |
Expands single/per-video video multimodal items. |
lmdeploy/serve/core/vl_async_engine.py |
Enables VL prefix caching only for supported PyTorch new-preprocess paths. |
lmdeploy/pytorch/models/qwen3_vl.py |
Normalizes multimodal metadata construction. |
lmdeploy/metrics/loggers.py |
Exports prefix-cache hit rate to Prometheus/log metrics. |
lmdeploy/cli/utils.py |
Adds CLI arguments for SSM prefix-cache state settings. |
lmdeploy/cli/serve.py |
Wires new prefix-cache CLI settings into API server config. |
lmdeploy/cli/cli.py |
Adds new prefix-cache CLI settings to chat parser. |
eval/eval.py |
Adds LongBenchV2 selection and optional judger handling. |
eval/config.py |
Adds LongBenchV2 config and tunable OpenCompass parameters. |
autotest/utils/run_restful_chat.py |
Normalizes an MCQ docstring character. |
tests/pytorch/paging/test_block_trie.py |
Adds coverage for multimodal keys, routed experts, and SSM checkpoint lifecycle. |
tests/pytorch/paging/test_scheduler.py |
Adds SSM state scheduling, rollback, migration, and long-context prefix-cache tests. |
tests/pytorch/engine/test_inputs_maker.py |
Adds long-context and compact SSM offset tests. |
tests/pytorch/engine/test_executor_base.py |
Adds executor/config coverage for prefix-cache state settings. |
tests/pytorch/engine/test_cache_engine.py |
Adds state-cache copy validation tests. |
tests/test_lmdeploy/test_vl/test_preprocess_utils.py |
Adds video multimodal expansion tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…fix-caching # Conflicts: # lmdeploy/messages.py # lmdeploy/pytorch/engine/cache_engine.py # lmdeploy/pytorch/engine/model_agent/agent.py
…hing # Conflicts: # tests/pytorch/engine/test_executor_base.py
…hing # Conflicts: # lmdeploy/pytorch/multimodal/data_type.py # lmdeploy/vl/model/preprocess_utils.py # tests/test_lmdeploy/test_vl/test_preprocess_utils.py
…hing # Conflicts: # lmdeploy/pytorch/paging/state_manager.py
| def prefix_cache_state_budget(parser): | ||
| """Add argument prefix_cache_state_budget to parser.""" | ||
|
|
||
| return parser.add_argument('--prefix-cache-state-budget', | ||
| type=int, | ||
| default=0, | ||
| help='Extra SSM state-cache slots budgeted for prefix-cache checkpoints. ' | ||
| '0 adds no extra slots, but checkpoints may borrow idle runtime state slots. ' | ||
| 'Only used by the PyTorch engine.') | ||
|
|
||
| @staticmethod | ||
| def prefix_cache_decode_state_interval(parser): |
There was a problem hiding this comment.
Please recommend their values for qwen3.5-35b/qwen3.5-35b-fp8/qwen3.5-397b/qwen3.5-397b-fp8
There was a problem hiding this comment.
prefix_cache_state_budget is the extra states for caching only, this value could be the same as max-batch-size if gpu memory is enough.
prefix_cache_decode_state_interval is the interval of checkpoint saving. This value should be determined by the inputs distribution. If most session lengh are larger than 1024, you can use 1024 as the prefix_cache_decode_state_interval
| if self.prefix_cache_decode_state_interval > 0: | ||
| assert self.prefix_cache_decode_state_interval % self.block_size == 0, ( | ||
| 'prefix_cache_decode_state_interval must be a multiple of block_size') |
There was a problem hiding this comment.
Better to update the help information of prefix_cache_decode_state_interval
…hing # Conflicts: # lmdeploy/pytorch/engine/engine.py # lmdeploy/vl/model/preprocess_utils.py
Summary
This PR extends PyTorch prefix caching so it is correct for multimodal/VLM requests and SSM/state-cache models, and adds best-effort routed-expert replay for prefix-cache hits.
The main goals are:
Key Changes
Multimodal Prefix-Cache Identity
start/endmodalitycontent_hashBlockTriekeys from token-only matching to token + multimodal extra hashes.HistoryMultiModalspaths.SSM / State-Cache Prefix Caching
prefix_cache_decode_state_interval=0as “disable decode checkpoint saves only”; prefill/chunk checkpoint saves may still work.prefix_cache_state_budgetrepresent extra checkpoint capacity; budget0may still borrow idle runtime state slots.StateCacheEngine.copy_caches()support for sorted/coalesced offset copies.Routed Expert Replay
Metrics And Debuggability
Misc
Notes / Limitations