Skip to content

fix(linear-att): fix latent prefix-cache ref/buffer leaks#1348

Merged
hiworldwzj merged 4 commits into
ModelTC:mainfrom
sufubao:fix_cache_hit_rare
Jun 14, 2026
Merged

fix(linear-att): fix latent prefix-cache ref/buffer leaks#1348
hiworldwzj merged 4 commits into
ModelTC:mainfrom
sufubao:fix_cache_hit_rare

Conversation

@sufubao

@sufubao sufubao commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

Four latent correctness defects in the linear-attention (Qwen3.5 / Qwen3-Next) hybrid prefix cache, found via adversarial audit + property-based fuzzing.

Correctness fixes

  • Root in eviction set_add_node added the root node to _evict_tree_set when the tree emptied (root is a leaf then), contradicting _evict's own assert node is not self.root_node. Now excluded.
  • Root ref leak on miss / trim-to-emptymatch_prefix takes a root ref on descent; the two "no match" returns handed None to the caller without releasing it, so root.ref_counter drifted up on every miss. Both returns now release it.
  • Root ref leak in deref_to_first_big_page_node — the big-page downgrade path leaked the same root ref when it bottomed out at root. Fixed.
  • Big-page state-buffer leak / assert-crash — big-page state ids accumulated in req.linear_att_len_to_big_page_id during chunked prefill were neither inserted nor freed when a request was paused/aborted mid-prefill (fallback branch of _linear_att_free_req), tripping free_a_req_mem's assert len(...) == 0 (worker crash) or leaking slots with asserts off. Now released on the non-insert exit paths.

The three root-ref issues are latent (root carries zero tokens and is never evictable), so they didn't affect results/memory/accuracy. The big-page leak is a reachable worker crash for long-context serving with --linear_att_page_block_num set.

Tests

New CPU-only tests (no GPU required):

  • test_linear_att_radix_cache_invariants.py — property-based fuzzer (small-page regime): random insert/match/dec/steal/evict, checking ref accounting, set membership, buffer-id conservation, value integrity, and root-ref balance after every op.
  • test_linear_att_radix_cache_bigpage.py — same for the big-page regime.
  • test_linear_att_free_req_big_page.py — regression tests for the pause/abort big-page release.

pytest unit_tests/server/router/dynamic_prompt/ unit_tests/server/router/model_infer/test_linear_att_free_req_big_page.py → passed; black + flake8 clean.

Risk / compatibility

The fixes don't change matching, eviction selection, or returned KV — only ref/buffer accounting on miss and pause/abort paths.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces auto-derivation for --linear_att_hash_page_size based on max_req_total_len to optimize prefix-cache hit rates for hybrid linear-attention models, along with extensive documentation and warnings against using excessively large page sizes in serving workloads. It also fixes critical bugs in the radix cache, including root reference counter drift during cache misses or trims, and memory leaks of big-page state buffers when requests are paused or aborted mid-prefill. Comprehensive unit and invariant fuzz tests are added to verify these fixes. A review comment suggests adding a defensive fallback check for args.max_req_total_len in api_start.py to prevent a potential TypeError if it is None.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread lightllm/server/api_start.py Outdated
Comment on lines +314 to +315
while derived * max_page_num < args.max_req_total_len:
derived *= 2

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent potential TypeError if args.max_req_total_len is None (as it is typed as Optional[int]), it is safer to add a defensive fallback check before using it in the comparison loop.

Suggested change
while derived * max_page_num < args.max_req_total_len:
derived *= 2
max_len = args.max_req_total_len if args.max_req_total_len is not None else 16384
while derived * max_page_num < max_len:
derived *= 2

Four latent defects in LinearAttPagedRadixCache / _linear_att_free_req,
found via adversarial audit + property-based fuzzing:

- Root in eviction set: _add_node added root to _evict_tree_set when the
  tree emptied (root is a leaf then), contradicting _evict's own
  'assert node is not self.root_node'. Now excluded.
- Root ref leak on miss / trim-to-empty: match_prefix takes a root ref on
  descent; the two 'no match' returns handed None to the caller without
  releasing it, so root.ref_counter drifted up on every miss. Both returns
  now release it.
- Root ref leak in deref_to_first_big_page_node: the big-page downgrade path
  leaked the same root ref when it bottomed out at root. Fixed.
- Big-page state-buffer leak / assert-crash: big-page state ids accumulated
  in req.linear_att_len_to_big_page_id during chunked prefill were neither
  inserted nor freed when a request was paused/aborted mid-prefill (fallback
  branch of _linear_att_free_req), tripping free_a_req_mem's assert (worker
  crash) or leaking slots with asserts off. Now released on the non-insert
  exit paths.

New CPU-only tests (no GPU): property-based invariant fuzzers for the
small-page and big-page regimes, plus regression tests for the pause/abort
big-page release.

The three root-ref issues are latent (root carries zero tokens and is never
evictable). The big-page leak is a reachable worker crash for long-context
serving with --linear_att_page_block_num set.
@sufubao sufubao force-pushed the fix_cache_hit_rare branch from aa8f40e to 187cbda Compare June 14, 2026 01:03
@sufubao sufubao changed the title fix(linear-att): auto-derive prefix-cache page size + fix latent cache bugs fix(linear-att): fix latent prefix-cache ref/buffer leaks Jun 14, 2026
@hiworldwzj hiworldwzj merged commit b0231de into ModelTC:main Jun 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants