fix(rscthrottler): avoid double-counting RSS and reserved in S3 write throttler (4.0-dev)#24931
fix(rscthrottler): avoid double-counting RSS and reserved in S3 write throttler (4.0-dev)#24931aunjgr wants to merge 1 commit into
Conversation
… throttler (matrixorigin#24892) The S3 write throttler's `acquireWithinRSSLimit` had two gates: 1. **Pool check** via `Available()` — double-counts S3 write buffers (in RSS as mmap'd mpool allocations + in reserved as throttler tokens). Under `ForceRefresh()` this returned 0 available, permanently rejecting all writes (matrixorigin#24881). 2. **RSS gate** `rss + reserved + ask > total × limitRate` — same double-counting. Even with stale RSS the sum `rss + reserved` pushes over the cap, causing false rejects. And the RSS gate's original purpose (preventing OOM from non-S3 memory pressure, PR matrixorigin#24268) is now handled by RSS scavenging (85%/92% thresholds with graduated cache eviction + FreeOSMemory). **Fix**: - **Pool check**: use `limit - reserved` directly instead of `Available()`. The pool only needs to bound forward-looking reservations against the throttler limit. - **RSS gate**: removed. Physical memory is protected by: - `pinnedRate >= 0.80` hard ceiling at `AcquirePolicyForCNFlushS3` - RSS scavenging (85%/92%) with cache eviction + FreeOSMemory `Available()` is unchanged — other throttler consumers (workspace, merge) rely on its `rss + reserved` semantics. Approved by: @XuPeng-SH
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
XuPeng-SH
left a comment
There was a problem hiding this comment.
I understand the double-counting bug this PR is fixing, but I think the current backport swings too far in the other direction: it removes RSS from the CNFlushS3 admission decision entirely.
That means if the process is already near the cgroup/host limit because of non-S3 memory, AcquirePolicyForCNFlushS3 can still admit more S3 reservations as long as reserved is below the pool limit. In other words, the policy now protects the forward-looking reservation pool, but no longer protects physical memory headroom from unrelated RSS pressure.
A concrete example from the new test shape is: actualTotalMemory=100, limit=90, reserved=0, rss=91. With the new code, pinnedRate() is still 0 because it only looks at reserved, so the request falls through to acquireWithinRSSLimit() and can still be granted. That is exactly the kind of non-S3 RSS pressure case the old gate was preventing. RSS scavenging is useful, but it is not an equivalent guarantee here: it is asynchronous / best-effort and cannot reclaim live heap/query memory.
So my current view is that this is not just a coverage question. The fix removes the double-counting bug, but it also removes the only hard backpressure for high RSS caused by non-S3 memory. I think the right fix needs to discount the overlapping S3 portion from RSS without dropping physical-memory enforcement altogether.
Cherry-pick of #24892 (f964de9) to 4.0-dev.
Fixes #24881
Fixes #24888
What this PR does
The S3 write throttler's
acquireWithinRSSLimithad two gates:Available()— double-counts S3 write buffers (in RSS as mmap'd mpool allocations + in reserved as throttler tokens). UnderForceRefresh()this returned 0 available, permanently rejecting all writes ([Bug]: Failed to load vector data during regression testing, but it works when isolation #24881).rss + reserved + ask > total × limitRate— same double-counting. The RSS gate's original purpose (preventing OOM from non-S3 memory pressure, PR fix: throttle CN S3 writes under memory pressure #24268) is now handled by RSS scavenging (85%/92% thresholds with graduated cache eviction + FreeOSMemory).Fix:
limit - reserveddirectly instead ofAvailable(). The pool only needs to bound forward-looking reservations against the throttler limit.pinnedRate >= 0.80ceiling + RSS scavenging.Available()is unchanged — other throttler consumers rely on its semantics.