persist: bound fetch read-ahead on all replicas#36988
Open
antiguru wants to merge 1 commit into
Open
Conversation
The persist fetch semaphore bounds the bytes of fetched-but-not-yet-decoded parts held in memory. It was gated on `is_cc_active` and, where active, sized to the full process memory limit, both under the assumption that this data spills to lgalloc-backed disk. That assumption no longer holds: there is no spill disk, the data is heap-resident, and the cc/non-cc distinction is gone. As a result the semaphore was effectively disabled (`MAX_PERMITS`) on the replicas that matter, so persist_source could read ahead without bound during hydration. A production heap profile showed fetched-but-undecoded blobs dominating the heap, with a large transient spike over steady state. Size the semaphore from the announced memory limit on every replica, and lower the default budget from 1.0x to 0.1x of the limit: the read-ahead buffers must now coexist with the arrangements being built in the same scarce heap, so the bound has to sit well below the full limit. Rewrite the flag docs to describe a hard heap bound rather than an lgalloc spill budget. Also make the bounded-memory suite start clusterd with `--announce-memory-limit`, mirroring production. Without it the semaphore was unbounded in the suite, so it never exercised persist's memory budget. Add a `persist-source-read-ahead` scenario that re-hydrates a large index and must stay within the announced limit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
a0bd7b4 to
fa932e3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The persist fetch semaphore bounds the bytes of fetched-but-not-yet-decoded parts held in memory. It was gated on
is_cc_activeand, where active, sized to the full process memory limit, under the assumption that this data spills to lgalloc-backed disk. That assumption no longer holds: there is no spill disk, the data is heap-resident, and the cc/non-cc distinction is gone. So the semaphore was effectively unbounded (MAX_PERMITS) on the replicas that matter, andpersist_sourcecould read ahead without bound during hydration. A production heap profile showed fetched-but-undecoded blobs dominating the heap with a large transient spike over steady state.Relates to CLU-113. Overlaps with #36910, which addresses the same problem but folds in the larger
shard_sourcedeasync refactor; this is the minimal standalone fix.Description
Size the semaphore from the announced memory limit on every replica, and lower the default budget from 1.0x to 0.1x of the limit. The read-ahead buffers are heap-resident and must coexist with the arrangements being built in the same scarce memory, so the bound has to sit well below the full limit. The mechanism is otherwise unchanged: the permit is acquired before download and held with the part until after decode, and a single part larger than the budget still acquires the whole budget rather than deadlocking.
The flag docs are rewritten to describe a hard heap bound instead of an lgalloc spill budget.
This caps read-ahead overshoot; it does not make an undersized replica fit. If an arrangement alone approaches the limit, hydration can still OOM.
Verification
The bounded-memory suite never passed
--announce-memory-limit, so the semaphore was unbounded there and the suite never exercised persist's memory budget. clusterd now announces its limit in every scenario, mirroring production, and a newpersist-source-read-aheadscenario re-hydrates a large index that must stay within the announced limit. A unit test asserts the semaphore is sized from the memory limit regardless ofis_cc_active.The new scenario's
clusterd_memorythreshold is an estimate, not yet empirically confirmed; it needs a bounded-memory CI run (and possiblyminimization-search) to tune. Activating the announced limit also changes the memory behavior of the existing scenarios, which CI should validate.🤖 Generated with Claude Code