Skip to content

persist: bound fetch read-ahead on all replicas#36988

Open
antiguru wants to merge 1 commit into
MaterializeInc:mainfrom
antiguru:persist-fetch-backpressure
Open

persist: bound fetch read-ahead on all replicas#36988
antiguru wants to merge 1 commit into
MaterializeInc:mainfrom
antiguru:persist-fetch-backpressure

Conversation

@antiguru

Copy link
Copy Markdown
Member

Motivation

The persist fetch semaphore bounds the bytes of fetched-but-not-yet-decoded parts held in memory. It was gated on is_cc_active and, where active, sized to the full process memory limit, under the assumption that this data spills to lgalloc-backed disk. That assumption no longer holds: there is no spill disk, the data is heap-resident, and the cc/non-cc distinction is gone. So the semaphore was effectively unbounded (MAX_PERMITS) on the replicas that matter, and persist_source could read ahead without bound during hydration. A production heap profile showed fetched-but-undecoded blobs dominating the heap with a large transient spike over steady state.

Relates to CLU-113. Overlaps with #36910, which addresses the same problem but folds in the larger shard_source deasync refactor; this is the minimal standalone fix.

Description

Size the semaphore from the announced memory limit on every replica, and lower the default budget from 1.0x to 0.1x of the limit. The read-ahead buffers are heap-resident and must coexist with the arrangements being built in the same scarce memory, so the bound has to sit well below the full limit. The mechanism is otherwise unchanged: the permit is acquired before download and held with the part until after decode, and a single part larger than the budget still acquires the whole budget rather than deadlocking.

The flag docs are rewritten to describe a hard heap bound instead of an lgalloc spill budget.

This caps read-ahead overshoot; it does not make an undersized replica fit. If an arrangement alone approaches the limit, hydration can still OOM.

Verification

The bounded-memory suite never passed --announce-memory-limit, so the semaphore was unbounded there and the suite never exercised persist's memory budget. clusterd now announces its limit in every scenario, mirroring production, and a new persist-source-read-ahead scenario re-hydrates a large index that must stay within the announced limit. A unit test asserts the semaphore is sized from the memory limit regardless of is_cc_active.

The new scenario's clusterd_memory threshold is an estimate, not yet empirically confirmed; it needs a bounded-memory CI run (and possibly minimization-search) to tune. Activating the announced limit also changes the memory behavior of the existing scenarios, which CI should validate.

🤖 Generated with Claude Code

The persist fetch semaphore bounds the bytes of fetched-but-not-yet-decoded
parts held in memory. It was gated on `is_cc_active` and, where active, sized
to the full process memory limit, both under the assumption that this data
spills to lgalloc-backed disk. That assumption no longer holds: there is no
spill disk, the data is heap-resident, and the cc/non-cc distinction is gone.

As a result the semaphore was effectively disabled (`MAX_PERMITS`) on the
replicas that matter, so persist_source could read ahead without bound during
hydration. A production heap profile showed fetched-but-undecoded blobs
dominating the heap, with a large transient spike over steady state.

Size the semaphore from the announced memory limit on every replica, and lower
the default budget from 1.0x to 0.1x of the limit: the read-ahead buffers must
now coexist with the arrangements being built in the same scarce heap, so the
bound has to sit well below the full limit. Rewrite the flag docs to describe a
hard heap bound rather than an lgalloc spill budget.

Also make the bounded-memory suite start clusterd with
`--announce-memory-limit`, mirroring production. Without it the semaphore was
unbounded in the suite, so it never exercised persist's memory budget. Add a
`persist-source-read-ahead` scenario that re-hydrates a large index and must
stay within the announced limit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@antiguru antiguru force-pushed the persist-fetch-backpressure branch from a0bd7b4 to fa932e3 Compare June 12, 2026 07:57
@antiguru antiguru marked this pull request as ready for review June 12, 2026 14:37
@antiguru antiguru requested a review from a team as a code owner June 12, 2026 14:37
@antiguru antiguru requested review from DAlperin, def- and petrosagg June 12, 2026 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant