[adapter][persist] PreAllocate Shards to remove from DDL path by mtabebe · Pull Request #35786 · MaterializeInc/materialize

mtabebe · 2026-03-30T15:13:41Z

Motivation

A single CREATE TABLE currently makes ~27 CRDB round-trips, with
open_data_handles being the largest chunk. Two of those operations
upgrade_version and open_critical_since don't depend on the table
schema and can be done ahead of time. This PR pre-performs them in a
background pool so they can be skipped on the DDL critical path.

Solution:

Commit 1: Pre-opened shard pool

Adds ShardPool<T>, a thread-safe pool of pre-opened persist shards, and a
background replenishment task (shard_pool_replenish_task) that keeps the
pool filled to a configurable target size.

Commit 2: Catalog tracking for crash recovery

Without catalog tracking, shards pre-opened by the pool but never claimed
(e.g., due to a crash) are leaked permanently. This commit adds a
pre_allocated_shards catalog collection (same pattern as
unfinalized_shards) to track them.

Results

Benchmarks show ~18% DDL latency improvement:

Metric	Before	After	Saved
Per-DDL	91.5ms	76.9ms	14.6ms
500 tables	50.7s	41.2s	9.4s
CRDB round-trips	27	22	5 per DDL

Rollout

The pool is controlled by storage_shard_pool_enabled (default: true).
It can be disabled at runtime with:

ALTER SYSTEM SET storage_shard_pool_enabled = false;

When disabled, DDL falls back to the existing ShardId::new() path with
no behavioral change.

github-actions · 2026-03-30T15:13:54Z

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

The PR title is descriptive and will make sense in the git log.
This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

Problem: A single CREATE TABLE makes ~27 CRDB round-trips, with `open_data_handles` being the largest chunk. Some of these operations: `upgrade_version` and `open_critical_since` don't depend on the table schema and can be done ahead of time. Solution: Add a background shard pool that pre-opens persist shards so they can be reused at DDL time, skipping the pre-openable CRDB round-trips. - `ShardPool<T>`: thread-safe pool with take/put and hit/miss metrics - `shard_pool_replenish_task`: background task that fills the pool to target size, controlled by dyncfgs: - `storage_shard_pool_enabled` (default: true) - `storage_shard_pool_target_size` (default: 10) - `storage_shard_pool_replenish_interval` (default: 1s) - Epoch fencing ensures correctness across multiple environmentd instances - The write handle still opens at DDL time since it needs `RelationDesc` Add new metrics: `mz_shard_pool_size`, `mz_shard_pool_hits_total`, `mz_shard_pool_misses_total`. Results: Release-build benchmarks show ~18% DDL latency improvement: - Per-DDL: 91.5ms → 76.9ms (14.6ms saved) - 500 tables: 50.7s → 41.2s (9.4s saved) - CRDB round-trips: 27 → 22 per DDL

Problem: Without catalog tracking, shards pre-opened by the pool but never claimed (e.g., due to a crash) are leaked permanently. We need a crash recovery mechanism. Solution: Track pre-opened shard IDs in a new `pre_allocated_shards` catalog collection, following the same pattern as `unfinalized_shards`. - New `PreAllocatedShard`/`PreAllocatedShardKey` catalog types with proto definitions (bumps to v81) - `ShardPool::drain_pending_inserts()` batches catalog writes into the existing `prepare_state` transaction, which is zero extra CRDB round-trips on the DDL critical path - Crash recovery: on restart, `initialize_state` moves any unclaimed pre-allocated shards to `unfinalized_shards` for GC by `finalize_shards_task`

mtabebe · 2026-03-30T19:30:16Z

Nightly is clean except for CDC failures that arise from this new commit: #35513

mtabebe · 2026-03-30T19:30:24Z

@BugBot run

cursor · 2026-03-30T19:30:31Z

PR Summary

Medium Risk
Adds a new durable catalog collection and protobuf/state-update wiring for pre_allocated_shards, which changes persisted catalog state and bootstrap/apply paths; mistakes could cause upgrade/rollback or crash-recovery issues. Runtime behavior is otherwise mostly additive and gated by new shard-pool system parameters.

Overview
Introduces a new durable catalog collection, PreAllocatedShard, to persistently track shards that have been pre-opened by a background shard pool so they can be reclaimed after crashes instead of being leaked.

Wires PreAllocatedShard through catalog protobufs/hashes, persist snapshot/trace handling, transaction batching, adapter bootstrap/apply logic (treated similarly to UnfinalizedShard but with no in-memory state updates), and catalog-debug tooling. Adds new system parameters (storage_shard_pool_enabled, storage_shard_pool_target_size, storage_shard_pool_replenish_interval) to mzcompose defaults and parallel-workload flag flipping.

^{Written by Cursor Bugbot for commit 59f253c. This will update automatically on new commits. Configure here.}

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

bkirwi · 2026-03-30T21:39:38Z

I don't totally follow why this approach is necessary, FWIW, mostly because it's not clear to me why we need to do anything synchronous with Persist in the DDL path.

One of the generated comments suggests that we're waiting on two roundtrips - upgrading the version and initializing the critical since. IIUC we shouldn't need to upgrade the version on a brand-new shard at all. Downgrading the since shouldn't really have to happen synchronously either, since it just allows compaction... though I'd totally believe that the structure of the storage controller makes this painful. In the long run it's probably worth putting some effort into cleaning up that aspect of the storage controller, though!

mtabebe · 2026-03-31T13:35:59Z

I don't totally follow why this approach is necessary, FWIW, mostly because it's not clear to me why we need to do anything synchronous with Persist in the DDL path.

One of the generated comments suggests that we're waiting on two roundtrips - upgrading the version and initializing the critical since. IIUC we shouldn't need to upgrade the version on a brand-new shard at all. Downgrading the since shouldn't really have to happen synchronously either, since it just allows compaction... though I'd totally believe that the structure of the storage controller makes this painful. In the long run it's probably worth putting some effort into cleaning up that aspect of the storage controller, though!

@bkirwi: (Maybe I am misunderstanding the code/your comment). But what I observed when profiling and looking at the code is that when we create a table is that we allocate a persist shard for that table. Specifically: create_collections_for_bootstrap is called which in turn calls open_data_handles. This function then does several subcalls that turn into CRDB operations: upgrade_version, open_critical_since and open_writer. So this PR aims to hide the latency of the first two calls but still keep the logic the same. (We can't remove open_writer, since it needs the actual RelationDesc schema).

I do agree with your points that we shouldn't need to upgrade the version on a brand new shard at all. So maybe we could just strip this out? The since handling is entangled since it is part of the collection state.

So my intention with this change was to be pragmatic: leave things the way they are, just move it out of the actual critical path by doing things ahead of time.

bkirwi · 2026-04-01T19:00:17Z

But what I observed when profiling and looking at the code is that when we create a table is that we allocate a persist shard for that table. Specifically: create_collections_for_bootstrap is called [...]

This description of the flow sounds accurate to me! The thing I'm not sure about is why we'd need to call create_collections synchronously in the DDL path. Of course those handles do need to be created sooner or later, but the pattern we're using increasingly in eg. apply_catalog_implications lets us push that kind of work downstream of the actual write to the catalog. (So: getting it off the critical path by doing it after returning a successful response instead of pushing it before the request.) In theory the only thing we need to do to record a shard in the catalog is generate a new ShardId for it, and that takes no time at all, really...

So my intention with this change was to be pragmatic: leave things the way they are, just move it out of the actual critical path by doing things ahead of time.

For sure. I think the way things are is not great - there's a general consensus I think that the storage controller is a bit of a mess - so I'm enthusiastic about taking the chance to unpick some of these weird flows instead of working around them. But there's a lot to be said for being pragmatic, and especially if there's some urgency here it makes sense why you'd take this path instead.

mtabebe · 2026-04-02T15:39:30Z

That makes sense, Ben. I will think about what to do here.

My intuition is the goal for the DDL for March was a spike on how to improve things. This approach does that. And also is evidence that we can further untangle things, since it isn't a specific requirement/goal from the plan.

So, maybe I'll take a stab at untangling things as a Friday project?

bkirwi · 2026-04-02T17:58:52Z

Whatever makes sense to you and the SQL team, yeah!

mtabebe force-pushed the ma/ddl/prealloc-shards branch from 6bab44e to e57346c Compare March 30, 2026 15:48

mtabebe added 2 commits March 30, 2026 11:51

mtabebe force-pushed the ma/ddl/prealloc-shards branch from e57346c to 59f253c Compare March 30, 2026 15:51

mtabebe changed the title ~~Ma/ddl/prealloc shards~~ [adapter][persist] PreAllocate Shards to remove from DDL path Mar 30, 2026

mtabebe marked this pull request as ready for review March 30, 2026 19:30

mtabebe requested review from a team as code owners March 30, 2026 19:30

mtabebe requested a review from ohbadiah March 30, 2026 19:30

cursor bot reviewed Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[adapter][persist] PreAllocate Shards to remove from DDL path#35786

[adapter][persist] PreAllocate Shards to remove from DDL path#35786
mtabebe wants to merge 2 commits intoMaterializeInc:mainfrom
mtabebe:ma/ddl/prealloc-shards

mtabebe commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

mtabebe commented Mar 30, 2026

Uh oh!

mtabebe commented Mar 30, 2026

Uh oh!

cursor bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

cursor bot left a comment

Uh oh!

bkirwi commented Mar 30, 2026

Uh oh!

mtabebe commented Mar 31, 2026

Uh oh!

bkirwi commented Apr 1, 2026 •

edited

Loading

Uh oh!

mtabebe commented Apr 2, 2026

Uh oh!

bkirwi commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mtabebe commented Mar 30, 2026

Results

Uh oh!

github-actions bot commented Mar 30, 2026

PR title guidelines

Pre-merge checklist

Uh oh!

mtabebe commented Mar 30, 2026

Uh oh!

mtabebe commented Mar 30, 2026

Uh oh!

cursor bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

bkirwi commented Mar 30, 2026

Uh oh!

mtabebe commented Mar 31, 2026

Uh oh!

bkirwi commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtabebe commented Apr 2, 2026

Uh oh!

bkirwi commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cursor bot commented Mar 30, 2026 •

edited

Loading

bkirwi commented Apr 1, 2026 •

edited

Loading