Skip to content

[adapter][persist] PreAllocate Shards to remove from DDL path#35786

Open
mtabebe wants to merge 2 commits intoMaterializeInc:mainfrom
mtabebe:ma/ddl/prealloc-shards
Open

[adapter][persist] PreAllocate Shards to remove from DDL path#35786
mtabebe wants to merge 2 commits intoMaterializeInc:mainfrom
mtabebe:ma/ddl/prealloc-shards

Conversation

@mtabebe
Copy link
Copy Markdown
Contributor

@mtabebe mtabebe commented Mar 30, 2026

Motivation

A single CREATE TABLE currently makes ~27 CRDB round-trips, with
open_data_handles being the largest chunk. Two of those operations
upgrade_version and open_critical_since don't depend on the table
schema and can be done ahead of time. This PR pre-performs them in a
background pool so they can be skipped on the DDL critical path.

Solution:

Commit 1: Pre-opened shard pool

Adds ShardPool<T>, a thread-safe pool of pre-opened persist shards, and a
background replenishment task (shard_pool_replenish_task) that keeps the
pool filled to a configurable target size.

Commit 2: Catalog tracking for crash recovery

Without catalog tracking, shards pre-opened by the pool but never claimed
(e.g., due to a crash) are leaked permanently. This commit adds a
pre_allocated_shards catalog collection (same pattern as
unfinalized_shards) to track them.

Results

Benchmarks show ~18% DDL latency improvement:

Metric Before After Saved
Per-DDL 91.5ms 76.9ms 14.6ms
500 tables 50.7s 41.2s 9.4s
CRDB round-trips 27 22 5 per DDL

Rollout

The pool is controlled by storage_shard_pool_enabled (default: true).
It can be disabled at runtime with:

ALTER SYSTEM SET storage_shard_pool_enabled = false;

When disabled, DDL falls back to the existing ShardId::new() path with
no behavioral change.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

  • Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
  • Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
  • Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

  • The PR title is descriptive and will make sense in the git log.
  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

@mtabebe mtabebe force-pushed the ma/ddl/prealloc-shards branch from 6bab44e to e57346c Compare March 30, 2026 15:48
mtabebe added 2 commits March 30, 2026 11:51
Problem:

A single CREATE TABLE makes ~27 CRDB round-trips, with `open_data_handles`
being the largest chunk. Some of these operations: `upgrade_version` and
`open_critical_since` don't depend on the table schema and can be done
ahead of time.

Solution:

Add a background shard pool that pre-opens persist shards so they can be
reused at DDL time, skipping the pre-openable CRDB round-trips.

- `ShardPool<T>`: thread-safe pool with take/put and hit/miss metrics
- `shard_pool_replenish_task`: background task that fills the pool to
  target size, controlled by dyncfgs:
  - `storage_shard_pool_enabled` (default: true)
  - `storage_shard_pool_target_size` (default: 10)
  - `storage_shard_pool_replenish_interval` (default: 1s)
- Epoch fencing ensures correctness across multiple environmentd instances
- The write handle still opens at DDL time since it needs `RelationDesc`

Add new metrics: `mz_shard_pool_size`, `mz_shard_pool_hits_total`,
`mz_shard_pool_misses_total`.

Results:

Release-build benchmarks show ~18% DDL latency improvement:
- Per-DDL: 91.5ms → 76.9ms (14.6ms saved)
- 500 tables: 50.7s → 41.2s (9.4s saved)
- CRDB round-trips: 27 → 22 per DDL
Problem:

Without catalog tracking, shards pre-opened by the pool but never claimed
(e.g., due to a crash) are leaked permanently. We need a crash recovery
mechanism.

Solution:

Track pre-opened shard IDs in a new `pre_allocated_shards` catalog
collection, following the same pattern as `unfinalized_shards`.

- New `PreAllocatedShard`/`PreAllocatedShardKey` catalog types with proto
  definitions (bumps to v81)
- `ShardPool::drain_pending_inserts()` batches catalog writes into the
  existing `prepare_state` transaction, which is zero extra CRDB round-trips
  on the DDL critical path
- Crash recovery: on restart, `initialize_state` moves any unclaimed
  pre-allocated shards to `unfinalized_shards` for GC by
  `finalize_shards_task`
@mtabebe mtabebe force-pushed the ma/ddl/prealloc-shards branch from e57346c to 59f253c Compare March 30, 2026 15:51
@mtabebe mtabebe changed the title Ma/ddl/prealloc shards [adapter][persist] PreAllocate Shards to remove from DDL path Mar 30, 2026
@mtabebe
Copy link
Copy Markdown
Contributor Author

mtabebe commented Mar 30, 2026

Nightly is clean except for CDC failures that arise from this new commit: #35513

@mtabebe
Copy link
Copy Markdown
Contributor Author

mtabebe commented Mar 30, 2026

@BugBot run

@cursor
Copy link
Copy Markdown

cursor bot commented Mar 30, 2026

PR Summary

Medium Risk
Adds a new durable catalog collection and protobuf/state-update wiring for pre_allocated_shards, which changes persisted catalog state and bootstrap/apply paths; mistakes could cause upgrade/rollback or crash-recovery issues. Runtime behavior is otherwise mostly additive and gated by new shard-pool system parameters.

Overview
Introduces a new durable catalog collection, PreAllocatedShard, to persistently track shards that have been pre-opened by a background shard pool so they can be reclaimed after crashes instead of being leaked.

Wires PreAllocatedShard through catalog protobufs/hashes, persist snapshot/trace handling, transaction batching, adapter bootstrap/apply logic (treated similarly to UnfinalizedShard but with no in-memory state updates), and catalog-debug tooling. Adds new system parameters (storage_shard_pool_enabled, storage_shard_pool_target_size, storage_shard_pool_replenish_interval) to mzcompose defaults and parallel-workload flag flipping.

Written by Cursor Bugbot for commit 59f253c. This will update automatically on new commits. Configure here.

@mtabebe mtabebe marked this pull request as ready for review March 30, 2026 19:30
@mtabebe mtabebe requested review from a team as code owners March 30, 2026 19:30
@mtabebe mtabebe requested a review from ohbadiah March 30, 2026 19:30
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

@bkirwi
Copy link
Copy Markdown
Contributor

bkirwi commented Mar 30, 2026

I don't totally follow why this approach is necessary, FWIW, mostly because it's not clear to me why we need to do anything synchronous with Persist in the DDL path.

One of the generated comments suggests that we're waiting on two roundtrips - upgrading the version and initializing the critical since. IIUC we shouldn't need to upgrade the version on a brand-new shard at all. Downgrading the since shouldn't really have to happen synchronously either, since it just allows compaction... though I'd totally believe that the structure of the storage controller makes this painful. In the long run it's probably worth putting some effort into cleaning up that aspect of the storage controller, though!

@mtabebe
Copy link
Copy Markdown
Contributor Author

mtabebe commented Mar 31, 2026

I don't totally follow why this approach is necessary, FWIW, mostly because it's not clear to me why we need to do anything synchronous with Persist in the DDL path.

One of the generated comments suggests that we're waiting on two roundtrips - upgrading the version and initializing the critical since. IIUC we shouldn't need to upgrade the version on a brand-new shard at all. Downgrading the since shouldn't really have to happen synchronously either, since it just allows compaction... though I'd totally believe that the structure of the storage controller makes this painful. In the long run it's probably worth putting some effort into cleaning up that aspect of the storage controller, though!

@bkirwi: (Maybe I am misunderstanding the code/your comment). But what I observed when profiling and looking at the code is that when we create a table is that we allocate a persist shard for that table. Specifically: create_collections_for_bootstrap is called which in turn calls open_data_handles. This function then does several subcalls that turn into CRDB operations: upgrade_version, open_critical_since and open_writer. So this PR aims to hide the latency of the first two calls but still keep the logic the same. (We can't remove open_writer, since it needs the actual RelationDesc schema).

I do agree with your points that we shouldn't need to upgrade the version on a brand new shard at all. So maybe we could just strip this out? The since handling is entangled since it is part of the collection state.

So my intention with this change was to be pragmatic: leave things the way they are, just move it out of the actual critical path by doing things ahead of time.

@bkirwi
Copy link
Copy Markdown
Contributor

bkirwi commented Apr 1, 2026

But what I observed when profiling and looking at the code is that when we create a table is that we allocate a persist shard for that table. Specifically: create_collections_for_bootstrap is called [...]

This description of the flow sounds accurate to me! The thing I'm not sure about is why we'd need to call create_collections synchronously in the DDL path. Of course those handles do need to be created sooner or later, but the pattern we're using increasingly in eg. apply_catalog_implications lets us push that kind of work downstream of the actual write to the catalog. (So: getting it off the critical path by doing it after returning a successful response instead of pushing it before the request.) In theory the only thing we need to do to record a shard in the catalog is generate a new ShardId for it, and that takes no time at all, really...

So my intention with this change was to be pragmatic: leave things the way they are, just move it out of the actual critical path by doing things ahead of time.

For sure. I think the way things are is not great - there's a general consensus I think that the storage controller is a bit of a mess - so I'm enthusiastic about taking the chance to unpick some of these weird flows instead of working around them. But there's a lot to be said for being pragmatic, and especially if there's some urgency here it makes sense why you'd take this path instead.

@mtabebe
Copy link
Copy Markdown
Contributor Author

mtabebe commented Apr 2, 2026

That makes sense, Ben. I will think about what to do here.

My intuition is the goal for the DDL for March was a spike on how to improve things. This approach does that. And also is evidence that we can further untangle things, since it isn't a specific requirement/goal from the plan.

So, maybe I'll take a stab at untangling things as a Friday project?

@bkirwi
Copy link
Copy Markdown
Contributor

bkirwi commented Apr 2, 2026

Whatever makes sense to you and the SQL team, yeah!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants