[adapter][persist] PreAllocate Shards to remove from DDL path#35786
[adapter][persist] PreAllocate Shards to remove from DDL path#35786mtabebe wants to merge 2 commits intoMaterializeInc:mainfrom
Conversation
|
Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone. PR title guidelines
Pre-merge checklist
|
6bab44e to
e57346c
Compare
Problem: A single CREATE TABLE makes ~27 CRDB round-trips, with `open_data_handles` being the largest chunk. Some of these operations: `upgrade_version` and `open_critical_since` don't depend on the table schema and can be done ahead of time. Solution: Add a background shard pool that pre-opens persist shards so they can be reused at DDL time, skipping the pre-openable CRDB round-trips. - `ShardPool<T>`: thread-safe pool with take/put and hit/miss metrics - `shard_pool_replenish_task`: background task that fills the pool to target size, controlled by dyncfgs: - `storage_shard_pool_enabled` (default: true) - `storage_shard_pool_target_size` (default: 10) - `storage_shard_pool_replenish_interval` (default: 1s) - Epoch fencing ensures correctness across multiple environmentd instances - The write handle still opens at DDL time since it needs `RelationDesc` Add new metrics: `mz_shard_pool_size`, `mz_shard_pool_hits_total`, `mz_shard_pool_misses_total`. Results: Release-build benchmarks show ~18% DDL latency improvement: - Per-DDL: 91.5ms → 76.9ms (14.6ms saved) - 500 tables: 50.7s → 41.2s (9.4s saved) - CRDB round-trips: 27 → 22 per DDL
Problem: Without catalog tracking, shards pre-opened by the pool but never claimed (e.g., due to a crash) are leaked permanently. We need a crash recovery mechanism. Solution: Track pre-opened shard IDs in a new `pre_allocated_shards` catalog collection, following the same pattern as `unfinalized_shards`. - New `PreAllocatedShard`/`PreAllocatedShardKey` catalog types with proto definitions (bumps to v81) - `ShardPool::drain_pending_inserts()` batches catalog writes into the existing `prepare_state` transaction, which is zero extra CRDB round-trips on the DDL critical path - Crash recovery: on restart, `initialize_state` moves any unclaimed pre-allocated shards to `unfinalized_shards` for GC by `finalize_shards_task`
e57346c to
59f253c
Compare
|
Nightly is clean except for CDC failures that arise from this new commit: #35513 |
|
@BugBot run |
PR SummaryMedium Risk Overview Wires Written by Cursor Bugbot for commit 59f253c. This will update automatically on new commits. Configure here. |
|
I don't totally follow why this approach is necessary, FWIW, mostly because it's not clear to me why we need to do anything synchronous with Persist in the DDL path. One of the generated comments suggests that we're waiting on two roundtrips - upgrading the version and initializing the critical since. IIUC we shouldn't need to upgrade the version on a brand-new shard at all. Downgrading the since shouldn't really have to happen synchronously either, since it just allows compaction... though I'd totally believe that the structure of the storage controller makes this painful. In the long run it's probably worth putting some effort into cleaning up that aspect of the storage controller, though! |
@bkirwi: (Maybe I am misunderstanding the code/your comment). But what I observed when profiling and looking at the code is that when we create a table is that we allocate a persist shard for that table. Specifically: I do agree with your points that we shouldn't need to upgrade the version on a brand new shard at all. So maybe we could just strip this out? The since handling is entangled since it is part of the collection state. So my intention with this change was to be pragmatic: leave things the way they are, just move it out of the actual critical path by doing things ahead of time. |
This description of the flow sounds accurate to me! The thing I'm not sure about is why we'd need to call create_collections synchronously in the DDL path. Of course those handles do need to be created sooner or later, but the pattern we're using increasingly in eg.
For sure. I think the way things are is not great - there's a general consensus I think that the storage controller is a bit of a mess - so I'm enthusiastic about taking the chance to unpick some of these weird flows instead of working around them. But there's a lot to be said for being pragmatic, and especially if there's some urgency here it makes sense why you'd take this path instead. |
|
That makes sense, Ben. I will think about what to do here. My intuition is the goal for the DDL for March was a spike on how to improve things. This approach does that. And also is evidence that we can further untangle things, since it isn't a specific requirement/goal from the plan. So, maybe I'll take a stab at untangling things as a Friday project? |
|
Whatever makes sense to you and the SQL team, yeah! |
Motivation
A single
CREATE TABLEcurrently makes ~27 CRDB round-trips, withopen_data_handlesbeing the largest chunk. Two of those operationsupgrade_versionandopen_critical_sincedon't depend on the tableschema and can be done ahead of time. This PR pre-performs them in a
background pool so they can be skipped on the DDL critical path.
Solution:
Commit 1: Pre-opened shard pool
Adds
ShardPool<T>, a thread-safe pool of pre-opened persist shards, and abackground replenishment task (
shard_pool_replenish_task) that keeps thepool filled to a configurable target size.
Commit 2: Catalog tracking for crash recovery
Without catalog tracking, shards pre-opened by the pool but never claimed
(e.g., due to a crash) are leaked permanently. This commit adds a
pre_allocated_shardscatalog collection (same pattern asunfinalized_shards) to track them.Results
Benchmarks show ~18% DDL latency improvement:
Rollout
The pool is controlled by
storage_shard_pool_enabled(default:true).It can be disabled at runtime with:
When disabled, DDL falls back to the existing
ShardId::new()path withno behavioral change.