feat(dir-catalog): copy-on-write directory manifest rewrites#6794
feat(dir-catalog): copy-on-write directory manifest rewrites#6794jackye1995 wants to merge 10 commits into
Conversation
Replace merge-insert/delete manifest mutations with always copy-on-write full rewrites. Each mutation scans the latest __manifest dataset, streams transformed rows into a replacement data file, and commits a new version with replacement scalar indices built inline. - Migrate metadata column from Utf8 to Lance JSON (LargeBinary) - Remove base_objects column and LabelList index - Build BTree (object_id), Bitmap (object_type), and FTS (metadata) indices during each streaming rewrite - Add overwrite-with-replacement-indices commit support in Lance - Handle concurrency via strict overwrite with full-rewrite retry - Backward compatible: old schema datasets (Utf8 metadata, base_objects) are read correctly and migrated on first write
Binary measures read (list_namespaces, list_tables, describe_table) and write (create_namespace, create_table) operations at configurable concurrency levels. Supports --variant and --inline-optimization flags to compare baseline merge-insert vs copy-on-write implementations.
Multi-process coordinator/worker architecture: coordinator spawns N child processes each with independent namespace instance (no shared cache). Supports S3 root paths, cold-read (fresh namespace per op), warm-read (cached), and write operations. Separate seed mode for populating manifests with configurable entry count.
Performance Benchmark: S3 Multi-Process (1K manifest entries)Setup: c7i.48xlarge EC2, S3 Standard us-east-1, 1000 manifest entries (333 namespaces + 667 tables), multi-process concurrency (each worker is a separate OS process with independent namespace/S3 connection — no shared cache), 200 operations per concurrency level. Three variants tested:
Write Performance
Tail latency at c=1: baseline_opt_on p90=1,779ms / p99=2,692ms vs copy_on_write p90=512ms / p99=657ms — 3.5x better p90, 4.1x better p99 (no periodic compaction spikes). Warm Read Performance (cached metadata, S3 data pages)
Cold Read Performance (fresh namespace load per operation)
Key Takeaways
|
Performance Benchmark: S3 Standard + S3 Express (1K manifest, multi-process)Setup: c7i.48xlarge EC2, 1000 manifest entries (333 namespaces + 667 tables), multi-process concurrency (each worker is a separate OS process with independent namespace/S3 connection), 200 operations per concurrency level. Four variants tested:
Write Performance — S3 Standard
Write Performance — S3 Express (same AZ)
Cold Read Performance — S3 Standard
Warm Read Performance — S3 Standard
Warm Read Performance — S3 Express
Key FindingsWrite throughput (vs baseline_opt_on):
Write latency:
Read performance:
Index trade-off:
S3 Express vs S3 Standard (CoW with index):
|
…nline cleanup - Remove FTS background channel asymmetry: accumulate metadata in ManifestIndexAccumulator during streaming, build all 3 indices (BTree, Bitmap, FTS) uniformly after the stream completes. - Replace CommitBuilder with direct manifest commit: expose write_manifest_file and ManifestWriteConfig as public API, add Dataset::commit_handler() accessor, construct Manifest via new_from_previous and commit directly. - Remove inline cleanup on retry: drop cleanup_uncommitted_overwrite_files and cleanup_uncommitted_index_uuids, rely on offline GC for orphaned files. - Add index verification tests: test_manifest_indices_are_complete_and_versioned checks all 3 indices are present, versioned, and have fragment bitmaps; test_manifest_reads_use_indexed_scans verifies explain plans show ScalarIndexQuery for BTree/Bitmap filters and MatchQuery for FTS.
seed-large writes a __manifest Lance table directly with N rows, bypassing the namespace API. Triggers one CoW rewrite to build indices. Adds --initial-entries flag to run mode for result tracking.
The mutation lock already serializes local writes. Use get_cached() on first attempt (no I/O) and get_refreshed() only on retry after conflict. Make checkout_version on success non-fatal so the return path doesn't block on I/O if the cache promotion fails.
S3X no-index: 6.8/s create-ns, 6.2/s declare-table at 1K entries. S3X with-index: 5.7/s create-ns, 5.1/s declare-table at 1K entries. Indexed point lookup flat from 100K to 1M (9ms warm on S3X). CoW full rewrite + 3 indices at 1M: 2.0s S3X, 3.2s S3.
Summary
__manifestdataset, streams transformed rows into a replacement data file, and commits a new version with replacement scalar indices built inline.