Skip to content

feat(dir-catalog): copy-on-write directory manifest rewrites#6794

Draft
jackye1995 wants to merge 10 commits into
lance-format:mainfrom
jackye1995:jack/copy-on-write-dir-manifest
Draft

feat(dir-catalog): copy-on-write directory manifest rewrites#6794
jackye1995 wants to merge 10 commits into
lance-format:mainfrom
jackye1995:jack/copy-on-write-dir-manifest

Conversation

@jackye1995
Copy link
Copy Markdown
Contributor

Summary

  • Replace merge-insert/delete manifest mutations with always copy-on-write full rewrites, where each mutation scans the latest __manifest dataset, streams transformed rows into a replacement data file, and commits a new version with replacement scalar indices built inline.
  • Migrate metadata column from Utf8 to Lance JSON, remove base_objects column and LabelList index, and build BTree/Bitmap/FTS indices during each streaming rewrite.
  • Add overwrite-with-replacement-indices commit support in Lance core and handle concurrency via strict overwrite with full-rewrite retry.
  • Backward compatible with old schema datasets (Utf8 metadata, base_objects column): read correctly and migrated on first write.

Replace merge-insert/delete manifest mutations with always copy-on-write
full rewrites. Each mutation scans the latest __manifest dataset, streams
transformed rows into a replacement data file, and commits a new version
with replacement scalar indices built inline.

- Migrate metadata column from Utf8 to Lance JSON (LargeBinary)
- Remove base_objects column and LabelList index
- Build BTree (object_id), Bitmap (object_type), and FTS (metadata)
  indices during each streaming rewrite
- Add overwrite-with-replacement-indices commit support in Lance
- Handle concurrency via strict overwrite with full-rewrite retry
- Backward compatible: old schema datasets (Utf8 metadata, base_objects)
  are read correctly and migrated on first write
@github-actions github-actions Bot added enhancement New feature or request java labels May 15, 2026
@jackye1995 jackye1995 changed the title feat: copy-on-write directory manifest rewrites feat(dir-catalog): copy-on-write directory manifest rewrites May 15, 2026
Binary measures read (list_namespaces, list_tables, describe_table) and
write (create_namespace, create_table) operations at configurable
concurrency levels. Supports --variant and --inline-optimization flags
to compare baseline merge-insert vs copy-on-write implementations.
Multi-process coordinator/worker architecture: coordinator spawns N
child processes each with independent namespace instance (no shared
cache). Supports S3 root paths, cold-read (fresh namespace per op),
warm-read (cached), and write operations. Separate seed mode for
populating manifests with configurable entry count.
@jackye1995
Copy link
Copy Markdown
Contributor Author

Performance Benchmark: S3 Multi-Process (1K manifest entries)

Setup: c7i.48xlarge EC2, S3 Standard us-east-1, 1000 manifest entries (333 namespaces + 667 tables), multi-process concurrency (each worker is a separate OS process with independent namespace/S3 connection — no shared cache), 200 operations per concurrency level.

Three variants tested:

  • baseline_opt_on: main branch with inline optimization enabled (compaction + index merge after every 8 fragments)
  • baseline_opt_off: main branch with inline optimization disabled
  • copy_on_write: this PR — always copy-on-write full rewrite with streaming indices

Write Performance

Concurrency baseline_opt_on p50 (tput) baseline_opt_off p50 (tput) copy_on_write p50 (tput)
1 540ms (1.3 ops/s) 1,646ms (0.6 ops/s) 421ms (2.3 ops/s)
5 677ms (1.5 ops/s) 2,410ms (1.1 ops/s) 408ms (2.4 ops/s)
10 790ms (1.2 ops/s) 2,732ms (1.2 ops/s) 409ms (2.4 ops/s)
20 1,587ms (1.3 ops/s) 3,050ms (0.7 ops/s) 561ms (2.5 ops/s)

Tail latency at c=1: baseline_opt_on p90=1,779ms / p99=2,692ms vs copy_on_write p90=512ms / p99=657ms — 3.5x better p90, 4.1x better p99 (no periodic compaction spikes).

Warm Read Performance (cached metadata, S3 data pages)

Concurrency baseline_opt_on p50 (tput) baseline_opt_off p50 (tput) copy_on_write p50 (tput)
1 119ms (7.5 ops/s) 937ms (1.0 ops/s) 95ms (9.1 ops/s)
10 131ms (34.5 ops/s) 910ms (6.8 ops/s) 107ms (52.2 ops/s)
20 122ms (41.1 ops/s) 928ms (10.3 ops/s) 110ms (52.1 ops/s)
100 194ms (23.0 ops/s) 2,885ms (1.4 ops/s) 122ms (56.1 ops/s)

Cold Read Performance (fresh namespace load per operation)

Concurrency baseline_opt_on p50 (tput) baseline_opt_off p50 (tput) copy_on_write p50 (tput)
1 366ms (2.6 ops/s) 871ms (1.1 ops/s) 326ms (2.7 ops/s)
10 360ms (25.3 ops/s) 864ms (10.5 ops/s) 317ms (29.0 ops/s)
20 332ms (50.8 ops/s) 887ms (16.0 ops/s) 306ms (58.6 ops/s)

Key Takeaways

  • Write throughput: CoW sustains ~2.3-2.5 ops/s vs baseline's ~1.2-1.5 ops/s — 1.6-2x improvement
  • Write tail latency: CoW eliminates compaction spikes — 3.5-4.1x lower p90/p99
  • Warm read at high concurrency: CoW scales to 2.4x throughput at c=100 (single-fragment reads vs multi-fragment scans)
  • Cold read: ~11-15% improvement — S3 manifest load dominates, modest structural benefit
  • baseline_opt_off is 2-10x slower than opt_on on reads, confirming inline optimization is essential for the old design but comes with write penalty

@jackye1995
Copy link
Copy Markdown
Contributor Author

Performance Benchmark: S3 Standard + S3 Express (1K manifest, multi-process)

Setup: c7i.48xlarge EC2, 1000 manifest entries (333 namespaces + 667 tables), multi-process concurrency (each worker is a separate OS process with independent namespace/S3 connection), 200 operations per concurrency level.

Four variants tested:

  • baseline_opt_on: main branch, inline optimization enabled (compaction + index merge every 8 fragments)
  • baseline_opt_off: main branch, inline optimization disabled
  • copy_on_write: this PR, CoW rewrite with inline index building
  • cow_no_index: this PR, CoW rewrite with inline_optimization_enabled=false (indices created offline)

Write Performance — S3 Standard

c baseline_opt_on cow_with_index cow_no_index
1 540ms (1.3/s) 421ms (2.3/s) 300ms (3.2/s)
2 619ms (1.2/s) 406ms (2.3/s) 308ms (3.1/s)
5 677ms (1.5/s) 408ms (2.4/s) 275ms (3.5/s)
10 790ms (1.2/s) 409ms (2.4/s) 309ms (3.7/s)
20 1,587ms (1.3/s) 561ms (2.5/s) 721ms (4.0/s)
100 6,751ms (1.3/s) 9,131ms (2.4/s) 4,879ms (4.4/s)
200 15,455ms (0.7/s) 15,757ms (1.9/s) 5,273ms (4.5/s)

Write Performance — S3 Express (same AZ)

c baseline_opt_on cow_with_index cow_no_index
1 535ms (1.6/s) 273ms (3.7/s) 247ms (4.0/s)
2 556ms (1.5/s) 280ms (3.5/s) 257ms (3.9/s)
5 894ms (2.2/s) 721ms (4.1/s) 494ms (4.8/s)
10 1,249ms (2.5/s) 762ms (4.1/s) 654ms (5.0/s)
20 1,892ms (1.6/s) 1,012ms (5.0/s) 905ms (5.7/s)
100 5,001ms (1.5/s) 2,252ms (3.2/s) 2,259ms (3.3/s)
200 18,156ms (0.5/s) 6,089ms (2.7/s) 6,285ms (4.7/s)

Cold Read Performance — S3 Standard

c baseline_opt_on baseline_opt_off cow_with_index cow_no_index
1 366ms (2.6/s) 871ms (1.1/s) 326ms (2.7/s) 274ms (3.6/s)
10 360ms (25.3/s) 864ms (10.5/s) 317ms (29.0/s) 265ms (33.0/s)
20 332ms (50.8/s) 887ms (16.0/s) 306ms (58.6/s) 270ms (65.3/s)

Warm Read Performance — S3 Standard

c baseline_opt_on baseline_opt_off cow_with_index cow_no_index
1 119ms (7.5/s) 937ms (1.0/s) 95ms (9.1/s) 117ms (7.6/s)
10 131ms (34.5/s) 910ms (6.8/s) 107ms (52.2/s) 99ms (23.8/s)
50 125ms (96.7/s) 944ms (7.8/s) 116ms (57.2/s) 103ms (86.8/s)

Warm Read Performance — S3 Express

c baseline_opt_on baseline_opt_off cow_with_index cow_no_index
1 125ms (7.5/s) 298ms (3.1/s) 109ms (8.6/s) 109ms (8.6/s)
10 122ms (51.3/s) 324ms (12.9/s) 107ms (58.3/s) 107ms (57.9/s)
50 125ms (96.7/s) 379ms (7.8/s) 114ms (106.3/s) 106ms (107.5/s)

Key Findings

Write throughput (vs baseline_opt_on):

  • CoW with index: 1.8-2.5x throughput on S3 Standard, 2.3-5.4x on S3 Express
  • CoW no index: 2.5-6.4x throughput on S3 Standard, 2.5-9.4x on S3 Express
  • Peak single-writer throughput: 3.2/s (S3 Standard no-index), 4.0/s (S3 Express no-index)

Write latency:

  • CoW with index eliminates compaction spikes: 3.5x better p90, 4.1x better p99 vs baseline
  • CoW no index saves additional 23-29% on S3 Standard by skipping 3 index file writes per commit

Read performance:

  • Cold reads: CoW is 11-16% faster (single compact fragment vs multi-fragment); no-index is even faster (fewer files to load)
  • Warm reads: CoW with index is 13-20% faster at low concurrency; warm describe_table benefits from BTree index for point lookups
  • baseline_opt_off is 2-10x slower than all other variants on reads

Index trade-off:

  • Skipping indices saves ~25% write latency on S3 Standard (~9% on S3 Express where per-request latency is lower)
  • Cold reads improve without index (fewer files)
  • Warm list operations are unaffected at 1K scale
  • Warm describe_table (point lookup) degrades 6-30% without BTree index — would worsen at larger scales

S3 Express vs S3 Standard (CoW with index):

  • Write: 35% faster (273ms vs 421ms at c=1) — lower per-request latency compounds across multiple S3 PUTs
  • Cold read: 13% faster (285ms vs 326ms)
  • Warm read: similar (~109ms vs ~95ms)

…nline cleanup

- Remove FTS background channel asymmetry: accumulate metadata in
  ManifestIndexAccumulator during streaming, build all 3 indices
  (BTree, Bitmap, FTS) uniformly after the stream completes.
- Replace CommitBuilder with direct manifest commit: expose
  write_manifest_file and ManifestWriteConfig as public API, add
  Dataset::commit_handler() accessor, construct Manifest via
  new_from_previous and commit directly.
- Remove inline cleanup on retry: drop cleanup_uncommitted_overwrite_files
  and cleanup_uncommitted_index_uuids, rely on offline GC for orphaned files.
- Add index verification tests: test_manifest_indices_are_complete_and_versioned
  checks all 3 indices are present, versioned, and have fragment bitmaps;
  test_manifest_reads_use_indexed_scans verifies explain plans show
  ScalarIndexQuery for BTree/Bitmap filters and MatchQuery for FTS.
seed-large writes a __manifest Lance table directly with N rows,
bypassing the namespace API. Triggers one CoW rewrite to build
indices. Adds --initial-entries flag to run mode for result tracking.
The mutation lock already serializes local writes. Use get_cached()
on first attempt (no I/O) and get_refreshed() only on retry after
conflict. Make checkout_version on success non-fatal so the return
path doesn't block on I/O if the cache promotion fails.
S3X no-index: 6.8/s create-ns, 6.2/s declare-table at 1K entries.
S3X with-index: 5.7/s create-ns, 5.1/s declare-table at 1K entries.
Indexed point lookup flat from 100K to 1M (9ms warm on S3X).
CoW full rewrite + 3 indices at 1M: 2.0s S3X, 3.2s S3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant