Skip to content

Supervoxel splitting with base+fork support and locks#534

Open
akhileshh wants to merge 74 commits into
pcgv3from
akhilesh/sv-splitting-locks
Open

Supervoxel splitting with base+fork support and locks#534
akhileshh wants to merge 74 commits into
pcgv3from
akhilesh/sv-splitting-locks

Conversation

@akhileshh
Copy link
Copy Markdown
Contributor

@akhileshh akhileshh commented Apr 23, 2026

Summary

Supervoxel splits in OCDBT-backed segmentations are now atomic under worker failure, with an operator tool for recovering stuck ops.

Locks

Two indefinite locks, both holding their cells on exception:

  • Root lock wraps multicut → split detection → voxel-level split → commit.
  • L2 chunk lock nests inside, scoped to rewritten chunks + 1-chunk margin for the edge-routing shell.

On unclean exit, both hold and the L2 scope is written to the op-log row.

Stuck-op signal

Non-empty scope field on the op-log row. Covers both CREATED (worker crash) and FAILED (caught exception) paths; status alone missed FAILED. Clean exit clears it.

Cleanup-then-replay

Concurrent ops on non-overlapping chunks advance OCDBT during the outage, so recovery can't use one pinned view.

  • Cleanup: read pre-op voxels from an OCDBT handle pinned at the op's timestamp, write them back unpinned at the same chunk keys. Chunks outside the scope are untouched.
  • Replay: reruns via the privileged-repair path against latest state. Indefinite-lock exit releases cells by value-match on the original op ID.

OCDBT time-travel

Tensorstore version field in the spec (integer generation or ISO-8601 Z timestamp, interpreted as commit_time ≤ T). Threaded as pinned_at. Pinned handles are read-only.

Reader contract

dataset_info now publishes the full kvstore spec (kvstack layers + OCDBT config + data prefixes) instead of path fragments, so readers pass it verbatim to tensorstore. The multi-scale open asserts fork presence, failing with a clear message instead of a tensorstore internal.

Drop the inline propagate_to_coarser_scales call from write_seg; coarser
mips are now the async downsample worker's responsibility. write_seg is
back to a single base-scale tensorstore write, so SV splits no longer
block on the full pyramid update. TestWriteSeg updated to assert the
coarser scales stay zero after write_seg (propagation tested separately
via TestPropagateToCoarserScales).
split_supervoxel now returns its base-resolution bbox. split_with_sv_splits
collects one per call and attaches the list to Result.seg_bbox (new
optional field). publish_edit includes the list in the payload and sets
downsample="true" so the worker only runs on edits that touched base seg.

List kept unmerged — lets the worker skip tiles outside the actual
change region.
workers/downsample_worker.py consumes edits-exchange messages flagged
downsample="true" and writes each non-base mip within the SV-split
bbox. graph/downsample.py splits the region into pyramid_blocks (sized
so no two blocks share a storage chunk at any mip), then either
tinybrain'd in one call (fast path, typical small edits) or per-mip
(fallback when base read exceeds memory budget). Write filtering keeps
OCDBT delta proportional to the actual change.

DownsampleBlockLock serializes overlapping jobs via kvdbclient's
row-key lock API; 26-byte hash-prefixed keys avoid tablet hot-spots.
Depends on kvdbclient lock_by_row_key / unlock_by_row_key /
renew_lock_by_row_key landing first.
Used by the async downsample worker as the mip-pyramid kernel.
Closes a concurrency gap where two SV splits on overlapping L2 chunks
but distinct roots can't be serialized by root locks — they acquire
disjoint root-lock sets and race on seg state. L2ChunkLock serializes
them via the kvdbclient row-key lock primitive.

Row key = 2-byte blake2b hash + 8-byte uint64 chunk_id (10 bytes).
Hash prefix keeps spatially-clustered L2 chunks from hot-spotting a
single bigtable tablet under concurrent load.

Primitive only — callers land separately.

RowKeyLockRegistry helper moved to tests/helpers.py so L2ChunkLock and
DownsampleBlockLock tests share it instead of duplicating.
Callers now get Cut(atomic_edges) | PreviewCut(ccs, illegal_split) |
SvSplitRequired(sv_remapping) instead of unwrapping-by-convention or
catching SupervoxelSplitRequiredError. The exception still unwinds
inside LocalMincutGraph — cheapest way to bail out of deep path code —
but it's caught once at the run_multicut boundary and never escapes,
so callers don't use raise/catch for control flow.
MulticutOperation._apply dispatches on the tagged multicut result and,
when an SV split is needed, calls the new edits_sv.split_supervoxels
under its surrounding RootLock, refreshes source/sink SV IDs from seg,
and retries multicut against the post-split graph. Root lock spans the
whole critical section; L2ChunkLock held only around the split loop.

This closes two races that existed when split_with_sv_splits handled
the flow outside any lock: same-root (root lock now never released
between multicut and commit) and cross-root (L2ChunkLock serializes
overlapping split regions). split_with_sv_splits is deleted;
handle_split calls cg.remove_edges directly.
Pre-compute each rep's bbox from the chunk coords of its CC members in
sv_remapping (no coord-padding, no resolution-axis assumption).
split_supervoxels builds the union lock set across reps — sparse chunks
plus one L2-chunk margin for update_edges's 1-voxel overlap read —
acquires once, then loops per-rep splits.

_update_chunks surfaces the change_chunks that actually got new SV IDs;
write_seg_chunks fires one tensorstore future per change-chunk and
awaits together, so only chunks with real label changes hit OCDBT.
Gap chunks between CC pieces and neighbor chunks read for the overlap
never get rewritten, keeping the delta proportional to the edit.

split_supervoxels also threads back the fresh source/sink SV IDs from
the in-memory new_seg block (same bytes that just landed on storage),
so the retry multicut sees current IDs without an extra seg read.

Drops _get_whole_sv (dead since the sv_remapping switch).

Adds a high-level architecture doc covering the end-to-end flow,
concurrency design, and durable invariants.
Enable opening a CG's OCDBT at a prior commit via the driver's `version`
spec field (int generation or ISO-8601 commit_time upper bound).
Groundwork for operator-driven replay of failed SV splits, which needs
clean pre-op reads against append-only storage.

OCDBT stamps commits from absl::Now() with no caller-override hook, so
pins will use OperationTimeStamp captured under the L2 chunk lock rather
than aligning OCDBT commit times to operation time.
@codecov

This comment was marked as spam.

split_supervoxels is now a pure planner returning a SplitResult
(seg_bboxes, source_ids_fresh, sink_ids_fresh, seg_writes,
bigtable_rows). No lock acquisition, no writes. The caller
(MulticutOperation._apply) holds the L2 chunk locks and fires
the consolidated persist — OCDBT chunks + bigtable rows — inside
an inner lock scope.

seg_writes is a flat list of (voxel_slices, data) pairs across all
reps so write_seg_chunks fires every chunk write as one parallel
tensorstore batch. Removes the per-rep serialization in the old
write_seg_chunks loop.

get_seg_source_and_destination_ocdbt gains a pinned_at kwarg, forwarded
to build_cg_ocdbt_spec — used later by the recovery path.
New L2 chunk counterpart to IndefiniteRootLock, keyed by chunk row.
L2ChunkLock now acquires via lock_by_row_key_with_indefinite so a
temporal acquire sees a crashed op's indefinite cell and refuses.
IndefiniteL2ChunkLock records its chunk scope on the op-log row's
L2ChunkLockScope column at __enter__ and clears it on clean exit,
giving recovery a durable scope without a bigtable-wide scan.

Both indefinite locks (root and L2 chunk) now short-circuit __exit__
when an exception is propagating: cells stay held, scope stays set.
Partial writes may exist after an exception; leaving the cells forces
subsequent ops to refuse at lock-acquire and the operator to run
recovery explicitly.

privileged_mode=True on either lock is the operator recovery escape
hatch: skips acquire, pre-populates acquired_keys so __exit__'s
value-matched release deletes the crashed op's cells.

RowKeyLockRegistry (test helper) gains the three new kvdbclient
primitives.
Operator recovery for SV-split ops that crashed mid-write. A worker
death inside IndefiniteL2ChunkLock leaves per-chunk indefinite cells
set and records the chunk scope on the op-log row. Recovery reverts
partial OCDBT writes using a version-pinned read of pre-op voxels,
then replays the op normally.

list_stuck scans OperationLogs for ops still at CREATED past a
min-age threshold. replay(cg, op_id) runs cleanup_partial_writes
followed by repair.edits.repair_operation(..., unlock=True);
IndefiniteL2ChunkLock's privileged-mode __exit__ deletes the crashed
op's pre-existing cells after the replay's writes land.

Architecture-level operator guide at docs/sv_splitting_recovery.md,
linked from docs/sv_splitting.md's Concurrency section.
Pass the op's timestamp from execute → _apply → split_supervoxels →
split_supervoxel → copy_parents_and_add_lineage / add_new_edges so
every new-SV bigtable mutation lands at the op's logical write time.
Gets atomic visibility under a parent_ts filter and makes replay's
override_ts actually control what time-filtered readers see after a
repair_operation. Parent-copy and Child-list writes deliberately keep
the old cell's timestamp so pre-op readers still see the old
hierarchy.

Replace the seven-tuple in/out soup in split_supervoxels with named
dataclasses: SvSplitTask (plan_sv_splits → split_supervoxel input)
and SvSplitOutcome (split_supervoxel's per-task output). Drop the
two unused return fields on split_supervoxel.
list_stuck filter switches from Status==CREATED to "L2ChunkLockScope
set and Status != SUCCESS past min_age". The authoritative signal for
stuck-ness is "scope recorded, not cleared" — worker crash (Status
stays CREATED) and Python exception during persist (Status=EXCEPTION
after Fix 1) both fall under it. Ops without scope aren't blocking
other ops and are outside stuck_ops' concern.

replay now verifies each chunk in the recorded scope actually has
Concurrency.IndefiniteLock held by this op_id before running cleanup
or repair. If cells are missing or held by a different op, raises
with a clear error. Protects against double-replay (first run already
released cells) and out-of-band clearing (manual bigtable edit,
buggy release path) — both would have cleanup_partial_writes revert
chunks that aren't ours.
fork_base_manifest is now an explicit step — invoked from the ingest
CLI's --ocdbt path or the seg_ocdbt notebook — rather than being
auto-triggered on first ws_ocdbt_scales access. ws_ocdbt_scales
asserts fork_exists() so a missing fork fails with a clear pointer
instead of a tensorstore mismatch/not-found error.
Stuck-op detection keys off L2ChunkLockScope being populated, not
Status=CREATED — that filter also covers caught-exception paths
(Status=FAILED) where cells are held but the row isn't CREATED.

Recovery now verifies each chunk's IndefiniteLock is actually held
by the op before cleaning up, so a stale scope can't have us revert
chunks another op owns. Reflect both in docs and the list CLI help.
Drop "mode-downsample" from the SV-split diagram — tinybrain owns
the algorithm; the doc shouldn't pin it.
@akhileshh akhileshh changed the title sv splitting locks Supervoxel splitting with locks Apr 24, 2026
@akhileshh akhileshh changed the title Supervoxel splitting with locks Supervoxel splitting with base+fork support and locks Apr 24, 2026
@akhileshh akhileshh requested review from fcollman and sdorkenw April 24, 2026 02:29
`_rep_bbox` enveloped every piece of the cross-chunk-connected rep —
for physical SVs split into many pieces across chunks, the bbox grew
far wider than the cut surface needs. Replace with `_coords_bbox`:
envelope of the user-placed source/sink coords plus a one-chunk
margin (matches the existing L2 lock margin and 1-voxel shell).

After the seg read, `cut_supervoxels` is intersected with the IDs
present in seg, so the "whole sv" set names only the rep pieces the
bbox actually touches. Pieces of the rep outside the bbox keep their
existing IDs — their cross-chunk edges to in-bbox split fragments
are routed via the 1-voxel shell, edges between two unsplit pieces
don't change.

Adds `TestCoordsBbox` covering envelope+margin, volume-bound clipping,
and that `plan_sv_splits` returns a tight bbox regardless of how
distant the rep's other pieces sit.
handle_supervoxel_id_lookup and id_helpers.get_atomic_ids_from_coords
both short-circuit to lookup_svs_from_seg whenever ocdbt_seg is true,
regardless of node-id layer. 2D slice clicks send L1 IDs from a view
that may be stale after an SV split; 3D mesh clicks send a root and
no L1 at all. Either way the current SV is what matters, so we read
seg at the click coords and let downstream same/different-root
checks surface any staleness with the sv_id->root diagnostic.

Also vectorize lookup_svs_from_seg's per-coord indexing into a single
advanced-index op.
`_schema_from_src` was passing both `domain` and `shape` to ts.open
when cloning the source schema to the destination. For sources with
a non-zero `voxel_offset`, `domain` carries absolute bounds (e.g.
[17756, 62244)) while `shape` implies an origin of 0, and tensorstore
refuses to merge them — base creation fails on any precomputed source
that doesn't start at the origin.

`domain` already encodes both extent and offset, so passing only it
is sufficient and avoids the conflict.
cloudbuild now uses BuildKit registry cache (`:buildcache`) so
unchanged stages reuse the prior build's layer artifacts and
already-warm nodes skip re-downloading them on pull.

The fresh-ingest CLI no longer infers `ocdbt_populate_base` from
`base_exists` — manifest presence didn't reflect whether chunks
were actually copied. Replaced with an explicit `--populate-base`
flag; operator sets it on first ingest and omits on subsequent
runs.
In-repo notes on every OCDBT spec field, config sub-field,
ocdbt_coordinator resource field, DistributedCoordinatorServer
constructor field, and runtime defaults — verified empirically against
the venv's tensorstore binary by probing with intentional-bad-value
plus spec round-trips.

Captures the distributed-mode constraints: atomic transactions
forbidden, cooperator-forwarded RPC bounded at gRPC's 4 MiB default
with no Python knob to raise it, leases held per btree node so
disjoint user-key writes still cross-forward. Catalogs the levers that
do exist (max_decoded_node_bytes, max_inline_value_bytes,
lease_duration) and ones that look promising but aren't fields.
@akhileshh akhileshh force-pushed the akhilesh/sv-splitting-locks branch from 9b86434 to 5873aa8 Compare May 21, 2026 21:30
akhileshh added 20 commits May 21, 2026 21:36
Reading tensorstore's distributed btree_writer.cc StagePending: values
≤ max_inline_value_bytes are carried inline in the encoded mutation, so
the cooperator's per-leaf WriteRequest packs all of them together. With
the previous 1 MiB threshold, every compressed_segmentation chunk that
landed below 1 MiB stayed inline and contributed its full bytes to the
gRPC forward — which blows past tensorstore's hardcoded 4 MiB gRPC
max-receive whenever a few inline chunks land on the same btree leaf.

Dropping to 4 KiB pushes every chunk value through WriteData into a d/
file and replaces it in the mutation with an IndirectDataReference, so
forwarded RPCs carry only small refs regardless of value size. Small
metadata (info JSON, populate-marker files) still stays inline.

Tradeoff: the old comment cited "7× GCS bloat" at the 100-byte default
— that ratio came from per-chunk independent zstd framing dominating
tiny payloads, not chunk-sized ones. At 4 KiB+ payloads the per-chunk
zstd overhead is in the single-digit-percent range. Reference doc
updated to credit max_inline_value_bytes as the actual RPC-size lever
and correct the cooperator-batching section.
tensorstore raises ValueError with an absl status-code prefix on
transport-level failures. A DNS hiccup or a GCS 5xx in is_chunk_populated
or mark_chunk_populated was killing the whole populate task. Wrap the
four marker / populate-meta helpers in a tenacity retry that matches
UNAVAILABLE / DEADLINE_EXCEEDED / ABORTED / INTERNAL prefixes only —
NOT_FOUND, INVALID_ARGUMENT, RESOURCE_EXHAUSTED still propagate.
…ask-body mode flags

`ingest layer N` now owns the whole OCDBT lifecycle when N matches
ocdbt_populate_layer: idempotently runs setup_base (base + fork +
config reconcile), starts the coordinator, queues tasks. Moves
setup_base out of `ingest graph` so there's a single command that
manages OCDBT. Re-pickles the IngestionManager so workers see the
resolved config.

Adds three flags to `ingest layer`:
  --queue-only/-q  : skip the coordinator (one is running elsewhere).
  --ocdbt-only/-o  : workers run only OCDBT populate, skipping
    add_parent_chunk. For re-populating a freshly created/wiped OCDBT
    base against an already-built bigtable graph.
  --ingest-only/-i : workers run only add_parent_chunk, skipping
    OCDBT populate. For rebuilding bigtable against an existing
    OCDBT base.

Refactors create_parent_chunk to a single function with a `mode`
parameter bound at queue time via functools.partial — no duplicated
OCDBT-populate block. _post_task_completion runs unconditionally so
layer-progress tracking in redis stays consistent across modes.

Adds IngestionManager.is_ocdbt_populate_layer(layer) so the compound
guard isn't repeated at every call site.
`setup_base` calls `fork_base_manifest` once at graph creation, before
populate has committed most of its writes to base. Any base commit
after fork-creation doesn't propagate into the fork's manifest, so
kvstack-routed reads through the fork miss every chunk written to base
afterwards — meshing (and any other OCDBT seg read) returns zeros.

Adds `ensure_fork_synced(ws, graph_id)` in graph/ocdbt/main.py and
calls it from `ChunkedGraphMeta.ws_ocdbt_scales` lazy init. The helper
re-snapshots the fork manifest from base when the fork's data prefix
is empty (no edits) and the manifests differ. Edit-free is the safety
guard: an SV-split writes value/btree files under <graph_id>_d/ before
updating the fork manifest, so a non-empty prefix means an edit either
landed or is in flight and must be preserved. The check is race-free
vs concurrent populate (which writes only to base).

Re-enables the OCDBT path in `get_local_segmentation` that was
temporarily disabled while the kvstack reads returned zeros.
The unsharded-dynamic-mesh subdir name has been a load-bearing
constant ("dynamic") shared implicitly between mesh_worker (reads
`custom_data["mesh"].get("dynamic_mesh_dir", "dynamic")`) and
neuroglancer (hardcodes the same literal in graphene/backend.ts).
Surface it through:

  - `handle_info` (the Flask `/info` endpoint NG fetches) — so
    `info["dynamic_mesh_dir"]` is exposed alongside `mesh_dir`.
  - `get_json_info` (used internally by meshing modules to build
    CloudVolume `info=` kwargs) — for consistency.

Default preserves existing datasets. Requires a matching NG-side
patch (graphene backend) to read this field instead of the
hardcoded literal.
After a bigtable row-copy clones an existing graph into a new
graph_id, two graph_id-bearing pieces of runtime state must be
reconciled before the new table is usable:

  1. `custom_data["mesh"]["dynamic_mesh_dir"]` carries forward the
     source table's value, which means the new table's edited mesh
     fragments would write into the source's dynamic mesh dir and
     overwrite each other at the same fragment-id keys. Extends the
     existing `graph_id != self.graph_id` shim in ChunkedGraph.__init__
     to also rewrite this field — only when the key is present in the
     copied custom_data — using the new graph_id directly:
        custom_data["mesh"]["dynamic_mesh_dir"] = f"dynamic_{graph_id}"
     `mesh.dir` (initial sharded meshes) is dataset-scoped and
     intentionally shared, so it's not touched.

  2. The per-CG OCDBT fork directory doesn't exist for the new
     graph_id, so `ws_ocdbt_scales` would fire `assert fork_exists`
     and any OCDBT-side read (meshing, sv-lookup, edits) would
     blow up. Replaces the assertion with `if not fork_exists:
     fork_base_manifest(...)`. Idempotent and race-safe: concurrent
     opens write identical base-manifest bytes to the same path;
     can't race with an edit because edits pre-suppose the fork.
     `ensure_fork_synced` immediately after keeps it in step with
     base.

Net effect: copy the bigtable rows, then `ChunkedGraph(graph_id=...)`
once. The shim updates meta in a single update_meta call and the
ws_ocdbt_scales lazy-init creates+syncs the fork. No operator step.
Two symptoms of the same root cause — upstream
`tiangolo/uwsgi-nginx-flask:python3.12` was rebuilt, its digest
changed, and that broke us two ways:

1) Stage-1 cache invalidated → builds went from ~1.5min to 7+min
   (full `conda env create` re-runs on every CI build because the
   FROM layer hash now changes whenever upstream rebuilds the tag).

2) The new base image carried a newer transitive Python package
   set into the conda env, leaving `importlib_metadata` in a
   partial-overlay state in `/app/venv` — `__init__.py` from one
   version on top of file tree from another, missing `_context.py`.
   Imports through messagingclient → google.cloud.pubsub_v1 →
   opentelemetry → opentelemetry.util._importlib_metadata crashed
   with `ModuleNotFoundError: No module named
   'importlib_metadata._context'`.

Pin the base image by digest so subsequent builds get the same
Python file set and the layer cache stays warm. Bump opentelemetry
(via pip-compile --upgrade) to 1.42.x, which dropped the
third-party `importlib_metadata` wrapper in favor of stdlib
`importlib.metadata` — verified inside the 1.42.1 wheel
(opentelemetry/util/_importlib_metadata.py imports
`from importlib.metadata import ...`). With opentelemetry no
longer pulling in the third-party package, no Python 3.12
consumer remains, so pip-compile correctly removed
`importlib-metadata` from the lockfile entirely.

Net dep changes:
  opentelemetry-api/sdk            1.39.1  → 1.42.1
  opentelemetry-semantic-conventions 0.60b1 → 0.63b1
  google-cloud-pubsub              2.35.0  → 2.38.0
  importlib-metadata               8.7.1   → dropped
  + routine minor bumps across pip-compiled deps

requirements.in unchanged. Bump the Dockerfile digest manually
when you want a fresher base.
- conftest bootstrap fixtures unpack the OcdbtConfig slot
- ws_cv mock assertions accept the loader's two CloudVolume constructions (info-fetch + cached-handle with info=)
- ws_ocdbt tests bypass OCDBT meta resolution and the fork-sync side-channel that hit real GCS, and relax the call-args check now that get_seg_source_and_destination_ocdbt takes the OcdbtConfig positionally
- cleanup_partial_writes patches _read_source_scales at the call site (ocdbt.main) instead of the package re-export
Opening an existing OCDBT must not embed a `config` block — tensorstore
validates every field against the on-disk manifest and raises
FAILED_PRECONDITION on any drift. Drift bricks every base that was
created with the older default for `max_inline_value_bytes`.

Only `create_base_ocdbt` keeps `config`; `build_cg_ocdbt_spec` and
`open_base_ocdbt` no longer pass it. Functions still accept the
`OcdbtConfig` arg so callers don't change; it's just not asserted into
the open-time spec.
…state warning

The function's sole job is closing the setup-base race: snapshot the
fork manifest from base before the fork's first edit. Once any edit
lands the function has nothing to do — base is immutable post-setup,
so manifests *will* diverge by design as the fork moves forward.

Short-circuit on edit-files first (lists `<graph_id>_d/`, no manifest
reads on the hot path) and remove the warning that fired on every
runtime open of every edited CG with a misleading "stale vs base"
framing.
Writes every mesh.* field a new or freshly-copied graph needs to serve
fragments — watershed-cv `info["mesh"]`, sharded mesh spec per layer,
`info["mesh_metadata"]` (uniform draco grid + dynamic dir), and the
bigtable `custom_data["mesh"]` block including `initial_ts`.

Static values come from a `mesh_config:` block in the dataset yaml,
parsed via the new `MeshConfig` dataclass (mirrors `OcdbtConfig`). No
in-code defaults — yaml must supply every static field; only the
graph-id-derived `dynamic_mesh_dir` is optional.

`initial_ts` is set once and preserved on re-runs (changing it would
reclassify node ids and silently break served manifests). When unset,
it's derived from a root id sampled near the volume center.

The CLI subcommand is `ingest mesh_meta GRAPH_ID DATASET_YAML`. It
does not gate on redis ingest progress — the operator decides when
the graph is ready.
…a graph

Every cached manifest fragment (initial + dynamic, present fragments
and DOES_NOT_EXIST markers) now expires after 3 days. Bounds the
cache against stale "X" markers that survived a remesh and against
fragments whose path/offset changed under them.

New POST /table/<table_id>/clear_manifest_cache (no node_id) wipes
every cached fragment under the graph's namespace via SCAN+DEL,
batched so a large fragment set doesn't block redis. Reuses the
existing admin permission decorator used by the per-node clear route.
profile() gains three opt-in kwargs (with_memory, with_rss, counters)
that capture tracemalloc Python heap peak, psutil RSS peak via a 50 ms
sampler thread, and per-call deltas on a caller-supplied counter dict.
Results land in a new self.blocks list of BlockMetrics; metrics_report()
prints them as a fixed-width table.

Default-off so existing profile("name") callers see no behavior change.
Sampler and tracemalloc state are torn down on context exit.
Adds pychunkedgraph.graph.dry_run.is_dry_run() (strict "1" match on
PCG_DRY_RUN) and wires it through every write site in the SV-split
edit flow:

- operation._write short-circuits via is_dry_run() instead of the
  cg.meta.READ_ONLY @Property (which stays in place for unrelated
  callers).
- A new GraphEditOperation._persist_rows centralizes BT mutation
  writes from _apply and _create_log_record paths; the three external
  cg.client.write callsites in operation.py collapse to one-liners.
- ocdbt.write_seg_chunks returns early so OCDBT delta writes are
  skipped.
- RootLock / IndefiniteRootLock / L2ChunkLock / IndefiniteL2ChunkLock
  __enter__ and __exit__ gain an early-return placed before the
  existing privileged_mode check, so dry-run wins regardless of
  replay state.

Replaces the harness-side monkey-patch design with a single global
toggle: setting PCG_DRY_RUN=1 makes any edit operation a read-only
dry run. Default-off; "true" / "0" / "" / unset all leave production
behavior unchanged.
split_profile drives an SV split end-to-end under PCG_DRY_RUN=1,
records per-stage timing + memory + IO into a HierarchicalProfiler,
and snapshots each stage's intermediate value into SplitInputs so
the user can replay a single heavy stage standalone after editing
its source. The full run is pickled under
<tempfile.gettempdir()>/pcg_split_profile/<graph_id>/<payload_sha>/
with a collision check on save and the cache path echoed at the
end of run_split_profile so the run can be copied for retention
before the OS tmp policy reclaims it.

Adds dry_run_scope() context manager next to is_dry_run() so the
harness (and any future caller) sets/restores PCG_DRY_RUN through
one helper rather than poking os.environ.
metrics_report now prints a compact human-readable table — no
banner, auto-scaled time (ms / s) and bytes (KB / MB / GB),
thousands-separator counters, signed rss delta, and all-zero
counter columns hidden. Format helpers _fmt_time / _fmt_bytes /
_fmt_count live module-private. print_report stays unchanged for
timing-only callers.

split_profile.profile_call(cg, name, fn, *args, **kwargs) wraps
one callable under dry_run_scope + count_io + profiler.profile,
returning (profiler, result). Lets the user replay a single stage
standalone after editing its source — pairs with load_run() to
iterate on one heavy stage without redoing the full flow.

ocdbt.ensure_fork_synced honors PCG_DRY_RUN — a defensive gate
on the one OCDBT write outside the gated edit flow. Steady-state
no-op for graphs with prior edits.
Inline get_profiler().profile() blocks now wrap the suspect operations
inside split_supervoxel, update_edges, and split_supervoxel_helper —
seg_read, binary_seg, geodesic_split, build_coords, kdtrees,
get_subgraph, edge_dedup, get_roots_inner, _get_new_edges,
connect_seeds, split_growing, etc. Each captures wall time, Python
heap peak, RSS peak, and IO counters as one BlockMetrics row.

HierarchicalProfiler.__init__ gains with_memory / with_rss instance
defaults (both True) and a default_counters attribute, so inline call
sites stay short — just `with _prof.profile(name):`.

run_split_profile drives the global profiler (reset + enable on
entry, disable on exit) so all inline blocks land in one
profiler.blocks list. Drops the previous top-level profile() wrappers
around the mincut / plan / split_supervoxels / remove_edges stages;
the wrappers stay as capture-only hooks for SplitInputs replay. New
overwrite=False kwarg on run_split_profile; when True, wipes the
cache dir before starting.

Profiler stays a no-op when not explicitly enabled.
…a fastremap

build_coords_by_label now delegates to fastremap.point_cloud — a
single C++ pass that emits uint16 coords grouped by label, treating
0 as background. Before the call, update_edges masks new_seg in
place with fastremap.mask_except(new_seg, new_ids ∪ all_edge_svs)
so point_cloud only emits coords for labels the downstream readers
(kdtrees + coords_by_label.get in _get_new_edges) will actually
query. update_edges is reordered so get_subgraph and
get_roots_inner run first to supply that label set. The caller
does not read seg after update_edges returns, so the in-place
mutation is safe.

remap_to_root drops fastremap.remap(..., in_place=False) — which
allocated a full-size shadow array of seg just to compare against
the target root — for fastremap.mask_except(seg, root_labels,
in_place=True). Same filter, no shadow.

binary_seg replaces np.isin(seg, supervoxel_ids) — which allocates
int-array auxiliaries the size of seg — with a per-SV `==` / `|=`
loop. Computed over seg[voxel_overlap_crop] directly so the
boolean is only as big as the region split_supervoxel_helper
consumes. Empty supervoxel_ids returns all-False, matching the
prior np.isin contract.

split_supervoxel computed fastremap.unique(seg) twice — once for
cut_supervoxels narrowing, once in get_roots — on the same
unmutated seg. Hoisted into a single seg_unique block whose result
both readers reuse, so the sort buffer is paid once at low rss
residual.

Harness-side: metrics_report and print_report no longer gate on
self.enabled (the harness disables the profiler before printing,
which was silencing the reports). run_split_profile saves the
cache before disabling. The metrics table adds cum_wall (from the
first profile() block since reset), rss_start, rss_peak columns;
drops the noisy ocdbt_reads counter.
_resolve_label3_touching_vectorized decided each stray label-3
component's side by dilating the label-1 and label-2 masks over the
whole volume and intersecting with the component mask — two
full-volume binary_dilations that dominated enforce_cc.

The label-3 voxels occupy a small region, so dilate the side masks
only inside the label-3 bounding box, padded by one voxel so the
dilation still reaches the bordering side voxels. The per-component
border counts are identical to the full-volume dilation, so the
split result is unchanged; only the dilated region shrinks to the
label-3 neighborhood.

Enforcement stays at full resolution after upsampling; moving it to
the downsampled grid fragments the labels under the foreground mask
and breaks the single-connected-component guarantee _update_chunks
relies on (covered by test_single_cc_guarantee_with_downsample).

Profiler-side: metrics_report lays the blocks out as a tree (one
column per nesting level, parent-first). Inline profile() blocks
remain only on the steps still time- or memory-suspect (>=2s wall
or >=~0.5GB rss); the cheap steps earlier profiling cleared are
unwrapped to keep the report focused.
@akhileshh akhileshh force-pushed the akhilesh/sv-splitting-locks branch from 03d8c3d to b3887d4 Compare May 28, 2026 14:51
akhileshh added 2 commits May 28, 2026 15:07
The SV-split path called cg.get_roots() without a timestamp in both
split_supervoxel (for building sv_root_map) and update_edges (for
mapping edge SVs to roots). Every other operation in the codebase
pins these reads to parent_ts — the pre-op read snapshot passed to
execute() — so a stuck-op replay sees the same graph state as the
original attempt.

Threads parent_ts from MulticutOperation._apply() → split_supervoxels
→ split_supervoxel → update_edges, and passes it to both get_roots
calls. Normal first-attempt runs pass parent_ts=None, which resolves
to current time as before; only replay paths (stuck_ops.replay)
supply a pinned snapshot.
Group the supervoxel-split modules under graph/sv_split/ (cutting,
edges, edits, profile, debug) and move the five splitting design docs
into the same package as README/algorithm/design/edges/recovery, so
the split flow and its documentation live behind one boundary instead
of being scattered across graph/, debug/, and docs/. sv_split/__init__
exposes only cutting/edges/edits; profile and debug stay out of it to
keep their app-layer (Flask) and dry-run deps off the core import path.

Promote the hierarchical profiler out of debug/ into a standalone
profiler/ package: hierarchical (HierarchicalProfiler, BlockMetrics,
RSS sampler), utils (formatters), main (global instance + accessors).

operation.py calls plan_sv_splits / split_supervoxels via the
sv_split.edits module object rather than a direct symbol import, so
profile.py's capture wrappers patch the live attribute at call time.

Repoint stuck_ops.py's docstring at the relocated recovery doc.
@akhileshh akhileshh force-pushed the akhilesh/sv-splitting-locks branch from e212d23 to 6cd5e2e Compare May 28, 2026 20:30
akhileshh added 4 commits May 28, 2026 22:30
Seed snapping ran a cKDTree over every voxel of the supervoxel, twice
per split (~47s, ~40% of a split), because the caller disables the
boundary-only and downsample candidate reductions for safety. Add a
use_bbox lever to snap_seeds_to_segment that restricts the candidate
scan to a box around the seeds — the nearest true voxel is always near
a seed, so the snapped result is identical while the KDTree shrinks by
orders of magnitude. A grow-until-nonempty loop guarantees the window
always contains the true nearest voxel, so the box size is never a
correctness constraint. Both split_supervoxel_helper call sites opt in.

Also: add a plain-terms algorithm overview, and correct the README
bbox description (the read/cut region is the seed envelope plus a
one-chunk margin, not the rep's full piece-set envelope).
The stray-fragment resolver's first action is a full-volume
connected-components labeling of label-3 voxels, after which it
early-returns when there are none. Splits that produce no strays still
paid for that whole-volume scan just to find nothing to do. Guard the
resolver behind a label-3 voxel count so the CC labeling runs only when
a fragment was actually stranded; the count is created solely by the
single-CC enforcement above, so an empty check is exactly the resolver's
own no-op condition.

Route cutting.py's per-stage logging through the package logger
(logger.debug, with the label-3 count at logger.note) instead of the
verbose-gated print helper, and drop the now-unused verbose parameter
from the split functions and their callers.
The supervoxel-split step copied the full overlap crop so the fresh IDs lived in a buffer separate from the later
in-place masks. Write them straight into seg's crop view instead, and
capture the OCDBT write payloads plus the source/sink id lookups from
that crop before masking — so they keep the unmasked neighbour SV IDs
the write must preserve. The root mask then runs only for update_edges,
which reads only its own wanted-label set and so is unaffected by the
masked neighbours. Peak memory drops by roughly the crop size; the only
remaining copies are per-changed-chunk, proportional to the edit.
…ot mask

Make pinky (nonzero voxel_offset) SV splits work end to end. Three
coupled changes plus the seg_unique optimization, all in the SV-split
path:

- chunks_overlapping_bbox gains an `origin` arg; the SV-split callers
  pass meta.voxel_bounds[:, 0] so the chunk lattice is anchored where
  PCG's chunks actually start. The origin-0 lattice put new supervoxel
  IDs in the wrong chunk on any offset dataset (and could split a grid
  block across two real chunks); zero-offset graphs were unaffected,
  which is why this stayed hidden.

- Remove the root mask in _route_edges_and_rows. update_edges already
  trims seg to {new ids} ∪ {edge-partner svs} internally; the outer
  mask, keyed on pre-split ids, deleted the freshly written new-SV
  voxels from seg once those ids were written in place — update_edges
  then raised KeyError building their kdtrees.

- seg_unique computes the distinct SV set per chunk over the segment-id
  field, restoring the chunk bits before the final union, instead of a
  full-uint64 unique over the whole read.

Add _assert_same_chunk: every new supervoxel must share a chunk with the
supervoxel it split from. split_supervoxel decomposed into _SplitCtx /
stage helpers. profile.py saves the failing traceback to the run dir
(load_traceback to read it back) and logs only the final exception line.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant