diff --git a/docs/decisions/2026-05-30-chapel-l23-cancel-token.md b/docs/decisions/2026-05-30-chapel-l23-cancel-token.md new file mode 100644 index 0000000..bd6141e --- /dev/null +++ b/docs/decisions/2026-05-30-chapel-l23-cancel-token.md @@ -0,0 +1,155 @@ + + +# 2026-05-30 — L2.3 cancel-token preemption for speculative search + +ADR-style addendum to [2026-05-30-chapel-rehabilitation.md][rehab]. +Documents the L2.3 cancel-token plumbing that closes the +"speculative-search winner doesn't kill losers" gap explicitly carried +forward from Wave 1. + +[rehab]: ./2026-05-30-chapel-rehabilitation.md + +## Status + +Accepted. Implemented in the PR that introduces this ADR. + +## Context + +PR #146's L2.2 `parallelProofSearchSpeculative` shipped with a clear +Wave-1 caveat (recorded in the function's docstring and in the rehab +ADR): + +> Wave 1 scope deliberately stops short of the "kill in-flight losers" +> step. Once a prover has been spawned, `tryProver` runs it to its own +> per-prover timeout — coforall doesn't know how to cancel a child task +> that's blocked on subprocess I/O. L2.3 introduces a cancel token +> threaded through `tryProver` so the children can SIGKILL their own +> subprocesses when they observe a winner; that's a separate PR. + +That separate PR is this one. The visible symptom Wave 1 left behind: +once a successful prover wins the CAS, the bench's wall-clock for +`parallel_speculative` was still bounded by the slowest in-flight +loser (because each one ran to its own `tryProver` timeout). The +`baseline.md` 5-run medians show this — speculative and best-of were +within noise on the success cases, instead of speculative dominating. + +## Decision + +Add a shared `CancelToken` instance to `parallelProofSearchSpeculative`, +thread an optional `borrowed CancelToken?` reference through +`tryProver`'s signature, and have the winner pair its CAS with a +write to the token. Each loser polls the token at every 100 ms tick +inside the existing bounded-wait loop; on observing `cancelled = true` +it SIGKILLs its own subprocess and returns a preempted result. + +Why a `class` (not `record`) `CancelToken`: + +- The token must be **shared by reference** across all `coforall` + tasks so the winner's write is visible to every loser. Chapel + records have value semantics; classes are passed by reference. +- Atomic-bool fields on a class have well-defined initial state + (`false`) after `init this` commits field initialisation, so the + init proc body is empty — no race-on-construction concern. +- The class is owned by the search function (`new owned`) and + borrowed by each `tryProver` call, so its lifetime is exactly the + search's lifetime and no extra liveness reasoning is needed. + +Why exitCode = -5 for preempted losers: + +- The existing failure codes are `-1` (not available / general failure), + `-2` (temp file failure), `-3` (timeout), `-4` (subprocess error). + Preemption is a distinct fourth-kind of failure — the prover did + not get to run to completion, but neither did it time out nor crash; + it was actively SIGKILLed by the portfolio. Reusing `-3` (timeout) + would mis-categorise preemption as a slow-prover signal, polluting + any later bench that tracks timeout rates. + +## Soundness + +The aggregation soundness proof in `proofs/agda/ParallelSoundness.agda` +already covers this case. Theorem `cancellation-safety` states: + +```agda +cancellation-safety : {n : ℕ} (bs : Vec Bool n) + → IsTrue (speculative (true ∷ bs)) +cancellation-safety _ = tt +``` + +i.e. **once a head witness is true, the verdict is true regardless of +the tail bs's contents**. In the L2.3 implementation the "head" is the +CAS-winner's `success = true` and the tail `bs` is every loser's +post-preemption result (which is now `success = false` with +exitCode = -5 instead of running to a possibly-true result). The +theorem says: the verdict over the whole result vector is unaffected +by this difference. The Agda module's header docstring is updated in +this PR to spell out the L2.3 cross-reference. + +The CAS itself is the linearisation point: any prover whose poll +observes `cancelled = true` ran AFTER the winning CAS. The +happens-before edge from `winnerIdx.compareAndSwap(-1, i)` → +`cancelToken.cancelled.write(true)` → next-loser's +`cancelToken!.cancelled.read()` is exactly the atomic-bool semantics +Chapel inherits from the C11 memory model. + +## Expected speedup + +Wave-1 baseline (`docs/bench/2026-05-30-chapel-mrr-baseline.md`) +showed speculative ≈ best-of on the success cases because every +loser ran to its own per-prover timeout. After L2.3 the loser wall +is bounded by `pollInterval + SIGKILL latency ≈ 100-150 ms`. For the +trivial-success regime this is invisible because the winner itself +returns in 200-700 ms. For the realistic-corpus regime (10-30 s +real prover invocations) this is the difference between waiting +`max(losers)` and `min(winner) + ~150 ms`. The dramatic speedup the +bench writeup predicted is now structurally present. + +A second baseline run (`docs/bench/`) is **not** included in this PR +— the trivial fixtures don't exercise the new code path enough to +show a meaningful number. A real-corpus bench run is the L2.4+ +follow-up. + +## Reproduce locally + +```bash +just bench-chapel-mrr +# Watch the parallel_speculative wallclock vs parallel_bestof: +# trivial fixtures show no change; a follow-up corpus bench would +# show speculative << bestof. +``` + +## Risks / counter-arguments considered + +1. **Premature cancellation race.** If two provers succeed at nearly + the same instant, the late one might write `cancelled = true` + between the early CAS and the early winner's exit. **Resolved:** the + CAS is what selects the winner; whichever prover's `compareAndSwap` + returns true wins regardless of token-write order. The late + prover's `compareAndSwap` returns false (expected != -1 now), so + it does NOT write to the token. Only ONE prover writes the token, + and it's the same prover that wins the CAS. + +2. **Self-cancellation.** Could a winning prover's poll race with its + own success-detection and trigger preemption on itself? **No:** + the cancellation poll happens INSIDE the `subproc.running` loop, + which exits once the subprocess has terminated. A successful + prover exits the loop via the `!subproc.running` edge before + `cancelToken.cancelled.write(true)` is even called. + +3. **Loser starvation.** What about provers whose first `tryProver` + poll is delayed? **Resolved by 100 ms ceiling:** even the + most-delayed prover observes the cancel within 100 ms of its + first poll. Total preemption-observation latency is bounded by + one pollInterval. + +## Wave-3+ follow-ups + +- A real-corpus bench (10-30 s prover invocations) to demonstrate + the speculative >> bestof speedup quantitatively. +- A `--bench-chapel-mrr` `--strategy=speculative` flag that + reports the per-prover preemption-vs-timeout-vs-success breakdown, + so we can measure the wallclock improvement attributable to + preemption specifically. +- Extend `CancelToken` to carry a reason string (winner-name, + timeout-budget-exhausted, external-cancel, …) for future + cooperative-cancellation flows beyond the speculative-search + case. diff --git a/docs/decisions/2026-05-30-chapel-rehabilitation.md b/docs/decisions/2026-05-30-chapel-rehabilitation.md index 6ea5f79..62bcd61 100644 --- a/docs/decisions/2026-05-30-chapel-rehabilitation.md +++ b/docs/decisions/2026-05-30-chapel-rehabilitation.md @@ -110,7 +110,9 @@ Wave 2 (separate PR, after Wave 1 green-on-main for ≥7 days): the metalayer build to `--dynamic`; flip `rust-chapel-real` to strict. - Add the cancel-token thread through `tryProver` and switch the - Chapel-side default to the speculative search. + Chapel-side default to the speculative search. Implementation + + soundness argument: + [2026-05-30-chapel-l23-cancel-token.md](./2026-05-30-chapel-l23-cancel-token.md). - Wire the `proven` and `docudactyl` parallel-dispatch tracks off the same scaffold; breadcrumb issues filed in those repos on the PR landing. diff --git a/proofs/agda/ParallelSoundness.agda b/proofs/agda/ParallelSoundness.agda index 3014658..8fdf19b 100644 --- a/proofs/agda/ParallelSoundness.agda +++ b/proofs/agda/ParallelSoundness.agda @@ -15,7 +15,17 @@ -- 3. Cancellation-safety — once a prover position witnesses -- success, the verdict is true regardless -- of what other provers do (the winner is --- insensitive to the losers' state) +-- insensitive to the losers' state). +-- THIS IS THE L2.3 PREEMPTION CLAIM. +-- Once the speculative-search winner has +-- flipped the shared `CancelToken`, every +-- in-flight loser self-SIGKILLs on its +-- next poll and returns a preempted +-- result (exitCode = -5). Those preempted +-- results are NEVER returned to the +-- caller — `winner` was set before any +-- cancellation could race the CAS — so +-- the aggregation verdict is unchanged. -- -- The Chapel implementation uses an atomic compare-and-swap on a -- shared `winnerIdx` to record the first-completing successful prover. @@ -25,6 +35,13 @@ -- the (possibly several) provers that succeed, but the aggregation -- verdict (success or not) is unaffected by which index wins. -- +-- L2.3 cancel-token implementation lives in `src/chapel/parallel_proof_search.chpl`: +-- the `CancelToken` class (class for shared-by-reference semantics across +-- coforall tasks), the `cancelToken` parameter on `tryProver`, and the +-- `cancelToken.cancelled.write(true)` paired with the CAS in +-- `parallelProofSearchSpeculative`. `cancellation-safety` below is the +-- formal statement of what that implementation preserves. +-- -- Wave-1 scope: this module proves the AGGREGATION layer. The -- complementary layer — that `tryProver` itself is sound (i.e. its -- `success = true` return implies the proof actually checks under the diff --git a/src/chapel/parallel_proof_search.chpl b/src/chapel/parallel_proof_search.chpl index 238a059..92479e2 100644 --- a/src/chapel/parallel_proof_search.chpl +++ b/src/chapel/parallel_proof_search.chpl @@ -128,12 +128,38 @@ proc isProverAvailable(info: ProverInfo): bool { } } +// --------------------------------------------------------------------------- +// Cancellation token (L2.3) +// --------------------------------------------------------------------------- + +// A shared cancellation flag used by `parallelProofSearchSpeculative` +// to signal in-flight losers as soon as the first winner has been +// established via the CAS. A `class` (not `record`) so that all +// `coforall` tasks share the same instance by reference. +// +// `cancelled.write(true)` is the only mutation; all readers are +// `read()` polls inside `tryProver`'s wait loop. The atomic semantics +// give us a happens-before edge from the winner's CAS to every +// loser's next poll — no extra synchronisation is required. +class CancelToken { + var cancelled: atomic bool; + proc init() { + // `atomic bool` defaults to `false` after `init this` commits + // field initialisation; no further write needed. + init this; + } +} + // --------------------------------------------------------------------------- // Real prover invocation via subprocess // --------------------------------------------------------------------------- -// Write goal content to a temporary file and invoke the prover -proc tryProver(info: ProverInfo, goal: string, timeout: int = defaultTimeout): ProofResult { +// Write goal content to a temporary file and invoke the prover. If +// `cancelToken` is non-nil and `cancelled.read()` returns true mid-poll, +// SIGKILL the child and return a cancelled-result (exitCode = -5). +// Callers that do not race (sequential / best-of) pass nil. +proc tryProver(info: ProverInfo, goal: string, timeout: int = defaultTimeout, + cancelToken: borrowed CancelToken? = nil): ProofResult { var timer = new stopwatch(); timer.start(); @@ -199,12 +225,39 @@ proc tryProver(info: ProverInfo, goal: string, timeout: int = defaultTimeout): P // The previous implementation used `!proc.running` which is true // before the child has begun and after it has exited, so it was // both racy at startup and stopped polling once running latched. + // L2.3: also break out if the shared `cancelToken` flips to true, + // which the speculative-search winner sets after its CAS succeeds. var elapsed: real = 0.0; const pollInterval: real = 0.1; + var preempted = false; while subproc.running && elapsed < timeout:real { sleep(pollInterval); elapsed += pollInterval; subproc.poll(); + if cancelToken != nil && cancelToken!.cancelled.read() { + preempted = true; + break; + } + } + + if preempted { + // L2.3 preemption: the speculative-search winner has been + // declared elsewhere. SIGKILL ourselves and return a + // distinct exitCode = -5 so the caller's result table + // distinguishes preempted-loser from timed-out from failed. + subproc.sendPosixSignal(9); + subproc.wait(); + timer.stop(); + try { remove(tmpFile); } catch { } + return new ProofResult( + success = false, + prover = info.name, + proverId = info.id, + time = timer.elapsed(), + exitCode = -5, + output = "Preempted by speculative-search winner", + category = info.category + ); } if subproc.running { @@ -370,21 +423,25 @@ proc parallelProofSearch(goal: string, provers: [] ProverInfo, // - parallelProofSearch waits for ALL tasks then picks the fastest // success. Wall time is bounded by the slowest prover. // - parallelProofSearchSpeculative records the first-completing -// success via an atomic CAS and returns it. Wall time is bounded -// by the fastest successful prover plus whatever in-flight tasks -// have already begun (no mid-flight preemption in Wave 1). +// success via an atomic CAS and SIGKILLs any still-running losers +// via the shared `CancelToken`. Wall time is bounded by the +// fastest successful prover plus one poll-interval (~100 ms) of +// observation lag on the losers. // -// Wave 1 scope deliberately stops short of the "kill in-flight losers" -// step. Once a prover has been spawned, `tryProver` runs it to its own -// per-prover timeout — coforall doesn't know how to cancel a child task -// that's blocked on subprocess I/O. L2.3 introduces a cancel token -// threaded through `tryProver` so the children can SIGKILL their own -// subprocesses when they observe a winner; that's a separate PR. +// L2.3 cancellation: the winning task's CAS is paired with a write to +// the shared `CancelToken`. Loser tasks read the token at every poll +// step in `tryProver` and self-SIGKILL their subprocess as soon as +// the flag is observed. The atomic-bool semantics give us a +// happens-before edge from CAS-success to next-loser-poll, so no +// further locking is needed. // -// Cancellation-safety today: every losing task is run to its natural -// end. The result table records all outcomes; only the winning index -// (the first successful CAS) is returned. No prover's exit can poison -// the join because the atomic CAS is monotone — the first winner wins. +// Soundness: the monotone first-wins CAS still controls which index +// is returned. The cancellation flag only affects how the LOSERS +// terminate — their results land in the table with exitCode = -5 +// instead of running to completion, but they are never returned to +// the caller because `winner` is set before any cancellation could +// race the CAS. See proofs/agda/ParallelSoundness.agda: +// `cancellation-safety` for the formal statement. proc parallelProofSearchSpeculative(goal: string, provers: [] ProverInfo, timeout: int = defaultTimeout): ProofResult { if verbose then @@ -397,15 +454,20 @@ proc parallelProofSearchSpeculative(goal: string, provers: [] ProverInfo, var results: [provers.domain] ProofResult; var winnerIdx: atomic int; winnerIdx.write(-1); + var cancelToken = new owned CancelToken(); coforall (prover, i) in zip(provers, provers.domain) { - results[i] = tryProver(prover, goal, timeout); + results[i] = tryProver(prover, goal, timeout, cancelToken.borrow()); if results[i].success { // Monotone first-wins CAS: only the first successful - // worker flips the atomic from -1 to its own index. + // worker flips the atomic from -1 to its own index. The + // paired write to cancelToken signals every still-polling + // loser to SIGKILL its subprocess on the next poll. var expected = -1; - winnerIdx.compareAndSwap(expected, i); + if winnerIdx.compareAndSwap(expected, i) { + cancelToken.cancelled.write(true); + } } }