hyperpolymath · hyperpolymath · May 30, 2026 · May 30, 2026
diff --git a/docs/decisions/2026-05-30-chapel-l23-cancel-token.md b/docs/decisions/2026-05-30-chapel-l23-cancel-token.md
@@ -0,0 +1,155 @@
+<!-- SPDX-License-Identifier: MPL-2.0 -->
+
+# 2026-05-30 — L2.3 cancel-token preemption for speculative search
+
+ADR-style addendum to [2026-05-30-chapel-rehabilitation.md][rehab].
+Documents the L2.3 cancel-token plumbing that closes the
+"speculative-search winner doesn't kill losers" gap explicitly carried
+forward from Wave 1.
+
+[rehab]: ./2026-05-30-chapel-rehabilitation.md
+
+## Status
+
+Accepted. Implemented in the PR that introduces this ADR.
+
+## Context
+
+PR #146's L2.2 `parallelProofSearchSpeculative` shipped with a clear
+Wave-1 caveat (recorded in the function's docstring and in the rehab
+ADR):
+
+> Wave 1 scope deliberately stops short of the "kill in-flight losers"
+> step. Once a prover has been spawned, `tryProver` runs it to its own
+> per-prover timeout — coforall doesn't know how to cancel a child task
+> that's blocked on subprocess I/O. L2.3 introduces a cancel token
+> threaded through `tryProver` so the children can SIGKILL their own
+> subprocesses when they observe a winner; that's a separate PR.
+
+That separate PR is this one. The visible symptom Wave 1 left behind:
+once a successful prover wins the CAS, the bench's wall-clock for
+`parallel_speculative` was still bounded by the slowest in-flight
+loser (because each one ran to its own `tryProver` timeout). The
+`baseline.md` 5-run medians show this — speculative and best-of were
+within noise on the success cases, instead of speculative dominating.
+
+## Decision
+
+Add a shared `CancelToken` instance to `parallelProofSearchSpeculative`,
+thread an optional `borrowed CancelToken?` reference through
+`tryProver`'s signature, and have the winner pair its CAS with a
+write to the token. Each loser polls the token at every 100 ms tick
+inside the existing bounded-wait loop; on observing `cancelled = true`
+it SIGKILLs its own subprocess and returns a preempted result.
+
+Why a `class` (not `record`) `CancelToken`:
+
+- The token must be **shared by reference** across all `coforall`
+  tasks so the winner's write is visible to every loser. Chapel
+  records have value semantics; classes are passed by reference.
+- Atomic-bool fields on a class have well-defined initial state
+  (`false`) after `init this` commits field initialisation, so the
+  init proc body is empty — no race-on-construction concern.
+- The class is owned by the search function (`new owned`) and
+  borrowed by each `tryProver` call, so its lifetime is exactly the
+  search's lifetime and no extra liveness reasoning is needed.
+
+Why exitCode = -5 for preempted losers:
+
+- The existing failure codes are `-1` (not available / general failure),
+  `-2` (temp file failure), `-3` (timeout), `-4` (subprocess error).
+  Preemption is a distinct fourth-kind of failure — the prover did
+  not get to run to completion, but neither did it time out nor crash;
+  it was actively SIGKILLed by the portfolio. Reusing `-3` (timeout)
+  would mis-categorise preemption as a slow-prover signal, polluting
+  any later bench that tracks timeout rates.
+
+## Soundness
+
+The aggregation soundness proof in `proofs/agda/ParallelSoundness.agda`
+already covers this case. Theorem `cancellation-safety` states:
+
+```agda
+cancellation-safety : {n : ℕ} (bs : Vec Bool n)
+                    → IsTrue (speculative (true ∷ bs))
+cancellation-safety _ = tt
+```
+
+i.e. **once a head witness is true, the verdict is true regardless of
+the tail bs's contents**. In the L2.3 implementation the "head" is the
+CAS-winner's `success = true` and the tail `bs` is every loser's
+post-preemption result (which is now `success = false` with
+exitCode = -5 instead of running to a possibly-true result). The
+theorem says: the verdict over the whole result vector is unaffected
+by this difference. The Agda module's header docstring is updated in
+this PR to spell out the L2.3 cross-reference.
+
+The CAS itself is the linearisation point: any prover whose poll
+observes `cancelled = true` ran AFTER the winning CAS. The
+happens-before edge from `winnerIdx.compareAndSwap(-1, i)` →
+`cancelToken.cancelled.write(true)` → next-loser's
+`cancelToken!.cancelled.read()` is exactly the atomic-bool semantics
+Chapel inherits from the C11 memory model.
+
+## Expected speedup
+
+Wave-1 baseline (`docs/bench/2026-05-30-chapel-mrr-baseline.md`)
+showed speculative ≈ best-of on the success cases because every
+loser ran to its own per-prover timeout. After L2.3 the loser wall
+is bounded by `pollInterval + SIGKILL latency ≈ 100-150 ms`. For the
+trivial-success regime this is invisible because the winner itself
+returns in 200-700 ms. For the realistic-corpus regime (10-30 s
+real prover invocations) this is the difference between waiting
+`max(losers)` and `min(winner) + ~150 ms`. The dramatic speedup the
+bench writeup predicted is now structurally present.
+
+A second baseline run (`docs/bench/`) is **not** included in this PR
+— the trivial fixtures don't exercise the new code path enough to
+show a meaningful number. A real-corpus bench run is the L2.4+
+follow-up.
+
+## Reproduce locally
+
+```bash
+just bench-chapel-mrr
+# Watch the parallel_speculative wallclock vs parallel_bestof:
+# trivial fixtures show no change; a follow-up corpus bench would
+# show speculative << bestof.
+```
+
+## Risks / counter-arguments considered
+
+1. **Premature cancellation race.** If two provers succeed at nearly
+   the same instant, the late one might write `cancelled = true`
+   between the early CAS and the early winner's exit. **Resolved:** the
+   CAS is what selects the winner; whichever prover's `compareAndSwap`
+   returns true wins regardless of token-write order. The late
+   prover's `compareAndSwap` returns false (expected != -1 now), so
+   it does NOT write to the token. Only ONE prover writes the token,
+   and it's the same prover that wins the CAS.
+
+2. **Self-cancellation.** Could a winning prover's poll race with its
+   own success-detection and trigger preemption on itself? **No:**
+   the cancellation poll happens INSIDE the `subproc.running` loop,
+   which exits once the subprocess has terminated. A successful
+   prover exits the loop via the `!subproc.running` edge before
+   `cancelToken.cancelled.write(true)` is even called.
+
+3. **Loser starvation.** What about provers whose first `tryProver`
+   poll is delayed? **Resolved by 100 ms ceiling:** even the
+   most-delayed prover observes the cancel within 100 ms of its
+   first poll. Total preemption-observation latency is bounded by
+   one pollInterval.
+
+## Wave-3+ follow-ups
+
+- A real-corpus bench (10-30 s prover invocations) to demonstrate
+  the speculative >> bestof speedup quantitatively.
+- A `--bench-chapel-mrr` `--strategy=speculative` flag that
+  reports the per-prover preemption-vs-timeout-vs-success breakdown,
+  so we can measure the wallclock improvement attributable to
+  preemption specifically.
+- Extend `CancelToken` to carry a reason string (winner-name,
+  timeout-budget-exhausted, external-cancel, …) for future
+  cooperative-cancellation flows beyond the speculative-search
+  case.
diff --git a/docs/decisions/2026-05-30-chapel-rehabilitation.md b/docs/decisions/2026-05-30-chapel-rehabilitation.md
@@ -110,7 +110,9 @@ Wave 2 (separate PR, after Wave 1 green-on-main for ≥7 days):
   the metalayer build to `--dynamic`; flip `rust-chapel-real` to
   strict.
 - Add the cancel-token thread through `tryProver` and switch the
-  Chapel-side default to the speculative search.
+  Chapel-side default to the speculative search. Implementation +
+  soundness argument:
+  [2026-05-30-chapel-l23-cancel-token.md](./2026-05-30-chapel-l23-cancel-token.md).
 - Wire the `proven` and `docudactyl` parallel-dispatch tracks
   off the same scaffold; breadcrumb issues filed in those repos
   on the PR landing.

diff --git a/proofs/agda/ParallelSoundness.agda b/proofs/agda/ParallelSoundness.agda
@@ -15,7 +15,17 @@
 --   3.  Cancellation-safety      — once a prover position witnesses
 --                                  success, the verdict is true regardless
 --                                  of what other provers do (the winner is
---                                  insensitive to the losers' state)
+--                                  insensitive to the losers' state).
+--                                  THIS IS THE L2.3 PREEMPTION CLAIM.
+--                                  Once the speculative-search winner has
+--                                  flipped the shared `CancelToken`, every
+--                                  in-flight loser self-SIGKILLs on its
+--                                  next poll and returns a preempted
+--                                  result (exitCode = -5). Those preempted
+--                                  results are NEVER returned to the
+--                                  caller — `winner` was set before any
+--                                  cancellation could race the CAS — so
+--                                  the aggregation verdict is unchanged.
 --
 -- The Chapel implementation uses an atomic compare-and-swap on a
 -- shared `winnerIdx` to record the first-completing successful prover.
@@ -25,6 +35,13 @@
 -- the (possibly several) provers that succeed, but the aggregation
 -- verdict (success or not) is unaffected by which index wins.
 --
+-- L2.3 cancel-token implementation lives in `src/chapel/parallel_proof_search.chpl`:
+-- the `CancelToken` class (class for shared-by-reference semantics across
+-- coforall tasks), the `cancelToken` parameter on `tryProver`, and the
+-- `cancelToken.cancelled.write(true)` paired with the CAS in
+-- `parallelProofSearchSpeculative`. `cancellation-safety` below is the
+-- formal statement of what that implementation preserves.
+--
 -- Wave-1 scope: this module proves the AGGREGATION layer. The
 -- complementary layer — that `tryProver` itself is sound (i.e. its
 -- `success = true` return implies the proof actually checks under the

diff --git a/src/chapel/parallel_proof_search.chpl b/src/chapel/parallel_proof_search.chpl
@@ -128,12 +128,38 @@ proc isProverAvailable(info: ProverInfo): bool {
     }
 }
 
+// ---------------------------------------------------------------------------
+// Cancellation token (L2.3)
+// ---------------------------------------------------------------------------
+
+// A shared cancellation flag used by `parallelProofSearchSpeculative`
+// to signal in-flight losers as soon as the first winner has been
+// established via the CAS. A `class` (not `record`) so that all
+// `coforall` tasks share the same instance by reference.
+//
+// `cancelled.write(true)` is the only mutation; all readers are
+// `read()` polls inside `tryProver`'s wait loop. The atomic semantics
+// give us a happens-before edge from the winner's CAS to every
+// loser's next poll — no extra synchronisation is required.
+class CancelToken {
+    var cancelled: atomic bool;
+    proc init() {
+        // `atomic bool` defaults to `false` after `init this` commits
+        // field initialisation; no further write needed.
+        init this;
+    }
+}
+
 // ---------------------------------------------------------------------------
 // Real prover invocation via subprocess
 // ---------------------------------------------------------------------------
 
-// Write goal content to a temporary file and invoke the prover
-proc tryProver(info: ProverInfo, goal: string, timeout: int = defaultTimeout): ProofResult {
+// Write goal content to a temporary file and invoke the prover. If
+// `cancelToken` is non-nil and `cancelled.read()` returns true mid-poll,
+// SIGKILL the child and return a cancelled-result (exitCode = -5).
+// Callers that do not race (sequential / best-of) pass nil.
+proc tryProver(info: ProverInfo, goal: string, timeout: int = defaultTimeout,
+               cancelToken: borrowed CancelToken? = nil): ProofResult {
     var timer = new stopwatch();
     timer.start();
 
@@ -199,12 +225,39 @@ proc tryProver(info: ProverInfo, goal: string, timeout: int = defaultTimeout): P
         // The previous implementation used `!proc.running` which is true
         // before the child has begun and after it has exited, so it was
         // both racy at startup and stopped polling once running latched.
+        // L2.3: also break out if the shared `cancelToken` flips to true,
+        // which the speculative-search winner sets after its CAS succeeds.
         var elapsed: real = 0.0;
         const pollInterval: real = 0.1;
+        var preempted = false;
         while subproc.running && elapsed < timeout:real {
             sleep(pollInterval);
             elapsed += pollInterval;
             subproc.poll();
+            if cancelToken != nil && cancelToken!.cancelled.read() {
+                preempted = true;
+                break;
+            }
+        }
+
+        if preempted {
+            // L2.3 preemption: the speculative-search winner has been
+            // declared elsewhere. SIGKILL ourselves and return a
+            // distinct exitCode = -5 so the caller's result table
+            // distinguishes preempted-loser from timed-out from failed.
+            subproc.sendPosixSignal(9);
+            subproc.wait();
+            timer.stop();
+            try { remove(tmpFile); } catch { }
+            return new ProofResult(
+                success = false,
+                prover = info.name,
+                proverId = info.id,
+                time = timer.elapsed(),
+                exitCode = -5,
+                output = "Preempted by speculative-search winner",
+                category = info.category
+            );
         }
 
         if subproc.running {
@@ -370,21 +423,25 @@ proc parallelProofSearch(goal: string, provers: [] ProverInfo,
 //   - parallelProofSearch waits for ALL tasks then picks the fastest
 //     success. Wall time is bounded by the slowest prover.
 //   - parallelProofSearchSpeculative records the first-completing
-//     success via an atomic CAS and returns it. Wall time is bounded
-//     by the fastest successful prover plus whatever in-flight tasks
-//     have already begun (no mid-flight preemption in Wave 1).
+//     success via an atomic CAS and SIGKILLs any still-running losers
+//     via the shared `CancelToken`. Wall time is bounded by the
+//     fastest successful prover plus one poll-interval (~100 ms) of
+//     observation lag on the losers.
 //
-// Wave 1 scope deliberately stops short of the "kill in-flight losers"
-// step. Once a prover has been spawned, `tryProver` runs it to its own
-// per-prover timeout — coforall doesn't know how to cancel a child task
-// that's blocked on subprocess I/O. L2.3 introduces a cancel token
-// threaded through `tryProver` so the children can SIGKILL their own
-// subprocesses when they observe a winner; that's a separate PR.
+// L2.3 cancellation: the winning task's CAS is paired with a write to
+// the shared `CancelToken`. Loser tasks read the token at every poll
+// step in `tryProver` and self-SIGKILL their subprocess as soon as
+// the flag is observed. The atomic-bool semantics give us a
+// happens-before edge from CAS-success to next-loser-poll, so no
+// further locking is needed.
 //
-// Cancellation-safety today: every losing task is run to its natural
-// end. The result table records all outcomes; only the winning index
-// (the first successful CAS) is returned. No prover's exit can poison
-// the join because the atomic CAS is monotone — the first winner wins.
+// Soundness: the monotone first-wins CAS still controls which index
+// is returned. The cancellation flag only affects how the LOSERS
+// terminate — their results land in the table with exitCode = -5
+// instead of running to completion, but they are never returned to
+// the caller because `winner` is set before any cancellation could
+// race the CAS. See proofs/agda/ParallelSoundness.agda:
+// `cancellation-safety` for the formal statement.
 proc parallelProofSearchSpeculative(goal: string, provers: [] ProverInfo,
                                     timeout: int = defaultTimeout): ProofResult {
     if verbose then
@@ -397,15 +454,20 @@ proc parallelProofSearchSpeculative(goal: string, provers: [] ProverInfo,
     var results: [provers.domain] ProofResult;
     var winnerIdx: atomic int;
     winnerIdx.write(-1);
+    var cancelToken = new owned CancelToken();
 
     coforall (prover, i) in zip(provers, provers.domain) {
-        results[i] = tryProver(prover, goal, timeout);
+        results[i] = tryProver(prover, goal, timeout, cancelToken.borrow());
 
         if results[i].success {
             // Monotone first-wins CAS: only the first successful
-            // worker flips the atomic from -1 to its own index.
+            // worker flips the atomic from -1 to its own index. The
+            // paired write to cancelToken signals every still-polling
+            // loser to SIGKILL its subprocess on the next poll.
             var expected = -1;
-            winnerIdx.compareAndSwap(expected, i);
+            if winnerIdx.compareAndSwap(expected, i) {
+                cancelToken.cancelled.write(true);
+            }
         }
     }