Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 155 additions & 0 deletions docs/decisions/2026-05-30-chapel-l23-cancel-token.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
<!-- SPDX-License-Identifier: MPL-2.0 -->

# 2026-05-30 — L2.3 cancel-token preemption for speculative search

ADR-style addendum to [2026-05-30-chapel-rehabilitation.md][rehab].
Documents the L2.3 cancel-token plumbing that closes the
"speculative-search winner doesn't kill losers" gap explicitly carried
forward from Wave 1.

[rehab]: ./2026-05-30-chapel-rehabilitation.md

## Status

Accepted. Implemented in the PR that introduces this ADR.

## Context

PR #146's L2.2 `parallelProofSearchSpeculative` shipped with a clear
Wave-1 caveat (recorded in the function's docstring and in the rehab
ADR):

> Wave 1 scope deliberately stops short of the "kill in-flight losers"
> step. Once a prover has been spawned, `tryProver` runs it to its own
> per-prover timeout — coforall doesn't know how to cancel a child task
> that's blocked on subprocess I/O. L2.3 introduces a cancel token
> threaded through `tryProver` so the children can SIGKILL their own
> subprocesses when they observe a winner; that's a separate PR.

That separate PR is this one. The visible symptom Wave 1 left behind:
once a successful prover wins the CAS, the bench's wall-clock for
`parallel_speculative` was still bounded by the slowest in-flight
loser (because each one ran to its own `tryProver` timeout). The
`baseline.md` 5-run medians show this — speculative and best-of were
within noise on the success cases, instead of speculative dominating.

## Decision

Add a shared `CancelToken` instance to `parallelProofSearchSpeculative`,
thread an optional `borrowed CancelToken?` reference through
`tryProver`'s signature, and have the winner pair its CAS with a
write to the token. Each loser polls the token at every 100 ms tick
inside the existing bounded-wait loop; on observing `cancelled = true`
it SIGKILLs its own subprocess and returns a preempted result.

Why a `class` (not `record`) `CancelToken`:

- The token must be **shared by reference** across all `coforall`
tasks so the winner's write is visible to every loser. Chapel
records have value semantics; classes are passed by reference.
- Atomic-bool fields on a class have well-defined initial state
(`false`) after `init this` commits field initialisation, so the
init proc body is empty — no race-on-construction concern.
- The class is owned by the search function (`new owned`) and
borrowed by each `tryProver` call, so its lifetime is exactly the
search's lifetime and no extra liveness reasoning is needed.

Why exitCode = -5 for preempted losers:

- The existing failure codes are `-1` (not available / general failure),
`-2` (temp file failure), `-3` (timeout), `-4` (subprocess error).
Preemption is a distinct fourth-kind of failure — the prover did
not get to run to completion, but neither did it time out nor crash;
it was actively SIGKILLed by the portfolio. Reusing `-3` (timeout)
would mis-categorise preemption as a slow-prover signal, polluting
any later bench that tracks timeout rates.

## Soundness

The aggregation soundness proof in `proofs/agda/ParallelSoundness.agda`
already covers this case. Theorem `cancellation-safety` states:

```agda
cancellation-safety : {n : ℕ} (bs : Vec Bool n)
→ IsTrue (speculative (true ∷ bs))
cancellation-safety _ = tt
```

i.e. **once a head witness is true, the verdict is true regardless of
the tail bs's contents**. In the L2.3 implementation the "head" is the
CAS-winner's `success = true` and the tail `bs` is every loser's
post-preemption result (which is now `success = false` with
exitCode = -5 instead of running to a possibly-true result). The
theorem says: the verdict over the whole result vector is unaffected
by this difference. The Agda module's header docstring is updated in
this PR to spell out the L2.3 cross-reference.

The CAS itself is the linearisation point: any prover whose poll
observes `cancelled = true` ran AFTER the winning CAS. The
happens-before edge from `winnerIdx.compareAndSwap(-1, i)` →
`cancelToken.cancelled.write(true)` → next-loser's
`cancelToken!.cancelled.read()` is exactly the atomic-bool semantics
Chapel inherits from the C11 memory model.

## Expected speedup

Wave-1 baseline (`docs/bench/2026-05-30-chapel-mrr-baseline.md`)
showed speculative ≈ best-of on the success cases because every
loser ran to its own per-prover timeout. After L2.3 the loser wall
is bounded by `pollInterval + SIGKILL latency ≈ 100-150 ms`. For the
trivial-success regime this is invisible because the winner itself
returns in 200-700 ms. For the realistic-corpus regime (10-30 s
real prover invocations) this is the difference between waiting
`max(losers)` and `min(winner) + ~150 ms`. The dramatic speedup the
bench writeup predicted is now structurally present.

A second baseline run (`docs/bench/`) is **not** included in this PR
— the trivial fixtures don't exercise the new code path enough to
show a meaningful number. A real-corpus bench run is the L2.4+
follow-up.

## Reproduce locally

```bash
just bench-chapel-mrr
# Watch the parallel_speculative wallclock vs parallel_bestof:
# trivial fixtures show no change; a follow-up corpus bench would
# show speculative << bestof.
```

## Risks / counter-arguments considered

1. **Premature cancellation race.** If two provers succeed at nearly
the same instant, the late one might write `cancelled = true`
between the early CAS and the early winner's exit. **Resolved:** the
CAS is what selects the winner; whichever prover's `compareAndSwap`
returns true wins regardless of token-write order. The late
prover's `compareAndSwap` returns false (expected != -1 now), so
it does NOT write to the token. Only ONE prover writes the token,
and it's the same prover that wins the CAS.

2. **Self-cancellation.** Could a winning prover's poll race with its
own success-detection and trigger preemption on itself? **No:**
the cancellation poll happens INSIDE the `subproc.running` loop,
which exits once the subprocess has terminated. A successful
prover exits the loop via the `!subproc.running` edge before
`cancelToken.cancelled.write(true)` is even called.

3. **Loser starvation.** What about provers whose first `tryProver`
poll is delayed? **Resolved by 100 ms ceiling:** even the
most-delayed prover observes the cancel within 100 ms of its
first poll. Total preemption-observation latency is bounded by
one pollInterval.

## Wave-3+ follow-ups

- A real-corpus bench (10-30 s prover invocations) to demonstrate
the speculative >> bestof speedup quantitatively.
- A `--bench-chapel-mrr` `--strategy=speculative` flag that
reports the per-prover preemption-vs-timeout-vs-success breakdown,
so we can measure the wallclock improvement attributable to
preemption specifically.
- Extend `CancelToken` to carry a reason string (winner-name,
timeout-budget-exhausted, external-cancel, …) for future
cooperative-cancellation flows beyond the speculative-search
case.
4 changes: 3 additions & 1 deletion docs/decisions/2026-05-30-chapel-rehabilitation.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,9 @@ Wave 2 (separate PR, after Wave 1 green-on-main for ≥7 days):
the metalayer build to `--dynamic`; flip `rust-chapel-real` to
strict.
- Add the cancel-token thread through `tryProver` and switch the
Chapel-side default to the speculative search.
Chapel-side default to the speculative search. Implementation +
soundness argument:
[2026-05-30-chapel-l23-cancel-token.md](./2026-05-30-chapel-l23-cancel-token.md).
- Wire the `proven` and `docudactyl` parallel-dispatch tracks
off the same scaffold; breadcrumb issues filed in those repos
on the PR landing.
Expand Down
19 changes: 18 additions & 1 deletion proofs/agda/ParallelSoundness.agda
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,17 @@
-- 3. Cancellation-safety — once a prover position witnesses
-- success, the verdict is true regardless
-- of what other provers do (the winner is
-- insensitive to the losers' state)
-- insensitive to the losers' state).
-- THIS IS THE L2.3 PREEMPTION CLAIM.
-- Once the speculative-search winner has
-- flipped the shared `CancelToken`, every
-- in-flight loser self-SIGKILLs on its
-- next poll and returns a preempted
-- result (exitCode = -5). Those preempted
-- results are NEVER returned to the
-- caller — `winner` was set before any
-- cancellation could race the CAS — so
-- the aggregation verdict is unchanged.
--
-- The Chapel implementation uses an atomic compare-and-swap on a
-- shared `winnerIdx` to record the first-completing successful prover.
Expand All @@ -25,6 +35,13 @@
-- the (possibly several) provers that succeed, but the aggregation
-- verdict (success or not) is unaffected by which index wins.
--
-- L2.3 cancel-token implementation lives in `src/chapel/parallel_proof_search.chpl`:
-- the `CancelToken` class (class for shared-by-reference semantics across
-- coforall tasks), the `cancelToken` parameter on `tryProver`, and the
-- `cancelToken.cancelled.write(true)` paired with the CAS in
-- `parallelProofSearchSpeculative`. `cancellation-safety` below is the
-- formal statement of what that implementation preserves.
--
-- Wave-1 scope: this module proves the AGGREGATION layer. The
-- complementary layer — that `tryProver` itself is sound (i.e. its
-- `success = true` return implies the proof actually checks under the
Expand Down
98 changes: 80 additions & 18 deletions src/chapel/parallel_proof_search.chpl
Original file line number Diff line number Diff line change
Expand Up @@ -128,12 +128,38 @@ proc isProverAvailable(info: ProverInfo): bool {
}
}

// ---------------------------------------------------------------------------
// Cancellation token (L2.3)
// ---------------------------------------------------------------------------

// A shared cancellation flag used by `parallelProofSearchSpeculative`
// to signal in-flight losers as soon as the first winner has been
// established via the CAS. A `class` (not `record`) so that all
// `coforall` tasks share the same instance by reference.
//
// `cancelled.write(true)` is the only mutation; all readers are
// `read()` polls inside `tryProver`'s wait loop. The atomic semantics
// give us a happens-before edge from the winner's CAS to every
// loser's next poll — no extra synchronisation is required.
class CancelToken {
var cancelled: atomic bool;
proc init() {
// `atomic bool` defaults to `false` after `init this` commits
// field initialisation; no further write needed.
init this;
}
}

// ---------------------------------------------------------------------------
// Real prover invocation via subprocess
// ---------------------------------------------------------------------------

// Write goal content to a temporary file and invoke the prover
proc tryProver(info: ProverInfo, goal: string, timeout: int = defaultTimeout): ProofResult {
// Write goal content to a temporary file and invoke the prover. If
// `cancelToken` is non-nil and `cancelled.read()` returns true mid-poll,
// SIGKILL the child and return a cancelled-result (exitCode = -5).
// Callers that do not race (sequential / best-of) pass nil.
proc tryProver(info: ProverInfo, goal: string, timeout: int = defaultTimeout,
cancelToken: borrowed CancelToken? = nil): ProofResult {
var timer = new stopwatch();
timer.start();

Expand Down Expand Up @@ -199,12 +225,39 @@ proc tryProver(info: ProverInfo, goal: string, timeout: int = defaultTimeout): P
// The previous implementation used `!proc.running` which is true
// before the child has begun and after it has exited, so it was
// both racy at startup and stopped polling once running latched.
// L2.3: also break out if the shared `cancelToken` flips to true,
// which the speculative-search winner sets after its CAS succeeds.
var elapsed: real = 0.0;
const pollInterval: real = 0.1;
var preempted = false;
while subproc.running && elapsed < timeout:real {
sleep(pollInterval);
elapsed += pollInterval;
subproc.poll();
if cancelToken != nil && cancelToken!.cancelled.read() {
preempted = true;
break;
}
}

if preempted {
// L2.3 preemption: the speculative-search winner has been
// declared elsewhere. SIGKILL ourselves and return a
// distinct exitCode = -5 so the caller's result table
// distinguishes preempted-loser from timed-out from failed.
subproc.sendPosixSignal(9);
subproc.wait();
timer.stop();
try { remove(tmpFile); } catch { }
return new ProofResult(
success = false,
prover = info.name,
proverId = info.id,
time = timer.elapsed(),
exitCode = -5,
output = "Preempted by speculative-search winner",
category = info.category
);
}

if subproc.running {
Expand Down Expand Up @@ -370,21 +423,25 @@ proc parallelProofSearch(goal: string, provers: [] ProverInfo,
// - parallelProofSearch waits for ALL tasks then picks the fastest
// success. Wall time is bounded by the slowest prover.
// - parallelProofSearchSpeculative records the first-completing
// success via an atomic CAS and returns it. Wall time is bounded
// by the fastest successful prover plus whatever in-flight tasks
// have already begun (no mid-flight preemption in Wave 1).
// success via an atomic CAS and SIGKILLs any still-running losers
// via the shared `CancelToken`. Wall time is bounded by the
// fastest successful prover plus one poll-interval (~100 ms) of
// observation lag on the losers.
//
// Wave 1 scope deliberately stops short of the "kill in-flight losers"
// step. Once a prover has been spawned, `tryProver` runs it to its own
// per-prover timeout — coforall doesn't know how to cancel a child task
// that's blocked on subprocess I/O. L2.3 introduces a cancel token
// threaded through `tryProver` so the children can SIGKILL their own
// subprocesses when they observe a winner; that's a separate PR.
// L2.3 cancellation: the winning task's CAS is paired with a write to
// the shared `CancelToken`. Loser tasks read the token at every poll
// step in `tryProver` and self-SIGKILL their subprocess as soon as
// the flag is observed. The atomic-bool semantics give us a
// happens-before edge from CAS-success to next-loser-poll, so no
// further locking is needed.
//
// Cancellation-safety today: every losing task is run to its natural
// end. The result table records all outcomes; only the winning index
// (the first successful CAS) is returned. No prover's exit can poison
// the join because the atomic CAS is monotone — the first winner wins.
// Soundness: the monotone first-wins CAS still controls which index
// is returned. The cancellation flag only affects how the LOSERS
// terminate — their results land in the table with exitCode = -5
// instead of running to completion, but they are never returned to
// the caller because `winner` is set before any cancellation could
// race the CAS. See proofs/agda/ParallelSoundness.agda:
// `cancellation-safety` for the formal statement.
proc parallelProofSearchSpeculative(goal: string, provers: [] ProverInfo,
timeout: int = defaultTimeout): ProofResult {
if verbose then
Expand All @@ -397,15 +454,20 @@ proc parallelProofSearchSpeculative(goal: string, provers: [] ProverInfo,
var results: [provers.domain] ProofResult;
var winnerIdx: atomic int;
winnerIdx.write(-1);
var cancelToken = new owned CancelToken();

coforall (prover, i) in zip(provers, provers.domain) {
results[i] = tryProver(prover, goal, timeout);
results[i] = tryProver(prover, goal, timeout, cancelToken.borrow());

if results[i].success {
// Monotone first-wins CAS: only the first successful
// worker flips the atomic from -1 to its own index.
// worker flips the atomic from -1 to its own index. The
// paired write to cancelToken signals every still-polling
// loser to SIGKILL its subprocess on the next poll.
var expected = -1;
winnerIdx.compareAndSwap(expected, i);
if winnerIdx.compareAndSwap(expected, i) {
cancelToken.cancelled.write(true);
}
}
}

Expand Down
Loading