fix(appkit): decouple cache in-flight execution from per-caller abort…#398
Conversation
… signal The cache in-flight deduplication map shared a single Promise across concurrent callers with the same cacheKey. When one caller's signal aborted (e.g. a React StrictMode mount/cleanup pair), fn()'s rejection cascaded through the shared promise to every other awaiter — including still-connected SSE consumers, which broadcast the abort as UPSTREAM_ERROR to the browser. Replace the bare promise map with a reference-counted in-flight entry owning its own AbortController. Callers pass their own callerSignal; the shared controller aborts only when every caller has bailed (refCount -> 0). Each caller's await is raced against its own signal so local aborts reject locally without poisoning the shared result. The catch block no longer wraps abort errors as ExecutionError.statementFailed when the shared controller has already aborted, since no live awaiter would observe them anyway. CacheInterceptor forwards context.signal as callerSignal and swaps context.signal to the cache's shared signal for the duration of fn() so the inner interceptor chain (timeout/retry/telemetry) and the underlying I/O (e.g. analytics SDK calls) observe the shared lifecycle rather than the per-request stream signal. Existing cache + plugin + stream + analytics test suites pass unchanged (715 tests). Signed-off-by: MarioCadenas <MarioCadenas@users.noreply.github.com>
36f0243 to
8457ca2
Compare
pkosiec
left a comment
There was a problem hiding this comment.
Tested this PR and it works! 👍
ff5da21 to
afd0cb7
Compare
There was a problem hiding this comment.
Approved after some improvements to the original PR.
What the last commit improves over the original
1. Race condition fix
Problem: A caller could join an already-aborted entry by incrementing refCount on a dead InFlightEntry (one whose sharedController was already aborted). The .finally() cleanup also unconditionally deleted the map entry, which could remove a newer entry for the same key that had been created after the abort.
Fix:
- Added guard
!existing.sharedController.signal.abortedwhen joining an existing entry — prevents latching onto a zombie entry - Made
.finally()conditional:if (this.inFlightRequests.get(cacheKey) === entry)— only cleans up if the map still holds this entry, not a newer one
2. Grace period for React StrictMode
Problem: The abort fired immediately when refCount hit 0. Under React StrictMode's mount→unmount→remount cycle, the unmount fires the abort before the remount can join the in-flight entry, causing a fresh execution instead of sharing the already-in-progress one.
Fix:
- Added
ABORT_GRACE_PERIOD_MS = 100constant andabortTimerfield onInFlightEntry - Changed
release()to usesetTimeout— delays the abort so a remount arriving within the grace period can cancel the timer and join the existing entry - New callers joining cancel any pending abort timer via
clearTimeout clear()method cleans up pending timers to prevent leaks
3. Proper AbortError for stream disconnects
Problem: stream-manager.ts was passing plain strings (e.g. "Server shutdown", "Client disconnected") to controller.abort(). This meant signal.reason was a string rather than a proper DOMException with name === "AbortError". Downstream code checking for abort errors via instanceof DOMException or error.name === "AbortError" would fail to detect these as aborts, potentially misclassifying them.
Fix: Changed all 3 abort call sites to use new DOMException("...", "AbortError").
4. Span lifecycle fix
- Removed premature
span.end()in the dedup path — the original calledspan.end()beforeawait waitWithRefCount(), which meant the span was closed before the awaited work completed. The span is now properly ended by thefinallyblock of the outer try/catch - Removed premature
span.setStatus(OK)in the dedup path — if the caller aborts mid-wait, the span was already marked OK before the outer catch re-sets it to ERROR. Now both dedup and cache-miss paths set status consistently after the await resolves
5. Comprehensive test suite
Added 11 new tests covering:
- One caller aborts while another waits — waiting caller resolves normally
- All callers abort — shared controller signal is aborted (with grace period)
- Pre-aborted callerSignal throws immediately without executing fn
- Single caller abort mid-flight
- Deduped caller abort doesn't poison first caller
- fn rejects while multiple callers wait — all receive the error
- Caller aborts after promise already resolved — gets the resolved value
- New caller after previous entry fully aborted gets fresh execution
- Grace period: new caller joins before timer fires — no abort, single execution
- Grace period: no new caller arrives — timer fires and aborts shared controller
- Non-signal caller regression (fn ignores signal parameter)
Also updated all ~20 test files' getOrExecute mock signatures to match the new API.
…d, abort types
Address review findings and edge cases in the ref-counted cache abort
handling:
- Prevent callers from joining already-aborted in-flight entries by
guarding on sharedController.signal.aborted
- Add 100ms grace period before aborting shared execution so React
StrictMode remounts can rejoin the in-flight entry
- Make .finally() cleanup conditional to avoid removing a newer entry
for the same cache key
- Use proper DOMException("…", "AbortError") for stream disconnect
abort reasons instead of plain strings
- Remove premature span.setStatus(OK) in dedup path — let the outer
try/catch handle status consistently
- Rename _waitWithRefCount → waitWithRefCount (TypeScript private, not
convention prefix)
- Clear pending abort timers on cache clear() and when new callers join
- Add comprehensive test suite for abort/ref-counting (11 tests)
- Update all getOrExecute mock signatures across test files
Signed-off-by: Pawel Kosiec <pawel.kosiec@databricks.com>
|
Tested both locally and on prod - everything works 👍 Merging without an additional review - Mario is OOO and the changes in my commit are hardening fixes on top of his original ref-counting implementation:
No architectural changes, no new dependencies — just edge-case fixes and test coverage for the feature Mario introduced in the first commit. |
Summary
Under React 19
<StrictMode>, navigating to/analyticsin dev surfaced this error in the browser console:Root cause
CacheManager.getOrExecutededupes concurrent callers with the samecacheKeyby sharing a singlePromiseininFlightRequests. Under StrictMode:useAnalyticsQueryfires a request → server registers an in-flight promise for the query'scacheKey.fetch→ server'sres.on("close")aborts the stream'sAbortControllerwith"Client disconnected"→ the SDK'sexecutor.query(..., signal)rejects withStatement failed: The operation was aborted.(anExecutionErrorcarryingstatusCode).stream-manager._processGeneratorInBackgroundcatches that rejection on mount 2's stream entry (which still has a connected client), categorizes it asUPSTREAM_ERROR(becausestatusCodeis set anderror.name !== "AbortError"), and broadcasts it to the browser.Smoking-gun log from the catch on mount 2's stream:
{"clientsRemaining": 1, "errorCode": "UPSTREAM_ERROR", "abortReason": }
clientsRemaining: 1+ noabortReasonproves the abort came in via the shared in-flight promise, not via this stream's own controller.This is impossible without StrictMode (only one mount → no inflight join) and impossible on the reconnect view (different code path, no
CacheInterceptor).Fix
Reference-count the cache's in-flight execution behind a cache-owned
AbortController:inFlightRequestsnow holds{ promise, refCount, sharedController }instead of barePromises.getOrExecute(key, fn, userKey, options)acceptsoptions.callerSignaland passes a cache-ownedsharedController.signalintofn(sharedSignal). The shared controller is aborted only whenrefCountdrops to 0 — i.e. every caller has bailed._waitWithRefCount(entry, callerSignal)races the entry's promise against each caller's own signal. A caller-side abort rejects that caller's promise locally and decrementsrefCount; other still-connected awaiters continue to share the underlying execution.ExecutionError.statementFailedwhen the shared controller has already aborted (no live awaiter would observe them anyway), andentry.promise.catch(() => {})suppresses unhandled-rejection warnings when every caller leaves beforefn()resolves.CacheInterceptorforwardscontext.signalascallerSignaland swapscontext.signalto the cache'ssharedSignalfor the duration offn()so the inner interceptor chain (timeout/retry/telemetry) and the underlying I/O (e.g. analyticsexecutor.query(..., signal)) observe the shared lifecycle instead of the per-request stream signal.External callers (e.g.
cache.getOrExecute(key, async () => {...}, userKey)) keep working unchanged because the newsharedSignalparameter is optional.Test plan
pnpm exec vitest run --project appkit src/cache src/plugin src/stream src/plugins/analytics— 715 tests, all green.apps/dev-playground: load/analyticsunder<StrictMode>— noUPSTREAM_ERRORin console; tiles load with data.telemetry-example-plugin.ts's directcache.getOrExecuteuse without a signal).