Skip to content

fix(appkit): decouple cache in-flight execution from per-caller abort…#398

Merged
pkosiec merged 2 commits into
mainfrom
mario/6220b406
May 25, 2026
Merged

fix(appkit): decouple cache in-flight execution from per-caller abort…#398
pkosiec merged 2 commits into
mainfrom
mario/6220b406

Conversation

@MarioCadenas
Copy link
Copy Markdown
Collaborator

@MarioCadenas MarioCadenas commented May 21, 2026

Summary

Under React 19 <StrictMode>, navigating to /analytics in dev surfaced this error in the browser console:

[useAnalyticsQuery] Code: UPSTREAM_ERROR, Message: Statement failed: The operation was aborted.
Removing <StrictMode> made it disappear, which made it look like a UI quirk. It wasn't — it was a real cache bug that StrictMode happened to expose.

Root cause

CacheManager.getOrExecute dedupes concurrent callers with the same cacheKey by sharing a single Promise in inFlightRequests. Under StrictMode:

  1. Mount 1 of each useAnalyticsQuery fires a request → server registers an in-flight promise for the query's cacheKey.
  2. StrictMode cleanup aborts mount 1's fetch → server's res.on("close") aborts the stream's AbortController with "Client disconnected" → the SDK's executor.query(..., signal) rejects with Statement failed: The operation was aborted. (an ExecutionError carrying statusCode).
  3. Mount 2 (still alive) hits the in-flight branch and awaits the same rejected promise.
  4. stream-manager._processGeneratorInBackground catches that rejection on mount 2's stream entry (which still has a connected client), categorizes it as UPSTREAM_ERROR (because statusCode is set and error.name !== "AbortError"), and broadcasts it to the browser.
    Smoking-gun log from the catch on mount 2's stream:
    {"clientsRemaining": 1, "errorCode": "UPSTREAM_ERROR", "abortReason": }
    clientsRemaining: 1 + no abortReason proves the abort came in via the shared in-flight promise, not via this stream's own controller.
    This is impossible without StrictMode (only one mount → no inflight join) and impossible on the reconnect view (different code path, no CacheInterceptor).

Fix

Reference-count the cache's in-flight execution behind a cache-owned AbortController:

  • inFlightRequests now holds { promise, refCount, sharedController } instead of bare Promises.
  • getOrExecute(key, fn, userKey, options) accepts options.callerSignal and passes a cache-owned sharedController.signal into fn(sharedSignal). The shared controller is aborted only when refCount drops to 0 — i.e. every caller has bailed.
  • New private _waitWithRefCount(entry, callerSignal) races the entry's promise against each caller's own signal. A caller-side abort rejects that caller's promise locally and decrements refCount; other still-connected awaiters continue to share the underlying execution.
  • The catch block no longer wraps abort-shaped errors as ExecutionError.statementFailed when the shared controller has already aborted (no live awaiter would observe them anyway), and entry.promise.catch(() => {}) suppresses unhandled-rejection warnings when every caller leaves before fn() resolves.
  • CacheInterceptor forwards context.signal as callerSignal and swaps context.signal to the cache's sharedSignal for the duration of fn() so the inner interceptor chain (timeout/retry/telemetry) and the underlying I/O (e.g. analytics executor.query(..., signal)) observe the shared lifecycle instead of the per-request stream signal.
    External callers (e.g. cache.getOrExecute(key, async () => {...}, userKey)) keep working unchanged because the new sharedSignal parameter is optional.

Test plan

  • pnpm exec vitest run --project appkit src/cache src/plugin src/stream src/plugins/analytics — 715 tests, all green.
  • Manual repro in apps/dev-playground: load /analytics under <StrictMode> — no UPSTREAM_ERROR in console; tiles load with data.
  • Verified cache dedup still works post-fix (the in-flight join branch still fires per logs; only the shared-fate-on-abort behavior changed).
  • Reviewer sanity-check: confirm there's no behavior regression for non-deduped callers (e.g. telemetry-example-plugin.ts's direct cache.getOrExecute use without a signal).

@MarioCadenas MarioCadenas requested a review from pkosiec May 21, 2026 14:34
@MarioCadenas MarioCadenas marked this pull request as ready for review May 21, 2026 14:35
@MarioCadenas MarioCadenas requested a review from a team as a code owner May 21, 2026 14:35
… signal

The cache in-flight deduplication map shared a single Promise across
concurrent callers with the same cacheKey. When one caller's signal
aborted (e.g. a React StrictMode mount/cleanup pair), fn()'s rejection
cascaded through the shared promise to every other awaiter — including
still-connected SSE consumers, which broadcast the abort as
UPSTREAM_ERROR to the browser.

Replace the bare promise map with a reference-counted in-flight entry
owning its own AbortController. Callers pass their own callerSignal;
the shared controller aborts only when every caller has bailed
(refCount -> 0). Each caller's await is raced against its own signal so
local aborts reject locally without poisoning the shared result. The
catch block no longer wraps abort errors as ExecutionError.statementFailed
when the shared controller has already aborted, since no live awaiter
would observe them anyway.

CacheInterceptor forwards context.signal as callerSignal and swaps
context.signal to the cache's shared signal for the duration of fn() so
the inner interceptor chain (timeout/retry/telemetry) and the underlying
I/O (e.g. analytics SDK calls) observe the shared lifecycle rather than
the per-request stream signal.

Existing cache + plugin + stream + analytics test suites pass unchanged
(715 tests).

Signed-off-by: MarioCadenas <MarioCadenas@users.noreply.github.com>
@pkosiec pkosiec force-pushed the mario/6220b406 branch 2 times, most recently from 36f0243 to 8457ca2 Compare May 22, 2026 14:38
Copy link
Copy Markdown
Member

@pkosiec pkosiec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this PR and it works! 👍

Comment thread packages/appkit/src/cache/index.ts
@pkosiec pkosiec force-pushed the mario/6220b406 branch 2 times, most recently from ff5da21 to afd0cb7 Compare May 25, 2026 10:13
Copy link
Copy Markdown
Member

@pkosiec pkosiec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved after some improvements to the original PR.

What the last commit improves over the original

1. Race condition fix

Problem: A caller could join an already-aborted entry by incrementing refCount on a dead InFlightEntry (one whose sharedController was already aborted). The .finally() cleanup also unconditionally deleted the map entry, which could remove a newer entry for the same key that had been created after the abort.

Fix:

  • Added guard !existing.sharedController.signal.aborted when joining an existing entry — prevents latching onto a zombie entry
  • Made .finally() conditional: if (this.inFlightRequests.get(cacheKey) === entry) — only cleans up if the map still holds this entry, not a newer one

2. Grace period for React StrictMode

Problem: The abort fired immediately when refCount hit 0. Under React StrictMode's mount→unmount→remount cycle, the unmount fires the abort before the remount can join the in-flight entry, causing a fresh execution instead of sharing the already-in-progress one.

Fix:

  • Added ABORT_GRACE_PERIOD_MS = 100 constant and abortTimer field on InFlightEntry
  • Changed release() to use setTimeout — delays the abort so a remount arriving within the grace period can cancel the timer and join the existing entry
  • New callers joining cancel any pending abort timer via clearTimeout
  • clear() method cleans up pending timers to prevent leaks

3. Proper AbortError for stream disconnects

Problem: stream-manager.ts was passing plain strings (e.g. "Server shutdown", "Client disconnected") to controller.abort(). This meant signal.reason was a string rather than a proper DOMException with name === "AbortError". Downstream code checking for abort errors via instanceof DOMException or error.name === "AbortError" would fail to detect these as aborts, potentially misclassifying them.

Fix: Changed all 3 abort call sites to use new DOMException("...", "AbortError").

4. Span lifecycle fix

  • Removed premature span.end() in the dedup path — the original called span.end() before await waitWithRefCount(), which meant the span was closed before the awaited work completed. The span is now properly ended by the finally block of the outer try/catch
  • Removed premature span.setStatus(OK) in the dedup path — if the caller aborts mid-wait, the span was already marked OK before the outer catch re-sets it to ERROR. Now both dedup and cache-miss paths set status consistently after the await resolves

5. Comprehensive test suite

Added 11 new tests covering:

  • One caller aborts while another waits — waiting caller resolves normally
  • All callers abort — shared controller signal is aborted (with grace period)
  • Pre-aborted callerSignal throws immediately without executing fn
  • Single caller abort mid-flight
  • Deduped caller abort doesn't poison first caller
  • fn rejects while multiple callers wait — all receive the error
  • Caller aborts after promise already resolved — gets the resolved value
  • New caller after previous entry fully aborted gets fresh execution
  • Grace period: new caller joins before timer fires — no abort, single execution
  • Grace period: no new caller arrives — timer fires and aborts shared controller
  • Non-signal caller regression (fn ignores signal parameter)

Also updated all ~20 test files' getOrExecute mock signatures to match the new API.

…d, abort types

Address review findings and edge cases in the ref-counted cache abort
handling:

- Prevent callers from joining already-aborted in-flight entries by
  guarding on sharedController.signal.aborted
- Add 100ms grace period before aborting shared execution so React
  StrictMode remounts can rejoin the in-flight entry
- Make .finally() cleanup conditional to avoid removing a newer entry
  for the same cache key
- Use proper DOMException("…", "AbortError") for stream disconnect
  abort reasons instead of plain strings
- Remove premature span.setStatus(OK) in dedup path — let the outer
  try/catch handle status consistently
- Rename _waitWithRefCount → waitWithRefCount (TypeScript private, not
  convention prefix)
- Clear pending abort timers on cache clear() and when new callers join
- Add comprehensive test suite for abort/ref-counting (11 tests)
- Update all getOrExecute mock signatures across test files

Signed-off-by: Pawel Kosiec <pawel.kosiec@databricks.com>
@pkosiec
Copy link
Copy Markdown
Member

pkosiec commented May 25, 2026

Tested both locally and on prod - everything works 👍

Merging without an additional review - Mario is OOO and the changes in my commit are hardening fixes on top of his original ref-counting implementation:

  • Defensive guards: prevent joining already-aborted entries, conditional .finally() cleanup to avoid removing newer entries for the same key, clearing abort timers on cache.clear()
  • 100ms grace period before aborting shared execution so React StrictMode remounts can rejoin in-flight entries
  • Proper DOMException("…", "AbortError") instead of plain strings for abort reasons in stream-manager
  • Removed premature span.setStatus(OK) / span.end() in the dedup path
  • 11 new tests covering abort/ref-counting edge cases (single abort, dual abort, grace period join/expiry, pre-aborted signal, fn rejection, post-resolve abort, fresh execution after full abort)
  • ~20 test files with mechanical mock signature updates (fn: () => Promise<T>fn: (signal?: AbortSignal) => Promise<T>) to match the new getOrExecute API

No architectural changes, no new dependencies — just edge-case fixes and test coverage for the feature Mario introduced in the first commit.

@pkosiec pkosiec merged commit 37a6b0f into main May 25, 2026
9 checks passed
@pkosiec pkosiec deleted the mario/6220b406 branch May 25, 2026 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants