Fix infinite 401 retry loop when app logs resubscribe fails#7055
Fix infinite 401 retry loop when app logs resubscribe fails#7055craigmichaelmartin wants to merge 4 commits intomainfrom
Conversation
When the JWT for app log polling expires, handleFetchAppLogsError calls onResubscribe() to get a fresh token. If onResubscribe() itself throws (network error, expired session), the exception propagates unhandled, causing the dev polling loop to retry every 5s with the same expired JWT — an infinite 401 loop. Wrap onResubscribe() in a try-catch so failures fall back to a 60s retry interval instead of propagating. This is a 12x reduction in 401 rate per affected client and allows self-healing when the underlying issue resolves.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
We detected some changes at Caution DO NOT create changesets for features which you do not wish to be included in the public changelog of the next CLI release. |
Coverage report
Test suite run success3912 tests passing in 1502 suites. Report generated by 🧪jest coverage report action from 9ec758c |
craigmichaelmartin
left a comment
There was a problem hiding this comment.
Review assisted by pair-review
|
|
||
| // Then | ||
| expect(failingResubscribe).toHaveBeenCalled() | ||
| expect(vi.getTimerCount()).toEqual(1) |
There was a problem hiding this comment.
🐛 Bug: This test's assertions — failingResubscribe was called and vi.getTimerCount() === 1 — are satisfied by two different code paths. Without the new try-catch in handleFetchAppLogsError, the rejected promise from onResubscribe() propagates up to pollAppLogs's outer catch block (line 131 of poll-app-logs.ts), which also schedules a retry timer (at 5s instead of 60s). So removing the fix entirely would not cause this test to fail. Two missing assertions compound the problem: (1) no verification that the inner catch path was taken, and (2) no verification that the retry interval is 60s rather than 5s — which is the key behavioral change this fix introduces.
Suggestion: Add assertions that pin down the exact code path and interval. For example:
// Given
const outputDebugSpy = vi.spyOn(output, 'outputDebug')
const timeoutSpy = vi.spyOn(global, 'setTimeout')
...
// Then
expect(failingResubscribe).toHaveBeenCalled()
expect(outputDebugSpy).toHaveBeenCalledWith(expect.stringContaining('Failed to resubscribe'))
expect(timeoutSpy).toHaveBeenCalledWith(expect.any(Function), 60000)
expect(vi.getTimerCount()).toEqual(1)The outputDebug spy confirms the inner catch was entered (not the outer catch, which calls outputWarn). The setTimeout spy confirms the 60s interval, following the pattern in usePollAppLogs.test.tsx line 414.
There was a problem hiding this comment.
Resolved in fa2d200 — added outputDebug spy, outputWarn spy (verifying the inner catch triggers onThrottle → outputWarn, not the outer catch), and setTimeout spy asserting the 60s interval.
| expect(outputResult).not.toHaveBeenCalled() | ||
| }) | ||
|
|
||
| test('should handle 401 with resubscribe failure and retry at throttle interval', async () => { |
There was a problem hiding this comment.
💬 Suggestion: The test name says "handle 401 with resubscribe failure" but handleFetchAppLogsError is fully mocked (lines 111-114), so the actual try-catch in utils.ts is never exercised. This test verifies that renderJsonLogs correctly handles a null-token/throttle-interval return value — which is valuable caller-integration coverage — but the name suggests it tests the failure path itself. The actual fix is only integration-tested via poll-app-logs.test.ts (line 388), which provides a real failingResubscribe callback. This follows the file's existing pattern of mocking handleFetchAppLogsError (same approach as the 500 error test at line 133), so it's consistent — but the name could be more precise.
Suggestion: Consider renaming to clarify what's actually being tested:
| test('should handle 401 with resubscribe failure and retry at throttle interval', async () => { | |
| test('should retry at throttle interval when handleFetchAppLogsError returns null token', async () => { |
There was a problem hiding this comment.
Resolved in fa2d200 — renamed to 'should retry at throttle interval when handleFetchAppLogsError returns null token'.
|
|
||
| expect(handleFetchAppLogsError).toHaveBeenCalled() | ||
| expect(pollAppLogs).toHaveBeenCalled() | ||
| expect(vi.getTimerCount()).toEqual(1) |
There was a problem hiding this comment.
💡 Improvement: The test sets throttleRetryInterval = 60000 and has handleFetchAppLogsError return it, but only asserts vi.getTimerCount() — not the interval value. The key behavioral difference between this scenario and a normal retry is the 60s interval. Without verifying it, this test is indistinguishable from a normal polling retry test.
Suggestion: Add an assertion on setTimeout to verify the interval:
| expect(vi.getTimerCount()).toEqual(1) | |
| expect(handleFetchAppLogsError).toHaveBeenCalled() | |
| expect(pollAppLogs).toHaveBeenCalled() | |
| expect(setTimeout).toHaveBeenCalledWith(expect.any(Function), throttleRetryInterval) | |
| expect(vi.getTimerCount()).toEqual(1) |
There was a problem hiding this comment.
Resolved in fa2d200 — added expect(setTimeout).toHaveBeenCalledWith(expect.any(Function), throttleRetryInterval) assertion.
| try { | ||
| nextJwtToken = await input.onResubscribe() | ||
| } catch (resubscribeError) { | ||
| outputDebug(`Failed to resubscribe to app logs: ${resubscribeError}`) |
There was a problem hiding this comment.
💡 Improvement: When resubscribe fails, only outputDebug is called — invisible in normal mode. This differs from the established pattern in this function: the 429 branch calls input.onThrottle() (line 130) and the generic error branch calls input.onUnknownError() (line 133), both surfacing user-visible messages in all three callers (warnings via outputWarn in poll-app-logs, structured JSON in render-json-logs, React error state in usePollAppLogs). Users will see no indication that authentication failed and the CLI is backing off for 60 seconds.
Suggestion: Consider whether calling input.onThrottle(retryIntervalMs) here would be appropriate, since the behavior (backing off for 60s) is analogous to the 429 case. This would give all three callers the opportunity to inform the user. Alternatively, a new callback like onResubscribeError could provide more specific messaging if the throttle framing doesn't fit.
There was a problem hiding this comment.
Resolved in fa2d200 — went with calling input.onThrottle(retryIntervalMs) in the catch block since the behavior (60s backoff) is analogous to the 429 case. All three callers now surface a user-visible message on resubscribe failure.
Code Review from River 🌊Overall assessment: Targeted, correct fix for the immediate symptom. Should ship — it'll reduce load by ~99% for this failure mode. But a follow-up is needed. What this PR gets right
What remains unfixed
Observability contextFrom Observe data (last 1h), the App Management GraphQL endpoint is receiving ~22K 401 requests/hour — all of them are |
dmerand
left a comment
There was a problem hiding this comment.
Dropping in a bug that looks like it needs addressing.
| @@ -109,17 +110,26 @@ export interface AppLogsOptions { | |||
|
|
|||
| export const handleFetchAppLogsError = async ( | |||
There was a problem hiding this comment.
resubscribeFailed is not enough to drive the new counter reset logic. A false value currently means either "resubscribe succeeded" or "this was some other error path, like 429/500, so no resubscribe happened". All three callers reset their consecutive-failure counter on false, which means unrelated errors between two 401 resubscribe failures can clear the counter and prevent the cutoff from triggering when it should.
Either return a distinct state for successful resubscribe vs. non-401 errors, or have the callers reset only when a fresh token is actually returned. This affects [[file:packages/app/src/cli/services/app-logs/dev/poll-app-logs.ts:107-117]], [[file:packages/app/src/cli/services/app-logs/logs-command/render-json-logs.ts:44-54]], and [[file:packages/app/src/cli/services/app-logs/logs-command/ui/components/hooks/usePollAppLogs.ts:169-179]].

Summary
onResubscribe()inhandleFetchAppLogsErrorwith try-catch so resubscribe failures use a 60s retry interval instead of propagating as unhandled exceptionsdev/poll-app-logs.ts(catch block retried every 5s with the same expired JWT)logs-command/render-json-logs.tslogs-command/usePollAppLogs.tspoll-app-logsandrender-json-logsWHY are these changes introduced?
WHAT is this pull request doing?
How to test your changes?
Post-release steps
Measuring impact
How do we know this change was effective? Please choose one:
Checklist