Skip to content

Fix infinite 401 retry loop when app logs resubscribe fails#7055

Open
craigmichaelmartin wants to merge 4 commits intomainfrom
03-19-fix_infinite_401_retry_loop_when_app_logs_resubscribe_fails
Open

Fix infinite 401 retry loop when app logs resubscribe fails#7055
craigmichaelmartin wants to merge 4 commits intomainfrom
03-19-fix_infinite_401_retry_loop_when_app_logs_resubscribe_fails

Conversation

@craigmichaelmartin
Copy link
Contributor

@craigmichaelmartin craigmichaelmartin commented Mar 19, 2026

Summary

  • Wrap onResubscribe() in handleFetchAppLogsError with try-catch so resubscribe failures use a 60s retry interval instead of propagating as unhandled exceptions
  • Fixes infinite 401 retry loop in dev/poll-app-logs.ts (catch block retried every 5s with the same expired JWT)
  • Fixes unhandled crash in logs-command/render-json-logs.ts
  • Fixes silent polling stop in logs-command/usePollAppLogs.ts
  • Adds tests for the resubscribe failure path in both poll-app-logs and render-json-logs

WHY are these changes introduced?

WHAT is this pull request doing?

How to test your changes?

Post-release steps

Measuring impact

How do we know this change was effective? Please choose one:

  • n/a - this doesn't need measurement, e.g. a linting rule or a bug-fix
  • Existing analytics will cater for this addition
  • PR includes analytics changes to measure impact

Checklist

  • I've considered possible cross-platform impacts (Mac, Linux, Windows)
  • I've considered possible documentation changes

When the JWT for app log polling expires, handleFetchAppLogsError calls
onResubscribe() to get a fresh token. If onResubscribe() itself throws
(network error, expired session), the exception propagates unhandled,
causing the dev polling loop to retry every 5s with the same expired
JWT — an infinite 401 loop.

Wrap onResubscribe() in a try-catch so failures fall back to a 60s
retry interval instead of propagating. This is a 12x reduction in 401
rate per affected client and allows self-healing when the underlying
issue resolves.
Copy link
Contributor Author

craigmichaelmartin commented Mar 19, 2026

@craigmichaelmartin craigmichaelmartin marked this pull request as ready for review March 19, 2026 19:51
@github-actions
Copy link
Contributor

We detected some changes at packages/*/src and there are no updates in the .changeset.
If the changes are user-facing, run pnpm changeset add to track your changes and include them in the next release CHANGELOG.

Caution

DO NOT create changesets for features which you do not wish to be included in the public changelog of the next CLI release.

@github-actions
Copy link
Contributor

Coverage report

St.
Category Percentage Covered / Total
🟢 Statements 82.21% 14930/18160
🟡 Branches 74.65% 7393/9904
🟢 Functions 81.36% 3767/4630
🟢 Lines 82.61% 14117/17089

Test suite run success

3912 tests passing in 1502 suites.

Report generated by 🧪jest coverage report action from 9ec758c

Copy link
Contributor Author

@craigmichaelmartin craigmichaelmartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Review assisted by pair-review


// Then
expect(failingResubscribe).toHaveBeenCalled()
expect(vi.getTimerCount()).toEqual(1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Bug: This test's assertions — failingResubscribe was called and vi.getTimerCount() === 1 — are satisfied by two different code paths. Without the new try-catch in handleFetchAppLogsError, the rejected promise from onResubscribe() propagates up to pollAppLogs's outer catch block (line 131 of poll-app-logs.ts), which also schedules a retry timer (at 5s instead of 60s). So removing the fix entirely would not cause this test to fail. Two missing assertions compound the problem: (1) no verification that the inner catch path was taken, and (2) no verification that the retry interval is 60s rather than 5s — which is the key behavioral change this fix introduces.

Suggestion: Add assertions that pin down the exact code path and interval. For example:

// Given
const outputDebugSpy = vi.spyOn(output, 'outputDebug')
const timeoutSpy = vi.spyOn(global, 'setTimeout')
...
// Then
expect(failingResubscribe).toHaveBeenCalled()
expect(outputDebugSpy).toHaveBeenCalledWith(expect.stringContaining('Failed to resubscribe'))
expect(timeoutSpy).toHaveBeenCalledWith(expect.any(Function), 60000)
expect(vi.getTimerCount()).toEqual(1)

The outputDebug spy confirms the inner catch was entered (not the outer catch, which calls outputWarn). The setTimeout spy confirms the 60s interval, following the pattern in usePollAppLogs.test.tsx line 414.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in fa2d200 — added outputDebug spy, outputWarn spy (verifying the inner catch triggers onThrottleoutputWarn, not the outer catch), and setTimeout spy asserting the 60s interval.

expect(outputResult).not.toHaveBeenCalled()
})

test('should handle 401 with resubscribe failure and retry at throttle interval', async () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Suggestion: The test name says "handle 401 with resubscribe failure" but handleFetchAppLogsError is fully mocked (lines 111-114), so the actual try-catch in utils.ts is never exercised. This test verifies that renderJsonLogs correctly handles a null-token/throttle-interval return value — which is valuable caller-integration coverage — but the name suggests it tests the failure path itself. The actual fix is only integration-tested via poll-app-logs.test.ts (line 388), which provides a real failingResubscribe callback. This follows the file's existing pattern of mocking handleFetchAppLogsError (same approach as the 500 error test at line 133), so it's consistent — but the name could be more precise.

Suggestion: Consider renaming to clarify what's actually being tested:

Suggested change
test('should handle 401 with resubscribe failure and retry at throttle interval', async () => {
test('should retry at throttle interval when handleFetchAppLogsError returns null token', async () => {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in fa2d200 — renamed to 'should retry at throttle interval when handleFetchAppLogsError returns null token'.


expect(handleFetchAppLogsError).toHaveBeenCalled()
expect(pollAppLogs).toHaveBeenCalled()
expect(vi.getTimerCount()).toEqual(1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Improvement: The test sets throttleRetryInterval = 60000 and has handleFetchAppLogsError return it, but only asserts vi.getTimerCount() — not the interval value. The key behavioral difference between this scenario and a normal retry is the 60s interval. Without verifying it, this test is indistinguishable from a normal polling retry test.

Suggestion: Add an assertion on setTimeout to verify the interval:

Suggested change
expect(vi.getTimerCount()).toEqual(1)
expect(handleFetchAppLogsError).toHaveBeenCalled()
expect(pollAppLogs).toHaveBeenCalled()
expect(setTimeout).toHaveBeenCalledWith(expect.any(Function), throttleRetryInterval)
expect(vi.getTimerCount()).toEqual(1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in fa2d200 — added expect(setTimeout).toHaveBeenCalledWith(expect.any(Function), throttleRetryInterval) assertion.

try {
nextJwtToken = await input.onResubscribe()
} catch (resubscribeError) {
outputDebug(`Failed to resubscribe to app logs: ${resubscribeError}`)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Improvement: When resubscribe fails, only outputDebug is called — invisible in normal mode. This differs from the established pattern in this function: the 429 branch calls input.onThrottle() (line 130) and the generic error branch calls input.onUnknownError() (line 133), both surfacing user-visible messages in all three callers (warnings via outputWarn in poll-app-logs, structured JSON in render-json-logs, React error state in usePollAppLogs). Users will see no indication that authentication failed and the CLI is backing off for 60 seconds.

Suggestion: Consider whether calling input.onThrottle(retryIntervalMs) here would be appropriate, since the behavior (backing off for 60s) is analogous to the 429 case. This would give all three callers the opportunity to inform the user. Alternatively, a new callback like onResubscribeError could provide more specific messaging if the throttle framing doesn't fit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in fa2d200 — went with calling input.onThrottle(retryIntervalMs) in the catch block since the behavior (60s backoff) is analogous to the 429 case. All three callers now surface a user-visible message on resubscribe failure.

@craigmichaelmartin
Copy link
Contributor Author

Code Review from River 🌊

Overall assessment: Targeted, correct fix for the immediate symptom. Should ship — it'll reduce load by ~99% for this failure mode. But a follow-up is needed.

What this PR gets right

  • Correctly identifies the unhandled exception in the resubscribe path as the root cause of the retry storm
  • Uses the existing throttle interval (60s) which is reasonable
  • Good test coverage for both the poll and JSON render code paths
  • Small, focused change — minimal risk

What remains unfixed

  1. The CLI never refreshes the Identity token. The resubscribeCallback just calls subscribeToAppLogs again with the same expired bearer token. So the CLI will keep retrying every 60s indefinitely — failing every time — until the user manually restarts. The session becomes a zombie.

  2. No user-facing error message. The outputDebug is only visible in debug mode. A regular user will see their log stream silently stop with no indication of what happened. Consider adding an outputWarn like: "Your session token has expired. Please restart shopify app logs."

  3. No max retry count or escalation. Consider capping retries and then surfacing a clear error asking the user to restart.

Observability context

From Observe data (last 1h), the App Management GraphQL endpoint is receiving ~22K 401 requests/hour — all of them are AppLogsSubscribe, from only ~8 distinct CLI sessions. This confirms the tight retry loop this PR addresses. The 60s backoff should reduce this to ~480 requests/hour from those same clients.

Copy link
Contributor

@dmerand dmerand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping in a bug that looks like it needs addressing.

@@ -109,17 +110,26 @@ export interface AppLogsOptions {

export const handleFetchAppLogsError = async (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resubscribeFailed is not enough to drive the new counter reset logic. A false value currently means either "resubscribe succeeded" or "this was some other error path, like 429/500, so no resubscribe happened". All three callers reset their consecutive-failure counter on false, which means unrelated errors between two 401 resubscribe failures can clear the counter and prevent the cutoff from triggering when it should.

Either return a distinct state for successful resubscribe vs. non-401 errors, or have the callers reset only when a fresh token is actually returned. This affects [[file:packages/app/src/cli/services/app-logs/dev/poll-app-logs.ts:107-117]], [[file:packages/app/src/cli/services/app-logs/logs-command/render-json-logs.ts:44-54]], and [[file:packages/app/src/cli/services/app-logs/logs-command/ui/components/hooks/usePollAppLogs.ts:169-179]].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants