Skip to content

feat: add retry resilience to imsApiCall#1438

Merged
ravverma merged 3 commits intomainfrom
feat/ims-api-call-resilience
Mar 16, 2026
Merged

feat: add retry resilience to imsApiCall#1438
ravverma merged 3 commits intomainfrom
feat/ims-api-call-resilience

Conversation

@ravverma
Copy link
Contributor

Summary

  • Added exponential-backoff retry logic to ImsBaseClient.imsApiCall
    to handle transient IMS failures without request-flooding the service
  • Max 2 retries (3 total attempts) — hard cap; retries only on 5xx
    server errors and 429 rate-limiting; never on 4xx client errors
  • Exponential back-off: 1 000 ms → 2 000 ms between attempts
    (retryBaseDelayMs is an overridable instance property so tests run
    at 0 ms without fake-timer complexity)
  • Audit-trail logging at every retry decision:
    • WARN on each retry: endpoint, HTTP status / network error,
      attempt N/3, next delay
    • INFO on successful recovery: endpoint, attempt that succeeded,
      total retries used
    • ERROR on final exhaustion after all 3 attempts

Test plan

  • All 74 existing tests pass (no behavioural change for non-retryable
    scenarios; retryable-status mocks updated with .times(3))
  • New describe('imsApiCall retry behavior') block (6 tests):
    • retries on 5xx → success on attempt 2
    • retries on 429 → success on attempt 2
    • retries on network error → success on attempt 3
    • exhausts all 3 attempts on network error → throws
    • exhausts all 3 attempts on 5xx → caller receives last response
    • no retry on 4xx → returns immediately with zero warn logs
  • Line/statement coverage 100 %; branch coverage 99.48 % (> 97 % threshold)

🤖 Generated with Claude Code

Adds exponential-backoff retry logic to `ImsBaseClient.imsApiCall` to
handle transient IMS failures without flooding the service with requests.

- Max 2 retries (3 total attempts) — hard cap to protect IMS
- Retries on: 5xx server errors and 429 rate-limiting
- Never retries 4xx client errors (no point hammering on auth failures)
- Exponential back-off: 1 000 ms → 2 000 ms between attempts
- `retryBaseDelayMs` is an overridable instance property so tests run
  at 0 ms delay without fake timers
- Audit-trail logging: warn on each retry (endpoint, status/error,
  attempt N/3, delay); info on successful recovery (attempt used,
  retry count); error on final exhaustion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ravverma ravverma requested review from ekremney and solaris007 March 13, 2026 12:49
@ravverma ravverma added the enhancement New feature or request label Mar 13, 2026
@ravverma ravverma self-assigned this Mar 13, 2026
Copy link
Member

@ekremney ekremney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work — clean retry logic with good observability. Two suggestions:

1. Add a defensive throw after the loop to remove the eslint-disable consistent-return

The loop logic is correct today, but the linter can't prove it always returns/throws. An explicit throw makes the contract clear for future readers:

    } // end for
    throw new Error(`IMS ${endpoint}: retry loop exited unexpectedly`);
  }

2. Add jitter to the backoff delay

Fixed delays mean correlated callers (e.g. multiple Lambdas hitting IMS rate limits simultaneously) retry at the same wall-clock times. Adding ±20% random jitter is standard practice to avoid thundering herd:

const delay = this.retryBaseDelayMs * (2 ** attempt) * (0.8 + Math.random() * 0.4);

Copy link
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ravverma,

Strengths

  • Excellent placement of retry logic at the imsApiCall chokepoint (ims-base-client.js:110) - all 11+ call sites across ImsClient and ImsPromiseClient get resilience automatically without caller changes.
  • #isRetryableStatus correctly limits retries to 429 and 5xx, never retrying 4xx auth errors like 401/403 which indicate configuration problems, not transient failures (ims-base-client.js:38-41).
  • Conservative retry budget (2 retries, 3 total attempts) with exponential backoff (1s, 2s) caps worst-case added latency at 3 seconds - appropriate for backend-to-backend auth service communication (ims-base-client.js:19-22).
  • Token endpoint POST retries are safe for IMS - both grant_type=authorization_code and grant_type=client_credentials are idempotent from the caller's perspective. Adobe IMS client codes are reusable, so duplicate POSTs simply return a new (or cached) token.
  • Headers are prepared once outside the retry loop (ims-base-client.js:112), ensuring the same credential is used across all attempts rather than re-acquiring tokens on each retry.
  • Thorough test coverage: dedicated retry behavior block (6 tests) covering 5xx, 429, network error recovery, retry exhaustion, and 4xx non-retry. Existing failure tests properly updated with .times(3) (ims-client.test.js:761-855).
  • Audit-trail logging at every retry decision point - warn on retry, info on recovery, error on exhaustion - with endpoint, status, attempt number, and delay included.

Issues

Important (Should Fix)

  1. Implicit undefined return masked by eslint-disable - ims-base-client.js:105

    The // eslint-disable-next-line consistent-return suppresses a legitimate structural warning. While all current code paths do terminate (the final attempt either returns the response or throws), the function has no explicit return/throw after the for-loop. If someone later changes the retry math or adds a condition, the function could silently return undefined, causing a confusing TypeError when the caller tries to call .ok or .json() on the result.

    How to fix: Remove the eslint-disable and add an explicit throw after the for-loop:

    } // end for
    throw new Error('imsApiCall: unexpected fall-through after retry loop');
  2. 429 responses do not respect Retry-After header - ims-base-client.js:122-136

    When IMS returns 429, it typically includes a Retry-After header indicating how long to wait. The code ignores this and uses the fixed exponential backoff (1s, 2s). If IMS says "back off for 60 seconds" and we retry after 1 second, we get 429 again - burning all three attempts for nothing. This is especially problematic in SpaceCat's multi-instance deployment where several service instances could hit the rate limit simultaneously, amplifying the pressure rather than relieving it.

    How to fix: Read response.headers.get('Retry-After') when status is 429. If present and numeric, use Math.max(parsedRetryAfter * 1000, delay) as the sleep duration. Fall back to exponential backoff when the header is absent.

  3. Response body not consumed on retryable HTTP errors - ims-base-client.js:122-136

    When a retryable status (e.g. 503) triggers a retry, the response object is discarded without consuming its body. With @adobe/fetch (undici under the hood), unconsumed response bodies hold the underlying socket open until GC or connection timeout. In a retry loop against a flapping IMS service, this could exhaust the HTTP/2 connection pool or cause socket leaks.

    How to fix: Add response.body?.cancel() or await response.text() before the continue statement to explicitly release the connection.

Minor (Nice to Have)

  1. callerName extraction via stack trace runs inside the retry loop - ims-base-client.js:139

    new Error().stack.split(...) is called on every successful attempt. While not expensive, it could be captured once before the loop starts and reused.

  2. Log messages use multi-line string concatenation - ims-base-client.js:127-133, 153-157, 161-164

    Template literals would be more readable than + concatenation across multiple lines. Style preference, not functional.

Recommendations

  • Add jitter to exponential backoff. The current delay is deterministic (baseDelay * 2^attempt). If multiple SpaceCat instances start their retry cycle simultaneously after an IMS blip, they all retry at exactly the same moments (thundering herd). Standard practice: delay * (0.5 + Math.random() * 0.5).
  • Track getImsAdminProfile / getImsAdminOrganizations consolidation. These methods use raw fetch() instead of imsApiCall(), bypassing retry, duration logging, and tracing. A follow-up PR to consolidate them would close the gap.
  • Document worst-case latency. getImsOrganizationDetails makes 3+ sequential imsApiCall calls. If each hits the retry ceiling, a single invocation could add 12-15 seconds of delay. Callers behind API Gateway (29s timeout) should be aware.
  • Consider making retry policy configurable via constructor. Accepting { maxRetries, baseDelayMs } in the config object would future-proof the design for callers needing different retry budgets.

Assessment

Ready to merge? With fixes

The retry mechanism is fundamentally sound - correct layer, correct status code filter, conservative budget, good test coverage, and safe for IMS token endpoint POSTs. The three Important items should be addressed before merge: the implicit return path (structural fragility), the Retry-After header (operational correctness for 429), and the unconsumed response body (connection pool health under retries).

- Add ±20% jitter to exponential backoff to avoid thundering herd
- Respect Retry-After header on 429 responses (takes precedence over base delay)
- Drain response body via response.body?.cancel?.() before discarding on retry to avoid socket leaks
- Move callerName extraction before the retry loop (avoids re-capturing on each attempt)
- Replace eslint-disable with explicit defensive throw after loop guarded by c8 ignore
- Add test for Retry-After header path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link

This PR will trigger a minor release when merged.

@ravverma
Copy link
Contributor Author

Hi @solaris007 @ekremney
thanks you for quick review, addressed comments , please find final change summary

feat: Add retry resilience to imsApiCall (#1438)

Problem

ImsBaseClient.imsApiCall had no retry logic. Transient IMS failures (5xx
server errors, 429 rate-limiting, network blips) surfaced immediately as
hard errors.

Solution

Added exponential backoff retry to imsApiCall in ImsBaseClient, with
safeguards to avoid flooding IMS.


Changes — ims-base-client.js

# What Detail
1 Max 2 retries (3 total attempts) static #MAX_RETRIES = 2 — hard cap prevents request floods
2 Retryable statuses only 429 (rate-limit) and 5xx (server error); 4xx client errors are never retried
3 Exponential backoff Base delay doubles each attempt: 1 000 ms → 2 000 ms
4 ±20% jitter delay * (0.8 + Math.random() * 0.4) spreads retries to avoid thundering herd
5 Retry-After header support On 429, reads Retry-After (seconds) and uses Math.max(baseDelay, retryAfterMs)
6 Response body draining await response.body?.cancel?.() before retry to avoid socket leaks
7 Audit-trail logging warn on each retry with endpoint, status, attempt count; info on recovery showing retries used
8 Configurable base delay retryBaseDelayMs exposed as instance property so tests can set it to 0 without fake timers
9 Defensive throw after loop Explicit throw new Error('imsApiCall: retry loop exited unexpectedly') replaces eslint-disable consistent-return

Changes — ims-client.test.js

  • Added client.retryBaseDelayMs = 0 in beforeEach for all suites that
    exercise retryable paths (zero delay keeps CI fast)
  • Added .times(3) on nock interceptors for all retryable status mocks
    (500, 503, 429) so retry attempts don't leak unmatched requests
  • Added new describe('imsApiCall retry behavior') suite with 7 tests:
    • Retries on 5xx, succeeds on 2nd attempt
    • Retries on 429, succeeds on 2nd attempt
    • Respects Retry-After header on 429
    • Retries on network error, succeeds on 3rd attempt
    • Exhausts all retries on network error, throws
    • Returns last bad response when all 5xx retries exhausted
    • Does not retry on 4xx client errors

Coverage

100% lines · 100% statements · 100% functions · ≥97% branches

@ravverma ravverma requested a review from solaris007 March 13, 2026 15:22
Copy link
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ravverma,

Thanks for the focused follow-up - all three Important findings from the prior review have been addressed cleanly.

Strengths

  • All prior findings systematically addressed in a single commit - good engineering discipline.
  • Correct layering - retry at the imsApiCall chokepoint gives all 11+ call sites resilience automatically without caller changes (ims-base-client.js:110).
  • Jitter + Retry-After composition is well-designed: Math.max(baseDelay, retryAfterMs) * jitter respects IMS rate-limit signals while preventing thundering herd (ims-base-client.js:137-142).
  • response.body?.cancel?.() before retry correctly releases the underlying socket for connection reuse in the undici-backed @adobe/fetch (ims-base-client.js:148).
  • Defensive throw after the for-loop (ims-base-client.js:185) replaces the eslint-disable with a structural safety net.
  • callerName extraction moved before the loop - no longer runs on every attempt.
  • Auth-safe by design: tokens cannot expire during the ~3s worst-case retry window, FormData bodies are safely reusable across attempts, and noAuth: true on token endpoints prevents recursive token fetch during retries.
  • Thorough test suite: 7 dedicated retry tests covering 5xx recovery, 429 recovery, Retry-After header, network error recovery, exhaustion paths, and 4xx non-retry.

Issues

Minor (Nice to Have)

  1. Retry-After parsing assumes seconds format - ims-base-client.js:137-139

    parseInt(retryAfterHeader, 10) handles the delay-seconds format but not the HTTP-date format allowed by RFC 7231 (e.g., Thu, 13 Mar 2026 16:00:00 GMT). If an HTTP-date is received, parseInt returns NaN, which propagates through to setTimeout(fn, NaN) - treated as 0ms by Node.js, silently defeating the backoff. IMS practically always sends seconds, so this is low risk. A Number.isFinite() guard would make it robust:

    const retryAfterSec = Number(retryAfterHeader);
    const retryAfterMs = Number.isFinite(retryAfterSec) && retryAfterSec > 0
      ? retryAfterSec * 1000 : 0;
  2. Retry-After test uses Retry-After: 0 - ims-client.test.js:811-824

    This exercises the header parsing path but produces the same delay as if the header were absent (since retryBaseDelayMs is also 0 in tests). The test proves the header is read without error, but doesn't assert it influences the computed delay. A test with a non-zero Retry-After and non-zero base delay would give stronger confidence.

Recommendations

  • Consider a small parseRetryAfter(headerValue) utility with a Number.isFinite guard - not needed for this PR, but useful if retry logic spreads to other clients.
  • Follow-up: document worst-case latency (3 attempts, ~3s added delay) in JSDoc so callers behind API Gateway (29s timeout) understand tail latency.
  • Follow-up: consolidate getImsAdminProfile/getImsAdminOrganizations which still use raw fetch() and bypass retry/tracing.

Assessment

Ready to merge: Yes. All three Important findings from the prior review are properly resolved. The retry logic is sound - correct layer, conservative budget, proper backoff with jitter and Retry-After respect, socket cleanup, and good observability. The remaining items are minor polish.

@ravverma ravverma merged commit 2be8308 into main Mar 16, 2026
7 checks passed
@ravverma ravverma deleted the feat/ims-api-call-resilience branch March 16, 2026 05:23
solaris007 pushed a commit that referenced this pull request Mar 16, 2026
@solaris007
Copy link
Member

🎉 This PR is included in version @adobe/spacecat-shared-ims-client-v1.12.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request released

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants