feat: add retry resilience to imsApiCall by ravverma · Pull Request #1438 · adobe/spacecat-shared

ravverma · 2026-03-13T12:48:38Z

Summary

Added exponential-backoff retry logic to ImsBaseClient.imsApiCall
to handle transient IMS failures without request-flooding the service
Max 2 retries (3 total attempts) — hard cap; retries only on 5xx
server errors and 429 rate-limiting; never on 4xx client errors
Exponential back-off: 1 000 ms → 2 000 ms between attempts
(retryBaseDelayMs is an overridable instance property so tests run
at 0 ms without fake-timer complexity)
Audit-trail logging at every retry decision:
- WARN on each retry: endpoint, HTTP status / network error,
  attempt N/3, next delay
- INFO on successful recovery: endpoint, attempt that succeeded,
  total retries used
- ERROR on final exhaustion after all 3 attempts

Test plan

All 74 existing tests pass (no behavioural change for non-retryable
scenarios; retryable-status mocks updated with .times(3))
New describe('imsApiCall retry behavior') block (6 tests):
- retries on 5xx → success on attempt 2
- retries on 429 → success on attempt 2
- retries on network error → success on attempt 3
- exhausts all 3 attempts on network error → throws
- exhausts all 3 attempts on 5xx → caller receives last response
- no retry on 4xx → returns immediately with zero warn logs
Line/statement coverage 100 %; branch coverage 99.48 % (> 97 % threshold)

🤖 Generated with Claude Code

Adds exponential-backoff retry logic to `ImsBaseClient.imsApiCall` to handle transient IMS failures without flooding the service with requests. - Max 2 retries (3 total attempts) — hard cap to protect IMS - Retries on: 5xx server errors and 429 rate-limiting - Never retries 4xx client errors (no point hammering on auth failures) - Exponential back-off: 1 000 ms → 2 000 ms between attempts - `retryBaseDelayMs` is an overridable instance property so tests run at 0 ms delay without fake timers - Audit-trail logging: warn on each retry (endpoint, status/error, attempt N/3, delay); info on successful recovery (attempt used, retry count); error on final exhaustion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ekremney

Nice work — clean retry logic with good observability. Two suggestions:

1. Add a defensive throw after the loop to remove the eslint-disable consistent-return

The loop logic is correct today, but the linter can't prove it always returns/throws. An explicit throw makes the contract clear for future readers:

    } // end for
    throw new Error(`IMS ${endpoint}: retry loop exited unexpectedly`);
  }

2. Add jitter to the backoff delay

Fixed delays mean correlated callers (e.g. multiple Lambdas hitting IMS rate limits simultaneously) retry at the same wall-clock times. Adding ±20% random jitter is standard practice to avoid thundering herd:

const delay = this.retryBaseDelayMs * (2 ** attempt) * (0.8 + Math.random() * 0.4);

solaris007

Hey @ravverma,

Strengths

Excellent placement of retry logic at the imsApiCall chokepoint (ims-base-client.js:110) - all 11+ call sites across ImsClient and ImsPromiseClient get resilience automatically without caller changes.
#isRetryableStatus correctly limits retries to 429 and 5xx, never retrying 4xx auth errors like 401/403 which indicate configuration problems, not transient failures (ims-base-client.js:38-41).
Conservative retry budget (2 retries, 3 total attempts) with exponential backoff (1s, 2s) caps worst-case added latency at 3 seconds - appropriate for backend-to-backend auth service communication (ims-base-client.js:19-22).
Token endpoint POST retries are safe for IMS - both grant_type=authorization_code and grant_type=client_credentials are idempotent from the caller's perspective. Adobe IMS client codes are reusable, so duplicate POSTs simply return a new (or cached) token.
Headers are prepared once outside the retry loop (ims-base-client.js:112), ensuring the same credential is used across all attempts rather than re-acquiring tokens on each retry.
Thorough test coverage: dedicated retry behavior block (6 tests) covering 5xx, 429, network error recovery, retry exhaustion, and 4xx non-retry. Existing failure tests properly updated with .times(3) (ims-client.test.js:761-855).
Audit-trail logging at every retry decision point - warn on retry, info on recovery, error on exhaustion - with endpoint, status, attempt number, and delay included.

Issues

Important (Should Fix)

Implicit undefined return masked by eslint-disable - ims-base-client.js:105

The // eslint-disable-next-line consistent-return suppresses a legitimate structural warning. While all current code paths do terminate (the final attempt either returns the response or throws), the function has no explicit return/throw after the for-loop. If someone later changes the retry math or adds a condition, the function could silently return undefined, causing a confusing TypeError when the caller tries to call .ok or .json() on the result.

How to fix: Remove the eslint-disable and add an explicit throw after the for-loop:
```
} // end for
throw new Error('imsApiCall: unexpected fall-through after retry loop');
```
429 responses do not respect Retry-After header - ims-base-client.js:122-136

When IMS returns 429, it typically includes a Retry-After header indicating how long to wait. The code ignores this and uses the fixed exponential backoff (1s, 2s). If IMS says "back off for 60 seconds" and we retry after 1 second, we get 429 again - burning all three attempts for nothing. This is especially problematic in SpaceCat's multi-instance deployment where several service instances could hit the rate limit simultaneously, amplifying the pressure rather than relieving it.

How to fix: Read response.headers.get('Retry-After') when status is 429. If present and numeric, use Math.max(parsedRetryAfter * 1000, delay) as the sleep duration. Fall back to exponential backoff when the header is absent.
Response body not consumed on retryable HTTP errors - ims-base-client.js:122-136

When a retryable status (e.g. 503) triggers a retry, the response object is discarded without consuming its body. With @adobe/fetch (undici under the hood), unconsumed response bodies hold the underlying socket open until GC or connection timeout. In a retry loop against a flapping IMS service, this could exhaust the HTTP/2 connection pool or cause socket leaks.

How to fix: Add response.body?.cancel() or await response.text() before the continue statement to explicitly release the connection.

Minor (Nice to Have)

callerName extraction via stack trace runs inside the retry loop - ims-base-client.js:139

new Error().stack.split(...) is called on every successful attempt. While not expensive, it could be captured once before the loop starts and reused.
Log messages use multi-line string concatenation - ims-base-client.js:127-133, 153-157, 161-164

Template literals would be more readable than + concatenation across multiple lines. Style preference, not functional.

Recommendations

Add jitter to exponential backoff. The current delay is deterministic (baseDelay * 2^attempt). If multiple SpaceCat instances start their retry cycle simultaneously after an IMS blip, they all retry at exactly the same moments (thundering herd). Standard practice: delay * (0.5 + Math.random() * 0.5).
Track getImsAdminProfile / getImsAdminOrganizations consolidation. These methods use raw fetch() instead of imsApiCall(), bypassing retry, duration logging, and tracing. A follow-up PR to consolidate them would close the gap.
Document worst-case latency. getImsOrganizationDetails makes 3+ sequential imsApiCall calls. If each hits the retry ceiling, a single invocation could add 12-15 seconds of delay. Callers behind API Gateway (29s timeout) should be aware.
Consider making retry policy configurable via constructor. Accepting { maxRetries, baseDelayMs } in the config object would future-proof the design for callers needing different retry budgets.

Assessment

Ready to merge? With fixes

The retry mechanism is fundamentally sound - correct layer, correct status code filter, conservative budget, good test coverage, and safe for IMS token endpoint POSTs. The three Important items should be addressed before merge: the implicit return path (structural fragility), the Retry-After header (operational correctness for 429), and the unconsumed response body (connection pool health under retries).

- Add ±20% jitter to exponential backoff to avoid thundering herd - Respect Retry-After header on 429 responses (takes precedence over base delay) - Drain response body via response.body?.cancel?.() before discarding on retry to avoid socket leaks - Move callerName extraction before the retry loop (avoids re-capturing on each attempt) - Replace eslint-disable with explicit defensive throw after loop guarded by c8 ignore - Add test for Retry-After header path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-03-13T15:20:21Z

This PR will trigger a minor release when merged.

ravverma · 2026-03-13T15:22:04Z

Hi @solaris007 @ekremney
thanks you for quick review, addressed comments , please find final change summary

feat: Add retry resilience to `imsApiCall` (#1438)

Problem

ImsBaseClient.imsApiCall had no retry logic. Transient IMS failures (5xx
server errors, 429 rate-limiting, network blips) surfaced immediately as
hard errors.

Solution

Added exponential backoff retry to imsApiCall in ImsBaseClient, with
safeguards to avoid flooding IMS.

Changes — `ims-base-client.js`

#	What	Detail
1	Max 2 retries (3 total attempts)	`static #MAX_RETRIES = 2` — hard cap prevents request floods
2	Retryable statuses only	429 (rate-limit) and 5xx (server error); 4xx client errors are never retried
3	Exponential backoff	Base delay doubles each attempt: `1 000 ms → 2 000 ms`
4	±20% jitter	`delay * (0.8 + Math.random() * 0.4)` spreads retries to avoid thundering herd
5	`Retry-After` header support	On 429, reads `Retry-After` (seconds) and uses `Math.max(baseDelay, retryAfterMs)`
6	Response body draining	`await response.body?.cancel?.()` before retry to avoid socket leaks
7	Audit-trail logging	`warn` on each retry with endpoint, status, attempt count; `info` on recovery showing retries used
8	Configurable base delay	`retryBaseDelayMs` exposed as instance property so tests can set it to `0` without fake timers
9	Defensive throw after loop	Explicit `throw new Error('imsApiCall: retry loop exited unexpectedly')` replaces `eslint-disable consistent-return`

Changes — `ims-client.test.js`

Added client.retryBaseDelayMs = 0 in beforeEach for all suites that
exercise retryable paths (zero delay keeps CI fast)
Added .times(3) on nock interceptors for all retryable status mocks
(500, 503, 429) so retry attempts don't leak unmatched requests
Added new describe('imsApiCall retry behavior') suite with 7 tests:
- Retries on 5xx, succeeds on 2nd attempt
- Retries on 429, succeeds on 2nd attempt
- Respects Retry-After header on 429
- Retries on network error, succeeds on 3rd attempt
- Exhausts all retries on network error, throws
- Returns last bad response when all 5xx retries exhausted
- Does not retry on 4xx client errors

Coverage

100% lines · 100% statements · 100% functions · ≥97% branches

solaris007

Hey @ravverma,

Thanks for the focused follow-up - all three Important findings from the prior review have been addressed cleanly.

Strengths

All prior findings systematically addressed in a single commit - good engineering discipline.
Correct layering - retry at the imsApiCall chokepoint gives all 11+ call sites resilience automatically without caller changes (ims-base-client.js:110).
Jitter + Retry-After composition is well-designed: Math.max(baseDelay, retryAfterMs) * jitter respects IMS rate-limit signals while preventing thundering herd (ims-base-client.js:137-142).
response.body?.cancel?.() before retry correctly releases the underlying socket for connection reuse in the undici-backed @adobe/fetch (ims-base-client.js:148).
Defensive throw after the for-loop (ims-base-client.js:185) replaces the eslint-disable with a structural safety net.
callerName extraction moved before the loop - no longer runs on every attempt.
Auth-safe by design: tokens cannot expire during the ~3s worst-case retry window, FormData bodies are safely reusable across attempts, and noAuth: true on token endpoints prevents recursive token fetch during retries.
Thorough test suite: 7 dedicated retry tests covering 5xx recovery, 429 recovery, Retry-After header, network error recovery, exhaustion paths, and 4xx non-retry.

Issues

Minor (Nice to Have)

Retry-After parsing assumes seconds format - ims-base-client.js:137-139

parseInt(retryAfterHeader, 10) handles the delay-seconds format but not the HTTP-date format allowed by RFC 7231 (e.g., Thu, 13 Mar 2026 16:00:00 GMT). If an HTTP-date is received, parseInt returns NaN, which propagates through to setTimeout(fn, NaN) - treated as 0ms by Node.js, silently defeating the backoff. IMS practically always sends seconds, so this is low risk. A Number.isFinite() guard would make it robust:
```
const retryAfterSec = Number(retryAfterHeader);
const retryAfterMs = Number.isFinite(retryAfterSec) && retryAfterSec > 0
  ? retryAfterSec * 1000 : 0;
```
Retry-After test uses Retry-After: 0 - ims-client.test.js:811-824

This exercises the header parsing path but produces the same delay as if the header were absent (since retryBaseDelayMs is also 0 in tests). The test proves the header is read without error, but doesn't assert it influences the computed delay. A test with a non-zero Retry-After and non-zero base delay would give stronger confidence.

Recommendations

Consider a small parseRetryAfter(headerValue) utility with a Number.isFinite guard - not needed for this PR, but useful if retry logic spreads to other clients.
Follow-up: document worst-case latency (3 attempts, ~3s added delay) in JSDoc so callers behind API Gateway (29s timeout) understand tail latency.
Follow-up: consolidate getImsAdminProfile/getImsAdminOrganizations which still use raw fetch() and bypass retry/tracing.

Assessment

Ready to merge: Yes. All three Important findings from the prior review are properly resolved. The retry logic is sound - correct layer, conservative budget, proper backoff with jitter and Retry-After respect, socket cleanup, and good observability. The remaining items are minor polish.

## [@adobe/spacecat-shared-ims-client-v1.12.0](https://github.com/adobe/spacecat-shared/compare/@adobe/spacecat-shared-ims-client-v1.11.13...@adobe/spacecat-shared-ims-client-v1.12.0) (2026-03-16) ### Features * add retry resilience to imsApiCall ([#1438](#1438)) ([2be8308](2be8308))

solaris007 · 2026-03-16T05:28:29Z

🎉 This PR is included in version @adobe/spacecat-shared-ims-client-v1.12.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

ravverma requested review from ekremney and solaris007 March 13, 2026 12:49

ravverma added the enhancement New feature or request label Mar 13, 2026

ravverma self-assigned this Mar 13, 2026

ekremney approved these changes Mar 13, 2026

View reviewed changes

solaris007 requested changes Mar 13, 2026

View reviewed changes

Merge branch 'main' into feat/ims-api-call-resilience

692339c

ravverma requested a review from solaris007 March 13, 2026 15:22

solaris007 approved these changes Mar 13, 2026

View reviewed changes

ravverma merged commit 2be8308 into main Mar 16, 2026
7 checks passed

ravverma deleted the feat/ims-api-call-resilience branch March 16, 2026 05:23

solaris007 added the released label Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add retry resilience to imsApiCall#1438

feat: add retry resilience to imsApiCall#1438
ravverma merged 3 commits intomainfrom
feat/ims-api-call-resilience

ravverma commented Mar 13, 2026

Uh oh!

ekremney left a comment

Uh oh!

solaris007 left a comment

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

ravverma commented Mar 13, 2026

Uh oh!

solaris007 left a comment

Uh oh!

Uh oh!

solaris007 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ravverma commented Mar 13, 2026

Summary

Test plan

Uh oh!

ekremney left a comment

Choose a reason for hiding this comment

Uh oh!

solaris007 left a comment

Choose a reason for hiding this comment

Strengths

Issues

Important (Should Fix)

Minor (Nice to Have)

Recommendations

Assessment

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

ravverma commented Mar 13, 2026

Hi @solaris007 @ekremney thanks you for quick review, addressed comments , please find final change summary

feat: Add retry resilience to imsApiCall (#1438)

Problem

Solution

Changes — ims-base-client.js

Changes — ims-client.test.js

Coverage

Uh oh!

solaris007 left a comment

Choose a reason for hiding this comment

Strengths

Issues

Minor (Nice to Have)

Recommendations

Assessment

Uh oh!

Uh oh!

solaris007 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Hi @solaris007 @ekremney
thanks you for quick review, addressed comments , please find final change summary

feat: Add retry resilience to `imsApiCall` (#1438)

Changes — `ims-base-client.js`

Changes — `ims-client.test.js`