feat: add retry resilience to imsApiCall#1438
Conversation
Adds exponential-backoff retry logic to `ImsBaseClient.imsApiCall` to handle transient IMS failures without flooding the service with requests. - Max 2 retries (3 total attempts) — hard cap to protect IMS - Retries on: 5xx server errors and 429 rate-limiting - Never retries 4xx client errors (no point hammering on auth failures) - Exponential back-off: 1 000 ms → 2 000 ms between attempts - `retryBaseDelayMs` is an overridable instance property so tests run at 0 ms delay without fake timers - Audit-trail logging: warn on each retry (endpoint, status/error, attempt N/3, delay); info on successful recovery (attempt used, retry count); error on final exhaustion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ekremney
left a comment
There was a problem hiding this comment.
Nice work — clean retry logic with good observability. Two suggestions:
1. Add a defensive throw after the loop to remove the eslint-disable consistent-return
The loop logic is correct today, but the linter can't prove it always returns/throws. An explicit throw makes the contract clear for future readers:
} // end for
throw new Error(`IMS ${endpoint}: retry loop exited unexpectedly`);
}2. Add jitter to the backoff delay
Fixed delays mean correlated callers (e.g. multiple Lambdas hitting IMS rate limits simultaneously) retry at the same wall-clock times. Adding ±20% random jitter is standard practice to avoid thundering herd:
const delay = this.retryBaseDelayMs * (2 ** attempt) * (0.8 + Math.random() * 0.4);
solaris007
left a comment
There was a problem hiding this comment.
Hey @ravverma,
Strengths
- Excellent placement of retry logic at the
imsApiCallchokepoint (ims-base-client.js:110) - all 11+ call sites acrossImsClientandImsPromiseClientget resilience automatically without caller changes. #isRetryableStatuscorrectly limits retries to 429 and 5xx, never retrying 4xx auth errors like 401/403 which indicate configuration problems, not transient failures (ims-base-client.js:38-41).- Conservative retry budget (2 retries, 3 total attempts) with exponential backoff (1s, 2s) caps worst-case added latency at 3 seconds - appropriate for backend-to-backend auth service communication (
ims-base-client.js:19-22). - Token endpoint POST retries are safe for IMS - both
grant_type=authorization_codeandgrant_type=client_credentialsare idempotent from the caller's perspective. Adobe IMS client codes are reusable, so duplicate POSTs simply return a new (or cached) token. - Headers are prepared once outside the retry loop (
ims-base-client.js:112), ensuring the same credential is used across all attempts rather than re-acquiring tokens on each retry. - Thorough test coverage: dedicated retry behavior block (6 tests) covering 5xx, 429, network error recovery, retry exhaustion, and 4xx non-retry. Existing failure tests properly updated with
.times(3)(ims-client.test.js:761-855). - Audit-trail logging at every retry decision point - warn on retry, info on recovery, error on exhaustion - with endpoint, status, attempt number, and delay included.
Issues
Important (Should Fix)
-
Implicit undefined return masked by eslint-disable -
ims-base-client.js:105The
// eslint-disable-next-line consistent-returnsuppresses a legitimate structural warning. While all current code paths do terminate (the final attempt either returns the response or throws), the function has no explicit return/throw after the for-loop. If someone later changes the retry math or adds a condition, the function could silently returnundefined, causing a confusingTypeErrorwhen the caller tries to call.okor.json()on the result.How to fix: Remove the eslint-disable and add an explicit throw after the for-loop:
} // end for throw new Error('imsApiCall: unexpected fall-through after retry loop');
-
429 responses do not respect
Retry-Afterheader -ims-base-client.js:122-136When IMS returns 429, it typically includes a
Retry-Afterheader indicating how long to wait. The code ignores this and uses the fixed exponential backoff (1s, 2s). If IMS says "back off for 60 seconds" and we retry after 1 second, we get 429 again - burning all three attempts for nothing. This is especially problematic in SpaceCat's multi-instance deployment where several service instances could hit the rate limit simultaneously, amplifying the pressure rather than relieving it.How to fix: Read
response.headers.get('Retry-After')when status is 429. If present and numeric, useMath.max(parsedRetryAfter * 1000, delay)as the sleep duration. Fall back to exponential backoff when the header is absent. -
Response body not consumed on retryable HTTP errors -
ims-base-client.js:122-136When a retryable status (e.g. 503) triggers a retry, the
responseobject is discarded without consuming its body. With@adobe/fetch(undici under the hood), unconsumed response bodies hold the underlying socket open until GC or connection timeout. In a retry loop against a flapping IMS service, this could exhaust the HTTP/2 connection pool or cause socket leaks.How to fix: Add
response.body?.cancel()orawait response.text()before thecontinuestatement to explicitly release the connection.
Minor (Nice to Have)
-
callerNameextraction via stack trace runs inside the retry loop -ims-base-client.js:139new Error().stack.split(...)is called on every successful attempt. While not expensive, it could be captured once before the loop starts and reused. -
Log messages use multi-line string concatenation -
ims-base-client.js:127-133,153-157,161-164Template literals would be more readable than
+concatenation across multiple lines. Style preference, not functional.
Recommendations
- Add jitter to exponential backoff. The current delay is deterministic (
baseDelay * 2^attempt). If multiple SpaceCat instances start their retry cycle simultaneously after an IMS blip, they all retry at exactly the same moments (thundering herd). Standard practice:delay * (0.5 + Math.random() * 0.5). - Track
getImsAdminProfile/getImsAdminOrganizationsconsolidation. These methods use rawfetch()instead ofimsApiCall(), bypassing retry, duration logging, and tracing. A follow-up PR to consolidate them would close the gap. - Document worst-case latency.
getImsOrganizationDetailsmakes 3+ sequentialimsApiCallcalls. If each hits the retry ceiling, a single invocation could add 12-15 seconds of delay. Callers behind API Gateway (29s timeout) should be aware. - Consider making retry policy configurable via constructor. Accepting
{ maxRetries, baseDelayMs }in the config object would future-proof the design for callers needing different retry budgets.
Assessment
Ready to merge? With fixes
The retry mechanism is fundamentally sound - correct layer, correct status code filter, conservative budget, good test coverage, and safe for IMS token endpoint POSTs. The three Important items should be addressed before merge: the implicit return path (structural fragility), the Retry-After header (operational correctness for 429), and the unconsumed response body (connection pool health under retries).
- Add ±20% jitter to exponential backoff to avoid thundering herd - Respect Retry-After header on 429 responses (takes precedence over base delay) - Drain response body via response.body?.cancel?.() before discarding on retry to avoid socket leaks - Move callerName extraction before the retry loop (avoids re-capturing on each attempt) - Replace eslint-disable with explicit defensive throw after loop guarded by c8 ignore - Add test for Retry-After header path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
This PR will trigger a minor release when merged. |
Hi @solaris007 @ekremney
|
| # | What | Detail |
|---|---|---|
| 1 | Max 2 retries (3 total attempts) | static #MAX_RETRIES = 2 — hard cap prevents request floods |
| 2 | Retryable statuses only | 429 (rate-limit) and 5xx (server error); 4xx client errors are never retried |
| 3 | Exponential backoff | Base delay doubles each attempt: 1 000 ms → 2 000 ms |
| 4 | ±20% jitter | delay * (0.8 + Math.random() * 0.4) spreads retries to avoid thundering herd |
| 5 | Retry-After header support |
On 429, reads Retry-After (seconds) and uses Math.max(baseDelay, retryAfterMs) |
| 6 | Response body draining | await response.body?.cancel?.() before retry to avoid socket leaks |
| 7 | Audit-trail logging | warn on each retry with endpoint, status, attempt count; info on recovery showing retries used |
| 8 | Configurable base delay | retryBaseDelayMs exposed as instance property so tests can set it to 0 without fake timers |
| 9 | Defensive throw after loop | Explicit throw new Error('imsApiCall: retry loop exited unexpectedly') replaces eslint-disable consistent-return |
Changes — ims-client.test.js
- Added
client.retryBaseDelayMs = 0inbeforeEachfor all suites that
exercise retryable paths (zero delay keeps CI fast) - Added
.times(3)on nock interceptors for all retryable status mocks
(500, 503, 429) so retry attempts don't leak unmatched requests - Added new
describe('imsApiCall retry behavior')suite with 7 tests:- Retries on 5xx, succeeds on 2nd attempt
- Retries on 429, succeeds on 2nd attempt
- Respects
Retry-Afterheader on 429 - Retries on network error, succeeds on 3rd attempt
- Exhausts all retries on network error, throws
- Returns last bad response when all 5xx retries exhausted
- Does not retry on 4xx client errors
Coverage
100% lines · 100% statements · 100% functions · ≥97% branches
solaris007
left a comment
There was a problem hiding this comment.
Hey @ravverma,
Thanks for the focused follow-up - all three Important findings from the prior review have been addressed cleanly.
Strengths
- All prior findings systematically addressed in a single commit - good engineering discipline.
- Correct layering - retry at the
imsApiCallchokepoint gives all 11+ call sites resilience automatically without caller changes (ims-base-client.js:110). - Jitter + Retry-After composition is well-designed:
Math.max(baseDelay, retryAfterMs) * jitterrespects IMS rate-limit signals while preventing thundering herd (ims-base-client.js:137-142). response.body?.cancel?.()before retry correctly releases the underlying socket for connection reuse in the undici-backed@adobe/fetch(ims-base-client.js:148).- Defensive throw after the for-loop (
ims-base-client.js:185) replaces the eslint-disable with a structural safety net. callerNameextraction moved before the loop - no longer runs on every attempt.- Auth-safe by design: tokens cannot expire during the ~3s worst-case retry window, FormData bodies are safely reusable across attempts, and
noAuth: trueon token endpoints prevents recursive token fetch during retries. - Thorough test suite: 7 dedicated retry tests covering 5xx recovery, 429 recovery, Retry-After header, network error recovery, exhaustion paths, and 4xx non-retry.
Issues
Minor (Nice to Have)
-
Retry-After parsing assumes seconds format -
ims-base-client.js:137-139parseInt(retryAfterHeader, 10)handles thedelay-secondsformat but not theHTTP-dateformat allowed by RFC 7231 (e.g.,Thu, 13 Mar 2026 16:00:00 GMT). If an HTTP-date is received,parseIntreturnsNaN, which propagates through tosetTimeout(fn, NaN)- treated as 0ms by Node.js, silently defeating the backoff. IMS practically always sends seconds, so this is low risk. ANumber.isFinite()guard would make it robust:const retryAfterSec = Number(retryAfterHeader); const retryAfterMs = Number.isFinite(retryAfterSec) && retryAfterSec > 0 ? retryAfterSec * 1000 : 0;
-
Retry-After test uses
Retry-After: 0-ims-client.test.js:811-824This exercises the header parsing path but produces the same delay as if the header were absent (since
retryBaseDelayMsis also 0 in tests). The test proves the header is read without error, but doesn't assert it influences the computed delay. A test with a non-zeroRetry-Afterand non-zero base delay would give stronger confidence.
Recommendations
- Consider a small
parseRetryAfter(headerValue)utility with aNumber.isFiniteguard - not needed for this PR, but useful if retry logic spreads to other clients. - Follow-up: document worst-case latency (3 attempts, ~3s added delay) in JSDoc so callers behind API Gateway (29s timeout) understand tail latency.
- Follow-up: consolidate
getImsAdminProfile/getImsAdminOrganizationswhich still use rawfetch()and bypass retry/tracing.
Assessment
Ready to merge: Yes. All three Important findings from the prior review are properly resolved. The retry logic is sound - correct layer, conservative budget, proper backoff with jitter and Retry-After respect, socket cleanup, and good observability. The remaining items are minor polish.
## [@adobe/spacecat-shared-ims-client-v1.12.0](https://github.com/adobe/spacecat-shared/compare/@adobe/spacecat-shared-ims-client-v1.11.13...@adobe/spacecat-shared-ims-client-v1.12.0) (2026-03-16) ### Features * add retry resilience to imsApiCall ([#1438](#1438)) ([2be8308](2be8308))
|
🎉 This PR is included in version @adobe/spacecat-shared-ims-client-v1.12.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
Summary
ImsBaseClient.imsApiCallto handle transient IMS failures without request-flooding the service
server errors and 429 rate-limiting; never on 4xx client errors
(
retryBaseDelayMsis an overridable instance property so tests runat 0 ms without fake-timer complexity)
WARNon each retry: endpoint, HTTP status / network error,attempt
N/3, next delayINFOon successful recovery: endpoint, attempt that succeeded,total retries used
ERRORon final exhaustion after all 3 attemptsTest plan
scenarios; retryable-status mocks updated with
.times(3))describe('imsApiCall retry behavior')block (6 tests):🤖 Generated with Claude Code