Adds retry support to the Amazon.Lambda.DurableExecution#2363
Adds retry support to the Amazon.Lambda.DurableExecution#2363GarrettBeatty wants to merge 1 commit into
Conversation
711bf82 to
4f05fa9
Compare
4f05fa9 to
54d18f9
Compare
54d18f9 to
599445f
Compare
599445f to
e7a85e4
Compare
e7a85e4 to
8f23ebb
Compare
8f23ebb to
e39e68e
Compare
e39e68e to
52055d3
Compare
81b9144 to
531cbbe
Compare
531cbbe to
31ea7e8
Compare
31ea7e8 to
ef44439
Compare
ef44439 to
6bc97f2
Compare
6bc97f2 to
85eae3e
Compare
85eae3e to
0a32c0d
Compare
There was a problem hiding this comment.
Pull request overview
Builds on PR #2360 to add retry support to the Amazon.Lambda.DurableExecution SDK. Failed steps can now be retried with configurable backoff and jitter via service-mediated retries (the SDK checkpoints a RETRY operation and suspends the Lambda so the user is not billed during backoff). Adds at-most-once semantics for non-idempotent steps via a synchronously-flushed START checkpoint that allows crash detection on replay.
Changes:
- New public retry API:
IRetryStrategy,RetryDecision,RetryStrategyfactories (Default/Transient/None/Exponential/FromDelegate),JitterStrategy,StepSemantics, andStepConfig.RetryStrategy/StepConfig.Semantics. StepOperationaddsPENDING(retry-timer) andSTARTED(AtMostOnce crash-recovery) replay arms, aHandleStepFailureAsyncdecision tree, and START-checkpoint emission (sync for AtMostOnce, fire-and-forget for AtLeastOnce).- 21 new unit tests plus integration-test updates asserting
StepStartedevents and richer history logging.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
Config/IRetryStrategy.cs |
New strategy interface + RetryDecision struct |
Config/RetryStrategy.cs |
ExponentialRetryStrategy, DelegateRetryStrategy, JitterStrategy, StepSemantics, factories |
Config/StepConfig.cs |
Adds RetryStrategy and Semantics properties |
Internal/StepOperation.cs |
PENDING/STARTED replay arms, retry decision tree, START-checkpoint emission |
Internal/TerminationManager.cs |
Adds RetryScheduled termination reason |
Internal/CheckpointBatcher.cs |
Doc-only update describing fire-and-forget semantics |
Tests/RetryStrategyTests.cs |
14 unit tests for exponential math/jitter/filters/delegate |
Tests/DurableContextTests.cs |
6 retry/AtMostOnce/Pending replay tests |
Tests/DurableFunctionTests.cs |
Updated to assert START + SUCCEED + WAIT-START flat sequence |
IntegrationTests/*.cs |
Add StepStarted-event assertions; richer history dump in DurableFunctionDeployment |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var history = await deployment.WaitForHistoryAsync( | ||
| arn!, | ||
| h => (h.Events?.Count(e => e.StepSucceededDetails != null) ?? 0) >= 2 | ||
| h => (h.Events?.Count(e => e.EventType == EventType.StepStarted) ?? 0) >= 2 |
There was a problem hiding this comment.
now that we are emitting START steps (which are needed for retries) we are asserting them in the IT tests
|
|
||
| COPY bin/publish/ ${LAMBDA_TASK_ROOT} | ||
|
|
||
| ENTRYPOINT ["/var/task/bootstrap"] |
| /// Replay semantics — example: <c>await ctx.StepAsync(ChargeCard, "charge")</c> | ||
| /// Replay branches — example: <c>await ctx.StepAsync(ChargeCard, "charge")</c> | ||
| /// <list type="bullet"> | ||
| /// <item>Fresh: no prior state → run func → emit SUCCEED → return result.</item> |
There was a problem hiding this comment.
in previous PR only SUCCEEDED or FAILED mattered. But now for replays, we need to keep track of how many times the function was executed, which is done via the number of STARTED steps.
Stacked PRs:
#2216
What
Adds retry support to the
Amazon.Lambda.DurableExecutionSDK on top of the foundation in #2360. After this PR a step that throws can be retried with configurable backoff and jitter; durable executions resume after the retry timer elapses without billing Lambda compute during the wait.Public API introduced:
IRetryStrategyRetryDecisionIRetryStrategy.ShouldRetry—ShouldRetryflag plusDelay.RetryStrategyDefault,Transient,None,Exponential(...),FromDelegate(...).JitterStrategyNone/Half/Fullfor exponential backoff.StepSemanticsAtLeastOncePerRetry(default) /AtMostOncePerRetry.StepConfig.RetryStrategy,StepConfig.SemanticsWhy
Real workflows fail. A step that calls a flaky downstream service or hits a transient throttle needs to retry without restarting the whole workflow. Durable execution makes service-mediated retries possible: the SDK checkpoints a
RETRYoperation with aNextAttemptDelaySeconds, suspends the Lambda, and the service re-invokes us when the timer fires. The user's compute isn't billed during the wait.AtMostOncePerRetrysemantics handle non-idempotent steps (e.g. charging a card): aSTARTcheckpoint is durably persisted before user code runs, so a Lambda crash mid-execution can be detected on replay and routed through the retry strategy rather than re-executing.How
Retry control flow. When a step throws,
StepOperation.HandleStepFailureAsyncconsults the configuredIRetryStrategy.ShouldRetry(ex, attemptNumber). If the decision says retry, the SDK enqueues aRETRYcheckpoint carryingNextAttemptDelaySeconds, then suspends viaTerminationManager.SuspendAndAwaitsoRunAsyncreturnsPendingto the service. On the next invocation,StepOperation.ReplayAsyncseesStatus == PENDINGand either re-suspends (timer not yet elapsed) or re-executes (timer fired) with the carried-forward attempt counter.At-most-once semantics. For non-idempotent steps,
Semantics = AtMostOncePerRetrywrites aSTARTcheckpoint and blocks until the batcher flushes it before user code runs. If Lambda crashes between user code and theSUCCEEDflush, replay seesSTARTEDwith no terminal record and routes throughHandleStepFailureAsyncas a failed attempt instead of re-executing — the side effect runs at most once per attempt.Retry strategy contract.
IRetryStrategy.ShouldRetry(Exception, int attemptNumber)returns aRetryDecision.ExponentialRetryStrategysupports configurable max attempts, initial/max delay, backoff rate, jitter (None/Half/Full), and exception filtering by type or message regex. Built-in factories:RetryStrategy.Default(6 attempts, 5s/60s, 2× backoff, full jitter),Transient(3 attempts, 1s/5s, half jitter),None.RetryStrategy.FromDelegate(...)for arbitrary policies.Key files:
Config/IRetryStrategy.cs— strategy interface +RetryDecisionvalue typeConfig/RetryStrategy.cs— built-in strategies,ExponentialRetryStrategy,JitterStrategy,StepSemantics,DelegateRetryStrategyConfig/StepConfig.cs— addsRetryStrategyandSemanticspropertiesInternal/StepOperation.cs— addsPENDING(retry timer) andSTARTED(AtMostOnce crash recovery) replay arms;HandleStepFailureAsyncdecision treeInternal/TerminationManager.cs— addsRetryScheduledreasonTesting
21 new unit tests in
Amazon.Lambda.DurableExecution.Tests(130 total, up from 109 in #2360):RetryStrategyTests(14 tests) — exponential backoff math, jitter strategies, max-attempt exhaustion, exception-type and message-pattern filtering, delegate strategiesDurableContextTestsretry block (6 tests) —FailsWithRetryStrategy_CheckpointsRetryAndSuspends,FailsNoRetryStrategy_CheckpointsFail,RetryExhausted_CheckpointsFail,PendingWithFutureTimestamp_Suspends,PendingWithPastTimestamp_ReExecutes,AtMostOnce_FlushesStartBeforeExecution,AtMostOnce_StartedReplay_TriggersRetryHandlerIntegration tests (
Amazon.Lambda.DurableExecution.IntegrationTests) —RetrySucceedsandRetryExhaustsend-to-end against the real durable-execution service.Out of scope (follow-up PRs)
MapAsync/ParallelAsync/RunInChildContextAsync/WaitForConditionAsyncCallbackAsync,InvokeAsyncDefaultJsonCheckpointSerializerDurableLoggerreplay-suppression (currentlyNullLogger)[DurableExecution]attributeDurableTestRunner/Amazon.Lambda.DurableExecution.Testingpackagedotnet new lambda.DurableFunctionblueprint