Skip to content

Adds retry support to the Amazon.Lambda.DurableExecution#2363

Open
GarrettBeatty wants to merge 1 commit into
GarrettBeatty/stack/2from
GarrettBeatty/stack/3
Open

Adds retry support to the Amazon.Lambda.DurableExecution#2363
GarrettBeatty wants to merge 1 commit into
GarrettBeatty/stack/2from
GarrettBeatty/stack/3

Conversation

@GarrettBeatty
Copy link
Copy Markdown
Contributor

@GarrettBeatty GarrettBeatty commented May 12, 2026

Stacked PRs:


#2216

What

Adds retry support to the Amazon.Lambda.DurableExecution SDK on top of the foundation in #2360. After this PR a step that throws can be retried with configurable backoff and jitter; durable executions resume after the retry timer elapses without billing Lambda compute during the wait.

Public API introduced:

Type Purpose
IRetryStrategy Decides whether a failed step should retry, with what delay.
RetryDecision Output of IRetryStrategy.ShouldRetryShouldRetry flag plus Delay.
RetryStrategy Static factory: Default, Transient, None, Exponential(...), FromDelegate(...).
JitterStrategy None / Half / Full for exponential backoff.
StepSemantics AtLeastOncePerRetry (default) / AtMostOncePerRetry.
StepConfig.RetryStrategy, StepConfig.Semantics Per-step retry configuration.

Why

Real workflows fail. A step that calls a flaky downstream service or hits a transient throttle needs to retry without restarting the whole workflow. Durable execution makes service-mediated retries possible: the SDK checkpoints a RETRY operation with a NextAttemptDelaySeconds, suspends the Lambda, and the service re-invokes us when the timer fires. The user's compute isn't billed during the wait.

AtMostOncePerRetry semantics handle non-idempotent steps (e.g. charging a card): a START checkpoint is durably persisted before user code runs, so a Lambda crash mid-execution can be detected on replay and routed through the retry strategy rather than re-executing.

How

Retry control flow. When a step throws, StepOperation.HandleStepFailureAsync consults the configured IRetryStrategy.ShouldRetry(ex, attemptNumber). If the decision says retry, the SDK enqueues a RETRY checkpoint carrying NextAttemptDelaySeconds, then suspends via TerminationManager.SuspendAndAwait so RunAsync returns Pending to the service. On the next invocation, StepOperation.ReplayAsync sees Status == PENDING and either re-suspends (timer not yet elapsed) or re-executes (timer fired) with the carried-forward attempt counter.

At-most-once semantics. For non-idempotent steps, Semantics = AtMostOncePerRetry writes a START checkpoint and blocks until the batcher flushes it before user code runs. If Lambda crashes between user code and the SUCCEED flush, replay sees STARTED with no terminal record and routes through HandleStepFailureAsync as a failed attempt instead of re-executing — the side effect runs at most once per attempt.

Retry strategy contract. IRetryStrategy.ShouldRetry(Exception, int attemptNumber) returns a RetryDecision. ExponentialRetryStrategy supports configurable max attempts, initial/max delay, backoff rate, jitter (None/Half/Full), and exception filtering by type or message regex. Built-in factories: RetryStrategy.Default (6 attempts, 5s/60s, 2× backoff, full jitter), Transient (3 attempts, 1s/5s, half jitter), None. RetryStrategy.FromDelegate(...) for arbitrary policies.

Key files:

  • Config/IRetryStrategy.cs — strategy interface + RetryDecision value type
  • Config/RetryStrategy.cs — built-in strategies, ExponentialRetryStrategy, JitterStrategy, StepSemantics, DelegateRetryStrategy
  • Config/StepConfig.cs — adds RetryStrategy and Semantics properties
  • Internal/StepOperation.cs — adds PENDING (retry timer) and STARTED (AtMostOnce crash recovery) replay arms; HandleStepFailureAsync decision tree
  • Internal/TerminationManager.cs — adds RetryScheduled reason

Testing

21 new unit tests in Amazon.Lambda.DurableExecution.Tests (130 total, up from 109 in #2360):

  • RetryStrategyTests (14 tests) — exponential backoff math, jitter strategies, max-attempt exhaustion, exception-type and message-pattern filtering, delegate strategies
  • DurableContextTests retry block (6 tests) — FailsWithRetryStrategy_CheckpointsRetryAndSuspends, FailsNoRetryStrategy_CheckpointsFail, RetryExhausted_CheckpointsFail, PendingWithFutureTimestamp_Suspends, PendingWithPastTimestamp_ReExecutes, AtMostOnce_FlushesStartBeforeExecution, AtMostOnce_StartedReplay_TriggersRetryHandler

Integration tests (Amazon.Lambda.DurableExecution.IntegrationTests) — RetrySucceeds and RetryExhausts end-to-end against the real durable-execution service.

Out of scope (follow-up PRs)

  • MapAsync / ParallelAsync / RunInChildContextAsync / WaitForConditionAsync
  • CallbackAsync, InvokeAsync
  • DefaultJsonCheckpointSerializer
  • DurableLogger replay-suppression (currently NullLogger)
  • Annotations source-generator integration / [DurableExecution] attribute
  • DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package
  • dotnet new lambda.DurableFunction blueprint

GarrettBeatty added a commit that referenced this pull request May 12, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 711bf82 to 4f05fa9 Compare May 12, 2026 16:20
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 12, 2026 16:31
GarrettBeatty added a commit that referenced this pull request May 12, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 4f05fa9 to 54d18f9 Compare May 12, 2026 16:31
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 12, 2026 16:31
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 12, 2026 18:16
GarrettBeatty added a commit that referenced this pull request May 12, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 54d18f9 to 599445f Compare May 12, 2026 18:16
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 12, 2026 18:16
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 12, 2026 21:30
GarrettBeatty added a commit that referenced this pull request May 12, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 599445f to e7a85e4 Compare May 12, 2026 21:30
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 12, 2026 21:30
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 12, 2026 21:34
GarrettBeatty added a commit that referenced this pull request May 12, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from e7a85e4 to 8f23ebb Compare May 12, 2026 21:34
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 12, 2026 21:34
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 13, 2026 16:04
GarrettBeatty added a commit that referenced this pull request May 13, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 8f23ebb to e39e68e Compare May 13, 2026 16:04
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 13, 2026 16:04
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 13, 2026 16:21
GarrettBeatty added a commit that referenced this pull request May 13, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from e39e68e to 52055d3 Compare May 13, 2026 16:21
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 13, 2026 16:21
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 13, 2026 16:39
GarrettBeatty added a commit that referenced this pull request May 13, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 13, 2026 21:24
GarrettBeatty added a commit that referenced this pull request May 13, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 81b9144 to 531cbbe Compare May 13, 2026 21:24
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 13, 2026 21:24
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 13, 2026 21:49
GarrettBeatty added a commit that referenced this pull request May 13, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 531cbbe to 31ea7e8 Compare May 13, 2026 21:49
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 13, 2026 21:49
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 13, 2026 22:20
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 31ea7e8 to ef44439 Compare May 13, 2026 22:20
GarrettBeatty added a commit that referenced this pull request May 13, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 13, 2026 22:20
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 13, 2026 22:31
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from ef44439 to 6bc97f2 Compare May 13, 2026 22:31
GarrettBeatty added a commit that referenced this pull request May 13, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 13, 2026 22:31
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 13, 2026 22:35
GarrettBeatty added a commit that referenced this pull request May 13, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 6bc97f2 to 85eae3e Compare May 13, 2026 22:35
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 13, 2026 22:35
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 14, 2026 01:24
GarrettBeatty added a commit that referenced this pull request May 14, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 85eae3e to 0a32c0d Compare May 14, 2026 01:24
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 14, 2026 01:25
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Builds on PR #2360 to add retry support to the Amazon.Lambda.DurableExecution SDK. Failed steps can now be retried with configurable backoff and jitter via service-mediated retries (the SDK checkpoints a RETRY operation and suspends the Lambda so the user is not billed during backoff). Adds at-most-once semantics for non-idempotent steps via a synchronously-flushed START checkpoint that allows crash detection on replay.

Changes:

  • New public retry API: IRetryStrategy, RetryDecision, RetryStrategy factories (Default/Transient/None/Exponential/FromDelegate), JitterStrategy, StepSemantics, and StepConfig.RetryStrategy/StepConfig.Semantics.
  • StepOperation adds PENDING (retry-timer) and STARTED (AtMostOnce crash-recovery) replay arms, a HandleStepFailureAsync decision tree, and START-checkpoint emission (sync for AtMostOnce, fire-and-forget for AtLeastOnce).
  • 21 new unit tests plus integration-test updates asserting StepStarted events and richer history logging.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.

Show a summary per file
File Description
Config/IRetryStrategy.cs New strategy interface + RetryDecision struct
Config/RetryStrategy.cs ExponentialRetryStrategy, DelegateRetryStrategy, JitterStrategy, StepSemantics, factories
Config/StepConfig.cs Adds RetryStrategy and Semantics properties
Internal/StepOperation.cs PENDING/STARTED replay arms, retry decision tree, START-checkpoint emission
Internal/TerminationManager.cs Adds RetryScheduled termination reason
Internal/CheckpointBatcher.cs Doc-only update describing fire-and-forget semantics
Tests/RetryStrategyTests.cs 14 unit tests for exponential math/jitter/filters/delegate
Tests/DurableContextTests.cs 6 retry/AtMostOnce/Pending replay tests
Tests/DurableFunctionTests.cs Updated to assert START + SUCCEED + WAIT-START flat sequence
IntegrationTests/*.cs Add StepStarted-event assertions; richer history dump in DurableFunctionDeployment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

var history = await deployment.WaitForHistoryAsync(
arn!,
h => (h.Events?.Count(e => e.StepSucceededDetails != null) ?? 0) >= 2
h => (h.Events?.Count(e => e.EventType == EventType.StepStarted) ?? 0) >= 2
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that we are emitting START steps (which are needed for retries) we are asserting them in the IT tests


COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]
/// Replay semantics — example: <c>await ctx.StepAsync(ChargeCard, "charge")</c>
/// Replay branches — example: <c>await ctx.StepAsync(ChargeCard, "charge")</c>
/// <list type="bullet">
/// <item>Fresh: no prior state → run func → emit SUCCEED → return result.</item>
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in previous PR only SUCCEEDED or FAILED mattered. But now for replays, we need to keep track of how many times the function was executed, which is done via the number of STARTED steps.

stack-info: PR: #2363, branch: GarrettBeatty/stack/3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Release Not Needed Add this label if a PR does not need to be released.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants