Open
Conversation
…ommit any outbound messages
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a race condition in the Azure Storage backend when extended sessions are enabled and an orchestration performs ContinueAsNew, where activity responses can arrive before the new execution ID is checkpointed, causing a stuck orchestration.
Changes:
- Moves
session.UpdateRuntimeState(runtimeState)earlier inCompleteTaskOrchestrationWorkItemAsyncso the in-memory session reflects the new execution ID before outbound messages are committed.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs
Outdated
Show resolved
Hide resolved
added 2 commits
February 24, 2026 09:41
…sessions-race-condition
…o initial misdiagnosis, and changed the IsOutOfOrder logic instead
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently we have a subtle race condition that can occur if an orchestration attempts to continue-as-new. The flow is as follows
TaskOrchestrationDispatchercallsCompleteTaskOrchestrationWorkItemAsync.TaskScheduledevent to start a new Activity.TaskCompletedevent back to the orchestration, all before theCompleteTaskOrchestrationWorkItemAsynchas updated the orchestration's state in storage to reflect the new execution ID.LockNextTaskOrchestrationWorkItemAsyncis made which retrieves theTaskCompletedevent. TheTaskCompletedevent is addressed to the new execution ID, but since the orchestration's state has not yet been updated in storage, there is no record or history for that execution ID. This call to determine out of order messages should detect that this is potentially an "out of order"TaskCompletedmessage, since the instance does "not yet exist". However,IsOutOfOrderMessagedecides that the message is okay, because this [condition[(https://github.com/Azure/durabletask/blob/1b04239f01b8376e084d9b957bf15b546700dd64/src/DurableTask.AzureStorage/Messaging/OrchestrationSession.cs#L163) evaluates to true.LockNextTaskOrchestrationWorkItemAsyncmethod, when we attempt to retrieve information about this orchestration instance with the new execution ID in storage, we find none, and fail at this point. We delete theTaskCompletedevent, which leaves the orchestration permanently stuck in a running state.The core of the issue is that the checks in
IsOutOfOrderMessageare not conservative enough. In this case, the checkpoint time of the session is indeed higher than that of theTaskCompletedevent, because the orchestration's state is retrieved after the message is received. Just this one condition evaluating to true should not be enough to decide that the message is not out of order.This PR changes the logic to be that if any of the conditions are met (this is a non-existent instance and the message hasn't yet been dequeued 5 times, the checkpoint is stale, or a scheduled event does not exist for a completion event), we treat the message as out of order.
Resolves #1302