Skip to content

Comments

Fix race condition with continue-as-new#1303

Open
sophiatev wants to merge 4 commits intomainfrom
stevosyan/fix-continue-as-new-with-extended-sessions-race-condition
Open

Fix race condition with continue-as-new#1303
sophiatev wants to merge 4 commits intomainfrom
stevosyan/fix-continue-as-new-with-extended-sessions-race-condition

Conversation

@sophiatev
Copy link
Contributor

@sophiatev sophiatev commented Feb 23, 2026

Currently we have a subtle race condition that can occur if an orchestration attempts to continue-as-new. The flow is as follows

  1. An orchestration continues-as-new with a new execution ID, and the TaskOrchestrationDispatcher calls CompleteTaskOrchestrationWorkItemAsync.
  2. In the completion call, outbound messages are committed. Say one of these is a TaskScheduled event to start a new Activity.
  3. The Activity completes and sends a TaskCompleted event back to the orchestration, all before the CompleteTaskOrchestrationWorkItemAsync has updated the orchestration's state in storage to reflect the new execution ID.
  4. A call to LockNextTaskOrchestrationWorkItemAsync is made which retrieves the TaskCompleted event. The TaskCompleted event is addressed to the new execution ID, but since the orchestration's state has not yet been updated in storage, there is no record or history for that execution ID. This call to determine out of order messages should detect that this is potentially an "out of order" TaskCompleted message, since the instance does "not yet exist". However, IsOutOfOrderMessage decides that the message is okay, because this [condition[(https://github.com/Azure/durabletask/blob/1b04239f01b8376e084d9b957bf15b546700dd64/src/DurableTask.AzureStorage/Messaging/OrchestrationSession.cs#L163) evaluates to true.
  5. Later on, in the LockNextTaskOrchestrationWorkItemAsync method, when we attempt to retrieve information about this orchestration instance with the new execution ID in storage, we find none, and fail at this point. We delete the TaskCompleted event, which leaves the orchestration permanently stuck in a running state.

The core of the issue is that the checks in IsOutOfOrderMessage are not conservative enough. In this case, the checkpoint time of the session is indeed higher than that of the TaskCompleted event, because the orchestration's state is retrieved after the message is received. Just this one condition evaluating to true should not be enough to decide that the message is not out of order.

This PR changes the logic to be that if any of the conditions are met (this is a non-existent instance and the message hasn't yet been dequeued 5 times, the checkpoint is stale, or a scheduled event does not exist for a completion event), we treat the message as out of order.

Resolves #1302

Copilot AI review requested due to automatic review settings February 23, 2026 21:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race condition in the Azure Storage backend when extended sessions are enabled and an orchestration performs ContinueAsNew, where activity responses can arrive before the new execution ID is checkpointed, causing a stuck orchestration.

Changes:

  • Moves session.UpdateRuntimeState(runtimeState) earlier in CompleteTaskOrchestrationWorkItemAsync so the in-memory session reflects the new execution ID before outbound messages are committed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sophia Tevosyan added 2 commits February 24, 2026 09:41
…o initial misdiagnosis, and changed the IsOutOfOrder logic instead
Copilot AI review requested due to automatic review settings February 24, 2026 17:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sophiatev sophiatev changed the title Fix the race condition for continue-as-new with extended sessions enabled Fix race condition with continue-as-new Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Orchestration is stuck in the Running state

1 participant