Skip to content

Wait for MARKER_RECORDED to fire version callback on replay#2821

Open
eamsden wants to merge 7 commits intomasterfrom
eamsden/interleaved-UpdateCompleted-getversion
Open

Wait for MARKER_RECORDED to fire version callback on replay#2821
eamsden wants to merge 7 commits intomasterfrom
eamsden/interleaved-UpdateCompleted-getversion

Conversation

@eamsden
Copy link
Copy Markdown
Contributor

@eamsden eamsden commented Mar 30, 2026

What was changed

  • Changed VersionStateMachine replay behavior behind a new SDK flag VERSION_WAIT_FOR_MARKER so getVersion() no longer resumes workflow code when the fake RECORD_MARKER command is created during replay.
  • Replay now waits until the real MARKER_RECORDED event is matched before firing the version callback.

Why?

Replay was resuming getVersion() too early relative to update completion protocol handling. In the interleaved update history from issue #2796, that allowed replay to queue the second version marker before the update completion work had been matched, which produced a replay-only nondeterminism error:

[TMPRL1100] getVersion call before the existing version marker event

Delaying the replay callback until the actual marker event is matched makes version replay ordering consistent with replayed side effects and fixes replay for the recorded history.

Checklist

  1. Closes Replay fails: UpdateCompleted events between version markers breaks replay #2796 but does not fix replay of existing histories.

  2. How was this tested:

  • ./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingRemoveTest" --tests "io.temporal.workflow.versionTests.GetVersionRemovedInReplayTest" --tests "io.temporal.workflow.versionTests.GetVersionWithoutCommandEventTest" --tests "io.temporal.workflow.versionTests.GetVersionAndTimerTest" --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest" --tests "io.temporal.workflow.versionTests.GetVersionInSignalTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingTest" --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"
  • ./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"
  • Verified that the original repro history now replays successfully through both WorkflowReplayer and the lower-level direct-query replay task handler.
  1. Any docs updates needed?
  • No docs updates needed. This is an internal replay-ordering fix with regression tests.

@eamsden eamsden changed the title Eamsden/interleaved update completed getversion Wait for MARKER_RECORDED to fire version callback on replay Mar 30, 2026
@eamsden eamsden marked this pull request as ready for review March 30, 2026 17:51
@eamsden eamsden requested a review from a team as a code owner March 30, 2026 17:51
eamsden added 6 commits March 31, 2026 12:08
Add a Java replay test that mirrors the Kotlin GreetingWorkflow sample and replays the provided workflow history fixture.

Assert that replay fails with the embedded TMPRL1100 NonDeterministicException message so the reproducer stays pinned to the reported failure mode.
Run a constrained experiment for the interleaved update replay bug by changing VersionStateMachine replay timing only for histories with SKIP_YIELD_ON_VERSION set. In that path, getVersion still returns synchronously, but the replay callback is no longer fired at fake RECORD_MARKER command creation and is instead delayed until the real MARKER_RECORDED event is matched.

The goal of the experiment was to verify that flagged histories do not depend on the current early replay callback or its extra eventLoop scheduling. The legacy interleaved update repro history does not have SKIP_YIELD_ON_VERSION, so it continues to fail unchanged and serves as the control case.

Verified with:
./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingRemoveTest" --tests "io.temporal.workflow.versionTests.GetVersionRemovedInReplayTest" --tests "io.temporal.workflow.versionTests.GetVersionWithoutCommandEventTest" --tests "io.temporal.workflow.versionTests.GetVersionAndTimerTest" --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest" --tests "io.temporal.workflow.versionTests.GetVersionInSignalTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingTest" --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"
./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionRemovedInReplayTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingRemoveTest" --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingTest"
Change VersionStateMachine replay semantics so getVersion no longer resumes workflow code when the fake RECORD_MARKER command is created. Replay now waits until the real MARKER_RECORDED event is matched before firing the version callback, which makes version-marker ordering consistent with replayed side effects.

This fixes the interleaved update replay bug reproduced by testGetVersionInterleavedUpdateReplayHistory.json. That history previously failed replay with [TMPRL1100] because the second getVersion callback ran ahead of update completion protocol handling. After this change, the same recorded history replays successfully through both WorkflowReplayer and the lower-level direct-query replay task handler.

The earlier flag-gated experiment showed that delaying the callback was already safe for histories with SKIP_YIELD_ON_VERSION. This commit removes that temporary gating and applies the same replay ordering to all histories.

Verified with:
./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingRemoveTest" --tests "io.temporal.workflow.versionTests.GetVersionRemovedInReplayTest" --tests "io.temporal.workflow.versionTests.GetVersionWithoutCommandEventTest" --tests "io.temporal.workflow.versionTests.GetVersionAndTimerTest" --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest" --tests "io.temporal.workflow.versionTests.GetVersionInSignalTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingTest" --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"
./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"
@eamsden eamsden force-pushed the eamsden/interleaved-UpdateCompleted-getversion branch from ff59a05 to 43ab9df Compare March 31, 2026 17:08
have elected not to fix that specific history. Make sure there is a
run-then-replay regression test which confirms new histories won't be
broken that way.
* to make sure new workflows run with {@link SdkFlag#SKIP_YIELD_ON_VERSION} by default to avoid
* interleaved histories.
*/
@Ignore("Recorded history predates SKIP_YIELD_ON_VERSION. Use the live-history replay test.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore this test? Given that the history doesn't contain the flag, it should replay exactly as it did originally. That's specifically the reason for such replay test.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a replay test where the history contains the flag and consequent events.

That is:

  1. change the code on your local machine to make the feature flag turned on by default
  2. run the test workflow
  3. capture that workflow history to a JSON file in the resources directory
  4. revert back the change you did in 1
  5. add a replay test that uses the history file captured in 3

That test sequence basically correspond to a user that would execute a workflow on a version of the SDK where the flag is enabled, but then rollback their deployment to a version of the SDK where the flag is supported but not enabled by default.

* Changes replay behavior of GetVersion to wait for the matching marker event before executing
* the callback.
*/
VERSION_WAIT_FOR_MARKER(5),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're getting more of these. We should think ahead at how to make sure this remains maintainable in the future.

Would you mind adding some comments on this flag and others above, indicating:

  • Which release introduced support for the flag (assume next patch release for new flags)
  • Which release turned it on (or "pending" if not yet turned on)
  • Ticket to the bug that the flag resolves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replay fails: UpdateCompleted events between version markers breaks replay

2 participants