Wait for MARKER_RECORDED to fire version callback on replay#2821
Wait for MARKER_RECORDED to fire version callback on replay#2821
Conversation
Add a Java replay test that mirrors the Kotlin GreetingWorkflow sample and replays the provided workflow history fixture. Assert that replay fails with the embedded TMPRL1100 NonDeterministicException message so the reproducer stays pinned to the reported failure mode.
Run a constrained experiment for the interleaved update replay bug by changing VersionStateMachine replay timing only for histories with SKIP_YIELD_ON_VERSION set. In that path, getVersion still returns synchronously, but the replay callback is no longer fired at fake RECORD_MARKER command creation and is instead delayed until the real MARKER_RECORDED event is matched. The goal of the experiment was to verify that flagged histories do not depend on the current early replay callback or its extra eventLoop scheduling. The legacy interleaved update repro history does not have SKIP_YIELD_ON_VERSION, so it continues to fail unchanged and serves as the control case. Verified with: ./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingRemoveTest" --tests "io.temporal.workflow.versionTests.GetVersionRemovedInReplayTest" --tests "io.temporal.workflow.versionTests.GetVersionWithoutCommandEventTest" --tests "io.temporal.workflow.versionTests.GetVersionAndTimerTest" --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest" --tests "io.temporal.workflow.versionTests.GetVersionInSignalTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingTest" --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest" ./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionRemovedInReplayTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingRemoveTest" --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingTest"
Change VersionStateMachine replay semantics so getVersion no longer resumes workflow code when the fake RECORD_MARKER command is created. Replay now waits until the real MARKER_RECORDED event is matched before firing the version callback, which makes version-marker ordering consistent with replayed side effects. This fixes the interleaved update replay bug reproduced by testGetVersionInterleavedUpdateReplayHistory.json. That history previously failed replay with [TMPRL1100] because the second getVersion callback ran ahead of update completion protocol handling. After this change, the same recorded history replays successfully through both WorkflowReplayer and the lower-level direct-query replay task handler. The earlier flag-gated experiment showed that delaying the callback was already safe for histories with SKIP_YIELD_ON_VERSION. This commit removes that temporary gating and applies the same replay ordering to all histories. Verified with: ./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingRemoveTest" --tests "io.temporal.workflow.versionTests.GetVersionRemovedInReplayTest" --tests "io.temporal.workflow.versionTests.GetVersionWithoutCommandEventTest" --tests "io.temporal.workflow.versionTests.GetVersionAndTimerTest" --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest" --tests "io.temporal.workflow.versionTests.GetVersionInSignalTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingTest" --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest" ./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"
`VERSION_WAIT_FOR_MARKER`
ff59a05 to
43ab9df
Compare
have elected not to fix that specific history. Make sure there is a run-then-replay regression test which confirms new histories won't be broken that way.
| * to make sure new workflows run with {@link SdkFlag#SKIP_YIELD_ON_VERSION} by default to avoid | ||
| * interleaved histories. | ||
| */ | ||
| @Ignore("Recorded history predates SKIP_YIELD_ON_VERSION. Use the live-history replay test.") |
There was a problem hiding this comment.
Why ignore this test? Given that the history doesn't contain the flag, it should replay exactly as it did originally. That's specifically the reason for such replay test.
There was a problem hiding this comment.
Also add a replay test where the history contains the flag and consequent events.
That is:
- change the code on your local machine to make the feature flag turned on by default
- run the test workflow
- capture that workflow history to a JSON file in the resources directory
- revert back the change you did in 1
- add a replay test that uses the history file captured in 3
That test sequence basically correspond to a user that would execute a workflow on a version of the SDK where the flag is enabled, but then rollback their deployment to a version of the SDK where the flag is supported but not enabled by default.
| * Changes replay behavior of GetVersion to wait for the matching marker event before executing | ||
| * the callback. | ||
| */ | ||
| VERSION_WAIT_FOR_MARKER(5), |
There was a problem hiding this comment.
We're getting more of these. We should think ahead at how to make sure this remains maintainable in the future.
Would you mind adding some comments on this flag and others above, indicating:
- Which release introduced support for the flag (assume next patch release for new flags)
- Which release turned it on (or "pending" if not yet turned on)
- Ticket to the bug that the flag resolves.
What was changed
VersionStateMachinereplay behavior behind a new SDK flagVERSION_WAIT_FOR_MARKERsogetVersion()no longer resumes workflow code when the fakeRECORD_MARKERcommand is created during replay.MARKER_RECORDEDevent is matched before firing the version callback.Why?
Replay was resuming
getVersion()too early relative to update completion protocol handling. In the interleaved update history from issue#2796, that allowed replay to queue the second version marker before the update completion work had been matched, which produced a replay-only nondeterminism error:[TMPRL1100] getVersion call before the existing version marker eventDelaying the replay callback until the actual marker event is matched makes version replay ordering consistent with replayed side effects and fixes replay for the recorded history.
Checklist
Closes Replay fails: UpdateCompleted events between version markers breaks replay #2796 but does not fix replay of existing histories.
How was this tested:
./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingRemoveTest" --tests "io.temporal.workflow.versionTests.GetVersionRemovedInReplayTest" --tests "io.temporal.workflow.versionTests.GetVersionWithoutCommandEventTest" --tests "io.temporal.workflow.versionTests.GetVersionAndTimerTest" --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest" --tests "io.temporal.workflow.versionTests.GetVersionInSignalTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingTest" --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"WorkflowReplayerand the lower-level direct-query replay task handler.