direct: Make cluster resize more resilient: fallback to regular update if resize failed due to INVALID_STATE#5716
Open
denik wants to merge 9 commits into
Open
direct: Make cluster resize more resilient: fallback to regular update if resize failed due to INVALID_STATE#5716denik wants to merge 9 commits into
denik wants to merge 9 commits into
Conversation
Cluster resize (num_workers/autoscale-only change) was only classified as Resize when the remote cluster was Running at plan time. A saved plan could fail with INVALID_STATE at apply time if the cluster terminated between plan and apply. Fix: always classify num_workers/autoscale-only changes as Resize. DoResize tries Clusters.Resize first; on INVALID_STATE it falls back to the full clusters/edit path with the same retry loop as DoUpdate. Update testserver to return INVALID_STATE from Resize when the cluster is not Running, matching real API behavior. Add acceptance test for the terminated-cluster fallback path. Co-authored-by: Isaac
Contributor
Approval status: pending
|
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Collaborator
Integration test reportCommit: 0e73b99
21 interesting tests: 13 SKIP, 7 KNOWN, 1 RECOVERED
Top 4 slowest tests (at least 2 minutes):
|
…ed plan Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
DoResizenow triesclusters/resizefirst and falls back toclusters/edit(viaDoUpdate) onINVALID_STATE.PlanEntryis threaded through toDoResizeso the fallback reuses the existing edit+retry logic without duplication.Why
A saved plan records the action at plan time (e.g.
resizewhen the cluster was running). If the cluster terminates before the plan is applied, the resize API returnsINVALID_STATEand the deploy fails. The fallback makes apply resilient to this race.This also helps for local-only plan (#5680) where we don't have remote state available to check. With this change we can always plan 'resize' based on changed attributes.
Tests
New acceptance test
resize-terminated-fallback: plans while the cluster is running (plan showsresize), terminates the cluster, applies the saved plan, and confirms both a failed resize request and a successful edit fallback in the request log.This PR was written by Claude Code.