Skip to content

direct: Make cluster resize more resilient: fallback to regular update if resize failed due to INVALID_STATE#5716

Open
denik wants to merge 9 commits into
mainfrom
denik/clusters-resize-then-edit
Open

direct: Make cluster resize more resilient: fallback to regular update if resize failed due to INVALID_STATE#5716
denik wants to merge 9 commits into
mainfrom
denik/clusters-resize-then-edit

Conversation

@denik

@denik denik commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Changes

DoResize now tries clusters/resize first and falls back to clusters/edit (via DoUpdate) on INVALID_STATE. PlanEntry is threaded through to DoResize so the fallback reuses the existing edit+retry logic without duplication.

Why

A saved plan records the action at plan time (e.g. resize when the cluster was running). If the cluster terminates before the plan is applied, the resize API returns INVALID_STATE and the deploy fails. The fallback makes apply resilient to this race.

This also helps for local-only plan (#5680) where we don't have remote state available to check. With this change we can always plan 'resize' based on changed attributes.

Tests

New acceptance test resize-terminated-fallback: plans while the cluster is running (plan shows resize), terminates the cluster, applies the saved plan, and confirms both a failed resize request and a successful edit fallback in the request log.

This PR was written by Claude Code.

Cluster resize (num_workers/autoscale-only change) was only classified
as Resize when the remote cluster was Running at plan time. A saved plan
could fail with INVALID_STATE at apply time if the cluster terminated
between plan and apply.

Fix: always classify num_workers/autoscale-only changes as Resize.
DoResize tries Clusters.Resize first; on INVALID_STATE it falls back to
the full clusters/edit path with the same retry loop as DoUpdate.

Update testserver to return INVALID_STATE from Resize when the cluster
is not Running, matching real API behavior. Add acceptance test for the
terminated-cluster fallback path.

Co-authored-by: Isaac
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Approval status: pending

/acceptance/bundle/ - needs approval

5 files changed
Suggested: @pietern
Also eligible: @andrewnester, @janniklasrose, @shreyas-goenka, @anton-107, @lennartkats-db

/bundle/ - needs approval

Files: bundle/direct/apply.go, bundle/direct/dresources/adapter.go, bundle/direct/dresources/cluster.go
Suggested: @pietern
Also eligible: @andrewnester, @janniklasrose, @shreyas-goenka, @anton-107, @lennartkats-db

General files (require maintainer)

Files: NEXT_CHANGELOG.md, libs/testserver/clusters.go
Based on git history:

  • @pietern -- recent work in bundle/direct/dresources/, ./, bundle/direct/

Any maintainer (@andrewnester, @anton-107, @pietern, @shreyas-goenka, @simonfaltum, @renaudhartert-db) can approve all areas.
See OWNERS for ownership rules.

@denik denik temporarily deployed to test-trigger-is June 25, 2026 09:37 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 25, 2026 09:37 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 25, 2026 09:40 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 25, 2026 09:40 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 25, 2026 09:42 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 25, 2026 09:42 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 25, 2026 09:47 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 25, 2026 09:47 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 25, 2026 09:50 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 25, 2026 09:50 — with GitHub Actions Inactive
@eng-dev-ecosystem-bot

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: 0e73b99

Run: 28161793102

Env 🟨​KNOWN 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
🟨​ aws linux 7 1 13 243 1025 5:23
🟨​ aws windows 7 1 13 245 1023 7:56
💚​ aws-ucws linux 8 13 330 942 5:07
💚​ aws-ucws windows 8 13 332 940 6:05
💚​ azure linux 2 15 246 1023 4:54
💚​ azure windows 2 15 248 1021 5:54
💚​ azure-ucws linux 2 15 335 938 5:25
💚​ azure-ucws windows 2 15 337 936 5:21
💚​ gcp linux 2 15 245 1025 4:16
💚​ gcp windows 2 15 247 1023 5:05
21 interesting tests: 13 SKIP, 7 KNOWN, 1 RECOVERED
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 🟨​K 🟨​K 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/recreate/embedding_dimension 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestFetchRepositoryInfoAPI_FromRepo 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
Top 4 slowest tests (at least 2 minutes):
duration env testname
3:19 azure windows TestAccept
3:18 aws-ucws windows TestAccept
3:18 gcp windows TestAccept
3:09 azure-ucws windows TestAccept

@denik denik changed the title direct: fall back to edit when resize fails with INVALID_STATE Make cluster resize more resilient: fallback to regular update if resize failed due to INVALID_STATE Jun 25, 2026
@denik denik temporarily deployed to test-trigger-is June 25, 2026 11:07 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 25, 2026 11:07 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 25, 2026 11:09 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 25, 2026 11:09 — with GitHub Actions Inactive
@denik denik changed the title Make cluster resize more resilient: fallback to regular update if resize failed due to INVALID_STATE direct: Make cluster resize more resilient: fallback to regular update if resize failed due to INVALID_STATE Jun 25, 2026
@denik denik temporarily deployed to test-trigger-is June 25, 2026 11:14 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 25, 2026 11:14 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 25, 2026 11:18 — with GitHub Actions Inactive
@denik denik deployed to test-trigger-is June 25, 2026 11:18 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants