Skip to content

fix(rest): retry on store conflict in read-modify-write handlers#149

Merged
Andrei Kvapil (kvaps) merged 2 commits into
mainfrom
fix/activate-retry-on-conflict
Jun 12, 2026
Merged

fix(rest): retry on store conflict in read-modify-write handlers#149
Andrei Kvapil (kvaps) merged 2 commits into
mainfrom
fix/activate-retry-on-conflict

Conversation

@kvaps

@kvaps Andrei Kvapil (kvaps) commented Jun 12, 2026

Copy link
Copy Markdown
Member

What

POST /v1/resource-definitions/{rd}/resources/{node}/activate could answer 409 conflict: store object was modified, retry the request when a controller reconciler updated the Resource CRD between the handler's read and write (release-gate TestGroupFRActivateDeactivate). Upstream LINSTOR never returns 409 on activate and the Python CLI does not retry it, so the conflict leaked straight to the operator.

The activate/deactivate handlers persisted the flag toggle via a bare Get -> mutate -> Update with no conflict retry. This PR routes them through the store's established typed-Patch helpers (PatchResourceSpec and siblings), which re-fetch the live object and re-apply the mutation under RetryOnConflict with backoff — no stale-snapshot re-PUT.

Audit

The whole REST surface was audited for the same read-modify-write-no-retry shape. Also converted:

  • node evict/restore flag toggles, node create upsert, node_type modify -> PatchNodeSpec
  • explicit resource-create and make-available layer-list pass-throughs, rd clone prop edits, resource-connection path delete, per-RD DRBD shared-secret -> PatchResourceDefinitionSpec
  • per-VD passphrase rotation -> PatchVolumeDefinitionSpec
  • volume-group create/modify/delete -> PatchResourceGroup

Toggle-disk, migrate-disk, resource/RG/VD modify, props endpoints, and the autoplace main paths already used the retrying patch helpers and are unchanged. Wire-visible status envelopes (duplicate VG 409, absent VG warn no-op, 404 precedence) are preserved.

Testing

  • New regression test injects a conflicting concurrent write between the handler's read and write via a fake-client interceptor and asserts activate/deactivate answer 200 with no lost update (fails with 409 on the pre-fix handler).
  • go test ./... green; go test -race -count=1 ./pkg/rest/... green; golangci-lint run 0 issues.
  • Integration: go test -tags=integration -run TestGroupF ./tests/integration/... green; TestGroupFRActivateDeactivate passes 5/5 consecutive runs.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Enhanced concurrent update handling across resources, nodes, and resource definitions to prevent data loss during conflicts.
    • Improved robustness of property updates and state changes under contention.
  • Tests

    • Added regression test for resource activation/deactivation under concurrent modification.

Andrei Kvapil (kvaps) and others added 2 commits June 12, 2026 19:31
The activate/deactivate handlers persisted the INACTIVE flag toggle
via a bare Get -> mutate -> Update, so a controller reconciler writing
the same Resource CRD between the read and the write surfaced the
store's optimistic-lock conflict as a 409 to the operator. Upstream
LINSTOR never returns 409 on activate and the Python CLI does not
retry it, which failed the release-gate integration suite.

Route the flag mutation through PatchResourceSpec: the store
re-fetches the live object and re-applies the mutation under
RetryOnConflict with backoff, so concurrent reconciler writes
converge instead of bouncing the request.

Add a regression test that injects a conflicting concurrent write
between the handler's read and write via a fake-client interceptor
and asserts both endpoints answer 200 with no lost update.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…t-retrying patches

Audit follow-up to the activate/deactivate 409 fix: the same
Get -> mutate -> Update shape (no retry, stale wire snapshot on
conflict) survived in several other write handlers. Any reconciler
or sibling write landing between the read and the write leaked the
store's optimistic-lock conflict to the operator as a 409.

Convert them to the established typed-Patch helpers, which re-fetch
the live object and re-apply the mutation under RetryOnConflict:

- node evict/restore flag toggles -> PatchNodeSpec
- node create upsert and node_type modify -> PatchNodeSpec
- explicit resource-create and make-available layer-list
  pass-throughs -> PatchResourceDefinitionSpec
- rd clone prop edits -> PatchResourceDefinitionSpec
- resource-connection path delete -> PatchResourceDefinitionSpec
- per-RD DRBD shared-secret set -> PatchResourceDefinitionSpec
- per-VD passphrase rotation -> PatchVolumeDefinitionSpec
- volume-group create/modify/delete -> PatchResourceGroup

Status-specific envelopes (duplicate VG 409, absent VG warn-mask
no-op) are preserved via closure sentinels so the patch aborts
without a persist on those paths.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR replaces read-modify-write patterns (Get → mutate local copy → Update) with typed patch operations that include conflict retry semantics across nine REST handler files. Node, ResourceDefinition, Resource, and ResourceGroup handlers now apply mutations via PatchNodeSpec, PatchResourceDefinitionSpec, PatchResourceSpec, and PatchResourceGroup, preventing lost updates from concurrent reconciler writes. A new conflict-handling regression test validates the retry behavior.

Changes

Patch-Based Concurrency for Storage Operations

Layer / File(s) Summary
Node Spec Patch Operations
pkg/rest/nodes.go, pkg/rest/node_lifecycle.go
Node registration (upsert-on-conflict), node-type updates, and flag mutations now use PatchNodeSpec with inline mutation closures, replacing Get + Update cycles.
ResourceDefinition Spec Patch Operations
pkg/rest/autoplace.go, pkg/rest/drbd_passphrase.go, pkg/rest/rd_clone.go, pkg/rest/resource_connections.go, pkg/rest/vd_passphrase_bug_233.go
LayerStack stamping, DRBD shared-secret, clone property edits, connection path deletion, and VD passphrase rotation now use PatchResourceDefinitionSpec with conflict retry. Early VD existence check added to prevent validation errors masking 404.
Resource Spec Patch Operations and Conflict Testing
pkg/rest/resource_adjust.go, pkg/rest/resource_adjust_conflict_test.go
Activate/deactivate flag mutations use PatchResourceSpec with documented conflict-retry behavior. New test suite verifies that concurrent out-of-band writes are not lost when the handler retries on store conflicts via injected side-writes.
ResourceGroup Spec Patch Operations with Sentinel Errors
pkg/rest/resource_group_extras.go
VolumeGroup create, update, and delete handlers use PatchResourceGroup with internal sentinel errors (duplicate/not-found) returned from patch closures, allowing handlers to map VG-specific conditions to correct HTTP responses without surfacing store errors.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Patches instead of writes so grand,
No more conflicts across the land,
Retry when the state has changed,
Concurrent writes are rearranged,
Lost updates? Not a chance—
Each spec gets its conflict dance! 🎭

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main objective of the pull request: converting read-modify-write handlers to use retry-on-conflict patch helpers to fix a race condition in store operations.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/activate-retry-on-conflict

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors several REST handlers to use Patch operations with retry-on-conflict instead of the Get-mutate-Update pattern, preventing concurrent reconciler writes from causing 409 conflicts. It also adds a regression test to verify that the store absorbs conflicts and retries successfully. The review feedback highlights a critical issue in pkg/rest/nodes.go where overwriting live with *n during node re-registration silently wipes existing node flags and custom properties, and suggests merging properties and preserving flags to prevent data loss.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pkg/rest/nodes.go
Comment on lines +581 to +585
err = s.Store.Nodes().PatchNodeSpec(r.Context(), n.Name, func(live *apiv1.Node) error {
*live = *n

return nil
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Completely overwriting live with *n during an upsert (re-registration) will silently wipe any existing node flags (such as EVICTED) and custom properties (such as topology properties or Aux/* keys) that are not present in the incoming registration request n.

Since satellite re-registrations or operator re-runs of linstor node create typically do not carry these controller-managed flags or custom metadata, this can lead to accidental loss of critical state (e.g., aborting an active node eviction process).

Consider preserving the existing flags and merging the properties to prevent data loss.

		err = s.Store.Nodes().PatchNodeSpec(r.Context(), n.Name, func(live *apiv1.Node) error {
			existingFlags := live.Flags
			existingProps := live.Props

			*live = *n

			// Preserve existing flags (like EVICTED) if not explicitly overwritten
			for _, f := range existingFlags {
				if !slices.Contains(live.Flags, f) {
					live.Flags = append(live.Flags, f)
				}
			}

			// Merge properties so custom metadata (like topology labels or Aux/*) is not wiped
			if len(live.Props) == 0 {
				live.Props = existingProps
			} else if len(existingProps) > 0 {
				mergedProps := make(map[string]string, len(existingProps)+len(live.Props))
				for k, v := range existingProps {
					mergedProps[k] = v
				}
				for k, v := range live.Props {
					mergedProps[k] = v
				}
				live.Props = mergedProps
			}

			return nil
		})

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/rest/autoplace.go`:
- Around line 1558-1570: The RD patch call that sets rd.LayerStack when
env.LayerList is present currently ignores errors (the returned value from
s.Store.ResourceDefinitions().PatchResourceDefinitionSpec is discarded),
allowing the handler to proceed to createOrPromoteResource() and return success
even if the patch failed or returned ErrNotFound; change this by capturing the
error from PatchResourceDefinitionSpec (called with rdName and the rd mutation
closure) and if non-nil return or propagate that error from the handler
(handling ErrNotFound specifically if needed) instead of continuing to create
the resource so the caller sees the concrete failure and the new Resource cannot
be created out-of-sync with its RD.

In `@pkg/rest/nodes.go`:
- Around line 576-585: The PatchNodeSpec closure used in
s.Store.Nodes().PatchNodeSpec should not do a full object replace via "*live =
*n" because that overwrites controller-managed fields; instead merge only the
create-API owned fields from "n" into "live" (or explicitly copy/preserve
controller-managed spec fields such as the DRBD ranges referenced in
pkg/store/k8s/patch_bug_208_test.go) before returning. Update the closure passed
to s.Store.Nodes().PatchNodeSpec to set only the allowed spec fields (or to read
and reassign preserved fields from the original live node into the result)
rather than replacing the whole "live" object.

In `@pkg/rest/resource_adjust_conflict_test.go`:
- Around line 100-108: The test currently only rejects negative RetCode; update
the assertion to require the exact success code returned by mutateResourceFlag()
instead of just non-negative. After decoding into rc (apiv1.APICallRc), assert
len(rc) > 0 and that rc[0].RetCode equals the expected maskInfo success value
(the same symbol returned by mutateResourceFlag(), e.g. maskInfo or the
corresponding constant), failing the test if it does not match to catch
wire-contract regressions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8b0bee9f-111f-49b6-ac0b-603d17806865

📥 Commits

Reviewing files that changed from the base of the PR and between c349d79 and 4860dc3.

📒 Files selected for processing (10)
  • pkg/rest/autoplace.go
  • pkg/rest/drbd_passphrase.go
  • pkg/rest/node_lifecycle.go
  • pkg/rest/nodes.go
  • pkg/rest/rd_clone.go
  • pkg/rest/resource_adjust.go
  • pkg/rest/resource_adjust_conflict_test.go
  • pkg/rest/resource_connections.go
  • pkg/rest/resource_group_extras.go
  • pkg/rest/vd_passphrase_bug_233.go

Comment thread pkg/rest/autoplace.go
Comment on lines 1558 to +1570
if len(env.LayerList) > 0 {
rd, getErr := s.Store.ResourceDefinitions().Get(r.Context(), rdName)
if getErr == nil && len(rd.LayerStack) == 0 {
rd.LayerStack = append([]string(nil), env.LayerList...)
_ = s.Store.ResourceDefinitions().Update(r.Context(), &rd)
}
// Bug 204b shape: typed-Patch with retry-on-conflict; the
// "only if unset" guard re-runs against the live RD on every
// retry so a concurrent reconciler write can't surface a 409
// (and a concurrent LayerStack set isn't clobbered).
_ = s.Store.ResourceDefinitions().PatchResourceDefinitionSpec(r.Context(), rdName,
func(rd *apiv1.ResourceDefinition) error {
if len(rd.LayerStack) == 0 {
rd.LayerStack = append([]string(nil), env.LayerList...)
}

return nil
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't swallow the RD patch failure on explicit create.

If layer_list is present and this patch fails, the handler still continues with createOrPromoteResource() and returns success even though the parent RD never recorded the requested LayerStack. That can leave the new Resource out of sync with its definition, and it also discards concrete failures like ErrNotFound after a concurrent RD delete.

Suggested fix
 	if len(env.LayerList) > 0 {
-		_ = s.Store.ResourceDefinitions().PatchResourceDefinitionSpec(r.Context(), rdName,
-			func(rd *apiv1.ResourceDefinition) error {
-				if len(rd.LayerStack) == 0 {
-					rd.LayerStack = append([]string(nil), env.LayerList...)
-				}
-
-				return nil
-			})
+		err := s.Store.ResourceDefinitions().PatchResourceDefinitionSpec(r.Context(), rdName,
+			func(rd *apiv1.ResourceDefinition) error {
+				if len(rd.LayerStack) == 0 {
+					rd.LayerStack = append([]string(nil), env.LayerList...)
+				}
+
+				return nil
+			})
+		if err != nil {
+			writeStoreError(w, err)
+
+			return nil, false
+		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if len(env.LayerList) > 0 {
rd, getErr := s.Store.ResourceDefinitions().Get(r.Context(), rdName)
if getErr == nil && len(rd.LayerStack) == 0 {
rd.LayerStack = append([]string(nil), env.LayerList...)
_ = s.Store.ResourceDefinitions().Update(r.Context(), &rd)
}
// Bug 204b shape: typed-Patch with retry-on-conflict; the
// "only if unset" guard re-runs against the live RD on every
// retry so a concurrent reconciler write can't surface a 409
// (and a concurrent LayerStack set isn't clobbered).
_ = s.Store.ResourceDefinitions().PatchResourceDefinitionSpec(r.Context(), rdName,
func(rd *apiv1.ResourceDefinition) error {
if len(rd.LayerStack) == 0 {
rd.LayerStack = append([]string(nil), env.LayerList...)
}
return nil
})
if len(env.LayerList) > 0 {
// Bug 204b shape: typed-Patch with retry-on-conflict; the
// "only if unset" guard re-runs against the live RD on every
// retry so a concurrent reconciler write can't surface a 409
// (and a concurrent LayerStack set isn't clobbered).
err := s.Store.ResourceDefinitions().PatchResourceDefinitionSpec(r.Context(), rdName,
func(rd *apiv1.ResourceDefinition) error {
if len(rd.LayerStack) == 0 {
rd.LayerStack = append([]string(nil), env.LayerList...)
}
return nil
})
if err != nil {
writeStoreError(w, err)
return nil, false
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/rest/autoplace.go` around lines 1558 - 1570, The RD patch call that sets
rd.LayerStack when env.LayerList is present currently ignores errors (the
returned value from s.Store.ResourceDefinitions().PatchResourceDefinitionSpec is
discarded), allowing the handler to proceed to createOrPromoteResource() and
return success even if the patch failed or returned ErrNotFound; change this by
capturing the error from PatchResourceDefinitionSpec (called with rdName and the
rd mutation closure) and if non-nil return or propagate that error from the
handler (handling ErrNotFound specifically if needed) instead of continuing to
create the resource so the caller sees the concrete failure and the new Resource
cannot be created out-of-sync with its RD.

Comment thread pkg/rest/nodes.go
Comment on lines +576 to +585
// Bug 205 shape: the re-register replace is routed through
// PatchNodeSpec so a satellite Hello / reconciler write
// landing between the AlreadyExists probe and the persist
// re-applies the wire snapshot under RetryOnConflict
// instead of leaking a 409 into the registration loop.
err = s.Store.Nodes().PatchNodeSpec(r.Context(), n.Name, func(live *apiv1.Node) error {
*live = *n

return nil
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid full-replacing the live node inside the retry patch.

Line 582 assigns *live = *n, so this closure still reapplies the POST snapshot as a wholesale spec replacement. That means an idempotent re-register can wipe live-only node fields that were never sent in the request, including the typed DRBD ranges guarded by pkg/store/k8s/patch_bug_208_test.go:52-110. Merge only the fields that the create API owns, or explicitly preserve controller-managed spec fields before returning.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/rest/nodes.go` around lines 576 - 585, The PatchNodeSpec closure used in
s.Store.Nodes().PatchNodeSpec should not do a full object replace via "*live =
*n" because that overwrites controller-managed fields; instead merge only the
create-API owned fields from "n" into "live" (or explicitly copy/preserve
controller-managed spec fields such as the DRBD ranges referenced in
pkg/store/k8s/patch_bug_208_test.go) before returning. Update the closure passed
to s.Store.Nodes().PatchNodeSpec to set only the allowed spec fields (or to read
and reassign preserved fields from the original live node into the result)
rather than replacing the whole "live" object.

Comment on lines +100 to +108
var rc []apiv1.APICallRc

if err := json.NewDecoder(resp.Body).Decode(&rc); err != nil {
t.Fatalf("decode envelope: %v", err)
}

if len(rc) == 0 || rc[0].RetCode < 0 {
t.Fatalf("envelope: got %+v, want non-empty MASK_INFO success", rc)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert the expected success code here.

RetCode < 0 only rejects fatal-error responses. A 200 envelope with the wrong non-negative ret_code would still pass, so this regression test can miss a wire-contract regression while the status code stays 200. Since mutateResourceFlag() returns a plain maskInfo success entry, pin that exact value.

Suggested fix
-			if len(rc) == 0 || rc[0].RetCode < 0 {
-				t.Fatalf("envelope: got %+v, want non-empty MASK_INFO success", rc)
-			}
+			if len(rc) != 1 || rc[0].RetCode != maskInfo {
+				t.Fatalf("envelope: got %+v, want one MASK_INFO success entry", rc)
+			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/rest/resource_adjust_conflict_test.go` around lines 100 - 108, The test
currently only rejects negative RetCode; update the assertion to require the
exact success code returned by mutateResourceFlag() instead of just
non-negative. After decoding into rc (apiv1.APICallRc), assert len(rc) > 0 and
that rc[0].RetCode equals the expected maskInfo success value (the same symbol
returned by mutateResourceFlag(), e.g. maskInfo or the corresponding constant),
failing the test if it does not match to catch wire-contract regressions.

@kvaps Andrei Kvapil (kvaps) merged commit 662ea38 into main Jun 12, 2026
27 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant