fix(rest): retry on store conflict in read-modify-write handlers#149
Conversation
The activate/deactivate handlers persisted the INACTIVE flag toggle via a bare Get -> mutate -> Update, so a controller reconciler writing the same Resource CRD between the read and the write surfaced the store's optimistic-lock conflict as a 409 to the operator. Upstream LINSTOR never returns 409 on activate and the Python CLI does not retry it, which failed the release-gate integration suite. Route the flag mutation through PatchResourceSpec: the store re-fetches the live object and re-applies the mutation under RetryOnConflict with backoff, so concurrent reconciler writes converge instead of bouncing the request. Add a regression test that injects a conflicting concurrent write between the handler's read and write via a fake-client interceptor and asserts both endpoints answer 200 with no lost update. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…t-retrying patches Audit follow-up to the activate/deactivate 409 fix: the same Get -> mutate -> Update shape (no retry, stale wire snapshot on conflict) survived in several other write handlers. Any reconciler or sibling write landing between the read and the write leaked the store's optimistic-lock conflict to the operator as a 409. Convert them to the established typed-Patch helpers, which re-fetch the live object and re-apply the mutation under RetryOnConflict: - node evict/restore flag toggles -> PatchNodeSpec - node create upsert and node_type modify -> PatchNodeSpec - explicit resource-create and make-available layer-list pass-throughs -> PatchResourceDefinitionSpec - rd clone prop edits -> PatchResourceDefinitionSpec - resource-connection path delete -> PatchResourceDefinitionSpec - per-RD DRBD shared-secret set -> PatchResourceDefinitionSpec - per-VD passphrase rotation -> PatchVolumeDefinitionSpec - volume-group create/modify/delete -> PatchResourceGroup Status-specific envelopes (duplicate VG 409, absent VG warn-mask no-op) are preserved via closure sentinels so the patch aborts without a persist on those paths. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
📝 WalkthroughWalkthroughThis PR replaces read-modify-write patterns (Get → mutate local copy → Update) with typed patch operations that include conflict retry semantics across nine REST handler files. Node, ResourceDefinition, Resource, and ResourceGroup handlers now apply mutations via PatchNodeSpec, PatchResourceDefinitionSpec, PatchResourceSpec, and PatchResourceGroup, preventing lost updates from concurrent reconciler writes. A new conflict-handling regression test validates the retry behavior. ChangesPatch-Based Concurrency for Storage Operations
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors several REST handlers to use Patch operations with retry-on-conflict instead of the Get-mutate-Update pattern, preventing concurrent reconciler writes from causing 409 conflicts. It also adds a regression test to verify that the store absorbs conflicts and retries successfully. The review feedback highlights a critical issue in pkg/rest/nodes.go where overwriting live with *n during node re-registration silently wipes existing node flags and custom properties, and suggests merging properties and preserving flags to prevent data loss.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| err = s.Store.Nodes().PatchNodeSpec(r.Context(), n.Name, func(live *apiv1.Node) error { | ||
| *live = *n | ||
|
|
||
| return nil | ||
| }) |
There was a problem hiding this comment.
Completely overwriting live with *n during an upsert (re-registration) will silently wipe any existing node flags (such as EVICTED) and custom properties (such as topology properties or Aux/* keys) that are not present in the incoming registration request n.
Since satellite re-registrations or operator re-runs of linstor node create typically do not carry these controller-managed flags or custom metadata, this can lead to accidental loss of critical state (e.g., aborting an active node eviction process).
Consider preserving the existing flags and merging the properties to prevent data loss.
err = s.Store.Nodes().PatchNodeSpec(r.Context(), n.Name, func(live *apiv1.Node) error {
existingFlags := live.Flags
existingProps := live.Props
*live = *n
// Preserve existing flags (like EVICTED) if not explicitly overwritten
for _, f := range existingFlags {
if !slices.Contains(live.Flags, f) {
live.Flags = append(live.Flags, f)
}
}
// Merge properties so custom metadata (like topology labels or Aux/*) is not wiped
if len(live.Props) == 0 {
live.Props = existingProps
} else if len(existingProps) > 0 {
mergedProps := make(map[string]string, len(existingProps)+len(live.Props))
for k, v := range existingProps {
mergedProps[k] = v
}
for k, v := range live.Props {
mergedProps[k] = v
}
live.Props = mergedProps
}
return nil
})There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/rest/autoplace.go`:
- Around line 1558-1570: The RD patch call that sets rd.LayerStack when
env.LayerList is present currently ignores errors (the returned value from
s.Store.ResourceDefinitions().PatchResourceDefinitionSpec is discarded),
allowing the handler to proceed to createOrPromoteResource() and return success
even if the patch failed or returned ErrNotFound; change this by capturing the
error from PatchResourceDefinitionSpec (called with rdName and the rd mutation
closure) and if non-nil return or propagate that error from the handler
(handling ErrNotFound specifically if needed) instead of continuing to create
the resource so the caller sees the concrete failure and the new Resource cannot
be created out-of-sync with its RD.
In `@pkg/rest/nodes.go`:
- Around line 576-585: The PatchNodeSpec closure used in
s.Store.Nodes().PatchNodeSpec should not do a full object replace via "*live =
*n" because that overwrites controller-managed fields; instead merge only the
create-API owned fields from "n" into "live" (or explicitly copy/preserve
controller-managed spec fields such as the DRBD ranges referenced in
pkg/store/k8s/patch_bug_208_test.go) before returning. Update the closure passed
to s.Store.Nodes().PatchNodeSpec to set only the allowed spec fields (or to read
and reassign preserved fields from the original live node into the result)
rather than replacing the whole "live" object.
In `@pkg/rest/resource_adjust_conflict_test.go`:
- Around line 100-108: The test currently only rejects negative RetCode; update
the assertion to require the exact success code returned by mutateResourceFlag()
instead of just non-negative. After decoding into rc (apiv1.APICallRc), assert
len(rc) > 0 and that rc[0].RetCode equals the expected maskInfo success value
(the same symbol returned by mutateResourceFlag(), e.g. maskInfo or the
corresponding constant), failing the test if it does not match to catch
wire-contract regressions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8b0bee9f-111f-49b6-ac0b-603d17806865
📒 Files selected for processing (10)
pkg/rest/autoplace.gopkg/rest/drbd_passphrase.gopkg/rest/node_lifecycle.gopkg/rest/nodes.gopkg/rest/rd_clone.gopkg/rest/resource_adjust.gopkg/rest/resource_adjust_conflict_test.gopkg/rest/resource_connections.gopkg/rest/resource_group_extras.gopkg/rest/vd_passphrase_bug_233.go
| if len(env.LayerList) > 0 { | ||
| rd, getErr := s.Store.ResourceDefinitions().Get(r.Context(), rdName) | ||
| if getErr == nil && len(rd.LayerStack) == 0 { | ||
| rd.LayerStack = append([]string(nil), env.LayerList...) | ||
| _ = s.Store.ResourceDefinitions().Update(r.Context(), &rd) | ||
| } | ||
| // Bug 204b shape: typed-Patch with retry-on-conflict; the | ||
| // "only if unset" guard re-runs against the live RD on every | ||
| // retry so a concurrent reconciler write can't surface a 409 | ||
| // (and a concurrent LayerStack set isn't clobbered). | ||
| _ = s.Store.ResourceDefinitions().PatchResourceDefinitionSpec(r.Context(), rdName, | ||
| func(rd *apiv1.ResourceDefinition) error { | ||
| if len(rd.LayerStack) == 0 { | ||
| rd.LayerStack = append([]string(nil), env.LayerList...) | ||
| } | ||
|
|
||
| return nil | ||
| }) |
There was a problem hiding this comment.
Don't swallow the RD patch failure on explicit create.
If layer_list is present and this patch fails, the handler still continues with createOrPromoteResource() and returns success even though the parent RD never recorded the requested LayerStack. That can leave the new Resource out of sync with its definition, and it also discards concrete failures like ErrNotFound after a concurrent RD delete.
Suggested fix
if len(env.LayerList) > 0 {
- _ = s.Store.ResourceDefinitions().PatchResourceDefinitionSpec(r.Context(), rdName,
- func(rd *apiv1.ResourceDefinition) error {
- if len(rd.LayerStack) == 0 {
- rd.LayerStack = append([]string(nil), env.LayerList...)
- }
-
- return nil
- })
+ err := s.Store.ResourceDefinitions().PatchResourceDefinitionSpec(r.Context(), rdName,
+ func(rd *apiv1.ResourceDefinition) error {
+ if len(rd.LayerStack) == 0 {
+ rd.LayerStack = append([]string(nil), env.LayerList...)
+ }
+
+ return nil
+ })
+ if err != nil {
+ writeStoreError(w, err)
+
+ return nil, false
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if len(env.LayerList) > 0 { | |
| rd, getErr := s.Store.ResourceDefinitions().Get(r.Context(), rdName) | |
| if getErr == nil && len(rd.LayerStack) == 0 { | |
| rd.LayerStack = append([]string(nil), env.LayerList...) | |
| _ = s.Store.ResourceDefinitions().Update(r.Context(), &rd) | |
| } | |
| // Bug 204b shape: typed-Patch with retry-on-conflict; the | |
| // "only if unset" guard re-runs against the live RD on every | |
| // retry so a concurrent reconciler write can't surface a 409 | |
| // (and a concurrent LayerStack set isn't clobbered). | |
| _ = s.Store.ResourceDefinitions().PatchResourceDefinitionSpec(r.Context(), rdName, | |
| func(rd *apiv1.ResourceDefinition) error { | |
| if len(rd.LayerStack) == 0 { | |
| rd.LayerStack = append([]string(nil), env.LayerList...) | |
| } | |
| return nil | |
| }) | |
| if len(env.LayerList) > 0 { | |
| // Bug 204b shape: typed-Patch with retry-on-conflict; the | |
| // "only if unset" guard re-runs against the live RD on every | |
| // retry so a concurrent reconciler write can't surface a 409 | |
| // (and a concurrent LayerStack set isn't clobbered). | |
| err := s.Store.ResourceDefinitions().PatchResourceDefinitionSpec(r.Context(), rdName, | |
| func(rd *apiv1.ResourceDefinition) error { | |
| if len(rd.LayerStack) == 0 { | |
| rd.LayerStack = append([]string(nil), env.LayerList...) | |
| } | |
| return nil | |
| }) | |
| if err != nil { | |
| writeStoreError(w, err) | |
| return nil, false | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/rest/autoplace.go` around lines 1558 - 1570, The RD patch call that sets
rd.LayerStack when env.LayerList is present currently ignores errors (the
returned value from s.Store.ResourceDefinitions().PatchResourceDefinitionSpec is
discarded), allowing the handler to proceed to createOrPromoteResource() and
return success even if the patch failed or returned ErrNotFound; change this by
capturing the error from PatchResourceDefinitionSpec (called with rdName and the
rd mutation closure) and if non-nil return or propagate that error from the
handler (handling ErrNotFound specifically if needed) instead of continuing to
create the resource so the caller sees the concrete failure and the new Resource
cannot be created out-of-sync with its RD.
| // Bug 205 shape: the re-register replace is routed through | ||
| // PatchNodeSpec so a satellite Hello / reconciler write | ||
| // landing between the AlreadyExists probe and the persist | ||
| // re-applies the wire snapshot under RetryOnConflict | ||
| // instead of leaking a 409 into the registration loop. | ||
| err = s.Store.Nodes().PatchNodeSpec(r.Context(), n.Name, func(live *apiv1.Node) error { | ||
| *live = *n | ||
|
|
||
| return nil | ||
| }) |
There was a problem hiding this comment.
Avoid full-replacing the live node inside the retry patch.
Line 582 assigns *live = *n, so this closure still reapplies the POST snapshot as a wholesale spec replacement. That means an idempotent re-register can wipe live-only node fields that were never sent in the request, including the typed DRBD ranges guarded by pkg/store/k8s/patch_bug_208_test.go:52-110. Merge only the fields that the create API owns, or explicitly preserve controller-managed spec fields before returning.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/rest/nodes.go` around lines 576 - 585, The PatchNodeSpec closure used in
s.Store.Nodes().PatchNodeSpec should not do a full object replace via "*live =
*n" because that overwrites controller-managed fields; instead merge only the
create-API owned fields from "n" into "live" (or explicitly copy/preserve
controller-managed spec fields such as the DRBD ranges referenced in
pkg/store/k8s/patch_bug_208_test.go) before returning. Update the closure passed
to s.Store.Nodes().PatchNodeSpec to set only the allowed spec fields (or to read
and reassign preserved fields from the original live node into the result)
rather than replacing the whole "live" object.
| var rc []apiv1.APICallRc | ||
|
|
||
| if err := json.NewDecoder(resp.Body).Decode(&rc); err != nil { | ||
| t.Fatalf("decode envelope: %v", err) | ||
| } | ||
|
|
||
| if len(rc) == 0 || rc[0].RetCode < 0 { | ||
| t.Fatalf("envelope: got %+v, want non-empty MASK_INFO success", rc) | ||
| } |
There was a problem hiding this comment.
Assert the expected success code here.
RetCode < 0 only rejects fatal-error responses. A 200 envelope with the wrong non-negative ret_code would still pass, so this regression test can miss a wire-contract regression while the status code stays 200. Since mutateResourceFlag() returns a plain maskInfo success entry, pin that exact value.
Suggested fix
- if len(rc) == 0 || rc[0].RetCode < 0 {
- t.Fatalf("envelope: got %+v, want non-empty MASK_INFO success", rc)
- }
+ if len(rc) != 1 || rc[0].RetCode != maskInfo {
+ t.Fatalf("envelope: got %+v, want one MASK_INFO success entry", rc)
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/rest/resource_adjust_conflict_test.go` around lines 100 - 108, The test
currently only rejects negative RetCode; update the assertion to require the
exact success code returned by mutateResourceFlag() instead of just
non-negative. After decoding into rc (apiv1.APICallRc), assert len(rc) > 0 and
that rc[0].RetCode equals the expected maskInfo success value (the same symbol
returned by mutateResourceFlag(), e.g. maskInfo or the corresponding constant),
failing the test if it does not match to catch wire-contract regressions.
What
POST /v1/resource-definitions/{rd}/resources/{node}/activatecould answer409 conflict: store object was modified, retry the requestwhen a controller reconciler updated the Resource CRD between the handler's read and write (release-gateTestGroupFRActivateDeactivate). Upstream LINSTOR never returns 409 on activate and the Python CLI does not retry it, so the conflict leaked straight to the operator.The activate/deactivate handlers persisted the flag toggle via a bare Get -> mutate -> Update with no conflict retry. This PR routes them through the store's established typed-Patch helpers (
PatchResourceSpecand siblings), which re-fetch the live object and re-apply the mutation underRetryOnConflictwith backoff — no stale-snapshot re-PUT.Audit
The whole REST surface was audited for the same read-modify-write-no-retry shape. Also converted:
node_typemodify ->PatchNodeSpecPatchResourceDefinitionSpecPatchVolumeDefinitionSpecPatchResourceGroupToggle-disk, migrate-disk, resource/RG/VD modify, props endpoints, and the autoplace main paths already used the retrying patch helpers and are unchanged. Wire-visible status envelopes (duplicate VG 409, absent VG warn no-op, 404 precedence) are preserved.
Testing
go test ./...green;go test -race -count=1 ./pkg/rest/...green;golangci-lint run0 issues.go test -tags=integration -run TestGroupF ./tests/integration/...green;TestGroupFRActivateDeactivatepasses 5/5 consecutive runs.Summary by CodeRabbit
Release Notes
Bug Fixes
Tests