Skip to content

fix(rest): absorb informer-cache lag across read-after-write REST paths#151

Merged
Andrei Kvapil (kvaps) merged 2 commits into
mainfrom
fix/rest-cache-lag-read-after-write
Jun 12, 2026
Merged

fix(rest): absorb informer-cache lag across read-after-write REST paths#151
Andrei Kvapil (kvaps) merged 2 commits into
mainfrom
fix/rest-cache-lag-read-after-write

Conversation

@kvaps

@kvaps Andrei Kvapil (kvaps) commented Jun 12, 2026

Copy link
Copy Markdown
Member

Problem

The REST server reads through the manager's informer-cached client while writes go straight to the apiserver. Any read that immediately follows a create — same wire client, sub-millisecond gap — can therefore be served by a cache that has not observed the write yet, surfacing a spurious 404 for an object that demonstrably exists. The mitigation pattern already exists (pkg/rest/cache_retry.go: RG/RD helpers, later extended to the snapshot/CSI-restore hot path), but it covered only the paths that had already failed in CI.

The latest victim was TestGroupB/PhysicalStorageCDPPropsPerKind/file-thin (status: got 404, want 202): the POST /v1/physical-storage/{node} handler matched the requested device paths against a device list served by a cache still trailing the discovery loop's PhysicalDevice write, found no match, and 404'd a valid create-device-pool request.

What changed

Instead of fixing one more endpoint, this PR closes out the whole class. Every store read in pkg/rest was enumerated and classified; the cache-retry pattern (3 attempts, 200 ms apart — a real 404 still surfaces after the budget, steady-state hits pay zero extra latency) is now applied uniformly to the operator-visible read-after-write paths:

  • Physical storage CDP — the device-path match retries while no device matches and none is busy (the failing case). A busy device (409) returns immediately since it is observably in the cache.
  • Nodes — single GET after register; the re-register upsert fall-through (Create returned AlreadyExists — authoritative proof of existence — so a NotFound from the immediate patch read is definitionally cache lag).
  • Storage pools — single GET after create; the node-existence probe on pool create (node registered and pools created in the same reconcile pass).
  • Volume definitions — create probe, list probe and single GET (linstor-csi POSTs RD and VD back-to-back; VDs live inline on the RD, so a stale RD revision also reads as NotFound).
  • Resource definitions — the RG gates on RD create, including the post-write re-check that would otherwise mistake cache lag for a concurrent RG delete and roll back a freshly created RD.
  • Resources — per-node GET and per-resource volume reads after autoplace / spawn / make-available.
  • Snapshots — list probe and the create-side source-RD read (the snapshot GET/restore reads were already mitigated).
  • Query-size-info — the per-RG capacity preflight that runs right after the RG is ensured.

Deliberately left alone (staleness harmless or NotFound is the contract): pure list endpoints (absence renders as an empty list), delete-path idempotency probes, NotFound-is-normal upsert probes, best-effort soft-fail reads, the process-local KV store and storage-pool-definition registry, and the controller-props bag (a missing singleton folds into an empty map; no 404 can surface).

The three pre-existing helpers were deduplicated behind a generic core so all helpers share the exact same budget and error-wrapping semantics.

Testing

  • New L1 regressions extend the lag-store pattern: the CDP device-list path end-to-end on the wire (202 under cache miss, attach lands on the real device), the genuine-no-match 404 after the budget, the busy-device 409 short-circuit, and the node re-register patch path.
  • TestGroupB green 3×; full integration suite (-tags=integration, envtest 1.34.1, linstor-client 1.27.1) green.
  • go test ./... green; go test -race ./pkg/rest green; golangci-lint run — 0 issues.

Summary by CodeRabbit

  • New Features

    • Added generic cache-retry mechanism to handle informer cache synchronization lag.
  • Bug Fixes

    • Reduced spurious 404 errors during create-then-read operations across resource, snapshot, volume definition, storage pool, and node endpoints.
    • Improved resilience in high-frequency operations by retrying cache lookups during synchronization windows.
  • Tests

    • Added comprehensive tests for cache-miss scenarios covering device attachment, node re-registration, and resource reads.

Andrei Kvapil (kvaps) and others added 2 commits June 13, 2026 01:29
The RG / RD / Snapshot cache-retry helpers carried three identical
copies of the retry-on-NotFound loop. Fold them into a single
generic getWithCacheRetry core so upcoming helpers for other object
kinds reuse the exact same budget and error-wrapping semantics
instead of growing a fourth copy.

No behaviour change: same attempts, delay, error wrapping and
context-cancellation handling.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The REST server reads through the manager's informer-cached client
while writes go straight to the apiserver, so any read that
immediately follows a create can hit a cache that has not observed
the write yet and surface a spurious 404. The CDP path
(POST /v1/physical-storage/{node}) was the latest victim: the
device-list match ran against a cache still trailing the discovery
loop's PhysicalDevice write and 404'd a valid create-device-pool.

Sweep the remaining members of the class and apply the existing
cache-retry pattern (3 attempts, 200 ms apart; a real 404 still
surfaces after the budget):

- physical-storage CDP device match (the failing case)
- node single GET and the storage-pool create node probe
  (register node then read/create pools in one reconcile pass)
- node re-register upsert fall-through (Create says AlreadyExists,
  so a NotFound from the immediate patch read is cache lag)
- storage-pool single GET (create then read-back)
- volume-definition create probe, list probe and single GET
  (linstor-csi POSTs RD and VD back-to-back)
- RD-create resource-group gates, including the post-write re-check
  that would otherwise roll back a freshly created RD when the RG
  was created moments earlier
- per-node resource GET and per-resource volume reads
  (autoplace / make-available then immediate read)
- snapshot list probe and snapshot-create source-RD read
- query-size-info RG read (capacity preflight after RG ensure)

Pure list endpoints, delete-path idempotency probes, the
process-local KV store / storage-pool-definition registry and the
controller-props bag (missing folds into an empty map, no 404) are
deliberately left alone.

Regression coverage extends the lag-store pattern to the
physical-device list and the node re-register patch path.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR centralizes cache-retry logic by introducing a generic getWithCacheRetry[T] helper in cache_retry.go, refactoring all existing cache-retry functions to use it, and applying retry-aware reads across REST handlers to tolerate informer-cache lag during resource creation hot paths. Comprehensive test infrastructure and test cases verify cache-miss scenarios.

Changes

Informer Cache Lag Resilience Across REST Handlers

Layer / File(s) Summary
Generic cache-retry infrastructure and refactored helpers
pkg/rest/cache_retry.go
Introduces getWithCacheRetry[T] generic retry loop with fixed delay, attempt budgets, and context-cancellation handling. Refactors all existing cache-retry helpers (RG, RD, Snapshot, Node, StoragePool, VolumeDefinition, Resource) and patchNodeSpecWithCacheRetry to delegate to this central helper, eliminating duplicate retry logic. Updates pickCDPDevicesWithCacheRetry to use centralized retry semantics while preserving device-selection behavior.
Test infrastructure and cache-lag simulators
pkg/rest/cache_retry_test.go
Adds lagPhysicalDeviceStore to hide a device from ListForNode() for configured calls, and lagNodePatchStore to return ErrNotFound from PatchNodeSpec() for initial patch attempts. Extends flakyStore with PhysicalDevices() and Nodes() methods to route calls through lag wrappers.
End-to-end cache-lag scenario tests
pkg/rest/cache_retry_test.go
Adds four test cases: TestPhysicalStorageCDP_SurvivesCacheMissOnDeviceList (device hidden initially, attach succeeds after retry), TestPickCDPDevicesWithCacheRetry_RealNoMatchStillEmpty (non-existent device retries then returns empty), TestPickCDPDevicesWithCacheRetry_BusyDeviceReturnsImmediately (busy device detected in cache, returns immediately), TestNodeReRegister_SurvivesCacheMissOnPatch (node patch initially NotFound, second registration succeeds).
ResourceGroup and ResourceDefinition validation with cache-retry
pkg/rest/resource_definitions.go, pkg/rest/query_size_info.go
Applies getRGWithCacheRetry to RG existence checks in refuseRDCreateOnRGDeletedRace (rollback-on-concurrent-deletion) and refuseRDCreateOnUnknownRG (parent RG gate). Also applies getRGWithCacheRetry to RG lookup in handleQuerySizeInfo for capacity-query hot path with added documentation of cache-staleness scenario.
Node read and upsert patch operations with cache-retry
pkg/rest/nodes.go
Updates handleNodeGet to use getNodeWithCacheRetry for post-create GET hot path. Updates node-upsert conflict-recovery in upsertNodeAndDiskless to route PatchNodeSpec through patchNodeSpecWithCacheRetry when Create returns ErrAlreadyExists.
StoragePool, Node, and PhysicalDevice cache-retry
pkg/rest/storage_pools.go, pkg/rest/physical_storage.go
Applies getStoragePoolWithCacheRetry to handleNodeStoragePoolGet for pool-create hot path. Applies getNodeWithCacheRetry to handleNodeStoragePoolCreate to retry before failing. Applies pickCDPDevicesWithCacheRetry to handlePhysicalStorageCreate for device-target selection with retry-on-lag semantics.
Snapshot list and hydration with RD cache-retry
pkg/rest/snapshots.go
Updates handleSnapshotList to check parent RD existence via getRDWithCacheRetry for create-then-list hot path. Updates hydrateSnapshotFromRD to fetch source RD via getRDWithCacheRetry before populating snapshot fields, ensuring hydration succeeds when cache trails RD writes.
VolumeDefinition list, get, and create with RD cache-retry
pkg/rest/volume_definitions.go
Applies getRDWithCacheRetry to parent RD existence check in handleVDList. Applies getVDWithCacheRetry to specific VD fetch in handleVDGet. Adds early RD existence check via getRDWithCacheRetry in handleVDCreate before request-body decoding.
Resource and Autoplace read cache-retry
pkg/rest/autoplace.go
Updates handleResourceGet to use getRDWithCacheRetry for RD existence check and getResourceWithCacheRetry for per-node Resource fetch, absorbing cache lag on CreateVolume/Attach hot path. Response redaction, annotation-stripping, and volume-wrapping logic unchanged.
Per-resource Volume list and get cache-retry
pkg/rest/volumes_per_resource.go
Updates both handleVolumesPerResourceList and handleVolumesPerResourceGet to fetch target Resource via getResourceWithCacheRetry, tolerating informer-cache lag during replica-create hot path. List and single-get response behavior (empty slices, typed 404s) unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • cozystack/blockstor#150: Both PRs address informer-cache lag on snapshot read-after-write by changing snapshot cache-retry helpers; this PR centralizes all retry logic via getWithCacheRetry while the related PR adds snapshot-specific helpers.
  • cozystack/blockstor#149: Both PRs update the node upsert conflict-recovery path in upsertNodeAndDiskless to use patch-based updates instead of stale read/update flows when Create returns ErrAlreadyExists.

Poem

🐰 A rabbit hops through caches deep,
Where retries wake and data creep,
With getWithCacheRetry's gentle hand,
Lag is tamed across the land—
No 404s on freshly born,
Just happy reads from dusk to morn! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: applying cache-lag fixes across read-after-write REST paths, which is the primary focus of all file changes in this PR.
Docstring Coverage ✅ Passed Docstring coverage is 93.75% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/rest-cache-lag-read-after-write

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the cache-retry mechanism by introducing a generic getWithCacheRetry helper and applying it across various REST endpoints (such as nodes, storage pools, volume definitions, resources, and physical storage) to mitigate informer-cache lag and prevent spurious 404 errors on hot paths. The feedback identifies a potential flakiness issue in TestPickCDPDevicesWithCacheRetry_BusyDeviceReturnsImmediately due to a timing-based assertion and suggests a more robust approach by counting the number of store calls instead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +693 to +695
if elapsed := time.Since(start); elapsed >= cacheRetryDelay {
t.Fatalf("busy short-circuit took %s, expected < one cacheRetryDelay (%s)", elapsed, cacheRetryDelay)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The timing-based assertion elapsed >= cacheRetryDelay is prone to flakiness in heavily loaded CI environments where CPU scheduling delays can easily exceed 200ms.\n\nTo make this test 100% robust, consider verifying that the operation returned immediately by counting the number of calls to ListForNode instead.\n\nYou can achieve this by:\n1. Updating lagPhysicalDeviceStore.ListForNode to always increment f.calls (even when f.calls >= f.missBudget).\n2. Wrapping st in flakyStore with missBudget: 0 in this test.\n3. Asserting that calls == 1 after the call.\n\nExample:\ngo\n pds := &lagPhysicalDeviceStore{\n PhysicalDeviceStore: st.PhysicalDevices(),\n missBudget: 0,\n }\n flaky := &flakyStore{\n Store: st,\n pds: pds,\n }\n\n targets, busy, err := pickCDPDevicesWithCacheRetry(t.Context(), flaky, \"n1\", []string{\"/dev/disk/by-id/scsi-busy\"})\n // ...\n pds.mu.Lock()\n calls := pds.calls\n pds.mu.Unlock()\n\n if calls != 1 {\n t.Fatalf(\"expected exactly 1 call, got %d\", calls)\n }\n

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/rest/cache_retry_test.go`:
- Around line 693-695: The wall-clock assertion comparing elapsed to
cacheRetryDelay is scheduler-sensitive and flaky; replace it by asserting the
retry path was not executed via a deterministic call-count. Remove the
time.Since(start)/elapsed check and instead track the number of invocations of
the retryable/mock callback (e.g., add an atomic/int counter that the function
under test increments) and assert that counter is 0 (or unchanged) after the
operation; reference the existing symbols elapsed/start/cacheRetryDelay to
locate the timing check and the retry callback used in this test and replace the
timing assertion with the call-count assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 78c91387-3f8d-4fea-9296-b0ad0612b675

📥 Commits

Reviewing files that changed from the base of the PR and between e4b747c and f36adb5.

📒 Files selected for processing (11)
  • pkg/rest/autoplace.go
  • pkg/rest/cache_retry.go
  • pkg/rest/cache_retry_test.go
  • pkg/rest/nodes.go
  • pkg/rest/physical_storage.go
  • pkg/rest/query_size_info.go
  • pkg/rest/resource_definitions.go
  • pkg/rest/snapshots.go
  • pkg/rest/storage_pools.go
  • pkg/rest/volume_definitions.go
  • pkg/rest/volumes_per_resource.go

Comment on lines +693 to +695
if elapsed := time.Since(start); elapsed >= cacheRetryDelay {
t.Fatalf("busy short-circuit took %s, expected < one cacheRetryDelay (%s)", elapsed, cacheRetryDelay)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Avoid scheduler-sensitive wall-clock assertion in busy-device short-circuit test.

Line 693’s elapsed < cacheRetryDelay check can fail on loaded CI runners even when no retry occurs, making this test flaky. Assert behavior via call count instead of elapsed wall time.

Suggested deterministic update
@@
-	start := time.Now()
-
-	targets, busy, err := pickCDPDevicesWithCacheRetry(t.Context(), st, "n1", []string{"/dev/disk/by-id/scsi-busy"})
+	flaky := &flakyStore{
+		Store: st,
+		pds: &lagPhysicalDeviceStore{
+			PhysicalDeviceStore: st.PhysicalDevices(),
+			hideName:            "__do-not-hide__",
+			missBudget:          cacheRetryAttempts,
+		},
+	}
+
+	targets, busy, err := pickCDPDevicesWithCacheRetry(t.Context(), flaky, "n1", []string{"/dev/disk/by-id/scsi-busy"})
 	if err != nil {
 		t.Fatalf("pickCDPDevicesWithCacheRetry: %v", err)
 	}
@@
-	if elapsed := time.Since(start); elapsed >= cacheRetryDelay {
-		t.Fatalf("busy short-circuit took %s, expected < one cacheRetryDelay (%s)", elapsed, cacheRetryDelay)
-	}
+	flaky.pds.mu.Lock()
+	gotCalls := flaky.pds.calls
+	flaky.pds.mu.Unlock()
+	if gotCalls != 1 {
+		t.Fatalf("expected exactly 1 ListForNode call for busy short-circuit, got %d", gotCalls)
+	}

As per coding guidelines, "**/*_test.go: L1 unit tests must run on every commit using go test ./...", so this assertion should avoid scheduler-dependent timing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/rest/cache_retry_test.go` around lines 693 - 695, The wall-clock
assertion comparing elapsed to cacheRetryDelay is scheduler-sensitive and flaky;
replace it by asserting the retry path was not executed via a deterministic
call-count. Remove the time.Since(start)/elapsed check and instead track the
number of invocations of the retryable/mock callback (e.g., add an atomic/int
counter that the function under test increments) and assert that counter is 0
(or unchanged) after the operation; reference the existing symbols
elapsed/start/cacheRetryDelay to locate the timing check and the retry callback
used in this test and replace the timing assertion with the call-count
assertion.

Source: Coding guidelines

@kvaps Andrei Kvapil (kvaps) merged commit 2017184 into main Jun 12, 2026
15 checks passed
Andrei Kvapil (kvaps) added a commit that referenced this pull request Jun 13, 2026
… (release-gate sweep) (#155)

* fix(stand): inject csi-sanity mTLS certs as PEM content, not file paths

The csi-sanity Job hung for the full harness wait (rc=1, 3600s) on the
release-gate sweep. Root cause: golinstor v0.58.0 (vendored by
piraeus-csi v1.10.1) reads LS_ROOT_CA / LS_USER_CERTIFICATE / LS_USER_KEY
as PEM *content* and feeds the value straight into
x509.CertPool.AppendCertsFromPEM / tls.X509KeyPair. The path-based
LS_*_FILE variants only exist from golinstor v0.61.0.

The manifest passed file paths (/etc/linstor/client/*.crt), so golinstor
tried to PEM-decode the literal path string and failed at startup with
"failed to get a valid certificate from 'LS_ROOT_CA'". linstor-csi never
opened /csi/csi.sock, the csi-sanity container looped on the missing
socket and exited, and the Job went Failed (not Complete) — wedging the
sweep's `kubectl wait --for=condition=complete` until the 1h timeout.

Inject the cert-manager secret keys directly as env values via
secretKeyRef so golinstor receives PEM bytes, and drop the now-unused
client-tls volume mount. With this fix the Job runs the CSI contract
end to end (linstor-csi connects over mTLS to controller:3371).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

* docs(cli-parity): whitelist intentional CLI deltas surfaced by refresh

The release-gate cli-parity-refresh flagged 12 new non-PARITY rows. Five
of them are genuine, intentional blockstor behaviours that reproduce on a
freshly-seeded parity fixture (verified live against the dev-stand
LINSTOR 1.33.2 oracle), but were not matched by the existing whitelist:

- 07 `rd l --resource-definitions <rd>`: the flag-qualified sibling of the
  `stampRDLayerDataFromStack` Layers-column delta already accepted as row
  81; the bare `rd l` whitelist string did not literally cover it.
- 33 `s d <rd> <nonexistent-snap>` and 42 `r d <rd> <nonexistent-node>`:
  blockstor's idempotent-delete envelope (SUCCESS "already absent" vs the
  upstream WARNING "not found"); required for idempotent CSI/operator
  retries.
- 40 `n c <node> <ip> --node-type Satellite`: satellite-registration
  success-envelope shape (no inline node UUID; "no active connection"
  reported as SUCCESS-class rather than WARNING). Node is registered
  identically; envelope shape only.
- 16 `ps l`: BS `/v1/physical-storage` omits the per-device `size` key,
  crashing python-linstor's renderer. Hardware-discovery convenience
  surface unused by linstor-csi; whitelisted until the DTO field lands.

None of the twelve are regressions from the recent merges (#131-#151).
The remaining seven flagged rows are ambient stand-state drift (leftover
`test`/`testrg` resource-groups on BS, a stale `orc-tbtest` resource plus
old error-reports on the oracle) and are deliberately NOT whitelisted —
they return to PARITY once the stand is cleaned, and masking them would
blind the gate to real regressions on the core list commands.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

---------

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant