fix(rest): absorb informer-cache lag across read-after-write REST paths#151
Conversation
The RG / RD / Snapshot cache-retry helpers carried three identical copies of the retry-on-NotFound loop. Fold them into a single generic getWithCacheRetry core so upcoming helpers for other object kinds reuse the exact same budget and error-wrapping semantics instead of growing a fourth copy. No behaviour change: same attempts, delay, error wrapping and context-cancellation handling. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The REST server reads through the manager's informer-cached client
while writes go straight to the apiserver, so any read that
immediately follows a create can hit a cache that has not observed
the write yet and surface a spurious 404. The CDP path
(POST /v1/physical-storage/{node}) was the latest victim: the
device-list match ran against a cache still trailing the discovery
loop's PhysicalDevice write and 404'd a valid create-device-pool.
Sweep the remaining members of the class and apply the existing
cache-retry pattern (3 attempts, 200 ms apart; a real 404 still
surfaces after the budget):
- physical-storage CDP device match (the failing case)
- node single GET and the storage-pool create node probe
(register node then read/create pools in one reconcile pass)
- node re-register upsert fall-through (Create says AlreadyExists,
so a NotFound from the immediate patch read is cache lag)
- storage-pool single GET (create then read-back)
- volume-definition create probe, list probe and single GET
(linstor-csi POSTs RD and VD back-to-back)
- RD-create resource-group gates, including the post-write re-check
that would otherwise roll back a freshly created RD when the RG
was created moments earlier
- per-node resource GET and per-resource volume reads
(autoplace / make-available then immediate read)
- snapshot list probe and snapshot-create source-RD read
- query-size-info RG read (capacity preflight after RG ensure)
Pure list endpoints, delete-path idempotency probes, the
process-local KV store / storage-pool-definition registry and the
controller-props bag (missing folds into an empty map, no 404) are
deliberately left alone.
Regression coverage extends the lag-store pattern to the
physical-device list and the node re-register patch path.
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
📝 WalkthroughWalkthroughThis PR centralizes cache-retry logic by introducing a generic ChangesInformer Cache Lag Resilience Across REST Handlers
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the cache-retry mechanism by introducing a generic getWithCacheRetry helper and applying it across various REST endpoints (such as nodes, storage pools, volume definitions, resources, and physical storage) to mitigate informer-cache lag and prevent spurious 404 errors on hot paths. The feedback identifies a potential flakiness issue in TestPickCDPDevicesWithCacheRetry_BusyDeviceReturnsImmediately due to a timing-based assertion and suggests a more robust approach by counting the number of store calls instead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if elapsed := time.Since(start); elapsed >= cacheRetryDelay { | ||
| t.Fatalf("busy short-circuit took %s, expected < one cacheRetryDelay (%s)", elapsed, cacheRetryDelay) | ||
| } |
There was a problem hiding this comment.
The timing-based assertion elapsed >= cacheRetryDelay is prone to flakiness in heavily loaded CI environments where CPU scheduling delays can easily exceed 200ms.\n\nTo make this test 100% robust, consider verifying that the operation returned immediately by counting the number of calls to ListForNode instead.\n\nYou can achieve this by:\n1. Updating lagPhysicalDeviceStore.ListForNode to always increment f.calls (even when f.calls >= f.missBudget).\n2. Wrapping st in flakyStore with missBudget: 0 in this test.\n3. Asserting that calls == 1 after the call.\n\nExample:\ngo\n pds := &lagPhysicalDeviceStore{\n PhysicalDeviceStore: st.PhysicalDevices(),\n missBudget: 0,\n }\n flaky := &flakyStore{\n Store: st,\n pds: pds,\n }\n\n targets, busy, err := pickCDPDevicesWithCacheRetry(t.Context(), flaky, \"n1\", []string{\"/dev/disk/by-id/scsi-busy\"})\n // ...\n pds.mu.Lock()\n calls := pds.calls\n pds.mu.Unlock()\n\n if calls != 1 {\n t.Fatalf(\"expected exactly 1 call, got %d\", calls)\n }\n
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/rest/cache_retry_test.go`:
- Around line 693-695: The wall-clock assertion comparing elapsed to
cacheRetryDelay is scheduler-sensitive and flaky; replace it by asserting the
retry path was not executed via a deterministic call-count. Remove the
time.Since(start)/elapsed check and instead track the number of invocations of
the retryable/mock callback (e.g., add an atomic/int counter that the function
under test increments) and assert that counter is 0 (or unchanged) after the
operation; reference the existing symbols elapsed/start/cacheRetryDelay to
locate the timing check and the retry callback used in this test and replace the
timing assertion with the call-count assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 78c91387-3f8d-4fea-9296-b0ad0612b675
📒 Files selected for processing (11)
pkg/rest/autoplace.gopkg/rest/cache_retry.gopkg/rest/cache_retry_test.gopkg/rest/nodes.gopkg/rest/physical_storage.gopkg/rest/query_size_info.gopkg/rest/resource_definitions.gopkg/rest/snapshots.gopkg/rest/storage_pools.gopkg/rest/volume_definitions.gopkg/rest/volumes_per_resource.go
| if elapsed := time.Since(start); elapsed >= cacheRetryDelay { | ||
| t.Fatalf("busy short-circuit took %s, expected < one cacheRetryDelay (%s)", elapsed, cacheRetryDelay) | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Avoid scheduler-sensitive wall-clock assertion in busy-device short-circuit test.
Line 693’s elapsed < cacheRetryDelay check can fail on loaded CI runners even when no retry occurs, making this test flaky. Assert behavior via call count instead of elapsed wall time.
Suggested deterministic update
@@
- start := time.Now()
-
- targets, busy, err := pickCDPDevicesWithCacheRetry(t.Context(), st, "n1", []string{"/dev/disk/by-id/scsi-busy"})
+ flaky := &flakyStore{
+ Store: st,
+ pds: &lagPhysicalDeviceStore{
+ PhysicalDeviceStore: st.PhysicalDevices(),
+ hideName: "__do-not-hide__",
+ missBudget: cacheRetryAttempts,
+ },
+ }
+
+ targets, busy, err := pickCDPDevicesWithCacheRetry(t.Context(), flaky, "n1", []string{"/dev/disk/by-id/scsi-busy"})
if err != nil {
t.Fatalf("pickCDPDevicesWithCacheRetry: %v", err)
}
@@
- if elapsed := time.Since(start); elapsed >= cacheRetryDelay {
- t.Fatalf("busy short-circuit took %s, expected < one cacheRetryDelay (%s)", elapsed, cacheRetryDelay)
- }
+ flaky.pds.mu.Lock()
+ gotCalls := flaky.pds.calls
+ flaky.pds.mu.Unlock()
+ if gotCalls != 1 {
+ t.Fatalf("expected exactly 1 ListForNode call for busy short-circuit, got %d", gotCalls)
+ }As per coding guidelines, "**/*_test.go: L1 unit tests must run on every commit using go test ./...", so this assertion should avoid scheduler-dependent timing.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/rest/cache_retry_test.go` around lines 693 - 695, The wall-clock
assertion comparing elapsed to cacheRetryDelay is scheduler-sensitive and flaky;
replace it by asserting the retry path was not executed via a deterministic
call-count. Remove the time.Since(start)/elapsed check and instead track the
number of invocations of the retryable/mock callback (e.g., add an atomic/int
counter that the function under test increments) and assert that counter is 0
(or unchanged) after the operation; reference the existing symbols
elapsed/start/cacheRetryDelay to locate the timing check and the retry callback
used in this test and replace the timing assertion with the call-count
assertion.
Source: Coding guidelines
… (release-gate sweep) (#155) * fix(stand): inject csi-sanity mTLS certs as PEM content, not file paths The csi-sanity Job hung for the full harness wait (rc=1, 3600s) on the release-gate sweep. Root cause: golinstor v0.58.0 (vendored by piraeus-csi v1.10.1) reads LS_ROOT_CA / LS_USER_CERTIFICATE / LS_USER_KEY as PEM *content* and feeds the value straight into x509.CertPool.AppendCertsFromPEM / tls.X509KeyPair. The path-based LS_*_FILE variants only exist from golinstor v0.61.0. The manifest passed file paths (/etc/linstor/client/*.crt), so golinstor tried to PEM-decode the literal path string and failed at startup with "failed to get a valid certificate from 'LS_ROOT_CA'". linstor-csi never opened /csi/csi.sock, the csi-sanity container looped on the missing socket and exited, and the Job went Failed (not Complete) — wedging the sweep's `kubectl wait --for=condition=complete` until the 1h timeout. Inject the cert-manager secret keys directly as env values via secretKeyRef so golinstor receives PEM bytes, and drop the now-unused client-tls volume mount. With this fix the Job runs the CSI contract end to end (linstor-csi connects over mTLS to controller:3371). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * docs(cli-parity): whitelist intentional CLI deltas surfaced by refresh The release-gate cli-parity-refresh flagged 12 new non-PARITY rows. Five of them are genuine, intentional blockstor behaviours that reproduce on a freshly-seeded parity fixture (verified live against the dev-stand LINSTOR 1.33.2 oracle), but were not matched by the existing whitelist: - 07 `rd l --resource-definitions <rd>`: the flag-qualified sibling of the `stampRDLayerDataFromStack` Layers-column delta already accepted as row 81; the bare `rd l` whitelist string did not literally cover it. - 33 `s d <rd> <nonexistent-snap>` and 42 `r d <rd> <nonexistent-node>`: blockstor's idempotent-delete envelope (SUCCESS "already absent" vs the upstream WARNING "not found"); required for idempotent CSI/operator retries. - 40 `n c <node> <ip> --node-type Satellite`: satellite-registration success-envelope shape (no inline node UUID; "no active connection" reported as SUCCESS-class rather than WARNING). Node is registered identically; envelope shape only. - 16 `ps l`: BS `/v1/physical-storage` omits the per-device `size` key, crashing python-linstor's renderer. Hardware-discovery convenience surface unused by linstor-csi; whitelisted until the DTO field lands. None of the twelve are regressions from the recent merges (#131-#151). The remaining seven flagged rows are ambient stand-state drift (leftover `test`/`testrg` resource-groups on BS, a stale `orc-tbtest` resource plus old error-reports on the oracle) and are deliberately NOT whitelisted — they return to PARITY once the stand is cleaned, and masking them would blind the gate to real regressions on the core list commands. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> --------- Signed-off-by: Andrei Kvapil <kvapss@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>
Problem
The REST server reads through the manager's informer-cached client while writes go straight to the apiserver. Any read that immediately follows a create — same wire client, sub-millisecond gap — can therefore be served by a cache that has not observed the write yet, surfacing a spurious 404 for an object that demonstrably exists. The mitigation pattern already exists (
pkg/rest/cache_retry.go: RG/RD helpers, later extended to the snapshot/CSI-restore hot path), but it covered only the paths that had already failed in CI.The latest victim was
TestGroupB/PhysicalStorageCDPPropsPerKind/file-thin(status: got 404, want 202): thePOST /v1/physical-storage/{node}handler matched the requested device paths against a device list served by a cache still trailing the discovery loop's PhysicalDevice write, found no match, and 404'd a valid create-device-pool request.What changed
Instead of fixing one more endpoint, this PR closes out the whole class. Every store read in
pkg/restwas enumerated and classified; the cache-retry pattern (3 attempts, 200 ms apart — a real 404 still surfaces after the budget, steady-state hits pay zero extra latency) is now applied uniformly to the operator-visible read-after-write paths:Createreturned AlreadyExists — authoritative proof of existence — so a NotFound from the immediate patch read is definitionally cache lag).Deliberately left alone (staleness harmless or NotFound is the contract): pure list endpoints (absence renders as an empty list), delete-path idempotency probes, NotFound-is-normal upsert probes, best-effort soft-fail reads, the process-local KV store and storage-pool-definition registry, and the controller-props bag (a missing singleton folds into an empty map; no 404 can surface).
The three pre-existing helpers were deduplicated behind a generic core so all helpers share the exact same budget and error-wrapping semantics.
Testing
TestGroupBgreen 3×; full integration suite (-tags=integration, envtest 1.34.1, linstor-client 1.27.1) green.go test ./...green;go test -race ./pkg/restgreen;golangci-lint run— 0 issues.Summary by CodeRabbit
New Features
Bug Fixes
Tests