fix(rest): absorb informer-cache lag on snapshot reads in the CSI restore hot path by kvaps · Pull Request #150 · cozystack/blockstor

Andrei Kvapil (kvaps) · 2026-06-12T20:13:10Z

Summary

The release-gate integration run failed TestGroupJ/CSICreateVolumeFromSnapshot: the snapshot create POST succeeded, but the immediate follow-up GET /v1/resource-definitions/{rd}/snapshots/{snap} answered 404, failing the whole CreateVolume-from-snapshot flow with GetSnapshot pvc-clone-src/snap-1: 404 Not Found.

Root cause

The REST server reads through the manager's informer-cached client while writes go straight to the apiserver. linstor-csi's CreateVolume-from-snapshot sequence — create the snapshot, immediately GET it back (size guard), then POST snapshot-restore-resource — reads the snapshot before the cache has observed the write, so a spurious NotFound surfaces as 404. This is the same read-after-write class already mitigated for ResourceGroup/ResourceDefinition reads on the CreateVolume hot path (getRGWithCacheRetry / getRDWithCacheRetry); snapshot reads never received the mitigation. In production with several apiserver replicas behind a ClusterIP the same 404 fires whenever a follow-up call round-robins onto a replica whose cache trails the write.

Fix

Extend the established cache-retry helper family with getSnapshotWithCacheRetry (same retry budget and semantics as the RG/RD helpers).
Use it in the snapshot GET handler and in both snapshot-restore handlers (resource-restore and volume-definition-restore) — the three snapshot reads on the CSI restore hot path.
Genuinely missing snapshots still answer 404 after the retry budget; envelope shapes are unchanged.

Testing

New regression tests model the cache lag in a SnapshotStore wrapper (a successful Create arms NotFound misses for subsequent Gets) and pin create-then-immediately-GET and create-then-immediately-restore on the wire. Both reproduce the exact 404 without the fix and pass with it.
TestGroupJ (including CSICreateVolumeFromSnapshot) green 3x consecutively under envtest.
Full integration suite, go test ./..., and go test -race ./pkg/rest/... green; golangci-lint clean.

Summary by CodeRabbit

Bug Fixes
- Improved snapshot operations reliability. Snapshot retrieval and restore operations now automatically retry on temporary cache misses that can occur when reading snapshots immediately after creation. This fix eliminates intermittent "snapshot not found" failures that previously occurred during write-then-read sequences, enhancing reliability in snapshot workflows.

…tore hot path linstor-csi's CreateVolume-from-snapshot sequence POSTs the snapshot create and immediately GETs it back (size guard), then POSTs snapshot-restore-resource. The create writes straight to the apiserver while the follow-up reads are served from the controller-runtime informer cache, which may not have observed the write yet — the spurious 404 fails the whole CreateVolume (TestGroupJ/CSICreateVolumeFromSnapshot in the release gate). This is the same read-after-write class Bug 124 already mitigated for RG/RD reads on the CreateVolume path; extend the established cache-retry helper family to snapshot reads and use it in the snapshot GET handler and both restore handlers. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

Model the informer-cache read-after-write lag in a SnapshotStore wrapper (a successful Create arms NotFound misses for subsequent Gets of that snapshot) and pin the linstor-csi hot-path contract on the wire: snapshot create POST followed by an immediate GET and an immediate snapshot-restore-resource POST must both succeed once the cache catches up within the retry budget, while a genuinely unknown snapshot still surfaces 404. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

coderabbitai · 2026-06-12T20:13:41Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8426a0e2-14ce-4e7c-a508-c7bcb1f0b868

📥 Commits

Reviewing files that changed from the base of the PR and between 662ea38 and a249784.

📒 Files selected for processing (4)

pkg/rest/cache_retry.go
pkg/rest/cache_retry_test.go
pkg/rest/snapshot_restore.go
pkg/rest/snapshots.go

📝 Walkthrough

Walkthrough

This PR adds cache-lag resilience to snapshot read operations. A new getSnapshotWithCacheRetry helper retries on transient store.ErrNotFound errors to handle informer-cache sync delays across apiserver replicas. The helper is integrated into snapshot GET and restore handlers, and validated with cache-lag simulation tests.

Changes

Snapshot cache-lag mitigation

Layer / File(s)	Summary
Snapshot cache-retry helper `pkg/rest/cache_retry.go`	`getSnapshotWithCacheRetry` implements a 3-attempt retry loop for snapshot reads, returning immediately on non-NotFound errors, respecting context cancellation, and wrapping final NotFound errors with retry count. Inline comments document the informer-cache-lag race scenario when snapshots are created and read across sibling replicas.
Handler integration `pkg/rest/snapshots.go`, `pkg/rest/snapshot_restore.go`	Snapshot GET handler and both snapshot-restore handlers now use `getSnapshotWithCacheRetry` instead of direct store access, adding cache-lag tolerance so reads succeed immediately after creation despite transient informer-cache misses.
Test infrastructure and validation `pkg/rest/cache_retry_test.go`	Added `lagSnapshotStore` to simulate informer-cache lag by returning `ErrNotFound` for a configurable budget after snapshot creation, and extended `flakyStore` to support it. Unit tests verify the helper succeeds after cache misses and surfaces real NotFound errors. End-to-end HTTP tests assert immediate create-then-read and create-then-restore operations succeed despite simulated cache-lag 404s.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A snapshot cache that lags behind,
Now retries thrice to find what's mine,
Three sibling servers, sync in time,
Create and read in perfect rhyme! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding cache-retry logic for snapshot reads to handle informer-cache lag in the CSI restore flow.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/csi-snapshot-read-cache-lag

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a retry mechanism (getSnapshotWithCacheRetry) to handle informer-cache lag when retrieving snapshots immediately after creation, preventing spurious 404 errors. This logic is integrated into snapshot retrieval and restore endpoints, accompanied by comprehensive tests modeling cache lag. Feedback on the changes suggests replacing time.After with time.NewTimer in the retry loop to prevent potential timer leaks upon context cancellation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-12T20:15:09Z

+		select {
+		case <-ctx.Done():
+			return apiv1.Snapshot{}, errors.Wrap(ctx.Err(), "get snapshot: context cancelled")
+		case <-time.After(cacheRetryDelay):
+		}


Using time.After in a loop or a frequently called function can lead to short-term timer leaks because the underlying timer is not garbage collected until it expires, even if the context is cancelled and the select block exits early. Using time.NewTimer and stopping it on context cancellation avoids this resource leak.

timer := time.NewTimer(cacheRetryDelay) select { case <-ctx.Done(): timer.Stop() return apiv1.Snapshot{}, errors.Wrap(ctx.Err(), "get snapshot: context cancelled") case <-timer.C: }

Andrei Kvapil (kvaps) and others added 2 commits June 12, 2026 22:56

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

Andrei Kvapil (kvaps) merged commit e4b747c into main Jun 12, 2026
15 checks passed

coderabbitai Bot mentioned this pull request Jun 12, 2026

fix(rest): absorb informer-cache lag across read-after-write REST paths #151

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rest): absorb informer-cache lag on snapshot reads in the CSI restore hot path#150

fix(rest): absorb informer-cache lag on snapshot reads in the CSI restore hot path#150
Andrei Kvapil (kvaps) merged 2 commits into
mainfrom
fix/csi-snapshot-read-cache-lag

Andrei Kvapil (kvaps) commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Andrei Kvapil (kvaps) commented Jun 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Andrei Kvapil (kvaps) commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading