fix(rest): absorb informer-cache lag on snapshot reads in the CSI restore hot path#150
Conversation
…tore hot path linstor-csi's CreateVolume-from-snapshot sequence POSTs the snapshot create and immediately GETs it back (size guard), then POSTs snapshot-restore-resource. The create writes straight to the apiserver while the follow-up reads are served from the controller-runtime informer cache, which may not have observed the write yet — the spurious 404 fails the whole CreateVolume (TestGroupJ/CSICreateVolumeFromSnapshot in the release gate). This is the same read-after-write class Bug 124 already mitigated for RG/RD reads on the CreateVolume path; extend the established cache-retry helper family to snapshot reads and use it in the snapshot GET handler and both restore handlers. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Model the informer-cache read-after-write lag in a SnapshotStore wrapper (a successful Create arms NotFound misses for subsequent Gets of that snapshot) and pin the linstor-csi hot-path contract on the wire: snapshot create POST followed by an immediate GET and an immediate snapshot-restore-resource POST must both succeed once the cache catches up within the retry budget, while a genuinely unknown snapshot still surfaces 404. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR adds cache-lag resilience to snapshot read operations. A new ChangesSnapshot cache-lag mitigation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a retry mechanism (getSnapshotWithCacheRetry) to handle informer-cache lag when retrieving snapshots immediately after creation, preventing spurious 404 errors. This logic is integrated into snapshot retrieval and restore endpoints, accompanied by comprehensive tests modeling cache lag. Feedback on the changes suggests replacing time.After with time.NewTimer in the retry loop to prevent potential timer leaks upon context cancellation.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| select { | ||
| case <-ctx.Done(): | ||
| return apiv1.Snapshot{}, errors.Wrap(ctx.Err(), "get snapshot: context cancelled") | ||
| case <-time.After(cacheRetryDelay): | ||
| } |
There was a problem hiding this comment.
Using time.After in a loop or a frequently called function can lead to short-term timer leaks because the underlying timer is not garbage collected until it expires, even if the context is cancelled and the select block exits early. Using time.NewTimer and stopping it on context cancellation avoids this resource leak.
timer := time.NewTimer(cacheRetryDelay)
select {
case <-ctx.Done():
timer.Stop()
return apiv1.Snapshot{}, errors.Wrap(ctx.Err(), "get snapshot: context cancelled")
case <-timer.C:
}
Summary
The release-gate integration run failed
TestGroupJ/CSICreateVolumeFromSnapshot: the snapshot create POST succeeded, but the immediate follow-upGET /v1/resource-definitions/{rd}/snapshots/{snap}answered 404, failing the whole CreateVolume-from-snapshot flow withGetSnapshot pvc-clone-src/snap-1: 404 Not Found.Root cause
The REST server reads through the manager's informer-cached client while writes go straight to the apiserver. linstor-csi's CreateVolume-from-snapshot sequence — create the snapshot, immediately GET it back (size guard), then POST
snapshot-restore-resource— reads the snapshot before the cache has observed the write, so a spurious NotFound surfaces as 404. This is the same read-after-write class already mitigated for ResourceGroup/ResourceDefinition reads on the CreateVolume hot path (getRGWithCacheRetry/getRDWithCacheRetry); snapshot reads never received the mitigation. In production with several apiserver replicas behind a ClusterIP the same 404 fires whenever a follow-up call round-robins onto a replica whose cache trails the write.Fix
getSnapshotWithCacheRetry(same retry budget and semantics as the RG/RD helpers).Testing
TestGroupJ(includingCSICreateVolumeFromSnapshot) green 3x consecutively under envtest.go test ./..., andgo test -race ./pkg/rest/...green; golangci-lint clean.Summary by CodeRabbit