Skip to content

fix(rest): absorb informer-cache lag on snapshot reads in the CSI restore hot path#150

Merged
Andrei Kvapil (kvaps) merged 2 commits into
mainfrom
fix/csi-snapshot-read-cache-lag
Jun 12, 2026
Merged

fix(rest): absorb informer-cache lag on snapshot reads in the CSI restore hot path#150
Andrei Kvapil (kvaps) merged 2 commits into
mainfrom
fix/csi-snapshot-read-cache-lag

Conversation

@kvaps

@kvaps Andrei Kvapil (kvaps) commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

The release-gate integration run failed TestGroupJ/CSICreateVolumeFromSnapshot: the snapshot create POST succeeded, but the immediate follow-up GET /v1/resource-definitions/{rd}/snapshots/{snap} answered 404, failing the whole CreateVolume-from-snapshot flow with GetSnapshot pvc-clone-src/snap-1: 404 Not Found.

Root cause

The REST server reads through the manager's informer-cached client while writes go straight to the apiserver. linstor-csi's CreateVolume-from-snapshot sequence — create the snapshot, immediately GET it back (size guard), then POST snapshot-restore-resource — reads the snapshot before the cache has observed the write, so a spurious NotFound surfaces as 404. This is the same read-after-write class already mitigated for ResourceGroup/ResourceDefinition reads on the CreateVolume hot path (getRGWithCacheRetry / getRDWithCacheRetry); snapshot reads never received the mitigation. In production with several apiserver replicas behind a ClusterIP the same 404 fires whenever a follow-up call round-robins onto a replica whose cache trails the write.

Fix

  • Extend the established cache-retry helper family with getSnapshotWithCacheRetry (same retry budget and semantics as the RG/RD helpers).
  • Use it in the snapshot GET handler and in both snapshot-restore handlers (resource-restore and volume-definition-restore) — the three snapshot reads on the CSI restore hot path.
  • Genuinely missing snapshots still answer 404 after the retry budget; envelope shapes are unchanged.

Testing

  • New regression tests model the cache lag in a SnapshotStore wrapper (a successful Create arms NotFound misses for subsequent Gets) and pin create-then-immediately-GET and create-then-immediately-restore on the wire. Both reproduce the exact 404 without the fix and pass with it.
  • TestGroupJ (including CSICreateVolumeFromSnapshot) green 3x consecutively under envtest.
  • Full integration suite, go test ./..., and go test -race ./pkg/rest/... green; golangci-lint clean.

Summary by CodeRabbit

  • Bug Fixes
    • Improved snapshot operations reliability. Snapshot retrieval and restore operations now automatically retry on temporary cache misses that can occur when reading snapshots immediately after creation. This fix eliminates intermittent "snapshot not found" failures that previously occurred during write-then-read sequences, enhancing reliability in snapshot workflows.

Andrei Kvapil (kvaps) and others added 2 commits June 12, 2026 22:56
…tore hot path

linstor-csi's CreateVolume-from-snapshot sequence POSTs the snapshot
create and immediately GETs it back (size guard), then POSTs
snapshot-restore-resource. The create writes straight to the
apiserver while the follow-up reads are served from the
controller-runtime informer cache, which may not have observed the
write yet — the spurious 404 fails the whole CreateVolume
(TestGroupJ/CSICreateVolumeFromSnapshot in the release gate).

This is the same read-after-write class Bug 124 already mitigated
for RG/RD reads on the CreateVolume path; extend the established
cache-retry helper family to snapshot reads and use it in the
snapshot GET handler and both restore handlers.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Model the informer-cache read-after-write lag in a SnapshotStore
wrapper (a successful Create arms NotFound misses for subsequent
Gets of that snapshot) and pin the linstor-csi hot-path contract on
the wire: snapshot create POST followed by an immediate GET and an
immediate snapshot-restore-resource POST must both succeed once the
cache catches up within the retry budget, while a genuinely unknown
snapshot still surfaces 404.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8426a0e2-14ce-4e7c-a508-c7bcb1f0b868

📥 Commits

Reviewing files that changed from the base of the PR and between 662ea38 and a249784.

📒 Files selected for processing (4)
  • pkg/rest/cache_retry.go
  • pkg/rest/cache_retry_test.go
  • pkg/rest/snapshot_restore.go
  • pkg/rest/snapshots.go

📝 Walkthrough

Walkthrough

This PR adds cache-lag resilience to snapshot read operations. A new getSnapshotWithCacheRetry helper retries on transient store.ErrNotFound errors to handle informer-cache sync delays across apiserver replicas. The helper is integrated into snapshot GET and restore handlers, and validated with cache-lag simulation tests.

Changes

Snapshot cache-lag mitigation

Layer / File(s) Summary
Snapshot cache-retry helper
pkg/rest/cache_retry.go
getSnapshotWithCacheRetry implements a 3-attempt retry loop for snapshot reads, returning immediately on non-NotFound errors, respecting context cancellation, and wrapping final NotFound errors with retry count. Inline comments document the informer-cache-lag race scenario when snapshots are created and read across sibling replicas.
Handler integration
pkg/rest/snapshots.go, pkg/rest/snapshot_restore.go
Snapshot GET handler and both snapshot-restore handlers now use getSnapshotWithCacheRetry instead of direct store access, adding cache-lag tolerance so reads succeed immediately after creation despite transient informer-cache misses.
Test infrastructure and validation
pkg/rest/cache_retry_test.go
Added lagSnapshotStore to simulate informer-cache lag by returning ErrNotFound for a configurable budget after snapshot creation, and extended flakyStore to support it. Unit tests verify the helper succeeds after cache misses and surfaces real NotFound errors. End-to-end HTTP tests assert immediate create-then-read and create-then-restore operations succeed despite simulated cache-lag 404s.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A snapshot cache that lags behind,
Now retries thrice to find what's mine,
Three sibling servers, sync in time,
Create and read in perfect rhyme! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding cache-retry logic for snapshot reads to handle informer-cache lag in the CSI restore flow.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/csi-snapshot-read-cache-lag

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a retry mechanism (getSnapshotWithCacheRetry) to handle informer-cache lag when retrieving snapshots immediately after creation, preventing spurious 404 errors. This logic is integrated into snapshot retrieval and restore endpoints, accompanied by comprehensive tests modeling cache lag. Feedback on the changes suggests replacing time.After with time.NewTimer in the retry loop to prevent potential timer leaks upon context cancellation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pkg/rest/cache_retry.go
Comment on lines +162 to +166
select {
case <-ctx.Done():
return apiv1.Snapshot{}, errors.Wrap(ctx.Err(), "get snapshot: context cancelled")
case <-time.After(cacheRetryDelay):
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using time.After in a loop or a frequently called function can lead to short-term timer leaks because the underlying timer is not garbage collected until it expires, even if the context is cancelled and the select block exits early. Using time.NewTimer and stopping it on context cancellation avoids this resource leak.

		timer := time.NewTimer(cacheRetryDelay)
		select {
		case <-ctx.Done():
			timer.Stop()
			return apiv1.Snapshot{}, errors.Wrap(ctx.Err(), "get snapshot: context cancelled")
		case <-timer.C:
		}

@kvaps Andrei Kvapil (kvaps) merged commit e4b747c into main Jun 12, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant