cherrypick: fix QCloud PUT retry#24928
Open
jiangxinmeng1 wants to merge 5 commits into
Open
Conversation
(cherry picked from commit c1d47a0)
(cherry picked from commit ec2b815)
When setErr cancels the derived context in WriteMultipartParallel, the deferred abort call was using the already-canceled context, causing the AbortMultipartUpload request to be dropped. This leaked incomplete multipart uploads on COS/AWS for read error paths (e.g., size mismatch errors that reach setErr before CompleteMultipartUpload). Save the parent context before WithCancel and use context.WithoutCancel(parentCtx) for the abort defer so the abort request can always proceed regardless of context cancellation. Add tests that assert AbortMultipartUpload is sent for size-mismatch read errors in both QCloud and AWS multipart paths. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> (cherry picked from commit 698abf7)
The COS SDK enables retry by default (RetryOpt.Count = 3). MatrixOne wraps all operations in its own DoWithRetry, so one transient failure was retried twice — first by the SDK and then by MatrixOne's outer loop — multiplying request count and stretching failure latency. Set client.Conf.RetryOpt.Count = 0 in NewQCloudSDK so only MatrixOne's DoWithRetry handles retries. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> (cherry picked from commit f5cb84c)
client.Conf.RetryOpt.Count = 0 disables COS SDK built-in retries globally, but two direct SDK calls were not wrapped in MatrixOne's DoWithRetry: - client.Bucket.Head during NewQCloudSDK initialization - AbortMultipartUpload during WriteMultipartParallel error cleanup Both are now wrapped so transient COS failures won't fail startup or drop cleanup. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> (cherry picked from commit 5704cd3)
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
Which issue(s) does this PR fix or relate to?
Fixes #24916
What this PR does / why we need it:
Cherry-pick fixes from #24917 to 4.0-dev.\n\nCommits:\n- c1d47a0 fix: retry qcloud put object with seekable reader\n- ec2b815 fix: reject multipart size mismatch before upload\n- 698abf7 fix: use non-canceled context for multipart upload abort\n- f5cb84c fix: disable COS SDK built-in retry to avoid double retry\n- 5704cd3 fix: wrap Bucket.Head and AbortMultipartUpload in DoWithRetry\n\nTested:\n- go test -mod=mod ./pkg/fileservice -run 'TestQCloudSDKWriteRetriesSeekablePut|TestQCloudSDKWriteNonSeekableFallbackPreservesSizeHint|TestAwsParallelMultipartAbortOnReadError|TestCOSParallelMultipartAbortOnReadError' -count=1\n\nNotes:\n- go test ./pkg/fileservice is blocked by existing inconsistent vendoring in the local checkout.\n- go test -mod=mod ./pkg/fileservice failed after running the package, but the output was dominated by existing package logs and the targeted cherry-pick tests above pass.