Skip to content

cherrypick: fix QCloud PUT retry#24928

Open
jiangxinmeng1 wants to merge 5 commits into
matrixorigin:4.0-devfrom
jiangxinmeng1:cherry-pick-24917-to-4.0-dev
Open

cherrypick: fix QCloud PUT retry#24928
jiangxinmeng1 wants to merge 5 commits into
matrixorigin:4.0-devfrom
jiangxinmeng1:cherry-pick-24917-to-4.0-dev

Conversation

@jiangxinmeng1

@jiangxinmeng1 jiangxinmeng1 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) does this PR fix or relate to?

Fixes #24916

What this PR does / why we need it:

Cherry-pick fixes from #24917 to 4.0-dev.\n\nCommits:\n- c1d47a0 fix: retry qcloud put object with seekable reader\n- ec2b815 fix: reject multipart size mismatch before upload\n- 698abf7 fix: use non-canceled context for multipart upload abort\n- f5cb84c fix: disable COS SDK built-in retry to avoid double retry\n- 5704cd3 fix: wrap Bucket.Head and AbortMultipartUpload in DoWithRetry\n\nTested:\n- go test -mod=mod ./pkg/fileservice -run 'TestQCloudSDKWriteRetriesSeekablePut|TestQCloudSDKWriteNonSeekableFallbackPreservesSizeHint|TestAwsParallelMultipartAbortOnReadError|TestCOSParallelMultipartAbortOnReadError' -count=1\n\nNotes:\n- go test ./pkg/fileservice is blocked by existing inconsistent vendoring in the local checkout.\n- go test -mod=mod ./pkg/fileservice failed after running the package, but the output was dominated by existing package logs and the targeted cherry-pick tests above pass.

jiangxinmeng1 and others added 5 commits June 11, 2026 11:43
When setErr cancels the derived context in WriteMultipartParallel,
the deferred abort call was using the already-canceled context,
causing the AbortMultipartUpload request to be dropped. This leaked
incomplete multipart uploads on COS/AWS for read error paths (e.g.,
size mismatch errors that reach setErr before CompleteMultipartUpload).

Save the parent context before WithCancel and use
context.WithoutCancel(parentCtx) for the abort defer so the abort
request can always proceed regardless of context cancellation.

Add tests that assert AbortMultipartUpload is sent for size-mismatch
read errors in both QCloud and AWS multipart paths.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
(cherry picked from commit 698abf7)
The COS SDK enables retry by default (RetryOpt.Count = 3). MatrixOne wraps
all operations in its own DoWithRetry, so one transient failure was retried
twice — first by the SDK and then by MatrixOne's outer loop — multiplying
request count and stretching failure latency.

Set client.Conf.RetryOpt.Count = 0 in NewQCloudSDK so only MatrixOne's
DoWithRetry handles retries.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
(cherry picked from commit f5cb84c)
client.Conf.RetryOpt.Count = 0 disables COS SDK built-in retries globally,
but two direct SDK calls were not wrapped in MatrixOne's DoWithRetry:
- client.Bucket.Head during NewQCloudSDK initialization
- AbortMultipartUpload during WriteMultipartParallel error cleanup
Both are now wrapped so transient COS failures won't fail startup or drop cleanup.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
(cherry picked from commit 5704cd3)
@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/M Denotes a PR that changes [100,499] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants