disagg: limit S3 read amplification on compute nodes#10771
disagg: limit S3 read amplification on compute nodes#10771JaySon-Huang wants to merge 36 commits intopingcap:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a node-level S3 read token-bucket limiter and S3 read metrics/recording; wires the limiter through IORateLimiter → ClientFactory → TiFlashS3Client → S3 clients; makes S3 reads and FileCache downloads limiter-aware (chunked reads, bg downloads), adds bounded follower waits, new metrics, failpoints, tests, and an HTTP cache-evict API. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant S3RandomAccessFile as S3RandomAccessFile
participant S3ReadLimiter as S3ReadLimiter
participant IStream as S3_IStream
Client->>S3RandomAccessFile: readImpl(buf, size)
alt Limiter active
S3RandomAccessFile->>S3ReadLimiter: getSuggestedChunkSize(...)
loop per chunk
S3RandomAccessFile->>S3ReadLimiter: requestBytes(chunk, DirectRead)
S3ReadLimiter-->>S3RandomAccessFile: grant (or block until tokens)
S3RandomAccessFile->>IStream: read(chunk)
IStream-->>S3RandomAccessFile: bytes_read
S3RandomAccessFile->>S3RandomAccessFile: finalizeRead(...)
end
else Limiter disabled
S3RandomAccessFile->>IStream: read(size)
IStream-->>S3RandomAccessFile: bytes_read
S3RandomAccessFile->>S3RandomAccessFile: finalizeRead(...)
end
S3RandomAccessFile-->>Client: return bytes_read
sequenceDiagram
actor Leader
actor Follower
participant FileCache
participant FileSegment
participant BGDownloader as BG_Downloader
participant S3ReadLimiter as S3ReadLimiter
Leader->>FileCache: get(key) — insert Empty placeholder
FileCache->>BGDownloader: schedule bgDownloadExecutor(enqueue_time)
BGDownloader->>S3ReadLimiter: requestBytes(..., FileCacheDownload)
Note over S3ReadLimiter: may block until budget available (token-bucket)
BGDownloader->>BGDownloader: downloadToLocal(...) (chunked when limiter active)
BGDownloader-->>FileSegment: setState(Complete) and notify waiters
Leader-->>FileCache: return miss
Follower->>FileCache: get(key) — sees Empty
FileCache->>FileSegment: waitForNotEmptyFor(timeout)
alt Background finishes within timeout
FileSegment-->>Follower: Complete (hit)
else Timeout or Failed
FileSegment-->>Follower: Miss (timeout/failed)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
4253d99 to
cd65279
Compare
Signed-off-by: JaySon-Huang <tshent@qq.com>
cd65279 to
df14a20
Compare
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dbms/src/Storages/S3/FileCache.cpp (1)
1317-1342:⚠️ Potential issue | 🔴 CriticalFailed downloads can corrupt
cache_used.
finalizeReservedSize()rebases the reservation tocontent_length, but the segment keeps its old estimated size until Line 1357. The new cleanup on Lines 1395-1398 / 1450-1451 removes the failed segment and releasesfile_seg->getSize(), so any failure after Line 1317 withcontent_length != estimated_sizewill either leak cache capacity or underflowcache_used.Persist the finalized reserved size before any later throw point, and add a regression that calls
get(...)withstd::nulloptor an intentionally wrong size.Also applies to: 1355-1357, 1390-1399, 1446-1452
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/S3/FileCache.cpp` around lines 1317 - 1342, finalizeReservedSize updates the reservation but the FileSegment still holds the old estimated size, so persist the finalized size into the segment immediately after a successful finalizeReservedSize(...) call (i.e. set file_seg->setSize(content_length) or equivalent on the FileSegment) before any later operations that can throw (e.g. before prepareParentDir, toTemporaryFilename, downloadToLocal, or std::filesystem::rename) and ensure failed paths release the same size that was reserved; also add a regression test that calls get(...) with std::nullopt and another test that requests an intentionally wrong size to verify cache_used/cleanup behavior on download failures.
🧹 Nitpick comments (1)
dbms/src/Storages/S3/S3RandomAccessFile.cpp (1)
144-171: Refresh the suggested chunk size inside the loop.Both chunked paths snapshot
getSuggestedChunkSize()once before the loop. Ifs3_max_read_bytes_per_secis lowered while a large read/seek is already in flight, the remaining iterations keep using the old chunk size and only pick up the tighter burst cap on the next call.♻️ Minimal change
- const auto chunk_size = read_limiter->getSuggestedChunkSize(s3_read_limiter_preferred_chunk_size); size_t total_gcount = 0; while (total_gcount < size) { + const auto chunk_size = read_limiter->getSuggestedChunkSize(s3_read_limiter_preferred_chunk_size); auto to_read = std::min(size - total_gcount, static_cast<size_t>(chunk_size)); read_limiter->requestBytes(to_read, S3ReadSource::DirectRead); istr.read(buf + total_gcount, to_read); auto gcount = istr.gcount(); total_gcount += gcount; @@ - const auto chunk_size = read_limiter->getSuggestedChunkSize(s3_read_limiter_preferred_chunk_size); size_t total_ignored = 0; const auto bytes_to_ignore = static_cast<size_t>(offset - cur_offset); while (total_ignored < bytes_to_ignore) { + const auto chunk_size = read_limiter->getSuggestedChunkSize(s3_read_limiter_preferred_chunk_size); auto to_ignore = std::min(bytes_to_ignore - total_ignored, static_cast<size_t>(chunk_size)); read_limiter->requestBytes(to_ignore, S3ReadSource::DirectRead); istr.ignore(to_ignore); auto ignored = istr.gcount(); total_ignored += ignored;Also applies to: 287-310
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/S3/S3RandomAccessFile.cpp` around lines 144 - 171, The code in S3RandomAccessFile::readChunked captures read_limiter->getSuggestedChunkSize(...) once before the loop so if s3_max_read_bytes_per_sec is lowered during a large in-flight read the loop continues using the stale chunk_size; update the implementation to call read_limiter->getSuggestedChunkSize(s3_read_limiter_preferred_chunk_size) inside the while loop before computing to_read so each iteration observes the current limiter suggestion, and apply the same change to the other chunked path referenced in the review (the second chunked read block around lines 287-310).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@metrics/grafana/tiflash_summary.json`:
- Around line 14015-14024: The Y-axis for the "I/O Limiter Pending Duration"
panel is configured with format "short" but should use seconds; update the
panel's first yaxes entry (the object under "yaxes" for the panel named "I/O
Limiter Pending Duration") by changing "format": "short" to "format": "s" so the
histogram quantiles render with seconds units.
- Around line 20558-20572: The two queries using histogram_quantile both use
0.999 but their legendFormat indicates different percentiles; update the expr
for the query with refId "D" to use histogram_quantile(0.99, ...) (i.e., replace
0.999 with 0.99) so the metric computed by the histogram_quantile call matches
the legendFormat "99%-{{stage}}-{{file_type}} {{$additional_groupby}}"; ensure
you only change the numeric quantile in the expression for refId "D" and keep
the rest of the selector and labels identical to the refId "B" query.
- Around line 20790-20800: The legendFormat in the Grafana target is using
{{additional_groupby}} which prevents template variable interpolation; update
the legendFormat for the target (the object with "refId": "A" and "expr":
"sum(rate(tiflash_storage_remote_cache_wait_on_downloading_result...") to use
the templated variable form {{$additional_groupby}} instead of
{{additional_groupby}} so Grafana will substitute the variable at render time.
---
Outside diff comments:
In `@dbms/src/Storages/S3/FileCache.cpp`:
- Around line 1317-1342: finalizeReservedSize updates the reservation but the
FileSegment still holds the old estimated size, so persist the finalized size
into the segment immediately after a successful finalizeReservedSize(...) call
(i.e. set file_seg->setSize(content_length) or equivalent on the FileSegment)
before any later operations that can throw (e.g. before prepareParentDir,
toTemporaryFilename, downloadToLocal, or std::filesystem::rename) and ensure
failed paths release the same size that was reserved; also add a regression test
that calls get(...) with std::nullopt and another test that requests an
intentionally wrong size to verify cache_used/cleanup behavior on download
failures.
---
Nitpick comments:
In `@dbms/src/Storages/S3/S3RandomAccessFile.cpp`:
- Around line 144-171: The code in S3RandomAccessFile::readChunked captures
read_limiter->getSuggestedChunkSize(...) once before the loop so if
s3_max_read_bytes_per_sec is lowered during a large in-flight read the loop
continues using the stale chunk_size; update the implementation to call
read_limiter->getSuggestedChunkSize(s3_read_limiter_preferred_chunk_size) inside
the while loop before computing to_read so each iteration observes the current
limiter suggestion, and apply the same change to the other chunked path
referenced in the review (the second chunked read block around lines 287-310).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 21ce3b0d-3139-4d04-9ef8-019ec8a0c894
📒 Files selected for processing (16)
dbms/src/IO/Checksum/ChecksumBuffer.cppdbms/src/Interpreters/Context.cppdbms/src/Interpreters/Context.hdbms/src/Server/Server.cppdbms/src/Storages/DeltaMerge/File/DMFileReader.cppdbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.cppdbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.hdbms/src/Storages/KVStore/FFI/tests/gtest_status_server.cppdbms/src/Storages/S3/FileCache.cppdbms/src/Storages/S3/S3Common.cppdbms/src/Storages/S3/S3RandomAccessFile.cppdbms/src/Storages/S3/S3ReadLimiter.cppdbms/src/Storages/S3/S3ReadLimiter.hdbms/src/Storages/S3/tests/gtest_filecache.cppdocs/tiflash_http_api.mdmetrics/grafana/tiflash_summary.json
✅ Files skipped from review due to trivial changes (2)
- dbms/src/Storages/KVStore/FFI/tests/gtest_status_server.cpp
- dbms/src/Server/Server.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
- dbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.h
- dbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.cpp
Signed-off-by: JaySon-Huang <tshent@qq.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dbms/src/Storages/S3/FileCache.cpp`:
- Around line 1272-1278: The code only calls
s3_read_metrics_recorder->recordBytes(rbuf.count(), ...) after copyData()
succeeds, so partial reads during exceptions are not recorded; update the
FileCache download logic (the block using ReadBufferFromIStream rbuf,
WriteBufferFromWritableFile wbuf, copyData and wbuf.sync) to record rbuf.count()
in an unconditional cleanup path (e.g., use a try/finally or RAII guard) so
recordBytes is invoked whether copyData() throws or not; apply the same change
to the other symmetric block around lines handling
ReadBufferFromIStream/copyData (the 1289-1294 case) so both success and
partial-failure byte counts are reported.
- Around line 1396-1403: The current failure cleanup calls remove(s3_key, true)
which flows into removeImpl() and increments eviction metrics; change the code
paths (around FileCache::FileSegment failure handling at the shown block and the
similar block at lines ~1452-1455) to unpublish the failed placeholder and
release its reservation without counting it as an eviction—either call an
existing lighter-weight API (e.g. an unpublishPlaceholder/unpublish or
releaseReservation method if present) or add a new helper (e.g.
unpublishPlaceholder(s3_key) or removeImpl(s3_key, /*count_metrics*/ false))
that removes the published placeholder without increasing
type_dtfile_evict/type_dtfile_evict_bytes; ensure
FileSegment::setStatus(FileSegment::Status::Failed) and bg_download_fail_count
handling remain the same and only replace the remove(..., force=true) call with
this non-metric-cleanup path.
In `@dbms/src/Storages/S3/tests/gtest_filecache.cpp`:
- Around line 1268-1304: The test currently only calls
waitForBgDownload(file_cache) which can pass even if background downloads
failed; update the test (TEST_F FileCacheTest
BgDownloadWorksWithSharedS3ReadLimiter) to explicitly assert the downloads
succeeded by verifying the cached entries for the two generated objects (use the
same objects from genObjects) are present and valid after waitForBgDownload —
e.g., call file_cache.get(S3FilenameView::fromKey(objects[0].key),
objects[0].size) and file_cache.get(S3FilenameView::fromKey(objects[1].key),
objects[1].size) and ASSERT they return non-null / valid results (or assert on
the on-disk cache file existence) to ensure the paused downloads actually
completed successfully.
In `@metrics/grafana/tiflash_summary.json`:
- Around line 21106-21117: The legendFormat in the target (inside the "targets"
entry with "refId": "A") is using "{{additional_groupby}}" which prevents
Grafana template interpolation; update the legendFormat value to use the
template variable with a dollar prefix (i.e., "{{$additional_groupby}}") so it
matches the expr usage and renders the variable instead of literal text.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 313aa7ad-b13b-4365-a774-ed266e69586b
📒 Files selected for processing (17)
dbms/src/IO/Checksum/ChecksumBuffer.cppdbms/src/Interpreters/Context.cppdbms/src/Interpreters/Context.hdbms/src/Server/Server.cppdbms/src/Storages/DeltaMerge/File/DMFileReader.cppdbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.cppdbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.hdbms/src/Storages/KVStore/FFI/tests/gtest_status_server.cppdbms/src/Storages/S3/FileCache.cppdbms/src/Storages/S3/FileCache.hdbms/src/Storages/S3/S3Common.cppdbms/src/Storages/S3/S3RandomAccessFile.cppdbms/src/Storages/S3/S3ReadLimiter.cppdbms/src/Storages/S3/S3ReadLimiter.hdbms/src/Storages/S3/tests/gtest_filecache.cppdocs/tiflash_http_api.mdmetrics/grafana/tiflash_summary.json
✅ Files skipped from review due to trivial changes (2)
- dbms/src/Storages/KVStore/FFI/tests/gtest_status_server.cpp
- dbms/src/Storages/S3/S3RandomAccessFile.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
- dbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.h
- dbms/src/Storages/S3/FileCache.h
Signed-off-by: JaySon-Huang <tshent@qq.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 33 out of 34 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 33 out of 34 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 33 out of 34 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
dbms/src/Storages/S3/S3ReadLimiter.h (1)
59-63: Clarify stale “stream dimension” wording in class docs.This class currently exposes byte-rate limiting APIs only; mentioning a present “stream dimension” here can confuse future readers.
✍️ Suggested doc-only tweak
- /// The stream dimension is best-effort protection against too many live response bodies, not a - /// replacement for byte throttling and not a safe cap on reader object count. In TiFlash a - /// `S3RandomAccessFile` may keep its body stream open across scheduling gaps, so a low stream - /// limit can block forward progress even when the node is no longer transferring many bytes. + /// This limiter currently enforces byte-rate only. Stream-count limiting was intentionally + /// removed because stream lifetime is not a reliable proxy for real network occupancy.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/S3/S3ReadLimiter.h` around lines 59 - 63, The class doc comment for S3ReadLimiter currently mentions a "stream dimension" which is stale and misleading because the class only exposes byte-rate limiting APIs; update the comment above the S3ReadLimiter constructor (S3ReadLimiter(UInt64 max_read_bytes_per_sec_ = 0, UInt64 refill_period_ms_ = 100)) to remove or rephrase the “stream dimension” wording and clearly state that only byte-rate limiting is enforced (and that any mention of limiting concurrent/open streams is best-effort or not provided), keeping the rest of the explanation about scheduling gaps and refill behavior intact.dbms/src/Common/TiFlashMetrics.cpp (1)
32-32: Givebg_download_stagemore headroom.
ExpBuckets{0.0001, 2, 20}tops out at roughly 52s. The manual experiment attached to this PR already shows minute-scale background-download waits, sotiflash_storage_remote_cache_bg_download_stage_secondswill pin to its top bucket and flatten p99/p999 above that. Consider widening only the bg-stage histogram range.♻️ Suggested tweak
-constexpr auto remote_cache_bg_download_stage_buckets = ExpBuckets{0.0001, 2, 20}; +constexpr auto remote_cache_bg_download_stage_buckets = ExpBuckets{0.01, 2, 20};🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Common/TiFlashMetrics.cpp` at line 32, The histogram for bg download stage currently uses remote_cache_bg_download_stage_buckets = ExpBuckets{0.0001, 2, 20} which tops out too low and will saturate for minute-scale waits; update the buckets to widen only this histogram (e.g., change the last parameter from 20 to 30) in TiFlashMetrics.cpp by modifying remote_cache_bg_download_stage_buckets so tiflash_storage_remote_cache_bg_download_stage_seconds has headroom for minute-scale durations.dbms/src/Storages/S3/tests/gtest_filecache.cpp (1)
1335-1338: Timing assertion may be sensitive to system load.The assertion
ASSERT_GE(watch.elapsedMilliseconds(), 200)validates that the limiter throttles the download, but the exact threshold (200ms) could be sensitive to CI runner load. Consider adding a small tolerance or documenting the expected timing rationale if flakiness is observed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/S3/tests/gtest_filecache.cpp` around lines 1335 - 1338, The timing assertion using AtomicStopwatch and ASSERT_GE(watch.elapsedMilliseconds(), 200) is brittle under variable CI load; update the test around file_cache.get(S3FilenameView::fromKey(object_key), object_size) and waitForBgDownload(file_cache) to allow a small tolerance (e.g., compare against an expected_delay minus a small delta or assert elapsed >= expected_delay_with_tolerance) or document the rationale; adjust the assertion to use a configurable/clearly named constant (expected_delay and tolerance) so the limiter check remains robust across environments.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dbms/src/IO/BaseFile/RateLimiter.h`:
- Line 223: The S3 limiter is being set to nullptr when the config is zero so
files constructed earlier (S3RandomAccessFile ctor snapshotting
client_ptr->getS3ReadLimiter()) keep nullptr; instead ensure getS3ReadLimiter()
always returns a valid (non-null) shared_ptr by publishing a no-op/unthrottled
S3::S3ReadLimiter instance when the configured rate is 0 and atomically swapping
in a real throttling S3ReadLimiter when the config becomes non-zero (or update
an existing limiter’s internal state rather than replacing with nullptr). Locate
the s3_read_limiter management in RateLimiter.cpp (and the getS3ReadLimiter
declaration) and change the zero-rate path to create/publish a non-throttling
limiter so existing snapshots in S3RandomAccessFile continue to be valid across
reloads.
In `@dbms/src/IO/BaseFile/tests/gtest_rate_limiter.cpp`:
- Around line 424-432: The test TEST(S3ReadLimiterTest,
LargeRequestDoesNotWaitForever) currently calls
S3::S3ReadLimiter::requestBytes(...) directly on the test thread which can hang
CI if the limiter regresses; instead spawn the requestBytes call in a background
task (e.g., std::async or std::thread) and use a wait_for() with a timeout to
assert the task completed (and that AtomicStopwatch::elapsedMilliseconds() is
under the expected bound), referencing S3::S3ReadLimiter, requestBytes,
S3::S3ReadSource::DirectRead and AtomicStopwatch to locate and replace the
direct blocking call with a timed async/wait_for assertion.
In `@dbms/src/Storages/S3/S3Common.h`:
- Around line 185-191: setS3ReadLimiter currently replaces the shared_ptr which
breaks existing S3RandomAccessFile snapshots; instead make the published
S3ReadLimiter object stable by updating its configuration in-place: in
setS3ReadLimiter(const std::shared_ptr<S3ReadLimiter>& limiter) if
shared_s3_read_limiter is non-null and limiter is non-null call
shared_s3_read_limiter->updateConfig(...) (or equivalent) to apply new limits,
otherwise (when there is no existing limiter) assign the pointer; also ensure
you still call shared_tiflash_client->setS3ReadLimiter(shared_s3_read_limiter)
so clients keep referencing the same long-lived S3ReadLimiter instance (refer to
setS3ReadLimiter, shared_s3_read_limiter, S3ReadLimiter::updateConfig, and
S3RandomAccessFile constructor).
---
Nitpick comments:
In `@dbms/src/Common/TiFlashMetrics.cpp`:
- Line 32: The histogram for bg download stage currently uses
remote_cache_bg_download_stage_buckets = ExpBuckets{0.0001, 2, 20} which tops
out too low and will saturate for minute-scale waits; update the buckets to
widen only this histogram (e.g., change the last parameter from 20 to 30) in
TiFlashMetrics.cpp by modifying remote_cache_bg_download_stage_buckets so
tiflash_storage_remote_cache_bg_download_stage_seconds has headroom for
minute-scale durations.
In `@dbms/src/Storages/S3/S3ReadLimiter.h`:
- Around line 59-63: The class doc comment for S3ReadLimiter currently mentions
a "stream dimension" which is stale and misleading because the class only
exposes byte-rate limiting APIs; update the comment above the S3ReadLimiter
constructor (S3ReadLimiter(UInt64 max_read_bytes_per_sec_ = 0, UInt64
refill_period_ms_ = 100)) to remove or rephrase the “stream dimension” wording
and clearly state that only byte-rate limiting is enforced (and that any mention
of limiting concurrent/open streams is best-effort or not provided), keeping the
rest of the explanation about scheduling gaps and refill behavior intact.
In `@dbms/src/Storages/S3/tests/gtest_filecache.cpp`:
- Around line 1335-1338: The timing assertion using AtomicStopwatch and
ASSERT_GE(watch.elapsedMilliseconds(), 200) is brittle under variable CI load;
update the test around file_cache.get(S3FilenameView::fromKey(object_key),
object_size) and waitForBgDownload(file_cache) to allow a small tolerance (e.g.,
compare against an expected_delay minus a small delta or assert elapsed >=
expected_delay_with_tolerance) or document the rationale; adjust the assertion
to use a configurable/clearly named constant (expected_delay and tolerance) so
the limiter check remains robust across environments.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 10de3fcb-9c5d-47f6-b1a7-8b48cbf070d8
📒 Files selected for processing (34)
dbms/src/Common/FailPoint.cppdbms/src/Common/TiFlashMetrics.cppdbms/src/Common/TiFlashMetrics.hdbms/src/IO/BaseFile/IORateLimitConfig.cppdbms/src/IO/BaseFile/IORateLimitConfig.hdbms/src/IO/BaseFile/RateLimiter.cppdbms/src/IO/BaseFile/RateLimiter.hdbms/src/IO/BaseFile/tests/gtest_rate_limiter.cppdbms/src/IO/Checksum/ChecksumBuffer.cppdbms/src/Interpreters/Context.cppdbms/src/Interpreters/Context.hdbms/src/Interpreters/Settings.hdbms/src/Server/Server.cppdbms/src/Server/tests/gtest_storage_config.cppdbms/src/Storages/DeltaMerge/File/DMFileReader.cppdbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.cppdbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.hdbms/src/Storages/KVStore/FFI/tests/gtest_status_server.cppdbms/src/Storages/S3/FileCache.cppdbms/src/Storages/S3/FileCache.hdbms/src/Storages/S3/MockS3Client.hdbms/src/Storages/S3/S3Common.cppdbms/src/Storages/S3/S3Common.hdbms/src/Storages/S3/S3RandomAccessFile.cppdbms/src/Storages/S3/S3RandomAccessFile.hdbms/src/Storages/S3/S3ReadLimiter.cppdbms/src/Storages/S3/S3ReadLimiter.hdbms/src/Storages/S3/S3ReadLimiter_fwd.hdbms/src/Storages/S3/tests/gtest_filecache.cppdbms/src/Storages/S3/tests/gtest_s3client.cppdbms/src/Storages/S3/tests/gtest_s3file.cppdocs/design/2026-03-24-disagg-s3-node-level-backpressure-and-filecache-dedup.mddocs/tiflash_http_api.mdmetrics/grafana/tiflash_summary.json
| void setS3ReadLimiter(const std::shared_ptr<S3ReadLimiter> & limiter) | ||
| { | ||
| std::unique_lock lock_init(mtx_init); | ||
| shared_s3_read_limiter = limiter; | ||
| if (shared_tiflash_client != nullptr) | ||
| shared_tiflash_client->setS3ReadLimiter(shared_s3_read_limiter); | ||
| } |
There was a problem hiding this comment.
Keep the published limiter object stable across reloads.
dbms/src/Server/Server.cpp hits this setter on config reload, but dbms/src/Storages/S3/S3RandomAccessFile.cpp snapshots the limiter once in its constructor. If a node starts with s3_max_read_bytes_per_sec = 0 and reload later enables it, already-open readers keep the old nullptr and bypass throttling for the rest of the query. Prefer updating one long-lived S3ReadLimiter via updateConfig() instead of swapping the shared_ptr.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@dbms/src/Storages/S3/S3Common.h` around lines 185 - 191, setS3ReadLimiter
currently replaces the shared_ptr which breaks existing S3RandomAccessFile
snapshots; instead make the published S3ReadLimiter object stable by updating
its configuration in-place: in setS3ReadLimiter(const
std::shared_ptr<S3ReadLimiter>& limiter) if shared_s3_read_limiter is non-null
and limiter is non-null call shared_s3_read_limiter->updateConfig(...) (or
equivalent) to apply new limits, otherwise (when there is no existing limiter)
assign the pointer; also ensure you still call
shared_tiflash_client->setS3ReadLimiter(shared_s3_read_limiter) so clients keep
referencing the same long-lived S3ReadLimiter instance (refer to
setS3ReadLimiter, shared_s3_read_limiter, S3ReadLimiter::updateConfig, and
S3RandomAccessFile constructor).
There was a problem hiding this comment.
The nonzero -> 0 reload path is already handled by keeping the existing S3ReadLimiter object alive and applying updateConfig(0) in place, so readers that already snapped the shared limiter can observe disable without pointer recreation.
What still remains intentionally limited today is the first 0 -> nonzero transition when startup published no limiter object: previously constructed readers that snapped nullptr will not start throttling retroactively, and only newly created readers will see the newly published limiter. I added a code comment and TODO in RateLimiter.cpp to document this behavior and the possible follow-up of publishing a no-op limiter even at rate 0 if we want stronger reload semantics later.
There was a problem hiding this comment.
The PR description images show:
- Image 1 (Node exporter – Network overview): Grafana Network panels with no drops/errors, traffic bursts concentrated on bond1 up to ~1.23 GB/s, packet peaks ~818 k p/s, link speeds stable over 2026-04-05 18:37:43–20:22:54.
- Image 2 (TiFlash Summary – PageStorage Rate Limiter): Limiter throughput shows S3 direct/filecache bursts; thresholds pegged at 200 MiB/s for s3_read and bg_write; pending queues up to 30; S3 read pending P99≈1.31 s, P999≈2.77 s in the same time window.
- Image 3 (TiFlash Summary – Remote Cache): Remote cache ops/flow with spikes; bounded waits up to ~7 s; queue depth up to ~150–200; occasional “too_many_download” rejects.
- Image 4 (TiFlash Summary – Node/resource): Store/available sizes oscillate with workload; uptime ~2 hours; transient CPU spikes >1000%; memory peaks ~11 GiB.
@JaySon-Huang thanks for the clarification. Your described semantics match the code paths: nonzero → 0 keeps the same S3ReadLimiter instance and calls updateConfig(0), so in-flight readers that captured the pointer observe disablement; the only non-retroactive case is the very first 0 → nonzero when the process started with no limiter object and older readers had snapped nullptr.
Given that, I’m good with the current behavior for this PR. As a low-risk follow-up to tighten reload semantics later, you could consider instantiating and publishing a no-op limiter at startup (limit=0). That would make the initial 0 → nonzero transition retroactive for already-open readers without changing hot-path costs when disabled, since:
- existing readers would already hold a stable pointer; a reload would just updateConfig(new_rate),
- per-read locking remains unchanged (no extra get() per read).
I’ve recorded this nuance so we don’t flag it again.
✏️ Learnings added
Learnt from: JaySon-Huang
Repo: pingcap/tiflash PR: 10771
File: dbms/src/Storages/S3/S3Common.h:185-191
Timestamp: 2026-04-05T19:04:11.469Z
Learning: TiFlash S3ReadLimiter reload semantics: nonzero→0 is handled in place via S3ReadLimiter::updateConfig(0) so existing readers observe disablement; the initial 0→nonzero after startup with no limiter object is intentionally non‑retroactive—readers that captured nullptr continue unthrottled until recreated. A possible future enhancement is to publish a no‑op limiter (rate=0) at startup to make 0→nonzero retroactive.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: CR
Repo: pingcap/tiflash PR: 0
File: dbms/src/Storages/AGENTS.md:0-0
Timestamp: 2026-01-26T03:03:41.424Z
Learning: Applies to dbms/src/Storages/**/{DeltaMerge,KVStore,Page,S3}/**/*.{h,hpp,cpp} : Use `std::shared_ptr` for `StorageDeltaMerge`, `DeltaMergeStore`, `Region`, and `Context` objects
What problem does this PR solve?
Issue Number: close #10752
Problem Summary:
In disaggregated read workloads, compute nodes can generate uncontrolled S3 traffic when cold reads and repeated cache misses amplify remote reads on the same node. We need a production-safe way to bound S3 traffic, reduce duplicate downloads for the same object, and make the resulting behavior observable and tunable.
What is changed and how it works?
More specifically:
S3ReadLimiterto cap node-level S3 read bytes for both direct reads and FileCache downloadsstorage.io_rate_limit.s3_max_read_bytes_per_secs3_max_get_object_streamsstream-cap design after validation showed that a reader-lifetime stream token can stall disaggregated query progressClientFactoryso all S3 clients on the node observe the same node-level byte budget0now updates the existing limiter instance instead of replacing it, so in-flight readers observe the disable and the related metric is updated consistentlyS3RandomAccessFileuse limiter-aware chunked read/seek paths while sharing common read/seek tail logicFileCachedownloads reuse the same limiter, addprofiles.default.dt_filecache_wait_on_downloading_ms, and let follower requests perform bounded wait on an in-flight same-key download instead of immediately creating more remote-read pressureresult x file_typestage x file_typetoo_many_downloadbyfile_typedocs/design/2026-03-24-disagg-s3-node-level-backpressure-and-filecache-dedup.mdCheck List
Tests
Manual test
Tested with following py script
Side effects
Documentation
New configs
Release note
Summary by CodeRabbit
New Features
Improvements
Tests
Documentation