Skip to content

disagg: limit S3 read amplification on compute nodes#10771

Open
JaySon-Huang wants to merge 36 commits intopingcap:masterfrom
JaySon-Huang:disagg_network_limit
Open

disagg: limit S3 read amplification on compute nodes#10771
JaySon-Huang wants to merge 36 commits intopingcap:masterfrom
JaySon-Huang:disagg_network_limit

Conversation

@JaySon-Huang
Copy link
Copy Markdown
Contributor

@JaySon-Huang JaySon-Huang commented Mar 25, 2026

What problem does this PR solve?

Issue Number: close #10752

Problem Summary:

In disaggregated read workloads, compute nodes can generate uncontrolled S3 traffic when cold reads and repeated cache misses amplify remote reads on the same node. We need a production-safe way to bound S3 traffic, reduce duplicate downloads for the same object, and make the resulting behavior observable and tunable.

What is changed and how it works?

add a node-level S3 byte limiter and wire it into both direct reads and FileCache downloads
avoid duplicate S3 downloads for the same key by adding bounded wait on in-flight FileCache entries
add phase-2 remote cache observability for bounded wait and background download stages
refine FileCache and S3 read paths for maintainability, comments, failure handling, and reload consistency
add an English design document for the latest disagg S3 backpressure plan

More specifically:

  • add S3ReadLimiter to cap node-level S3 read bytes for both direct reads and FileCache downloads
  • add storage.io_rate_limit.s3_max_read_bytes_per_sec
  • remove the earlier s3_max_get_object_streams stream-cap design after validation showed that a reader-lifetime stream token can stall disaggregated query progress
  • publish the shared limiter through ClientFactory so all S3 clients on the node observe the same node-level byte budget
  • keep the limiter object stable across config reloads; reloading to 0 now updates the existing limiter instance instead of replacing it, so in-flight readers observe the disable and the related metric is updated consistently
  • make S3RandomAccessFile use limiter-aware chunked read/seek paths while sharing common read/seek tail logic
  • make FileCache downloads reuse the same limiter, add profiles.default.dt_filecache_wait_on_downloading_ms, and let follower requests perform bounded wait on an in-flight same-key download instead of immediately creating more remote-read pressure
  • handle bounded-wait download failures by removing failed placeholders eagerly so later retries can proceed
  • add remote-cache observability for:
    • bounded wait result / bytes / wait-seconds by result x file_type
    • background download stage seconds by stage x file_type
    • reject counts for too_many_download by file_type
    • background download queue depth
  • add an English design doc under docs/design/2026-03-24-disagg-s3-node-level-backpressure-and-filecache-dedup.md

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Manual test

Tested with following py script

# Round 1: cold-cache baseline without node-level S3 byte limiting and without FileCache bounded wait.
# Round 2: cold-cache baseline with two concurrent queries and no parameter changes.
# Round 3: cold-cache run with node-level S3 byte limiting enabled at 150 MiB/s.
# Round 4: keep the same S3 byte limit, enable FileCache bounded wait at 5000ms.
# Round 5: raise the S3 byte limit from 150 MiB/s to 200 MiB/s and rerun the cold-cache query.
# Round 6: keep the same config as round 5 and run the target query twice concurrently.
image image image image
The six-round run_cold_disagg_read.py experiment clearly reproduces the target scenario described in the design doc: under cold-cache conditions, two concurrent queries without any limiter push the host bond1 receive peak to about 2.48 GB/s, while direct reads and FileCache downloads both surge at the same time. This is exactly the node-level S3 read amplification problem the proposal is meant to control. After enabling s3_max_read_bytes_per_sec, TiFlash shows sustained s3_read_byte pending activity, and the host-level network peak is consistently reduced to the few-hundred-MB/s range. That confirms the node-level byte limiter is effective not only in internal accounting but also at the real machine network level, turning the system from “unbounded amplification that can blow up the node network” into a bounded, observable, and tunable degraded state.

In this workload, 150 MiB/s is too conservative and significantly stretches TiFlash execution time. Raising the limit to 200 MiB/s noticeably improves both tiflash_task execution time and overall slowlog wall time in the single-query case, while still keeping network usage under control. For the current sample, s3_max_read_bytes_per_sec = 209715200 together with dt_filecache_wait_on_downloading_ms = 5000 is the most balanced configuration observed so far. Bounded wait helps mainly on the merged path in single-query runs, but under two-query concurrency it also starts producing substantial coldata hits/timeouts, which indicates that its benefit expands once overlap becomes strong enough. At this point, the dominant remaining bottleneck is no longer uncontrolled network traffic, but FileCache background-download supply: bg_downloading_count, bg_download_queue_count, and too_many_download stay high across multiple rounds. Therefore, the next optimization priority should be downloader concurrency / queue / admission behavior rather than further increasing the bounded-wait budget.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
    New configs
profiles.default.dt_filecache_wait_on_downloading_ms
storage.io_rate_limit.s3_max_read_bytes_per_sec
  • Contains experimental features
  • Changes MySQL compatibility

Release note

limit S3 read amplification on compute nodes

Summary by CodeRabbit

  • New Features

    • Node-level S3 read-rate limiter; bounded follower wait for in-flight FileCache downloads; two per-node HTTP cache-evict endpoints.
  • Improvements

    • Limiter-aware chunked reads/seeks and downloads, background-download queue/stage timing, failed-placeholder cleanup, expanded remote-cache & I/O-limiter metrics, runtime wait tuning setting.
  • Tests

    • New unit/integration coverage for limiter, FileCache wait semantics, background-failure and scheduling cleanup paths, and limiter publication.
  • Documentation

    • Added design doc and HTTP API docs.

@JaySon-Huang JaySon-Huang self-assigned this Mar 25, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Mar 25, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. labels Mar 25, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Mar 25, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jayson-huang and additionally assign yudongusa for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • ✅ Review completed - (🔄 Check again to review again)
📝 Walkthrough

Walkthrough

Adds a node-level S3 read token-bucket limiter and S3 read metrics/recording; wires the limiter through IORateLimiter → ClientFactory → TiFlashS3Client → S3 clients; makes S3 reads and FileCache downloads limiter-aware (chunked reads, bg downloads), adds bounded follower waits, new metrics, failpoints, tests, and an HTTP cache-evict API.

Changes

Cohort / File(s) Summary
Fail Points
dbms/src/Common/FailPoint.cpp
Register new fail-points file_cache_bg_download_fail, file_cache_bg_download_schedule_fail.
Metrics
dbms/src/Common/TiFlashMetrics.h, dbms/src/Common/TiFlashMetrics.cpp
Add remote-cache enums, metric families (counters/histograms), bucket boundaries, and accessor methods for per-(file_type × result/reason/stage) metrics.
Rate-Limit Config
dbms/src/IO/BaseFile/IORateLimitConfig.h, dbms/src/IO/BaseFile/IORateLimitConfig.cpp
Add s3_max_read_bytes_per_sec field, parsing, toString, and equality comparison updates.
IORateLimiter
dbms/src/IO/BaseFile/RateLimiter.h, dbms/src/IO/BaseFile/RateLimiter.cpp
Add s3_read_limiter member, public getS3ReadLimiter(), create/update logic from config, and propagate stop.
S3 Read Limiter (API & Impl)
dbms/src/Storages/S3/S3ReadLimiter.h, dbms/src/Storages/S3/S3ReadLimiter.cpp, dbms/src/Storages/S3/S3ReadLimiter_fwd.h
Introduce S3ReadLimiter (token-bucket), S3ReadMetricsRecorder, refill/update, blocking requestBytes, suggested-chunk helper, stop semantics, and metrics recording.
S3 Client / Factory
dbms/src/Storages/S3/S3Common.h, dbms/src/Storages/S3/S3Common.cpp, dbms/src/Storages/S3/MockS3Client.h
Propagate shared S3ReadLimiter/metrics-recorder into TiFlashS3Client/MockS3Client; add client accessors and ClientFactory setters to publish shared limiter/recorder.
S3 RandomAccessFile
dbms/src/Storages/S3/S3RandomAccessFile.h, dbms/src/Storages/S3/S3RandomAccessFile.cpp
Add limiter-aware chunked read/seek flows, finalize helpers, integrate read_limiter/metrics-recorder, and centralize post-read/seek handling.
FileCache
dbms/src/Storages/S3/FileCache.h, dbms/src/Storages/S3/FileCache.cpp
Add bounded follower wait (waitForNotEmptyFor), pass enqueue_time to bg executor, integrate S3ReadLimiter into download path, record bg queue/stage metrics, add failpoint-triggered bg failure handling and cleanup, and runtime wait_on_downloading_ms.
Server wiring
dbms/src/Server/Server.cpp
Publish IORateLimiter’s S3ReadLimiter into ClientFactory after init and on config reload.
Settings
dbms/src/Interpreters/Settings.h
Add dt_filecache_wait_on_downloading_ms setting (UInt64, default 0).
Tests
tests across dbms/src/IO/BaseFile/tests/*, dbms/src/Server/tests/*, dbms/src/Storages/S3/tests/*
Add unit/integration tests for S3ReadLimiter behavior, config parsing, FileCache bounded-wait semantics, failpoint-driven bg failures, and publication of the shared limiter.
HTTP status / Evict API
dbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.cpp, .../ProxyFFIStatusService.h, tests, docs
Add /tiflash/cache/evict/mark and /tiflash/cache/evict/minmax handlers, parsing helpers, tests, and documentation.
Misc — robustness & logging
various files (ChecksumBuffer.cpp, DMFileReader.cpp, etc.)
Tighter seek error logging and added exception logging/handling around seeks.
Docs / Design
docs/design/...-disagg-s3-node-level-backpressure-and-filecache-dedup.md, docs/tiflash_http_api.md
Add design doc describing node-level S3 backpressure, FileCache deduplication, metrics, and new HTTP API doc.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant S3RandomAccessFile as S3RandomAccessFile
    participant S3ReadLimiter as S3ReadLimiter
    participant IStream as S3_IStream

    Client->>S3RandomAccessFile: readImpl(buf, size)
    alt Limiter active
        S3RandomAccessFile->>S3ReadLimiter: getSuggestedChunkSize(...)
        loop per chunk
            S3RandomAccessFile->>S3ReadLimiter: requestBytes(chunk, DirectRead)
            S3ReadLimiter-->>S3RandomAccessFile: grant (or block until tokens)
            S3RandomAccessFile->>IStream: read(chunk)
            IStream-->>S3RandomAccessFile: bytes_read
            S3RandomAccessFile->>S3RandomAccessFile: finalizeRead(...)
        end
    else Limiter disabled
        S3RandomAccessFile->>IStream: read(size)
        IStream-->>S3RandomAccessFile: bytes_read
        S3RandomAccessFile->>S3RandomAccessFile: finalizeRead(...)
    end
    S3RandomAccessFile-->>Client: return bytes_read
Loading
sequenceDiagram
    actor Leader
    actor Follower
    participant FileCache
    participant FileSegment
    participant BGDownloader as BG_Downloader
    participant S3ReadLimiter as S3ReadLimiter

    Leader->>FileCache: get(key) — insert Empty placeholder
    FileCache->>BGDownloader: schedule bgDownloadExecutor(enqueue_time)
    BGDownloader->>S3ReadLimiter: requestBytes(..., FileCacheDownload)
    Note over S3ReadLimiter: may block until budget available (token-bucket)
    BGDownloader->>BGDownloader: downloadToLocal(...) (chunked when limiter active)
    BGDownloader-->>FileSegment: setState(Complete) and notify waiters
    Leader-->>FileCache: return miss

    Follower->>FileCache: get(key) — sees Empty
    FileCache->>FileSegment: waitForNotEmptyFor(timeout)
    alt Background finishes within timeout
        FileSegment-->>Follower: Complete (hit)
    else Timeout or Failed
        FileSegment-->>Follower: Miss (timeout/failed)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • JinheLin
  • CalvinNeo
  • wshwsh12

Poem

🐇 I nibbled tokens, tiny and bright,
I kept S3 rivers from rushing the night.
Followers pause a beat or two,
Background hops, then finishes through.
Quiet bytes — my carrot delight.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'disagg: limit S3 read amplification on compute nodes' clearly summarizes the main change, which is adding node-level S3 byte limiting to prevent read amplification in disaggregated workloads.
Linked Issues check ✅ Passed The PR fully addresses the objectives from issue #10752: prevents MPP query failures from network saturation by implementing a node-level S3ReadLimiter, reduces duplicate downloads via bounded FileCache waits, provides tunable controls via s3_max_read_bytes_per_sec and dt_filecache_wait_on_downloading_ms, and adds observability metrics for S3/remote-cache activity.
Out of Scope Changes check ✅ Passed All code changes are within scope: S3ReadLimiter implementation, FileCache bounded-wait support, configuration changes, metrics integration, and HTTP cache eviction endpoints are all required by the linked objectives. Design documentation and test expansions are appropriate supporting changes.
Description check ✅ Passed The pull request description is comprehensive, well-structured, and covers all required sections including problem statement, detailed changes, test coverage, and documentation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed do-not-merge/needs-linked-issue labels Mar 25, 2026
@JaySon-Huang JaySon-Huang force-pushed the disagg_network_limit branch 4 times, most recently from 4253d99 to cd65279 Compare April 3, 2026 09:24
Signed-off-by: JaySon-Huang <tshent@qq.com>
@JaySon-Huang JaySon-Huang force-pushed the disagg_network_limit branch from cd65279 to df14a20 Compare April 3, 2026 10:09
@JaySon-Huang JaySon-Huang changed the title disagg: throttle S3 readers disagg: limit S3 read amplification on compute nodes Apr 4, 2026
@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 5, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dbms/src/Storages/S3/FileCache.cpp (1)

1317-1342: ⚠️ Potential issue | 🔴 Critical

Failed downloads can corrupt cache_used.

finalizeReservedSize() rebases the reservation to content_length, but the segment keeps its old estimated size until Line 1357. The new cleanup on Lines 1395-1398 / 1450-1451 removes the failed segment and releases file_seg->getSize(), so any failure after Line 1317 with content_length != estimated_size will either leak cache capacity or underflow cache_used.

Persist the finalized reserved size before any later throw point, and add a regression that calls get(...) with std::nullopt or an intentionally wrong size.

Also applies to: 1355-1357, 1390-1399, 1446-1452

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/S3/FileCache.cpp` around lines 1317 - 1342,
finalizeReservedSize updates the reservation but the FileSegment still holds the
old estimated size, so persist the finalized size into the segment immediately
after a successful finalizeReservedSize(...) call (i.e. set
file_seg->setSize(content_length) or equivalent on the FileSegment) before any
later operations that can throw (e.g. before prepareParentDir,
toTemporaryFilename, downloadToLocal, or std::filesystem::rename) and ensure
failed paths release the same size that was reserved; also add a regression test
that calls get(...) with std::nullopt and another test that requests an
intentionally wrong size to verify cache_used/cleanup behavior on download
failures.
🧹 Nitpick comments (1)
dbms/src/Storages/S3/S3RandomAccessFile.cpp (1)

144-171: Refresh the suggested chunk size inside the loop.

Both chunked paths snapshot getSuggestedChunkSize() once before the loop. If s3_max_read_bytes_per_sec is lowered while a large read/seek is already in flight, the remaining iterations keep using the old chunk size and only pick up the tighter burst cap on the next call.

♻️ Minimal change
-    const auto chunk_size = read_limiter->getSuggestedChunkSize(s3_read_limiter_preferred_chunk_size);
     size_t total_gcount = 0;
     while (total_gcount < size)
     {
+        const auto chunk_size = read_limiter->getSuggestedChunkSize(s3_read_limiter_preferred_chunk_size);
         auto to_read = std::min(size - total_gcount, static_cast<size_t>(chunk_size));
         read_limiter->requestBytes(to_read, S3ReadSource::DirectRead);
         istr.read(buf + total_gcount, to_read);
         auto gcount = istr.gcount();
         total_gcount += gcount;
@@
-    const auto chunk_size = read_limiter->getSuggestedChunkSize(s3_read_limiter_preferred_chunk_size);
     size_t total_ignored = 0;
     const auto bytes_to_ignore = static_cast<size_t>(offset - cur_offset);
     while (total_ignored < bytes_to_ignore)
     {
+        const auto chunk_size = read_limiter->getSuggestedChunkSize(s3_read_limiter_preferred_chunk_size);
         auto to_ignore = std::min(bytes_to_ignore - total_ignored, static_cast<size_t>(chunk_size));
         read_limiter->requestBytes(to_ignore, S3ReadSource::DirectRead);
         istr.ignore(to_ignore);
         auto ignored = istr.gcount();
         total_ignored += ignored;

Also applies to: 287-310

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/S3/S3RandomAccessFile.cpp` around lines 144 - 171, The code
in S3RandomAccessFile::readChunked captures
read_limiter->getSuggestedChunkSize(...) once before the loop so if
s3_max_read_bytes_per_sec is lowered during a large in-flight read the loop
continues using the stale chunk_size; update the implementation to call
read_limiter->getSuggestedChunkSize(s3_read_limiter_preferred_chunk_size) inside
the while loop before computing to_read so each iteration observes the current
limiter suggestion, and apply the same change to the other chunked path
referenced in the review (the second chunked read block around lines 287-310).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@metrics/grafana/tiflash_summary.json`:
- Around line 14015-14024: The Y-axis for the "I/O Limiter Pending Duration"
panel is configured with format "short" but should use seconds; update the
panel's first yaxes entry (the object under "yaxes" for the panel named "I/O
Limiter Pending Duration") by changing "format": "short" to "format": "s" so the
histogram quantiles render with seconds units.
- Around line 20558-20572: The two queries using histogram_quantile both use
0.999 but their legendFormat indicates different percentiles; update the expr
for the query with refId "D" to use histogram_quantile(0.99, ...) (i.e., replace
0.999 with 0.99) so the metric computed by the histogram_quantile call matches
the legendFormat "99%-{{stage}}-{{file_type}} {{$additional_groupby}}"; ensure
you only change the numeric quantile in the expression for refId "D" and keep
the rest of the selector and labels identical to the refId "B" query.
- Around line 20790-20800: The legendFormat in the Grafana target is using
{{additional_groupby}} which prevents template variable interpolation; update
the legendFormat for the target (the object with "refId": "A" and "expr":
"sum(rate(tiflash_storage_remote_cache_wait_on_downloading_result...") to use
the templated variable form {{$additional_groupby}} instead of
{{additional_groupby}} so Grafana will substitute the variable at render time.

---

Outside diff comments:
In `@dbms/src/Storages/S3/FileCache.cpp`:
- Around line 1317-1342: finalizeReservedSize updates the reservation but the
FileSegment still holds the old estimated size, so persist the finalized size
into the segment immediately after a successful finalizeReservedSize(...) call
(i.e. set file_seg->setSize(content_length) or equivalent on the FileSegment)
before any later operations that can throw (e.g. before prepareParentDir,
toTemporaryFilename, downloadToLocal, or std::filesystem::rename) and ensure
failed paths release the same size that was reserved; also add a regression test
that calls get(...) with std::nullopt and another test that requests an
intentionally wrong size to verify cache_used/cleanup behavior on download
failures.

---

Nitpick comments:
In `@dbms/src/Storages/S3/S3RandomAccessFile.cpp`:
- Around line 144-171: The code in S3RandomAccessFile::readChunked captures
read_limiter->getSuggestedChunkSize(...) once before the loop so if
s3_max_read_bytes_per_sec is lowered during a large in-flight read the loop
continues using the stale chunk_size; update the implementation to call
read_limiter->getSuggestedChunkSize(s3_read_limiter_preferred_chunk_size) inside
the while loop before computing to_read so each iteration observes the current
limiter suggestion, and apply the same change to the other chunked path
referenced in the review (the second chunked read block around lines 287-310).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 21ce3b0d-3139-4d04-9ef8-019ec8a0c894

📥 Commits

Reviewing files that changed from the base of the PR and between a1a397d and 9412be4.

📒 Files selected for processing (16)
  • dbms/src/IO/Checksum/ChecksumBuffer.cpp
  • dbms/src/Interpreters/Context.cpp
  • dbms/src/Interpreters/Context.h
  • dbms/src/Server/Server.cpp
  • dbms/src/Storages/DeltaMerge/File/DMFileReader.cpp
  • dbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.cpp
  • dbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.h
  • dbms/src/Storages/KVStore/FFI/tests/gtest_status_server.cpp
  • dbms/src/Storages/S3/FileCache.cpp
  • dbms/src/Storages/S3/S3Common.cpp
  • dbms/src/Storages/S3/S3RandomAccessFile.cpp
  • dbms/src/Storages/S3/S3ReadLimiter.cpp
  • dbms/src/Storages/S3/S3ReadLimiter.h
  • dbms/src/Storages/S3/tests/gtest_filecache.cpp
  • docs/tiflash_http_api.md
  • metrics/grafana/tiflash_summary.json
✅ Files skipped from review due to trivial changes (2)
  • dbms/src/Storages/KVStore/FFI/tests/gtest_status_server.cpp
  • dbms/src/Server/Server.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
  • dbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.h
  • dbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.cpp

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dbms/src/Storages/S3/FileCache.cpp`:
- Around line 1272-1278: The code only calls
s3_read_metrics_recorder->recordBytes(rbuf.count(), ...) after copyData()
succeeds, so partial reads during exceptions are not recorded; update the
FileCache download logic (the block using ReadBufferFromIStream rbuf,
WriteBufferFromWritableFile wbuf, copyData and wbuf.sync) to record rbuf.count()
in an unconditional cleanup path (e.g., use a try/finally or RAII guard) so
recordBytes is invoked whether copyData() throws or not; apply the same change
to the other symmetric block around lines handling
ReadBufferFromIStream/copyData (the 1289-1294 case) so both success and
partial-failure byte counts are reported.
- Around line 1396-1403: The current failure cleanup calls remove(s3_key, true)
which flows into removeImpl() and increments eviction metrics; change the code
paths (around FileCache::FileSegment failure handling at the shown block and the
similar block at lines ~1452-1455) to unpublish the failed placeholder and
release its reservation without counting it as an eviction—either call an
existing lighter-weight API (e.g. an unpublishPlaceholder/unpublish or
releaseReservation method if present) or add a new helper (e.g.
unpublishPlaceholder(s3_key) or removeImpl(s3_key, /*count_metrics*/ false))
that removes the published placeholder without increasing
type_dtfile_evict/type_dtfile_evict_bytes; ensure
FileSegment::setStatus(FileSegment::Status::Failed) and bg_download_fail_count
handling remain the same and only replace the remove(..., force=true) call with
this non-metric-cleanup path.

In `@dbms/src/Storages/S3/tests/gtest_filecache.cpp`:
- Around line 1268-1304: The test currently only calls
waitForBgDownload(file_cache) which can pass even if background downloads
failed; update the test (TEST_F FileCacheTest
BgDownloadWorksWithSharedS3ReadLimiter) to explicitly assert the downloads
succeeded by verifying the cached entries for the two generated objects (use the
same objects from genObjects) are present and valid after waitForBgDownload —
e.g., call file_cache.get(S3FilenameView::fromKey(objects[0].key),
objects[0].size) and file_cache.get(S3FilenameView::fromKey(objects[1].key),
objects[1].size) and ASSERT they return non-null / valid results (or assert on
the on-disk cache file existence) to ensure the paused downloads actually
completed successfully.

In `@metrics/grafana/tiflash_summary.json`:
- Around line 21106-21117: The legendFormat in the target (inside the "targets"
entry with "refId": "A") is using "{{additional_groupby}}" which prevents
Grafana template interpolation; update the legendFormat value to use the
template variable with a dollar prefix (i.e., "{{$additional_groupby}}") so it
matches the expr usage and renders the variable instead of literal text.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 313aa7ad-b13b-4365-a774-ed266e69586b

📥 Commits

Reviewing files that changed from the base of the PR and between a1a397d and d440f12.

📒 Files selected for processing (17)
  • dbms/src/IO/Checksum/ChecksumBuffer.cpp
  • dbms/src/Interpreters/Context.cpp
  • dbms/src/Interpreters/Context.h
  • dbms/src/Server/Server.cpp
  • dbms/src/Storages/DeltaMerge/File/DMFileReader.cpp
  • dbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.cpp
  • dbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.h
  • dbms/src/Storages/KVStore/FFI/tests/gtest_status_server.cpp
  • dbms/src/Storages/S3/FileCache.cpp
  • dbms/src/Storages/S3/FileCache.h
  • dbms/src/Storages/S3/S3Common.cpp
  • dbms/src/Storages/S3/S3RandomAccessFile.cpp
  • dbms/src/Storages/S3/S3ReadLimiter.cpp
  • dbms/src/Storages/S3/S3ReadLimiter.h
  • dbms/src/Storages/S3/tests/gtest_filecache.cpp
  • docs/tiflash_http_api.md
  • metrics/grafana/tiflash_summary.json
✅ Files skipped from review due to trivial changes (2)
  • dbms/src/Storages/KVStore/FFI/tests/gtest_status_server.cpp
  • dbms/src/Storages/S3/S3RandomAccessFile.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
  • dbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.h
  • dbms/src/Storages/S3/FileCache.h

Signed-off-by: JaySon-Huang <tshent@qq.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 34 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@JaySon-Huang JaySon-Huang requested a review from Copilot April 5, 2026 17:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 34 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 34 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@JaySon-Huang
Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 5, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
dbms/src/Storages/S3/S3ReadLimiter.h (1)

59-63: Clarify stale “stream dimension” wording in class docs.

This class currently exposes byte-rate limiting APIs only; mentioning a present “stream dimension” here can confuse future readers.

✍️ Suggested doc-only tweak
-    /// The stream dimension is best-effort protection against too many live response bodies, not a
-    /// replacement for byte throttling and not a safe cap on reader object count. In TiFlash a
-    /// `S3RandomAccessFile` may keep its body stream open across scheduling gaps, so a low stream
-    /// limit can block forward progress even when the node is no longer transferring many bytes.
+    /// This limiter currently enforces byte-rate only. Stream-count limiting was intentionally
+    /// removed because stream lifetime is not a reliable proxy for real network occupancy.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/S3/S3ReadLimiter.h` around lines 59 - 63, The class doc
comment for S3ReadLimiter currently mentions a "stream dimension" which is stale
and misleading because the class only exposes byte-rate limiting APIs; update
the comment above the S3ReadLimiter constructor (S3ReadLimiter(UInt64
max_read_bytes_per_sec_ = 0, UInt64 refill_period_ms_ = 100)) to remove or
rephrase the “stream dimension” wording and clearly state that only byte-rate
limiting is enforced (and that any mention of limiting concurrent/open streams
is best-effort or not provided), keeping the rest of the explanation about
scheduling gaps and refill behavior intact.
dbms/src/Common/TiFlashMetrics.cpp (1)

32-32: Give bg_download_stage more headroom.

ExpBuckets{0.0001, 2, 20} tops out at roughly 52s. The manual experiment attached to this PR already shows minute-scale background-download waits, so tiflash_storage_remote_cache_bg_download_stage_seconds will pin to its top bucket and flatten p99/p999 above that. Consider widening only the bg-stage histogram range.

♻️ Suggested tweak
-constexpr auto remote_cache_bg_download_stage_buckets = ExpBuckets{0.0001, 2, 20};
+constexpr auto remote_cache_bg_download_stage_buckets = ExpBuckets{0.01, 2, 20};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Common/TiFlashMetrics.cpp` at line 32, The histogram for bg download
stage currently uses remote_cache_bg_download_stage_buckets = ExpBuckets{0.0001,
2, 20} which tops out too low and will saturate for minute-scale waits; update
the buckets to widen only this histogram (e.g., change the last parameter from
20 to 30) in TiFlashMetrics.cpp by modifying
remote_cache_bg_download_stage_buckets so
tiflash_storage_remote_cache_bg_download_stage_seconds has headroom for
minute-scale durations.
dbms/src/Storages/S3/tests/gtest_filecache.cpp (1)

1335-1338: Timing assertion may be sensitive to system load.

The assertion ASSERT_GE(watch.elapsedMilliseconds(), 200) validates that the limiter throttles the download, but the exact threshold (200ms) could be sensitive to CI runner load. Consider adding a small tolerance or documenting the expected timing rationale if flakiness is observed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/S3/tests/gtest_filecache.cpp` around lines 1335 - 1338, The
timing assertion using AtomicStopwatch and
ASSERT_GE(watch.elapsedMilliseconds(), 200) is brittle under variable CI load;
update the test around file_cache.get(S3FilenameView::fromKey(object_key),
object_size) and waitForBgDownload(file_cache) to allow a small tolerance (e.g.,
compare against an expected_delay minus a small delta or assert elapsed >=
expected_delay_with_tolerance) or document the rationale; adjust the assertion
to use a configurable/clearly named constant (expected_delay and tolerance) so
the limiter check remains robust across environments.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dbms/src/IO/BaseFile/RateLimiter.h`:
- Line 223: The S3 limiter is being set to nullptr when the config is zero so
files constructed earlier (S3RandomAccessFile ctor snapshotting
client_ptr->getS3ReadLimiter()) keep nullptr; instead ensure getS3ReadLimiter()
always returns a valid (non-null) shared_ptr by publishing a no-op/unthrottled
S3::S3ReadLimiter instance when the configured rate is 0 and atomically swapping
in a real throttling S3ReadLimiter when the config becomes non-zero (or update
an existing limiter’s internal state rather than replacing with nullptr). Locate
the s3_read_limiter management in RateLimiter.cpp (and the getS3ReadLimiter
declaration) and change the zero-rate path to create/publish a non-throttling
limiter so existing snapshots in S3RandomAccessFile continue to be valid across
reloads.

In `@dbms/src/IO/BaseFile/tests/gtest_rate_limiter.cpp`:
- Around line 424-432: The test TEST(S3ReadLimiterTest,
LargeRequestDoesNotWaitForever) currently calls
S3::S3ReadLimiter::requestBytes(...) directly on the test thread which can hang
CI if the limiter regresses; instead spawn the requestBytes call in a background
task (e.g., std::async or std::thread) and use a wait_for() with a timeout to
assert the task completed (and that AtomicStopwatch::elapsedMilliseconds() is
under the expected bound), referencing S3::S3ReadLimiter, requestBytes,
S3::S3ReadSource::DirectRead and AtomicStopwatch to locate and replace the
direct blocking call with a timed async/wait_for assertion.

In `@dbms/src/Storages/S3/S3Common.h`:
- Around line 185-191: setS3ReadLimiter currently replaces the shared_ptr which
breaks existing S3RandomAccessFile snapshots; instead make the published
S3ReadLimiter object stable by updating its configuration in-place: in
setS3ReadLimiter(const std::shared_ptr<S3ReadLimiter>& limiter) if
shared_s3_read_limiter is non-null and limiter is non-null call
shared_s3_read_limiter->updateConfig(...) (or equivalent) to apply new limits,
otherwise (when there is no existing limiter) assign the pointer; also ensure
you still call shared_tiflash_client->setS3ReadLimiter(shared_s3_read_limiter)
so clients keep referencing the same long-lived S3ReadLimiter instance (refer to
setS3ReadLimiter, shared_s3_read_limiter, S3ReadLimiter::updateConfig, and
S3RandomAccessFile constructor).

---

Nitpick comments:
In `@dbms/src/Common/TiFlashMetrics.cpp`:
- Line 32: The histogram for bg download stage currently uses
remote_cache_bg_download_stage_buckets = ExpBuckets{0.0001, 2, 20} which tops
out too low and will saturate for minute-scale waits; update the buckets to
widen only this histogram (e.g., change the last parameter from 20 to 30) in
TiFlashMetrics.cpp by modifying remote_cache_bg_download_stage_buckets so
tiflash_storage_remote_cache_bg_download_stage_seconds has headroom for
minute-scale durations.

In `@dbms/src/Storages/S3/S3ReadLimiter.h`:
- Around line 59-63: The class doc comment for S3ReadLimiter currently mentions
a "stream dimension" which is stale and misleading because the class only
exposes byte-rate limiting APIs; update the comment above the S3ReadLimiter
constructor (S3ReadLimiter(UInt64 max_read_bytes_per_sec_ = 0, UInt64
refill_period_ms_ = 100)) to remove or rephrase the “stream dimension” wording
and clearly state that only byte-rate limiting is enforced (and that any mention
of limiting concurrent/open streams is best-effort or not provided), keeping the
rest of the explanation about scheduling gaps and refill behavior intact.

In `@dbms/src/Storages/S3/tests/gtest_filecache.cpp`:
- Around line 1335-1338: The timing assertion using AtomicStopwatch and
ASSERT_GE(watch.elapsedMilliseconds(), 200) is brittle under variable CI load;
update the test around file_cache.get(S3FilenameView::fromKey(object_key),
object_size) and waitForBgDownload(file_cache) to allow a small tolerance (e.g.,
compare against an expected_delay minus a small delta or assert elapsed >=
expected_delay_with_tolerance) or document the rationale; adjust the assertion
to use a configurable/clearly named constant (expected_delay and tolerance) so
the limiter check remains robust across environments.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 10de3fcb-9c5d-47f6-b1a7-8b48cbf070d8

📥 Commits

Reviewing files that changed from the base of the PR and between 9433514 and b767295.

📒 Files selected for processing (34)
  • dbms/src/Common/FailPoint.cpp
  • dbms/src/Common/TiFlashMetrics.cpp
  • dbms/src/Common/TiFlashMetrics.h
  • dbms/src/IO/BaseFile/IORateLimitConfig.cpp
  • dbms/src/IO/BaseFile/IORateLimitConfig.h
  • dbms/src/IO/BaseFile/RateLimiter.cpp
  • dbms/src/IO/BaseFile/RateLimiter.h
  • dbms/src/IO/BaseFile/tests/gtest_rate_limiter.cpp
  • dbms/src/IO/Checksum/ChecksumBuffer.cpp
  • dbms/src/Interpreters/Context.cpp
  • dbms/src/Interpreters/Context.h
  • dbms/src/Interpreters/Settings.h
  • dbms/src/Server/Server.cpp
  • dbms/src/Server/tests/gtest_storage_config.cpp
  • dbms/src/Storages/DeltaMerge/File/DMFileReader.cpp
  • dbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.cpp
  • dbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.h
  • dbms/src/Storages/KVStore/FFI/tests/gtest_status_server.cpp
  • dbms/src/Storages/S3/FileCache.cpp
  • dbms/src/Storages/S3/FileCache.h
  • dbms/src/Storages/S3/MockS3Client.h
  • dbms/src/Storages/S3/S3Common.cpp
  • dbms/src/Storages/S3/S3Common.h
  • dbms/src/Storages/S3/S3RandomAccessFile.cpp
  • dbms/src/Storages/S3/S3RandomAccessFile.h
  • dbms/src/Storages/S3/S3ReadLimiter.cpp
  • dbms/src/Storages/S3/S3ReadLimiter.h
  • dbms/src/Storages/S3/S3ReadLimiter_fwd.h
  • dbms/src/Storages/S3/tests/gtest_filecache.cpp
  • dbms/src/Storages/S3/tests/gtest_s3client.cpp
  • dbms/src/Storages/S3/tests/gtest_s3file.cpp
  • docs/design/2026-03-24-disagg-s3-node-level-backpressure-and-filecache-dedup.md
  • docs/tiflash_http_api.md
  • metrics/grafana/tiflash_summary.json

Comment on lines +185 to +191
void setS3ReadLimiter(const std::shared_ptr<S3ReadLimiter> & limiter)
{
std::unique_lock lock_init(mtx_init);
shared_s3_read_limiter = limiter;
if (shared_tiflash_client != nullptr)
shared_tiflash_client->setS3ReadLimiter(shared_s3_read_limiter);
}
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep the published limiter object stable across reloads.

dbms/src/Server/Server.cpp hits this setter on config reload, but dbms/src/Storages/S3/S3RandomAccessFile.cpp snapshots the limiter once in its constructor. If a node starts with s3_max_read_bytes_per_sec = 0 and reload later enables it, already-open readers keep the old nullptr and bypass throttling for the rest of the query. Prefer updating one long-lived S3ReadLimiter via updateConfig() instead of swapping the shared_ptr.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/S3/S3Common.h` around lines 185 - 191, setS3ReadLimiter
currently replaces the shared_ptr which breaks existing S3RandomAccessFile
snapshots; instead make the published S3ReadLimiter object stable by updating
its configuration in-place: in setS3ReadLimiter(const
std::shared_ptr<S3ReadLimiter>& limiter) if shared_s3_read_limiter is non-null
and limiter is non-null call shared_s3_read_limiter->updateConfig(...) (or
equivalent) to apply new limits, otherwise (when there is no existing limiter)
assign the pointer; also ensure you still call
shared_tiflash_client->setS3ReadLimiter(shared_s3_read_limiter) so clients keep
referencing the same long-lived S3ReadLimiter instance (refer to
setS3ReadLimiter, shared_s3_read_limiter, S3ReadLimiter::updateConfig, and
S3RandomAccessFile constructor).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nonzero -> 0 reload path is already handled by keeping the existing S3ReadLimiter object alive and applying updateConfig(0) in place, so readers that already snapped the shared limiter can observe disable without pointer recreation.
What still remains intentionally limited today is the first 0 -> nonzero transition when startup published no limiter object: previously constructed readers that snapped nullptr will not start throttling retroactively, and only newly created readers will see the newly published limiter. I added a code comment and TODO in RateLimiter.cpp to document this behavior and the possible follow-up of publishing a no-op limiter even at rate 0 if we want stronger reload semantics later.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description images show:

  • Image 1 (Node exporter – Network overview): Grafana Network panels with no drops/errors, traffic bursts concentrated on bond1 up to ~1.23 GB/s, packet peaks ~818 k p/s, link speeds stable over 2026-04-05 18:37:43–20:22:54.
  • Image 2 (TiFlash Summary – PageStorage Rate Limiter): Limiter throughput shows S3 direct/filecache bursts; thresholds pegged at 200 MiB/s for s3_read and bg_write; pending queues up to 30; S3 read pending P99≈1.31 s, P999≈2.77 s in the same time window.
  • Image 3 (TiFlash Summary – Remote Cache): Remote cache ops/flow with spikes; bounded waits up to ~7 s; queue depth up to ~150–200; occasional “too_many_download” rejects.
  • Image 4 (TiFlash Summary – Node/resource): Store/available sizes oscillate with workload; uptime ~2 hours; transient CPU spikes >1000%; memory peaks ~11 GiB.

@JaySon-Huang thanks for the clarification. Your described semantics match the code paths: nonzero → 0 keeps the same S3ReadLimiter instance and calls updateConfig(0), so in-flight readers that captured the pointer observe disablement; the only non-retroactive case is the very first 0 → nonzero when the process started with no limiter object and older readers had snapped nullptr.

Given that, I’m good with the current behavior for this PR. As a low-risk follow-up to tighten reload semantics later, you could consider instantiating and publishing a no-op limiter at startup (limit=0). That would make the initial 0 → nonzero transition retroactive for already-open readers without changing hot-path costs when disabled, since:

  • existing readers would already hold a stable pointer; a reload would just updateConfig(new_rate),
  • per-read locking remains unchanged (no extra get() per read).

I’ve recorded this nuance so we don’t flag it again.


✏️ Learnings added
Learnt from: JaySon-Huang
Repo: pingcap/tiflash PR: 10771
File: dbms/src/Storages/S3/S3Common.h:185-191
Timestamp: 2026-04-05T19:04:11.469Z
Learning: TiFlash S3ReadLimiter reload semantics: nonzero→0 is handled in place via S3ReadLimiter::updateConfig(0) so existing readers observe disablement; the initial 0→nonzero after startup with no limiter object is intentionally non‑retroactive—readers that captured nullptr continue unthrottled until recreated. A possible future enhancement is to publish a no‑op limiter (rate=0) at startup to make 0→nonzero retroactive.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: CR
Repo: pingcap/tiflash PR: 0
File: dbms/src/Storages/AGENTS.md:0-0
Timestamp: 2026-01-26T03:03:41.424Z
Learning: Applies to dbms/src/Storages/**/{DeltaMerge,KVStore,Page,S3}/**/*.{h,hpp,cpp} : Use `std::shared_ptr` for `StorageDeltaMerge`, `DeltaMergeStore`, `Region`, and `Context` objects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid MPP queries exhaust tiflash-compute network under disagg arch

2 participants