Skip to content

[feature](tso) Add global monotonically increasing Timestamp Oracle(TSO)#61199

Open
AntiTopQuark wants to merge 33 commits intoapache:masterfrom
AntiTopQuark:support_tso
Open

[feature](tso) Add global monotonically increasing Timestamp Oracle(TSO)#61199
AntiTopQuark wants to merge 33 commits intoapache:masterfrom
AntiTopQuark:support_tso

Conversation

@AntiTopQuark
Copy link
Copy Markdown

@AntiTopQuark AntiTopQuark commented Mar 11, 2026

What problem does this PR solve?

Issue Number: close #61198

Related #57921

Problem Summary:

Release note

  • Implement a global monotonically increasing Timestamp Oracle (TSO) service that generates unique, monotonically increasing timestamps for transactions.
    The service calibrates its initial timestamp at startup and periodically updates it to maintain a time window.
    A TSO timestamp encodes the physical time and a logical counter; it is assembled and extracted by the new TSOTimestamp class.
  • Introduce TSOService, a master-only daemon that manages global timestamps.
    The service exposes two main methods:
    • getTSO() – returns a new TSO timestamp for transaction commits.
    • getCurrentTSO() – returns the current TSO without bumping the logical counter.
  • Add multiple configuration properties to control the behavior of the TSO feature:
    • experimental_enable_feature_tso – enables/disables the TSO feature.
    • tso_service_update_interval_ms – interval in milliseconds for the TSO service to update its window.
    • max_update_tso_retry_count and max_get_tso_retry_count – retry limits for updating and obtaining TSOs.
    • tso_service_window_duration_ms – length of the time window allocated by the TSO service.
    • tso_time_offset_debug_mode – debug offset for the physical time.
    • enable_tso_persist_journal and enable_tso_checkpoint_module – persistence switches for edit log and checkpoint.
  • Table property: Introduce enable_tso which can be configured in CREATE TABLE or modified via ALTER TABLE. Only tables with enable_tso = true generate commit TSO for transactions; when disabled, commit_tso remains -1.
  • Transaction and commit integration:
    • During commit, TransactionState now fetches a commit TSO from TSOService when TSO is enabled and stores it in the transaction state and TableCommitInfo.
    • The commit TSO is recorded per partition (via TPartitionVersionInfo.commit_tso), and is persisted with each rowset (see next item).
  • Rowset and meta changes:
    • Rowset::make_visible now accepts a commit_tso parameter and writes it to RowsetMeta.
    • RowsetMetaPB adds a new field commit_tso to persist commit timestamps.
    • information_schema.rowsets introduces a new column COMMIT_TSO allowing users to query the commit timestamp for each rowset.
    • Pending publish tasks, asynchronous publish tasks and other internal structures have been extended to carry commit TSO.
  • External interface:
    A new REST endpoint /api/tso is added for retrieving current TSO information. It returns a JSON payload containing:
    • window_end_physical_time – end of the current TSO time window.
    • current_tso – the current composed 64‑bit TSO.
    • current_tso_physical_time and current_tso_logical_counter – the decomposed physical and logical parts of the current TSO. This API does not increment the logical counter.
  • Metrics & observability:
    New metrics counters (e.g., tso_clock_drift_detected, tso_clock_backward_detected, tso_clock_calculated, tso_clock_updated) expose state and health of the TSO service.
  • Regression & unit tests:
    New unit tests verify TSOTimestamp bit manipulation, TSOService behavior, commit TSO propagation, and the /api/tso endpoint. Regression tests verify that rowset commit timestamps are populated when TSO is enabled and that the API returns increasing TSOs.

Impact and Compatibility

  • Experimental: the TSO feature is currently guarded by experimental_enable_feature_tso. It is disabled by default and can be enabled in front-end configuration. When enabled, old FE versions without this feature cannot replay edit log entries containing TSO operations; therefore upgrade all FEs before enabling.
  • Table compatibility: tables created before enabling TSO remain unaffected unless explicitly modified to set enable_tso to true. Tables with TSO enabled will produce commit TSO for each rowset and may require downstream consumers to handle the new commit_tso field.
  • Client API: clients can call /api/tso to inspect current TSO values. No existing API is modified.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.45% (1798/2263)
Line Coverage 64.65% (32249/49881)
Region Coverage 65.60% (16145/24611)
Branch Coverage 56.06% (8611/15360)

@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

@doris-robot
Copy link
Copy Markdown

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.31% (1798/2267)
Line Coverage 64.65% (32290/49944)
Region Coverage 65.57% (16170/24659)
Branch Coverage 55.96% (8611/15388)

@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.31% (1798/2267)
Line Coverage 64.63% (32279/49944)
Region Coverage 65.55% (16165/24659)
Branch Coverage 55.95% (8610/15388)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 44.34% (188/424) 🎉
Increment coverage report
Complete coverage report

@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

…TSO)

Signed-off-by: Jingzhe Jia <AntiTopQuark1350@outlook.com>
@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.31% (1798/2267)
Line Coverage 64.66% (32294/49944)
Region Coverage 65.59% (16173/24659)
Branch Coverage 55.97% (8612/15388)

@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 44.34% (188/424) 🎉
Increment coverage report
Complete coverage report

Signed-off-by: Jingzhe Jia <AntiTopQuark1350@outlook.com>
@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.24% (1798/2269)
Line Coverage 64.53% (32281/50023)
Region Coverage 65.45% (16165/24699)
Branch Coverage 55.85% (8609/15414)

@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 36676 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 6755fa1061376778da12ffdc0a7d55f410eb3900, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17632	4532	4353	4353
q2	q3	10634	1028	633	633
q4	4716	465	354	354
q5	7776	802	625	625
q6	240	220	185	185
q7	1010	881	647	647
q8	9516	945	808	808
q9	9938	7467	7447	7447
q10	8982	4112	3755	3755
q11	576	354	342	342
q12	810	685	570	570
q13	17982	4880	4014	4014
q14	477	480	450	450
q15	q16	569	556	462	462
q17	929	1056	820	820
q18	6682	6330	5766	5766
q19	1179	1312	898	898
q20	1528	1446	1616	1446
q21	4911	3312	2730	2730
q22	505	400	371	371
Total cold run time: 106592 ms
Total hot run time: 36676 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5532	5227	5364	5227
q2	q3	1945	2397	2012	2012
q4	1005	1451	890	890
q5	4614	4933	4811	4811
q6	231	206	174	174
q7	7952	7900	7686	7686
q8	3272	3605	3427	3427
q9	27508	27430	27218	27218
q10	4591	5015	4563	4563
q11	1036	932	907	907
q12	577	676	497	497
q13	4458	4987	4040	4040
q14	465	481	470	470
q15	q16	551	573	556	556
q17	3771	3865	3792	3792
q18	7547	7196	6936	6936
q19	1173	1190	1215	1190
q20	2337	2444	2267	2267
q21	14015	13414	13352	13352
q22	596	546	487	487
Total cold run time: 93176 ms
Total hot run time: 90502 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.67% (1796/2283)
Line Coverage 64.40% (32280/50128)
Region Coverage 65.26% (16159/24761)
Branch Coverage 55.73% (8612/15454)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 58.53% (271/463) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.92% (19966/37732)
Line Coverage 36.45% (187288/513766)
Region Coverage 32.70% (145252/444150)
Branch Coverage 33.88% (63662/187922)

@AntiTopQuark AntiTopQuark requested a review from morningman March 30, 2026 16:03
@AntiTopQuark
Copy link
Copy Markdown
Author

Hi, @morningman and @morrySnow .I have fixed all the comments. Please help review this code again.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.54% (27176/36952)
Line Coverage 57.07% (292346/512219)
Region Coverage 54.34% (243592/448270)
Branch Coverage 56.08% (105713/188488)

@gavinchou
Copy link
Copy Markdown
Contributor

LGTM

gavinchou
gavinchou previously approved these changes Mar 31, 2026
Copy link
Copy Markdown
Contributor

@gavinchou gavinchou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 31, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@AntiTopQuark
Copy link
Copy Markdown
Author

Hi, @morningman and @morrySnow . Please help review this code again.

@dataroaring
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 2 issues worth addressing before this lands.

  1. High: fe/fe-core/src/main/java/org/apache/doris/tso/TSOService.java makes image persistence of windowEndTSO optional even though calibration already requires journal persistence. If enable_tso_feature=true, enable_tso_persist_journal=true, and enable_tso_checkpoint_module=false, a checkpoint can be written without the last reserved TSO window. After that image is taken and older TSO journals are compacted away, the next master restart will reload windowEndTSO=0 from image and can calibrate from wall clock instead of the last issued window, violating the feature's global monotonicity guarantee.
  2. Medium: fe/fe-core/src/main/java/org/apache/doris/httpv2/rest/TSOAction.java exposes cached TSO values even when the feature has been disabled or the service is uncalibrated, because the endpoint does not validate feature state and TSOService does not clear the cached timestamp on disable. That makes /api/tso report stale state as if it were authoritative.

Critical checkpoint conclusions:

  • Goal of the task: Partially achieved. The PR wires TSO through FE commit, publish RPC, and BE rowset metadata, but the checkpoint/restart path still has a correctness hole.
  • Minimality/focus: Broad but reasonable for the feature scope.
  • Concurrency: Locking in TSOService and BE publish paths looks consistent; I did not find a deadlock, but the disable path leaves stale in-memory state observable through the API.
  • Lifecycle/static initialization: Not fully correct. The TSO lifecycle across journal replay, checkpoint, and master restart is incomplete when the image module is disabled.
  • Configuration: New configs are present, but enable_tso_persist_journal and enable_tso_checkpoint_module can currently be combined into a broken recovery mode.
  • Compatibility/incompatible change: New FE op code / meta module handling exists, but the optional checkpoint persistence weakens recovery compatibility semantics.
  • Parallel code paths: commit_tso propagation appears covered in the main publish path, async publish path, and report-based republish path.
  • Special conditional checks: The persist-journal precondition in calibration is good; the HTTP API is missing equivalent feature-state validation.
  • Test coverage: Good unit/regression coverage for normal flows, but I did not see coverage for checkpoint-disabled restart recovery or API behavior while disabled/uninitialized.
  • Observability: Logging and metrics are generally sufficient.
  • Transaction/persistence: Not correct yet because the durable source of windowEndTSO can disappear after checkpoint compaction.
  • Data writes/modifications: FE-to-BE commit_tso plumbing and BE rowset persistence look consistent in the reviewed paths.
  • FE-BE variable passing: The new thrift/proto fields are propagated through publish and recovery paths I checked.
  • Performance: No obvious hot-path regression stood out during review.
  • Other issues: None beyond the two findings above.

}

public long saveTSO(CountingDataOutputStream dos, long checksum) throws IOException {
if (!Config.enable_tso_checkpoint_module) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

saveTSO() is gated by enable_tso_checkpoint_module, but calibrateTimestamp() already requires enable_tso_persist_journal=true, so TSO can be actively handing out timestamps while this image module stays disabled. That creates a monotonicity hole after checkpoint compaction: once a checkpoint is taken without windowEndTSO and the pre-checkpoint TSO journal entries are purged, the next master restart reloads windowEndTSO=0 from image and can calibrate from wall clock instead of the last reserved window. In that scenario the new master can generate a TSO lower than one already issued before the checkpoint. This state needs to be made impossible, either by always saving windowEndTSO into the image whenever journal persistence is enabled, or by rejecting the config combination that enables TSO without checkpoint persistence.

Copy link
Copy Markdown
Author

@AntiTopQuark AntiTopQuark Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We intentionally keep these two switches decoupled:

  • enable_tso_persist_journal serves as the strict correctness gate for enabling TSO; if this configuration is set to false, calibrateTimestamp() will fail fast.

  • enable_tso_checkpoint_module is a compatibility switch for image format and rollback scenarios, not the switch for enabling TSO.

Even when the checkpoint module is disabled, window-end events are still replayed from the journal (OP_TSO_TIMESTAMP_WINDOW_END).Therefore, this configuration combination is a compatibility mode chosen by the operator, with known trade‑offs, rather than an unintended state.

cc @dataroaring

return ResponseEntityBuilder.badRequest("Current FE is not master");
}
// Get current TSO information without increasing it
long windowEndPhysicalTime = env.getTSOService().getWindowEndTSO();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint never checks whether TSO is enabled or calibrated before returning getWindowEndTSO() / getCurrentTSO(). That is observable because TSOService.runAfterCatalogReady() only flips isInitialized to false when the feature is disabled; it does not clear the cached globalTimestamp or windowEndTSO. After ADMIN SET FRONTEND CONFIG ('enable_tso_feature'='false'), /api/tso will still return the previous timestamp as if it were current. Please either clear the cached state on disable, or make /api/tso reject requests unless the feature is enabled and calibrated.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an intentional design, not an oversight.
The /api/tso endpoint is intended as a state observation interface, not a timestamp issuance interface.

The actual path for externally allocating TSO is getTSO(), where strict validation is performed on both enable_tso_feature and isInitialized. Requests will be rejected directly if TSO is not enabled or not calibrated.

In contrast, /api/tso only reads in-memory snapshots of the current state (getWindowEndTSO() / getCurrentTSO()) for observation and troubleshooting purposes, and does not advance the timestamp.

cc @dataroaring

Copy link
Copy Markdown
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Thanks for this substantial feature addition! The TSO design follows established patterns (similar to TiDB/PD). Here are my findings:


Critical Issues

1. Config naming inconsistency — enable_tso_feature vs experimental_enable_feature_tso

The Config field is enable_tso_feature but the error message in PropertyAnalyzer.analyzeEnableTso() says "experimental_enable_tso_feature" (note the word order difference: enable_tso_feature vs enable_feature_tso). This will confuse users trying to set the right config.

2. getTSO() sleeps in the transaction commit path

getTSO() is called from DatabaseTransactionMgr.getCommitTSO() during commit. It contains retry loops that call sleep(200) up to tso_max_get_retry_count (default 10) times. This means a transaction commit can block for up to 2 seconds waiting for the environment. The Env.getCurrentEnv().isReady() and isMaster() checks should be done once before entering the retry loop rather than repeated inside it.

3. Logical counter overflow recovery gap

In getTSO(), when logical > MAX_LOGICAL_COUNTER, the code sleeps and retries. But generateTSO() returns logicalCounter + 1 without resetting the counter — the reset only happens when updateTimestamp() daemon advances the physical time via setTSOPhysical(). If the daemon tick hasn't run yet, all concurrent getTSO() calls will see overflow and fail until the daemon catches up.

4. unprotectedCommitTransaction signature change

unprotectedCommitTransaction and unprotectedCommitTransaction2PC now throw TransactionCommitFailedException. This is a behavioral change — all callers need to be verified. Subclasses (e.g. cloud variants) that override these methods must also update their signatures.

5. calibrateTimestamp() requires enable_tso_persist_journal=true but it defaults to false

Enabling enable_tso_feature=true without also setting enable_tso_persist_journal=true causes the service to perpetually fail calibration in runAfterCatalogReady(). These config dependencies should be enforced or at least clearly documented.


Moderate Issues

6. _async_publish_tasks still uses anonymous std::tuple

storage_engine.h changed the async publish map from pair to tuple<int64_t, int64_t, int64_t> with positional access (std::get<0>, etc.). Since DiscontinuousVersionTablet was correctly introduced as a named struct for the other case, consider doing the same here for consistency.

7. Rowset::make_visible has no default for commit_tso

The signature changed from make_visible(Version) to make_visible(Version, int64_t commit_tso) without a default parameter. The BE txn_manager.h overloads correctly use = -1, but rowset.h does not. Any code (including out-of-tree or test code) calling the old signature will break.

8. /api/tso authentication check

TSOAction.getTSO() calls executeCheckPassword(request, response) but doesn't check its return value. Other REST actions in the codebase typically verify the result or use checkGlobalAuth. This may allow unauthenticated access to TSO state.

9. TSOClockBackwardException propagates through MasterDaemon

If clock backward exceeds the threshold, a RuntimeException propagates from runAfterCatalogReady(). Depending on how MasterDaemon handles uncaught exceptions in the run loop, this could kill the daemon thread permanently.

10. Concurrency model mixes ReentrantLock and AtomicLong/AtomicBoolean

windowEndTSO is an AtomicLong read without the lock in updateTimestamp(), while isInitialized is an AtomicBoolean set under lock in some paths but checked without it in others. This is technically safe but makes the concurrency contract harder to reason about. Consider documenting which fields are protected by what.


Minor Issues

11. Missing spaces in config descriptions

Several @ConfField description strings are missing trailing spaces before concatenation:

  • "retry 3 times" + "to update""retry 3 timesto update"
  • "5000ms from BDBJE once." → should have space before "will apply"
  • Same pattern in tso_max_get_retry_count, tso_service_window_duration_ms, tso_time_offset_debug_mode

12. FeMetaVersion bump mentioned but not in diff

The PR description mentions bumping VERSION_CURRENT to VERSION_141, but this file is not in the changed files list. Is the version bump missing or in a separate PR?

13. Regression test may silently pass when TSO is disabled

test_tso_api.groovy and test_tso_rowset_commit_tso.groovy — if the test cluster doesn't have enable_tso_feature and enable_tso_persist_journal set, the tests may skip or pass without exercising the feature.


Overall the BE-side plumbing (threading commit_tso through publish, rowset meta, information_schema) is clean. The FE core needs attention on the concurrency/retry model in getTSO() and the config dependency chain.

…riptions to ensure clarity of the descriptive information.
@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

@AntiTopQuark
Copy link
Copy Markdown
Author

Code Review

Thanks for this substantial feature addition! The TSO design follows established patterns (similar to TiDB/PD). Here are my findings:

Critical Issues

1. Config naming inconsistency — enable_tso_feature vs experimental_enable_feature_tso

The Config field is enable_tso_feature but the error message in PropertyAnalyzer.analyzeEnableTso() says "experimental_enable_tso_feature" (note the word order difference: enable_tso_feature vs enable_feature_tso). This will confuse users trying to set the right config.

2. getTSO() sleeps in the transaction commit path

getTSO() is called from DatabaseTransactionMgr.getCommitTSO() during commit. It contains retry loops that call sleep(200) up to tso_max_get_retry_count (default 10) times. This means a transaction commit can block for up to 2 seconds waiting for the environment. The Env.getCurrentEnv().isReady() and isMaster() checks should be done once before entering the retry loop rather than repeated inside it.

3. Logical counter overflow recovery gap

In getTSO(), when logical > MAX_LOGICAL_COUNTER, the code sleeps and retries. But generateTSO() returns logicalCounter + 1 without resetting the counter — the reset only happens when updateTimestamp() daemon advances the physical time via setTSOPhysical(). If the daemon tick hasn't run yet, all concurrent getTSO() calls will see overflow and fail until the daemon catches up.

4. unprotectedCommitTransaction signature change

unprotectedCommitTransaction and unprotectedCommitTransaction2PC now throw TransactionCommitFailedException. This is a behavioral change — all callers need to be verified. Subclasses (e.g. cloud variants) that override these methods must also update their signatures.

5. calibrateTimestamp() requires enable_tso_persist_journal=true but it defaults to false

Enabling enable_tso_feature=true without also setting enable_tso_persist_journal=true causes the service to perpetually fail calibration in runAfterCatalogReady(). These config dependencies should be enforced or at least clearly documented.

Moderate Issues

6. _async_publish_tasks still uses anonymous std::tuple

storage_engine.h changed the async publish map from pair to tuple<int64_t, int64_t, int64_t> with positional access (std::get<0>, etc.). Since DiscontinuousVersionTablet was correctly introduced as a named struct for the other case, consider doing the same here for consistency.

7. Rowset::make_visible has no default for commit_tso

The signature changed from make_visible(Version) to make_visible(Version, int64_t commit_tso) without a default parameter. The BE txn_manager.h overloads correctly use = -1, but rowset.h does not. Any code (including out-of-tree or test code) calling the old signature will break.

8. /api/tso authentication check

TSOAction.getTSO() calls executeCheckPassword(request, response) but doesn't check its return value. Other REST actions in the codebase typically verify the result or use checkGlobalAuth. This may allow unauthenticated access to TSO state.

9. TSOClockBackwardException propagates through MasterDaemon

If clock backward exceeds the threshold, a RuntimeException propagates from runAfterCatalogReady(). Depending on how MasterDaemon handles uncaught exceptions in the run loop, this could kill the daemon thread permanently.

10. Concurrency model mixes ReentrantLock and AtomicLong/AtomicBoolean

windowEndTSO is an AtomicLong read without the lock in updateTimestamp(), while isInitialized is an AtomicBoolean set under lock in some paths but checked without it in others. This is technically safe but makes the concurrency contract harder to reason about. Consider documenting which fields are protected by what.

Minor Issues

11. Missing spaces in config descriptions

Several @ConfField description strings are missing trailing spaces before concatenation:

  • "retry 3 times" + "to update""retry 3 timesto update"
  • "5000ms from BDBJE once." → should have space before "will apply"
  • Same pattern in tso_max_get_retry_count, tso_service_window_duration_ms, tso_time_offset_debug_mode

12. FeMetaVersion bump mentioned but not in diff

The PR description mentions bumping VERSION_CURRENT to VERSION_141, but this file is not in the changed files list. Is the version bump missing or in a separate PR?

13. Regression test may silently pass when TSO is disabled

test_tso_api.groovy and test_tso_rowset_commit_tso.groovy — if the test cluster doesn't have enable_tso_feature and enable_tso_persist_journal set, the tests may skip or pass without exercising the feature.

Overall the BE-side plumbing (threading commit_tso through publish, rowset meta, information_schema) is clean. The FE core needs attention on the concurrency/retry model in getTSO() and the config dependency chain.

  1. Config naming inconsistency
    enable_tso_feature is the code field name, while experimental_enable_tso_feature is the external config name for the experimental feature. The experimental prefix follows the system's established convention.

  2. getTSO() contains sleep in the transaction commit path
    This is an intentional bounded retry, designed to handle transient states during master‑slave failover or readiness flutter. Moving the checks outside the loop would turn recoverable transient errors into immediate failures. The maximum blocking time is also configurable, not unbounded.

  3. Logical counter overflow "recovery gap"
    This is by design as backpressure, not a bug: on overflow, we wait for the daemon to advance physical time to guarantee monotonicity and persistence semantics. Reset and persistence are not performed on the hot getTSO() path.

  4. unprotectedCommitTransaction* signature change
    This is an explicit upgrade to failure semantics (abort commit if TSO acquisition fails). All in‑tree call chains have been aligned, and I have not found any overridden methods that are not adapted.

  5. calibrateTimestamp() requires enable_tso_persist_journal=true
    This is an intentional fail‑fast constraint to prevent unsafe execution such as “TSO enabled without persistence”. The default value being false is part of a gradual rollout strategy, not a defect.

  6. _async_publish_tasks uses tuple
    This minimizes code changes, and relevant comments have already been added in the code.

  7. Rowset::make_visible has no default parameter
    All in‑tree call sites now explicitly pass commit_tso, so there is no issue causing existing code to fail at compile or runtime.

  8. /api/tso missing authentication check
    The semantics of executeCheckPassword() are to throw an exception directly on failure, rather than relying on return value checking. Many other REST interfaces in the codebase follow the same pattern.

  9. TSOClockBackwardException causes daemon thread exit
    This will not happen. Daemon.run() catches Throwable at the top level, so the thread proceeds to the next round and is not permanently terminated by this exception.

  10. Mixed use of ReentrantLock and Atomic*
    This is intentional layering: high‑frequency read‑only states use atomic variables, while compound state such as globalTimestamp uses locks. The thread‑safety model is clear and correct.

  11. Spaces in config description strings
    This has been fixed.

  12. FeMetaVersion bump mentioned but not present in code
    During earlier code review, it was determined that the FeMetaVersion change was unnecessary. That part of the code has been removed, and the description updated accordingly.

  13. Regression tests “silently pass” when TSO is disabled
    Both regression tests explicitly run ADMIN SET FRONTEND CONFIG at the beginning to enable experimental_enable_tso_feature and enable_tso_persist_journal. They do not run and pass vacuously with TSO disabled.

@AntiTopQuark
Copy link
Copy Markdown
Author

Hi @morningman @gavinchou @morrySnow @dataroaring , Please review this PR again, thank you very much~

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Mar 31, 2026
@morningman
Copy link
Copy Markdown
Contributor

run buildall

@doris-robot
Copy link
Copy Markdown

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.48% (1798/2291)
Line Coverage 64.16% (32272/50296)
Region Coverage 65.01% (16173/24878)
Branch Coverage 55.51% (8624/15536)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 58.53% (271/463) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 26880 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 9a463a79f087084c9f05555425a86d176a1bf4b4, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17598	4459	4302	4302
q2	q3	10645	791	555	555
q4	4668	363	256	256
q5	7572	1201	1001	1001
q6	180	173	145	145
q7	785	839	672	672
q8	10038	1516	1360	1360
q9	5552	4986	4739	4739
q10	6369	1977	1643	1643
q11	473	251	240	240
q12	771	576	470	470
q13	18072	2725	1928	1928
q14	229	244	210	210
q15	q16	747	723	667	667
q17	732	823	459	459
q18	5902	5449	5220	5220
q19	1105	973	627	627
q20	524	484	372	372
q21	4519	1875	1726	1726
q22	467	317	288	288
Total cold run time: 96948 ms
Total hot run time: 26880 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4716	4621	4579	4579
q2	q3	3955	4301	3849	3849
q4	881	1190	773	773
q5	4105	4413	4371	4371
q6	188	178	143	143
q7	1772	1654	1562	1562
q8	2532	2682	2618	2618
q9	7845	7379	7458	7379
q10	3759	4088	3703	3703
q11	509	427	403	403
q12	488	606	462	462
q13	2469	2875	2068	2068
q14	281	296	279	279
q15	q16	708	781	711	711
q17	1174	1322	1341	1322
q18	7478	6965	6795	6795
q19	931	906	897	897
q20	2081	2159	1998	1998
q21	4067	3532	3376	3376
q22	505	433	411	411
Total cold run time: 50444 ms
Total hot run time: 47699 ms

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 168366 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 9a463a79f087084c9f05555425a86d176a1bf4b4, data reload: false

query5	4337	647	508	508
query6	328	225	202	202
query7	4202	474	264	264
query8	336	238	226	226
query9	8717	2741	2702	2702
query10	512	372	321	321
query11	7042	5068	4884	4884
query12	180	129	122	122
query13	1286	446	354	354
query14	5790	3728	3435	3435
query14_1	2790	2773	2788	2773
query15	197	192	172	172
query16	969	480	454	454
query17	989	696	606	606
query18	2439	439	343	343
query19	214	207	180	180
query20	134	122	125	122
query21	210	140	105	105
query22	13197	13414	13048	13048
query23	16189	15838	16485	15838
query23_1	16305	16123	16151	16123
query24	7641	1700	1289	1289
query24_1	1297	1244	1216	1216
query25	547	463	412	412
query26	1247	263	146	146
query27	2800	475	293	293
query28	4552	1878	1832	1832
query29	841	570	488	488
query30	299	223	184	184
query31	1008	940	860	860
query32	84	72	74	72
query33	517	339	293	293
query34	921	867	526	526
query35	622	676	595	595
query36	1115	1146	1003	1003
query37	136	95	80	80
query38	2986	2915	2892	2892
query39	872	848	825	825
query39_1	798	791	807	791
query40	234	153	133	133
query41	62	57	59	57
query42	254	254	257	254
query43	230	249	224	224
query44	
query45	194	186	178	178
query46	944	1010	611	611
query47	2153	2170	2101	2101
query48	303	322	229	229
query49	639	453	380	380
query50	691	287	211	211
query51	4179	4158	4036	4036
query52	257	267	254	254
query53	301	342	281	281
query54	293	296	264	264
query55	91	90	89	89
query56	315	320	310	310
query57	1912	1670	1731	1670
query58	277	271	265	265
query59	2818	2955	2732	2732
query60	339	334	318	318
query61	159	156	159	156
query62	660	595	535	535
query63	306	275	275	275
query64	5080	1302	1034	1034
query65	
query66	1464	457	370	370
query67	24228	24265	24194	24194
query68	
query69	427	329	307	307
query70	981	983	952	952
query71	344	314	298	298
query72	3006	2864	2627	2627
query73	561	545	322	322
query74	9584	9622	9455	9455
query75	2870	2778	2483	2483
query76	2294	1043	685	685
query77	375	405	317	317
query78	10990	11125	10489	10489
query79	1141	772	584	584
query80	1546	664	578	578
query81	542	264	225	225
query82	1345	158	123	123
query83	355	294	247	247
query84	285	122	104	104
query85	1196	503	450	450
query86	436	312	288	288
query87	3139	3189	2948	2948
query88	3538	2678	2690	2678
query89	415	365	347	347
query90	1814	175	179	175
query91	170	175	142	142
query92	78	74	69	69
query93	927	857	502	502
query94	594	325	299	299
query95	603	331	385	331
query96	639	531	239	239
query97	2496	2491	2406	2406
query98	236	220	228	220
query99	1029	988	910	910
Total cold run time: 250940 ms
Total hot run time: 168366 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.94% (19996/37772)
Line Coverage 36.47% (187659/514499)
Region Coverage 32.72% (145458/444549)
Branch Coverage 33.91% (63828/188250)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.53% (27200/36991)
Line Coverage 57.08% (292803/512955)
Region Coverage 54.17% (243064/448667)
Branch Coverage 55.99% (105727/188816)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add a global monotonically increasing timestamp service (TSO) for incremental computation in Doris

9 participants