Skip to content

consumer: dedupe replayed DML by applied watermark for storage consumer#4328

Merged
ti-chi-bot[bot] merged 3 commits intomasterfrom
wk989898-patch-1
Mar 3, 2026
Merged

consumer: dedupe replayed DML by applied watermark for storage consumer#4328
ti-chi-bot[bot] merged 3 commits intomasterfrom
wk989898-patch-1

Conversation

@wk989898
Copy link
Collaborator

@wk989898 wk989898 commented Mar 2, 2026

What problem does this PR solve?

Issue Number: close #4327 ref #4051

What is changed and how it works?

Storage consumer also needs to be fixed

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

  • Bug Fixes

    • Avoid replaying stale schema or data operations by tracking and honoring per-table version watermarks, reducing duplicate or out-of-order effects.
    • Stop further processing after successful schema application to prevent redundant handling.
  • Chores

    • Enhanced operational logs to include per-table watermark and version context for clearer troubleshooting.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 2, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an issue in the storage consumer where the order of processing Data Definition Language (DDL) and Data Manipulation Language (DML) events was incorrect. The changes introduce a revised sorting mechanism to ensure that for any given table version, all DML operations are processed and applied before any corresponding DDL operations, thereby maintaining data integrity and consistency during replication.

Highlights

  • Corrected DDL/DML Ordering: The processing order of DDL (Data Definition Language) and DML (Data Manipulation Language) events in the storage consumer has been corrected. For the same table version, DML operations will now be consumed before DDL operations.
  • Refactored Sorting Logic: The complex sorting logic for DmlPathKey has been extracted from an anonymous function within handleNewFiles into a new, dedicated helper function named dmlPathKeyLess.
  • Improved Code Clarity: Internal comments related to the DDL/DML ordering and the purpose of the fake DML key for schema files have been updated to accurately reflect the new processing strategy.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • cmd/storage-consumer/consumer.go
    • Updated comments to clarify the intended DDL/DML processing order, specifically noting that DMLs should precede DDLs for the same table version.
    • Refactored the inline sorting logic for DmlPathKey into a new, standalone function dmlPathKeyLess.
    • Implemented the dmlPathKeyLess function to correctly sort DmlPathKey instances, ensuring DML files are ordered before schema (DDL) files for identical table versions.
    • Adjusted the call to sort.Slice in handleNewFiles to utilize the newly introduced dmlPathKeyLess function.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ti-chi-bot ti-chi-bot bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 2, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 2, 2026

📝 Walkthrough

Walkthrough

Added table-level DDL watermark tracking and DDL replay guards to the storage consumer: record max executed DDL version per table, skip stale DDL/DML based on that watermark, update watermark after successful DDL execution, and add related logging and control-flow guards.

Changes

Cohort / File(s) Summary
Storage consumer DDL/watermark changes
cmd/storage-consumer/consumer.go
Introduce tableDDLWatermark state and initialize it in newConsumer. In handleNewFiles, skip stale schema.json (DDL) and stale DML when key.TableVersion <= ddlWatermark, log warnings with schema/table/version and watermark, update watermark on successful DDL execution, and early-continue after DDL handling. Added log message enhancements including watermark context.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

size/L

Suggested reviewers

  • 3AceShowHand
  • wlwilliamx
  • hongyunyan

Poem

🐰
I hop through schemas, watermark in paw,
Skipping stale echoes I once saw,
When DDLs land and versions climb,
I mark the table, one at a time,
Hooray — the stream stays neat and raw!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete. It provides linked issue numbers but lacks detailed explanation of what changed, how it works, test coverage details, and answers to required questions. Expand the description to explain the watermark deduplication mechanism, specify which tests were added/modified, and provide explicit answers to the performance/compatibility and documentation questions.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: deduplicating replayed DML using an applied watermark mechanism in the storage consumer.
Linked Issues check ✅ Passed The code changes implement DDL/DML replay deduplication via watermark tracking in the storage consumer, which directly addresses the instability in issue #4327.
Out of Scope Changes check ✅ Passed All changes are focused on storage consumer DDL/DML handling and watermark tracking, directly aligned with fixing the test instability issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch wk989898-patch-1

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request corrects the processing order of DDL and DML events in the storage consumer, ensuring that for the same table version, DML events are processed before DDL events. This is achieved by refactoring the sorting logic into a new dmlPathKeyLess function with the corrected comparison logic. The change is clear and addresses the issue. My main feedback is to add unit tests for the new critical sorting function to ensure its correctness and prevent future regressions.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/storage-consumer/consumer.go (1)

547-547: ⚠️ Potential issue | 🔴 Critical

Handle DML flush errors before continuing to DDL.

Line 547 drops the flushDMLEvents error. If flush fails, processing can continue and execute later DDL, breaking ordering guarantees under failure paths.

Suggested fix
-		c.flushDMLEvents(ctx, tableID)
+		if err := c.flushDMLEvents(ctx, tableID); err != nil {
+			return errors.Trace(err)
+		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/storage-consumer/consumer.go` at line 547, The call to
c.flushDMLEvents(ctx, tableID) currently ignores any returned error which can
allow subsequent DDL to run and break ordering; update the call site to capture
the error (err := c.flushDMLEvents(ctx, tableID)), check it, and if non-nil
log/return/propagate the error (e.g., return fmt.Errorf or propagate the
existing function's error) so processing stops and DDL is not executed when DML
flush fails, preserving ordering guarantees; ensure you reference
c.flushDMLEvents and the enclosing function's error flow when adding the
early-return or error propagation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@cmd/storage-consumer/consumer.go`:
- Line 547: The call to c.flushDMLEvents(ctx, tableID) currently ignores any
returned error which can allow subsequent DDL to run and break ordering; update
the call site to capture the error (err := c.flushDMLEvents(ctx, tableID)),
check it, and if non-nil log/return/propagate the error (e.g., return fmt.Errorf
or propagate the existing function's error) so processing stops and DDL is not
executed when DML flush fails, preserving ordering guarantees; ensure you
reference c.flushDMLEvents and the enclosing function's error flow when adding
the early-return or error propagation.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1fdee2 and bfbfee6.

📒 Files selected for processing (1)
  • cmd/storage-consumer/consumer.go

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Mar 2, 2026
@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Mar 2, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 2, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 3AceShowHand, wlwilliamx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [3AceShowHand,wlwilliamx]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 2, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-03-02 09:51:19.246782717 +0000 UTC m=+178923.824861910: ☑️ agreed by 3AceShowHand.
  • 2026-03-02 12:15:50.736433743 +0000 UTC m=+187595.314512947: ☑️ agreed by wlwilliamx.

@wk989898
Copy link
Collaborator Author

wk989898 commented Mar 2, 2026

/retest

@wk989898 wk989898 added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 2, 2026
@3AceShowHand
Copy link
Collaborator

/retest

@ti-chi-bot ti-chi-bot bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 3, 2026
@wk989898 wk989898 changed the title consumer: correct the order of ddl and dml in storage consumer consumer: dedupe replayed DML by applied watermark for storage consumer Mar 3, 2026
@3AceShowHand
Copy link
Collaborator

/test all

@wk989898 wk989898 removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 3, 2026
@ti-chi-bot ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 3, 2026
@wk989898 wk989898 added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 3, 2026
@wk989898
Copy link
Collaborator Author

wk989898 commented Mar 3, 2026

/test storage

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cmd/storage-consumer/consumer.go (1)

541-564: Explicitly skip schema sentinel keys with empty query.

When PartitionNum == -1 and Date == "" but tableDef.Query is empty, this currently falls through to DML handling (no-op today). Add an early continue to avoid unnecessary fake table-id allocation and processing.

♻️ Proposed refactor
-		if key.PartitionNum == fakePartitionNumForSchemaFile && len(key.Date) == 0 && len(tableDef.Query) > 0 {
+		if key.PartitionNum == fakePartitionNumForSchemaFile && len(key.Date) == 0 {
+			if len(tableDef.Query) == 0 {
+				continue
+			}
 			if key.TableVersion <= ddlWatermark {
 				log.Warn("DDL event replayed with stale table version, ignore it",
 					zap.String("schema", key.Schema), zap.String("table", key.Table),
 					zap.Uint64("tableVersion", key.TableVersion), zap.Uint64("ddlWatermark", ddlWatermark),
 					zap.String("query", tableDef.Query))
 				continue
 			}
@@
 			log.Info("execute ddl event successfully",
 				zap.String("query", tableDef.Query),
 				zap.String("schema", key.Schema), zap.String("table", key.Table),
 				zap.Uint64("ddlWatermark", c.tableDDLWatermark[tableKey]))
 			continue
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/storage-consumer/consumer.go` around lines 541 - 564, The code path
handling schema sentinel keys doesn't skip cases where key.PartitionNum ==
fakePartitionNumForSchemaFile and key.Date == "" but tableDef.Query is empty,
causing unnecessary DML handling and fake table-id allocation; update the
conditional in the loop that checks key.PartitionNum, key.Date and
tableDef.Query (the block that creates ddlEvent via tableDef.ToDDLEvent, calls
c.sink.WriteBlockEvent, and sets c.tableDDLWatermark[tableKey]) to add an early
continue when tableDef.Query is empty (i.e., return immediately to the next
iteration if tableDef.Query == ""), so sentinel keys with empty queries are
skipped before any fake table-id or DML processing occurs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cmd/storage-consumer/consumer.go`:
- Around line 541-564: The code path handling schema sentinel keys doesn't skip
cases where key.PartitionNum == fakePartitionNumForSchemaFile and key.Date == ""
but tableDef.Query is empty, causing unnecessary DML handling and fake table-id
allocation; update the conditional in the loop that checks key.PartitionNum,
key.Date and tableDef.Query (the block that creates ddlEvent via
tableDef.ToDDLEvent, calls c.sink.WriteBlockEvent, and sets
c.tableDDLWatermark[tableKey]) to add an early continue when tableDef.Query is
empty (i.e., return immediately to the next iteration if tableDef.Query == ""),
so sentinel keys with empty queries are skipped before any fake table-id or DML
processing occurs.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 79d584e and 35a42a9.

📒 Files selected for processing (1)
  • cmd/storage-consumer/consumer.go

@wk989898 wk989898 removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 3, 2026
@wk989898
Copy link
Collaborator Author

wk989898 commented Mar 3, 2026

/retest

1 similar comment
@3AceShowHand
Copy link
Collaborator

/retest

@ti-chi-bot ti-chi-bot bot merged commit 4f8911d into master Mar 3, 2026
26 checks passed
@ti-chi-bot ti-chi-bot bot deleted the wk989898-patch-1 branch March 3, 2026 12:17
tenfyzhong pushed a commit that referenced this pull request Mar 18, 2026
…er (#4328)

ref #4051, close #4327

(cherry picked from commit 4f8911d)
Signed-off-by: tenfyzhong <tenfy@tenfy.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

unstable storage test ddl_for_split_tables_with_random_move_table

3 participants