Skip to content

tests: fix next gen integration TiDB startup flakiness#4278

Open
wlwilliamx wants to merge 2 commits intopingcap:masterfrom
wlwilliamx:fix/next-gen-start-tidb-cluster-fail
Open

tests: fix next gen integration TiDB startup flakiness#4278
wlwilliamx wants to merge 2 commits intopingcap:masterfrom
wlwilliamx:fix/next-gen-start-tidb-cluster-fail

Conversation

@wlwilliamx
Copy link
Copy Markdown
Collaborator

@wlwilliamx wlwilliamx commented Feb 26, 2026

What problem does this PR solve?

Issue Number: close #4277

What is changed and how it works?

  • Fix next-gen TiDB config generation in integration test utils:
    • Keep socket and max-server-connections at root level.
    • Keep run-auto-analyze / server-memory-quota under [performance] so TiDB won't treat them as invalid options.
  • Ensure stop_tidb_cluster kills leaked cdc_pulsar_consumer to avoid accumulating background processes across cases/groups.

Check List

Tests

  • Integration test
  • Manual test
bash -n tests/integration_tests/_utils/start_tidb_cluster_nextgen
bash -n tests/integration_tests/_utils/start_tls_tidb_cluster_impl_nextgen
bash -n tests/integration_tests/_utils/stop_tidb_cluster

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

None

Summary by CodeRabbit

  • Tests

    • Improved cluster configuration generation and validation to avoid invalid options and ensure tuning keys are correctly placed.
    • Enhanced TLS/security settings for test deployments, including consistent TLS credential fields and root-level connection limits.
    • Minor test formatting adjustments with no behavioral change.
  • Chores

    • Refined cleanup to terminate test-run processes and prevent resource leaks across cases.

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 26, 2026
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @wlwilliamx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses flakiness in next-gen TiDB integration tests by rectifying issues in how TiDB configuration files are generated and by improving test environment cleanup. The changes ensure that TiDB starts reliably by adhering to its configuration parsing rules and prevent background processes from accumulating, leading to more stable and predictable test runs.

Highlights

  • TiDB Configuration Fixes: Corrected TiDB configuration generation in integration tests to properly place socket and max-server-connections at the root level, and relocated performance-related configurations (run-auto-analyze, server-memory-quota) to the [performance] section to ensure they are correctly parsed.
  • Test Environment Cleanup: Enhanced the stop_tidb_cluster utility to terminate cdc_pulsar_consumer processes, preventing resource leaks and improving test stability.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • tests/integration_tests/_utils/start_tidb_cluster_nextgen
    • Updated the randomGenSocketsConf function to prepend the socket configuration at the root level of the TiDB config file and remove any existing socket entries.
    • Added a comment explaining the section-sensitive nature of TiDB config keys.
    • Moved max-server-connections from the [instance] section to the root level.
    • Moved run-auto-analyze and server-memory-quota from the root level to the [performance] section.
  • tests/integration_tests/_utils/start_tls_tidb_cluster_impl_nextgen
    • Added a comment explaining the section-sensitive nature of TiDB config keys.
    • Moved max-server-connections from the [instance] section to the root level.
    • Moved run-auto-analyze and server-memory-quota from the root level to the [performance] section.
  • tests/integration_tests/_utils/stop_tidb_cluster
    • Added cdc_pulsar_consumer to the list of processes to be killed during cluster shutdown.
Activity
  • No specific activity (comments, reviews, or progress updates) was provided in the context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 26, 2026

📝 Walkthrough

Walkthrough

Fixes TiDB/TiKV TOML generation to place section-sensitive keys at correct root/section levels, expands TLS/security fields, prepends root-level socket lines, and adds termination of background cdc_pulsar_consumer to stop process leaks in integration test teardown.

Changes

Cohort / File(s) Summary
TiDB next-gen config generation
tests/integration_tests/_utils/start_tidb_cluster_nextgen
Prepend root-level socket = ... instead of appending; remove stray socket/instance.* lines via temp-file rewrite. Move tuning keys (e.g., max-server-connections, run-auto-analyze, server-memory-quota, analyze-always-skip-wide-columns) to correct root or [performance] sections and add explanatory comments.
TLS TiDB config generation
tests/integration_tests/_utils/start_tls_tidb_cluster_impl_nextgen
Mirror the non-TLS config fixes for TLS deployments: add root max-server-connections = 0, move tuning knobs under [performance], remove misplaced [instance] entries, and add TLS [security] fields (ssl-ca, ssl-cert, ssl-key, cluster-ssl-ca, cluster-ssl-cert, cluster-ssl-key) in system and per-keyspace TOMLs.
Cluster shutdown cleanup & misc tests
tests/integration_tests/_utils/stop_tidb_cluster, pkg/sink/sqlmodel/row_change_test.go, tools/workload/schema/bank3/bank.go
Add pkill for cdc_pulsar_consumer to teardown to prevent leaked consumers; minor test formatting change in row_change_test.go; trivial partition-line tweak in bank.go.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

Suggested reviewers

  • tenfyzhong
  • wk989898
  • asddongmen

Poem

🐰 I nibbled configs, sorted each line,
Root keys hop first, sections align,
I chased stray sockets out of the den,
Killed ghosts of consumers, so tests run again,
TLS snug and tidy — a rabbit's small win.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Out of Scope Changes check ❓ Inconclusive The minor formatting change in bank3 workload test is unrelated to #4277 objectives, though it has minimal impact. All other changes are directly scoped to fixing the identified integration test issues. Clarify whether the indentation change in pkg/sink/sqlmodel/row_change_test.go and tools/workload/schema/bank3/bank.go are intentional or accidental formatting fixes unrelated to the main PR objective.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main objective: fixing TiDB startup flakiness in next-generation integration tests.
Description check ✅ Passed The description follows the template structure, includes the linked issue number, explains what was changed and how it works, specifies test types, and provides a release note.
Linked Issues check ✅ Passed All code changes directly address the two root causes documented in #4277: fixing TiDB config key placement and ensuring cdc_pulsar_consumer process cleanup.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses flakiness in next-generation integration tests by correcting TiDB configuration generation and improving the test cleanup process. The changes relocate TiDB configuration options to their proper sections within the TOML files, and fix the socket configuration generation to ensure it's a root-level setting. Furthermore, the stop_tidb_cluster script is updated to kill leftover cdc_pulsar_consumer processes, preventing resource leaks between test executions. I have one minor suggestion to enhance the robustness of temporary file creation in a shell script.

Comment on lines +32 to +33
} >"$config_file.tmp"
mv "$config_file.tmp" "$config_file"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a fixed temporary filename like .tmp can be problematic if multiple instances of this script run concurrently. To make the temporary filename unique and avoid potential race conditions, consider appending the process ID ($$). A more robust solution would be to use mktemp.

	} >"$config_file.tmp.$$"
	mv "$config_file.tmp.$$" "$config_file"

@wlwilliamx
Copy link
Copy Markdown
Collaborator Author

/test pull-cdc-pulsar-light-integration-test-next-gen

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Feb 26, 2026

@wlwilliamx: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-build
/test pull-cdc-kafka-integration-heavy
/test pull-cdc-kafka-integration-light
/test pull-cdc-mysql-integration-heavy
/test pull-cdc-mysql-integration-light
/test pull-cdc-storage-integration-heavy
/test pull-cdc-storage-integration-light
/test pull-check
/test pull-error-log-review
/test pull-unit-test

The following commands are available to trigger optional jobs:

/test pull-build-next-gen
/test pull-cdc-kafka-integration-heavy-next-gen
/test pull-cdc-kafka-integration-light-next-gen
/test pull-cdc-mysql-integration-heavy-next-gen
/test pull-cdc-mysql-integration-light-next-gen
/test pull-cdc-pulsar-integration-heavy
/test pull-cdc-pulsar-integration-heavy-next-gen
/test pull-cdc-pulsar-integration-light
/test pull-cdc-pulsar-integration-light-next-gen
/test pull-cdc-storage-integration-heavy-next-gen
/test pull-cdc-storage-integration-light-next-gen
/test pull-unit-test-next-gen

Use /test all to run the following jobs that were automatically triggered:

pull-error-log-review
Details

In response to this:

/test pull-cdc-pulsar-light-integration-test-next-gen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@wlwilliamx
Copy link
Copy Markdown
Collaborator Author

/test pull-cdc-pulsar-integration-light-next-gen

@wlwilliamx
Copy link
Copy Markdown
Collaborator Author

/test pull-cdc-pulsar-integration-heavy-next-gen

@tenfyzhong
Copy link
Copy Markdown
Collaborator

/test next-gen

3 similar comments
@wlwilliamx
Copy link
Copy Markdown
Collaborator Author

/test next-gen

@wlwilliamx
Copy link
Copy Markdown
Collaborator Author

/test next-gen

@wlwilliamx
Copy link
Copy Markdown
Collaborator Author

/test next-gen

@wlwilliamx
Copy link
Copy Markdown
Collaborator Author

/retest

@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Mar 11, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Mar 12, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenfyzhong, wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [tenfyzhong,wk989898]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Mar 12, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Mar 12, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-03-11 11:53:26.568968711 +0000 UTC m=+437438.081026392: ☑️ agreed by tenfyzhong.
  • 2026-03-12 09:29:38.939851731 +0000 UTC m=+515210.451909502: ☑️ agreed by wk989898.

…t-tidb-cluster-fail

# Conflicts:
#	tests/integration_tests/_utils/start_tidb_cluster_nextgen
@wlwilliamx
Copy link
Copy Markdown
Collaborator Author

/test all

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/integration_tests/_utils/start_tidb_cluster_nextgen (1)

365-471: Consider extracting common TiDB config generation into a helper function.

The four TiDB config blocks share significant structure (root-level keys, [log.file], [performance], [security]). A helper function accepting keyspace-name, tikv-worker-url, and service-scope as parameters could reduce duplication and ensure consistency. This is entirely optional given the current clarity.

♻️ Example helper function approach
# Generate TiDB config with consistent structure
gen_tidb_config() {
    local out_file="$1"
    local keyspace="$2"
    local tikv_worker_url="${3:-}"
    local service_scope="${4:-}"
    
    cat >"$out_file" <<EOF
keyspace-name = "$keyspace"
enable-telemetry = false
mem-quota-query = 671088640
max-server-connections = 0
${tikv_worker_url:+tikv-worker-url = "$tikv_worker_url"}

[log.file]
max-backups = 100

[performance]
tcp-keep-alive = true
run-auto-analyze = false
server-memory-quota = 2684354560

[security]
enable-sem = false
${service_scope:+
[instance]
tidb_service_scope = '$service_scope'}
EOF
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration_tests/_utils/start_tidb_cluster_nextgen` around lines 365 -
471, Multiple nearly identical TiDB config blocks (creating upstream/downstream
and system/keyspace files) should be consolidated into a helper to avoid
duplication; add a function (e.g., gen_tidb_config) that accepts parameters for
out_file, keyspace-name, optional tikv-worker-url and optional service-scope and
use it to replace the repeated cat >>"$OUT_DIR/…". Ensure the helper preserves
existing behavior: append existing "$tidb_config" when present, create the
target TEST_DATA_DIR directory, and conditionally include the tikv-worker-url
line (use UP_TIKV_WORKER_HOST/UP_TIKV_WORKER_PORT or
DOWN_TIKV_WORKER_HOST/DOWN_TIKV_WORKER_PORT) and the
[instance]/tidb_service_scope entry when needed; then call this helper four
times for upstream-system, upstream-keyspace, downstream-system,
downstream-keyspace (and keep the cp of the other toml file).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/integration_tests/_utils/start_tidb_cluster_nextgen`:
- Around line 365-471: Multiple nearly identical TiDB config blocks (creating
upstream/downstream and system/keyspace files) should be consolidated into a
helper to avoid duplication; add a function (e.g., gen_tidb_config) that accepts
parameters for out_file, keyspace-name, optional tikv-worker-url and optional
service-scope and use it to replace the repeated cat >>"$OUT_DIR/…". Ensure the
helper preserves existing behavior: append existing "$tidb_config" when present,
create the target TEST_DATA_DIR directory, and conditionally include the
tikv-worker-url line (use UP_TIKV_WORKER_HOST/UP_TIKV_WORKER_PORT or
DOWN_TIKV_WORKER_HOST/DOWN_TIKV_WORKER_PORT) and the
[instance]/tidb_service_scope entry when needed; then call this helper four
times for upstream-system, upstream-keyspace, downstream-system,
downstream-keyspace (and keep the cp of the other toml file).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8184c3dc-80ac-4cd7-add7-45abe3a6f77d

📥 Commits

Reviewing files that changed from the base of the PR and between 76bb15b and a1ccaa2.

📒 Files selected for processing (3)
  • pkg/sink/sqlmodel/row_change_test.go
  • tests/integration_tests/_utils/start_tidb_cluster_nextgen
  • tools/workload/schema/bank3/bank.go
✅ Files skipped from review due to trivial changes (1)
  • tools/workload/schema/bank3/bank.go

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Mar 12, 2026

@wlwilliamx: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cdc-storage-integration-heavy-next-gen 76bb15b link false /test pull-cdc-storage-integration-heavy-next-gen
pull-cdc-mysql-integration-heavy a1ccaa2 link true /test pull-cdc-mysql-integration-heavy

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tests/integration: pulsar next gen jobs fail to start TiDB (process leak + invalid TiDB config)

3 participants