Skip to content

ci(prow): add tiflash sanitizer jobs and replay automation#4390

Merged
ti-chi-bot[bot] merged 4 commits intoPingCAP-QE:mainfrom
lybcodes:codex/split-tiflash-and-replay
Apr 1, 2026
Merged

ci(prow): add tiflash sanitizer jobs and replay automation#4390
ti-chi-bot[bot] merged 4 commits intoPingCAP-QE:mainfrom
lybcodes:codex/split-tiflash-and-replay

Conversation

@lybcodes
Copy link
Copy Markdown
Contributor

@lybcodes lybcodes commented Mar 31, 2026

Summary

  • add native Prow presubmits for pingcap/tiflash sanitizer workflows:
    • pull-sanitizer-asan
    • pull-sanitizer-tsan
  • add local replay automation for native Prow jobs
  • wire replay automation presubmit in pingcap-qe/ci

Scope

  • .ci/replay-prow-job.sh
  • prow-jobs/pingcap-qe/ci/presubmits.yaml
  • prow-jobs/pingcap/tiflash/latest-presubmits.yaml

Issue Link

Copy link
Copy Markdown

@ti-chi-bot ti-chi-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already done a preliminary review for you, and I hope to help you do a better job.

Summary

This PR introduces a new script (replay-prow-job.sh) to replay Prow jobs locally in different modes (pod, prowjob, and auto), facilitating easier testing and debugging. It also adds new presubmit jobs for sanitizer testing (ASan and TSan) with dedicated configurations. The approach is modular, leveraging YAML anchors for shared configurations, and includes extensive documentation within the script. The implementation is thorough, but there are areas for improvement in error handling, modularity, and edge case coverage.

Critical Issues

  • Potential Command Injection in run_with_sanitizer_runtime
    File: prow-jobs/pingcap/tiflash/latest-presubmits.yaml
    Lines: 91-92
    The function run_with_sanitizer_runtime passes its arguments directly to setarch without sanitizing them. If the input to this function is derived from external sources, it could lead to command injection vulnerabilities.
    Solution: Sanitize inputs or restrict run_with_sanitizer_runtime to trusted arguments.
    run_with_sanitizer_runtime() {
        local args=("$@")
        for arg in "${args[@]}"; do
            if [[ "$arg" =~ [^a-zA-Z0-9_\-/] ]]; then
                echo "Invalid argument: $arg" >&2
                exit 1
            fi
        done
        setarch "${arch_name}" -R "${args[@]}"
    }

Code Improvements

  1. Error Handling in run_with_sanitizer_runtime
    File: prow-jobs/pingcap/tiflash/latest-presubmits.yaml
    Lines: 91-92
    The fallback mechanism for setarch failure does not log the reason for the failure clearly.
    Solution: Add detailed error messages.

    if ! setarch "${arch_name}" -R "$@"; then
        echo "Error: setarch command failed for architecture ${arch_name}" >&2
        "$@"
    fi
  2. Excessive Argument Parsing in replay-prow-job.sh
    File: .ci/replay-prow-job.sh
    Lines: 1-1607
    The script performs extensive argument parsing, making it harder to maintain. Consider modularizing argument parsing logic into separate functions.
    Solution: Divide argument parsing into smaller, reusable functions, e.g., parse_mode_args, parse_common_args.

  3. Hardcoded Timeouts
    File: .ci/replay-prow-job.sh
    Lines: 1345 (--run-timeout 3600)
    The timeout values are hardcoded in several places. These should ideally be configurable via environment variables or centralized configuration.
    Solution: Use a default timeout value and allow overrides via CLI or environment variables.

    DEFAULT_TIMEOUT=3600
    RUN_TIMEOUT="${RUN_TIMEOUT:-$DEFAULT_TIMEOUT}"

Best Practices

  1. Missing Unit Tests for the Script
    File: .ci/replay-prow-job.sh
    The script is complex but lacks any test coverage. Add unit tests for critical functions like collect_changed_files, select_jobs_for_file, etc.
    Solution: Create a test suite using bash-test or migrate critical logic to a scripting language with better testing support (e.g., Python).

  2. Inconsistent Naming Conventions
    File: prow-jobs/pingcap/tiflash/latest-presubmits.yaml
    Lines: 38-139
    The naming convention for job names (pull-sanitizer-asan, pull-sanitizer-tsan) is inconsistent with existing jobs (pull-integration-test).
    Solution: Standardize job names across all presubmits, e.g., pull-test-sanitizer-asan, pull-test-sanitizer-tsan.

  3. Documentation Clarity
    File: .ci/replay-prow-job.sh
    Lines: 1-30 (Usage Section)
    The documentation in the usage function is overly verbose and could confuse users.
    Solution: Consolidate the usage information and provide examples in a separate README file.

  4. Code Duplication in Replay Logic
    File: .ci/replay-prow-job.sh
    Lines: 1345-1607
    The logic for pod mode and prowjob mode has significant overlap. Use shared functions for common operations like preparing environment variables and handling job execution.
    Solution: Extract shared logic into helper functions.

Suggested Refactor Example

Refactor common replay logic into a helper function (prepare_job_execution) to reduce duplication:

prepare_job_execution() {
    local mode="$1"
    local config="$2"
    local job_name="$3"
    local namespace="$4"

    log "Preparing job execution: mode=${mode}, job=${job_name}, config=${config}"
    run_kubectl apply --namespace "${namespace}" -f "${config}" || fatal "Failed to apply config"
}

Conclusion

The PR is well-structured and introduces valuable functionality for local replay of Prow jobs and sanitizer testing. Addressing the identified issues will improve maintainability, security, and usability of the code.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new utility script, .ci/replay-prow-job.sh, designed to replay Prow jobs locally or within a cluster using various modes (pod, prowjob, or auto). It also adds a new CI job, pull-replay-prow-jobs, to automate the replay of changed Prow configurations. Furthermore, new presubmit jobs for TiFlash have been added to perform AddressSanitizer (ASan) and ThreadSanitizer (TSan) testing. The review feedback highlights an opportunity to improve CI efficiency and reliability by baking system dependencies (such as git, jq, ruby, kubectl, and ccache) directly into the Docker images rather than installing them at runtime during job execution.

Comment thread prow-jobs/pingcap-qe/ci/presubmits.yaml
Comment thread prow-jobs/pingcap/tiflash/latest-presubmits.yaml
@lybcodes lybcodes changed the title ci(prow): split tiflash sanitizer jobs and replay automation ci(prow): add tiflash sanitizer jobs and replay automation Mar 31, 2026
@lybcodes lybcodes requested a review from Copilot March 31, 2026 10:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR advances the migration away from ci.pingcap.net by adding native Prow presubmits for TiFlash sanitizer workflows and introducing an automated “replay changed Prow jobs” mechanism (plus a presubmit hook in pingcap-qe/ci) to validate native Prow job changes.

Changes:

  • Added pull-sanitizer-asan / pull-sanitizer-tsan presubmits for pingcap/tiflash.
  • Added .ci/replay-prow-job.sh to replay native Prow jobs locally (pod / prowjob / auto) and to auto-replay jobs affected by YAML diffs.
  • Wired a pull-replay-prow-jobs optional presubmit in pingcap-qe/ci to run the replay automation on changed prow-jobs/**/*.y{a,}ml.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
prow-jobs/pingcap/tiflash/latest-presubmits.yaml Adds TiFlash sanitizer presubmits and shared anchors/scripts for sanitizer execution.
prow-jobs/pingcap-qe/ci/presubmits.yaml Adds an optional presubmit to auto-replay changed native Prow job configs.
.ci/replay-prow-job.sh New automation script to extract/replay Prow jobs and batch-replay jobs inferred from YAML diffs.

Comment thread prow-jobs/pingcap/tiflash/latest-presubmits.yaml Outdated
Comment thread prow-jobs/pingcap/tiflash/latest-presubmits.yaml Outdated
Comment thread .ci/replay-prow-job.sh Outdated
Comment thread .ci/replay-prow-job.sh Outdated
Copy link
Copy Markdown

@ti-chi-bot ti-chi-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already done a preliminary review for you, and I hope to help you do a better job.

Summary

This PR introduces native Prow presubmits for pingcap/tiflash sanitizer workflows (pull-sanitizer-asan and pull-sanitizer-tsan) and adds local replay automation for native Prow jobs. The implementation includes a new shell script (.ci/replay-prow-job.sh) for job replay, YAML updates for presubmit configurations, and significant logic for handling sanitizer builds. Overall, the implementation is thorough but requires refinement in error handling, script modularity, and documentation for maintainability.


Critical Issues

  1. Error Handling for Unsupported Sanitizers (File: latest-presubmits.yaml, Line: ~9)

    • The script exits with a generic error message if an unsupported sanitizer is provided (ASan or TSan are supported). This can lead to confusion in debugging.
    • Suggested Solution: Use a more descriptive error message and add a fallback/default behavior. For example:
      if [[ "${sanitizer}" != "ASan" && "${sanitizer}" != "TSan" ]]; then
        echo "Error: Unsupported sanitizer '${sanitizer}'. Valid options are 'ASan' and 'TSan'." >&2
        exit 1
      fi
  2. Performance Bottleneck in Submodule Updates (File: .ci/replay-prow-job.sh, Lines ~1300)

    • Submodule update logic in the pod mode replay script (git submodule update --init --recursive) runs unconditionally, which may significantly slow down job execution for repositories with large or unnecessary submodules.
    • Suggested Solution: Allow selective submodule updates using predefined paths or a configuration file.

Code Improvements

  1. Complexity in .ci/replay-prow-job.sh (File: .ci/replay-prow-job.sh, Lines ~1-1617)

    • The script is overly long and difficult to navigate. Key functionalities like auto_replay_changed_jobs and run_pod_mode could benefit from modularization.
    • Suggested Solution: Break down the script into smaller scripts or functions, each handling specific tasks (e.g., job detection, job replay, environment setup).
  2. Resource Limits in Prow Jobs (File: latest-presubmits.yaml, Lines ~140-200)

    • Memory and CPU limits for the sanitizer containers are set at high values (32Gi memory, 12 CPUs). These limits may cause scheduling issues in shared environments.
    • Suggested Solution: Analyze and reduce resource limits based on actual usage, or allow them to be configurable via environment variables.
  3. Redundant Code in latest-presubmits.yaml (Lines ~140-200)

    • The pull_sanitizer_job configuration is repeated twice with minor differences (ASan vs TSan). This duplication increases maintenance overhead.
    • Suggested Solution: Refactor common configurations using YAML anchors and reuse them. For example:
      pull_sanitizer_job: &pull_sanitizer_job
        <<: *common_job_config
        spec:
          containers:
            - <<: *pull_sanitizer_container
              env:
                - name: SANITIZER
                  value: "{{SANITIZER_TYPE}}"

Best Practices

  1. Documentation for .ci/replay-prow-job.sh (File: .ci/replay-prow-job.sh)

    • The script lacks detailed comments and usage examples for critical functions. This makes it harder for new contributors to understand its behavior.
    • Suggested Solution: Add comments explaining the purpose of each major function and provide examples of expected inputs/outputs.
  2. Testing Coverage for Replay Automation

    • There is no indication of automated tests for the new replay functionality in pull-replay-prow-jobs.
    • Suggested Solution: Add unit tests for the shell script and integration tests for the replayed jobs using mock data.
  3. Environment Variable Defaults (File: .ci/replay-prow-job.sh, Line ~680)

    • Hardcoded environment variable defaults (master, main) may not work for all repositories.
    • Suggested Solution: Allow repository-specific configurations via a .env file or command-line arguments.

Conclusion

The PR delivers important CI improvements for pingcap/tiflash but needs refinement in modularity, resource optimization, and documentation. Addressing these issues will improve maintainability and ensure scalability in larger environments.

Copy link
Copy Markdown
Contributor

@wuhuizuo wuhuizuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking issue in .ci/replay-prow-job.sh.

--run-timeout is documented as the full replay timeout and is threaded through pull-replay-prow-jobs, but pod mode never actually enforces it. After the pod becomes Ready, the script just does kubectl exec ... /tmp/prow_run.sh, so a hung replay can keep running until the outer Prow job timeout instead of the configured replay timeout. That makes --run-timeout effectively a no-op for the default --mode pod path used by this PR.

I think this needs a real timeout wrapper around the pod execution path, or an explicit pod kill/cancel once RUN_TIMEOUT is exceeded.

Copy link
Copy Markdown

@ti-chi-bot ti-chi-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already done a preliminary review for you, and I hope to help you do a better job.

Summary

This PR introduces a new .ci/replay-prow-job.sh script to facilitate local replay of Prow jobs using pod or prowjob modes, adds sanitizer presubmits (pull-sanitizer-asan and pull-sanitizer-tsan) for pingcap/tiflash, and integrates replay automation as a presubmit in pingcap-qe/ci. The approach is comprehensive, including robust configuration parsing, job replay automation, and clear documentation in the script. While the changes are functional and well-structured, some areas need refinement in terms of error handling, potential edge cases, and maintainability.


Critical Issues

  1. Potential Infinite Loop in Pod Mode Execution

    • File: .ci/replay-prow-job.sh
    • Lines: log "run job command (timeout=${RUN_TIMEOUT}s)", trap 'exit 0' TERM INT; while true; do wait || true; sleep 1; done
    • Issue: The container command in pod mode (trap 'exit 0' TERM INT; while true; do wait || true; sleep 1; done) risks running indefinitely if the actual job inside the pod fails or does not terminate properly. This could result in resource exhaustion or hanging jobs.
    • Suggested Fix: Replace the indefinite loop with explicit monitoring of the job exit status and terminate the pod gracefully if the job does not finish within the expected timeframe. Example:
      trap 'exit 0' TERM INT; timeout "${RUN_TIMEOUT}" wait || exit 1
  2. Sanitizer Validation Logic May Cause Silent Failures

    • File: prow-jobs/pingcap/tiflash/latest-presubmits.yaml
    • Lines: if [[ "${sanitizer}" != "ASan" && "${sanitizer}" != "TSan" ]]; then
    • Issue: The sanitizer validation logic may silently exit with code 1, which could be challenging to diagnose in CI logs.
    • Suggested Fix: Add explicit error messaging and ensure logs clearly indicate the failure reason:
      echo "Error: Unsupported sanitizer '${sanitizer}'. Valid options are [ASan, TSan]." >&2
      exit 1

Code Improvements

  1. Hardcoded Resource Values for Sanitizer Jobs

    • File: prow-jobs/pingcap/tiflash/latest-presubmits.yaml
    • Lines: memory: 32Gi, cpu: "12"
    • Issue: Hardcoding resource requests and limits can lead to inefficiencies if cluster capacity changes. Consider making these configurable via environment variables or YAML parameters.
    • Suggested Fix: Use placeholders or environment variables for resource definitions:
      requests:
        cpu: "${SANITIZER_CPU_REQUEST:-12}"
        memory: "${SANITIZER_MEMORY_REQUEST:-32Gi}"
      limits:
        cpu: "${SANITIZER_CPU_LIMIT:-12}"
        memory: "${SANITIZER_MEMORY_LIMIT:-32Gi}"
  2. Verbose Flag Behavior in Replay Script

    • File: .ci/replay-prow-job.sh
    • Lines: vlog(), if [[ "${VERBOSE:-false}" == "true" ]]; then
    • Issue: The verbosity option (--verbose) is implemented but does not cover all critical steps, such as replay execution or resource cleanup.
    • Suggested Fix: Expand vlog() usage to include additional debug information, such as the pod spec, environment variables, and replay targets.

Best Practices

  1. Testing Coverage for .ci/replay-prow-job.sh

    • Issue: The script lacks automated tests to validate functionality, especially for edge cases (e.g., missing environment variables, invalid modes).
    • Suggested Fix: Add unit tests or integration tests using a shell testing framework like bats or mock Kubernetes APIs.
  2. Documentation Clarity in Replay Script

    • File: .ci/replay-prow-job.sh
    • Issue: While the usage block is helpful, it does not clearly document the required Kubernetes permissions for the script to function.
    • Suggested Fix: Add a section to the usage block detailing required permissions (e.g., kubectl access, namespace privileges).
  3. Consistency in YAML Style

    • Files: All YAML files (presubmits.yaml, latest-presubmits.yaml)
    • Issue: Mixing YAML anchors and explicit definitions can be confusing for maintainers. For example, <<: *pull_sanitizer_job is used inconsistently.
    • Suggested Fix: Standardize YAML styling by either fully using anchors or explicit definitions.

Next Steps

  1. Address the critical issues to avoid runtime interruptions and improve reliability.
  2. Refactor parts of the codebase to enhance maintainability and configuration flexibility.
  3. Add testing and documentation to ensure future maintainability and clarity.

Copy link
Copy Markdown

@ti-chi-bot ti-chi-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already done a preliminary review for you, and I hope to help you do a better job.

Summary

This PR introduces new Prow presubmit jobs for sanitizers (pull-sanitizer-asan and pull-sanitizer-tsan) and adds a local replay automation script (replay-prow-job.sh) to facilitate testing and debugging of Prow jobs. The implementation includes YAML updates for job definitions and a detailed Bash script for replaying jobs locally or in ProwJob mode. The overall quality of the PR is strong, but there are several areas where improvements can be made for robustness, maintainability, and efficiency.


Critical Issues

  1. Security Risk: Hardcoded Secrets in YAML

    • File: .ci/replay-prow-job.sh, prow-jobs/pingcap/tiflash/latest-presubmits.yaml
    • Line: Secrets like REPLAY_GIT_TOKEN are potentially exposed when injected into pods via environment variables.
    • Why it's an issue: Hardcoding sensitive information in the YAML configuration or scripts can lead to unintentional leaks and security vulnerabilities.
    • Suggested Solution: Use Kubernetes secrets to securely manage sensitive values. Replace:
      env:
        - name: REPLAY_GIT_TOKEN
          value: <your-token>
      With:
      env:
        - name: REPLAY_GIT_TOKEN
          valueFrom:
            secretKeyRef:
              name: git-token-secret
              key: token
  2. Inefficient Caching Logic in Sanitizer Script

    • File: prow-jobs/pingcap/tiflash/latest-presubmits.yaml
    • Line: Sanitizer script reinitializes ccache settings every time.
    • Why it's an issue: Frequent reinitialization of ccache can degrade performance and potentially result in unnecessary overhead during repeated job executions.
    • Suggested Solution: Skip initialization if ccache is already configured. Add a check:
      if ! ccache -p | grep -q "cache_dir"; then
          ccache -o cache_dir=/path/to/cache
          ...
      fi

Code Improvements

  1. Error Handling in replay-prow-job.sh

    • File: .ci/replay-prow-job.sh
    • Line: Functions like run_kubectl_with_deadline silently ignore errors.
    • Why it's an issue: Failure in critical steps like pod creation or execution can lead to undetected issues in workflow replay.
    • Suggested Solution: Add error logging and retry logic for critical steps:
      run_kubectl_with_deadline() {
          ...
          if ! run_kubectl "$@"; then
              log "Error in step: $step_desc"
              exit 1
          fi
      }
  2. Unbounded Resource Requests

    • File: prow-jobs/pingcap/tiflash/latest-presubmits.yaml
    • Line: Resources for pull-sanitizer jobs are set to high values (32Gi memory, 12 CPUs).
    • Why it's an issue: High resource allocation can lead to inefficiencies or scheduling issues in Kubernetes clusters.
    • Suggested Solution: Add resource limits based on observed workload requirements. For example:
      resources:
        requests:
          cpu: "8"
          memory: 16Gi
        limits:
          cpu: "10"
          memory: 20Gi
  3. Code Duplication in Sanitizer Definitions

    • File: prow-jobs/pingcap/tiflash/latest-presubmits.yaml
    • Line: pull-sanitizer-asan and pull-sanitizer-tsan jobs define similar configurations.
    • Why it's an issue: Duplication makes maintenance more difficult.
    • Suggested Solution: Use YAML anchors to reduce repetition:
      pull_sanitizer_base_job: &pull_sanitizer_base_job
        <<: *pull_sanitizer_job
        spec:
          containers:
            - <<: *pull_sanitizer_container
      ...
      - <<: *pull_sanitizer_base_job
        name: pull-sanitizer-asan
        env:
          - name: SANITIZER
            value: ASan
      - <<: *pull_sanitizer_base_job
        name: pull-sanitizer-tsan
        env:
          - name: SANITIZER
            value: TSan

Best Practices

  1. Missing Tests for replay-prow-job.sh

    • File: .ci/replay-prow-job.sh
    • Why it's an issue: The script lacks automated test coverage to validate its correctness.
    • Suggested Solution: Add unit tests using a shell testing framework like bats. Example:
      @test "test replay-prow-job.sh --help" {
          run bash .ci/replay-prow-job.sh --help
          [ "$status" -eq 0 ]
      }
  2. Inconsistent Logging Levels

    • File: .ci/replay-prow-job.sh
    • Line: The script mixes log and vlog without clear separation.
    • Why it's an issue: Lack of consistent logging levels can make debugging harder.
    • Suggested Solution: Standardize logging levels using verbosity flags. For instance:
      log() {
          if [[ "${VERBOSE:-false}" == "true" ]]; then
              echo "[DEBUG] $*" >&2
          else
              echo "[INFO] $*" >&2
          fi
      }
  3. Improve Documentation and Usage Examples

    • File: .ci/replay-prow-job.sh
    • Why it's an issue: The usage section is comprehensive but lacks examples for edge cases like running jobs with custom namespaces.
    • Suggested Solution: Extend usage documentation with common scenarios:
      # Examples:
      # Replay in a custom namespace
      .ci/replay-prow-job.sh --namespace custom-namespace --mode pod ...

Conclusion

This PR is well-structured and achieves its objectives effectively. Addressing the issues above will improve security, maintainability, and performance, making the solution more robust. The critical security concerns should be prioritized, followed by simplification of YAML definitions and enhanced error handling.

@lybcodes
Copy link
Copy Markdown
Contributor Author

lybcodes commented Apr 1, 2026

@wuhuizuo Thanks for the catch.You were right.
I’ve updated .ci/replay-prow-job.sh so --run-timeout is now enforced as a full replay deadline across the entire pod flow.

Copy link
Copy Markdown
Contributor

@wuhuizuo wuhuizuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed the latest updates.

The new pod replay flow now carries a single deadline through apply / wait / cp / exec, validates --run-timeout, and force-deletes the replay pod when that deadline is exceeded. That addresses my previous blocking concern that --run-timeout was a no-op in pod mode.

I don't have additional blocking comments from the CI side.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 1, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wuhuizuo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the lgtm label Apr 1, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 1, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-01 04:11:26.907811425 +0000 UTC m=+324692.113171472: ☑️ agreed by wuhuizuo.

@ti-chi-bot ti-chi-bot Bot added the approved label Apr 1, 2026
@ti-chi-bot ti-chi-bot Bot merged commit e00d9c8 into PingCAP-QE:main Apr 1, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

3 participants