ci(prow): add tiflash sanitizer jobs and replay automation#4390
ci(prow): add tiflash sanitizer jobs and replay automation#4390ti-chi-bot[bot] merged 4 commits intoPingCAP-QE:mainfrom
Conversation
There was a problem hiding this comment.
I have already done a preliminary review for you, and I hope to help you do a better job.
Summary
This PR introduces a new script (replay-prow-job.sh) to replay Prow jobs locally in different modes (pod, prowjob, and auto), facilitating easier testing and debugging. It also adds new presubmit jobs for sanitizer testing (ASan and TSan) with dedicated configurations. The approach is modular, leveraging YAML anchors for shared configurations, and includes extensive documentation within the script. The implementation is thorough, but there are areas for improvement in error handling, modularity, and edge case coverage.
Critical Issues
- Potential Command Injection in
run_with_sanitizer_runtime
File:prow-jobs/pingcap/tiflash/latest-presubmits.yaml
Lines: 91-92
The functionrun_with_sanitizer_runtimepasses its arguments directly tosetarchwithout sanitizing them. If the input to this function is derived from external sources, it could lead to command injection vulnerabilities.
Solution: Sanitize inputs or restrictrun_with_sanitizer_runtimeto trusted arguments.run_with_sanitizer_runtime() { local args=("$@") for arg in "${args[@]}"; do if [[ "$arg" =~ [^a-zA-Z0-9_\-/] ]]; then echo "Invalid argument: $arg" >&2 exit 1 fi done setarch "${arch_name}" -R "${args[@]}" }
Code Improvements
-
Error Handling in
run_with_sanitizer_runtime
File:prow-jobs/pingcap/tiflash/latest-presubmits.yaml
Lines: 91-92
The fallback mechanism forsetarchfailure does not log the reason for the failure clearly.
Solution: Add detailed error messages.if ! setarch "${arch_name}" -R "$@"; then echo "Error: setarch command failed for architecture ${arch_name}" >&2 "$@" fi
-
Excessive Argument Parsing in
replay-prow-job.sh
File:.ci/replay-prow-job.sh
Lines: 1-1607
The script performs extensive argument parsing, making it harder to maintain. Consider modularizing argument parsing logic into separate functions.
Solution: Divide argument parsing into smaller, reusable functions, e.g.,parse_mode_args,parse_common_args. -
Hardcoded Timeouts
File:.ci/replay-prow-job.sh
Lines: 1345 (--run-timeout 3600)
The timeout values are hardcoded in several places. These should ideally be configurable via environment variables or centralized configuration.
Solution: Use a default timeout value and allow overrides via CLI or environment variables.DEFAULT_TIMEOUT=3600 RUN_TIMEOUT="${RUN_TIMEOUT:-$DEFAULT_TIMEOUT}"
Best Practices
-
Missing Unit Tests for the Script
File:.ci/replay-prow-job.sh
The script is complex but lacks any test coverage. Add unit tests for critical functions likecollect_changed_files,select_jobs_for_file, etc.
Solution: Create a test suite usingbash-testor migrate critical logic to a scripting language with better testing support (e.g., Python). -
Inconsistent Naming Conventions
File:prow-jobs/pingcap/tiflash/latest-presubmits.yaml
Lines: 38-139
The naming convention for job names (pull-sanitizer-asan,pull-sanitizer-tsan) is inconsistent with existing jobs (pull-integration-test).
Solution: Standardize job names across all presubmits, e.g.,pull-test-sanitizer-asan,pull-test-sanitizer-tsan. -
Documentation Clarity
File:.ci/replay-prow-job.sh
Lines: 1-30 (Usage Section)
The documentation in theusagefunction is overly verbose and could confuse users.
Solution: Consolidate the usage information and provide examples in a separate README file. -
Code Duplication in Replay Logic
File:.ci/replay-prow-job.sh
Lines: 1345-1607
The logic for pod mode and prowjob mode has significant overlap. Use shared functions for common operations like preparing environment variables and handling job execution.
Solution: Extract shared logic into helper functions.
Suggested Refactor Example
Refactor common replay logic into a helper function (prepare_job_execution) to reduce duplication:
prepare_job_execution() {
local mode="$1"
local config="$2"
local job_name="$3"
local namespace="$4"
log "Preparing job execution: mode=${mode}, job=${job_name}, config=${config}"
run_kubectl apply --namespace "${namespace}" -f "${config}" || fatal "Failed to apply config"
}Conclusion
The PR is well-structured and introduces valuable functionality for local replay of Prow jobs and sanitizer testing. Addressing the identified issues will improve maintainability, security, and usability of the code.
There was a problem hiding this comment.
Code Review
This pull request introduces a new utility script, .ci/replay-prow-job.sh, designed to replay Prow jobs locally or within a cluster using various modes (pod, prowjob, or auto). It also adds a new CI job, pull-replay-prow-jobs, to automate the replay of changed Prow configurations. Furthermore, new presubmit jobs for TiFlash have been added to perform AddressSanitizer (ASan) and ThreadSanitizer (TSan) testing. The review feedback highlights an opportunity to improve CI efficiency and reliability by baking system dependencies (such as git, jq, ruby, kubectl, and ccache) directly into the Docker images rather than installing them at runtime during job execution.
There was a problem hiding this comment.
Pull request overview
This PR advances the migration away from ci.pingcap.net by adding native Prow presubmits for TiFlash sanitizer workflows and introducing an automated “replay changed Prow jobs” mechanism (plus a presubmit hook in pingcap-qe/ci) to validate native Prow job changes.
Changes:
- Added
pull-sanitizer-asan/pull-sanitizer-tsanpresubmits forpingcap/tiflash. - Added
.ci/replay-prow-job.shto replay native Prow jobs locally (pod / prowjob / auto) and to auto-replay jobs affected by YAML diffs. - Wired a
pull-replay-prow-jobsoptional presubmit inpingcap-qe/cito run the replay automation on changedprow-jobs/**/*.y{a,}ml.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| prow-jobs/pingcap/tiflash/latest-presubmits.yaml | Adds TiFlash sanitizer presubmits and shared anchors/scripts for sanitizer execution. |
| prow-jobs/pingcap-qe/ci/presubmits.yaml | Adds an optional presubmit to auto-replay changed native Prow job configs. |
| .ci/replay-prow-job.sh | New automation script to extract/replay Prow jobs and batch-replay jobs inferred from YAML diffs. |
There was a problem hiding this comment.
I have already done a preliminary review for you, and I hope to help you do a better job.
Summary
This PR introduces native Prow presubmits for pingcap/tiflash sanitizer workflows (pull-sanitizer-asan and pull-sanitizer-tsan) and adds local replay automation for native Prow jobs. The implementation includes a new shell script (.ci/replay-prow-job.sh) for job replay, YAML updates for presubmit configurations, and significant logic for handling sanitizer builds. Overall, the implementation is thorough but requires refinement in error handling, script modularity, and documentation for maintainability.
Critical Issues
-
Error Handling for Unsupported Sanitizers (File:
latest-presubmits.yaml, Line: ~9)- The script exits with a generic error message if an unsupported sanitizer is provided (
ASanorTSanare supported). This can lead to confusion in debugging. - Suggested Solution: Use a more descriptive error message and add a fallback/default behavior. For example:
if [[ "${sanitizer}" != "ASan" && "${sanitizer}" != "TSan" ]]; then echo "Error: Unsupported sanitizer '${sanitizer}'. Valid options are 'ASan' and 'TSan'." >&2 exit 1 fi
- The script exits with a generic error message if an unsupported sanitizer is provided (
-
Performance Bottleneck in Submodule Updates (File:
.ci/replay-prow-job.sh, Lines ~1300)- Submodule update logic in the pod mode replay script (
git submodule update --init --recursive) runs unconditionally, which may significantly slow down job execution for repositories with large or unnecessary submodules. - Suggested Solution: Allow selective submodule updates using predefined paths or a configuration file.
- Submodule update logic in the pod mode replay script (
Code Improvements
-
Complexity in
.ci/replay-prow-job.sh(File:.ci/replay-prow-job.sh, Lines ~1-1617)- The script is overly long and difficult to navigate. Key functionalities like
auto_replay_changed_jobsandrun_pod_modecould benefit from modularization. - Suggested Solution: Break down the script into smaller scripts or functions, each handling specific tasks (e.g., job detection, job replay, environment setup).
- The script is overly long and difficult to navigate. Key functionalities like
-
Resource Limits in Prow Jobs (File:
latest-presubmits.yaml, Lines ~140-200)- Memory and CPU limits for the sanitizer containers are set at high values (
32Gimemory,12 CPUs). These limits may cause scheduling issues in shared environments. - Suggested Solution: Analyze and reduce resource limits based on actual usage, or allow them to be configurable via environment variables.
- Memory and CPU limits for the sanitizer containers are set at high values (
-
Redundant Code in
latest-presubmits.yaml(Lines ~140-200)- The
pull_sanitizer_jobconfiguration is repeated twice with minor differences (ASanvsTSan). This duplication increases maintenance overhead. - Suggested Solution: Refactor common configurations using YAML anchors and reuse them. For example:
pull_sanitizer_job: &pull_sanitizer_job <<: *common_job_config spec: containers: - <<: *pull_sanitizer_container env: - name: SANITIZER value: "{{SANITIZER_TYPE}}"
- The
Best Practices
-
Documentation for
.ci/replay-prow-job.sh(File:.ci/replay-prow-job.sh)- The script lacks detailed comments and usage examples for critical functions. This makes it harder for new contributors to understand its behavior.
- Suggested Solution: Add comments explaining the purpose of each major function and provide examples of expected inputs/outputs.
-
Testing Coverage for Replay Automation
- There is no indication of automated tests for the new replay functionality in
pull-replay-prow-jobs. - Suggested Solution: Add unit tests for the shell script and integration tests for the replayed jobs using mock data.
- There is no indication of automated tests for the new replay functionality in
-
Environment Variable Defaults (File:
.ci/replay-prow-job.sh, Line ~680)- Hardcoded environment variable defaults (
master,main) may not work for all repositories. - Suggested Solution: Allow repository-specific configurations via a
.envfile or command-line arguments.
- Hardcoded environment variable defaults (
Conclusion
The PR delivers important CI improvements for pingcap/tiflash but needs refinement in modularity, resource optimization, and documentation. Addressing these issues will improve maintainability and ensure scalability in larger environments.
wuhuizuo
left a comment
There was a problem hiding this comment.
I found one blocking issue in .ci/replay-prow-job.sh.
--run-timeout is documented as the full replay timeout and is threaded through pull-replay-prow-jobs, but pod mode never actually enforces it. After the pod becomes Ready, the script just does kubectl exec ... /tmp/prow_run.sh, so a hung replay can keep running until the outer Prow job timeout instead of the configured replay timeout. That makes --run-timeout effectively a no-op for the default --mode pod path used by this PR.
I think this needs a real timeout wrapper around the pod execution path, or an explicit pod kill/cancel once RUN_TIMEOUT is exceeded.
There was a problem hiding this comment.
I have already done a preliminary review for you, and I hope to help you do a better job.
Summary
This PR introduces a new .ci/replay-prow-job.sh script to facilitate local replay of Prow jobs using pod or prowjob modes, adds sanitizer presubmits (pull-sanitizer-asan and pull-sanitizer-tsan) for pingcap/tiflash, and integrates replay automation as a presubmit in pingcap-qe/ci. The approach is comprehensive, including robust configuration parsing, job replay automation, and clear documentation in the script. While the changes are functional and well-structured, some areas need refinement in terms of error handling, potential edge cases, and maintainability.
Critical Issues
-
Potential Infinite Loop in Pod Mode Execution
- File:
.ci/replay-prow-job.sh - Lines:
log "run job command (timeout=${RUN_TIMEOUT}s)",trap 'exit 0' TERM INT; while true; do wait || true; sleep 1; done - Issue: The container command in pod mode (
trap 'exit 0' TERM INT; while true; do wait || true; sleep 1; done) risks running indefinitely if the actual job inside the pod fails or does not terminate properly. This could result in resource exhaustion or hanging jobs. - Suggested Fix: Replace the indefinite loop with explicit monitoring of the job exit status and terminate the pod gracefully if the job does not finish within the expected timeframe. Example:
trap 'exit 0' TERM INT; timeout "${RUN_TIMEOUT}" wait || exit 1
- File:
-
Sanitizer Validation Logic May Cause Silent Failures
- File:
prow-jobs/pingcap/tiflash/latest-presubmits.yaml - Lines:
if [[ "${sanitizer}" != "ASan" && "${sanitizer}" != "TSan" ]]; then - Issue: The sanitizer validation logic may silently exit with code 1, which could be challenging to diagnose in CI logs.
- Suggested Fix: Add explicit error messaging and ensure logs clearly indicate the failure reason:
echo "Error: Unsupported sanitizer '${sanitizer}'. Valid options are [ASan, TSan]." >&2 exit 1
- File:
Code Improvements
-
Hardcoded Resource Values for Sanitizer Jobs
- File:
prow-jobs/pingcap/tiflash/latest-presubmits.yaml - Lines:
memory: 32Gi,cpu: "12" - Issue: Hardcoding resource requests and limits can lead to inefficiencies if cluster capacity changes. Consider making these configurable via environment variables or YAML parameters.
- Suggested Fix: Use placeholders or environment variables for resource definitions:
requests: cpu: "${SANITIZER_CPU_REQUEST:-12}" memory: "${SANITIZER_MEMORY_REQUEST:-32Gi}" limits: cpu: "${SANITIZER_CPU_LIMIT:-12}" memory: "${SANITIZER_MEMORY_LIMIT:-32Gi}"
- File:
-
Verbose Flag Behavior in Replay Script
- File:
.ci/replay-prow-job.sh - Lines:
vlog(),if [[ "${VERBOSE:-false}" == "true" ]]; then - Issue: The verbosity option (
--verbose) is implemented but does not cover all critical steps, such as replay execution or resource cleanup. - Suggested Fix: Expand
vlog()usage to include additional debug information, such as the pod spec, environment variables, and replay targets.
- File:
Best Practices
-
Testing Coverage for
.ci/replay-prow-job.sh- Issue: The script lacks automated tests to validate functionality, especially for edge cases (e.g., missing environment variables, invalid modes).
- Suggested Fix: Add unit tests or integration tests using a shell testing framework like
batsor mock Kubernetes APIs.
-
Documentation Clarity in Replay Script
- File:
.ci/replay-prow-job.sh - Issue: While the usage block is helpful, it does not clearly document the required Kubernetes permissions for the script to function.
- Suggested Fix: Add a section to the usage block detailing required permissions (e.g.,
kubectlaccess, namespace privileges).
- File:
-
Consistency in YAML Style
- Files: All YAML files (
presubmits.yaml,latest-presubmits.yaml) - Issue: Mixing YAML anchors and explicit definitions can be confusing for maintainers. For example,
<<: *pull_sanitizer_jobis used inconsistently. - Suggested Fix: Standardize YAML styling by either fully using anchors or explicit definitions.
- Files: All YAML files (
Next Steps
- Address the critical issues to avoid runtime interruptions and improve reliability.
- Refactor parts of the codebase to enhance maintainability and configuration flexibility.
- Add testing and documentation to ensure future maintainability and clarity.
There was a problem hiding this comment.
I have already done a preliminary review for you, and I hope to help you do a better job.
Summary
This PR introduces new Prow presubmit jobs for sanitizers (pull-sanitizer-asan and pull-sanitizer-tsan) and adds a local replay automation script (replay-prow-job.sh) to facilitate testing and debugging of Prow jobs. The implementation includes YAML updates for job definitions and a detailed Bash script for replaying jobs locally or in ProwJob mode. The overall quality of the PR is strong, but there are several areas where improvements can be made for robustness, maintainability, and efficiency.
Critical Issues
-
Security Risk: Hardcoded Secrets in YAML
- File:
.ci/replay-prow-job.sh,prow-jobs/pingcap/tiflash/latest-presubmits.yaml - Line: Secrets like
REPLAY_GIT_TOKENare potentially exposed when injected into pods via environment variables. - Why it's an issue: Hardcoding sensitive information in the YAML configuration or scripts can lead to unintentional leaks and security vulnerabilities.
- Suggested Solution: Use Kubernetes secrets to securely manage sensitive values. Replace:
With:
env: - name: REPLAY_GIT_TOKEN value: <your-token>
env: - name: REPLAY_GIT_TOKEN valueFrom: secretKeyRef: name: git-token-secret key: token
- File:
-
Inefficient Caching Logic in Sanitizer Script
- File:
prow-jobs/pingcap/tiflash/latest-presubmits.yaml - Line: Sanitizer script reinitializes
ccachesettings every time. - Why it's an issue: Frequent reinitialization of
ccachecan degrade performance and potentially result in unnecessary overhead during repeated job executions. - Suggested Solution: Skip initialization if
ccacheis already configured. Add a check:if ! ccache -p | grep -q "cache_dir"; then ccache -o cache_dir=/path/to/cache ... fi
- File:
Code Improvements
-
Error Handling in
replay-prow-job.sh- File:
.ci/replay-prow-job.sh - Line: Functions like
run_kubectl_with_deadlinesilently ignore errors. - Why it's an issue: Failure in critical steps like pod creation or execution can lead to undetected issues in workflow replay.
- Suggested Solution: Add error logging and retry logic for critical steps:
run_kubectl_with_deadline() { ... if ! run_kubectl "$@"; then log "Error in step: $step_desc" exit 1 fi }
- File:
-
Unbounded Resource Requests
- File:
prow-jobs/pingcap/tiflash/latest-presubmits.yaml - Line: Resources for
pull-sanitizerjobs are set to high values (32Gimemory,12CPUs). - Why it's an issue: High resource allocation can lead to inefficiencies or scheduling issues in Kubernetes clusters.
- Suggested Solution: Add resource limits based on observed workload requirements. For example:
resources: requests: cpu: "8" memory: 16Gi limits: cpu: "10" memory: 20Gi
- File:
-
Code Duplication in Sanitizer Definitions
- File:
prow-jobs/pingcap/tiflash/latest-presubmits.yaml - Line:
pull-sanitizer-asanandpull-sanitizer-tsanjobs define similar configurations. - Why it's an issue: Duplication makes maintenance more difficult.
- Suggested Solution: Use YAML anchors to reduce repetition:
pull_sanitizer_base_job: &pull_sanitizer_base_job <<: *pull_sanitizer_job spec: containers: - <<: *pull_sanitizer_container ... - <<: *pull_sanitizer_base_job name: pull-sanitizer-asan env: - name: SANITIZER value: ASan - <<: *pull_sanitizer_base_job name: pull-sanitizer-tsan env: - name: SANITIZER value: TSan
- File:
Best Practices
-
Missing Tests for
replay-prow-job.sh- File:
.ci/replay-prow-job.sh - Why it's an issue: The script lacks automated test coverage to validate its correctness.
- Suggested Solution: Add unit tests using a shell testing framework like
bats. Example:@test "test replay-prow-job.sh --help" { run bash .ci/replay-prow-job.sh --help [ "$status" -eq 0 ] }
- File:
-
Inconsistent Logging Levels
- File:
.ci/replay-prow-job.sh - Line: The script mixes
logandvlogwithout clear separation. - Why it's an issue: Lack of consistent logging levels can make debugging harder.
- Suggested Solution: Standardize logging levels using verbosity flags. For instance:
log() { if [[ "${VERBOSE:-false}" == "true" ]]; then echo "[DEBUG] $*" >&2 else echo "[INFO] $*" >&2 fi }
- File:
-
Improve Documentation and Usage Examples
- File:
.ci/replay-prow-job.sh - Why it's an issue: The usage section is comprehensive but lacks examples for edge cases like running jobs with custom namespaces.
- Suggested Solution: Extend usage documentation with common scenarios:
# Examples: # Replay in a custom namespace .ci/replay-prow-job.sh --namespace custom-namespace --mode pod ...
- File:
Conclusion
This PR is well-structured and achieves its objectives effectively. Addressing the issues above will improve security, maintainability, and performance, making the solution more robust. The critical security concerns should be prioritized, followed by simplification of YAML definitions and enhanced error handling.
|
@wuhuizuo Thanks for the catch.You were right. |
wuhuizuo
left a comment
There was a problem hiding this comment.
Re-reviewed the latest updates.
The new pod replay flow now carries a single deadline through apply / wait / cp / exec, validates --run-timeout, and force-deletes the replay pod when that deadline is exceeded. That addresses my previous blocking concern that --run-timeout was a no-op in pod mode.
I don't have additional blocking comments from the CI side.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wuhuizuo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
Summary
pingcap/tiflashsanitizer workflows:pull-sanitizer-asanpull-sanitizer-tsanpingcap-qe/ciScope
.ci/replay-prow-job.shprow-jobs/pingcap-qe/ci/presubmits.yamlprow-jobs/pingcap/tiflash/latest-presubmits.yamlIssue Link