Skip to content

Commit 707bf1a

Browse files
authored
ci: reduce unit test flakiness and shard re-run cost (#3844)
A unit-test shard recently failed on a timing race rather than a real regression - a run-engine waitpoint test sleeps 1250ms waiting on a 1000ms timeout that's processed by a ~1000ms worker poll, so on a CPU-starved shard the margin evaporates and the whole matrix goes red. Because `fail-fast` defaults on, that one flake cancels the sibling shards, and the only recovery is re-running the entire matrix "just to be sure" - which is itself slow. This is the low-risk first pass at that pain: - `fail-fast: false` on the webapp and internal shard matrices, so one flaky shard no longer cancels its siblings. "Re-run failed jobs" now re-runs just the failed shard instead of the whole matrix. - CI-scoped `retry: process.env.CI ? 2 : 0` on the timing-sensitive packages (`run-engine`, `redis-worker`, `schedule-engine`). Flakes self-heal in CI; local runs stay at `retry: 0` so they still surface in dev. A stopgap until the timing tests are made deterministic. - `fetch-depth: 1` on the unit-test checkouts - they don't use git history, so the full clone was wasted setup time across ~20 jobs. - Reconcile the pre-pull image tags with what testcontainers actually pulls (`redis:7-alpine` -> `redis:7.2`, `ryuk:0.11.0` -> `ryuk:0.14.0`) and add `minio/minio:latest` to the webapp pre-pull. Otherwise those images pull unauthenticated at test time and risk Docker Hub rate-limit flakes (worst on fork PRs, where the authenticated pre-pull is skipped entirely). Deeper follow-ups - bigger runners, turbo remote cache, runtime-weighted sharding, and the real root-cause fix (container reuse / template-DB isolation + deterministic timing tests) - are tracked under TRI-10484.
1 parent 16d59aa commit 707bf1a

6 files changed

Lines changed: 23 additions & 12 deletions

File tree

.github/workflows/unit-tests-internal.yml

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ jobs:
1616
name: "🧪 Unit Tests: Internal"
1717
runs-on: ubuntu-latest
1818
strategy:
19+
# one flaky shard shouldn't cancel its siblings - lets us re-run only the failed shard
20+
fail-fast: false
1921
matrix:
2022
shardIndex: [1, 2, 3, 4, 5, 6, 7, 8]
2123
shardTotal: [8]
@@ -53,7 +55,7 @@ jobs:
5355
- name: ⬇️ Checkout repo
5456
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
5557
with:
56-
fetch-depth: 0
58+
fetch-depth: 1
5759
persist-credentials: false
5860

5961
- name: ⎔ Setup pnpm
@@ -84,8 +86,8 @@ jobs:
8486
echo "Pre-pulling Docker images with authenticated session..."
8587
docker pull postgres:14
8688
docker pull clickhouse/clickhouse-server:25.4-alpine
87-
docker pull redis:7-alpine
88-
docker pull testcontainers/ryuk:0.11.0
89+
docker pull redis:7.2
90+
docker pull testcontainers/ryuk:0.14.0
8991
docker pull electricsql/electric:1.2.4
9092
echo "Image pre-pull complete"
9193
@@ -123,7 +125,7 @@ jobs:
123125
- name: ⬇️ Checkout repo
124126
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
125127
with:
126-
fetch-depth: 0
128+
fetch-depth: 1
127129
persist-credentials: false
128130

129131
- name: ⎔ Setup pnpm

.github/workflows/unit-tests-packages.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ jobs:
5353
- name: ⬇️ Checkout repo
5454
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
5555
with:
56-
fetch-depth: 0
56+
fetch-depth: 1
5757
persist-credentials: false
5858

5959
- name: ⎔ Setup pnpm
@@ -84,8 +84,8 @@ jobs:
8484
echo "Pre-pulling Docker images with authenticated session..."
8585
docker pull postgres:14
8686
docker pull clickhouse/clickhouse-server:25.4-alpine
87-
docker pull redis:7-alpine
88-
docker pull testcontainers/ryuk:0.11.0
87+
docker pull redis:7.2
88+
docker pull testcontainers/ryuk:0.14.0
8989
docker pull electricsql/electric:1.2.4
9090
echo "Image pre-pull complete"
9191
@@ -123,7 +123,7 @@ jobs:
123123
- name: ⬇️ Checkout repo
124124
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
125125
with:
126-
fetch-depth: 0
126+
fetch-depth: 1
127127
persist-credentials: false
128128

129129
- name: ⎔ Setup pnpm

.github/workflows/unit-tests-webapp.yml

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ jobs:
1616
name: "🧪 Unit Tests: Webapp"
1717
runs-on: ubuntu-latest
1818
strategy:
19+
# one flaky shard shouldn't cancel its siblings - lets us re-run only the failed shard
20+
fail-fast: false
1921
matrix:
2022
shardIndex: [1, 2, 3, 4, 5, 6, 7, 8]
2123
shardTotal: [8]
@@ -53,7 +55,7 @@ jobs:
5355
- name: ⬇️ Checkout repo
5456
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
5557
with:
56-
fetch-depth: 0
58+
fetch-depth: 1
5759
persist-credentials: false
5860

5961
- name: ⎔ Setup pnpm
@@ -84,9 +86,10 @@ jobs:
8486
echo "Pre-pulling Docker images with authenticated session..."
8587
docker pull postgres:14
8688
docker pull clickhouse/clickhouse-server:25.4-alpine
87-
docker pull redis:7-alpine
88-
docker pull testcontainers/ryuk:0.11.0
89+
docker pull redis:7.2
90+
docker pull testcontainers/ryuk:0.14.0
8991
docker pull electricsql/electric:1.2.4
92+
docker pull minio/minio:latest
9093
echo "Image pre-pull complete"
9194
9295
- name: 📥 Download deps
@@ -131,7 +134,7 @@ jobs:
131134
- name: ⬇️ Checkout repo
132135
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
133136
with:
134-
fetch-depth: 0
137+
fetch-depth: 1
135138
persist-credentials: false
136139

137140
- name: ⎔ Setup pnpm

internal-packages/run-engine/vitest.config.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ export default defineConfig({
44
test: {
55
include: ["**/*.test.ts"],
66
globals: true,
7+
// CI-only: absorbs timing races (real-clock waits vs worker poll interval) under shard CPU contention
8+
retry: process.env.CI ? 2 : 0,
79
isolate: true,
810
fileParallelism: false,
911
testTimeout: 120_000,

internal-packages/schedule-engine/vitest.config.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@ import { defineConfig } from "vitest/config";
33
export default defineConfig({
44
test: {
55
globals: true,
6+
// CI-only: absorbs timing races (real-clock waits vs worker poll interval) under shard CPU contention
7+
retry: process.env.CI ? 2 : 0,
68
environment: "node",
79
setupFiles: ["./test/setup.ts"],
810
testTimeout: 30000,

packages/redis-worker/vitest.config.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ export default defineConfig({
44
test: {
55
include: ["**/*.test.ts"],
66
globals: true,
7+
// CI-only: absorbs timing races (real-clock waits vs worker poll interval) under shard CPU contention
8+
retry: process.env.CI ? 2 : 0,
79
fileParallelism: false,
810
},
911
});

0 commit comments

Comments
 (0)