Skip to content

Commit f277f5f

Browse files
feat(db): zero-downtime migration safety lint + db-migrate skill (#5041)
* feat(db): zero-downtime migration safety lint + db-migrate skill Add scripts/check-migrations-safety.ts (check:migrations), a CI gate that classifies statements in newly-added migrations into hard errors (rewrite), annotate-to-acknowledge contract ops (`-- migration-safe: <reason>`), and backfill warnings. Wire it into test-build.yml. Add the /db-migrate skill as the judgment half (expand/contract phasing, app-code cross-ref, annotation authoring). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(skills): run cleanup and db-migrate safety checks in /ship * fix(db): address review — DROP INDEX lock symmetry, RENAME CONSTRAINT false-positive, alter-type literal match - Non-concurrent DROP INDEX is now a hard error (ACCESS EXCLUSIVE lock), symmetric with CREATE INDEX; DROP INDEX CONCURRENTLY after a COMMIT passes clean. Removes the false-confidence annotate path. - RENAME rule narrowed to RENAME COLUMN / table RENAME TO; RENAME CONSTRAINT and ALTER INDEX ... RENAME (metadata-only) no longer flagged. - alter-type regex now requires TYPE to follow the column identifier, so it no longer matches TYPE inside a string default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(db): enforce IF EXISTS on DROP INDEX CONCURRENTLY for replay idempotency Symmetric with the CREATE INDEX CONCURRENTLY rule: a post-COMMIT DROP INDEX CONCURRENTLY replays from the top on failure, so without IF EXISTS it aborts re-dropping an already-gone index. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * improvement(skills): gate /ship cleanup on UI changes; default migration base to staging - /ship runs /cleanup only when the diff touches UI code (.tsx or apps/sim/components|hooks|stores); the six passes are React-only. - /ship runs check:migrations against origin/staging (the PR base). - check:migrations default baseRef is now origin/staging instead of origin/main. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(db-migrate): add contract-pending TODO convention for deferred drops Establishes a durable, greppable marker (`contract-pending(<precondition>): ...`) left on the legacy column in schema.ts when an expand defers a drop, so the contract phase doesn't rot. The outstanding-work list is `grep -rn contract-pending`; the contract PR's `-- migration-safe:` annotation references the expand and deletes the marker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent a49e755 commit f277f5f

6 files changed

Lines changed: 802 additions & 6 deletions

File tree

.agents/skills/db-migrate/SKILL.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
---
2+
name: db-migrate
3+
description: Author or review a Drizzle DB migration for zero-downtime safety — expand/contract phasing, backward-compatibility with the deployed app version, and writing the `-- migration-safe` acknowledgment the check:migrations lint requires. Use when adding/editing files under `packages/db/migrations/` or changing `packages/db/schema.ts`.
4+
---
5+
6+
# DB Migrate Skill
7+
8+
You make schema changes that survive a deploy without downtime. The `check:migrations` lint (`scripts/check-migrations-safety.ts`) is the deterministic gate; you are the judgment that decides whether a flagged change is actually safe and writes the annotation that satisfies it.
9+
10+
## The window (why this matters)
11+
12+
A deploy runs the migration, then rolls out the new app image via blue/green. The two are **not atomic and cannot be** — during cutover the old task set keeps serving against the **already-migrated** schema. So:
13+
14+
> Every migration must be backward-compatible with the app version that is *already deployed*.
15+
16+
If a migration drops a column the old code still reads, renames one, or adds a `NOT NULL` the old inserts don't populate, the old code throws until traffic fully shifts — the downtime we're guarding against. You can't fix this by reordering the pipeline; the only fix is discipline.
17+
18+
## Expand / contract
19+
20+
Split every breaking change across **two deploys**:
21+
22+
1. **Expand** (this PR): additive, backward-compatible schema + code that tolerates *both* the old and new shape.
23+
2. **Contract** (a later PR, after expand is fully deployed): remove the old thing, now that nothing reads it.
24+
25+
Never put expand and contract in the same PR. If this PR both removes the code that used a column *and* drops the column, the old code is still live during cutover — split it.
26+
27+
### Per-operation playbook
28+
29+
| You want to | Do (deploy 1 = expand) | Do (deploy 2 = contract) |
30+
|---|---|---|
31+
| Add a required column | `ADD COLUMN` nullable or `DEFAULT`; code writes it | backfill, then `SET NOT NULL` |
32+
| Rename a column/table | add the new name; code dual-writes / reads new-then-old | drop the old name |
33+
| Drop a column/table | stop all reads/writes in code; ship it | `DROP` (annotate) |
34+
| Change a column type | add a new column of the new type; dual-write | backfill, swap reads, drop old |
35+
| Add FK / CHECK | `ADD CONSTRAINT ... NOT VALID` | `VALIDATE CONSTRAINT` separately |
36+
| Index an existing table | `COMMIT;` breakpoint → `SET lock_timeout = 0``CREATE INDEX CONCURRENTLY IF NOT EXISTS` (see `packages/db/scripts/migrate.ts`) ||
37+
| Drop an index | `COMMIT;` breakpoint → `DROP INDEX CONCURRENTLY` — plain `DROP INDEX` takes ACCESS EXCLUSIVE on the table ||
38+
| Backfill data | batched + idempotent `UPDATE` (keyset/`WHERE`, bounded) ||
39+
40+
A `CREATE INDEX`, `ADD COLUMN`, or `ADD CONSTRAINT` against a table **created in the same migration** is always safe (no rows, no live traffic) — the lint already suppresses those.
41+
42+
## Tracking the contract (don't let it rot)
43+
44+
The contract half is deferred to a later deploy — and that is exactly when it gets forgotten, leaving dead columns, orphaned tables, and `NOT NULL`s that never land. Every deferred contract must become a durable, greppable TODO.
45+
46+
When an expand defers a drop, leave a **`contract-pending`** marker on the legacy column/table in `packages/db/schema.ts` — that is the file you will be editing when you finally do the drop, so the reminder lives where the work happens:
47+
48+
```ts
49+
// contract-pending(after #5035 is fully deployed): drop once permission-check.ts stops reading it
50+
workspaceId: text('workspace_id'),
51+
```
52+
53+
Format: `contract-pending(<precondition>): <what to drop> — <why it's safe once the precondition holds>`. The precondition names the PR/release that removes the last reader and **must be fully deployed** before the contract ships.
54+
55+
- **The TODO list is a grep** — always accurate, never drifts: `grep -rn "contract-pending" packages/db apps/sim`. Run it when starting migration work to see what is owed.
56+
- For anything with a real owner or schedule, also open a tracking issue and put its number in the marker.
57+
- **Close the loop in the contract PR:** the contract migration's `-- migration-safe:` annotation references the expand, and you **delete the `contract-pending` marker** in the same PR:
58+
```sql
59+
-- migration-safe: contract of #5035 — workspace_id readers removed there, deployed 2026-06-10
60+
ALTER TABLE "permission_group" DROP COLUMN "workspace_id";
61+
```
62+
- An expand merged **without** a marker for the drop it defers, or a contract merged **without** removing its marker, is a bug — flag it in review.
63+
64+
## The judgment the lint can't do
65+
66+
The lint flags risky *shapes*; it cannot know whether a given drop is *safe right now*. For each flagged statement, do the work it can't:
67+
68+
1. **Is the dependency gone?** Grep the app for the table/column: search `apps/sim` and `packages` for the column name, the Drizzle field (camelCase), and the table object. If any live read/write remains, it is **not** safe — fix the code first.
69+
2. **Did the expand already ship?** The removal of that read/write must be in a deploy that is *already out*, not this same PR. If it's in this PR, split: land the code change now, do the destructive migration in a follow-up after it deploys.
70+
3. **Backfills:** confirm the `UPDATE`/`DELETE` is batched (bounded `WHERE`/keyset, not a single whole-table statement), idempotent (safe to replay — a failed migration re-runs unjournaled files from the top), and safe under concurrent writes from the still-live old app.
71+
72+
## Workflow
73+
74+
1. Edit `packages/db/schema.ts`, then `cd packages/db && bunx drizzle-kit generate` to produce the SQL. If this is an expand that defers a drop, leave a `contract-pending` marker on the legacy column (see "Tracking the contract"). If this is the contract, delete the marker it resolves.
75+
2. Hand-edit the generated SQL where the playbook requires it: `CONCURRENTLY` + `COMMIT;` breakpoint for indexes on existing tables, `NOT VALID` for constraints, batching for backfills.
76+
3. Run `bun run check:migrations` (base defaults to `origin/staging`).
77+
- **Hard errors** (`add-not-null-no-default`, `rename`, `index-not-concurrent`, `constraint-not-valid`, …): rewrite into expand/contract. Do **not** try to annotate them away — the lint won't accept it.
78+
- **Annotate tier** (`drop-table`, `drop-column`, `drop-default`, `set-not-null`, `alter-type`, `drop-index`): only after you've confirmed steps 1–3 above, add a comment on the line directly above the statement:
79+
```sql
80+
-- migration-safe: `secret` read removed in v0.6.1 (#1234), shipped two deploys ago
81+
ALTER TABLE "webhook" DROP COLUMN "secret";
82+
```
83+
The reason must be specific and name the PR/version that removed the dependency. An empty reason fails the lint.
84+
- **Warnings** (`data-backfill`): non-blocking, but confirm the batching/idempotency before merging.
85+
4. Verify locally: `cd packages/db && bun run db:migrate` against a dev DB.
86+
87+
## Hard rule
88+
89+
Never annotate a destructive statement just to make the lint pass. The annotation is a claim that you verified the old code no longer depends on it. If you can't make that claim truthfully, the change belongs in a later deploy — tell the user to split it.

.agents/skills/ship/SKILL.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
name: ship
3-
description: Commit, push, and open a PR to staging in one shot
3+
description: Commit, push, and open a PR to staging in one shot — runs the cleanup pass and, when migrations changed, the db-migrate safety review first
44
---
55

66
# Ship Command
@@ -16,12 +16,17 @@ When the user runs `/ship`:
1616
- Types: `fix`, `feat`, `improvement`, `chore`
1717
- Scope: short identifier (e.g., `undo-redo`, `api`, `ui`)
1818
- Keep it concise
19-
3. **Run pre-ship checks** from the repo root before staging:
19+
3. **Run the cleanup pass** — only if the diff modifies UI code (any `.tsx` file, or anything under `apps/sim/components/`, `apps/sim/hooks/`, or `apps/sim/stores/`): `/cleanup`
20+
- The six code-quality skills (effects, memo, callbacks, state, React Query, emcn) only apply to React code, so skip this step entirely when no UI was touched. When it runs, it applies fixes so they land in this commit.
21+
4. **Run migration safety** — only if the diff touches `packages/db/migrations/**` or `packages/db/schema.ts`:
22+
- Run `/db-migrate` to review the migration for zero-downtime safety (expand/contract phasing, backward-compatibility with the deployed app version).
23+
- `bun run check:migrations origin/staging` must pass (staging is the PR base). Do not silence a flagged statement with a `-- migration-safe:` annotation unless `/db-migrate` confirmed the old code no longer depends on it; otherwise split the destructive change into a later deploy.
24+
5. **Run pre-ship checks** from the repo root before staging:
2025
- `bun run lint` to fix formatting issues
2126
- `bun run check:api-validation:strict` to catch boundary contract failures before CI
22-
4. **Stage and commit** the changes with the generated message
23-
5. **Push to origin** using the current branch name
24-
6. **Create a PR** to staging with a description in the user's voice
27+
6. **Stage and commit** the changes with the generated message
28+
7. **Push to origin** using the current branch name
29+
8. **Create a PR** to staging with a description in the user's voice
2530

2631
## Commit Message Format
2732

@@ -77,7 +82,7 @@ gh pr create --base staging --title "COMMIT_MESSAGE" --body "PR_BODY"
7782

7883
- Short, direct bullet points
7984
- No unnecessary explanation
80-
- "Tested manually" is acceptable for testing section; include lint and boundary validation results when run
85+
- "Tested manually" is acceptable for testing section; include lint, boundary validation, and (when migrations changed) `check:migrations` results when run
8186
- Checkboxes filled in appropriately
8287
- No screenshots section unless UI changes
8388

.github/workflows/test-build.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,16 @@ jobs:
118118
- name: Verify realtime prune graph
119119
run: bun run check:realtime-prune
120120

121+
- name: Migration safety (zero-downtime) audit
122+
run: |
123+
if [ "${{ github.event_name }}" = "pull_request" ]; then
124+
BASE_REF="origin/${{ github.base_ref }}"
125+
git fetch --depth=1 origin "${{ github.base_ref }}" 2>/dev/null || true
126+
else
127+
BASE_REF="HEAD~1"
128+
fi
129+
bun run check:migrations "$BASE_REF"
130+
121131
- name: Type-check realtime server
122132
run: bunx turbo run type-check --filter=@sim/realtime
123133

package.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
"check:zustand-v5": "bun run scripts/check-zustand-v5-selectors.ts",
3030
"check:react-query": "bun run scripts/check-react-query-patterns.ts --check",
3131
"check:utils": "bun run scripts/check-utils-enforcement.ts",
32+
"check:migrations": "bun run scripts/check-migrations-safety.ts",
3233
"mship-contracts:generate": "bun run scripts/sync-mothership-stream-contract.ts",
3334
"mship-contracts:check": "bun run scripts/sync-mothership-stream-contract.ts --check",
3435
"mship-tools:generate": "bun run scripts/sync-tool-catalog.ts",
Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
/**
2+
* Run with: bun test scripts/check-migrations-safety.test.ts
3+
* (Root scripts are bun-native and not part of the turbo/vitest workspaces.)
4+
*/
5+
import { describe, expect, test } from 'bun:test'
6+
import { lintSql } from './check-migrations-safety.ts'
7+
8+
const rules = (sql: string) => lintSql(sql).map((f) => `${f.tier}:${f.rule}`)
9+
10+
describe('additive / safe', () => {
11+
test('nullable add column passes', () => {
12+
expect(lintSql('ALTER TABLE "webhook" ADD COLUMN "provider_config" json;')).toEqual([])
13+
})
14+
15+
test('NOT NULL with DEFAULT passes', () => {
16+
expect(lintSql('ALTER TABLE "user" ADD COLUMN "flag" boolean DEFAULT false NOT NULL;')).toEqual(
17+
[]
18+
)
19+
})
20+
21+
test('CREATE TABLE plus index and FK on that new table passes', () => {
22+
const sql = `CREATE TABLE "kb" ("id" text PRIMARY KEY NOT NULL, "user_id" text NOT NULL);
23+
--> statement-breakpoint
24+
CREATE INDEX "kb_user_id_idx" ON "kb" USING btree ("user_id");
25+
--> statement-breakpoint
26+
ALTER TABLE "kb" ADD CONSTRAINT "kb_user_fk" FOREIGN KEY ("user_id") REFERENCES "user"("id");`
27+
expect(lintSql(sql)).toEqual([])
28+
})
29+
30+
test('CONCURRENTLY index after a COMMIT breakpoint passes', () => {
31+
const sql = `COMMIT;
32+
--> statement-breakpoint
33+
SET lock_timeout = 0;
34+
--> statement-breakpoint
35+
CREATE INDEX CONCURRENTLY IF NOT EXISTS "idx_x" ON "embedding" ("kb_id");`
36+
expect(lintSql(sql)).toEqual([])
37+
})
38+
})
39+
40+
describe('hard errors', () => {
41+
test('ADD COLUMN NOT NULL without default', () => {
42+
expect(rules('ALTER TABLE "user" ADD COLUMN "email" text NOT NULL;')).toEqual([
43+
'error:add-not-null-no-default',
44+
])
45+
})
46+
47+
test('RENAME column', () => {
48+
expect(rules('ALTER TABLE "marketplace" RENAME COLUMN "executions" TO "views";')).toEqual([
49+
'error:rename',
50+
])
51+
})
52+
53+
test('CREATE INDEX on existing table without CONCURRENTLY', () => {
54+
expect(rules('CREATE INDEX "idx_y" ON "embedding" ("kb_id");')).toEqual([
55+
'error:index-not-concurrent',
56+
])
57+
})
58+
59+
test('CONCURRENTLY index without IF NOT EXISTS', () => {
60+
const sql = `COMMIT;
61+
--> statement-breakpoint
62+
CREATE INDEX CONCURRENTLY "idx_z" ON "embedding" ("kb_id");`
63+
expect(rules(sql)).toEqual(['error:concurrent-index-not-idempotent'])
64+
})
65+
66+
test('CONCURRENTLY index without a preceding COMMIT', () => {
67+
expect(
68+
rules('CREATE INDEX CONCURRENTLY IF NOT EXISTS "idx_z" ON "embedding" ("kb_id");')
69+
).toEqual(['error:concurrent-index-no-commit'])
70+
})
71+
72+
test('ADD FOREIGN KEY on existing table without NOT VALID', () => {
73+
expect(
74+
rules(
75+
'ALTER TABLE "session" ADD CONSTRAINT "s_fk" FOREIGN KEY ("uid") REFERENCES "user"("id");'
76+
)
77+
).toEqual(['error:constraint-not-valid'])
78+
})
79+
})
80+
81+
describe('annotate tier', () => {
82+
const drop = 'ALTER TABLE "webhook" DROP COLUMN "secret";'
83+
84+
test('DROP COLUMN unannotated fails', () => {
85+
expect(rules(drop)).toEqual(['error:drop-column'])
86+
})
87+
88+
test('DROP COLUMN annotated passes', () => {
89+
const sql = `-- migration-safe: secret read removed in v0.6.1 (#1234), shipped two deploys ago\n${drop}`
90+
expect(lintSql(sql)).toEqual([])
91+
})
92+
93+
test('annotation tolerates an intervening statement-breakpoint line', () => {
94+
const sql = `ALTER TABLE "webhook" ADD COLUMN "provider_config" json;
95+
--> statement-breakpoint
96+
-- migration-safe: secret read removed in v0.6.1 (#1234)
97+
${drop}`
98+
expect(lintSql(sql)).toEqual([])
99+
})
100+
101+
test('dangling annotation with empty reason fails', () => {
102+
const sql = `-- migration-safe:\n${drop}`
103+
const found = lintSql(sql)
104+
expect(found).toHaveLength(1)
105+
expect(found[0].tier).toBe('error')
106+
expect(found[0].message).toContain('no reason')
107+
})
108+
109+
test('annotation on the wrong statement does not bleed', () => {
110+
const sql = `-- migration-safe: removing secret
111+
ALTER TABLE "webhook" ADD COLUMN "x" json;
112+
--> statement-breakpoint
113+
${drop}`
114+
expect(rules(sql)).toEqual(['error:drop-column'])
115+
})
116+
117+
test('type change and DROP TABLE are annotate-tier', () => {
118+
expect(
119+
rules(
120+
'ALTER TABLE "user_table_rows" ALTER COLUMN "order_key" SET DATA TYPE text COLLATE "C";'
121+
)
122+
).toEqual(['error:alter-type'])
123+
expect(rules('DROP TABLE "marketplace_execution" CASCADE;')).toEqual(['error:drop-table'])
124+
})
125+
})
126+
127+
describe('warnings (non-blocking)', () => {
128+
test('UPDATE backfill warns but does not error', () => {
129+
const found = lintSql('UPDATE "user_table_definitions" SET "schema" = \'{}\' WHERE id = \'1\';')
130+
expect(found.map((f) => f.tier)).toEqual(['warn'])
131+
})
132+
133+
test('UPDATE without WHERE flags the whole-table note', () => {
134+
const found = lintSql('UPDATE "user" SET "active" = true;')
135+
expect(found[0].tier).toBe('warn')
136+
expect(found[0].message).toContain('no WHERE')
137+
})
138+
})
139+
140+
describe('review fixes', () => {
141+
test('RENAME CONSTRAINT is metadata-only — not flagged', () => {
142+
expect(
143+
lintSql('ALTER TABLE "permission_group" RENAME CONSTRAINT "old_fk" TO "new_fk";')
144+
).toEqual([])
145+
})
146+
147+
test('ALTER INDEX ... RENAME is metadata-only — not flagged', () => {
148+
expect(lintSql('ALTER INDEX "old_idx" RENAME TO "new_idx";')).toEqual([])
149+
})
150+
151+
test('table RENAME TO is still a hard error', () => {
152+
expect(rules('ALTER TABLE "marketplace" RENAME TO "listings";')).toEqual(['error:rename'])
153+
})
154+
155+
test('plain DROP INDEX is a hard error (ACCESS EXCLUSIVE lock)', () => {
156+
expect(rules('DROP INDEX "permission_group_workspace_name_unique";')).toEqual([
157+
'error:drop-index-not-concurrent',
158+
])
159+
})
160+
161+
test('DROP INDEX CONCURRENTLY after a COMMIT passes clean', () => {
162+
const sql = `COMMIT;
163+
--> statement-breakpoint
164+
DROP INDEX CONCURRENTLY IF EXISTS "stale_idx";`
165+
expect(lintSql(sql)).toEqual([])
166+
})
167+
168+
test('DROP INDEX CONCURRENTLY without IF EXISTS is not idempotent', () => {
169+
const sql = `COMMIT;
170+
--> statement-breakpoint
171+
DROP INDEX CONCURRENTLY "stale_idx";`
172+
expect(rules(sql)).toEqual(['error:concurrent-drop-index-not-idempotent'])
173+
})
174+
175+
test('DROP INDEX CONCURRENTLY without a preceding COMMIT errors', () => {
176+
expect(rules('DROP INDEX CONCURRENTLY IF EXISTS "stale_idx";')).toEqual([
177+
'error:concurrent-drop-index-no-commit',
178+
])
179+
})
180+
181+
test('alter-type does not match TYPE inside a string default', () => {
182+
expect(lintSql(`ALTER TABLE "x" ALTER COLUMN "y" SET DEFAULT 'change TYPE later';`)).toEqual([])
183+
})
184+
})
185+
186+
describe('parser robustness', () => {
187+
test('semicolon inside a string literal does not split', () => {
188+
expect(lintSql(`ALTER TABLE "x" ADD COLUMN "y" text DEFAULT 'a;b' NOT NULL;`)).toEqual([])
189+
})
190+
191+
test('dollar-quoted DO block is one statement; FK on a new table is suppressed', () => {
192+
const sql = `CREATE TABLE "jobs" ("id" text PRIMARY KEY NOT NULL, "wid" text NOT NULL);
193+
--> statement-breakpoint
194+
DO $$ BEGIN
195+
ALTER TABLE "jobs" ADD CONSTRAINT "jobs_fk" FOREIGN KEY ("wid") REFERENCES "workspace"("id");
196+
EXCEPTION WHEN duplicate_object THEN null;
197+
END $$;`
198+
expect(lintSql(sql)).toEqual([])
199+
})
200+
})

0 commit comments

Comments
 (0)