fix(lakekeeper): drop per-tenant PG database+role on duckling delete by benben · Pull Request #649 · PostHog/duckgres

benben · 2026-06-01T16:02:53Z

Companion to posthog/charts#11522 (cnpg-shard composition cleanup).
That PR makes the cnpg-shard tenant role+DB Delete-managed; this PR
covers the external + dev/orbstack paths where the lakekeeper_
Postgres database on the shared metadata store is created by the
duckgres lakekeeper provisioner itself, not by the Crossplane
composition.

Without this drop, recreating a duckling with the same orgID fails
in two ways:

1. Role password drift

provider-sql's `ALTER ROLE` path on cnpg-shard doesn't fire on
`passwordSecretRef` change; on external we have no `ALTER` path
at all, so the stale role in the shared RDS keeps the previous
lifetime's password and the Lakekeeper migrate Job hits SASL fail.

2. Lakekeeper encryption-key drift

The provisioner generates a fresh `encryption-key` in the per-org
k8s Secret on every provision, so the old encrypted rows in
`lakekeeper_.*` (storage profile secrets in particular)
decrypt with the new key as "Wrong key or corrupt data", and every
`CREATE TABLE` returns `SecretFetchError 500`.

Observed in mw-dev today

Both symptoms surfaced on `ben-cnpg-ice` and `ben-ext-ice` after a
delete + recreate cycle; `ben-aur-ice` survived because the
per-duckling Aurora cluster is fully wiped on delete.

Changes

`postgres_admin.go`: add `DropDatabase` + `DropRole` (idempotent,
NotFound-tolerant; `DROP DATABASE WITH (FORCE)` so a lingering
Lakekeeper connection doesn't block, `DROP OWNED ... CASCADE`
before `DROP ROLE` so cluster-wide grants don't keep the role
pinned).
`lakekeeper_provisioner.go`: extend `DeleteForOrg` signature with
`ProvisioningInputs`; when `!PGPreProvisioned` and `AdminDSN` is
set, drop the `lakekeeper_` DB+role after the k8s
teardown. Best-effort: PG failures are logged and swallowed so a
flaky metadata store or already-destroyed Aurora cluster doesn't
permanently block duckling deletion. cnpg-shard short-circuits
because the composition (posthog/charts#11522) owns that
teardown.
`controller.go`: resolve `LakekeeperInputs` BEFORE deleting the
Duckling CR. The CR carries the metadata-store master credentials
the resolver folds into `AdminDSN`; once the CR is gone the
resolver can't reconstruct them. Best-effort: resolve failures
degrade to k8s-only teardown rather than blocking delete.
tests: live-PG round-trip for the Drop helpers (`PG_ADMIN_DSN`-
gated, same posture as existing `EnsureDatabase` tests); unit
tests asserting that `PGPreProvisioned` and empty `AdminDSN` both
skip the PG cleanup branch.

🤖 Generated with Claude Code

PR PostHog/charts#11522 fixes the cnpg-shard path by switching the composition's cnpg-tenant-{role,database} resources to full Delete managementPolicy. The external + dev/orbstack paths need a parallel fix on the duckgres side: the lakekeeper_<orgid> Postgres database on the shared metadata store is created by the duckgres lakekeeper provisioner (EnsureDatabase + EnsureRole), and the existing DeleteForOrg only cleans up the k8s pieces. Without this drop, recreating a duckling with the same orgID fails in two ways: 1. Role password drift: provider-sql's ALTER ROLE path on cnpg-shard doesn't fire on passwordSecretRef change; on external we have no ALTER path at all, so the stale role in the shared RDS keeps the previous lifetime's password and the Lakekeeper migrate Job hits SASL fail. 2. Lakekeeper encryption-key drift: the provisioner generates a fresh encryption-key in the per-org k8s Secret on every provision, so the old encrypted rows in lakekeeper_<orgid>.* (storage profile secrets in particular) decrypt with the new key as "Wrong key or corrupt data", and every CREATE TABLE returns SecretFetchError 500. Both symptoms observed in mw-dev today on ben-cnpg-ice and ben-ext-ice after a delete + recreate cycle; ben-aur-ice survived because the per-duckling Aurora cluster is fully wiped on delete. Changes: - postgres_admin.go: add DropDatabase + DropRole (idempotent, NotFound-tolerant; DROP DATABASE WITH (FORCE) so a lingering Lakekeeper connection doesn't block, DROP OWNED CASCADE before DROP ROLE so cluster-wide grants don't keep the role pinned) - lakekeeper_provisioner.go: extend DeleteForOrg signature with ProvisioningInputs; when !PGPreProvisioned and AdminDSN is set, drop the lakekeeper_<orgid> DB+role after the k8s teardown. Best-effort: PG failures are logged and swallowed so a flaky metadata store or already-destroyed Aurora cluster doesn't permanently block duckling deletion. cnpg-shard is short-circuited because the composition (PR PostHog/charts#11522) owns its teardown. - controller.go: resolve LakekeeperInputs BEFORE deleting the Duckling CR. The CR carries the metadata-store master credentials the resolver folds into AdminDSN; once the CR is gone the resolver can't reconstruct them. Best-effort: resolve failures degrade to k8s-only teardown rather than blocking delete. - tests: live-PG round-trip for Drop helpers (PG_ADMIN_DSN-gated, same posture as existing EnsureDatabase tests); unit tests asserting that PGPreProvisioned and empty AdminDSN both skip the PG cleanup branch.

#649 wired DropDatabase+DropRole into DeleteForOrg, but the actual teardown failed in mw-dev with: drop database lakekeeper_ben_ext_ice: ERROR: must be owner of database lakekeeper_ben_ext_ice (SQLSTATE 42501) drop role lakekeeper_ben_ext_ice: ERROR: role "lakekeeper_ben_ext_ice" cannot be dropped because some objects depend on it (SQLSTATE 2BP01) EnsureRole runs ALTER DATABASE OWNER TO <role> + ALTER SCHEMA public OWNER TO <role>, so the lakekeeper_<orgid> DB is owned by the per-tenant role, not by the admin. On shared RDS (external metadata case), the admin is a non-superuser tenant of the shared cluster (e.g. ducklingexample), so DROP DATABASE fails with 42501 and the subsequent DROP ROLE fails with 2BP01 because the role still owns the database. DropDatabase: - probe pg_database; no-op if absent - look up the current owner, GRANT it to CURRENT_USER, then ALTER DATABASE OWNER TO CURRENT_USER before DROP DATABASE - GRANT + ALTER OWNER are best-effort so the path stays correct when the admin already is the owner / is a superuser DropRole: - probe pg_roles; no-op if absent - GRANT the role to CURRENT_USER so REASSIGN OWNED + DROP OWNED can run on shared RDS (the operations require role-membership in PG14+) - REASSIGN OWNED first (handles owned database objects), then DROP OWNED CASCADE (cluster-wide grants + default privileges), then DROP ROLE. The first two are best-effort; only DROP ROLE failure is returned.

benben merged commit 5823ccd into main Jun 1, 2026
22 checks passed

benben deleted the ben/lakekeeper-drop-db-on-delete branch June 1, 2026 16:13

benben mentioned this pull request Jun 1, 2026

fix(lakekeeper): drop database when admin is not the owner #650

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lakekeeper): drop per-tenant PG database+role on duckling delete#649

fix(lakekeeper): drop per-tenant PG database+role on duckling delete#649
benben merged 1 commit into
mainfrom
ben/lakekeeper-drop-db-on-delete

benben commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benben commented Jun 1, 2026

1. Role password drift

2. Lakekeeper encryption-key drift

Observed in mw-dev today

Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant