fix(lakekeeper): drop per-tenant PG database+role on duckling delete#649
Merged
Conversation
PR PostHog/charts#11522 fixes the cnpg-shard path by switching the composition's cnpg-tenant-{role,database} resources to full Delete managementPolicy. The external + dev/orbstack paths need a parallel fix on the duckgres side: the lakekeeper_<orgid> Postgres database on the shared metadata store is created by the duckgres lakekeeper provisioner (EnsureDatabase + EnsureRole), and the existing DeleteForOrg only cleans up the k8s pieces. Without this drop, recreating a duckling with the same orgID fails in two ways: 1. Role password drift: provider-sql's ALTER ROLE path on cnpg-shard doesn't fire on passwordSecretRef change; on external we have no ALTER path at all, so the stale role in the shared RDS keeps the previous lifetime's password and the Lakekeeper migrate Job hits SASL fail. 2. Lakekeeper encryption-key drift: the provisioner generates a fresh encryption-key in the per-org k8s Secret on every provision, so the old encrypted rows in lakekeeper_<orgid>.* (storage profile secrets in particular) decrypt with the new key as "Wrong key or corrupt data", and every CREATE TABLE returns SecretFetchError 500. Both symptoms observed in mw-dev today on ben-cnpg-ice and ben-ext-ice after a delete + recreate cycle; ben-aur-ice survived because the per-duckling Aurora cluster is fully wiped on delete. Changes: - postgres_admin.go: add DropDatabase + DropRole (idempotent, NotFound-tolerant; DROP DATABASE WITH (FORCE) so a lingering Lakekeeper connection doesn't block, DROP OWNED CASCADE before DROP ROLE so cluster-wide grants don't keep the role pinned) - lakekeeper_provisioner.go: extend DeleteForOrg signature with ProvisioningInputs; when !PGPreProvisioned and AdminDSN is set, drop the lakekeeper_<orgid> DB+role after the k8s teardown. Best-effort: PG failures are logged and swallowed so a flaky metadata store or already-destroyed Aurora cluster doesn't permanently block duckling deletion. cnpg-shard is short-circuited because the composition (PR PostHog/charts#11522) owns its teardown. - controller.go: resolve LakekeeperInputs BEFORE deleting the Duckling CR. The CR carries the metadata-store master credentials the resolver folds into AdminDSN; once the CR is gone the resolver can't reconstruct them. Best-effort: resolve failures degrade to k8s-only teardown rather than blocking delete. - tests: live-PG round-trip for Drop helpers (PG_ADMIN_DSN-gated, same posture as existing EnsureDatabase tests); unit tests asserting that PGPreProvisioned and empty AdminDSN both skip the PG cleanup branch.
benben
added a commit
that referenced
this pull request
Jun 1, 2026
#649 wired DropDatabase+DropRole into DeleteForOrg, but the actual teardown failed in mw-dev with: drop database lakekeeper_ben_ext_ice: ERROR: must be owner of database lakekeeper_ben_ext_ice (SQLSTATE 42501) drop role lakekeeper_ben_ext_ice: ERROR: role "lakekeeper_ben_ext_ice" cannot be dropped because some objects depend on it (SQLSTATE 2BP01) EnsureRole runs ALTER DATABASE OWNER TO <role> + ALTER SCHEMA public OWNER TO <role>, so the lakekeeper_<orgid> DB is owned by the per-tenant role, not by the admin. On shared RDS (external metadata case), the admin is a non-superuser tenant of the shared cluster (e.g. ducklingexample), so DROP DATABASE fails with 42501 and the subsequent DROP ROLE fails with 2BP01 because the role still owns the database. DropDatabase: - probe pg_database; no-op if absent - look up the current owner, GRANT it to CURRENT_USER, then ALTER DATABASE OWNER TO CURRENT_USER before DROP DATABASE - GRANT + ALTER OWNER are best-effort so the path stays correct when the admin already is the owner / is a superuser DropRole: - probe pg_roles; no-op if absent - GRANT the role to CURRENT_USER so REASSIGN OWNED + DROP OWNED can run on shared RDS (the operations require role-membership in PG14+) - REASSIGN OWNED first (handles owned database objects), then DROP OWNED CASCADE (cluster-wide grants + default privileges), then DROP ROLE. The first two are best-effort; only DROP ROLE failure is returned.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Companion to posthog/charts#11522 (cnpg-shard composition cleanup).
That PR makes the cnpg-shard tenant role+DB Delete-managed; this PR
covers the external + dev/orbstack paths where the lakekeeper_
Postgres database on the shared metadata store is created by the
duckgres lakekeeper provisioner itself, not by the Crossplane
composition.
Without this drop, recreating a duckling with the same orgID fails
in two ways:
1. Role password drift
provider-sql's `ALTER ROLE` path on cnpg-shard doesn't fire on
`passwordSecretRef` change; on external we have no `ALTER` path
at all, so the stale role in the shared RDS keeps the previous
lifetime's password and the Lakekeeper migrate Job hits SASL fail.
2. Lakekeeper encryption-key drift
The provisioner generates a fresh `encryption-key` in the per-org
k8s Secret on every provision, so the old encrypted rows in
`lakekeeper_.*` (storage profile secrets in particular)
decrypt with the new key as "Wrong key or corrupt data", and every
`CREATE TABLE` returns `SecretFetchError 500`.
Observed in mw-dev today
Both symptoms surfaced on `ben-cnpg-ice` and `ben-ext-ice` after a
delete + recreate cycle; `ben-aur-ice` survived because the
per-duckling Aurora cluster is fully wiped on delete.
Changes
NotFound-tolerant; `DROP DATABASE WITH (FORCE)` so a lingering
Lakekeeper connection doesn't block, `DROP OWNED ... CASCADE`
before `DROP ROLE` so cluster-wide grants don't keep the role
pinned).
`ProvisioningInputs`; when `!PGPreProvisioned` and `AdminDSN` is
set, drop the `lakekeeper_` DB+role after the k8s
teardown. Best-effort: PG failures are logged and swallowed so a
flaky metadata store or already-destroyed Aurora cluster doesn't
permanently block duckling deletion. cnpg-shard short-circuits
because the composition (posthog/charts#11522) owns that
teardown.
Duckling CR. The CR carries the metadata-store master credentials
the resolver folds into `AdminDSN`; once the CR is gone the
resolver can't reconstruct them. Best-effort: resolve failures
degrade to k8s-only teardown rather than blocking delete.
gated, same posture as existing `EnsureDatabase` tests); unit
tests asserting that `PGPreProvisioned` and empty `AdminDSN` both
skip the PG cleanup branch.
🤖 Generated with Claude Code