Skip to content

fix(lakekeeper): drop per-tenant PG database+role on duckling delete#649

Merged
benben merged 1 commit into
mainfrom
ben/lakekeeper-drop-db-on-delete
Jun 1, 2026
Merged

fix(lakekeeper): drop per-tenant PG database+role on duckling delete#649
benben merged 1 commit into
mainfrom
ben/lakekeeper-drop-db-on-delete

Conversation

@benben
Copy link
Copy Markdown
Member

@benben benben commented Jun 1, 2026

Companion to posthog/charts#11522 (cnpg-shard composition cleanup).
That PR makes the cnpg-shard tenant role+DB Delete-managed; this PR
covers the external + dev/orbstack paths where the lakekeeper_
Postgres database on the shared metadata store is created by the
duckgres lakekeeper provisioner itself, not by the Crossplane
composition.

Without this drop, recreating a duckling with the same orgID fails
in two ways:

1. Role password drift

provider-sql's `ALTER ROLE` path on cnpg-shard doesn't fire on
`passwordSecretRef` change; on external we have no `ALTER` path
at all, so the stale role in the shared RDS keeps the previous
lifetime's password and the Lakekeeper migrate Job hits SASL fail.

2. Lakekeeper encryption-key drift

The provisioner generates a fresh `encryption-key` in the per-org
k8s Secret on every provision, so the old encrypted rows in
`lakekeeper_.*` (storage profile secrets in particular)
decrypt with the new key as "Wrong key or corrupt data", and every
`CREATE TABLE` returns `SecretFetchError 500`.

Observed in mw-dev today

Both symptoms surfaced on `ben-cnpg-ice` and `ben-ext-ice` after a
delete + recreate cycle; `ben-aur-ice` survived because the
per-duckling Aurora cluster is fully wiped on delete.

Changes

  • `postgres_admin.go`: add `DropDatabase` + `DropRole` (idempotent,
    NotFound-tolerant; `DROP DATABASE WITH (FORCE)` so a lingering
    Lakekeeper connection doesn't block, `DROP OWNED ... CASCADE`
    before `DROP ROLE` so cluster-wide grants don't keep the role
    pinned).
  • `lakekeeper_provisioner.go`: extend `DeleteForOrg` signature with
    `ProvisioningInputs`; when `!PGPreProvisioned` and `AdminDSN` is
    set, drop the `lakekeeper_` DB+role after the k8s
    teardown. Best-effort: PG failures are logged and swallowed so a
    flaky metadata store or already-destroyed Aurora cluster doesn't
    permanently block duckling deletion. cnpg-shard short-circuits
    because the composition (posthog/charts#11522) owns that
    teardown.
  • `controller.go`: resolve `LakekeeperInputs` BEFORE deleting the
    Duckling CR. The CR carries the metadata-store master credentials
    the resolver folds into `AdminDSN`; once the CR is gone the
    resolver can't reconstruct them. Best-effort: resolve failures
    degrade to k8s-only teardown rather than blocking delete.
  • tests: live-PG round-trip for the Drop helpers (`PG_ADMIN_DSN`-
    gated, same posture as existing `EnsureDatabase` tests); unit
    tests asserting that `PGPreProvisioned` and empty `AdminDSN` both
    skip the PG cleanup branch.

🤖 Generated with Claude Code

PR PostHog/charts#11522 fixes the cnpg-shard path by switching the
composition's cnpg-tenant-{role,database} resources to full Delete
managementPolicy. The external + dev/orbstack paths need a parallel
fix on the duckgres side: the lakekeeper_<orgid> Postgres database
on the shared metadata store is created by the duckgres lakekeeper
provisioner (EnsureDatabase + EnsureRole), and the existing
DeleteForOrg only cleans up the k8s pieces.

Without this drop, recreating a duckling with the same orgID fails
in two ways:

1. Role password drift: provider-sql's ALTER ROLE path on cnpg-shard
   doesn't fire on passwordSecretRef change; on external we have no
   ALTER path at all, so the stale role in the shared RDS keeps the
   previous lifetime's password and the Lakekeeper migrate Job hits
   SASL fail.

2. Lakekeeper encryption-key drift: the provisioner generates a
   fresh encryption-key in the per-org k8s Secret on every provision,
   so the old encrypted rows in lakekeeper_<orgid>.* (storage profile
   secrets in particular) decrypt with the new key as
   "Wrong key or corrupt data", and every CREATE TABLE returns
   SecretFetchError 500.

Both symptoms observed in mw-dev today on ben-cnpg-ice and
ben-ext-ice after a delete + recreate cycle; ben-aur-ice survived
because the per-duckling Aurora cluster is fully wiped on delete.

Changes:
- postgres_admin.go: add DropDatabase + DropRole (idempotent,
  NotFound-tolerant; DROP DATABASE WITH (FORCE) so a lingering
  Lakekeeper connection doesn't block, DROP OWNED CASCADE before
  DROP ROLE so cluster-wide grants don't keep the role pinned)
- lakekeeper_provisioner.go: extend DeleteForOrg signature with
  ProvisioningInputs; when !PGPreProvisioned and AdminDSN is set,
  drop the lakekeeper_<orgid> DB+role after the k8s teardown.
  Best-effort: PG failures are logged and swallowed so a flaky
  metadata store or already-destroyed Aurora cluster doesn't
  permanently block duckling deletion. cnpg-shard is short-circuited
  because the composition (PR PostHog/charts#11522) owns its
  teardown.
- controller.go: resolve LakekeeperInputs BEFORE deleting the
  Duckling CR. The CR carries the metadata-store master credentials
  the resolver folds into AdminDSN; once the CR is gone the resolver
  can't reconstruct them. Best-effort: resolve failures degrade to
  k8s-only teardown rather than blocking delete.
- tests: live-PG round-trip for Drop helpers (PG_ADMIN_DSN-gated,
  same posture as existing EnsureDatabase tests); unit tests
  asserting that PGPreProvisioned and empty AdminDSN both skip the
  PG cleanup branch.
@benben benben merged commit 5823ccd into main Jun 1, 2026
22 checks passed
@benben benben deleted the ben/lakekeeper-drop-db-on-delete branch June 1, 2026 16:13
benben added a commit that referenced this pull request Jun 1, 2026
#649 wired DropDatabase+DropRole into DeleteForOrg, but the actual
teardown failed in mw-dev with:

  drop database lakekeeper_ben_ext_ice: ERROR: must be owner of
    database lakekeeper_ben_ext_ice (SQLSTATE 42501)
  drop role lakekeeper_ben_ext_ice: ERROR: role
    "lakekeeper_ben_ext_ice" cannot be dropped because some objects
    depend on it (SQLSTATE 2BP01)

EnsureRole runs ALTER DATABASE OWNER TO <role> + ALTER SCHEMA public
OWNER TO <role>, so the lakekeeper_<orgid> DB is owned by the
per-tenant role, not by the admin. On shared RDS (external metadata
case), the admin is a non-superuser tenant of the shared cluster
(e.g. ducklingexample), so DROP DATABASE fails with 42501 and the
subsequent DROP ROLE fails with 2BP01 because the role still owns
the database.

DropDatabase:
- probe pg_database; no-op if absent
- look up the current owner, GRANT it to CURRENT_USER, then
  ALTER DATABASE OWNER TO CURRENT_USER before DROP DATABASE
- GRANT + ALTER OWNER are best-effort so the path stays correct when
  the admin already is the owner / is a superuser

DropRole:
- probe pg_roles; no-op if absent
- GRANT the role to CURRENT_USER so REASSIGN OWNED + DROP OWNED can
  run on shared RDS (the operations require role-membership in PG14+)
- REASSIGN OWNED first (handles owned database objects), then
  DROP OWNED CASCADE (cluster-wide grants + default privileges),
  then DROP ROLE. The first two are best-effort; only DROP ROLE
  failure is returned.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant