fix: resolve marketplace bundle architecture blockers by bussyjd · Pull Request #541 · ObolNetwork/obol-stack

bussyjd · 2026-05-24T13:32:11Z

Summary

This branch is the corrective follow-up for the architecture review of PRs above #509 that were folded into the marketplace bundle.

It addresses review blockers from #518, #520, #533, #534, #535, #536, and #540, and resolves the #528 pre-release ownership-migration concern by removing that one-off migration surface entirely. The project is not production released, so carrying live-cluster Helm ownership mutation code is unnecessary risk; the release template now warns pre-release testers how to reset a local test stack if they hit ownership errors.

Key fixes:

Consolidate obol-frontend RBAC into one rendered template and remove the duplicate ClusterRole / ClusterRoleBinding owner.
Preserve mutable runtime ConfigMaps around obol x402 setup / EnsureVerifier so a base Helm sync cannot erase LiteLLM models or buyer auth state.
Remove the obsolete pre-release Bedag/raw-to-base migration script and upgrade doc.
Add release-template guidance for pre-release testers who hit Helm invalid ownership metadata: back up local test data, then recreate the stack with obol stack down, obol stack purge --force, obol stack init, and obol stack up.
Publish drain state through ERC-8004 registration services as available: false plus drainEndsAt during the drain grace period.
Make the serviceoffer-controller process exit when the leader-owned controller loop fails, allowing Kubernetes to restart the pod and release leadership cleanly.
Ensure expired drains still tear down routes when an agent-backed offer is no longer ready.
Align observability docs with the current metric names and label set.
Add a Helm rendered-object uniqueness check so duplicate Kubernetes identities cannot silently re-enter the base chart.

Review Map

flowchart LR
  P518["#518 leader election"] --> F["fix branch"]
  P520["#520 x402 verifier Helm sync"] --> F
  P528["#528 pre-release ownership migration"] --> F
  P533["#533 Helm smoke coverage"] --> F
  P534["#534 observability docs"] --> F
  P535["#535 drain lifecycle"] --> F
  P536["#536 marketplace bundle"] --> F
  P540["#540 frontend RBAC follow-up"] --> F

  F --> B["base: feat/marketplace-bundle"]

The common failure mode was not syntax. The reviewed PRs were mostly locally reasonable, but when composed into the bundle they created architectural conflicts: duplicate chart ownership, broad Helm reconciliation against mutable runtime state, lifecycle state that was visible in Kubernetes but not in ERC-8004 identity, and controller paths that could leave stale routes behind.

Chart Ownership

The bundle had two templates emitting the same frontend RBAC object identities:

flowchart TD
  A["base chart render"] --> B["obol-frontend.yaml"]
  A --> C["obol-frontend-rbac.yaml"]
  B --> D["ClusterRole obol-frontend-openclaw-discovery"]
  B --> E["ClusterRoleBinding obol-frontend-openclaw-discovery"]
  C --> D
  C --> E
  D --> X["duplicate object identity"]
  E --> X

This branch makes obol-frontend.yaml the single owner:

flowchart TD
  A["base chart render"] --> B["obol-frontend.yaml"]
  B --> C["HTTPRoute"]
  B --> D["ClusterRole"]
  B --> E["ClusterRoleBinding"]
  R["obol-frontend-rbac.yaml"] -. removed .-> D
  R -. removed .-> E

The final RBAC surface keeps the frontend local-operator trust boundary but removes unnecessary secrets access. It includes read-only PurchaseRequest and RegistrationRequest permissions because the frontend displays those states but does not own writes.

Runtime Config Preservation

EnsureVerifier correctly moved x402 verifier deployment to the base Helm release, but the base release also owns bootstrap ConfigMaps in the llm namespace. A full base sync can therefore overwrite runtime state created by model setup and buyer flows.

The new flow snapshots only the mutable runtime ConfigMaps, runs the canonical base sync, and restores merged data afterward:

sequenceDiagram
  participant CLI as obol x402 setup
  participant K8s as Kubernetes API
  participant Helm as helmfile base sync
  participant LiteLLM as litellm-config
  participant Buyer as buyer ConfigMaps

  CLI->>K8s: read litellm-config
  CLI->>K8s: read x402-buyer-config
  CLI->>K8s: read x402-buyer-auths
  CLI->>Helm: sync base release
  Helm->>K8s: reconcile canonical chart objects
  CLI->>LiteLLM: restore current base config + prior user models
  CLI->>Buyer: restore prior buyer keys, current keys win on conflicts

flowchart LR
  S["snapshot before sync"] --> H["helmfile --selector name=base sync"]
  H --> M["merge after sync"]
  M --> A["litellm-config"]
  M --> B["x402-buyer-config"]
  M --> C["x402-buyer-auths"]

  A --> A1["current base paid/* route retained"]
  A --> A2["previous configured models appended by model_name"]
  B --> B1["previous runtime keys retained"]
  C --> C1["previous auth pool keys retained"]

The LiteLLM merge intentionally keeps the post-sync base config as canonical and appends previous user model entries that are not already present by model_name. For buyer ConfigMaps, the merge is key based: previous runtime keys survive, and current chart keys win on direct conflicts.

Pre-release Migration Surface

Because this project has not shipped as a production release, we do not need to preserve a one-off mutation path for clusters created between internal PRs. The safer architecture is to keep the chart source of truth clean, remove the special-case adoption helper, and give pre-release testers clear reset guidance in the release notes.

flowchart TD
  A["pre-release cluster state"] -. no public compatibility contract .-> B["discard one-off adoption helper"]
  B --> C["delete hack/migrate-bedag-raw-to-base.sh"]
  B --> D["delete docs/upgrade-from-pre-pr-523.md"]
  B --> E["release-template warning"]
  E --> F["back up local test data"]
  F --> G["obol stack down"]
  G --> H["obol stack purge --force"]
  H --> I["obol stack init"]
  I --> J["obol stack up"]
  C --> K["fresh install / current stack path remains canonical"]
  D --> K
  J --> K

This removes direct live-cluster Helm annotation mutation from the repository rather than polishing it into a supported upgrade story. The release note still gives affected testers a concrete path out if they hit Helm ownership errors.

Drain Lifecycle

The previous drain implementation kept Kubernetes status and route behavior partially aligned, but ERC-8004 consumers could not see that a service was draining. The registration document now marks draining services as unavailable until the grace period ends.

sequenceDiagram
  participant Operator
  participant Offer as ServiceOffer
  participant Controller
  participant Route as HTTPRoute / payment gate
  participant Reg as agent-registration.json
  participant Consumer as ERC-8004 consumer

  Operator->>Offer: obol sell stop --grace
  Offer->>Controller: spec.drainAt + spec.drainGracePeriod
  Controller->>Reg: service.available=false, drainEndsAt=<RFC3339>
  Controller->>Route: keep serving during grace window
  Consumer->>Reg: discover service is draining
  Controller->>Route: delete after drain expiry
  Controller->>Offer: status RoutePublished=False, PaymentGateReady=False, Drained

The same drain metadata is applied to the generated web service and custom registered services. OASF taxonomy entries remain taxonomy-only and do not get endpoint availability metadata.

For agent-backed offers, the controller previously returned early when the referenced Agent was not ready. This could skip route teardown after the drain grace expired. The reconcile path now checks expired drains before returning the waiting-for-agent status and deletes route children when the offer is already drained.

Controller Failure Semantics

A leader-owned controller loop failure should not leave the process alive while the lease machinery may still be active. The leader callback now exits non-zero if controller.Run returns an error:

flowchart TD
  A["pod becomes leader"] --> B["controller.Run(ctx)"]
  B -->|nil| C["normal shutdown path"]
  B -->|error| D["log failure"]
  D --> E["os.Exit(1)"]
  E --> F["Kubernetes restarts pod"]
  F --> G["new leader election"]

This keeps failure ownership clear: if the elected controller loop dies, the pod dies too, and restart policy handles recovery.

CI Guardrail

The Helm template smoke now stores rendered YAML and checks rendered object identity uniqueness:

flowchart TD
  A["helm template base"] --> B["base-rendered.yaml"]
  B --> C["scan apiVersion/kind/metadata.name/metadata.namespace"]
  C -->|unique| D["smoke passes"]
  C -->|duplicate| E["print duplicate identity and fail"]

This specifically guards the duplicate RBAC regression class. Helm can render duplicate objects without failing; Kubernetes then receives an order-dependent manifest. The CI check turns that into a pre-merge failure.

Observability Docs

The docs now match the implemented metric surface:

x402:revenue:7d_by_offer
x402:revenue:7d_by_offer_chain_asset_symbol
obol_x402_verifier_charged_requests_total
no documented route label requirement for the revenue series

Validation

Parallel validation was run against the fix branch:

OBOL_LLM_ENDPOINT=http://silvermesh.v1337.lan:8081 OBOL_LLM_MODEL=qwen36-apex-i-compact go test ./internal/x402 ./internal/serviceoffercontroller ./internal/erc8004 ./cmd/serviceoffer-controller -count=1
Local equivalent of .github/workflows/helm-template-smoke.yml for base and cloudflared, including the new duplicate rendered-object check.
git diff --check
bash -n flows/*.sh
Live upgrade check against running k3d cluster obol-stack-working-longhorn, created from pre-consolidation defaults with legacy bedag/raw releases: branch-rendered base Helm upgrade failed with Helm invalid ownership metadata for Namespace "erpc", matching the release warning. The cluster remained Ready and Helm release revisions were unchanged after the manual Helmfile attempt.
Repository scan confirmed no stale references to hack/migrate-bedag-raw-to-base.sh, docs/upgrade-from-pre-pr-523.md, or the removed migration command.
SilverMesh health check: GET /health returned {"status":"ok"}.
SilverMesh chat smoke returned SILVERMESH_OK using qwen36-apex-i-compact with thinking disabled.
SilverMesh architectural patch review returned PASS for frontend RBAC ownership, runtime ConfigMap preservation, pre-release migration-surface removal plus live-tested release warning, ERC-8004 drain visibility, leader-election failure semantics, drained route cleanup, observability docs, and Helm duplicate detection.

Compatibility Notes

The ERC-8004 additions are additive JSON fields and omit when not draining.
The frontend RBAC object names are unchanged; only duplicate ownership is removed and read-only CRD permissions are consolidated.
Runtime ConfigMap restore is scoped to known mutable ConfigMaps in llm; unrelated chart-owned objects are left to Helm.
The pre-release Helm ownership adoption helper is intentionally removed instead of becoming a supported upgrade path; affected testers are directed to recreate local test stacks.

bussyjd added 4 commits May 24, 2026 17:30

fix: resolve marketplace bundle architecture blockers

c3ba469

chore: remove pre-release migration script

82cbfae

docs: warn pre-release testers about stack reset

94418db

docs: clarify pre-release ownership warning

46189cd

bussyjd marked this pull request as ready for review May 24, 2026 13:52

bussyjd merged commit 1dbbf60 into feat/marketplace-bundle May 24, 2026

This was referenced May 24, 2026

fix(rbac): grant frontend read access to PurchaseRequest + RegistrationRequest #540

Closed

feat: x402 marketplace + architecture review bundle (#513-#535) #536

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve marketplace bundle architecture blockers#541

fix: resolve marketplace bundle architecture blockers#541
bussyjd merged 4 commits into
feat/marketplace-bundlefrom
fix/marketplace-bundle-architecture-review

bussyjd commented May 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bussyjd commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review Map

Chart Ownership

Runtime Config Preservation

Pre-release Migration Surface

Drain Lifecycle

Controller Failure Semantics

CI Guardrail

Observability Docs

Validation

Compatibility Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bussyjd commented May 24, 2026 •

edited

Loading