Skip to content

fix: resolve marketplace bundle architecture blockers#541

Merged
bussyjd merged 4 commits into
feat/marketplace-bundlefrom
fix/marketplace-bundle-architecture-review
May 24, 2026
Merged

fix: resolve marketplace bundle architecture blockers#541
bussyjd merged 4 commits into
feat/marketplace-bundlefrom
fix/marketplace-bundle-architecture-review

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 24, 2026

Summary

This branch is the corrective follow-up for the architecture review of PRs above #509 that were folded into the marketplace bundle.

It addresses review blockers from #518, #520, #533, #534, #535, #536, and #540, and resolves the #528 pre-release ownership-migration concern by removing that one-off migration surface entirely. The project is not production released, so carrying live-cluster Helm ownership mutation code is unnecessary risk; the release template now warns pre-release testers how to reset a local test stack if they hit ownership errors.

Key fixes:

  • Consolidate obol-frontend RBAC into one rendered template and remove the duplicate ClusterRole / ClusterRoleBinding owner.
  • Preserve mutable runtime ConfigMaps around obol x402 setup / EnsureVerifier so a base Helm sync cannot erase LiteLLM models or buyer auth state.
  • Remove the obsolete pre-release Bedag/raw-to-base migration script and upgrade doc.
  • Add release-template guidance for pre-release testers who hit Helm invalid ownership metadata: back up local test data, then recreate the stack with obol stack down, obol stack purge --force, obol stack init, and obol stack up.
  • Publish drain state through ERC-8004 registration services as available: false plus drainEndsAt during the drain grace period.
  • Make the serviceoffer-controller process exit when the leader-owned controller loop fails, allowing Kubernetes to restart the pod and release leadership cleanly.
  • Ensure expired drains still tear down routes when an agent-backed offer is no longer ready.
  • Align observability docs with the current metric names and label set.
  • Add a Helm rendered-object uniqueness check so duplicate Kubernetes identities cannot silently re-enter the base chart.

Review Map

flowchart LR
  P518["#518 leader election"] --> F["fix branch"]
  P520["#520 x402 verifier Helm sync"] --> F
  P528["#528 pre-release ownership migration"] --> F
  P533["#533 Helm smoke coverage"] --> F
  P534["#534 observability docs"] --> F
  P535["#535 drain lifecycle"] --> F
  P536["#536 marketplace bundle"] --> F
  P540["#540 frontend RBAC follow-up"] --> F

  F --> B["base: feat/marketplace-bundle"]
Loading

The common failure mode was not syntax. The reviewed PRs were mostly locally reasonable, but when composed into the bundle they created architectural conflicts: duplicate chart ownership, broad Helm reconciliation against mutable runtime state, lifecycle state that was visible in Kubernetes but not in ERC-8004 identity, and controller paths that could leave stale routes behind.

Chart Ownership

The bundle had two templates emitting the same frontend RBAC object identities:

flowchart TD
  A["base chart render"] --> B["obol-frontend.yaml"]
  A --> C["obol-frontend-rbac.yaml"]
  B --> D["ClusterRole obol-frontend-openclaw-discovery"]
  B --> E["ClusterRoleBinding obol-frontend-openclaw-discovery"]
  C --> D
  C --> E
  D --> X["duplicate object identity"]
  E --> X
Loading

This branch makes obol-frontend.yaml the single owner:

flowchart TD
  A["base chart render"] --> B["obol-frontend.yaml"]
  B --> C["HTTPRoute"]
  B --> D["ClusterRole"]
  B --> E["ClusterRoleBinding"]
  R["obol-frontend-rbac.yaml"] -. removed .-> D
  R -. removed .-> E
Loading

The final RBAC surface keeps the frontend local-operator trust boundary but removes unnecessary secrets access. It includes read-only PurchaseRequest and RegistrationRequest permissions because the frontend displays those states but does not own writes.

Runtime Config Preservation

EnsureVerifier correctly moved x402 verifier deployment to the base Helm release, but the base release also owns bootstrap ConfigMaps in the llm namespace. A full base sync can therefore overwrite runtime state created by model setup and buyer flows.

The new flow snapshots only the mutable runtime ConfigMaps, runs the canonical base sync, and restores merged data afterward:

sequenceDiagram
  participant CLI as obol x402 setup
  participant K8s as Kubernetes API
  participant Helm as helmfile base sync
  participant LiteLLM as litellm-config
  participant Buyer as buyer ConfigMaps

  CLI->>K8s: read litellm-config
  CLI->>K8s: read x402-buyer-config
  CLI->>K8s: read x402-buyer-auths
  CLI->>Helm: sync base release
  Helm->>K8s: reconcile canonical chart objects
  CLI->>LiteLLM: restore current base config + prior user models
  CLI->>Buyer: restore prior buyer keys, current keys win on conflicts
Loading
flowchart LR
  S["snapshot before sync"] --> H["helmfile --selector name=base sync"]
  H --> M["merge after sync"]
  M --> A["litellm-config"]
  M --> B["x402-buyer-config"]
  M --> C["x402-buyer-auths"]

  A --> A1["current base paid/* route retained"]
  A --> A2["previous configured models appended by model_name"]
  B --> B1["previous runtime keys retained"]
  C --> C1["previous auth pool keys retained"]
Loading

The LiteLLM merge intentionally keeps the post-sync base config as canonical and appends previous user model entries that are not already present by model_name. For buyer ConfigMaps, the merge is key based: previous runtime keys survive, and current chart keys win on direct conflicts.

Pre-release Migration Surface

Because this project has not shipped as a production release, we do not need to preserve a one-off mutation path for clusters created between internal PRs. The safer architecture is to keep the chart source of truth clean, remove the special-case adoption helper, and give pre-release testers clear reset guidance in the release notes.

flowchart TD
  A["pre-release cluster state"] -. no public compatibility contract .-> B["discard one-off adoption helper"]
  B --> C["delete hack/migrate-bedag-raw-to-base.sh"]
  B --> D["delete docs/upgrade-from-pre-pr-523.md"]
  B --> E["release-template warning"]
  E --> F["back up local test data"]
  F --> G["obol stack down"]
  G --> H["obol stack purge --force"]
  H --> I["obol stack init"]
  I --> J["obol stack up"]
  C --> K["fresh install / current stack path remains canonical"]
  D --> K
  J --> K
Loading

This removes direct live-cluster Helm annotation mutation from the repository rather than polishing it into a supported upgrade story. The release note still gives affected testers a concrete path out if they hit Helm ownership errors.

Drain Lifecycle

The previous drain implementation kept Kubernetes status and route behavior partially aligned, but ERC-8004 consumers could not see that a service was draining. The registration document now marks draining services as unavailable until the grace period ends.

sequenceDiagram
  participant Operator
  participant Offer as ServiceOffer
  participant Controller
  participant Route as HTTPRoute / payment gate
  participant Reg as agent-registration.json
  participant Consumer as ERC-8004 consumer

  Operator->>Offer: obol sell stop --grace
  Offer->>Controller: spec.drainAt + spec.drainGracePeriod
  Controller->>Reg: service.available=false, drainEndsAt=<RFC3339>
  Controller->>Route: keep serving during grace window
  Consumer->>Reg: discover service is draining
  Controller->>Route: delete after drain expiry
  Controller->>Offer: status RoutePublished=False, PaymentGateReady=False, Drained
Loading

The same drain metadata is applied to the generated web service and custom registered services. OASF taxonomy entries remain taxonomy-only and do not get endpoint availability metadata.

For agent-backed offers, the controller previously returned early when the referenced Agent was not ready. This could skip route teardown after the drain grace expired. The reconcile path now checks expired drains before returning the waiting-for-agent status and deletes route children when the offer is already drained.

Controller Failure Semantics

A leader-owned controller loop failure should not leave the process alive while the lease machinery may still be active. The leader callback now exits non-zero if controller.Run returns an error:

flowchart TD
  A["pod becomes leader"] --> B["controller.Run(ctx)"]
  B -->|nil| C["normal shutdown path"]
  B -->|error| D["log failure"]
  D --> E["os.Exit(1)"]
  E --> F["Kubernetes restarts pod"]
  F --> G["new leader election"]
Loading

This keeps failure ownership clear: if the elected controller loop dies, the pod dies too, and restart policy handles recovery.

CI Guardrail

The Helm template smoke now stores rendered YAML and checks rendered object identity uniqueness:

flowchart TD
  A["helm template base"] --> B["base-rendered.yaml"]
  B --> C["scan apiVersion/kind/metadata.name/metadata.namespace"]
  C -->|unique| D["smoke passes"]
  C -->|duplicate| E["print duplicate identity and fail"]
Loading

This specifically guards the duplicate RBAC regression class. Helm can render duplicate objects without failing; Kubernetes then receives an order-dependent manifest. The CI check turns that into a pre-merge failure.

Observability Docs

The docs now match the implemented metric surface:

  • x402:revenue:7d_by_offer
  • x402:revenue:7d_by_offer_chain_asset_symbol
  • obol_x402_verifier_charged_requests_total
  • no documented route label requirement for the revenue series

Validation

Parallel validation was run against the fix branch:

  • OBOL_LLM_ENDPOINT=http://silvermesh.v1337.lan:8081 OBOL_LLM_MODEL=qwen36-apex-i-compact go test ./internal/x402 ./internal/serviceoffercontroller ./internal/erc8004 ./cmd/serviceoffer-controller -count=1
  • Local equivalent of .github/workflows/helm-template-smoke.yml for base and cloudflared, including the new duplicate rendered-object check.
  • git diff --check
  • bash -n flows/*.sh
  • Live upgrade check against running k3d cluster obol-stack-working-longhorn, created from pre-consolidation defaults with legacy bedag/raw releases: branch-rendered base Helm upgrade failed with Helm invalid ownership metadata for Namespace "erpc", matching the release warning. The cluster remained Ready and Helm release revisions were unchanged after the manual Helmfile attempt.
  • Repository scan confirmed no stale references to hack/migrate-bedag-raw-to-base.sh, docs/upgrade-from-pre-pr-523.md, or the removed migration command.
  • SilverMesh health check: GET /health returned {"status":"ok"}.
  • SilverMesh chat smoke returned SILVERMESH_OK using qwen36-apex-i-compact with thinking disabled.
  • SilverMesh architectural patch review returned PASS for frontend RBAC ownership, runtime ConfigMap preservation, pre-release migration-surface removal plus live-tested release warning, ERC-8004 drain visibility, leader-election failure semantics, drained route cleanup, observability docs, and Helm duplicate detection.

Compatibility Notes

  • The ERC-8004 additions are additive JSON fields and omit when not draining.
  • The frontend RBAC object names are unchanged; only duplicate ownership is removed and read-only CRD permissions are consolidated.
  • Runtime ConfigMap restore is scoped to known mutable ConfigMaps in llm; unrelated chart-owned objects are left to Helm.
  • The pre-release Helm ownership adoption helper is intentionally removed instead of becoming a supported upgrade path; affected testers are directed to recreate local test stacks.

@bussyjd bussyjd marked this pull request as ready for review May 24, 2026 13:52
@bussyjd bussyjd merged commit 1dbbf60 into feat/marketplace-bundle May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant