fix: resolve marketplace bundle architecture blockers#541
Merged
bussyjd merged 4 commits intoMay 24, 2026
Merged
Conversation
This was referenced May 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch is the corrective follow-up for the architecture review of PRs above #509 that were folded into the marketplace bundle.
It addresses review blockers from #518, #520, #533, #534, #535, #536, and #540, and resolves the #528 pre-release ownership-migration concern by removing that one-off migration surface entirely. The project is not production released, so carrying live-cluster Helm ownership mutation code is unnecessary risk; the release template now warns pre-release testers how to reset a local test stack if they hit ownership errors.
Key fixes:
obol-frontendRBAC into one rendered template and remove the duplicateClusterRole/ClusterRoleBindingowner.obol x402 setup/EnsureVerifierso a base Helm sync cannot erase LiteLLM models or buyer auth state.invalid ownership metadata: back up local test data, then recreate the stack withobol stack down,obol stack purge --force,obol stack init, andobol stack up.available: falseplusdrainEndsAtduring the drain grace period.Review Map
The common failure mode was not syntax. The reviewed PRs were mostly locally reasonable, but when composed into the bundle they created architectural conflicts: duplicate chart ownership, broad Helm reconciliation against mutable runtime state, lifecycle state that was visible in Kubernetes but not in ERC-8004 identity, and controller paths that could leave stale routes behind.
Chart Ownership
The bundle had two templates emitting the same frontend RBAC object identities:
This branch makes
obol-frontend.yamlthe single owner:The final RBAC surface keeps the frontend local-operator trust boundary but removes unnecessary
secretsaccess. It includes read-only PurchaseRequest and RegistrationRequest permissions because the frontend displays those states but does not own writes.Runtime Config Preservation
EnsureVerifiercorrectly moved x402 verifier deployment to the base Helm release, but the base release also owns bootstrap ConfigMaps in thellmnamespace. A full base sync can therefore overwrite runtime state created by model setup and buyer flows.The new flow snapshots only the mutable runtime ConfigMaps, runs the canonical base sync, and restores merged data afterward:
The LiteLLM merge intentionally keeps the post-sync base config as canonical and appends previous user model entries that are not already present by
model_name. For buyer ConfigMaps, the merge is key based: previous runtime keys survive, and current chart keys win on direct conflicts.Pre-release Migration Surface
Because this project has not shipped as a production release, we do not need to preserve a one-off mutation path for clusters created between internal PRs. The safer architecture is to keep the chart source of truth clean, remove the special-case adoption helper, and give pre-release testers clear reset guidance in the release notes.
This removes direct live-cluster Helm annotation mutation from the repository rather than polishing it into a supported upgrade story. The release note still gives affected testers a concrete path out if they hit Helm ownership errors.
Drain Lifecycle
The previous drain implementation kept Kubernetes status and route behavior partially aligned, but ERC-8004 consumers could not see that a service was draining. The registration document now marks draining services as unavailable until the grace period ends.
The same drain metadata is applied to the generated
webservice and custom registered services. OASF taxonomy entries remain taxonomy-only and do not get endpoint availability metadata.For agent-backed offers, the controller previously returned early when the referenced Agent was not ready. This could skip route teardown after the drain grace expired. The reconcile path now checks expired drains before returning the waiting-for-agent status and deletes route children when the offer is already drained.
Controller Failure Semantics
A leader-owned controller loop failure should not leave the process alive while the lease machinery may still be active. The leader callback now exits non-zero if
controller.Runreturns an error:This keeps failure ownership clear: if the elected controller loop dies, the pod dies too, and restart policy handles recovery.
CI Guardrail
The Helm template smoke now stores rendered YAML and checks rendered object identity uniqueness:
This specifically guards the duplicate RBAC regression class. Helm can render duplicate objects without failing; Kubernetes then receives an order-dependent manifest. The CI check turns that into a pre-merge failure.
Observability Docs
The docs now match the implemented metric surface:
x402:revenue:7d_by_offerx402:revenue:7d_by_offer_chain_asset_symbolobol_x402_verifier_charged_requests_totalroutelabel requirement for the revenue seriesValidation
Parallel validation was run against the fix branch:
OBOL_LLM_ENDPOINT=http://silvermesh.v1337.lan:8081 OBOL_LLM_MODEL=qwen36-apex-i-compact go test ./internal/x402 ./internal/serviceoffercontroller ./internal/erc8004 ./cmd/serviceoffer-controller -count=1.github/workflows/helm-template-smoke.ymlfor base and cloudflared, including the new duplicate rendered-object check.git diff --checkbash -n flows/*.shobol-stack-working-longhorn, created from pre-consolidation defaults with legacybedag/rawreleases: branch-renderedbaseHelm upgrade failed with Helminvalid ownership metadataforNamespace "erpc", matching the release warning. The cluster remainedReadyand Helm release revisions were unchanged after the manual Helmfile attempt.hack/migrate-bedag-raw-to-base.sh,docs/upgrade-from-pre-pr-523.md, or the removed migration command.GET /healthreturned{"status":"ok"}.SILVERMESH_OKusingqwen36-apex-i-compactwith thinking disabled.Compatibility Notes
llm; unrelated chart-owned objects are left to Helm.