Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .github/release-template.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,13 @@ repositories or docs.]
## Breaking changes / Migration notes

- [Delete this section if there are no breaking changes.]
- **Upgrading from a pre-PR #523 cluster**: PR #523 relocated six `bedag/raw`
helmfile releases into the `base` chart. Existing clusters must run
`bash hack/migrate-bedag-raw-to-base.sh` once before `obol stack up` to
transfer Helm ownership annotations; otherwise `helm upgrade base` fails
with `invalid ownership metadata`. See
[`docs/upgrade-from-pre-pr-523.md`](../docs/upgrade-from-pre-pr-523.md).
Fresh installs are unaffected.

## Known issues

Expand Down
82 changes: 82 additions & 0 deletions docs/upgrade-from-pre-pr-523.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Upgrading clusters created before PR #523

PR [#523](https://github.com/ObolNetwork/obol-stack/pull/523) relocates six
`bedag/raw` helmfile releases into the `base` chart so the stack has one
source of truth for everything it ships in the `erpc`, `obol-frontend`, and
`llm` namespaces.

**Fresh installs are unaffected.** This page only applies if you are
upgrading a cluster that was created **before** PR #523 was merged.

## Symptom

Running `obol stack up` on a pre-#523 cluster fails during `helm upgrade base`
with errors of the form:

```
Error: UPGRADE FAILED: <resource> exists and cannot be imported into the
current release: invalid ownership metadata; annotation validation error:
key "meta.helm.sh/release-name" must equal "base"; current value is
"<legacy-release>"
```

Helm refuses to "adopt" resources owned by another release. About ten
resources are affected (Namespaces, HTTPRoutes, Middlewares, ConfigMaps,
PrometheusRule, PodMonitor, ClusterRole/Binding) — enough that hand-fixing
them is error prone.

## When to run the migration script

- **Run once**, **before** `obol stack up`, against any cluster created
before PR #523 merged.
- The script is **idempotent** — safe to re-run if `obol stack up` is
interrupted or if you migrate one cluster at a time.
- Fresh clusters (`obol stack init && obol stack up` on an empty machine)
do **not** need it.

```bash
# Optional: point at a non-default kubeconfig
export KUBECONFIG="$HOME/.config/obol/kubeconfig.yaml"

bash hack/migrate-bedag-raw-to-base.sh
obol stack up
```

## What the script does

It re-annotates the affected resources so Helm treats them as members of
the `base` release:

```
meta.helm.sh/release-name=base
meta.helm.sh/release-namespace=kube-system
app.kubernetes.io/managed-by=Helm
```

It covers the legacy `bedag/raw` releases removed by PR #523:

| Legacy release | Namespace |
|---|---|
| `obol-frontend-rbac` | `obol-frontend` |
| `obol-frontend-httproute` | `obol-frontend` |
| `erpc-httproute` | `erpc` |
| `erpc-x402-middleware` | `erpc` |
| `erpc-metadata` | `erpc` |
| `llm-buyer-podmonitor` | `llm` |
| `x402-verifier-podmonitor` | `x402` (partial-upgrade clusters from before PR #513 hardening) |

It also adopts a small set of resources that may exist with no Helm
ownership at all (`namespace/erpc`, `namespace/obol-frontend`,
`prometheusrule/x402-verifier` in `x402`) so the next `helm upgrade base`
can manage them cleanly.

## Verifying the migration

After running the script, `obol stack up` should succeed without the
`invalid ownership metadata` errors. To spot-check a single resource:

```bash
kubectl get httproute -n obol-frontend obol-frontend \
-o jsonpath='{.metadata.annotations.meta\.helm\.sh/release-name}{"\n"}'
# → base
```
82 changes: 82 additions & 0 deletions hack/migrate-bedag-raw-to-base.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
#!/usr/bin/env bash
# Migrate resources from the legacy bedag/raw helmfile releases to the
# base chart that now owns them after obol-stack PR #523.
#
# Symptom this fixes:
# Error: UPGRADE FAILED: <resource> exists and cannot be imported
# into the current release: invalid ownership metadata
#
# Run once before `obol stack up` against any cluster deployed before
# PR #523 merged.
#
# Idempotent — safe to re-run.

set -euo pipefail

: "${KUBECONFIG:=$HOME/.config/obol/kubeconfig.yaml}"

ORPHAN_RELEASES=(
obol-frontend-rbac
obol-frontend-httproute
erpc-httproute
erpc-x402-middleware
erpc-metadata
llm-buyer-podmonitor
x402-verifier-podmonitor # killed by PR #513's hardening; keep in case partial-upgrade clusters still have it
)

migrate_one() {
local target="$1"
local current
current=$(kubectl get "$target" -o jsonpath='{.metadata.annotations.meta\.helm\.sh/release-name}' 2>/dev/null || true)
if [[ "$current" == "base" ]]; then
echo " $target: already on base, skipping"
return 0
fi
if [[ -z "$current" ]]; then
echo " $target: no Helm metadata, adopting into base"
else
echo " $target: was on '$current', migrating to base"
fi
kubectl annotate "$target" \
meta.helm.sh/release-name=base \
meta.helm.sh/release-namespace=kube-system --overwrite >/dev/null
kubectl label "$target" app.kubernetes.io/managed-by=Helm --overwrite >/dev/null
}

echo "==> Scanning for resources owned by legacy bedag/raw releases..."
for release in "${ORPHAN_RELEASES[@]}"; do
echo "release: $release"
kubectl get all,clusterrole,clusterrolebinding,role,rolebinding,configmap,httproute,middleware,podmonitor,servicemonitor,prometheusrule,referencegrant,namespace \
-A -o json 2>/dev/null \
| jq -r --arg rel "$release" '.items[]
| select(.metadata.annotations["meta.helm.sh/release-name"] == $rel)
| "\(.kind)/\(.metadata.name)\(if .metadata.namespace then " -n " + .metadata.namespace else "" end)"' \
| while read -r target; do
[[ -z "$target" ]] && continue
migrate_one "$target"
done
done

# Some resources were never Helm-owned (e.g. PrometheusRule x402-verifier may have
# been created via kubectl apply somewhere). Adopt them into base too if they exist
# in the namespaces base now owns.
echo "==> Adopting unowned resources base will now claim..."
declare -a UNOWNED_TARGETS=(
"namespace/erpc"
"namespace/obol-frontend"
"prometheusrule/x402-verifier -n x402"
)
for target in "${UNOWNED_TARGETS[@]}"; do
if kubectl get $target >/dev/null 2>&1; then
owner=$(kubectl get $target -o jsonpath='{.metadata.annotations.meta\.helm\.sh/release-name}' 2>/dev/null || true)
if [[ -z "$owner" || "$owner" == "base" ]]; then
echo " $target: $([ -z "$owner" ] && echo "adopting" || echo "already base")"
kubectl annotate $target meta.helm.sh/release-name=base meta.helm.sh/release-namespace=kube-system --overwrite >/dev/null
kubectl label $target app.kubernetes.io/managed-by=Helm --overwrite >/dev/null
fi
fi
done

echo ""
echo "✓ Migration complete. You may now run 'obol stack up'."
2 changes: 1 addition & 1 deletion internal/agentcrd/agent.go
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ func Namespace(name string) string {
}

// HostHomePath is where the agent's .hermes data lives on the host. The
// cluster mounts this into the Hermes pod via hostPath; writing
// cluster mounts this into the Hermes pod via the data PVC; writing
// SOUL.md/skills here puts them inside the pod automatically.
func HostHomePath(cfg *config.Config, name string) string {
desc := agentruntime.Describe(agentruntime.Hermes)
Expand Down
124 changes: 26 additions & 98 deletions internal/hermes/hermes.go
Original file line number Diff line number Diff line change
Expand Up @@ -245,10 +245,7 @@ func Sync(cfg *config.Config, id string, u *ui.UI) error {
return fmt.Errorf("helmfile sync failed: %w", err)
}

// Host-side chown the PVC backing dirs to the in-pod UID/GID, bypassing
// the user-namespacing that defeats the in-pod `init-hermes-perms`
// chown from #446 (see ensureHermesPVCOwnership doc comment for details).
ensureHermesPVCOwnership(cfg, id, u)
fixHermesDataPVCK3dFallback(cfg, id, u)

// Publish wallet-metadata ConfigMap for the frontend (namespace now exists).
applyWalletMetadataConfigMap(cfg, id, deploymentDir)
Expand Down Expand Up @@ -775,20 +772,8 @@ func generateValues(namespace, hostname, dashboardHostname, agentBaseURL, token,
runAsUser: %d
runAsGroup: %d
fsGroup: %d
fsGroupChangePolicy: OnRootMismatch
initContainers:
- name: init-hermes-perms
image: %s
imagePullPolicy: IfNotPresent
securityContext:
runAsUser: 0
runAsGroup: 0
command:
- sh
- -c
- chown -R %d:%d /data
volumeMounts:
- name: data
mountPath: /data
- name: init-hermes-data
image: %s
imagePullPolicy: IfNotPresent
Expand Down Expand Up @@ -862,7 +847,7 @@ func generateValues(namespace, hostname, dashboardHostname, agentBaseURL, token,
value: %s
- name: OBOL_SKILLS_DIR
value: /data/.hermes/%s
`, desc.DataPVCName, namespace, desc.ServiceName, desc.ServiceName, namespace, desc.ServiceName, desc.ServiceName, desc.ServiceName, desc.ServiceName, containerUID, containerGID, containerGID, quoteYAML(image()), containerUID, containerGID, quoteYAML(image()), desc.ServiceName, quoteYAML(image()), quoteYAML(hermesBinary), desc.DefaultPort, desc.DefaultPort, quoteYAML(primary), quoteYAML(namespace), obolSkillsDirName)
`, desc.DataPVCName, namespace, desc.ServiceName, desc.ServiceName, namespace, desc.ServiceName, desc.ServiceName, desc.ServiceName, desc.ServiceName, containerUID, containerGID, containerGID, quoteYAML(image()), desc.ServiceName, quoteYAML(image()), quoteYAML(hermesBinary), desc.DefaultPort, desc.DefaultPort, quoteYAML(primary), quoteYAML(namespace), obolSkillsDirName)

if agentBaseURL != "" {
fmt.Fprintf(&b, " - name: AGENT_BASE_URL\n value: %s\n", quoteYAML(agentBaseURL))
Expand Down Expand Up @@ -1022,7 +1007,6 @@ func syncRuntimeFiles(cfg *config.Config, id string, configData []byte, u *ui.UI
if err := removeLegacyHeartbeat(targetDir); err != nil {
return err
}
fixRuntimeVolumeOwnership(cfg, targetDir, u)
return nil
}

Expand Down Expand Up @@ -1314,94 +1298,38 @@ func fixRuntimeVolumeOwnership(cfg *config.Config, hostPath string, u *ui.UI) {
}
}

// hermesPVCPaths returns the host-side PVC backing directories owned by the
// Hermes pod and chowned to containerUID:containerGID.
//
// Intentionally limited to PVCs that the Hermes container itself mounts —
// `remote-signer-keystores` is excluded even though it sits in the same
// namespace because the remote-signer pod runs as runAsUser=65532 with
// fsGroup=1000 (obol/remote-signer chart) and forcing its volume to
// 10000:10000 (Hermes' UID) makes the remote-signer crash-loop on
// `failed to load keystores: Permission denied (os error 13)` against
// the read-only /data/keystores mount. The local-path-provisioner default
// of 1000:1000 already matches that pod's fsGroup contract, so leaving
// that volume untouched is the safe behavior.
func hermesPVCPaths(cfg *config.Config, id string) []string {
namespace := agentruntime.Namespace(agentruntime.Hermes, id)
return []string{
filepath.Join(cfg.DataDir, namespace, agentruntime.Describe(agentruntime.Hermes).DataPVCName),
// fsGroup should own Hermes' data volume. This fallback only repairs legacy
// k3d/userns clusters when the init container is already visibly stuck.
func fixHermesDataPVCK3dFallback(cfg *config.Config, id string, u *ui.UI) {
backendName := "k3d"
if data, err := os.ReadFile(filepath.Join(cfg.ConfigDir, ".stack-backend")); err == nil {
backendName = strings.TrimSpace(string(data))
}
if backendName != "k3d" {
return
}
}

// ensureHermesPVCOwnership host-side chowns the Hermes PVC backing directories
// to containerUID:containerGID so the agent's init containers can write under
// /data on the first start.
//
// Why this is needed (issue #475):
// - The embedded k3d config (internal/embed/k3d-config.yaml) sets
// KubeletInUserNamespace=true. Pod "root" maps to a host subuid that
// lacks chown authority over the host bind-mount path provisioned by
// local-path-provisioner. The in-pod `init-hermes-perms` chown added in
// #446 (commit c066baa) silently no-ops in this configuration.
// - local-path-provisioner's helper-pod sets the dir to 1000:1000 (see
// internal/embed/infrastructure/base/templates/local-path.yaml). Hermes
// runs as 10000:10000, so the next init container fails on
// `mkdir /data/.hermes/home: Permission denied`.
//
// The fix is to chown from outside the user namespace: `docker exec` into the
// k3d server container runs at the host Docker daemon's authority, which is
// real root and is not subject to the kubelet's user-namespacing.
//
// Best-effort. Waits up to 60s for each PVC to be Bound (local-path uses
// WaitForFirstConsumer, so the host dir doesn't exist until the consuming
// pod is scheduled). On non-k3d backends fixRuntimeVolumeOwnership falls
// back to a plain os.Chown.
//
// If a Hermes pod is currently stuck in Init:CrashLoopBackOff because of the
// pre-fix permissions, deletes it so kubelet re-creates with the corrected
// perms immediately rather than after exponential backoff (up to ~5 min).
// Skips the delete when no pod is stuck so repeated `Sync` calls
// (e.g. `obol model sync` after `obol model prefer`) do not gratuitously
// restart a healthy agent.
func ensureHermesPVCOwnership(cfg *config.Config, id string, u *ui.UI) {
namespace := agentruntime.Namespace(agentruntime.Hermes, id)
kubeconfigPath := filepath.Join(cfg.ConfigDir, "kubeconfig.yaml")
kubectlBin := filepath.Join(cfg.BinDir, "kubectl")

// Wait only for the PVCs hermesPVCPaths chowns. remote-signer-keystores
// is intentionally NOT in this loop — see the doc comment on
// hermesPVCPaths for why.
for _, pvc := range []string{
agentruntime.Describe(agentruntime.Hermes).DataPVCName,
} {
waitCmd := exec.Command(kubectlBin,
"wait", "--for=jsonpath={.status.phase}=Bound",
"--timeout=60s", "pvc/"+pvc, "-n", namespace)
waitCmd.Env = append(os.Environ(), "KUBECONFIG="+kubeconfigPath)
_ = waitCmd.Run() // best-effort; continue even on timeout
if !hermesInitContainerStuck(cfg, namespace) {
return
}

for _, p := range hermesPVCPaths(cfg, id) {
fixRuntimeVolumeOwnership(cfg, p, u)
}
hostPath := filepath.Join(cfg.DataDir, namespace, agentruntime.Describe(agentruntime.Hermes).DataPVCName)
fixRuntimeVolumeOwnership(cfg, hostPath, u)

if hermesInitStuck(cfg, namespace) {
deleteCmd := exec.Command(kubectlBin,
"-n", namespace, "delete", "pod",
"-l", "app.kubernetes.io/name=hermes",
"--ignore-not-found", "--wait=false")
deleteCmd.Env = append(os.Environ(), "KUBECONFIG="+kubeconfigPath)
if err := deleteCmd.Run(); err == nil && u != nil {
u.Info("Restarted Hermes pod to apply fresh volume ownership")
}
kubeconfigPath := filepath.Join(cfg.ConfigDir, "kubeconfig.yaml")
kubectlBin := filepath.Join(cfg.BinDir, "kubectl")
deleteCmd := exec.Command(kubectlBin,
"-n", namespace, "delete", "pod",
"-l", "app.kubernetes.io/name=hermes",
"--ignore-not-found", "--wait=false")
deleteCmd.Env = append(os.Environ(), "KUBECONFIG="+kubeconfigPath)
if err := deleteCmd.Run(); err == nil && u != nil {
u.Info("Restarted Hermes pod after best-effort k3d PVC ownership repair")
}
}

// hermesInitStuck reports whether at least one Hermes pod has an init
// container in CrashLoopBackOff or an Error waiting state — the signature of
// the perm-denied symptom this fix targets. Returns false on any kubectl
// failure so that a transient API hiccup does not trigger spurious restarts.
func hermesInitStuck(cfg *config.Config, namespace string) bool {
func hermesInitContainerStuck(cfg *config.Config, namespace string) bool {
kubeconfigPath := filepath.Join(cfg.ConfigDir, "kubeconfig.yaml")
kubectlBin := filepath.Join(cfg.BinDir, "kubectl")
cmd := exec.Command(kubectlBin,
Expand Down
Loading
Loading