Skip to content

fix(controller/render): Restricted PSS securityContext on httpd workloads#529

Closed
bussyjd wants to merge 5 commits into
feat/restricted-pss-sweepfrom
fix/controller-render-restricted-pss
Closed

fix(controller/render): Restricted PSS securityContext on httpd workloads#529
bussyjd wants to merge 5 commits into
feat/restricted-pss-sweepfrom
fix/controller-render-restricted-pss

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 24, 2026

Summary

Cross-PR interaction fix surfaced by the 14-PR integration test campaign (Bug #3).

PR #521 enforces the Restricted Pod Security Standard on the x402 (and llm) namespaces. The serviceoffer-controller renders two httpd-based Deployments without a securityContext:

  • obol-skill-md — publishes /skill.md and api/services.json
  • agentidentity-<name>-registration — publishes /.well-known/agent-registration.json

Both pods are rejected at admission with:

violates PodSecurity \"restricted:latest\":
  allowPrivilegeEscalation != false
  unrestricted capabilities (must drop [\"ALL\"])
  runAsNonRoot != true
  seccompProfile must be RuntimeDefault or Localhost

so they never start. The marketplace API then returns STACK_UNREACHABLE because skill-md isn't reachable.

Fix

Adds Restricted-PSS-compliant securityContext blocks to both render functions in internal/serviceoffercontroller/render.go:

  • Pod: runAsNonRoot: true, runAsUser: 1000, runAsGroup: 1000, fsGroup: 1000, seccompProfile.type: RuntimeDefault
  • Container (httpd): allowPrivilegeEscalation: false, capabilities.drop: [\"ALL\"]

The two securityContext payloads are factored into helpers (restrictedPodSecurityContext, restrictedContainerSecurityContext) so future controller-rendered workloads can reuse the same Restricted defaults.

Both Deployments already bind httpd to 8080 (httpd -f -p 8080 -h /www), which non-root UID 1000 can bind cleanly. No port or Service changes were required.

Why UID 1000

That's the canonical busybox non-root UID and the only Linux user/group the upstream busybox:1.36 image exposes besides root. The httpd payload is a read-only ConfigMap projection, and the new fsGroup: 1000 keeps the projected volumes readable.

Tests

  • New: TestBuildSkillCatalogDeployment_RestrictedPSS
  • New: TestBuildAgentIdentityRegistrationDeployment_RestrictedPSS
  • Shared helper assertRestrictedPSS asserts every Restricted-PSS field on the rendered Deployment so regressions show up at the renderer, not at PSS admission in a live cluster.
go build ./...                                       # clean
go test ./internal/serviceoffercontroller/...        # PASS (5.8s)

Test plan

  • go build ./... clean
  • go test ./internal/serviceoffercontroller/... green
  • New unit tests fail when securityContext is removed (covered by helper invariants)
  • Live verification: deploy on a stack with PR feat(security): Restricted Pod Security Standard across embedded workloads #521 enforcement, confirm both httpd Deployments reach Ready and /skill.md + /.well-known/agent-registration.json resolve through Traefik
  • Re-run the 14-PR integration matrix and confirm Bug User facing ingress #3 is closed without the manual Deployment patch

Related

HananINouman and others added 5 commits May 22, 2026 22:53
PR #481 only repaired hermes-<id> volumes after hermes.Sync (master agent).
Child agents live under agent-<name> and are provisioned by the controller or
agent-factory without that path, so hermes-data stayed 1000:1000 while Hermes
runs as 10000:10000 and crash-looped on Permission denied under /data/.hermes.

Extend EnsureHermesDataPVCOwnership to agent-<name>/hermes-data, call it from
obol agent new and obol sell demo quant, and add obol agent repair-perms for
factory-only creates that cannot docker-exec the k3d node from in-cluster.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace host-side Hermes PVC ownership repair with Kubernetes fsGroup and keep only a tiny k3d fallback.
PR #511's host-side chown workaround was superseded by PR #514. This merge records the conflict resolution while keeping main's native Kubernetes fsGroup implementation.
…oads

PR #521 enforces Restricted Pod Security Standard on x402 + llm
namespaces. The controller renders two httpd-based Deployments
(obol-skill-md publisher + agentidentity-default-registration well-
known/agent-registration.json publisher) without securityContext,
so PSS admission rejects them and they never start. Result:
marketplace API returns STACK_UNREACHABLE because skill-md isn't
reachable.

Adds Restricted-compliant securityContext to both renderers:
  pod:        runAsNonRoot, runAsUser=1000, RunAsGroup=1000,
              seccompProfile=RuntimeDefault, fsGroup=1000
  container:  allowPrivilegeEscalation=false, drop ALL capabilities

Both Deployments already bind httpd to 8080, which is non-root
safe, so no port change is required.

Surfaced by the 14-PR integration test campaign. The integration
test workaround patched the running Deployments manually:
plans/integration-test-results-final-20260524.md Bug #3.
@bussyjd bussyjd changed the base branch from main to feat/restricted-pss-sweep May 24, 2026 06:48
@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented May 24, 2026

Superseded by bundle PR #536 — closing in favor of the consolidated merge target. Original branch and history preserved.

@bussyjd bussyjd closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants