Skip to content

feat(multi-node): engine DaemonSet bundling + Caddy sticky LB + GPU/CPU pool placement#80

Open
aucahuasi wants to merge 73 commits into
mainfrom
dev/distributed-streamgl-gpu
Open

feat(multi-node): engine DaemonSet bundling + Caddy sticky LB + GPU/CPU pool placement#80
aucahuasi wants to merge 73 commits into
mainfrom
dev/distributed-streamgl-gpu

Conversation

@aucahuasi

@aucahuasi aucahuasi commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Enables Graphistry to run correctly across multiple GPU nodes from a single Helm release. All GPU-bound services (nginx, forge-etl-python, dask-cuda-worker, streamgl-gpu, streamgl-viz, streamgl-sessions) are consolidated into a single engine DaemonSet pod per GPU node, fronted by Caddy as a session-affinity L7. The prior multi-node model (leader + follower helm releases per node) is retired in favour of a single-namespace deployment.

The architectural change

engine DaemonSet (one bundled pod per GPU node)

New templates/engine/engine-daemonset.yaml colocates nginx, fep, dask, and the streamgl-* services as sibling containers in one pod per GPU node. Tier-aware: 3 containers (nginx + fep + dask) at analytics, 6 containers (adds streamgl-{viz,sessions,gpu}) at viz/full. Intra-pod hostAliases pin the app-layer hostnames (streamgl-viz, streamgl-gpu, forge-etl-python, dask-cuda-worker) to 127.0.0.1, so every intra-stack HTTP hop is localhost. streamgl-gpu's PM2 localhost IPC continues to work unchanged because all gpu-router and PM2-forked gpu-worker children are intra-pod.

Companion templates:

  • templates/engine/engine-nginx-cfg.yml -- supplementary nginx conf.d ConfigMap that adds an intra-pod :8080 Host-header dispatcher, loaded alongside the production default.conf.template.
  • templates/engine/engine-service.yaml -- engine-headless Service for the Caddy upstream pool (gated on tier.analytics) plus four shim Services (streamgl-viz, streamgl-sessions, streamgl-gpu, forge-etl-python, gated on tier.viz) so the production nginx FQDN-suffixed hostnames keep resolving through CoreDNS for any path that does not go through hostAliases.

This bundle supersedes the earlier streamgl-gpu router/spawner split (added in 511876d, retired in 699bc4f). Bundling keeps PM2 IPC working unchanged, makes every intra-stack hop localhost (lower latency, better data locality), and has the same blast radius as a router-split design because each viz session is pinned to one engine pod by Caddy's session-affinity routing anyway. Per-node redundancy comes from running multiple engine pods (one per GPU node) under the headless Service.

The engine DaemonSet's forge-etl-python container now sets GRAPHISTRY_DASK_LOCAL_AFFINITY=1 so each fep submission's persist/compute carries a soft worker-affinity hint pinning it to the dask-cuda-worker on the same pod (matched by HOSTNAME prefix). This is a cross-repo dependency on graphistry/graphistry#3097, which adds the dask_affinity helper that reads this env var and produces the kwargs. With multiple GPU nodes registered to the scheduler, the hint eliminates the cross-node shuffle path that would otherwise ship ~256 MiB partitions over the cluster overlay (~2x ETL speedup measured at 4/92/512 MiB). The hint is soft (allow_other_workers=True) and decays to a no-op on miss, so single-node deployments are byte-identical to pre-change.

Caddy as session-affinity L7

Caddy load-balances across engine-headless using Caddy's dynamic a resolver (live pod-IP refresh, refresh 10s). Viz session channels are pinned by session value, not per browser: /streamgl/* (which carries id=<session>) and the /graph/socket.io WebSocket plus the /graph/graph.html page (which carry session=<session>) use lb_policy query (Caddy's highest-random-weight / rendezvous hashing). Because query hashes the value and is key-agnostic, id= and session= resolve the same session to the same engine pod, so every client of one session URL converges on one pod. That convergence is what makes a shared session collaborative (node drags, filter exclusions, and histogram color encodings sync live across browsers) and lets that pod reuse its caches (forge-etl-python ETL/dask cache, streamgl-gpu GPU objects, streamgl-viz CPU/nBody state). All other traffic (landing page, nexus API, static assets, ETL uploads) uses a per-browser HMAC graphistry_sticky cookie (signed on engine.cookieSecret) whose first-assignment policy is caddy.lb.fallback; distinct sessions still spread evenly across the pool because rendezvous hashing of distinct session ids is stateless and uniform.

New operator-tunable knobs in values.yaml:

  • caddy.enabled -- toggle Caddy + caddy-ingress entirely. false lets operators front engine-headless with their own ingress controller (Pattern B); the operator owns TLS termination and session affinity in that case. The render gate on caddy-cfg.yml, caddy-deployment.yaml, and caddy-ingress.yml is now tier.analytics AND caddy.enabled.
  • caddy.tls.mode -- external | self | off. external trusts X-Forwarded-Proto: https from private_ranges so the sticky cookie keeps Secure+SameSite=None. self terminates from existingSecret or ACME. off is plain HTTP only.
  • caddy.lb.fallback -- first-assignment policy for the per-browser graphistry_sticky cookie on non-session traffic (landing page, nexus API, static assets, ETL uploads), used when no cookie is set yet (default round_robin). The viz session channels do not use this; they pin by session value via the hardcoded query / HRW policy, so a shared session stays collaborative regardless of this setting (and session affinity cannot be misconfigured or broken through it).
  • caddy.accessLog.{enabled,output,format,level} -- Caddy access logging (one structured JSON line per request on the Caddy pod's stdout). On by default so every topology has request visibility at the L7 entry point, including non-ingress fronts (tls.mode self/external/off, e.g. Tanzu via an Avi LoadBalancer) where Caddy is the only request-log layer; the /caddy/health/ liveness probe is excluded (via log_skip) so it does not flood the log. Set enabled=false behind a cluster ingress that already logs requests, or to cut volume.
  • caddy.service -- type / loadBalancerIP / nodePort / nodePortHttps / annotations / externalTrafficPolicy. Cloud, Tanzu, MetalLB, NodePort all selectable from values; per-platform annotation hints inline.
  • caddy.upstreamImage -- escape hatch to use caddy:2.10-alpine directly while the bundled wrapper image lags upstream PR reverseproxy: cookie should be Secure and SameSite=None when TLS caddyserver/caddy#6115 (the cookie-LB Secure+SameSite=None fix). Liveness probe switches to httpGet when set (no curl in the official image).

Caddy Pod template gains a checksum/config annotation that hashes the rendered Caddyfile into the Pod spec, so any change to TLS mode / lb.fallback / cookieSecret triggers an automatic rollout on helm upgrade. Without this, ConfigMap-only changes left Caddy with stale parsed config in memory until something forced a pod bounce.

Heterogeneous cluster placement: GPU/CPU pools + dedicated-tenant taints

Two new operator-tunable values let the chart land correctly on clusters that mix GPU and CPU node pools, or that use NodePool/managed taints to keep workloads off shared infra:

  • engine.nodeSelector (default {}) -- when non-empty, the engine DaemonSet uses this selector instead of global.nodeSelector (falls back to global when empty). Lets operators target the engine pod at GPU-labelled nodes (graphistry.io/role=gpu) while keeping the chart's CPU-side workloads (caddy, nexus, redis, dask-scheduler, notebook, pivot, gak-{private,public}) on a separate pool via global.nodeSelector (graphistry.io/role=cpu).
  • global.tolerations (default []) -- applied to every chart-rendered workload (11 templates: caddy, dask-scheduler, gak-private/public, http-tools netshoot+whoami, nexus, notebook, pivot, redis, engine DaemonSet). Lets operators opt the chart's pods into nodes carrying operator-defined taints: NVIDIA GPU Operator's nvidia.com/gpu=true:NoSchedule, GKE/EKS managed GPU node-pool taints (nvidia.com/gpu=present:NoSchedule), dedicated-tenant taints (dedicated=graphistry:NoSchedule), etc. Empty [] keeps current behaviour byte-identical. Tolerations are permissive (not directive), so adding a GPU-pool toleration to global.tolerations is harmless for non-GPU workloads -- they still land on whichever node global.nodeSelector admits them to.

Bug fix: dcgm-exporter and http-tools (netshoot, whoami) were the only chart workloads that did not honour global.nodeSelector; now they do. Pre-fix these would leak onto non-GPU nodes in mixed-pool clusters even when the operator constrained the rest of the chart to GPU nodes.

PVC tier-gating dropped (correctness fix)

gak-private, gak-public, and uploads-files PersistentVolumeClaims are no longer gated on tier.full / tier.analytics. With Retain reclaim policy, gating made tier downgrades leave PVs Released with stale claimRef; later upgrades created new PVCs with fresh UIDs that no longer matched, and the new PVCs sat Pending indefinitely. Consuming Deployments stay tier-gated; only the storage object is unconditional.

Validation

End-to-end on a 2-node k3s cluster (node1 = k3s server with NFS server colocated, node2 = k3s agent / NFS client) with NFS RWX storage, tier viz, two engine pods (one per node), Caddy as ClusterIP fronted by k3s Traefik on node1.

  • ETL 5-call sequence at analytics tier: /readarrow, /upload, /preshape, /properties, /download all returned 200 on both nodes. Local-worker affinity hint (added in graphistry_master) was exercised: each fep submission's persist/compute kwargs named the dask-cuda-worker on its own node, verified by HOSTNAME-prefix match against scheduler worker info. Behavior decayed to a no-op when a fep call landed on a node without a registered local worker, so single-node deployments are byte-identical to pre-change.
  • Multi-node session test at viz tier: browsers opened sessions against the public Caddy endpoint on node1. Each viz session pinned to one engine pod by session value (HRW): two browsers opening the same session URL converged on the same pod and collaborated in real time (node drags, filter exclusions, and histogram color encodings synced across both), while distinct sessions spread across the two engine pods. A session's /streamgl/* reads and its /graph/socket.io WebSocket landed on the pod holding its gpu-router and nBody state, confirmed via gk_status_gpu_list (the session id present on exactly one pod) and per-pod log tailing. Long-idle sessions (60s+ inactivity) survived without disconnect.
  • Tier transitions: upgraded viz -> analytics -> viz against a release with non-empty PVCs. gak/uploads PVCs survived the round trip and re-bound to their existing PVs (PV/PVC UID stable, claimRef intact). Pre-fix this would have left the new PVCs Pending.
  • Caddyfile config rollout: edited only caddy.lb.fallback in values and re-ran helm upgrade. Caddy Pod rolled automatically because the rendered Caddyfile checksum changed; ConfigMap-only edits no longer require manual pod bounce.
  • Concurrent upload test (carried over from earlier validation in this branch): 1.77 GB of concurrent uploads across both ingress replicas (now both inside engine pods) with zero errors; the only render glitch observed on a 7th concurrent tab was the HTTP/1.1 6-connections-per-origin browser limit on plain-HTTP localhost (documented upstream, dissolves under HTTPS/HTTP/2), not chart or backend.
  • Telemetry stack on multi-node k3s (3 GPUs across 2 nodes): Pre-fix, Grafana's DCGM dashboard showed metrics for only 1 GPU because prometheus scraped the Service VIP and kube-proxy load-balanced to a single DaemonSet pod per scrape. Post-fix (kubernetes_sd_configs role: pod + relabel rules), all 3 GPUs visible with stable per-node labels. node-exporter validated equivalently. The new prometheus-rbac.yaml ServiceAccount + Role + RoleBinding satisfies prometheus's apiserver pod-discovery; automountServiceAccountToken: true is explicit on the prometheus pod for clusters that flip the namespace default.
  • Platform-tier nexus-proxy end-to-end (tier=platform, postgres + nexus + nexus-proxy slice): All five v1-to-v2 deprecation shims (/etl, /api/check, /api/encrypt, /api/decrypt, /api/v1/etl/vgraph/*) returned 410 Gone with the documented upgrade message. Live /api/v1/* routes (/datasets/, /files/, /organization/, /team/, /named-endpoint/, /my/user/entitlements/) returned 200 with real data after a Django session login at /accounts/login/. v2 ETL routes routed correctly through nginx (returned upstream errors at platform tier as expected — the GPU backends streamgl-gpu and forge-etl-python intentionally don't render at this tier). Tier transition platform -> analytics removed nexus-proxy and brought up the engine DaemonSet's nginx container; analytics -> platform reversed it cleanly.
  • Caddy stability: Two crash modes fixed. Single-line handle @grafana { reverse_proxy ... } block syntax tripped Caddy v2's parser ("Unexpected next token after '{' on same line") — expanded to multi-line form. The /caddy/health/ endpoint was being intercepted by the catch-all handle and proxied to engine-headless (no endpoints) after respond wrote 200, because respond is non-terminal in Caddy v2; wrapping in a terminal handle /caddy/health/ { respond 200 ... } broke the resulting 30s SIGTERM-then-restart liveness loop. New templates/caddy/_helpers.tpl with graphistry.caddy.healthHandle and graphistry.caddy.telemetryHandles defines removes 3-4× duplication of identical handle blocks across the tls.mode = external | self | off branches.

Telemetry as a properly-structured subchart

telemetry is now a Helm subchart consumed via dependencies: with condition: global.ENABLE_OPEN_TELEMETRY. Helm fully prunes the subchart's templates when the flag is off (was previously per-template {{- if }} gates inside the parent).
Telemetry-specific values move to a top-level telemetry: block in the parent's values.yaml (subchart-canonical layout); only cross-cutting values (OTLP endpoint, instance name, image-pull config, default scheduling, storage class) stay under global.*.

The parent's shared _helpers.tpl (graphistry.tier.* helpers) is extracted into a new graphistry-common library subchart so the telemetry subchart can reuse the same tier gating via a dependency entry.

New operator-tunable knobs:

  • telemetry.dcgmExporter.useExternal + telemetry.dcgmExporter.externalEndpoint (default false / "") — when true the chart skips its own dcgm-exporter DaemonSet+Service and points prometheus + otel-collector at an externally-managed endpoint.
    Use cases: GKE with NVIDIA GPU Operator (the bundled exporter image fails on Container-Optimized OS), or any cluster already running the GPU Operator's DCGM module (avoids two DaemonSets scraping the same GPUs). Format host:port, no scheme.
  • telemetry.nodeExporter.useExternal + telemetry.nodeExporter.externalEndpoint — same pattern for clusters already running kube-prometheus-stack's node-exporter; avoids two DaemonSets per node.
  • telemetry.prometheus.enableAdminAPI (default false) — when true passes --web.enable-admin-api to prometheus, enabling tsdb/delete_series, tsdb/clean_tombstones, snapshot, and shutdown. Mirrors kube-prometheus-stack's prometheusSpec.enableAdminAPI default. Off in production; flip in dev/test to drop stale series after scrape-config refactors.
  • telemetry.prometheus.retention (default 15d) — local TSDB retention.
  • telemetry.{prometheus,jaeger,grafana}.persistence.{enabled,size,storageClassName} — per-component PVC config for the otel-collector backend stack. Storage class falls back to global.storageClassNameOverride then retain-sc. Without persistence,
    Grafana dashboards/sessions and Jaeger traces are lost on every pod restart.

Topology + correctness changes:

  • Prometheus is now a single-replica Deployment with Recreate strategy + RWO PVC (was per-node DaemonSet, no global view, no retention). Multi-replica HA is out of scope; users who need it should add Thanos or remote_write to a managed backend in their overrides.
  • prometheus scrape config: dcgm-exporter and node-exporter jobs switched from Service-VIP static_configs to kubernetes_sd_configs with role: pod + relabel rules that set the node label from pod_node_name and __address__ from pod_ip. Pre-fix prometheus scraped the Service VIP and kube-proxy load-balanced to one DaemonSet pod per scrape, silently dropping the other nodes' GPU/host metrics from dashboards. Post-fix every DaemonSet pod is a distinct scrape target with a stable per-node label.
  • New templates/prometheus-rbac.yaml: ServiceAccount + namespace-scoped Role + RoleBinding granting read-only pods / services / endpoints get / list / watch. Required by prometheus's kubernetes_sd_configs apiserver discovery. The prometheus pod sets automountServiceAccountToken: true explicitly — the K8s default is true, but locked-down clusters (some Tanzu/OpenShift profiles, hardened GKE namespaces) flip it to false at namespace level, which would silently break pod discovery.
  • otel-collector ConfigMap unified: previously otel-collector-cloud-configmap.yaml and otel-collector-configmap.yaml rendered separately based on OTEL_CLOUD_MODE; now one templated ConfigMap with cloud-mode/self-hosted branches inline. Cloud-mode credentials read from a pre-created Secret via secretKeyRef (never inlined into values.yaml).
  • checksum/config annotation on otel-collector / prometheus / grafana / jaeger pod templates — ConfigMap content changes now trigger a rollout on helm upgrade (was previously a no-op until manual pod delete).
  • Per-component telemetry Ingresses (grafana-ingress.yaml, jaeger-ingress.yaml, prometheus-ingress.yaml) removed; replaced by Caddy path routes (/grafana, /jaeger, /prometheus) on the parent chart's main Ingress.

Platform-tier nexus-proxy (nginx fronting nexus, v1-to-v2 endpoint rewrites)

New templates/nexus-proxy/nexus-proxy-deployment.yaml + nexus-proxy-service.yaml: platform-tier-only nginx Deployment + Service that fronts nexus and provides the v1-to-v2 endpoint rewrites + deprecated-endpoint 410 shims baked into the graphistry/nginx image. Renders only when global.tier == "platform" (exact match, not >=); at analytics+ the engine DaemonSet's nginx container plays the same role with intra-pod localhost dispatch to the streamgl/forge backends, so the standalone Deployment would be redundant.

Why it exists: the v1-to-v2 rewrites Graphistry clients depend on (deprecation 410s on /etl, /api/check, /api/encrypt, /api/decrypt, /api/v1/etl/vgraph/*; live forwarding for /api/v1/{datasets,files,organization,team,named-endpoint,...}; parallel /api/v2/etl/... routes) live in the graphistry/nginx image's default.conf.template rendered by render_templates.sh. Pre-fix, tier=platform rendered postgres + nexus only, so a nexus-only deployment had no externally-reachable HTTP surface honouring the v1 paths.

The platform tier now ships a deployable slice of postgres + nexus + nexus-proxy with no transitive dependencies on analytics-tier services. The Deployment's only init container waits for the nexus Service alone (not redis / dask-scheduler / streamgl), so the slice is genuinely self-contained — useful for downstream products that need only the auth foundation (e.g. as the Nexus-only auth backend behind another product's chart).

In-cluster reachability (the actual ask for service-to-service integrations like Louie):

  • http://nexus-proxy.<ns>.svc.cluster.local:80 — both v1 and v2 paths
  • http://nexus.<ns>.svc.cluster.local:8000 — raw nexus, v2 only

External reachability is deliberately not chart-managed at platform tier (no Ingress, no Caddy). Operators bring their own L7 (Pulumi, Ansible, kustomize, manual Ingress), or use port-forward for dev/test:

kubectl -n <ns> port-forward --address 0.0.0.0 svc/nexus-proxy 8080:80

Reuses existing values and PVCs: NginxResources, global.{nodeSelector,tolerations,imagePullSecrets,restartPolicy}, postgres secret refs, and the local-media-mount + data-mount PVCs. The image's content-directory expectations for /streamgl, /pivot, and /upload paths are satisfied by emptyDir mounts — those routes return 404 at platform tier, which is the correct behaviour when the backends don't exist.

Complementary changes (details in CHANGELOG.md)

  • Storage-agnostic chart: PVC templates unified on global.storage.accessMode + global.storageClassNameOverride. Removed ENABLE_CLUSTER_MODE, IS_FOLLOWER, multiNode, clusterVolume, provisioner, REDIS_URL_NEXUS_FEP, longhornDashboard, and the hardcoded datamount-longhorn / postgres-longhorn / retain-sc-cluster StorageClass names. Longhorn becomes just another backend operators can point a SC at.
  • Dedicated retain-sc-postgres StorageClass for the postgres-cluster chart, per Crunchy PGO's documented pattern.
  • Dask Kubernetes Operator removed: dask-cluster.yml CRD, dask.operator toggle, operator sections from all platform READMEs and Sphinx docs, ArgoCD app, CD subchart, chart bundler entries, ACR import script, dev-compose setup script.
  • OTEL / Redis Service hardening: otel-collector Service flipped to ClusterIP + internalTrafficPolicy: Local (DaemonSet collector model, per opentelemetry-operator#1401); Redis Service flipped to ClusterIP (was LoadBalancer, a latent security default).
  • charts/values-overrides/examples/cluster/ rewritten end-to-end (~+800 lines): legacy leader/follower multi-namespace example deleted. New single-namespace multi-node guide adds:
    • Architecture diagram of the engine-DaemonSet topology and Caddy session-affinity ingress
    • cookieSecret rotation procedure (HMAC key for graphistry_sticky)
    • Two-axis scheduling model (nodeSelector for where-allowed vs tolerations for what-taints-accepted) with a worked 5-node A100 walkthrough
    • Cost-optimised variant (mixed GPU/CPU pools using engine.nodeSelector + global.tolerations)
    • Three approaches for pinning CPU singletons (caddy, nexus, redis, postgres) to a small CPU pool
    • End-to-end verification commands (pod placement, session-to-pod affinity, ETL 5-call)
    • cluster/retain-sc-nfs.yaml for operators who prefer to apply the Retain StorageClass separately
  • k3s README updated: new "Postgres StorageClass" / "Graphistry StorageClass" subsections, two-SC install flow, same-namespace invariant note. PV-cleanup runbook reorganized into three explicit options (Graphistry-only / postgres-only / both) so operators don't accidentally wipe the wrong chart's data.
  • Chart bug fixes: dcgm-exporter DaemonSet and netshoot / whoami http-tools gained the missing nodeSelector blocks (they were the only workload templates that did not honour global.nodeSelector).
  • Documentation refactored to modern helm-docs standards, abandoned code removed: values.yaml carries helm-docs # -- field annotations so the README value tables are generated rather than hand-maintained and drifting; the chart READMEs / NOTES / CLUSTER / TROUBLESHOOTING were restructured around the single-namespace topology; and dead templates, values, scripts, and doc sections left over from the retired leader-follower model are deleted rather than carried forward (the storage-agnostic, Dask-operator-removal, and example-rewrite items above are part of this sweep).
  • examples/gk.sh operator helpers for the multi-pod engine: per-pod and fan-out log tailing (gk_logs_engine[_all] [container ...], container-scoped, colorized per container, GK_LOGS_TAIL-bounded so a tail starts from "now" instead of replaying rotated history), and gk_status_* health/session helpers that emit pod + host (GPU node) tagged JSON, so gk_status_gpu_list | jq resolves which engine pod and physical host holds a given viz session.

Chart versions: graphistry-helm 0.4.3 -> 0.5.0 (minor, not patch: this PR is a breaking re-architecture, not a fix), postgres-cluster 0.7.5 -> 0.8.0. The two are sibling charts, installed separately and versioned independently; postgres-cluster's appVersion tracks the Crunchy PGO release it deploys (5.2.0 / Postgres 14). This release pairs graphistry-helm 0.5.0 with postgres-cluster 0.8.0. graphistry-helm's appVersion now tracks the Graphistry release it deploys (2.50.7, matching global.tag), correcting a long-standing mirror of the chart version; global.tag is bumped v2.50.6 -> v2.50.7 so the chart ships the current release.

Chart renames. Both charts are renamed to Helm-convention names: Graphistry-Helm-Chart -> graphistry-helm and postgrescluster -> postgres-cluster. The prior name: fields used uppercase / no-dash forms that break Helm's naming rule (lowercase, dashes) and did not match their already-correct directories. Only the name: fields and published-name references change; the directories (charts/graphistry-helm, charts/postgres-cluster) are unchanged, so ArgoCD path-based sources and the docs build are untouched, and chart-releaser publishes the new names automatically on the next release (existing Graphistry-Helm-Chart 0.4.x / postgrescluster 0.7.x index entries remain for current installs). The Crunchy postgresclusters CR kind is upstream and unaffected.

Test plan

  • 2-node k3s + NFS RWX: helm install succeeds end-to-end
  • ETL 5-call sequence (analytics tier) succeeds on both nodes; local-worker affinity exercised and decays to no-op on miss
  • Multi-node viz sessions pinned per-session (HRW) across engine pods; two browsers on one session URL converge and collaborate; distinct sessions spread across pods; reconnects stay pinned
  • Tier transitions (viz -> analytics -> viz) preserve PVC binding
  • Caddyfile config edits trigger automatic Pod rollout via checksum/config on helm upgrade
  • dcgm-exporter respects global.nodeSelector (prior bug: leaked onto non-GPU nodes)
  • helm uninstall + reinstall rebinds PVCs via volumeName workflow, preserves data
  • Concurrent multi-GB uploads across nodes, zero backend errors
  • Mixed GPU/CPU node-pool test: engine.nodeSelector=graphistry.io/role=gpu + CPU singletons pinned via global.nodeSelector=graphistry.io/role=cpu; verify engine pods land only on GPU nodes and caddy/nexus/redis/dask-scheduler land only on CPU nodes
  • global.tolerations validation against NVIDIA GPU Operator taint (nvidia.com/gpu=true:NoSchedule) on a managed cluster (GKE/EKS GPU node-pool)
  • Dedicated-tenant taint validation (dedicated=graphistry:NoSchedule) -- chart pods schedule onto tainted nodes only when the toleration is set
  • Telemetry stack on multi-node k3s: all GPUs and all nodes visible via per-node labels (post kubernetes_sd_configs switch)
  • Platform-tier slice (tier=platform): postgres + nexus + nexus-proxy renders standalone; v1-to-v2 deprecation 410s + authenticated /api/v1/* round-trips return real data; tier transitions to/from analytics are clean
  • Caddy stable on tls.mode = external | self | off paths; /caddy/health/ no longer trips the 30s SIGTERM liveness loop
  • Production-scale load
  • Longhorn RWX integration

Related

…emove Dask Operator

Replace dask-cuda-worker Deployment (with K8s-level replicas scaling) with a
DaemonSet matching the same pattern used by forge-etl-python and streamgl-gpu.
This aligns the Helm chart with the docker-compose GPU architecture where all
GPU services run one pod per node and scale workers internally via env vars
(DASK_NUM_WORKERS, DCW_CUDA_VISIBLE_DEVICES) rather than K8s replicas.

The previous Deployment model (dask.workers replicas) contradicted the app-level
multi-GPU configuration: multiple replicas on the same node would see the same
CUDA_VISIBLE_DEVICES and compete for GPU memory. The DaemonSet model ensures
one pod per GPU node with app-controlled worker processes and round-robin GPU
assignment, consistent with forge-etl-python and streamgl-gpu.

Remove the Dask Kubernetes Operator integration (dask.operator toggle,
DaskCluster CRD template, operator install docs, Argo CD app, ACR import,
chart-bundler gathering, dev-compose setup). The operator's pod-level scaling
model conflicts with Graphistry's app-level GPU management where services
control their own GPU assignment and worker counts via environment variables.

Remove forgeWorkers Helm value — FORGE_NUM_WORKERS is now controlled exclusively
via env vars (default: 4), matching how DASK_NUM_WORKERS and STREAMGL_NUM_WORKERS
already work. This gives operators a single, consistent interface for GPU worker
configuration across all services.

Templates changed:
- dask-cuda-worker-daemonset.yaml: rewritten as DaemonSet (was Deployment)
- dask-cluster.yml: deleted (DaskCluster operator mode)
- dask-scheduler-deployment.yaml: removed dask.operator guard
- forge-etl-python-daemonset.yaml: removed forgeWorkers Helm value reference

Values cleaned:
- Removed dask.workers, dask.operator, forgeWorkers from values.yaml
- Removed forgeWorkers from k3s example values

Docs/infra cleaned (23 files):
- Removed Dask Operator install/troubleshoot sections from all READMEs
  (k3s, gke, tanzu, cluster, troubleshooting.md)
- Removed dask-kubernetes-operator-docs.rst and index.rst reference
- Removed dask.workers and forgeWorkers from graphistry-helm-docs.rst
- Rewrote troubleshooting.md Dask architecture section to document
  DaemonSet model and app-level GPU/worker configuration
- Removed dask-operator-cd.yaml Argo CD app
- Removed dask operator from cd/repo/Chart.yaml, bundler.sh,
  helm-dev-setup-deploy.sh, ACR import script, docs Makefile

Tested: deployed on jorge7 GKE cluster, dask-cuda-worker DaemonSet pod
starts correctly, detects 2 GPUs (CUDA_VISIBLE_DEVICES=0,1), forge-etl-python
connects to dask-scheduler and initializes 4 workers with round-robin GPU
assignment (0,1,0,1).
…streamgl-gpu router split

Make the chart correctly deploy Graphistry across multiple GPU nodes
from a single Helm release. Three interlocking chart changes land the
core of the multi-node story; everything else in this commit is
complementary refactor cleanup.

1. streamgl-gpu split (router Deployment + spawner DaemonSet)
---------------------------------------------------------------
New `streamgl-gpu-deployment.yaml` deploys a cluster-wide router
(replicas=1) that holds the session-to-worker registry in memory and
proxies `/streamgl/*` WebSocket traffic to whichever worker owns a
session. New `streamgl-gpu-networkpolicy.yaml` restricts the router's
`/internal/*` endpoints (register, deregister, worker-event) to
in-cluster callers as defense-in-depth on top of the nginx-level
denial of `/streamgl/internal/`.

Workers stay per-node as a DaemonSet, but the in-pod spawner now
registers each worker with the remote router over HTTP rather than
relying on PM2's localhost event bus. That removes the PM2 IPC
limitation that blocked multi-node in every prior design. Workers on
bob are now discoverable and schedulable from jorge7 (and vice versa)
via `/internal/register`, and sessions fan out across all GPU nodes.

New Helm value `StreamglGpuWorkerResources` separates the DaemonSet
(GPU) resource block from `StreamglGpuResources` (router, no GPU),
since the two workloads have very different resource profiles. Worker
count and GPU visibility stay on env vars (`STREAMGL_NUM_WORKERS`,
`CUDA_VISIBLE_DEVICES`) consistent with the `DASK_NUM_WORKERS` and
`FORGE_NUM_WORKERS` pattern.

Corresponds to the app-layer PR graphistry/graphistry#3087.

2. nginx: Deployment -> DaemonSet
---------------------------------
nginx was a single-replica Deployment, which on a multi-node cluster
funnelled all external traffic through one node and every upload
through one `forge-etl-python` pod via the load-balanced Service.
That pinning broke the upload path on NFS because nginx on node A
wrote the request body to the RWX PVC and `forge-etl-python` on node
B read it as 0 bytes until node B's NFS client attribute cache
refreshed, well past `FORGE_MAX_FILE_WAIT_MS`. Uploads 500'd with
"Waited longer than 10000 ms for from_path ... to be populated".

Running nginx as a DaemonSet gives each GPU node its own ingress pod.
caddy still reverse-proxies to the `nginx` Service with the default
Cluster policy, so kube-proxy distributes external traffic across
every node's nginx replica. Each node's nginx then hands off to its
local `forge-etl-python` via the Local routing below, so writer and
reader share a single kubelet's NFS client. Load distribution is
automatic, matches the DaemonSet count, and requires no `replicas:`
tuning or HPA. The `rollingUpdate`/`maxSurge`/`Recreate` branch is
gone (DaemonSets only support `RollingUpdate`/`OnDelete`); rolling
updates now proceed one node at a time with `maxUnavailable: 1`.

3. forge-etl-python Service: internalTrafficPolicy: Local
---------------------------------------------------------
Pairs with the nginx DaemonSet above. Every Service call to
`forge-etl-python` routes to the DaemonSet pod on the caller's node,
so nginx's write to `uploads-files` PVC is read back through the same
kubelet's NFS client. The NFS cross-node coherence race goes away.

Trade-off: nginx on node A cannot fall back to `forge-etl-python` on
node B if node A's pod is unhealthy; DaemonSet per-node liveness
probes cover that failure mode. Single-node deployments are
unaffected (there is only one endpoint). This is the same pattern
already applied to `otel-collector` for the same data-locality reason
(upstream OpenTelemetry Operator guidance).

---

Validated end-to-end on a 2-node k3s cluster (jorge7 + bob) with NFS
RWX: 1.77 GB of concurrent uploads across both nginx replicas with
zero errors; 6 active WebSocket sessions distributed 3/3 across GPU
nodes; cross-tab node-move sync working across the streamgl-gpu
router; no retry amplification under sustained load.

Complementary changes in this commit (details in CHANGELOG.md):

- Storage-agnostic chart: PVC templates unified on
  `global.storage.accessMode` + `global.storageClassNameOverride`.
  Removed `ENABLE_CLUSTER_MODE`, `IS_FOLLOWER`, `multiNode`,
  `clusterVolume`, `provisioner`, `REDIS_URL_NEXUS_FEP`,
  `longhornDashboard`, and the hardcoded `datamount-longhorn` /
  `postgres-longhorn` / `retain-sc-cluster` StorageClass names.
  Longhorn becomes just another backend operators can point a SC at.
- Dedicated `retain-sc-postgres` StorageClass for the postgres-cluster
  chart, per Crunchy PGO's documented pattern.
- `LEADER_OTEL_EXPORTER_OTLP_ENDPOINT` removed; `otel-collector`
  Service flipped to `ClusterIP` + `internalTrafficPolicy: Local`;
  Redis Service flipped to `ClusterIP` (was LoadBalancer).
- `charts/values-overrides/examples/cluster/` replaced: legacy
  leader/follower multi-namespace example deleted, new single-
  namespace multi-node guide added with NFS as the documented default
  plus a `retain-sc-nfs.yaml` manifest for operators who prefer to
  apply the Retain StorageClass separately.
- k3s README: new "Postgres StorageClass" and "Graphistry
  StorageClass" subsections, two-SC install flow, same-namespace
  invariant note, Cleanup filter extended to cover both charts.
- Fixes: `dcgm-exporter` DaemonSet and `netshoot`/`whoami` http-tools
  gained the missing `nodeSelector` blocks (were the only templates
  that did not honour `global.nodeSelector`).
@aucahuasi aucahuasi self-assigned this Apr 22, 2026
@aucahuasi aucahuasi changed the title Dev/distributed streamgl gpu feat(multi-node): streamgl-gpu router split, DaemonSet nginx, and NFS-coherent forge-etl-python routing Apr 22, 2026
@aucahuasi aucahuasi changed the title feat(multi-node): streamgl-gpu router split, DaemonSet nginx, and NFS-coherent forge-etl-python routing feat(multi-node): streamgl-gpu router split, DaemonSet nginx, and distributed FS coherent forge-etl-python routing Apr 22, 2026
… docs

Follow-up to 511876d, picking up cleanups and operator-facing docs
surfaced during multi-node validation on the k3s cluster.

- templates/dask: remove vestigial dask-cuda-worker Service (dead since
  66f1c63 "DKO HOTFIX" 2023-02-14); Dask workers register by pod IP +
  ephemeral port with the scheduler's in-memory registry, never through
  a Service. Inline comment documents why Dask bypasses Services by
  design, and why the dashboard stays bound to localhost.

- templates/forge-etl: init container wait switched from
  "service dask-cuda-worker" to "pod -lio.kompose.service=..." since
  the Service is gone; pod-label readiness is strictly stronger than
  Service-existence as an init gate (a Service can render with zero
  ready backends).

- templates/streamgl NetworkPolicy: inline DEV-MODE GAP and CNI
  REQUIREMENT caveats. Policy is gated on global.devMode == false and
  is inert at runtime on non-enforcing CNIs (vanilla flannel, stock
  AWS VPC CNI without the add-on); nginx L7 deny is the only remaining
  defense on those clusters.

- examples/cluster/README: operator-facing NetworkPolicy CNI section
  listing enforcers (kube-router in k3s >=1.25, Calico, Cilium, Antrea,
  Weave, GKE with --enable-network-policy) vs non-enforcers, with a
  pointer to test_1_networkpolicy.md for a 5-minute verification.

- examples/troubleshooting: document HTTP/1.1 6-connections-per-origin
  browser limit that hangs the 7th viz tab on plain-HTTP localhost; fix
  is HTTP/2 multiplex via Caddy "tls internal" (dev), mkcert for a
  named host, or automatic ACME (prod). Diagnosed by Manfred on hub.

- docs/configure-storageclass: new "StorageClass defaults per backend"
  table (9 backends x default volumeBindingMode x default reclaimPolicy
  x override-needed column) so operators know up front which knobs
  must be set at SC creation rather than discovering it from a pending
  PVC.
…vices into single bundled pod per node

Replaces the per-service Deployments/DaemonSets (nginx, forge-etl-python,
dask-cuda-worker, streamgl-gpu, streamgl-viz, streamgl-sessions) with a
single `engine` DaemonSet pod per GPU node that colocates all of them as
sibling containers. Referred to internally as "fatpod" during development;
final name is `engine` for both single-node and multi-node deployments at
analytics+ tier.

Why this design over the previous per-service / streamgl-gpu router split

  - Latency: every intra-stack HTTP hop (viz <-> gpu, fep <-> dask,
    nginx <-> any) is now localhost. The streaming hot path no longer
    crosses the CNI overlay on viz frames or ETL submissions.
  - Data locality: dask-cuda-worker, fep, and streamgl-gpu run on the
    same physical host as their consumers; partitioned cuDF/dask_cudf
    data lives next to compute. Pairs with the local-worker affinity
    hint added to fep ETL submissions in graphistry_master.
  - Resilience: equivalent to previous designs. Browser sessions are
    pinned across engine pods via Caddy's HMAC-signed `graphistry_sticky`
    cookie; node loss has the same blast radius as before.
  - Simplicity: streamgl-gpu's PM2 localhost IPC continues to work
    unchanged because gpu-router and PM2-forked gpu-worker children
    are intra-pod. No router/spawner HTTP split, no app-layer changes.

New templates
-------------

  templates/engine/engine-daemonset.yaml   pod definition, tier-aware:
                                           3 containers at `analytics`
                                           (nginx + fep + dask), 6 at
                                           `viz`/`full` (adds
                                           streamgl-{viz,sessions,gpu})
  templates/engine/engine-nginx-cfg.yml    supplementary :8080 Host-
                                           header dispatcher inside the
                                           pod, loaded as conf.d
                                           alongside the production
                                           default.conf
  templates/engine/engine-service.yaml     engine-headless for Caddy
                                           upstreams (gated tier.analytics)
                                           plus 4 shim Services
                                           (streamgl-viz, streamgl-sessions,
                                           streamgl-gpu, forge-etl-python,
                                           gated tier.viz) so the
                                           production nginx FQDN-suffixed
                                           hostnames still resolve

Removed templates (subsumed by engine)
--------------------------------------

  templates/nginx/nginx-deployment.yaml
  templates/nginx/nginx-log-exporter-configmap.yaml  (sidecar dropped)
  templates/forge-etl/forge-etl-python-daemonset.yaml
  templates/dask/dask-cuda-worker-daemonset.yaml
  templates/streamgl/streamgl-gpu-daemonset.yaml
  templates/streamgl/streamgl-gpu-deployment.yaml      (router-split design)
  templates/streamgl/streamgl-gpu-networkpolicy.yaml   (router-split design)
  templates/streamgl/streamgl-sessions-deployment.yaml
  templates/streamgl/streamgl-viz-deployment.yaml

Caddy as the L7 layer
---------------------

Caddy now load-balances browser sessions across engine-headless with
cookie stickiness (`graphistry_sticky`, HMAC-signed on
`engine.cookieSecret`) using Caddy's `dynamic a` resolver against the
headless Service for live pod-IP refresh.

New operator knobs in values.yaml:

  caddy.enabled        toggle Caddy + caddy-ingress entirely. `false`
                       lets operators front engine-headless with their
                       own ingress controller (Pattern B). The render
                       gate on caddy-cfg.yml, caddy-deployment.yaml,
                       and caddy-ingress.yml is now
                       `tier.analytics AND caddy.enabled`.
  caddy.tls.mode       `external` | `self` | `off`. external trusts
                       XFP=https from private_ranges so the sticky
                       cookie keeps Secure+SameSite=None. self
                       terminates from existingSecret or ACME. off is
                       plain HTTP only.
  caddy.lb.fallback    first-time-assignment policy when no cookie is
                       set (default round_robin).
  caddy.service        type / loadBalancerIP / nodePort /
                       nodePortHttps / annotations /
                       externalTrafficPolicy. Cloud, Tanzu, MetalLB,
                       NodePort all selectable from values.
  caddy.upstreamImage  escape hatch to use upstream `caddy:2.10-alpine`
                       directly while the bundled wrapper image lags
                       upstream PR caddyserver/caddy#6115. Liveness
                       switches to httpGet when set (no curl in the
                       official image).

Caddy Pod template gains a `checksum/config` annotation that hashes the
rendered Caddyfile ConfigMap into the Pod spec, so any change to TLS
mode / lb.fallback / cookieSecret triggers an automatic rollout on
`helm upgrade`. Without this, ConfigMap-only changes left Caddy with
stale parsed config in memory until something forced a pod bounce.

PVC tier-gating dropped (correctness fix)
-----------------------------------------

`gak-private`, `gak-public`, and `uploads-files` PersistentVolumeClaims
are no longer gated on tier.full / tier.analytics. With Retain reclaim
policy, gating made tier downgrades leave PVs Released with stale
claimRef; later upgrades created new PVCs with fresh UIDs that no
longer matched, and the new PVCs sat Pending indefinitely. Consuming
Deployments stay tier-gated; only the storage object is unconditional.

Other cleanup
-------------

  - Dask Kubernetes Operator removed: dask-cluster.yml CRD,
    `dask.operator` toggle, operator sections from all platform READMEs
    and Sphinx docs, ArgoCD app, CD subchart, chart bundler entries,
    ACR import script, dev-compose setup script.
  - Cluster mode (ENABLE_CLUSTER_MODE / IS_FOLLOWER) wiring stripped
    from all Deployment/DaemonSet templates. Legacy
    cluster/{cluster-storage,follower,global-common,leader}.yaml
    replaced by a single-namespace multi-node guide and
    cluster/retain-sc-nfs.yaml.
  - Redis Service: LoadBalancer -> ClusterIP (latent security default).
  - otel-collector Service: LoadBalancer -> ClusterIP +
    internalTrafficPolicy: Local (matches DaemonSet collector model;
    opentelemetry-operator#1401).
  - dcgm-exporter, netshoot, whoami: added missing `nodeSelector`
    blocks honouring global.nodeSelector.

values.yaml additions
---------------------

  - `engine` block: cookieSecret, uploadsScratchSizeLimit.
  - `caddy` sub-blocks: enabled, tls{mode, existingSecret, acmeEmail,
    domains}, lb{fallback}, service{type, loadBalancerIP, nodePort,
    nodePortHttps, annotations, externalTrafficPolicy}, upstreamImage.
  - `global.storage.accessMode` (default ReadWriteOnce).
  - TOPOLOGY note covering Pattern A (Caddy as L7) vs Pattern B
    (operator's ingress controller as L7) with explicit operator
    responsibilities under each pattern.

k3s example values
------------------

  - Exercises the new knobs end-to-end: caddy.enabled: true,
    caddy.tls.mode: "off", caddy.lb.fallback: round_robin,
    caddy.service.type: ClusterIP, caddy.upstreamImage:
    caddy:2.10-alpine, engine.cookieSecret,
    engine.uploadsScratchSizeLimit. Tier set to `viz`.
  - PV-cleanup runbook in the README reorganized into three explicit
    options (Graphistry-only / postgres-only / both) to prevent
    operators accidentally wiping the wrong chart's data.

Verification
------------

Two-node test bed: node1 (k3s server, NFS server colocated) and
node2 (k3s agent, NFS client). RWX storage backed by NFS, tier
`viz`, two engine pods (one per node) load-balanced by Caddy
(ClusterIP) fronted by k3s Traefik on node1.

  ETL 5-call sequence -- analytics tier
    /readarrow, /upload, /preshape, /properties, /download all returned
    200 on both nodes. Local-worker affinity hint exercised: each fep
    submission's persist/compute kwargs named the dask-cuda-worker on
    its own node (verified by HOSTNAME-prefix match against scheduler
    worker info). Behavior decayed to no-op when a fep landed on a node
    without a registered local worker.

  Session test -- viz tier, multi-node
    Browsers opened sessions against the public Caddy endpoint on
    node1. Each browser pinned to one engine pod via the
    `graphistry_sticky` cookie across page reloads and WebSocket
    reconnects; sessions on node1 were undisturbed by traffic
    landing on node2, and vice versa. Verified sticky distribution by
    tab count vs cookie value. Long-idle sessions (60s+ inactivity)
    survived without disconnect.

  Tier transitions
    Upgraded `viz -> analytics -> viz` against a release with
    non-empty PVCs. gak/uploads PVCs survived the round trip and
    re-bound to their existing PVs (PV/PVC UID stable, claimRef
    intact). Pre-fix this would have left the new PVCs Pending.

  Caddyfile config rollout
    Edited only `caddy.lb.fallback` in values and re-ran
    `helm upgrade`. Caddy Pod rolled automatically because the
    rendered Caddyfile checksum changed; ConfigMap-only edits no
    longer require manual pod bounce.
@aucahuasi aucahuasi changed the title feat(multi-node): streamgl-gpu router split, DaemonSet nginx, and distributed FS coherent forge-etl-python routing feat(multi-node): engine DaemonSet bundling GPU services + Caddy sticky LB Apr 26, 2026
…erations) + dask local-affinity wiring + cluster README rewrite

Three coherent changes shipping together to support heterogeneous K8s
clusters with mixed GPU/CPU node pools and dedicated-tenant taints.

1. Multi-node placement infrastructure
--------------------------------------

* charts/graphistry-helm/values.yaml:
  - New `engine.nodeSelector` (default `{}`). When set, the engine DaemonSet
    uses this selector instead of `global.nodeSelector` (falls back to
    global when empty). Lets operators target the engine pod at GPU-labelled
    nodes while keeping the chart's CPU-side workloads (caddy, nexus, redis,
    dask-scheduler, notebook, pivot, gak-{private,public}) on a separate
    pool.
  - New `global.tolerations` (default `[]`). Applied to every chart-
    rendered workload. Lets operators opt the chart's pods into nodes
    carrying operator-defined taints: NVIDIA GPU Operator's
    `nvidia.com/gpu=true:NoSchedule`, GKE/EKS managed GPU node-pool taints,
    dedicated-tenant taints (`dedicated=foo:NoSchedule`), etc. Empty `[]`
    keeps current behaviour byte-identical.
  - Inline docstrings explain the recommended `graphistry.io/role=gpu/cpu`
    labelling convention, the GPU-Operator and managed-pool taint patterns,
    and why tolerations are permissive (not directive) so adding a GPU-pool
    toleration to global.tolerations is harmless for non-GPU workloads.

* templates/{caddy,dask,engine,graph-app-kit,http-tools,nexus,notebook,
  pivot,redis}/*-deployment.yaml + engine-daemonset.yaml:
  Eleven templates now emit a `tolerations:` block when
  `.Values.global.tolerations` is non-empty (standard `{{- with ... }}`
  guard pattern). engine-daemonset.yaml additionally rewires nodeSelector
  to `(engine.nodeSelector | default global.nodeSelector)` so engine can
  override placement independently.

2. Cluster deployment guide rewrite
-----------------------------------

* charts/values-overrides/examples/cluster/README.md (+943 / -130):
  Near-rewrite of the multi-node deployment guide. New sections:

  - "Cluster architecture: engine DaemonSet and Caddy" -- bird's-eye view
    of the bundled engine pod, Caddy sticky LB across pods, why both
    pieces are necessary together.
  - `engine.cookieSecret` production hardening + rotation behaviour
    (existing browser sessions are dropped on secret change).
  - "Node placement: GPU vs CPU pools, dedicated tenants" -- the two
    scheduling axes (nodeSelector for where-allowed, tolerations for
    what-taints-accepted), the chart's scheduling values, recommended
    labelling convention.
  - Worked walkthrough: 5-node A100 cluster with a dedicated LLM tenant
    on node 5 (label + taint + toleration), step-by-step from cluster-
    admin labelling through values file authoring through the LLM team
    deploying their own workload separately.
  - Cost-optimised variant (separate GPU/CPU pools).
  - Three approaches for pinning CPU singletons to one specific node
    (label-one-node / pin-by-hostname / dedicated-pin-label), with
    tradeoffs and gotchas, plus a list of approaches not recommended.
  - Defensive matrix: what each safety net catches (selector-only vs
    selector+taint vs taint-without-toleration).
  - Common per-distro pre-baked taints reference.
  - Verification commands matrix to confirm placement after install.

  The existing NFS / `global.storage.accessMode: ReadWriteMany` Step 4
  content is preserved.

3. Dask local-affinity wiring
-----------------------------

* templates/engine/engine-daemonset.yaml (forge-etl-python container env
  block): set `GRAPHISTRY_DASK_LOCAL_AFFINITY=1`. Wires the chart-side
  half of the dask local-worker affinity feature shipping in
  graphistry_master PR #3097. fep's `server/util/dask_affinity.py:
  persist_kwargs()` reads this env var to gate per-submission
  `scheduler_info()` round-trips and pin `client.persist`/`compute` calls
  to the co-located dask-cuda-worker on the same node (~2x ETL speedup
  empirically validated at 4 MiB / 92 MiB / 512 MiB upload sizes). Hint
  is soft -- `allow_other_workers=True` is always set fep-side, so the
  scheduler still falls back to a remote worker when the local one is
  busy or down. Hardcoded "1" matches the precedent of other engine-
  DaemonSet-required constants like `OTEL_SERVICE_NAME` and `PORT`. If
  the env var is missing, fep's helper short-circuits and the deployment
  runs byte-identically to pre-change.
@aucahuasi aucahuasi changed the title feat(multi-node): engine DaemonSet bundling GPU services + Caddy sticky LB feat(multi-node): engine DaemonSet bundling + Caddy sticky LB + GPU/CPU pool placement Apr 26, 2026
…loss on mixed pools

Bug
---
The graph-app-kit-{public,private} Deployments and the Jupyter notebook
Deployment hardcoded `nodeSelector: global.nodeSelector` with no override
hook. On a mixed GPU/CPU node-pool deployment -- the recommended pattern
where global.nodeSelector pins the platform-tier majority (caddy, nexus,
redis, dask-scheduler) to a CPU pool -- this routed the GPU-bound gak
Streamlit views and the notebook RAPIDS kernel to CPU nodes. Pods came up
green but lost CUDA capability:

  - graph-app-kit (both public + private) runs the same RAPIDS-on-CUDA
    runtime as forge-etl-python (cudf, pygraphistry); the Streamlit
    dashboards drive graph layout against the GPU stack. compose pins
    both `runtime: nvidia` for this reason.
  - notebook ships a `Python 3.8 (RAPIDS)` ipykernel (cudf/cuml/cugraph)
    that fails to load CUDA when the kernel container has no GPU
    access. compose pins it `runtime: nvidia`.

The hardcoded selector blocked any per-workload override, so operators
running mixed pools could not pin gak/notebook to GPU nodes without
forking the chart.

Fix
---
Three templates now use the override-with-fallback Helm idiom:

  templates/graph-app-kit/graph-app-kit-public-deployment.yaml
  templates/graph-app-kit/graph-app-kit-private-deployment.yaml
  templates/notebook/notebook-deployment.yaml

      nodeSelector: {{- (.Values.X.nodeSelector | default .Values.global.nodeSelector)
                       | toYaml | nindent 8 }}

New Helm values (default `{}`):

  - graphAppKit.nodeSelector  -- applies to both gak Deployments
  - notebook.nodeSelector     -- applies to the notebook Deployment

Empty default falls back to global.nodeSelector, so single-pool
deployments are byte-identical to pre-fix. Mixed-pool operators set
these explicitly to a GPU-labelled pool; the values.yaml docstrings
spell out the pattern with a `graphistry.io/role: gpu` example.

Pivot is intentionally not changed: it is a Node.js HTTP-only service
that embeds the streamgl-viz iframe in browser-side pages. No GPU
bindings server-side, so it stays correctly tied to global.nodeSelector.

Docs
----
  - cluster/README.md: corrected the GPU-vs-CPU classification (notebook
    and gak were previously listed as CPU-only, which was wrong -- both
    are GPU-bound). Split the topology diagram into GPU-bound vs
    CPU-only boxes, added a per-workload breakdown table, and added a
    defensive matrix row describing the silent-runtime-regression
    failure mode this fix addresses.
  - troubleshooting.md: same misclassification fix in the multi-node
    placement guidance.
  - k3s/k3s_example_values.yaml: illustrative usage of the two new
    nodeSelector overrides for the mixed-pool example.
  - CHANGELOG.md: bug-fix entries for both new values.
Telemetry now ships as a properly-structured Helm subchart consumed via
dependencies + condition: global.ENABLE_OPEN_TELEMETRY, with a top-level
`telemetry:` block in the parent's values.yaml for subchart-specific knobs
and `global.*` reserved for cross-cutting values (OTLP endpoint, instance,
image-pull, scheduling, storage class). The shared `_helpers.tpl` is
extracted into a new `graphistry-common` library subchart so the telemetry
subchart can reuse `graphistry.tier.*` gating via a dependency entry.

Multi-node Prometheus scraping is fixed. The dcgm-exporter and
node-exporter jobs switched from Service-VIP `static_configs` to
`kubernetes_sd_configs` with `role: pod` and relabel rules that set the
`node` label from `pod_node_name` and `__address__` from `pod_ip`. With
the old static-target config, kube-proxy load-balanced to one DaemonSet
pod per scrape and silently dropped the other nodes' GPU/host metrics.
Validated on a 2-node 3-GPU k3s cluster: pre-fix 1 GPU visible, post-fix
all 3 with stable per-`node` labels. New `templates/prometheus-rbac.yaml`
(ServiceAccount + namespace-scoped Role + RoleBinding for read-only
`pods`/`services`/`endpoints`) covers apiserver discovery; the prometheus
pod sets `automountServiceAccountToken: true` explicitly so locked-down
clusters (some Tanzu/OpenShift profiles, hardened GKE namespaces) that
flip the default to false don't silently break discovery.

Telemetry-stack persistence is normalized. Prometheus is now a
single-replica Deployment with `Recreate` strategy + RWO PVC (was
per-node DaemonSet, no global view, no retention) with
`telemetry.prometheus.retention=15d` and a new
`telemetry.prometheus.enableAdminAPI` knob (default false; mirrors
kube-prometheus-stack's default; flip in dev to drop stale series after
scrape-config refactors). Grafana and Jaeger gain
`persistence.{enabled,size,storageClassName}` blocks matching prometheus.
The otel-collector cloud and self-hosted ConfigMaps are merged into one
templated ConfigMap with cloud-mode credentials sourced via
`secretKeyRef`. New `useExternal` + `externalEndpoint` knobs on
`telemetry.dcgmExporter` and `telemetry.nodeExporter` let operators
defer to NVIDIA GPU Operator's exporter (mandatory on GKE
Container-Optimized OS, where the bundled image fails) or
kube-prometheus-stack's node-exporter, avoiding double DaemonSets.
`checksum/config` annotations on otel-collector / prometheus / grafana /
jaeger pod templates make `helm upgrade` actually roll workloads when
their ConfigMaps change.

Caddy stability: two crash modes fixed. The single-line
`handle @grafana { reverse_proxy ... }` blocks tripped Caddy v2's parser
("Unexpected next token after '{' on same line") -- expanded to
multi-line form. The `/caddy/health/` endpoint was being routed by the
catch-all `handle` to `engine-headless` (no endpoints) after `respond`
wrote 200, because `respond` is non-terminal in Caddy v2; wrapped in a
terminal `handle /caddy/health/ { respond 200 ... }` to break the 30s
SIGTERM-then-restart liveness loop. New
`templates/caddy/_helpers.tpl` defines (`graphistry.caddy.healthHandle`,
`graphistry.caddy.telemetryHandles`) remove 3-4x duplication of identical
handle blocks across the `tls.mode = external | self | off` branches in
`caddy-cfg.yml`.

Per-component telemetry Ingresses (grafana/jaeger/prometheus) are removed
in favour of Caddy path routes on the parent chart's main Ingress;
NOTES.txt updates "Ingress paths" wording to "Caddy paths". The gke
README's "Fix DCGM GPU Metrics on GKE" section is rewritten to use
`telemetry.dcgmExporter.useExternal: true` instead of an out-of-band
`kubectl patch`. The k3s example values exercise the new caddy/engine
knobs and enable telemetry self-hosted by default.
Adds a platform-tier-only nginx Deployment + Service (`nexus-proxy`) that
fronts nexus and provides the v1-to-v2 endpoint rewrites and deprecated-
endpoint 410 shims that live in the `graphistry/nginx` image. Renders only
when `global.tier == "platform"` (exact match, not `>=`); at analytics+
the engine DaemonSet's nginx container plays the same role with intra-pod
localhost dispatch to streamgl/forge backends, so the standalone
Deployment would be redundant at higher tiers.

The platform tier now ships a deployable slice of postgres + nexus +
nexus-proxy with no transitive dependencies on analytics-tier services.
The Deployment's only init container waits for the nexus Service alone
(not redis / dask-scheduler / streamgl) so the slice is genuinely
self-contained. Reuses existing values and PVCs: `NginxResources`,
`global.{nodeSelector,tolerations,imagePullSecrets,restartPolicy}`,
postgres secret refs, and the `local-media-mount` + `data-mount` PVCs.
The image's content-directory expectations for `/streamgl`, `/pivot`,
and `/upload` paths are satisfied by emptyDir mounts -- those routes
return 404 at platform tier, which is the correct behaviour when the
backends don't exist.

End-to-end validation at `tier=platform`: the deprecation shims (`/etl`,
`/api/check`, `/api/encrypt`, `/api/decrypt`, `/api/v1/etl/vgraph/*`)
return 410 Gone with the documented upgrade message; live `/api/v1/*`
routes (`/datasets/`, `/files/`, `/organization/`, `/team/`,
`/named-endpoint/`) return 200 with real data after a Django session
login at `/accounts/login/`; v2 ETL routes (`/api/v2/etl/vgraph/`,
`/api/v2/etl/datasets/<id>/{gfql,kepler}/...`) route through nginx as
expected (upstream errors at platform tier because GPU backends are
absent by design). Switching back to `tier=analytics` correctly removes
the standalone nexus-proxy in favour of the engine DaemonSet's nginx
container.

NOTES.txt is now tier-aware: at platform it lists `nexus-proxy` in
DEPLOYED SERVICES, prints both in-cluster URLs and a
`port-forward --address 0.0.0.0` recipe for browser access, and skips
the ACCESS and TELEMETRY blocks (Caddy and the telemetry stack don't
render at this tier). At analytics+ both blocks render as before. The
tier descriptions are also updated to reflect the engine-DaemonSet
collapse from 0.4.4 -- the single DaemonSet with tier-conditional
sibling containers replaces the per-service Deployments listed in the
old text.

Documentation updated to match: the k3s README's Deployment Tiers
section now documents `nexus-proxy` in the platform-tier row and
rewrites analytics+ rows around the engine DaemonSet shape; the
"Services per tier" table replaces standalone
`nginx`/`forge-etl-python`/`dask-cuda-worker`/`streamgl-*` rows with
`engine` DaemonSet container rows showing the tier-conditional sibling
set. troubleshooting.md Section 9 "Accessing Graphistry" gains a new
"Tier matters for the access path" subsection covering the platform-
tier path (no Caddy/Ingress; `nexus-proxy` is the entry point;
port-forward + curl smoke tests for v1/v2 routing); the existing
Caddy/Ingress flow is scoped as "Verification (analytics+ tier)".

The k3s example values file's tier comment records the platform-tier
validation run and the port-forward recipe. The active value is
`tier: "analytics"` for normal usage.
Caddy now renders at every tier (was tier >= analytics) so platform-tier
deployments share the same TLS / Ingress / caddy.service.{type,...}
machinery as analytics+. Tier promotions no longer churn external-facing
config: the Service name, Ingress, port-forward target, and cert-manager
wiring are byte-identical from platform through full. The catch-all
reverse_proxy upstream switches by tier via a new
graphistry.caddy.upstreamHandle helper -- nexus-proxy at platform tier
(single replica, no sticky-cookie ceremony; mirrors main-branch's
caddy -> nginx:80 production shape, just retargeted at nexus-proxy which
carries the v1-to-v2 endpoint shims), dynamic engine-headless with HMAC
sticky-cookie LB at analytics+. The 3 inline catch-all blocks across the
external/self/off tlsMode branches collapse to one include each.

Adds ingress.enabled (default true) for operators whose external L7 is
at the Service layer rather than the K8s Ingress layer (Brad/Dell on
Tanzu NSX-T pointing an external LB at caddy.service.type=NodePort,
BBAI's Pulumi-managed LB, service mesh, or dev port-forward). Default
true preserves today's behavior so existing values overlays render an
unchanged Ingress. The Bitnami-style ingress.{className, hosts, tls,
annotations} keys are intentionally not added; equivalents already exist
under global.ingressClassName, global.domain, caddy.tls.{existingSecret,
acmeEmail, domains}, caddy.service.annotations, and
ingress.management.annotations. values.yaml now carries an explicit
Bitnami-shape -> chart-shape mapping table so operators coming from
other charts can find the right keys.

Drive-by fix to a pre-existing bug: ingress.management.annotations was
documented and exemplified with service.beta.kubernetes.io/* cloud-LB
annotations, but those are Service-resource annotations -- they're inert
on an Ingress. Replaced the comment + commented-out examples with
annotations that actually take effect on an Ingress (nginx-ingress
internal class, ALB scheme, GKE gce-internal, cert-manager cluster
issuer, body-size). Cloud-LB annotations belong on
caddy.service.annotations, where they're already correctly documented.

Verified on jorge7 k3s at platform tier: all 5 v1 deprecation shim paths
return byte-identical 410s through Caddy vs. direct nexus-proxy, /
returns 200, /accounts/login/ returns 200, Caddy /caddy/health/ returns
{"success": true}, Ingress is picked up by Traefik. helm template lints
clean across 4 tiers x 3 tls.mode combinations.
`make html` now copies the 10 canonical chart READMEs into
docs/source/ with relative-link rewriting, runs frigate via a Python
wrapper that bypasses its broken --no-deps CLI (frigate archived
2024-12), and renders the result. 7 stale hand-written .rst pages
deleted; 3 frigate-generated .rst files gitignored as derived
artifacts. index.rst restructured into 5 captioned sections;
10mins-to-k8s.rst rewritten as platform-agnostic 8-step skeleton
with placeholders.

Other docs cleanup: TROUBLESHOOTING.md moved to repo root,
CLUSTER.md into the chart dir; new postgres-cluster, cd, aks
READMEs; top-level README repository-structure HTML -> nested
markdown list. Complete improvement in organization!
@aucahuasi aucahuasi requested a review from albarralnunez April 28, 2026 09:00
…nt guide

Telemetry storage fixes
- Prometheus, Jaeger, and Grafana pod templates now set securityContext.fsGroup
  per workload (65534 / 10001 / 472, matching each upstream image's hardcoded
  UID); kubelet chowns the PVC mount on first attach. Pre-fix the three pods
  crashlooped with permission-denied on Longhorn (and any other block CSI
  driver) because the data directory inherited root:root 0755 ownership and
  the non-root container processes could not write to it. Configurable per
  workload via telemetry.<workload>.securityContext.fsGroup; set to {} to
  disable on NFS deployments that prefer chmod 0777 + no_root_squash.

  workload via telemetry.<workload>.securityContext.fsGroup; set to {} to
  disable on NFS deployments that prefer chmod 0777 + no_root_squash.

Longhorn deployment documentation (charts/graphistry-helm/CLUSTER.md)
- New end-to-end Longhorn 1.11.x section: architecture (control plane vs data
  plane, CSI vs iSCSI vs iscsid), when to choose Longhorn over NFS, per-node
  prerequisites (open-iscsi / iscsid running, ext4/XFS data path, shared mount
  propagation), Helm install with defaultReplicaCount=2, defaultDataPath, and
  persistence.defaultClass=false, Node CR multi-disk patches, retain-sc-longhorn
  StorageClass creation (RWO replicated block) plus optional RWX share-manager
  class, smoke test, and cleanup runbook.
- Top-level README volumeBindingMode table grew a Replicated-block row covering
  Longhorn RWO, Rook-Ceph RBD, OpenEBS Mayastor / cStor, and Portworx; the
  existing Longhorn entry now reads "Longhorn RWX (share-manager NFSv4.1
  re-export)" so the two modes are no longer conflated.

Docs style sweep across all six READMEs (top-level, postgres-cluster,
graphistry-helm, CLUSTER, telemetry, TROUBLESHOOTING): semantic chart-name +
section-name link text replaces file-path-as-link-text in prose; em-dashes,
double-dashes, and prose arrows replaced with semicolons, colons, parentheses,
and English connectives in narrative prose; tables, ASCII diagrams, and code
blocks keep their structural symbols. Five pre-existing broken inbound links
(../troubleshooting.md, ../cluster/README.md, ../k3s/, ../gke/, ../tanzu/)
repaired in-flight.
@aucahuasi aucahuasi marked this pull request as ready for review April 29, 2026 08:58

@albarralnunez albarralnunez left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

aucahuasi added 9 commits May 26, 2026 10:29
…ipeline

Values key restructuring
- rollingUpdate boolean + sibling maxSurge value collapsed into the
  rollingUpdate.{enabled,maxSurge} block; applied across nexus, dask-
  scheduler, caddy, redis, notebook, pivot, gak-public, gak-private,
  and the engine DaemonSet rollout strategy.
- graphAppKitPublic / graphAppKitPrivate top-level booleans replaced by
  graphAppKit.public.enabled / graphAppKit.private.enabled. PVC and
  Deployment templates for gak read the new keys.
- nginx.service.httpPort replaces hardcoded port 80 across engine-
  service, nexus-proxy-service, caddy-ingress, and nginx-ingress-dev;
  nginxPorts.portOne removed.
- ingress.proxyBodySize replaces the camelCase root ProxyBodySize key.
- pivot.devRepository replaces the pivotDev.repository nested key.
- Nexus chart-wide env keys collapsed into nexus.env map:
  graphistryCPUMode, nodeEnv, appEnvironment, djangoSettingsModule,
  djangoDebug, sessionCookieAge, jwtExpirationDelta, enableDjangoSilk.
  Operators that overrode these in overlays need to move them under
  nexus.env.
- dask-scheduler REMOTE_DASK env removed (unused since the scheduler-as-
  coordinator pattern in v0.4.x).

Operator-explicit TLS
- global.tls, global.tlsStaging, tlsEmail Helm values removed. Their
  seven prior jobs (cert-manager Ingress annotations, spec.tls block,
  ClusterIssuer resources, Nexus COOKIE_SECURE, Caddyfile HSTS, NOTES
  URL scheme, Grafana GF_SERVER_ROOT_URL auto-derivation) are now
  operator-explicit or auto-derived from caddy.tls.mode + ingress.tls.
- ingress.tls Helm value (default []): passthrough into the Caddy
  Ingress spec.tls. Each entry {hosts: [...], secretName: ...} is
  rendered verbatim. Operators wiring TLS at the upstream cluster
  ingress (nginx-ingress, Traefik, Tanzu LB, cert-manager) set this
  directly.
- graphistry.externalScheme library helper centralises the "is this
  cluster reachable over HTTPS?" decision. caddy-cfg.yml,
  nexus-deployment.yaml, and NOTES.txt now read the helper; single
  source of truth replaces three separate inline tls/tlsStaging checks.
- Grafana GF_SERVER_ROOT_URL auto-derivation from global.domain + tls
  removed. Default reverts to "/grafana" (relative). Operators set
  telemetry.grafana.GF_SERVER_ROOT_URL explicitly when absolute URLs are
  needed (OAuth callbacks, share-dashboard links, alert webhook image
  URLs). Auto-derivation also coupled the telemetry subchart to
  parent-chart values that subcharts shouldn't see.

Crypto Secret management
- crypto.jwtSecret (default graphistry-jwt) and crypto.nexusSecret
  (default graphistry-nexus) Helm values: names of two operator-managed
  K8s Secrets the chart references via envFrom: secretRef:. graphistry-
  jwt holds DJANGO_SECRET_KEY (loaded into nexus + the three streamgl
  containers for shared JWT signing); graphistry-nexus holds the three
  GRAPHISTRY_NEXUS_* keys (nexus only; bulk envFrom lets operators layer
  additional nexus-only sensitive vars into the same Secret).
- graphistry.assertCrypto library helper: lookup-based pre-flight that
  runs on every install + upgrade render. Fails at template time when
  either Secret is missing in .Release.Namespace, pointing operators at
  the bootstrap script.
- charts/graphistry-helm/crypto-bootstrap/create-secrets.sh: operator
  bootstrap tool. Takes <graphistry-namespace> as required arg;
  generates CSPRNG values via a one-off busybox Pod (image already in
  the airgap bundle; needs only kubectl on the host) and creates both
  Secrets via stdin-piped manifests so values never appear in kubectl
  argv. --help covers custom Secret names, external secret-management
  tooling (External Secrets Operator, helmfile + sops, vault, Sealed
  Secrets), manual kubectl create -f - equivalents, and /dev/urandom
  fallbacks.
- README "Create Crypto Secrets (Required)" section + NOTES.txt step 4
  document the contract and rotation rules. Rotation table scopes the
  GRAPHISTRY_NEXUS_ENCRYPTION_KEYS data loss correctly: connector
  credentials in the nexus Postgres Connector.keyjson JSONB column
  become unrecoverable on single-value rotation; graph data, datasets,
  files, user accounts are unaffected.

Per-component env / extraEnv / extraEnvFrom extensibility
- Added across every chart workload: nexus, the engine-pod siblings
  (nginx, streamgl-viz, streamgl-sessions, streamgl-gpu, forge-etl-
  python), graph-app-kit-public, graph-app-kit-private, notebook,
  pivot, dask-scheduler, nexus-proxy, redis. Chart defaults stay in
  <container>.env (deep-mergeable map); operators layer additional
  entries via <container>.extraEnv (supports valueFrom: secretKeyRef)
  and <container>.extraEnvFrom without forking templates. PORT is
  filtered on engine-pod siblings since per-container PORT differs
  from the chart-wide default.

Removed templates
- templates/tls/letsencrypt-{production,staging}.yaml: chart-owned
  ClusterIssuer resources. Cert issuance is no longer the chart's
  responsibility; operators install cert-manager (cluster-scoped)
  and define ClusterIssuers separately.
- templates/http-tools/{netshoot,whoami}.yaml: dev / network-debug pods.
  Run on demand instead via kubectl run -it --rm --image=nicolaka/
  netshoot ... -- bash; see TROUBLESHOOTING.md.
- templates/network/grph-networkpolicy.yaml: an incomplete attempt at
  pod-to-pod policy (missing egress allowlist for DNS / Postgres /
  external HTTPS, wrong PGO label selector). The permissive variant
  could also undo a cluster-wide default-deny baseline. NetworkPolicy
  is a cluster-security concern; operators write their own tailored to
  their CNI and label scheme.
- templates/service-monitors/nginx-service-monitor.yaml: stale
  ServiceMonitor for a Service that no longer exists at its pre-engine
  location; the engine DaemonSet's nginx sibling is scraped via the
  telemetry subchart's existing config.
- templates/ingress/forward-headers.yaml: a one-off Middleware resource
  for an ingress controller variant the chart no longer assumes;
  superseded by ingress.management.annotations passthrough.

Docs build pipeline
- docs/Makefile gains a helm-docs-gen target rendering per-chart
  parameter tables via the helm-docs binary, replacing the deprecated
  frigate path (frigate has been archived since 2024-12).
- docs/_tools/helm-docs-table.gotmpl: chart-specific helm-docs template.
- docs/_tools/frigate_nodeps.py removed.
- docs/requirements.txt + docs/source/conf.py minor adjustments tracking
  the helm-docs path; docs/source/.gitignore swaps frigate-generated
  .rst entries for helm-docs-generated .md entries.
- docs/source/_static/custom.css: Sphinx site styling.
- .readthedocs.yaml: build.jobs.pre_build runs `make -C docs helm-docs-
  install + readmes + helm-docs-gen` so the published RTD site matches
  what `make html` produces locally. Without the pre_build hook, the
  toctree-referenced chart-README + Values pages would silently be
  missing on read-the-docs.

values.yaml documentation pass
- helm-docs `# --` annotation markers moved to the comment block
  immediately preceding the value (helm-docs requires contiguous comment
  blocks).
- Em-dashes and double-dashes in prose replaced with ;, :, () per
  project style and helm-docs annotation-parsing constraints.

Examples + adjacent cleanup
- Stale top-level `tls: false` removed from gke / tanzu / microk8s
  example values files; the templates only ever read `.Values.global.
  tls`, so these overrides were silently ignored from day one.
- charts/values-overrides/examples/k3s/k3s_example_values.yaml
  simplified to track current chart defaults.
- charts/postgres-cluster/values.yaml adjustments tracking the parent
  chart's storage and TLS value renames.
- charts/graphistry-helm/CLUSTER.md wording polish.

CHANGELOG.md
- 0.4.4 `### Added` gains entries for crypto.{jwt,nexus}Secret,
  graphistry.assertCrypto, create-secrets.sh, chart-injected envFrom
  wiring, the README Crypto Secrets section, and per-component env
  extensibility. `### Upgrade notes` gains a Crypto Secrets migration
  entry pointing operators at the bootstrap script.
Hardening
- Add graphistry.assertCrypto library helper: values-shape pre-flight
  for crypto.jwtSecret and crypto.nexusSecret, no lookup-based check
  so it works on Argo CD / helm template / helm-diff offline.
- Quote ingress.proxyBodySize in the nginx-ingress annotation so
  operators setting a bare integer per the values comment do not crash
  K8s admission with the metadata.annotations type-mismatch error.
- Route graphistry.tier.platform through tierLevel for the same bogus-
  tier validation as tier.analytics, tier.viz, tier.full; pre-fix the
  helper bypassed validation and bogus tiers were caught incidentally
  by sibling templates.
- Caddy ACME email directive in tls.mode=self now renders only when
  no static cert is present; previously emitted as dead config when
  both existingSecret and acmeEmail were set.
- nginx-ingress-specific annotations on the chart Ingress now gate on
  global.ingressClassName == "nginx"; operators on Traefik / ALB /
  GKE / NSX get a clean Ingress without dead annotations.
- streamgl-gpu liveness probe computes the list-server port from
  PORT + STREAMGL_NUM_WORKERS + 1 at exec time instead of hardcoding
  8095. Default behavior (N=4) is byte-equivalent; non-default N
  values now work without CrashLoopBackOff.
- dask-cuda-worker container no longer inherits .Values.forgeetlpython
  env. New daskcudaworker.env / extraEnv / extraEnvFrom keys mirror
  the forge-etl-python 3-layer pattern; cross-cutting vars
  (GRAPHISTRY_CPU_MODE, REDIS_DB) mirrored as defaults in both blocks
  with cross-reference comments.
- crypto-bootstrap/create-secrets.sh partial-parse failure no longer
  dumps raw kubectl stdout (which holds any CSPRNG values the Pod
  produced) to stderr; opt-in BOOTSTRAP_DEBUG=1 for the raw dump.
- caddy.tls.mode enum consolidates legacy ingress.enabled + three-value
  mode enum + externalTerminatesUpstream into one knob; previously
  illegal combinations (mode=self redirect loop, cloud-LB silently
  downgrading to http://) are unrepresentable.
- Caddy /grafana, /prometheus, /jaeger reverse_proxy handles now gate
  on tier.analytics+ alongside the existing ENABLE_OPEN_TELEMETRY and
  OTEL_CLOUD_MODE checks; platform-tier deployments no longer emit
  Caddy routes to backends that did not render.

Dead-code purge
- charts/k8s-dashboard/ chart + cd/argo-apps/templates/k8s-dash-cd.yaml.
  Verbatim copy-paste of upstream kubernetes/dashboard v2.6.1 (June
  2022), default disabled, untouched since 2022-09, 4 years behind
  upstream. Operators wanting it install upstream's own chart.
- charts/values-overrides/internal/skinny-values.yaml,
  charts/values-overrides/internal/eks-dev-values.yaml,
  charts/values-overrides/examples/microk8s_example_values.yaml,
  charts/values-overrides/examples/azure_example_values.yaml. All four
  set keys the 0.4.x chart API no longer reads. skinny-values.yaml in
  particular set graphistryCPUMode: "1" which the chart silently
  ignored, so an operator using it for a CPU-only EKS dev cluster got
  GPU containers crashing instead of CPU mode. Maintained per-distro
  examples (k3s/, gke/, aks/, tanzu/) cover the same surface for those
  distros; microk8s and azure operators copy from k3s/ or aks/.
- DEVELOP.md, dev-compose/, acr-bootstrap/, and
  .github/workflows/dev-cluster-deployment.yaml. Webcoderz-era 2022
  Azure DevOps Pipelines + manual workflow_dispatch GH Action that
  hard-coded the legacy MULTINODE / TLS / APP_TAG env-var contract
  against the unreleased multiNode + datamount-longhorn chart path.
  The Azure DevOps Pipeline was never wired up in Graphistry's
  organization; the GH Action was manual-only; dev-compose/ was its
  sole consumer.
- templates/ingress/nginx-ingress-dev.yml: a duplicate devMode-only
  Ingress routing directly to the nginx Service and bypassing Caddy.
- templates/pvc/uploads-files-persistentvolumeclaim.yaml + the
  volumeName.uploadsFiles Helm value + NOTES.txt / README / CLUSTER.md
  / TROUBLESHOOTING.md references. The PVC has been dead code since
  the engine DaemonSet refactor switched nginx body-temp scratch to a
  per-pod emptyDir.
- forgetlpython.env / extraEnv iteration on the dask-cuda-worker
  container (the F1 leak fix above).

Operator docs
- New retain-sc-telemetry StorageClass requirement for the telemetry
  subchart. Parallel to retain-sc-postgres for Postgres. Decoupled
  from the chart-wide retain-sc so multi-node operators whose retain-
  sc is NFS/RWX do not silently land Prometheus TSDB / Jaeger Badger
  / Grafana SQLite on a networked filesystem (POSIX fsync + flock
  semantics violated). Telemetry subchart README gains a TL;DR
  mirroring the postgres-cluster pattern (heredoc manifest + skip-
  path bullets) and a Storage section with per-backend provisioner
  table. Top-level README "Three StorageClasses to create" entry.
  Chart README pointer next to the existing Postgres Prereq line.
- New "Diagnose engine internals" subsection in TROUBLESHOOTING.md
  documenting the kubectl port-forward path for operators hitting the
  internalTrafficPolicy: Local silent connection drop on off-pod
  callers (sidecars on CPU pool nodes, manual kubectl exec curl).
- "Access Graphistry" table in the chart README and the ACCESS block
  in NOTES.txt are now tier-aware. Nexus landing + dashboard works at
  every tier; ETL REST API requires analytics+; /graph/* browser viz
  requires viz+. Previously labeled every URL "Graphistry:" with no
  qualifier.
- New STORAGE TOPOLOGY block in NOTES.txt printed at install time
  when global.storage.accessMode is ReadWriteOnce (the default).
  Reframes accessMode as the single declaration for single-node vs
  multi-node intent. Industry survey confirmed no mainstream Helm
  chart auto-detects topology at render time (Helm lookup is
  unreliable on Argo CD / helm template / --dry-run=client); the
  contract is operator-declared. CLUSTER.md sidebar names the
  "Multi-Attach error for volume" symptom for operators searching by
  failure mode.
- README Migration section gains a "Plan a maintenance window" entry
  for the 0.4.3 to 0.4.4 upgrade. Lists the affected Services, the
  selector flip mechanism, expected outage window (60-300s on single-
  node k3s with images cached; 5-15 min per node on multi-node with
  fresh image pulls), and verification commands.
- Top-level README "PVCs per tier" table corrected to match chart
  reality: gak-public and gak-private PVCs render at every tier (the
  consuming Deployments stay tier-gated to full). Tier-gated PVCs on
  Retain SC would orphan PVs with stale claimRefs on downgrade-then-
  upgrade cycles.
- README "Optional" pointer for the telemetry subchart TL;DR next to
  the existing Postgres Prereq pointer; mirrors abstraction layers
  with the rest of the doc tree.
- nginx.env block in values.yaml documents two image-supported optional
  env vars (LONGHORN_NAMESPACE, EXTRA_NGINX_LOGS) that the chart does
  not set as defaults but operators can opt into via env or extraEnv.
- OTEL_RESOURCE_ATTRIBUTES added to the otelEnv map as a commented-
  out default with a doc comment naming the OTel semantic-conventions
  spec.

Verification
- helm lint passes on every change.
- ASCII only; no em-dashes in prose; no commit hashes or PR refs in
  chart-shipped docs (CHANGELOG body entries restate rationale inline
  for airgapped operators).
Continuation of a0d5cba covering the remaining blocker / critical /
docs items in the 0.4.4 review surface. CHANGELOG.md carries one
entry per item with full failure-mode and fix detail; this summary
groups by severity.

BLOCKERS

- Rolling update strategy default: documented per-workload race
  window rather than reverting or restructuring. The flip from
  Recreate to RollingUpdate does not deadlock helm upgrade on
  correctly-configured clusters; the K8s-level deadlock only fires
  in the multi-node + RWO misconfig the chart already warns about
  via NOTES.txt. The actual residual risk is a per-app race during
  the ~30-90s pod-overlap window, affecting redis snapshots, nexus
  uploads, pivot/notebook saves, and caddy ACME (only at
  tls.mode=self). New TROUBLESHOOTING section 12 documents the
  per-workload verdict and the rollingUpdate.enabled=false opt-out.

- Engine env priority: documented rather than reordered. The
  chart-literals-after-extraEnv emission order is deliberate
  defense (chart-owned wiring stays authoritative). New
  TROUBLESHOOTING section 13 documents the full emission order,
  SSA-vs-default-helm collision behavior, and the dedicated
  .Values.global.* channels operators should use for legitimate
  overrides of chart-owned keys. values.yaml gains a one-line
  pointer at the first engine-container extraEnv block.
  engine-daemonset.yaml "K8s API rejects duplicate env keys"
  comments rewritten to accurately describe SSA vs non-SSA.

- Multi-node otel-collector scrape: replaced the static_configs
  block in prometheus-configmap.yaml with kubernetes_sd_configs
  role: pod selecting label io.kompose.service: otel-collector,
  relabeling __address__ to pod IP. Single-replica Prometheus now
  scrapes every per-node collector instead of routing through the
  Service VIP, which was blocked by internalTrafficPolicy: Local.
  Split into two scrape jobs: otel-collector on :8889 for engine
  OTLP-derived metrics, and otel-collector-self on :8888 for the
  collector's own telemetry (queue size, drop counters, export
  failures). Service unchanged; only the Prometheus scrape path
  bypasses it.

- Notebook gak mount gating: the notebook Deployment unconditionally
  mounted gak-public and gak-private PVCs even when the operator
  disabled either gak variant via graphAppKit.public.enabled or
  graphAppKit.private.enabled. The PVC templates were correctly
  gated but the consumer Deployment was not, leaving the notebook
  pod stuck FailedMount on a PVC the chart never rendered. Matching
  .enabled gates added around the two volumeMount entries and the
  two volume entries; default rendering byte-identical when both
  variants are enabled.

CRITICAL

- Crypto bootstrap shape validation: four regex checks added after
  CSPRNG extraction, before any kubectl create. Catches generators
  whose output is non-empty but wrong-format (line-wrapped base64,
  broken od column width, rate-limited urandom). Mismatches
  accumulate into a bash array so all four surface in one error
  naming the generator image.

- Crypto bootstrap precondition Secret-presence check: ahead of
  the CSPRNG generator, kubectl get secret is run for both
  graphistry-jwt and graphistry-nexus. If either exists, the script
  lists which, prints a two-bullet resolution guide (full re-init:
  delete BOTH and rerun; rotation: follow the MultiFernet README
  recipe), and exits 1 before any CSPRNG values are generated. The
  prior delete-one-and-rerun pattern silently created partial key
  rotation, invalidating either browser sessions or connector
  credentials with no error signal.

- engine.cookieSecret Caddyfile injection: the value rendered into
  lb_policy cookie graphistry_sticky <SECRET> { without quoting.
  Whitespace split the directive into multiple positional args
  (parse error); literal newlines broke the directive entirely
  (EOF). Wrapped the helper's emission with | quote (whitespace
  tolerated; embedded " escaped; # / { / } carried through
  literally) and added a template-time fail if the value contains
  a literal newline. Gated on $isAnalyticsPlus so platform-tier
  is not affected. Default rendering byte-shape unchanged.

- Caddy wrapper cookie-LB Secure/SameSite fix delivered via
  global.tag bump v2.50.4 to v2.50.6 across base values, chart
  README, and the k3s / gke / tanzu example overlays. v2.50.6
  rebases the bundled wrapper on Caddy 2.11.2 (caddyserver/caddy
  PR #6115 merged in v2.8.0). Pre-v2.50.6 wrappers shipped Caddy
  v2.7.6 and silently dropped graphistry_sticky in cross-site
  embed contexts, breaking session pinning. caddy.upstreamImage
  remains as an escape hatch for operators needing a specific
  Caddy version. Stale "currently pinned to v2.7.6" comments
  refreshed (history retained for operators debugging older
  deployments).

- Caddy phantom :443 port: gated both the Pod's containerPort: 443
  and the Service's :443 port block (including the nested
  nodePortHttps clause) on tls.mode=self. The Caddyfile only emits
  a :443 site block in tls.mode=self; in ingress / external / off
  Caddy listens on :80 only. Pre-fix, cloud LB TCP health checks
  on :443 saw connection-refused; with externalTrafficPolicy:
  Local, kube-proxy blackholed traffic. Same edit unified the
  caddy-deployment.yaml pre-existing default "external" inline
  fallbacks to default "ingress", matching values.yaml and the
  three other Caddy templates.

- Telemetry N-fold timeseries cardinality: added
  internalTrafficPolicy: Local to dcgm-exporter and node-exporter
  Services. Pre-fix, each per-node otel-collector's internal
  prometheus receiver scraped via Service hostname and kube-proxy
  load-balanced cluster-wide; resulting metrics carried the
  collector's own host.name resource attribute, fanning a single
  exporter pod's metrics to N distinct timeseries upstream
  (Grafana Cloud / Mimir bill by active timeseries; Datadog by
  submissions). Local policy makes each collector scrape its
  node-local exporter only. Self-hosted mode unaffected;
  Prometheus already scrapes pod IP direct via kubernetes_sd_configs.

- Floating :latest pins: groundnuty/k8s-wait-for pinned to v2.0
  across engine DaemonSet (wait-for-redis, wait-for-dask-scheduler),
  nexus-proxy Deployment (wait-for-nexus), and otel-collector
  DaemonSet (init-prometheus, init-jaeger): 10 references across
  3 templates, both default and custom-registry render branches.
  crypto-bootstrap busybox:latest pinned to busybox:1.38.0-musl.
  charts/graphistry-helm/README.md airgap section rewritten in the
  same pass: dropped a stale pre-engine-consolidation pod list,
  bumped the v2.50.1 example tag to v2.50.6, and mirrored the
  existing "Create Docker Hub Secret" section's "contact Graphistry
  Support" tone (operators do not enumerate images themselves).

- NOTES.txt TROUBLESHOOTING URL fix: updated the deleted
  /charts/values-overrides/examples/troubleshooting.md to the
  repo-root /TROUBLESHOOTING.md the docs-pipeline-rewire moved
  earlier in this version.

- NOTES.txt legacy-key migration warning: six-key hasKey-gated
  warning added (graphAppKitPublic, graphAppKitPrivate,
  pivotDev.repository, volumeName.uploadsFiles, multiNode,
  clusterVolume). Slim shape: lists detected keys by name, points
  at the chart README Migration section for recipes. Inline comment
  documents why rollingUpdate-bool is intentionally NOT detected
  (Deployment templates access .Values.rollingUpdate.enabled
  directly and Helm cannot index a bool; helm install errors at
  template-render time with "can't evaluate field enabled in type
  interface {}" before NOTES is reached, so the README Migration
  paragraph is the recovery path). Chart README Migration section
  extended with six new "**You had X.**" paragraphs matching the
  existing TLS migration tone.

DOCS / MINOR

Many smaller cleanups; one CHANGELOG entry per item. Highlights:

- CHANGELOG StreamglGpuWorkerResources typo corrected to
  StreamglGpuResources (matches the actual chart values key).
- nexusDev.repository added to the legacy-key migration warning
  alongside its pivotDev sibling.
- OTEL_EXPORTER_JAEGER_ENDPOINT gated on self-hosted mode (cloud
  pods no longer carry the dangling jaeger:4317 reference).
- otel-collector liveness probe comment corrected to reflect the
  actual emitted path ("/", not "/healthz").
- New NOTES.txt advisory: COOKIE SECRET WARNING when
  engine.cookieSecret equals the placeholder default and tier is
  analytics+, pointing operators at openssl rand -hex 32.
- TROUBLESHOOTING.md "Expected output (k3s with Traefik)" updated
  to reflect the 0.4.4 single-Caddy-Ingress topology (telemetry
  UIs now path-routed through Caddy reverse_proxy, no standalone
  grafana / prometheus / jaeger Ingresses).
- nginx env block in engine-daemonset.yaml gains the PORT filter
  for manifest symmetry with its five sibling containers.
- engine-wait-for-postgres port driven by new global.POSTGRES_PORT
  value (default empty, falls back to 5432). Operators on RDS
  Proxy / PgBouncer on 6432 no longer hit pg_isready timeouts.
- New TROUBLESHOOTING section documenting the dask-cuda-worker
  liveness -> in-pod nginx dispatcher -> forge-etl-python:8100/
  workerhealth chain and cascading-restart behavior under nginx
  instability.
- engine DaemonSet terminationGracePeriodSeconds bumped 30 to 60
  for dask-cuda-worker drain headroom.
- gak-{public,private} PVC comments clarify that the new .enabled
  gate is a per-variant opt-out (entire variant suppressed),
  distinct from the intentionally-absent tier gate.
- TROUBLESHOOTING section 13 extended with a subsection naming
  notebook, pivot, graph-app-kit-public, and graph-app-kit-private
  as sharing the engine pod's env emission-order pattern.
- NOTES.txt Crypto Secrets reminder lifted out of the devMode
  gate; PREREQUISITES list inside the gate renumbered 1-3.
- NOTES.txt Available tiers line no longer implies the chart
  bundles postgres (postgres-cluster is a separate Helm release
  prereq).
- New NOTES.txt advisory: EXTERNAL TLS TOPOLOGY when
  caddy.tls.mode=external, warning about HSTS pinning if the
  upstream LB serves plain HTTP.
- Grafana defaults: GF_AUTH_ANONYMOUS_ENABLED true to false,
  GF_AUTH_ANONYMOUS_ORG_ROLE Admin to Viewer. Pre-fix, anyone
  reaching the chart's domain via Caddy /grafana got unauthenticated
  Grafana Admin (modify dashboards, run arbitrary queries against
  Prometheus, add datasources). admin/admin seed credentials
  unchanged for dev convenience.
- --stdin=true removed from the crypto-bootstrap kubectl run
  invocation (unused attach stream; some kubectl versions drop
  initial stdout bytes via the attach side effect).

OUT OF SCOPE

- Defensive trap around the bootstrap-script kubectl run pod
  (orphan-pod cleanup on Ctrl-C): a defense that does not constrain
  any attacker who has the access required to exploit the window
  (RBAC rarely partitions pods/log from secrets; any actor at the
  operator's terminal trivially bypasses the trap by editing the
  script). Memory rule defense-must-constrain-attacker captures
  the reasoning.
- assertCrypto include in engine-daemonset.yaml and
  nexus-proxy-deployment.yaml: deferred. The assertion is caught
  incidentally today because nexus-deployment.yaml always renders
  and includes it; defense-in-depth across templates would close
  fragility under future template-render-order changes.
- readinessProbe + startupProbe coverage across the six engine
  sibling containers plus notebook / pivot / gak / nexus: real
  architectural work; deferred to a separate workstream.

Verified: helm lint clean; helm template renders cleanly across
default values plus the scenario-specific overlays exercised per
item (rollingUpdate bool, caddy.tls.mode variants, telemetry
cloud + self-hosted, graphAppKit variant disables, legacy-key
overlays, cookieSecret placeholder)
Four narrow additions catching foot-guns that were not in the prior
review pass but are cheap to close. None are install-time fatal; all
four are documentation or comment-only and do not change any chart
template behavior.

- graphistry.tier.* helpers (charts/graphistry-common/templates/
  _helpers.tpl) emit the literal string "true" when the deployment
  tier is at or above the named level, and the EMPTY STRING when not.
  There is no "false" branch. All 30 existing call sites correctly
  use eq ... "true" or ne ... "true", but a future contributor
  reaching for eq ... "false" silently always evaluates false, and
  one reaching for {{ if (include ...) }} truthiness is subject to
  Go template's non-empty-string-truthy rule (the literal string
  "false" would be truthy if the helper were ever symmetrised).
  Added a file-level docstring documenting the contract with explicit
  positive and negative example forms; per-helper one-liners now
  read "empty string otherwise" instead of stopping at the truthy
  case. No behavior change.

- charts/graphistry-helm/CLUSTER.md Storage section gains a new
  "Adding a node to an existing single-node deployment" subsection.
  The chart's STORAGE TOPOLOGY warning in NOTES.txt fires at every
  helm install and helm upgrade with global.storage.accessMode set
  to ReadWriteOnce, but does NOT re-fire when cluster node count
  changes out-of-band (operator joins a 2nd GPU node months after
  install). The new subsection walks the prevention path (provision
  RWX backend; swap StorageClass; migrate data explicitly; flip
  accessMode to ReadWriteMany; helm upgrade; then join the new node)
  and the recovery path (cordon the new node first if already joined
  and the engine pod is stuck Multi-Attach). Lands after the existing
  "Symptom you are debugging" subsection so operators reading either
  the prevention or recovery path encounter the right adjacent
  content. Doc-only addition.

- otelEnv block in values.yaml gains a comment block documenting
  Helm's map-merge vs atomic-replace semantics. otelEnv is a
  top-level map with seven chart defaults (OTLP endpoint, four
  per-signal timeouts, metric push interval, metric export timeout).
  Overlay files (-f overlay.yaml) and per-key --set otelEnv.KEY=VAL
  deep-merge per key; whole-map --set otelEnv="{KEY: VAL}" atomically
  REPLACES every default. The atomic-replace form silently drops the
  OTLP endpoint and all timeouts, producing missing-telemetry symptoms
  with no render-time signal. The new comment block tells operators
  which override forms merge and which replace, and recommends
  overlay files or per-key --set for additions. Same comment shape
  as the existing nexus.env merge-semantics guidance.

- templates/NOTES.txt analytics-tier ACCESS block gains a single
  line under the existing "Browser visualization (/graph/*) NOT
  available at analytics tier" warning, naming /streamgl/* asset-path
  502 behavior at analytics tier as expected (not a chart bug).
  Engine-pod nginx unconditionally has a proxy_pass to
  http://streamgl-viz:8080 from the bundled default.conf (chart
  cannot path-gate it; that lives in the upstream nginx image),
  but the streamgl-viz shim Service is gated on tier.viz and does
  not render at analytics. /streamgl/* requests (from old session
  bookmarks, third-party iframe embeds, dashboard widgets pointing
  at viz URLs) return 502 from the engine nginx. The new NOTES line
  names the audience and the expected behavior so analytics-tier
  operators recognize the failure mode without filing a chart bug.

Storage in-flight upload window during helm upgrade rolls (also
floated as a doc candidate) is deliberately left undocumented at
the chart level: the maintenance-window pattern in the existing
Migration guidance already covers it, and an explicit per-Deployment
drain procedure would model operator state machinery the chart
deliberately leaves unmodeled (consistent with the deferred
readiness/startup-probe and preStop-hook work).

Verified: helm lint clean; helm template renders cleanly; no
em-dashes in any touched file; no internal review-process language
in any chart-shipped file.
…atomicity

Continuation of the 0.4.4 hardening surface. CHANGELOG.md carries one
entry per item with full failure-mode and verification detail; this
summary groups by concern.

ENGINE POD PROBES

- nginx sibling container gains a readinessProbe mirroring the existing
  livenessProbe (curl -f http://localhost/healthz) with
  initialDelaySeconds: 8, periodSeconds: 10, failureThreshold: 3,
  timeoutSeconds: 5. Caddy's `dynamic a engine-headless ...` resolver
  publishes pod IPs to its upstream pool on a 10s refresh as soon as
  the pod enters Endpoints, which without a readinessProbe happens as
  soon as the pod is Ready (containers Running), before nginx is bound
  on :80. During DaemonSet rollover (updateStrategy: RollingUpdate
  with maxUnavailable: 1, one engine pod per node rolling sequentially)
  Caddy would route browser traffic to a not-yet-serving pod, producing
  transient 502s on every rollover. Pre-0.4.4 the same gap existed on
  the standalone nginx Deployment but was masked by the Recreate-default
  strategy (no overlap window) and by Caddy resolving through kube-proxy
  with TCP retry semantics; 0.4.4's DaemonSet RollingUpdate + headless
  Service + dynamic resolver makes the gap operator-visible. The wrapper
  image's /healthz returns HTTP 201, and the container becomes ready in
  roughly one second from start (verified at runtime against
  graphistry/streamgl-nginx:v2.50.6-universal); the 8s initialDelay is
  conservative buffer for K8s scheduling overhead. Pod IPs no longer
  enter Endpoints until nginx is bound; rolling upgrades stop dropping
  browser sessions on the new-pod side.

- dask-cuda-worker stage-2 liveness target changed from
  http://forge-etl-python:8080/workerhealth (which routed through the
  in-pod nginx dispatcher via hostAliases: forge-etl-python -> 127.0.0.1
  -> nginx :8080 -> fep :8100) to http://localhost:8100/workerhealth
  (direct to fep in the shared engine-pod network namespace at fep's
  bind port). The earlier indirect form coupled dask's liveness to nginx
  availability: any nginx container restart (image pull, OOM,
  ConfigMap-triggered rollout) made the dispatcher unavailable, dask's
  liveness failed failureThreshold: 3 consecutive times, kubelet killed
  and restarted the dask container, and in-flight Arrow batches were
  lost (no preStop hook; terminationGracePeriodSeconds: 60 only helps
  if the worker can respond to SIGTERM, which it cannot if liveness
  has already failed). The direct-localhost form decouples the cascade.
  Probe semantic preserved: fep's /workerhealth handler validates
  dask-cluster reachability internally before responding, so the probe
  still asserts "the dask cluster is operational through fep"; it just
  no longer requires nginx routing to make the assertion. Verified
  end-to-end against the running graphistry/etl-server-python:v2.50.6-13
  container: the image binds 0.0.0.0:$PORT, the chart's engine pod sets
  PORT=8100 for the forge-etl-python sibling, and
  curl http://localhost:$PORT/workerhealth from the same network
  namespace returns {"success": true}.

- TROUBLESHOOTING.md section 14 ("Engine Container Probe Dependencies")
  rewritten: the dask-cuda-worker subsection now documents the
  direct-localhost variant as current and the routed-through-nginx
  variant as historical/superseded; new subsection on the nginx
  readinessProbe documents its role in headless-Service rollover
  semantics and the rationale for the 8s initialDelay.

MIGRATION WARNING CORRECTNESS

- NOTES.txt legacy-TLS warning gate was checking the wrong nesting
  level. 0.4.3 declared tls and tlsStaging at the TOP level
  (values.yaml:268,271 on origin/main) and the shipped 0.4.3 overlays
  (k3s, gke, tanzu, microk8s) all set them top-level. The warning was
  checking hasKey .Values.global "tls" / "tlsStaging", which never
  fires for the actual operator-overlay shape. Result: an operator
  upgrading from 0.4.3 with tls: true got no warning that TLS handling
  had moved to operator-explicit caddy.tls.mode + ingress.tls (could
  end up serving plain HTTP believing the chart still terminates TLS).
  Fixed by switching the three hasKey paths to top-level and dropping
  the `global.` prefix from the body labels. Verified the warning fires
  on --set tls=true and does NOT false-positive on the implausible
  --set global.tls=true shape.

- NOTES.txt legacy-v0.4.3 migration block dropped its multiNode /
  clusterVolume detection entirely. The original C14 addition checked
  these at the wrong nesting level (top-level hasKey .Values "multiNode"
  while 0.4.3 templates uniformly read .Values.global.multiNode /
  .Values.global.clusterVolume). On audit of the actual operator
  population, no customer ever shipped an overlay using the chart's
  pre-0.4.4 leader/follower cluster-mode prerequisites that these two
  keys configured. Correcting the hasKey path would defend an empty
  customer population. Removed the two hasKey entries from the outer
  or-chain and the two if blocks in the body; chart README Migration
  section parallel paragraphs (`**You had multiNode: true.**` and
  `**You had clusterVolume: ....**`) also dropped on the same logic.
  The five remaining keys in the warning block (graphAppKitPublic,
  graphAppKitPrivate, nexusDev.repository, pivotDev.repository,
  volumeName.uploadsFiles) are genuinely top-level in 0.4.3 and shipped
  in operator overlays, so those detections remain and continue to fire.

- TROUBLESHOOTING.md "Multi-Attach error on engine PVCs (multi-node)"
  step 2 rephrased from "Deploy in cluster mode with NFS, EFS, Longhorn
  RWX, or CephFS" to "Switch the chart's global.storage.accessMode to
  ReadWriteMany and back retain-sc with an RWX storage backend (NFS,
  EFS, Longhorn RWX, CephFS, Azure Files)". Removes ambiguous "cluster
  mode" phrasing that could be misread as the never-shipped legacy
  leader/follower cluster feature.

BOOTSTRAP SCRIPT ATOMICITY

- crypto-bootstrap/create-secrets.sh Secret creation step refactored
  from two sequential kubectl create -f - heredocs into one multi-doc
  stdin submission separated by `---`. Two motivations stack:
  (1) values continue not to appear in kubectl argv (`ps` /
  /proc/<pid>/cmdline cannot see them, same as before), and (2) the
  failure window where graphistry-jwt could end up orphaned without
  graphistry-nexus shrinks from "any moment between two separate
  kubectl processes" to "any moment within a single kubectl process
  reading one stdin stream". The shape is honestly NOT transactionally
  atomic at the apiserver (kubectl still processes docs sequentially;
  an admission-webhook failure on the second doc still leaves the
  first committed), but operator-side interruption between docs
  (Ctrl-C, terminal disconnect, OOM of the shell) cannot leave the
  script in the half-created state since both docs ship in one syscall
  stream. The existing precondition Secret-presence check ahead of
  the CSPRNG generator continues to catch any orphan-jwt state on
  rerun. Help text reference updated from "the two kubectl create -f -
  invocations" to "the multi-doc kubectl create -f - invocation".
  Verified via kubectl create --dry-run=client -f - with multi-doc
  stdin: both Secrets processed and reported in one invocation.

OUT OF SCOPE (no change)

- streamgl-gpu liveness probe port: verified against the streamgl-gpu
  container. Both the current chart probe target
  (localhost:$((PORT+STREAMGL_NUM_WORKERS+1))/check-workers, =:8095 at
  K8s PORT=8090, N=4) and the pre-0.4.4 standalone probe target
  (localhost:$PORT/check-workers, =:8090 in K8s) return identical
  {"status":200,"workers":[...]} payloads at runtime. Source confirms
  /check-workers is canonically registered on the PORT+N+1 endpoint and
  exposed on PORT through a proxy special case; both shapes work, the
  current chart shape is more direct. No fix needed.

- $RANDOM-based POD_NAME in the bootstrap script: theoretical name
  collision under concurrent same-namespace runs only; concurrent
  bootstrap of the same namespace is not a real workflow and the pod
  is --rm short-lived.

Verified: helm lint clean; helm template renders cleanly; bash -n
clean on the bootstrap script; no em-dashes added to any touched file.
Three narrow follow-ups on top of the prior 0.4.4 hardening passes.
CHANGELOG.md carries one entry per item with full failure-mode and
verification detail; this summary groups by concern.

PROBE PORT CONSISTENCY

- templates/nexus/nexus-deployment.yaml and templates/pivot/
pivot-deployment.yaml init containers' pg_isready checks now read
.Values.global.POSTGRES_PORT (default 5432) in both branches of the
external-host vs in-cluster-host if/else, matching the earlier
engine-init fix. Both files previously hardcoded -p 5432 in both
branches; the chart now has a consistent POSTGRES_PORT story across
all three wait-for-postgres init probes (engine, nexus, pivot). On
default Crunchy (5432) nothing changes; on external Postgres at a
non-standard port (RDS Proxy on 6432, PgBouncer on 6432, etc.) nexus
and pivot stop hanging in Init while the apps themselves would have
been fine (the chart wires POSTGRES_PORT into the runtime env from
the Crunchy pguser secretRef or from the global.POSTGRES_HOST
overlay; only the init-container readiness gate was reading the
wrong port). Verified via
helm template --set global.POSTGRES_PORT=6432 that both files render
the override in both branches; default render still emits -p 5432.

COMMENT CORRECTNESS

- templates/engine/engine-daemonset.yaml stale comment near the
streamgl-gpu env block claimed `STREAMGL_NUM_WORKERS still comes
through .Values.env like the existing chart`. The claim was true on
origin/main (where STREAMGL_NUM_WORKERS: "4" was a chart-default
.Values.env entry) but 0.4.4's values restructuring removed the
default entry. The streamgl-gpu container's PORT+N+1 list-server
probe formula (${PORT}+${STREAMGL_NUM_WORKERS:-4}+1) still resolves
correctly today because both the image's internal default and the
probe's bash :-4 fallback are 4. Operators tuning N for non-default
GPU counts must set STREAMGL_NUM_WORKERS via .Values.streamglgpu.env
or .Values.env; the container reads the same env var the probe
formula reads, so the router and probe stay in sync under tuning.
Comment rewritten to describe this accurately so future contributors
do not assume a chart-default that no longer exists. Doc-only.

POSTGRES-CLUSTER OPERATOR NOTES

- New charts/postgres-cluster/templates/NOTES.txt (this chart
previously had none): emits a post-install / post-upgrade callout
covering operator prerequisites (PGO controller running in the
postgres-operator namespace; StorageClass pre-created with Retain
reclaim policy + block-storage RWO backend + WaitForFirstConsumer
binding; not NFS, because PostgreSQL relies on fsync() durability
semantics NFS does not guarantee), the upgrade rename hazard
introduced when 0.4.4 unified the default StorageClass name from
older variants (retain-sc, retain-sc-cluster) to retain-sc-postgres
(PVC.spec.storageClassName is immutable; default-config upgrade
across the rename cannot rebind the existing PV to the new SC, so
PGO's data PVC stays Pending, or comes up on an empty volume if the
operator provisions a fresh retain-sc-postgres without restoring
from backup, orphaning the prior Retain-policy PV), the two safe
upgrade paths (pin global.storageClassNameOverride to the existing
SC name, OR backup-and-restore onto a fresh SC), the resource
sizing posture (default global.PostgresResources is dev-sized and
will not survive production workloads), the active pgBackRest
backup cron schedules, the data and backup-repo PVC sizes,
post-install verification commands, and the resolved
storageClassName / postgres host / db / user / credentials secret
values so the operator can sanity-check configuration at every
install / upgrade. All site-specific data (storage class, secret
name, PVC sizes, resource limits, backup cron strings, host / db /
user) is rendered from .Values so operator overrides flow through
to the NOTES text without divergence; the only literal in the
template is the `default "retain-sc-postgres"` fallback string,
which matches postgres-cluster.yaml's own template default at lines
50/81 so the NOTES SC value never diverges from what the rendered
PVC actually requests. Pointers to the chart README (architecture,
service routing, credential plumbing, cleanup) and to TROUBLESHOOTING
section 7 (runtime diagnostics) keep the NOTES focused on
install-time-actionable detail rather than duplicating reference
material. Verified default render + override render (storage class,
postgres host / db / user, PVC sizes, schedules, resource limits)
all flow through; hardcoding scan clean for backup cron strings,
resource limits, volume sizes, and secret name; em-dash scan clean.

OUT OF SCOPE (verified, no change)

- otel collector self-metrics :8888 binding: surfaced as a possible
multi-tenant exposure concern. The endpoint exposes operational
metrics (request counts, queue depths) and not secrets; the chart's
deployment target is single-tenant clusters; rewiring to 127.0.0.1
+ a Service + NetworkPolicy adds churn without constraining an
attacker who already has cluster-pod-network access. Accepted as-is.

- Engine DaemonSet assertCrypto include: surfaced as a parallel to
the nexus-deployment.yaml top-of-file include. Verified the nexus
include is NOT gated by a tier conditional; it always fires on
helm template, regardless of which tier renders the rest of the
manifest. The helper checks crypto.jwtSecret and crypto.nexusSecret;
the engine consumes only crypto.jwtSecret (three sites in
engine-daemonset.yaml) and is always deployed alongside nexus.
Adding the include to engine-daemonset.yaml would emit a duplicate
fail-fast message on the same condition nexus already catches.
Skipped as redundant.

Verified: helm lint clean on both charts (graphistry-helm,
postgres-cluster); helm template renders cleanly with defaults and
under all override scenarios listed above; no em-dashes added to any
touched file; no internal review-process language in any chart-shipped
file or in this commit message.
Six narrow follow-ups on top of the prior 0.4.4 hardening passes. The
common theme is the engine-consolidation's single-replica L7 + headless
sticky-LB topology: regressions and latent gaps that were masked by the
pre-0.4.4 multi-Deployment chart become operator-visible once Caddy is
the streaming-continuity bottleneck and the engine pod is the bundled
GPU sibling group. CHANGELOG.md carries one entry per item with full
failure-mode and verification detail; this summary groups by concern.

ENGINE POD

- templates/engine/engine-service.yaml streamgl-viz shim Service
  publishes a second port (3100 -> 3100) to match a secondary listener
  the streamgl-viz container binds inside the engine pod alongside its
  :8081 PORT-env-driven HTTP port. The chart sets K8S_NAMESPACE_SUFFIX=
  ".<ns>.svc.cluster.local" so resolver-path lookups for streamgl-viz
  go through the shim Service rather than the in-pod hostAliases; pre-fix
  the shim only mapped 8080 -> 8081, so resolver-path lookups for the
  secondary port returned connection-refused / 502 (the engine-
  consolidation added the shim-Service pattern but only covered one of
  streamgl-viz's two bound ports). The app-code path through hostAliases
  was unaffected, so only resolver-path consumers dialing through the
  engine-pod nginx broke. New port inherits internalTrafficPolicy: Local,
  matching the documented engine-shim-Service contract.

- templates/engine/engine-daemonset.yaml forge-etl-python container PORT
  moved 8100 -> 8200, with $streamglgpuPort (8090) and $fepPort (8200)
  promoted to template-local variables at the top of the file as the
  single source of truth for the streamgl-gpu / fep PORT emissions AND
  the new defensive guard. The engine-nginx-cfg.yml dispatcher map and
  the engine-service.yaml shim move in lockstep; engine-nginx-cfg.yml
  carries a cross-reference comment since it is a separate template and
  Helm template-local vars do not cross files. Pre-fix layout gave
  streamgl-gpu only 9 free slots above its base (8091..8099) before
  colliding with fep at 8100, so raising STREAMGL_NUM_WORKERS to 9+ on
  a high-GPU node would crash: the gpu-router image's hardcoded math
  (workers at PORT+i, list-server at PORT+N+1) puts the list-server on
  8100 at N=9. Moving fep to 8200 raises the ceiling to N<=108, which
  exceeds any realistic GPU-per-node SKU. A defensive fail guard inside
  the viz tier-gate extracts STREAMGL_NUM_WORKERS from .Values.streamglgpu.env
  or the legacy .Values.env list and refuses render with a multi-line
  diagnostic (full port layout, ceiling math, remediation paths) if
  PORT+N+1 would still collide with fep. FORGE_NUM_WORKERS is unaffected
  by this geometry: fep's worker processes share one bound socket, so
  raising it does not consume additional ports. Verified end-to-end:
  default-render (N=4) emits 8200 at every PORT site / probe / shim /
  dispatcher entry and zero :8100 residue; N=108 renders without the
  guard firing; N=109 / 120 / 200 all fail at helm template with the
  diagnostic; analytics tier (no streamgl-gpu container) skips the guard
  regardless of N.

- templates/engine/engine-daemonset.yaml nginx-container header comment
  block had a stale K8S_NAMESPACE_SUFFIX="" claim that contradicted the
  actual env value (".<ns>.svc.cluster.local") set ~30 lines below in
  the same container's env. Two contradictory comments described the
  same env var in the same file; the second one (above the env line)
  correctly described the FQDN-suffixed model that consumes the shim
  Services. Rewrote the header comment to match the actual model and
  name both routing paths (nginx-resolver path through the shim Services
  with internalTrafficPolicy: Local; app-code path through hostAliases
  to the in-pod :8080 dispatcher). The stale comment had masked the
  cause of the engine-service resolver-path regression in earlier
  readings; correcting it makes the routing model unambiguous for future
  maintainers. Doc-only.

- templates/engine/engine-daemonset.yaml hostAliases list gains
  "nginx" -> 127.0.0.1 in the always-on bucket alongside forge-etl-python
  and dask-cuda-worker. The chart wires viz's ETL submission target as
  FORGE_ETL_HOSTNAME: "nginx" + FORGE_ETL_PORT: "80", and NGINX_HOST: "nginx"
  is read by streamgl-sessions. Pre-fix the bare lookup fell through to
  CoreDNS to the default-Cluster-policy nginx Service, so kube-proxy
  round-robined the lookup across all engine pods cluster-wide and viz's
  ETL submission could land on a remote engine pod's nginx, defeating the
  "every intra-stack HTTP hop is localhost" goal for that hop on multi-
  node deployments. Functionally correct (the remote nginx forwards to
  its own local fep), so MINOR not BLOCKER, but the cross-pod latency on
  a hot path is real. The new hostAlias forces the lookup to 127.0.0.1
  so viz hits the same-node nginx :80 which forwards to the same-node fep
  on $fepPort. The nginx Service stays in place as the off-pod fallback.
  Same-pass cleanup: dropped two pre-existing commit-hash and PR-number
  references in the surrounding header tier-gating comment and the
  dask-cuda-worker rationale paragraph (chart ships to airgapped
  customers without git history, so commit-refs age out of context).

CADDY

- templates/caddy/caddy-deployment.yaml gains a readinessProbe mirroring
  the existing livenessProbe in both image branches (exec curl for the
  wrapper image, httpGet for caddy.upstreamImage). Pre-fix Caddy ran
  replicas: 1 + RollingUpdate maxUnavailable: 0% + a default maxSurge,
  so the surge pod IS the only path during every helm upgrade and every
  Caddyfile checksum/config flip. kube-proxy added the surge pod IP to
  the Service endpoints the instant its container went Running, but
  Caddy still had to read+adapt the Caddyfile and bind :80 after that;
  browsers routed in the cold-start window got connection-refused.
  Liveness did not close the gap (gates restarts, not endpoint membership).
  The Caddyfile's `respond /caddy/health/ 200` directive is terminal
  (sits before reverse_proxy nginx:80), so the probe answers ~0ms once
  Caddy is bound and never touches the upstream; failureThreshold: 1 is
  the precise signal (success criterion is binary "Caddy is bound").
  Verified end-to-end against the wrapper image
  graphistry/caddy:v2.50.6-universal: curl -sf returns HTTP 200 in
  ~0.2ms; Caddy version v2.11.2 confirmed. Same fix shape as the
  engine-pod nginx readinessProbe added in the prior commit.

- templates/caddy/caddy-deployment.yaml Caddy container gains
  lifecycle.preStop: ["sh","-c","sleep 10"] and the pod spec gains
  terminationGracePeriodSeconds: 90. Caddy's default grace_period is
  infinite (verified live via `caddy adapt`: reported
  grace_period: (default=infinite)), so it drains in-flight requests
  forever on SIGTERM until something cuts it off; on K8s that "something"
  is the pod's terminationGracePeriodSeconds. The implicit 30s default
  was too short for long-lived viz WebSocket / streamgl streaming
  sessions (the design goal of the engine-consolidation), which routinely
  hold connections open for minutes, so every rolling Caddy restart was
  hard-cutting active viz tabs on the SIGKILL. preStop sleep 10 gives
  kube-proxy time to deprogram the endpoint before SIGTERM
  (kube-proxy iptables-sync interval is 5-10s on most clusters);
  terminationGracePeriodSeconds: 90 gives ~80s of actual drain budget.
  Browsers cut at the 90s ceiling reconnect through the cookie-based
  sticky LB and pin back to the same engine pod on retry, so the cut
  session continues on the same backend across the Caddy swap window
  (sticky cookie HMAC survives the new Caddy pod since both replicas
  share engine.cookieSecret).

OUT OF SCOPE (verified, no change)

- engine.cookieSecret default sentinel: surfaced as a possible forged-
  sticky-cookie concern. The cookie is lb_policy cookie HMAC for sticky
  LB pin-only; auth runs through nexus / JWT and is independent. To
  exploit a forged sticky cookie an attacker must inject it into a
  victim's browser (XSS or MITM), and either capability already grants
  outcomes worse than session pinning. A forced rotation defends a
  window the attacker has already opened. Existing NOTES warning at
  install / upgrade when the placeholder is unchanged stays; no
  template fail() gate added (operator-hostile for first-time installs
  on a non-auth concern).

Verified: helm lint clean; helm template renders cleanly at default
N, at N=108 (last safe), and rejects with the diagnostic at N=109+;
em-dash / internal-impl-name / commit-ref / review-process-language
scans clean across all five edited files and the CHANGELOG entries.
…luster-pass follow-ups

The prior commit (engine-pod resolver-path port plan + caddy zero-downtime
gates) shipped a whitespace-trim bug in the streamgl-gpu list-server port-
collision guard that swallowed the `- name: streamgl-gpu` header into the
preceding comment divider and corrupted streamgl-sessions via YAML last-
wins merge: viz/full-tier deploys rendered 5 containers instead of 6 with
streamgl-gpu absent and streamgl-sessions running a malformed gpu-router.
helm template and helm lint both reported success because the YAML is
technically valid (a comment plus a key-merge), and the prior pass's
verification grepped for the new fep port number instead of counting
containers. This commit hot-fixes that BLOCKER, plus an in-scope storage
race the same review pass surfaced, plus the CLUSTER.md docs that drifted
during the engine port-plan move. One commit because the hot-fix and the
storage fix touch the same file; CHANGELOG.md carries one entry per item
with full failure-mode and verification detail.

CHART HOTFIX

- templates/engine/engine-daemonset.yaml: the inner `{{- end -}}` closing
  the streamgl-gpu port-collision guard had a trailing `-}}` that stripped
  the newline before `- name: streamgl-gpu`. Combined with the outer viz
  tier-gate already stripping the newline after the comment divider, the
  divider line and the container header collapsed onto one line so YAML
  read the entire streamgl-gpu container definition as part of a comment.
  Drop the trailing dash on the inner end. Verified by container count
  rather than port grep: viz/full render 6 main containers, analytics
  renders 3, streamgl-gpu has its own image / PORT=8090 / probe at every
  tier it should render, streamgl-sessions has its own image at every
  tier it should render. The port-collision guard itself still fires:
  passes STREAMGL_NUM_WORKERS=108 (last safe under fep:8200, list-server
  at 8199), fails 109/120/200 with the full diagnostic.

ENGINE POD STORAGE

- templates/engine/engine-daemonset.yaml viz static UI bundle moves from
  a subPath on the shared `data-mount` PVC (`static/viz-build`) to a
  per-pod `viz-build-cache` emptyDir. The streamgl-viz image carries the
  bundle (~31 MiB); the engine-copy-viz-static init container copies it
  into the emptyDir; the sibling nginx container mounts the same emptyDir
  at /opt/graphistry/apps/core/viz/build and serves it. Pre-fix every
  engine pod's init container `cp -r`'d byte-identical content into the
  same shared subPath on the RWX PVC, so multi-node deployments raced on
  initial boot (N pods writing the same files concurrently) and on every
  rolling DaemonSet upgrade (one new pod rewriting the bundle while old
  pods on other nodes still read it from the same files). `cp -r` is
  non-atomic per file (truncate + rewrite), so a concurrent nginx reader
  could observe partial / hash-mismatched chunks until convergence;
  self-heals once writers finish with identical bytes, but the window is
  operator-visible on a multi-node viz load. Per-pod emptyDir removes
  cross-pod read/write entirely: only this pod's init writes it, only
  this pod's nginx reads it. Per-pod disk overhead is ~31 MiB; on a
  10-node DaemonSet that is ~310 MiB total across the cluster, negligible.
  Mirrors the existing nexus-proxy-deployment.yaml `viz-build-empty`
  emptyDir pattern at platform tier (which renders empty because no
  streamgl-viz exists at that tier; the analytics+ engine pod just
  populates the equivalent emptyDir from the streamgl-viz image's bundle).
  Verified at viz/full tier: nginx mounts viz-build-cache, init copies
  into it. Verified at analytics: init container absent, nginx viz-build
  mount absent. Verified: zero stale `static/viz-build` references in
  the rendered output at any tier.

DOC DRIFT FROM THE ENGINE PORT-PLAN MOVE

- charts/graphistry-helm/CLUSTER.md `forge-etl-python` intra-pod port
  references updated from `:8100` to `:8200` in two places (step-by-step
  example and "every cross-container call in the pod" table), matching
  the chart's `$fepPort` move in the prior commit.

- templates/NOTES.txt GPU HEALTH CHECKS cudfhealth curl target updated
  from `localhost:8080/cudfhealth` to `localhost:8200/cudfhealth`. Pre-fix
  the curl hit nginx's intra-pod dispatcher on :8080 with no Host header,
  returning nginx's default-handling response not fep's `/cudfhealth`
  payload. Post-fix the curl hits fep directly via its bind port.

CLUSTER.MD DOC PASS

- charts/graphistry-helm/CLUSTER.md NFS recipe gains a "Why
  `no_root_squash`" rationale paragraph explaining the engine-pod root-UID
  + EACCES failure mode under the NFS server's default `root_squash`, plus
  the one-line alternative for operators whose security policy forbids
  `no_root_squash` (explicit `securityContext.runAsUser` + `fsGroup` plus
  re-owned export; chart doesn't ship that path today).

- charts/graphistry-helm/CLUSTER.md Variant A (single-pool) gains a
  trade-off paragraph naming the CPU-only singletons that end up on
  GPU-labelled nodes under this configuration and pointing at Variant B
  for operators with GPU-capacity constraints.

- charts/graphistry-helm/CLUSTER.md engine.cookieSecret "Setting it"
  rewrites the operator-provisioned-Secret bullet to clarify the chart
  does not create the Secret, and shows the full `kubectl create secret
  generic ...` + `helm upgrade --set engine.cookieSecret=$(...)` pattern
  with the Secret name marked illustrative.

- charts/graphistry-helm/CLUSTER.md Caddy section gains a new
  "Availability posture: single-replica trade-off and HA recipe"
  subsection between Rotation behaviour and Why both pieces are necessary
  together. Two tables enumerate what the planned path is protected by
  (checksum/config + RollingUpdate maxUnavailable: 0% + readinessProbe
  + preStop + grace) and what the unplanned path looks like today
  (node-eviction-timeout 5min + reschedule + boot before a replacement
  Caddy pod stands up; cluster-wide nodeSelector outage blocks the
  schedule entirely). Explains why multi-replica Caddy is safe in the
  chart's design (cookie HMAC flows through engine.cookieSecret -> the
  Caddyfile ConfigMap -> every Caddy pod; lb_policy cookie is HMAC-
  deterministic, no shared in-memory state). Lists the four operator-
  overlay knobs for HA (replicas bump, topologySpreadConstraints with
  kubernetes.io/hostname, PodDisruptionBudget v1 minAvailable: 1,
  Service externalTrafficPolicy trade-off note), with honest disclosure
  that the chart does not currently surface all of these as Helm values
  so operators wanting HA today either patch the template, fork the
  chart, or apply a Helm post-renderer. Chart's `replicas: 1` default
  is unchanged in this commit.

Verified: helm lint clean; engine DaemonSet render at viz/analytics/full
tiers shows the correct main-container set (6/3/6) with each container
holding its own image/PORT/probe; port-collision guard fires at the
documented N ceiling; no stale `:8100` or `static/viz-build` in the
rendered output; no em-dashes added to any edited file; no internal
review-process language or commit-hash refs in any edited file or
the CHANGELOG entries.
…ling pass

The telemetry subchart's `otel-collector` and `node-exporter` DaemonSets
both defaulted their per-workload `nodeSelector` to `{}` and chained
through to `.Values.global.nodeSelector` via Helm's `| default` falsy
treatment of empty maps. Under the parent chart's documented split-pool
topology (`engine.nodeSelector: graphistry.io/role=gpu`,
`global.nodeSelector: graphistry.io/role=cpu`) this silently pinned
both DaemonSets to CPU nodes only.

Combined with the collector Service's `internalTrafficPolicy: Local`
(intentional, keeps OTLP gRPC on the same node), engine pods on GPU
nodes pushed OTLP to a Service with zero local endpoints there, and
every span / metric / log silently dropped with no render-time and no
app-side error. node-exporter had the parallel bug for OS metrics:
GPU-node CPU / memory / disk / network all dropped. Both reproducible
via `helm template --set 'global.nodeSelector.graphistry\.io/role=cpu'
--set 'engine.nodeSelector.graphistry\.io/role=gpu' --set
global.ENABLE_OPEN_TELEMETRY=true` on HEAD before this commit; the
collector lands on CPU nodes, node-exporter lands on CPU nodes, engine
runs on GPU nodes, and OTLP / OS-metric silently goes nowhere.

CHART FIX

- charts/telemetry/templates/otel-collector.yaml and
  charts/telemetry/templates/node-exporter.yaml: both DaemonSet
  `nodeSelector` template directives drop the `| default
  .Values.global.nodeSelector` fallback. Empty per-workload selector
  now renders literally as `{}`, scheduling both DaemonSets on every
  node the tolerations admit. Operator override is preserved: setting
  `telemetry.openTelemetryCollector.nodeSelector` or
  `telemetry.nodeExporter.nodeSelector` to a non-empty map still
  constrains either DaemonSet directly. dcgm-exporter uses the same
  template pattern but its subchart values ship a non-empty default
  (`nvidia.com/gpu.present: "true"`) that never triggered the chain;
  unchanged. Parent chart per-workload Deployments (caddy, nexus,
  redis, dask-scheduler, pivot, notebook, gak-{public,private})
  still chain to `global.nodeSelector` as before; the no-chain
  treatment is targeted to the two telemetry DaemonSets where per-node
  co-scheduling is load-bearing for the Local-policy Service.

- charts/telemetry/values.yaml `openTelemetryCollector.tolerations`
  default changes from `[]` to `[{key: nvidia.com/gpu, operator:
  Equal, value: "true", effect: NoSchedule}, {operator: Exists}]`,
  mirroring the shape `nodeExporter.tolerations` already shipped.
  Without this, even with the nodeSelector chain-fix the collector
  would still fail to schedule on GPU nodes commonly tainted
  `nvidia.com/gpu=true:NoSchedule` by the NVIDIA GPU Operator (or
  managed GPU node-pools on GKE / EKS / AKS); the `Exists` broad
  fallback closes the failure mode for every parent-chart topology
  the chart advertises (single-pool, split-pool, dedicated-tenant).
  `concat` with `.Values.global.tolerations` preserved so operator-
  supplied cluster-wide tolerations still stack on top. Subchart
  values comment blocks on the affected fields rewritten in detail.

- charts/graphistry-helm/values.yaml `global.nodeSelector` comment
  block gains a "Telemetry exception" paragraph immediately under the
  existing multi-pool guidance, explicitly naming the two DaemonSets
  that deliberately do NOT inherit this knob and noting that
  dcgm-exporter is independent (pins to `nvidia.com/gpu.present=true`
  regardless). Operators following the recommended split-pool layout
  are not surprised by the no-chain behavior the next time they grep
  the chart for `nodeSelector`.

VERIFIED

- Split-pool render (`--set
  'global.nodeSelector.graphistry\.io/role=cpu' --set
  'engine.nodeSelector.graphistry\.io/role=gpu' --set
  global.ENABLE_OPEN_TELEMETRY=true`) now emits otel-collector
  `nodeSelector: {}` + the broad tolerations, node-exporter
  `nodeSelector: {}` + its existing broad tolerations. Engine still
  pinned to GPU; dcgm-exporter still pinned to
  `nvidia.com/gpu.present`; prometheus / jaeger / grafana / parent-
  chart Deployments still pinned to `global=cpu` as documented.

- Single-pool default render (`--set global.ENABLE_OPEN_TELEMETRY=
  true`) still emits otel-collector + node-exporter `nodeSelector:
  {}`.

- Operator override path verified: `--set
  telemetry.openTelemetryCollector.nodeSelector.dedicated=graphistry-
  only` renders the explicit selector instead of `{}`.

- `helm lint` clean. 58 docs parse cleanly via PyYAML's
  `safe_load_all`. Em-dash, internal-impl-name, and review-process-
  language scans clean across all edited files.

SUBCHART README

- charts/telemetry/README.md cloud-mode rendering bullet: replaced
  the stale "ConfigMaps / Ingresses do not render in cloud mode"
  claim with "ConfigMaps / Services / PVCs do not render" + the
  one-line "subchart ships no Ingress templates of its own in either
  mode" qualifier. Forward-points at the new "Accessing the UIs"
  section.

- charts/telemetry/README.md gains "Accessing the UIs" between the
  cloud-mode rendering note and "Cloud credentials". Documents the
  parent chart's three Caddy reverse-proxy handles (`/grafana/`,
  `/prometheus/`, `/jaeger/`) with backend Service / port, the
  in-cluster Service ports listed in the ASCII diagram (`:3000`,
  `:9090`, `:16686`) are explicitly called out as NOT operator-facing
  endpoints (the diagram label was misleading some operators toward
  port-forward), the three gating conditions (`global.tier >=
  analytics`, `global.ENABLE_OPEN_TELEMETRY: true`,
  `telemetry.OTEL_CLOUD_MODE: false`), the backend sub-path
  configuration so operators changing prefixes change all three
  sides in lock-step, and a fallback `kubectl port-forward` recipe
  for when Caddy is unavailable.

- charts/telemetry/README.md Scheduling section rewritten. Opens
  with corrected accuracy note that the inheritance-from-global
  pattern is per-workload, not blanket. New "Engine↔collector
  co-scheduling invariant" subsection names the Local-policy-Service
  requirement, lists the producer / scrape-target set, and explains
  why the collector's no-chain + broad-tolerations defaults close
  the silent-drop in every documented parent-chart topology. Per-
  workload subsections for `nodeExporter` (no-chain rationale),
  `dcgmExporter` (always pins to GPU label, unchanged), and the
  three single-replica Deployments. Self-hosted and cloud-mode
  ASCII diagrams updated to clarify `dcgm on GPU nodes only;
  node-exporter every node` instead of the false "DaemonSets, one
  pod per node" generalisation. New troubleshooting subsection
  "Traces or metrics missing from EVERY engine node (split-pool)"
  with the verification commands and overlay-debug recipe for
  operators who layered explicit `nodeSelector` overrides on top of
  the new defaults
aucahuasi added 23 commits June 5, 2026 11:30
extraEnvFrom (the operator envFrom escape hatch for bulk-importing a custom
Secret/ConfigMap) was empty on every pod and redundant. Each pod that has
sensitive auth already has a dedicated channel: gak via gak-secret, postgres
via postgresEnv, stripe via stripe-secret, crypto via the jwtSecret envFrom.
pivot and the compute pods (streamgl-*, dask-cuda, notebook, nginx) have no
operator secrets at all.

Removed from gak (public + private), pivot, notebook, and the engine
containers (nginx, forge-etl, dask-cuda; streamgl-viz/sessions/gpu keep their
crypto jwtSecret envFrom). nexus keeps extraEnvFrom for now; it is handled with
the upcoming nexusSecret refactor.

Behavior preserving: every removed block was empty, so the chart renders
byte-identical. Dry-run clean; lint passes.
…block

The nexus-only secret moves out of the global crypto block and splits by
enforcement:
- crypto.jwtSecret stays global (DJANGO_SECRET_KEY is shared by nexus + the
  streamgl engine containers).
- nexus.requiredSecret (was crypto.nexusSecret): the three signing/encryption
  keys nexus has no defaults for and crashes without (GRAPHISTRY_NEXUS_*).
  Chart-asserted, loaded via envFrom.
- nexus.optionalSecret (new): one optional envFrom Secret for every
  feature-gated credential nexus reads only when a feature is enabled (LDAP,
  email, OAuth, Slack, Stripe). secretRef optional: true, so it may be absent
  on minimal installs; not asserted; skipped when the name is empty.

Stripe folds into optionalSecret: the 14 STRIPE_* / DJSTRIPE_WEBHOOK_SECRET
secretKeyRef entries (previously a hardcoded Secret with per-var key remapping
like test-public-key) are dropped; operators add those keys to optionalSecret
by env-var name, auto-imported via envFrom. The non-secret
DJSTRIPE_WEBHOOK_VALIDATION moves to nexus.env. The redundant nexus
extraEnvFrom is removed (optionalSecret is now the single operator-secret
channel; the placeholder AUTH_LDAP_BIND_PASSWORD that would shadow it is gone).

Verified: the 3 required keys + DJANGO_SECRET_KEY still load; assertCrypto
fires on empty nexus.requiredSecret; optionalSecret is skipped when empty;
full chart dry-run clean across tiers; lint passes.

NOTES.txt / README still document the old crypto.nexusSecret and need updating
to the required/optional split; done in the follow-up docs pass.
The nexus template's extraEnvFrom was replaced by nexus.optionalSecret in the
secret split, leaving nexus.extraEnvFrom in values.yaml with no consumer. Drop
the dead key + its comment. With this, extraEnvFrom is gone from the chart
entirely (operator secrets flow through the dedicated per-pod Secrets:
gak-secret, nexus.optionalSecret, postgres, crypto).
The per-pod env-config comments still described the retired layered model
(the chart-wide `env:` list, extraEnv, extraEnvFrom, the "3-layer pattern").
Rewrite them to match what the chart actually does now:
- Each pod's `env` is an operator-patchable MAP, deep-merged over the env the
  chart injects for that pod (the chart picks exactly the vars that pod's app
  and entrypoint read); operator wins per key.
- nexus: the env-config block now documents nexus.env (plain) plus
  requiredSecret / optionalSecret (secret-sourced via envFrom), and the
  env-shadows-envFrom precedence rule.
- Fix stale pointers: STREAMGL_NUM_WORKERS / CUDA pass through the pod's env
  map (not the deleted global list); JWT_AUTH_COOKIE is chart-managed;
  SMTP creds go in nexus.optionalSecret; DCW pinning vs the pod's
  CUDA_VISIBLE_DEVICES.

Comments only; the rendered chart is unchanged. Dry-run clean; lint passes.
… secrets; expand abbreviations

Aligns operator-facing docs with the chart's current shape after the nexus
secret split (crypto.nexusSecret became nexus.requiredSecret +
nexus.optionalSecret) and the env-map refactor (global .Values.env and
per-container extraEnv / extraEnvFrom removed).

chart README.md: the "Create Crypto Secrets" section is split into three
subsections under "Create Graphistry Secrets" (JWT Secret; Nexus Required
Secret, with Rotating keys nested; Nexus Optional Secret), with a shared
CSPRNG recipe block in the lead-in. TL;DR reference updated.

10-MINUTES-TO-K8S.md: "Bootstrap Crypto Secrets" renamed to "Bootstrap
Required Secrets"; the two required Secret names link to the new README
anchors; "atomically" softened to "in a single command".

NOTES.txt: crypto.nexusSecret became nexus.requiredSecret; README section
reference updated.

_helpers.tpl: the assertCrypto fail message dropped the dangling reference
to the deleted crypto-bootstrap/create-secrets.sh script and now points at
the README "Create Graphistry Secrets" section.

values.yaml: per-pod env-doc comments rewritten to the pickEnv map model;
streamgl container names spelled out in full.

TROUBLESHOOTING.md: section 13 env priority corrected to the current
operator channel (per-container env map only); section 14 dask shorthand
expanded to dask-cuda-worker.

CLUSTER.md, examples, postgres-cluster README: external-Postgres injection
no longer references the deleted .Values.env; tier comments and prose use
full canonical service names (forge-etl-python, dask-cuda-worker,
graph-app-kit, the streamgl containers) instead of fep / dask / gak
shorthand.

Docs only; chart render unchanged. Dry-run clean across tiers; lint passes.
0.4.4 is unreleased (no git tag; [Development] empty; actively edited). Its
notes still described the intermediate crypto.nexusSecret / extraEnv /
extraEnvFrom / .Values.env model that was reworked within the same cycle and
does not ship. Updated 11 entries in place to the as-shipped state:

- crypto.nexusSecret became nexus.requiredSecret + nexus.optionalSecret;
  "Create Crypto Secrets" became "Create Graphistry Secrets" (Added entries,
  Upgrade notes, assertCrypto, envFrom wiring).
- per-component env / extraEnv / extraEnvFrom became the per-container env
  map; the dask-cuda-worker leak fix no longer references extraEnv or the
  3-layer pattern; dask-cuda-worker added to the env-map workload list.
- incidental .Values.env override-channel mentions corrected: streamgl-gpu
  and the port guard read .Values.streamglgpu.env; nginx reads
  .Values.nginx.env; grafana overrides via telemetry.grafana.env.

The only surviving .Values.env references are the historical ones in the
stale-comment-fix entry, describing the list that was removed. Docs only.
New "## Environment Variable Configuration" section (before "Install
Graphistry") documents the chart's env convention in one place: every
workload exposes a plain `env` map that Helm deep-merges over the chart's
per-workload defaults (patch one key, the rest stay). Tables the maps by
workload: nexus / pivot / notebook / graphAppKit, the six engine-pod
container maps, the shared `otelEnv`, and all six telemetry subchart maps
(openTelemetryCollector, prometheus, jaeger, grafana, dcgmExporter,
nodeExporter).

Links the Graphistry admin docs for the env-var catalog (what each variable
does), points to "Create Graphistry Secrets" for sensitive values (env maps
are non-secret only), and to TROUBLESHOOTING section 13 for the chart-owned
keys that must use dedicated global.* channels. Docs only; render unchanged.
…nippets

Three install snippets (postgres-cluster README, examples README, chart
README) used `helm upgrade -i pg-cluster ...` while 10-MINUTES, TROUBLESHOOTING,
and the CHANGELOG upgrade notes used `postgres-cluster`. postgres-cluster
README even used both. Standardize on `postgres-cluster` (matches the chart
name and the majority of docs) so operators get the same release name
regardless of which page they follow.
gk.sh's gk_install_pgc ran `helm upgrade -i pg-cluster`, the lone pg-cluster
holdout the release-name cleanup missed (introduced in 8e096ae). Align it with
the documented `postgres-cluster` release name so the script and the docs agree
and teardown is `helm uninstall postgres-cluster`.
…led Option A

The PV-cleanup docs offered an inverted "everything except the app PVs" command
mislabeled "drop postgres only" (examples/README literally called it "Option A:
drop postgres-cluster PVs only"); its `test(app) | not` filter actually deleted
postgres AND telemetry, so an operator dropping Postgres silently wiped
Grafana/Prometheus/Jaeger too.

Replace with a per-group helper, no inversion:
- chart README: `del_pv <namespace> <regex>` (deletes only Released PVs whose
  claim matches) plus a commented, labelled menu, one line per data group
  (postgres, telemetry, graphistry app data, all).
- examples/README: points at the chart README for the helper and shows the
  menu using $GRAPHISTRY_K8S_NS, instead of re-defining it.
- gk.sh: `gk_del_pv <pg|telemetry|app|all>` plus gk_del_pv_pg/_telemetry/_app/
  _all wrappers. It tables the matching Released PVs (name, capacity, access,
  status, claim, storageclass) and confirms before deleting; only Released PVs
  are ever touched, never a Bound volume.

Validated against a live cluster: per-group listing, confirm/abort, and delete
all work; bad arg returns the usage error with no kubectl call.
… list

The rendered install NOTES still said `helm upgrade -i pg-cluster` in
PREREQUISITES (the release-name cleanup and earlier sweeps were .md-only and
skipped this .txt) and abbreviated the tier list as `nginx+fep+dask` and
`viz + gak + ...`. Use the documented `postgres-cluster` release name and the
full service names (forge-etl-python, dask-cuda-worker, graph-app-kit). The
gak-public / gak-private PVC names are real resource names and unchanged.
…tainers

The earlier POSTGRES_* env cleanup verified app code but missed the image
entrypoints. streamgl-viz, streamgl-gpu, forge-etl-python, and pivot run a
Postgres-readiness wait in their entrypoint that connects with
POSTGRES_HOST/PORT/DB/USER/PASSWORD before starting the server. With those
vars stripped, the connect failed and the entrypoint looped
"Waiting for PostgreSQL to become available..." forever; the servers never
started, so /pivot and the streamgl endpoints returned 502 Bad Gateway.

Re-add graphistry.postgresEnv to those four containers, matching nexus and
streamgl-sessions, which kept it and stayed healthy. dask-cuda-worker (its
GPU-worker entrypoint talks to dask-scheduler, not Postgres) and nginx do not
run that wait and are correctly left without it.

postgres-cluster README: the "every Graphistry pod gets postgres connection
details" line listed pods that do not connect (dask-scheduler, dask-cuda-worker,
caddy, nginx, redis, notebook). Corrected to the six that do, split by reason
(nexus and streamgl-sessions for data; viz/gpu/forge/pivot for the readiness
wait).

Verified against a live cluster (viz/gpu/forge/pivot were stuck, sessions was
serving) and by render: POSTGRES_HOST present on exactly those six; lint clean.
… reuse caches

Switch the engine upstream from per-browser cookie stickiness to per-session
affinity for the viz session channels. This is the change that makes real-time
collaboration work on a multi-pod engine and lets a session's clients share one
pod's warm caches.

Problem. A viz session's working state lives in one engine pod's memory, spread
across its containers: GPU graph buffers and layout objects in streamgl-gpu,
per-session graph state and the session registry in streamgl-viz and
streamgl-sessions, and in-progress ETL and dask results in forge-etl-python.
None of it is shared across pods. The previous Caddyfile pinned by the
graphistry_sticky cookie, which is per browser: two browsers (or two users)
opening the same session URL received different cookies and landed on different
engine pods. The result was a duplicate, independent copy of the session per
browser: no collaboration (a node move in one window never reached the other),
and the GPU, CPU, and ETL caches were built twice.

Fix. Route the session channels by the session value using Caddy's `query` load
balancing policy (rendezvous, highest-random-weight hashing). Three handles
replace the single cookie catch-all (templates/caddy/_helpers.tpl):

  @StreamGL path /streamgl/*                            -> lb_policy query id
  @viz      path /graph/socket.io* /graph/graph.html*   -> lb_policy query session
  (catch-all)                                           -> lb_policy cookie graphistry_sticky

Why this is correct and safe:
- The streamgl channels carry the session as `id=<sessionId>` and the viz channel
  as `session=<sessionId>`. Caddy's `query` policy hashes the value only (it is
  key-agnostic), so `query id` and `query session` resolve the same session to the
  same pod. No application change was needed.
- Rendezvous hashing is computed over each pod's address, not its pool index, so
  the two independent reverse_proxy blocks agree on the same pod, and adding or
  removing a GPU node moves only about 1/N of sessions instead of reshuffling all
  of them.
- /graph/graph.html is pinned too, so the page that creates the per-session viz
  state lands on the same pod the WebSocket and streamgl calls will hit (no
  split-brain).
- Non-session traffic (landing page, nexus API, static assets) is stateless and
  stays on the per-browser cookie, so it keeps load-balancing freely.

The shared `dynamic a engine-headless` discovery block is factored into a new
graphistry.caddy.engineDiscovery helper rather than copy-pasted across the three
handles.

Verification:
- `caddy validate` on caddy:2.11-alpine reports Valid configuration.
- Live two-browser test on a 2-node engine DaemonSet: both browsers converge on
  one pod (the session shows on a single gpu-worker, not duplicated); animation
  play, node moves, filter exclusions, and histogram color encodings all
  synchronize between the windows.

Docs. The README "Caddy configuration" section gains a "Session affinity and
collaboration" subsection (the cache-reuse and collaboration rationale, the
rendezvous-hashing properties, the route map, the collaboration flow, operator
notes). CLUSTER.md, values.yaml, and the Tanzu example are reframed from
per-browser cookie stickiness to per-session routing, including the corrected
cookie-rotation behavior: rotating engine.cookieSecret no longer orphans live
sessions, since they route by value, not by the cookie.
The engine is a DaemonSet (one pod per GPU node) and Caddy session-routes by
value, so a single curl through the ingress only shows the pod a request hashed
to. These helpers exec into every engine pod and curl its own localhost
endpoints, so operators see all GPU nodes:
- gk_status_gpu_list / gk_status_gpu_list_full: streamgl-gpu /streamgl/list per
  pod (filtered sessions-per-worker, or full raw JSON).
- gk_status_streamgl: streamgl-viz / -sessions / -gpu health per pod.
- gk_status_forge:    forge-etl-python /health, /cudfhealth, /workerhealth per pod.
- gk_status_engine:   full per-pod roll-up (streamgl + forge-etl-python).
- gk_get_token <user> <password>: print a Graphistry API JWT via the REST auth
  endpoint /api/v2/auth/token/generate.

Containers absent at the current tier report "unreachable". Validated live
against a 2-node engine DaemonSet.
…uting

The root README's Architecture diagram still showed the old per-browser
graphistry_sticky cookie pinning each browser to one engine pod. Update it to
the session-affinity model (Caddy hashes each session to one engine pod, so a
session's clients converge for live collaboration and warm-cache reuse), and
add a one-line pointer to the "Session affinity and collaboration" section for
the full routing model.
…operator safety)

Build out the chart README's "Session affinity and collaboration" section and
align CLUSTER.md and values.yaml.

- Route map table: every browser-facing path, the engine container the in-pod
  nginx dispatches it to (streamgl-gpu, streamgl-viz, forge-etl-python, nexus,
  streamgl-sessions, static), and whether Caddy routes by session value or the
  per-browser cookie. Plus the intra-pod localhost path that session-triggered
  ETL takes, which never reaches Caddy.

- New sessions and load balancing: rendezvous hashing is stateless, so a
  brand-new session id maps to a pod with no prior state; because session ids
  are random, the hash spreads sessions uniformly across pods, which is how viz
  sessions load-balance across GPU nodes. No-session requests fall to the query
  policy's random fallback.

- Two-strategy table: session channels use `query` (HRW over the session id);
  everything else uses `cookie` (round-robin places a new browser via
  caddy.lb.fallback, the HMAC cookie then pins it). Clarifies the cookie
  remembers a round-robin choice rather than hashing the browser onto a pod.

- Operator safety: the session-channel routing is hardcoded in the chart and is
  not a value, so it cannot be reconfigured or broken. caddy.lb.fallback governs
  only the non-session cookie catch-all. Rewrote the stale caddy.lb.fallback
  comment in values.yaml (which still claimed the cookie pins viz sessions) to
  say this plainly, and refreshed the least_conn warning for the new model.

- Link "highest-random-weight" to Caddy's load-balancing documentation, and
  spell out /graph/socket.io and /graph/graph.html instead of "the /graph
  WebSocket and page".
…r wording

- README: the non-session cookie's first-assignment is caddy.lb.fallback
  (round_robin by default; also least_conn, weighted_round_robin, random), not
  round-robin-only. State that only this non-session fallback is configurable;
  the session-channel HRW routing is fixed.

- CLUSTER: a user's next session lands on a pod by rendezvous hashing of its id,
  not Caddy's round-robin (linked to Caddy's load-balancing docs). The dead-pod
  appendix now describes failover for both policies: the query policy re-selects
  the next-best hash, the cookie policy uses caddy.lb.fallback.
Expose Caddy access logging as caddy.accessLog.{enabled,output,format,level},
rendering a site `log` block in the Caddyfile, plus a log_skip for the health
probe.

Why: the docker-compose Caddyfile ships access logging as a commented `log`
block operators uncomment (the hosted SaaS staging runs with it on), but the
chart's generated Caddyfile dropped that capability, so K8s operators had no
supported way to enable it (ConfigMap edits are overwritten on helm upgrade).
This restores parity.

Default enabled=true: behind a cluster ingress (tls.mode=ingress) the ingress
already logs requests, but non-ingress fronts (tls.mode self/external/off, e.g.
Tanzu via an Avi LoadBalancer, or mesh/dev) put Caddy at the edge with no other
request-log layer. On by default gives every topology request visibility; set
caddy.accessLog.enabled=false behind a logging ingress or to cut volume. Note
this is a behavior change on upgrade: existing installs start emitting access
logs unless they set enabled=false.

The K8s readiness/liveness probes hit /caddy/health/ every ~3s and would
otherwise dominate the log, so the accessLog helper emits `log_skip
/caddy/health/` alongside the `log` block. log_skip is ordered before `handle`
in Caddy's directive order, so the probe still answers 200, it just isn't
logged; it's owned by the feature flag, not hardcoded in the health handler.

output/format/level mirror Caddy's own `log` sub-directives (defaults
stdout/json/info; access logs only emit at INFO and ERROR); an empty sub-value
is omitted so Caddy's default applies. Rendered into all three site shapes
(ingress/external, self, off) via a new graphistry.caddy.accessLog helper.

Verified live on a 2-node cluster: rolled the Caddy pod, confirmed the deployed
Caddyfile carries the log block + log_skip, and the access log shows real
requests (page loads, the bootstrap/302/session flow, /streamgl/* by session
id) with zero /caddy/health/ lines. caddy validate clean across modes; helm
lint clean.
… caddy dedupe

The engine is a DaemonSet (one pod per GPU node) with 10 containers per pod, and
Caddy HRW-pins each viz session to one pod, so the old single-pod/all-container
helpers misled and flooded on multi-node clusters.

- gk_engine: export GRAPHISTRY_K8S_ENGINE_PODS (newline list of ALL engine pods)
  instead of a single head -1 pod; gk_pods shows every pod. Consumers iterate
  `for pod in $GRAPHISTRY_K8S_ENGINE_PODS` or `set --` for the first.
- gk_logs_engine [container ...] / gk_logs_engine_all [container ...]: optional
  container args (aliases gpu/viz/forge/dask/sessions/nginx via _gk_engine_container,
  or full names); no args = all containers. 1 container uses -c; 2+ tail all and
  filter the [pod/container] prefix client-side (kubectl has no multi-container
  select). _all fans out across every engine pod.
- gk_logs_engine_all uses ONE foreground `kubectl logs -l ... --prefix` instead of
  a `kubectl logs -f &` per pod: the background tails sat in their own process
  groups and survived Ctrl-C (orphaned, kept printing). One foreground process
  stops cleanly; --prefix tags each line [pod/container].
- _gk_colorize: awk colorizer tints each line's [pod/container] prefix (only the
  prefix, not the body) by container. TTY/NO_COLOR gated (no-op when redirected);
  GK_LOGS_COLOR=always|never and GK_COLOR_* env overrides. Core awk only, so
  portable across gawk/mawk/BSD awk.
- gk_logs_engine (single pod) and gk_logs: --max-log-requests (engine pod has 10
  containers, over kubectl's default 5-stream follow cap).
- gk_caddy: head -1 on CADDY_POD so a rollout's two-pod window doesn't break
  gk_logs_caddy ("more than one slash").

Quick-check live-debug helpers, not a substitute for a log aggregator.
…yable

The gk_status_* helpers printed `== pod ==` headers interleaved with JSON/text,
which isn't parseable and loses pod context when you copy one block. Each now
emits a single JSON array, one object per engine pod, with the pod name in `pod`
and the node it runs on (the GPU host) in `host`:

- gk_status_gpu_list      -> [{pod, host, workers:[{name, sessionIDs}]}]
- gk_status_gpu_list_full -> [{pod, host, list:<raw /streamgl/list>}]
- gk_status_streamgl      -> [{pod, host, streamgl-viz, streamgl-sessions, streamgl-gpu}]
- gk_status_forge         -> [{pod, host, health, cudfhealth, workerhealth}]
- gk_status_engine        -> [{pod, host, streamgl:{...}, forge:{...}}]

Each endpoint response is embedded as JSON when parseable, else as a string (so
forge's plain-text `cudfhealth: ok` stays a string); an unreachable container is
marked "unreachable" and the pod still appears (not silently dropped). New
_gk_streamgl_obj / _gk_forge_obj keep streamgl/forge/engine DRY; _gk_add_host
slurps the per-pod object stream into the array and injects `host` from one
pod-to-node kubectl call (not one per pod). Output pretty-prints for humans and
parses for tooling, e.g. which GPU host holds a session:

  gk_status_gpu_list | jq -r '.[] | select(.workers[].sessionIDs[]=="<sid>") | .host'

gk_get_token unchanged (returns a bare token for capture, not a status object).
kubectl logs -f on a single pod defaults to --tail=-1 (all retained log), so the
log tailers replayed the whole rotated history before following; a slow flood on
a long-running, chatty pod (streamgl-gpu, forge during a viz). Add
--tail="${GK_LOGS_TAIL:-1}" to every kubectl logs -f call so they start from
"now" (default 1 line) then follow. GK_LOGS_TAIL=-1 restores all-history,
GK_LOGS_TAIL=500 gives more scrollback. gk_logs_engine_all already defaulted to
10 via the -l selector; now uniform and configurable across all log helpers.
….5->0.8.0

graphistry-helm: this PR is a breaking re-architecture, so a minor bump (0.5.0),
not a patch. appVersion corrected to track the Graphistry release it deploys
(2.50.7, = global.tag) instead of mirroring the chart version; global.tag bumped
v2.50.6 -> v2.50.7 so the chart ships the current release.

postgrescluster: 0.7.5 -> 0.8.0 (substantive values restructure + retain-sc-postgres
StorageClass landed on an unchanged version). appVersion stays 5.2.0 (the Crunchy
PGO release the PostgresCluster CR deploys is unchanged). Fixed the misleading
"version should match the CRD" comment (version = chart SemVer; appVersion = PGO).

Swept the user-facing --version / image-tag references across READMEs, examples,
NOTES, and gk.sh. Left intentionally: the telemetry / graphistry-common subcharts
at 0.4.4 (internal; parent 0.5.0 depends on them validly via Chart.lock), the
historical "Caddy 2.11.2 since v2.50.6" thresholds, and the prior-release 0.4.3
migration refs.
@aucahuasi aucahuasi added documentation Improvements or additions to documentation enhancement New feature or request labels Jun 8, 2026
aucahuasi added 5 commits June 8, 2026 12:52
… postgres refs)

c379f7b's version sweep was scoped too narrowly and missed two sets of refs:
- the repo-root README.md, CHANGELOG.md, TROUBLESHOOTING.md (the bump only
  touched charts/);
- the 0.7.5 postgres-chart-version references living inside graphistry-helm
  chart files (examples/gk.sh, 10-MINUTES-TO-K8S.md, examples/README.md,
  README.md, templates/NOTES.txt) -- that sweep was scoped to
  charts/postgres-cluster/ only.

Brings all of them to 0.5.0 / 0.8.0. The CHANGELOG's current release entry is
relabeled [Version 0.4.4] -> [Version 0.5.0] with its self-references, and the
TROUBLESHOOTING helm-list example's appVersion column to 2.50.7 (graphistry-helm
now tracks the deployed Graphistry release). Preserved: the [Version 0.4.3]
entry, the historical postgres-cluster 0.7.4 -> 0.7.5 line, and the internal
telemetry / graphistry-common subchart versions (0.4.4).
Graphistry-Helm-Chart -> graphistry-helm, postgrescluster -> postgres-cluster.
The Chart.yaml `name:` fields violated Helm's naming rule (lowercase, dashes, no
uppercase) and did not match their already-correct directory names. Fixed the
names plus every published-name reference (install commands, repo table, .tgz
filenames, prose) across READMEs, examples, NOTES, gk.sh, CHANGELOG, and
TROUBLESHOOTING.

Scope:
- Directories are unchanged (charts/graphistry-helm, charts/postgres-cluster
  were already convention-compliant), so ArgoCD path-based sources and the docs
  build are untouched.
- The Crunchy CR kind (postgresclusters, kubectl get postgrescluster), the CRD
  FQDN, and the PGO source path are upstream identifiers and are preserved.
- No rendered resource changes: the parent chart uses .Chart.Name only in the
  cosmetic NOTES post-install line, not in any label/selector; the telemetry
  subchart's .Chart.Name resolves to "telemetry", unaffected by this rename.

Published-name impact: chart-releaser auto-discovers charts under charts/ and
reads the name from Chart.yaml, so the next release publishes graphistry-helm
0.5.0 / postgres-cluster 0.8.0 automatically. The prior Graphistry-Helm-Chart
0.4.x / postgrescluster 0.7.x index entries remain for existing installs; new
installs use the new names.
…+ docs version

Bumps the bundled subcharts to match the parent so no 0.4.4 remains anywhere:
- graphistry-common 0.4.4 -> 0.5.0 (version + appVersion; a type:library that
  renders nothing, so appVersion just mirrors the chart version).
- telemetry 0.4.4 -> 0.5.0 (version); appVersion -> 2.50.7. telemetry is
  Graphistry's telemetry stack -- the collector config, instrumentation, and
  dashboards are developed and released in lockstep with the app -- so its
  appVersion tracks the Graphistry release, not the chart version. The deployed
  third-party components (otel-collector, prometheus, grafana, jaeger,
  dcgm/node-exporter) carry their own versions in their image tags.
- parent dependency pins + Chart.lock -> 0.5.0.
- docs/source/conf.py Sphinx release -> v0.5.0 (was missed; lives outside charts/).

Verified: helm lint + helm template pass; zero 0.4.4 remaining repo-wide.
…y rendered image

Closes the air-gap gaps so a single value redirects all images the charts render:

- Telemetry stack (grafana, prometheus, jaeger, otel-collector, dcgm-exporter,
  node-exporter) now honors global.containerregistry.name via the same two-branch
  pattern as k8s-wait-for, flattening to <reg>/<basename>:<tag> in air-gap. They
  previously redirected only via per-component telemetry.<comp>.image overrides.
- Grafana init busybox pinned to 1.38.0 and made redirectable (was bare `busybox`,
  i.e. floating :latest and unredirectable).
- engine wait-for-postgres init (the graphistry.waitForPostgres helper): the
  crunchy-postgres image now redirects too (was hardcoded to the upstream Crunchy
  registry, the one image that escaped the redirect).
- README air-gapped section + values.yaml containerregistry annotation now state
  that one value redirects every image (app + telemetry + third-party); dropped the
  misleading short-name example.
- postgres-cluster README gains an air-gapped section, cross-linked both ways with
  the graphistry-helm air-gapped section.

Verified: helm lint clean on both charts; rendering with a custom
global.containerregistry.name leaves zero images on their upstream registries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants