Open-source SRE copilot — observability, FinOps, runbook automation, and incident response across Kubernetes and AWS / Azure / GCP.
Nudgebee is an open-source SRE copilot that watches your Kubernetes clusters and AWS / Azure / GCP accounts, turns raw signals into ranked findings, and walks operators through investigation and remediation. It bundles:
- Observability ingestion — Kubernetes events, metrics, traces, plus cloud-provider scans across AWS, Azure, and GCP.
- FinOps & cost optimization — surface unused / underutilized resources (idle workloads, oversized pods, stale snapshots, dangling volumes) and right-sizing recommendations across cloud and Kubernetes.
- LLM-powered triage — agentic planners that reproduce, root-cause, and propose fixes for incidents.
- ChatOps — Slack / Teams chatbot for SRE workflows: query state, run runbooks, ack alerts, and drive investigations from the channel where on-call already lives.
- Runbook automation — codify recurring fixes as reusable runbooks and trigger them from chat, alert, or schedule.
- Ticketing + notifications — bidirectional sync with Jira, ServiceNow, PagerDuty, Zenduty; alert delivery to Slack, Teams, email.
Dashboard screenshot — to be added. Track discussion thread or contribute via a PR.
The fastest way to run Nudgebee from source. Infra in containers, backend and frontend from source on the host. This is the path contributors should use.
- Docker (with
docker compose) or Podman Desktop (withpodman-compose) - Go 1.26+
- Node 22+ and npm
git clone https://github.com/nudgebee/nudgebee.git
cd nudgebeedocker compose up -dThe default compose profile starts Postgres, Redis, RabbitMQ, Qdrant, Temporal, and a one-shot migrations container that applies the Postgres + RabbitMQ schema and then exits. Re-runs are safe — golang-migrate is idempotent against an up-to-date tracker. To also run the backend and frontend in containers (instead of from source), use docker compose --profile full up -d.
See api-server/migrations/README.md for how migration tracking works and how to add a new migration.
# macOS / Linux / WSL
cp api-server/services/.env.example api-server/services/.env
# Windows PowerShell
Copy-Item api-server\services\.env.example api-server\services\.envThen generate the encryption key and replace the __REPLACE__ placeholder for
NUDGEBEE_ENCRYPTION_KEY in the new .env:
openssl rand -hex 32Keep this value — you'll paste the same key into app/.env in step 5, and
into every other service's .env if you later run more from source (see
Local Stack Bootstrap below). Rotating it after data is written makes
previously-encrypted DB rows unreadable, so treat it like a database master
password.
Other defaults work as-is against the compose stack from step 2. Read the inline comments before any non-local deploy — a few other values (private keys) also need rotation.
# macOS / Linux / WSL (requires make)
cd api-server/services
make run
# Windows (no make required — runs the same command directly)
cd api-server\services
go run ./cmdListens on http://localhost:8000. Leave it running.
In a new terminal:
# macOS / Linux / WSL
cp app/.env.example app/.env
# Windows PowerShell
Copy-Item app\.env.example app\.envReplace __REPLACE__ for NUDGEBEE_ENCRYPTION_KEY with the same value you
generated in step 3. The app can't decrypt what services-server writes unless
these match.
The NEXTAUTH_SECRET in the example is a dev-only sample; rotate it for any
non-local deploy.
cd app
npm install --legacy-peer-deps
npm run devOpen http://localhost:3000.
On the sign-in page, click Admin Login. Then:
- Email: any address (e.g.
dev@example.com) — a tenant + admin user are created automatically on first sign-in. - Password: literally
Test!24#5— the value ofNEXTAUTH_DUMMY_CREDS_PASSWORDshipped inapp/.env.example. Type it exactly; this is the dummy-credentials provider, not your own password.
The sample values in steps 3 and 5 above are fine for local dev. For any non-local deployment, generate fresh values and review the notes below.
| Var | Used by | How to generate | Notes |
|---|---|---|---|
APP_DATABASE_URL |
services-server | — | Compose default: postgres://postgres:postgrespassword@localhost:5432/nudgebee?sslmode=disable. Use localhost from the host, postgres hostname from inside the compose network. |
NUDGEBEE_ENCRYPTION_KEY |
services-server and app | openssl rand -hex 32 |
Encrypts integration credentials and other sensitive columns. Must match between services-server and app. Rotating it makes previously-encrypted rows unreadable — there is no automatic re-encryption migration. |
ACTION_API_SERVER_TOKEN |
services-server and app | openssl rand -hex 32 |
Optional. Shared secret for internal app↔services-server action calls. Defaults to empty on both sides, which disables the check (fine for local dev). If you set it, the value must match in both files. |
NEXTAUTH_SECRET |
app | openssl rand -base64 32 |
Signs both NextAuth session cookies and the inner HS256 session JWT (used by nbctl / Bearer-flow callers). Rotating it logs everyone out and invalidates outstanding bearer tokens. |
NEXTAUTH_DUMMY_CREDS_ENABLED / _PASSWORD |
app | — | Enables the any-email/password provider. Use for local development only; turn it off in any deployment exposed beyond your laptop. |
RABBIT_MQ_USERNAME / _PASSWORD / _HOST / _PORT |
services-server | — | Compose defaults: guest / guest / localhost / 5672. |
CLICKHOUSE_ENABLED |
services-server | — | false for the OSS local stack — ClickHouse is not part of compose. |
error pinging postgres: lookup postgres: no such hostfrom backend →APP_DATABASE_URLinapi-server/services/.envstill uses container hostname. Replace@postgres:5432with@localhost:5432.migrate: error: pq: relation "..." already exists→ tracker schema drifted from actual tables. InspectSELECT version, dirty FROM nudgebee.schema_migrations;and usemigrate force <version>to align. See api-server/migrations/README.md.- Action call from frontend returns 502
RPC gateway could not handle the operation→ the requested action isn't registered inapp/src/lib/actions.yaml, or it's a subscription / fragment / parse error. Check the frontend dev-server console for the unhandled reason.
The umbrella chart is published as a public OCI artifact at oci://ghcr.io/nudgebee/charts/nudgebee and bundles Postgres, RabbitMQ, Redis, Qdrant, and Temporal as subcharts.
# 1. Generate a permanent encryption key — store this securely.
# Losing it makes previously-encrypted DB rows unreadable.
export NUDGEBEE_ENC_KEY=$(openssl rand -hex 32)
echo "Save this key: $NUDGEBEE_ENC_KEY"
# 2. Install
helm install nudgebee oci://ghcr.io/nudgebee/charts/nudgebee \
--namespace nudgebee --create-namespace \
--set nudgebee_secret.NUDGEBEE_ENCRYPTION_KEY="$NUDGEBEE_ENC_KEY" \
--wait --timeout 20mTo pin a specific version, pass --version <X.Y.Z> (latest is used by default). To install from source instead — useful when iterating on chart changes — clone the repo, run helm dep update deploy/kubernetes/nudgebee, and point helm install at the local path.
The post-install hook applies database migrations automatically. Once the pods are ready:
kubectl -n nudgebee port-forward svc/app 3000:80
# Retrieve the bootstrap admin password
kubectl -n nudgebee get secret nudgebee \
-o jsonpath='{.data.NEXTAUTH_DUMMY_CREDS_PASSWORD}' | base64 -dOpen http://localhost:3000 and sign in with any email + that password. See deploy/kubernetes/README.md for production-grade configuration (ingress, TLS, external Postgres, ClickHouse, observability sidecars).
The platform is live but empty. A quick tour that takes ~10 minutes:
- Connect a cloud account or a Kubernetes cluster — Settings → Integrations. Onboarding a real cluster lets the K8s collector populate the knowledge graph; an AWS / GCP / Azure account lights up the spend + recommendations surfaces. Without at least one of these, most of the dashboard is intentionally empty.
- Wire up a notification channel — Settings → Integrations → Slack / Teams / Email. Without one, you can still try the product, but notifications + chatops flows won't reach you.
- Run a sample runbook — Runbooks → Library. The bundled library has runnable examples (health-check loops, K8s investigation, cost spotlights). Run one manually to see end-to-end orchestration.
- Ask the AI assistant something — bottom-right corner of the dashboard. It can answer questions about your connected clusters, walk you through a recommendation, or kick off an investigation.
- Hit a bug? Open an issue using the bug template.
- Idea for a feature? Use the feature template.
- Want to contribute code? Read CONTRIBUTING.md — covers CLA, branch model, PR conventions, and local-dev debugging tips. Look for issues tagged
good first issue(coming soon — see #30295 for context). - Question that doesn't fit a template? Email
dev@nudgebee.comor use the contact links on the new-issue page.
Nudgebee is a Kubernetes-native monorepo of Go, Python, and TypeScript services. The high-level flow:
┌─────────────────────────────────────────────────┐
│ Browser (Next.js dashboard — `app/`) │
└────────────────────┬────────────────────────────┘
│ HTTP / WebSocket
▼
┌─────────────────────────────────────────────────┐
│ app (Next.js server) — RPC gateway + NextAuth │
└────┬───────────────────┬──────────────────┬─────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌──────────────┐ ┌──────────────┐
│ api-server │ │ llm-server │ │ ticket-server│
│ services │◀─│ + rag-server │ │ notifications│
│ (Go / Gin) │ │ + code- │ │ runbook │
└────────┬────────┘ │ analysis │ └──────┬───────┘
│ └──────┬───────┘ │
│ │ │
┌─────────────┼──────────────────┴─────────────────┘
▼ ▼ ▼ ▼
Postgres RabbitMQ events Qdrant Temporal
(state) (cross-service) (vectors) (workflows)
▲
│
┌──────┴──────────────────────────────────┐
│ Collectors │
│ cloud-collector k8s-collector relay │
│ ml-k8s-server │
└─────────────────────────────────────────┘
app/— Next.js dashboard; in-process RPC gateway at/api/graphqlforwards client calls to backend/rpc/*handlers.api-server/services/— core Go backend (Gin) for tenants, accounts, recommendations, integrations.llm/llm-server+llm/rag-server+llm/code-analysis— LLM session state, retrieval-augmented context, on-demand code analysis.ticket-server/— bidirectional sync with external ticketing (Jira, ServiceNow, PagerDuty, Zenduty).runbook-server/— runbook orchestration via Temporal workflows. See runbook-server/README.md for architecture, env reference, API, and task framework.notifications-server/— Slack/Teams/email alert delivery.collector-server/— cloud-scan + Kubernetes metrics pipelines;relay-serverbridges in-cluster agents back to the central plane.ml-k8s-server/— Python ML pipelines for workload right-sizing.
For per-service detail, see the Project Structure table below or each module's README.md.
The repo ships a docker-compose.yaml that wires every service against the public container registry. You rarely need to bring all 18 containers up — start only the services that match what you're working on. The table below lists each service and the minimum set of upstream services required for it to boot successfully.
| Service | Image | Notes |
|---|---|---|
postgres |
postgres:16 | Primary RDBMS. App schema applied by golang-migrate on deploy. |
rabbitmq |
rabbitmq:3-management | Message bus. UI at :15672. |
redis |
redis:7-alpine | Cache. |
qdrant |
qdrant/qdrant:v1.16.0 | Vector store for RAG / LLM. |
temporal |
temporalio/auto-setup:1.29.1 | Workflow engine. Backed by postgres (creates temporal + temporal_visibility DBs on first boot). Required by workflow-server. |
temporal-ui |
temporalio/ui:2.44.0 | Optional Temporal Web UI at :8233. |
ClickHouse is intentionally not in the local-dev compose. Backend services run with
CLICKHOUSE_ENABLED=falselocally; bring up aclickhousecontainer manually if you're working on analytics-pipeline code.
| Service | Min upstream deps | Why |
|---|---|---|
api-server-services |
postgres, rabbitmq, redis |
Core backend. Won't bootstrap RabbitMQ consumers without RabbitMQ; queries fail without Postgres; cache pulls hit Redis. ClickHouse is gated on CLICKHOUSE_ENABLED. |
ticket-server |
postgres, rabbitmq |
DB writes + async ticketing sync. |
workflow-server (runbook-server) |
postgres, rabbitmq, temporal |
Workflow state in PG; events on RMQ; Temporal SDK calls Dial(7233) at startup and crashes if unreachable. |
notifications-server |
postgres, rabbitmq |
Persists messages, consumes RMQ events. |
cloud-collector |
rabbitmq |
Publishes scrape events to RMQ. ClickHouse writes are skipped when disabled. |
relay-server |
postgres, rabbitmq |
K8s gateway; tunnel state in PG, events on RMQ. |
k8s-collector-app |
rabbitmq |
Publishes K8s metrics to RMQ. ClickHouse writes skipped when disabled. |
ml-k8s-server |
postgres |
Reads/writes scaling features. |
llm-server |
postgres, qdrant |
LLM session state + vector lookups. Spawns code-analysis per account on demand (not a long-running compose service). |
rag-server |
postgres, qdrant |
RAG retrieval against Qdrant; metadata in PG. |
| Service | Min upstream deps | Why |
|---|---|---|
app (Next.js) |
api-server-services |
All GraphQL operations are served by the in-process RPC gateway in the Next.js server (/api/graphql) and forwarded to api-server-services action handlers. RabbitMQ/Redis are required indirectly via api-server-services' own deps. |
- DB exploration:
postgres(connect withpsql/ DBeaver). - Login + dashboard render:
postgres+api-server-services+app(pulls RabbitMQ/Redis transitively). - Cloud findings pipeline: add
rabbitmq+cloud-collector(start a manualclickhousecontainer too if you want findings persisted; setCLICKHOUSE_ENABLED=truein backend env). - LLM/RAG flows: add
qdrant+llm-server+rag-server.
Start a subset with docker compose up -d <service> [<service> ...] (or podman-compose up -d ...); transitive deps are pulled in automatically via depends_on.
Each module has its own README with setup and development instructions.
| Module | Description | README |
|---|---|---|
app/ |
Frontend dashboard (Next.js + React, NextAuth) | app/README.md |
| Module | Description | README |
|---|---|---|
api-server/ |
GraphQL API layer overview | api-server/README.md |
api-server/services/ |
Core backend Go services (Gin) | api-server/services/README.md |
api-server/migrations/ |
DB migrations (Postgres via golang-migrate, ClickHouse, RabbitMQ) | api-server/migrations/README.md |
| Module | Description | README |
|---|---|---|
collector-server/cloud-collector/ |
AWS/cloud data collection | collector-server/cloud-collector/README.md |
collector-server/k8s-collector/app/ |
K8s metrics aggregation (Python) | collector-server/k8s-collector/app/README.md |
collector-server/k8s-collector/relay-server/ |
K8s relay gateway (WebSocket) | collector-server/k8s-collector/relay-server/README.md |
| Module | Description | README |
|---|---|---|
ml-k8s-server/ |
ML models & K8s autoscaling | ml-k8s-server/README.md |
llm/llm-server/ |
LLM inference service | llm/llm-server/README.md |
llm/code-analysis/ |
Code analysis engine | llm/code-analysis/README.md |
llm/rag-server/ |
RAG (Retrieval Augmented Generation) | llm/rag-server/README.md |
llm/benchmark/ |
LLM benchmarking | llm/benchmark/README.md |
| Module | Description | README |
|---|---|---|
runbook-server/ |
Runbook orchestration + automation engine (Temporal) | runbook-server/README.md |
ticket-server/ |
External ticketing integration (Jira, ServiceNow, PagerDuty, Zenduty) | ticket-server/README.md |
notifications-server/ |
Notification delivery (Slack, Teams, email) | notifications-server/README.md |
| Module | Description | README |
|---|---|---|
deploy/kubernetes/ |
Helm charts & Kubernetes config files | deploy/kubernetes/README.md |
| Module | Description |
|---|---|
app-e2e-tests/ |
End-to-end integration tests |
We welcome contributions! Before opening your first PR:
- Read CONTRIBUTING.md for the development workflow, conventional-commit format, and PR guidelines.
- Review the Code of Conduct.
- Browse open issues — look for
good first issueandhelp wantedlabels if you're getting started.
By contributing, you agree your contributions are licensed under the Apache License, Version 2.0. On your first PR, the CLA Assistant bot will post a one-click sign link — subsequent PRs need no further action.
If you believe you have found a security vulnerability in Nudgebee, please do not open a public GitHub issue. Instead follow the responsible disclosure process in SECURITY.md.
Nudgebee ships with no telemetry or product analytics. No data leaves your cluster except what you explicitly configure — notification webhooks, ticket-system sync, LLM provider calls, and any outbound integrations you wire up.
- Questions and ideas — GitHub Discussions
- Bugs and feature requests — GitHub Issues
- Security reports — SECURITY.md
Apache License 2.0 — see LICENSE.