Production RAG Assistant is a production-style Retrieval-Augmented Generation backend built with FastAPI, Postgres/pgvector, hybrid retrieval, provider switching, deterministic evals, observability, and a minimal browser UI.
The project is designed to run locally without paid model calls by default.
Fake providers are enabled out of the box. OpenAI-compatible embedding,
generation, query rewrite, and reranking can be enabled through .env when a
real provider key is available.
- FastAPI API for chat, streaming chat, documents, workspaces, sessions, export jobs, health, and metrics.
- Workspace management API with create, update, list, detail, soft archive, bulk archive, restore, bulk restore operations, and operation audit logging.
- Postgres + pgvector schema with Alembic migrations, including an export job foundation for asynchronous downloads.
- Markdown ingestion, chunking, content hashing, fake embeddings, OpenAI embeddings, and reindexing.
- Hybrid retrieval with vector search, sparse search, metadata filters, RRF fusion, optional query rewrite, optional session-history contextualization, and optional OpenAI listwise reranking.
- Fake generator and OpenAI Responses API generator, including streaming.
- Refusal guards for unsafe, out-of-scope, low-confidence, and empty-retrieval cases.
- Provider timeout, retry, error mapping, structured logs, Prometheus metrics, latency metrics, token metrics, and cost estimates.
- Deterministic eval gate with JSONL datasets and trend recording.
- Minimal web UI at
/app/with sessions, history, SSE chat, document upload, reindex actions, workspace creation, editing, archive/restore actions, admin overview, workspace search, pagination, status filters, bulk archive/restore actions, cross-page matching bulk preview/confirmation, archived-workspace read-only guards, chat log audit filters, chat log audit export, chat log audit details, workspace operation audit filters, workspace operation audit details, and chat error recovery. - Dockerfile, production-style Compose stack, deployment runbook, and CI workflow.
flowchart TD
A["Markdown documents"] --> B["Ingestion and chunking"]
B --> C["Postgres documents and chunks"]
C --> D["pgvector embeddings and sparse search vector"]
E["POST /chat or /chat/stream"] --> F["API key and workspace check"]
F --> G["Optional session history"]
G --> H["Question refusal guard"]
H --> I["Optional query rewrite"]
I --> J["Vector retrieval"]
I --> K["Sparse retrieval"]
J --> L["RRF fusion"]
K --> L
L --> M["Retrieval refusal guard"]
M --> N["Optional rerank"]
N --> O["RAG prompt"]
O --> P["Generator"]
P --> Q["Citations, usage, logs, metrics"]
backend/
app/
api/ FastAPI routes and API security
core/ config, logging, request id, tracing, rate limit
db/ models, repositories, sessions, Alembic migrations
observability/ Prometheus metrics registry
rag/ embeddings, retrieval, reranking, generation, pipeline
static/ browser UI served by FastAPI
tests/ unit and integration-style tests
ingestion/ Markdown parsing, cleaning, chunking, hashing, ingest CLI
evals/ eval datasets, runner, reports, trend recorder
data/raw/ seed Markdown documents
monitoring/ Grafana dashboard and Prometheus alert templates
docs/ handoff, configuration, deployment, observability docs
Create local configuration:
Copy-Item .env.example .envValidate Compose without printing secrets:
docker compose -f docker-compose.prod.yml config --quiet
uv run python -m backend.app.core.config_check --productionStart the production-style local stack:
docker compose -f docker-compose.prod.yml up -d --buildOpen the UI:
http://127.0.0.1:8000/app/
Health check:
curl.exe http://127.0.0.1:8000/healthIf port 8000 is already in use, set API_PORT in .env before starting the
stack.
Install dependencies and run checks with uv:
uv sync
uv run ruff check .
uv run pytestRun database migrations:
uv run alembic upgrade headRun the API directly on the host:
uv run uvicorn backend.app.main:app --host 127.0.0.1 --port 8000Run the default pipeline smoke:
uv run python -m backend.app.rag.pipeline_smokeRun the document-management smoke:
uv run python -m evals.document_management_smokeRun the eval gate:
uv run python -m evals.run --format summary --fail-on-failure --no-outputCurrent local baseline: 599 passed.
Runtime configuration comes from .env. Keep .env local-only and use
.env.example as the template. The full configuration reference is
docs/CONFIGURATION.md.
Production secret manager mapping is documented in
docs/SECRET_MANAGER_MAPPING.md.
Run uv run python -m backend.app.core.config_check --production before shared
or real production deployment; it reports only variable names and remediation
guidance, not secret values.
Default local mode:
EMBEDDING_PROVIDER=fake
GENERATOR_PROVIDER=fake
QUERY_REWRITER_PROVIDER=none
RERANKER_PROVIDER=none
API_KEYS=dev-key
API_KEY_ROLES=
Enable real OpenAI-compatible providers only when OPENAI_API_KEY is set:
EMBEDDING_PROVIDER=openai
GENERATOR_PROVIDER=openai
QUERY_REWRITER_PROVIDER=openai
RERANKER_PROVIDER=openai
OPENAI_API_KEY=<set in local .env or secret manager>
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
LLM_MODEL=gpt-5.4-nano
QUERY_REWRITE_MODEL=gpt-5.4-nano
RERANKER_MODEL=gpt-5.4-nano
After changing the embedding provider for an existing database, reindex stored chunk embeddings so stored vectors and query vectors use the same model:
uv run python -m backend.app.rag.reindex_embeddings --workspace-id public --writeChat:
curl.exe -X POST http://127.0.0.1:8000/chat `
-H "Authorization: Bearer dev-key" `
-H "Content-Type: application/json" `
-H "X-Workspace-ID: public" `
-d "{\"question\":\"What problem does FlashAttention solve?\"}"Streaming chat:
curl.exe -N -X POST http://127.0.0.1:8000/chat/stream `
-H "Authorization: Bearer dev-key" `
-H "Content-Type: application/json" `
-H "X-Workspace-ID: public" `
-d "{\"question\":\"What problem does FlashAttention solve?\"}"Create a chat session:
curl.exe -X POST http://127.0.0.1:8000/chat/sessions `
-H "Authorization: Bearer dev-key" `
-H "Content-Type: application/json" `
-H "X-Workspace-ID: public" `
-d "{\"title\":\"LLM systems\"}"Archive and restore a workspace:
curl.exe -X POST http://127.0.0.1:8000/workspaces/tenant-a/archive `
-H "Authorization: Bearer dev-key" `
-H "Content-Type: application/json" `
-d "{\"reason\":\"temporary tenant cleanup\"}"
curl.exe -X POST http://127.0.0.1:8000/workspaces/tenant-a/restore `
-H "Authorization: Bearer dev-key"Bulk archive and restore workspaces:
curl.exe "http://127.0.0.1:8000/workspaces/bulk/preview?status=active&q=tenant&sample_limit=20" `
-H "Authorization: Bearer dev-key"
curl.exe -X POST http://127.0.0.1:8000/workspaces/bulk/archive-matching `
-H "Authorization: Bearer dev-key" `
-H "Content-Type: application/json" `
-d "{\"q\":\"tenant\",\"status\":\"active\",\"expected_total\":2,\"confirm\":true,\"reason\":\"temporary cleanup\"}"
curl.exe -X POST http://127.0.0.1:8000/workspaces/bulk/archive `
-H "Authorization: Bearer dev-key" `
-H "Content-Type: application/json" `
-d "{\"ids\":[\"tenant-a\",\"tenant-b\"],\"reason\":\"temporary cleanup\"}"
curl.exe -X POST http://127.0.0.1:8000/workspaces/bulk/restore `
-H "Authorization: Bearer dev-key" `
-H "Content-Type: application/json" `
-d "{\"ids\":[\"tenant-a\",\"tenant-b\"]}"Archive and restore operations write workspace_audit_logs records with the
request id, hashed API key, action, affected workspace ids, and operation
metadata.
Query workspace operation audit logs:
curl.exe "http://127.0.0.1:8000/workspaces/audit-logs?action=archive&workspace_id=tenant-a&limit=20&offset=0" `
-H "Authorization: Bearer dev-key"The /app/ Admin overview also exposes these records with action, workspace
ID, request ID, and time-range filters.
Asynchronous export is represented by the export_jobs table,
ExportJobRepository, /exports/jobs API, and export worker. Jobs start as
pending, can be claimed by a worker as running, and then finish as
succeeded or failed. Worker output is written under EXPORT_STORAGE_DIR.
In production compose, the API and export-worker services share the
export_prod_data volume so the API can download files written by the worker.
Admin export buttons create a job, poll its status, and download the completed
file through /exports/jobs/{job_id}/download. The existing /chat/logs/export
route remains synchronous for compatibility. Failed export jobs can be retried
explicitly with POST /exports/jobs/{job_id}/retry, which resets the job to
pending for the worker to claim again.
Create and inspect an export job:
curl.exe -X POST http://127.0.0.1:8000/exports/jobs `
-H "Authorization: Bearer dev-key" `
-H "X-Workspace-ID: public" `
-H "Content-Type: application/json" `
-d "{\"export_type\":\"chat_logs\",\"format\":\"jsonl\",\"filters\":{\"limit\":1000,\"offset\":0}}"
curl.exe "http://127.0.0.1:8000/exports/jobs?status=pending&export_type=chat_logs" `
-H "Authorization: Bearer dev-key" `
-H "X-Workspace-ID: public"Run one worker pass:
uv run python -m backend.app.exporting.workerRun a continuously polling worker locally:
uv run python -m backend.app.exporting.worker --loopThe loop sleeps for EXPORT_WORKER_POLL_INTERVAL_SECONDS when there is no
pending job. Production compose starts this loop as the export-worker service:
docker compose -f docker-compose.prod.yml logs -f export-workerEach worker iteration first resets stale running jobs back to pending when
their started_at age exceeds EXPORT_JOB_RUNNING_TIMEOUT_SECONDS. This lets a
new worker process recover jobs left behind by a crashed or interrupted worker.
The worker also deletes expired top-level .jsonl and .csv files from
EXPORT_STORAGE_DIR after EXPORT_FILE_RETENTION_SECONDS; job metadata remains
available for audit, and old downloads return 404 export file not found after
the file is removed.
Download a completed job:
curl.exe "http://127.0.0.1:8000/exports/jobs/<job-id>/download" `
-H "Authorization: Bearer dev-key" `
-H "X-Workspace-ID: public" `
-o chat-logs.jsonlRetry a failed export job:
curl.exe -X POST "http://127.0.0.1:8000/exports/jobs/<job-id>/retry" `
-H "Authorization: Bearer dev-key" `
-H "X-Workspace-ID: public"Archived workspaces remain readable for audit and recovery, but write-oriented
operations return 409 workspace archived. This includes chat, streaming chat,
chat session creation, document upload, document deletion, and document reindex.
The web UI mirrors this policy by disabling write controls for the current
workspace after it detects archived_at.
Run before committing:
uv run ruff check .
uv run pytest
uv run python -m backend.app.core.config_check
uv run python -m evals.run --format summary --fail-on-failure --no-output
uv run python -m backend.app.rag.pipeline_smoke
uv run python -m evals.document_management_smoke
docker compose -f docker-compose.prod.yml config --quiet
rg -n "s[k]-" backend docs .github ingestion evals pyproject.toml README.md Makefile docker-compose.yml docker-compose.prod.yml .env.example Dockerfile .dockerignoreThe secret scan should only match intentional placeholders, never real keys.
- Project handoff and quick start
- Configuration and secrets guide
- Secret manager mapping
- Deployment runbook
- Release checklist
- Release v0.1.0 notes
- GitHub Release v0.1.0 body
- Observability guide
- Database observability guide
- Eval trends guide
docker build -t production-rag-assistant:local .