Skip to content

ESousa97/py-rag-engine

Repository files navigation

py-rag-engine

End-to-end Python RAG (Retrieval-Augmented Generation) engine: PDF / Markdown ingestion, recursive and semantic chunking, embeddings via LM Studio (or any OpenAI-compatible endpoint), PostgreSQL + pgvector with HNSW cosine index, hybrid search (dense ANN + Postgres FTS) fused with Reciprocal Rank Fusion, Cross-Encoder re-ranking, a FastAPI REST API, offline RAGAS evaluation, and structured tests on Python 3.11 / 3.12.

py-rag-engine banner

CI CodeQL Dependency review Publish Python License: MIT Last Commit Ruff pytest CodeFactor pre-commit


py-rag-engine is a small, production-minded RAG engine: ingest PDF and Markdown files, chunk them recursively (with optional embedding-based semantic splitting), embed through LM Studio, store vectors in PostgreSQL with pgvector (HNSW cosine index) alongside a tsvector GIN index for full-text search, retrieve with three composable stages (dense ANN + Postgres FTS → Reciprocal Rank Fusion → Cross-Encoder re-rank), and optionally generate grounded answers via an LM Studio chat model. A FastAPI app exposes the whole pipeline behind /health, /documents, and /query. An offline RAGAS runner benchmarks chunk size and embedding-model combinations. Canonical repository: github.com/esousa97/py-rag-engine.

Pipeline overview

PDF / Markdown
      │
      ▼
  Ingestion ──► Recursive / Semantic Chunking ──► SHA-256 deduplication
      │
      ▼
  Embedding (LM Studio / any OpenAI-compatible endpoint)
      │
      ▼
  PostgreSQL + pgvector  ◄──  HNSW cosine index (bge-m3 · 1024 dims)
                         ◄──  GIN index on tsvector (Postgres FTS)
      │
      ├──► Dense recall  (top-20 via pgvector cosine ANN)
      │
      └──► FTS recall    (top-20 via ts_rank_cd, websearch_to_tsquery)
                │
                ▼
       Reciprocal Rank Fusion (RRF, k=60)
                │
                ▼
  Cross-Encoder re-rank  (ms-marco-MiniLM-L-6-v2, top-5 final)
                │
                ▼
  RerankedResult list ordered by relevance score

Demo (quick smoke test)

Create a virtual environment, install the package, and run a fast wiring check against a small sample document — no Postgres or LM Studio required for the unit suite.

Linux / macOS (bash)

python -m venv .venv
source .venv/bin/activate
pip install -e ".[api,embeddings]"
pip install pytest

pytest -q

Windows (PowerShell)

py -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -e ".[api,embeddings]"
pip install pytest

python -m pytest -q

To exercise the full pipeline end-to-end (ingest → embed → store → hybrid search → rerank), spin up Postgres + LM Studio (see "Infrastructure setup" below) and run python scripts\demo_rerank.py data\gdp_document_0.pdf. To drive the REST API, run uvicorn "py_rag_engine.api:create_app" --factory --host 127.0.0.1 --port 8001. See docs/architecture.md and docs/evaluation.md for the full module map and metric definitions.

Features

Area What you get
Ingestion PDF page extraction via pypdf and Markdown (.md / .markdown) loaders with source-aware metadata.
Chunking Recursive splitter with dynamic overlap, plus optional semantic chunking using cosine similarity between paragraph embeddings. A standalone async SemanticChunker exposes from_sample(...) to auto-calibrate the cosine-distance threshold from a representative sample (calibrate_distance_threshold).
Dedup SHA-256 content hash per chunk; upserts on (embedding_model, content_hash) keep the table idempotent across re-runs.
Storage PostgreSQL + pgvector with HNSW cosine index, JSONB metadata GIN index, and a tsvector GIN index for full-text search.
Retrieval Three-stage: dense pgvector ANN + Postgres FTS (ts_rank_cd, websearch_to_tsquery) fused via Reciprocal Rank Fusion (RRF, k=60), then re-ranked by a Cross-Encoder.
Re-ranking CrossEncoderReranker (default ms-marco-MiniLM-L-6-v2) with lazy model loading and an injectable predict for tests.
Generation Optional grounded answer via an LM Studio chat model (generate_answer assembles the prompt + citation contexts).
API FastAPI with /health, /documents (upload + list), and /query exposing four retrieval modes via use_hybrid / use_rerank toggles.
Evaluation Offline RAGAS runner with faithfulness / answer relevancy / context precision; outputs a timestamped JSON report and a config ranking.
Tests 132 unit / integration tests on Python 3.11 and 3.12 in CI; integration suite gated on TEST_POSTGRES_URL + LM_STUDIO_BASE_URL.

Tech stack

Component Role
Python 3.10+ Language and runtime (CI runs 3.11 and 3.12)
PostgreSQL 16 + pgvector Vector storage, HNSW cosine ANN, FTS via tsvector
SQLAlchemy 2 + psycopg 3 DB engine and connection pool
LM Studio (OpenAI-compatible) Embeddings (bge-m3) and chat (qwen2.5-7b-instruct)
sentence-transformers Cross-Encoder re-ranking and local semantic embeddings
langchain-text-splitters Recursive character chunking
pypdf PDF page extraction
FastAPI + uvicorn REST API
RAGAS (optional) Faithfulness / answer relevancy / context precision metrics
pytest / pytest-cov Tests and coverage
Ruff Lint + format

Prerequisites

Requirement Version tested Notes
Python 3.11 / 3.12 / 3.14 >=3.10 per pyproject.toml
PostgreSQL + pgvector pg16 + pgvector 0.8.2 via Docker (see below)
LM Studio any OpenAI-compatible server on localhost:1234
Embedding model gpustack/bge-m3-GGUF (bge-m3-Q8_0) 1024d, multilingual
Chat model (for eval) Qwen/Qwen2.5-7B-Instruct-GGUF Q4_K_M ~4.7 GB, recommended for 12 GB GPUs
sentence-transformers 5.4.x Cross-Encoder re-ranking + local embeddings
Git LFS any to clone the Cross-Encoder weights

No external job/database server beyond Postgres is required; the API and CLI scripts are stateless.

Installation and usage

From source (recommended)

git clone https://github.com/esousa97/py-rag-engine.git
cd py-rag-engine
pip install -e ".[api,embeddings]"
# add ".[eval]" too if you plan to run the RAGAS evaluation

Development install (editable)

pip install -e ".[api,embeddings,eval]"
pip install pytest

PyPI

There is no PyPI publish workflow yet — install from source as shown above. A publish.yml workflow would build wheels/sdists on release and upload them via PyPI trusted publishing; the badge link above is wired so that any future publish workflow renders without further edits.

Dependency review is a recommended CI add-on (via actions/dependency-review-action) and would run on every pull request alongside lint and tests. The current ci.yml runs the test matrix only.

Quick Start

1. Postgres + pgvector via Docker

$env:POSTGRES_PASSWORD = "<choose-a-strong-password>"
$env:POSTGRES_DB       = "rag"
$env:POSTGRES_PORT     = "5434"

docker run -d `
  --name rag-pgvector `
  -e POSTGRES_PASSWORD=$env:POSTGRES_PASSWORD `
  -e POSTGRES_DB=$env:POSTGRES_DB `
  -p "$($env:POSTGRES_PORT):5432" `
  pgvector/pgvector:pg16

2. LM Studio — embedding + chat model

Install LM Studio, download gpustack/bge-m3-GGUF (embeddings) and Qwen/Qwen2.5-7B-Instruct-GGUF Q4_K_M (chat, for grounded answers / evaluation), load both in Developer → Local Server, and click Start Server. Verify:

Invoke-RestMethod http://localhost:1234/v1/models | Select-Object -ExpandProperty data | Format-Table id

3. Run the REST API

$env:EVAL_POSTGRES_URL     = "postgresql+psycopg://postgres:$env:POSTGRES_PASSWORD@localhost:5434/rag"
$env:LM_STUDIO_BASE_URL    = "http://localhost:1234"
$env:LM_STUDIO_EMBED_MODEL = "text-embedding-bge-m3"
$env:LM_STUDIO_CHAT_MODEL  = "qwen2.5-7b-instruct"

.venv\Scripts\python.exe -m uvicorn "py_rag_engine:api.create_app" --factory --host 127.0.0.1 --port 8001

4. Drive it with curl

# Health
curl -s http://127.0.0.1:8001/health
# {"status":"ok","postgres":"ok","lm_studio":"ok"}

# Ingest
curl -X POST "http://127.0.0.1:8001/documents?chunk_size=1024" \
     -F "file=@data/gdp_document_0.pdf"

# Hybrid + Cross-Encoder rerank + grounded answer
curl -X POST http://127.0.0.1:8001/query -H "Content-Type: application/json" \
  -d '{"question":"How does HNSW work in pgvector?","top_k":3,
       "use_hybrid":true,"use_rerank":true,"generate_answer":true}'

Retrieval modes (single endpoint)

use_hybrid use_rerank Mode Score returned
false false dense (pgvector ANN only) cosine similarity
true false hybrid (dense + FTS via RRF) RRF score
false true dense_rerank (cosine + Cross-Encoder) rerank score
true true hybrid_rerank (RRF + Cross-Encoder) rerank score

End-to-end demo CLI

python scripts\demo_rerank.py data\gdp_document_0.pdf `
  --query "What is the geotechnical investigation methodology used in this report?" `
  --reranker-model "$PWD\.cache\ms-marco-MiniLM-L-6-v2"

Programmatic chunker + embedder API

For ad-hoc scripts that just need to embed text or split it into semantic chunks (without the full ingestion → Postgres pipeline), the top-level SemanticChunker and VectorClient cover the entire flow with async + type-checked APIs:

import asyncio
from openai import AsyncOpenAI
from py_rag_engine import SemanticChunker, VectorClient, calibrate_distance_threshold

async def main() -> None:
    # Any OpenAI-compatible endpoint. LM Studio uses a dummy api_key.
    openai_client = AsyncOpenAI(base_url="http://127.0.0.1:1234/v1", api_key="lm-studio")
    embedder = VectorClient(
        provider="openai",                # or "sentence-transformers" for local
        model="text-embedding-bge-m3",
        client=openai_client,
        batch_size=32,
        max_retries=5,                    # exponential backoff on HTTP 429
        initial_backoff=1.0,
    )

    # Calibrate the cosine-distance threshold from a representative sample.
    # For tiny / non-representative samples, prefer a fixed threshold.
    threshold = await calibrate_distance_threshold(
        sample_texts=open("data/eval_document.md", encoding="utf-8").read(),
        embedder=embedder,
        percentile=0.85,
        margin=0.05,
    )
    chunker = SemanticChunker(embedder=embedder, distance_threshold=threshold)

    # Equivalent one-liner:
    # chunker = await SemanticChunker.from_sample(sample_text, embedder=embedder)

    text = open("examples/article.md", encoding="utf-8").read()
    chunks = await chunker.chunk(text, page=1, source="article.md")

    for c in chunks:
        # idempotency: c.content_hash is sha256(c.text); safe to upsert
        print(c.metadata.chunk_index, c.content_hash[:10], c.text[:80])

asyncio.run(main())

Key points

  • SemanticChunker groups adjacent paragraphs whose cosine distance is below distance_threshold and emits DocumentChunk(text, metadata, content_hash) — the SHA-256 content_hash is what PostgresEmbeddingStore uses for idempotent upserts.
  • VectorClient.get_embeddings(texts) batches the input and retries RateLimitError / HTTP 429 with exponential backoff (initial_backoff * 2**attempt, capped at max_backoff, plus optional jitter). Non-rate-limit errors propagate immediately.
  • For bge-m3 empirically same-topic paragraph distances cluster around ~0.45 and topic shifts around ~0.65; DEFAULT_DISTANCE_THRESHOLD = 0.55 is tuned for that gap. Run scripts/probe_semantic_distances.py to inspect distances for your own corpus.

See docs/architecture.md for the full API contract and calibration guidance.

Resilience (LM Studio retries)

The LMStudioClient wraps every embedding / chat call with retries on OSError, WinError 10054 (Windows socket reset under load), and malformed JSON, using exponential backoff tuned via LMStudioConfig.retries and LMStudioConfig.backoff. Failed calls are logged and surface to the API as 502 / 503 responses. Tune behaviour via env vars and the wiring in src/py_rag_engine/clients/lm_studio.py and src/py_rag_engine/api/routes.py.

Documentation

Document Contents
LICENSE MIT License
docs/architecture.md Module map, pipeline diagram, data flow
docs/evaluation.md RAGAS metrics, run modes, report schema
pyproject.toml Build config, optional extras (api, embeddings, eval), Ruff settings

Project layout

Path Role
src/py_rag_engine/config.py LMStudioConfig / PostgresConfig / EvalConfig (env-driven)
src/py_rag_engine/clients/lm_studio.py LMStudioClient HTTP wrapper + detect_chat_model
src/py_rag_engine/domain.py DocumentChunk, ChunkMetadata
src/py_rag_engine/vector_math.py cosine_similarity (numpy)
src/py_rag_engine/ingestion/ PDF + Markdown loaders, ingest_file / ingest_path orchestration
src/py_rag_engine/chunking/ Recursive splitter with dynamic overlap; embedding-based semantic splitting
src/py_rag_engine/chunker.py Public SemanticChunker + calibrate_distance_threshold (async; threshold auto-tuning)
src/py_rag_engine/embeddings/ SHA-256 hashing, make_lm_studio_embed, make_sentence_transformer_embed
src/py_rag_engine/embedder.py Public VectorClient (OpenAI-compatible or SentenceTransformer, async batching + rate-limit backoff)
src/py_rag_engine/storage/postgres.py PostgresEmbeddingStore (pgvector HNSW + Postgres FTS)
src/py_rag_engine/retrieval/hybrid.py retrieve_hybrid, reciprocal_rank_fusion
src/py_rag_engine/retrieval/rerank.py CrossEncoderReranker, rerank_candidates
src/py_rag_engine/retrieval/service.py retrieve_with_rerank, retrieve_hybrid_with_rerank
src/py_rag_engine/api/ FastAPI factory + lifespan, REST routes, Pydantic schemas
src/py_rag_engine/generation/lm_studio_chat.py generate_answer grounded in context
src/py_rag_engine/evaluation/ Gold-standard loader, metrics, official-RAGAS adapter, EvalRunner
scripts/demo_rerank.py E2E demo CLI (dense + Cross-Encoder rerank)
scripts/demo_hybrid.py Hybrid Search demo CLI (dense vs FTS vs RRF)
scripts/eval_ragas.py Offline RAGAS CLI
scripts/process_document.py Standalone chunking CLI
data/ Sample PDF metadata, eval document, gold-standard Q&A
tests/ pytest suite (chunking, ingestion, storage, retrieval, API, evaluation)
.github/workflows/ci.yml CI: pytest matrix on Python 3.11 / 3.12

Tests

pip install -e ".[api,embeddings]"
pip install pytest
pytest -q
132 passed, 1 skipped

The single skipped test is tests/test_postgres_integration.py, which is gated on TEST_POSTGRES_URL + LM_STUDIO_BASE_URL and embeds three sentences round-trip through pgvector.

Integration test (full round-trip)

$env:POSTGRES_PASSWORD         = "<your-password>"
$env:TEST_POSTGRES_URL         = "postgresql+psycopg://postgres:$env:POSTGRES_PASSWORD@localhost:5434/rag"
$env:LM_STUDIO_BASE_URL        = "http://localhost:1234"
$env:LM_STUDIO_EMBEDDING_MODEL = "text-embedding-bge-m3"
$env:STORAGE_EMBEDDING_MODEL   = "bge-m3"
python -m pytest -q

Coverage

pip install -e ".[api,embeddings]"
pip install pytest pytest-cov
pytest --cov=py_rag_engine --cov-report=term-missing

Offline RAGAS evaluation

scripts/eval_ragas.py runs the full pipeline against a 10-question gold-standard set and writes a JSON report comparing chunk sizes and embedding models. Three metrics are computed per question — faithfulness, answer relevancy, context precision — and an overall_ranking block ranks configurations by mean score.

Mode Env var Configs × Questions Wall time¹
Smoke (sanity check) EVAL_SMOKE=1 1 × 1 ~30 s
Quick (single model, 3 Qs) EVAL_QUICK=1 EVAL_SKIP_MINILM=1 3 × 3 ~3 min
Full single-model EVAL_SKIP_MINILM=1 3 × 10 ~8 min
Full comparison (no flags) 6 × 10 ~15 min

¹ Qwen2.5-7B-Instruct Q4_K_M on RTX 3060 12 GB. See docs/evaluation.md for the full metric definitions and report schema.

Contributing

Issues and pull requests are welcome. Run pytest -q and ruff check src tests before opening a PR; keep modules single-purpose and prefer extending existing scripts over adding new top-level entry points.

Changelog

Tracked via GitHub releases and the commit history.

License

MIT.

Author

Enoque Sousa

LinkedIn GitHub Portfolio

⬆ Back to Top

Made with ❤️ by Enoque Sousa

Project status: Study project

Releases

No releases published

Packages

 
 
 

Contributors

Languages