Intro To Agentic RAG System Documentation

This codebase powers an Agentic RAG Engine leveraging the LangGraph framework. It natively supports hierarchical indexing, hybrid semantic and keyword search, and pluggable LLM compatibility.

Quick Start

Installation

Install the necessary Python packages:

pip install -r requirements.txt

Starting the Web UI

Launch the interactive chat server:

python project/app.py

The application will be available at http://localhost:7860 (default Gradio port).

Prerequisites

Requirement	Details
Python	3.11 or higher
RAM	16 GB for Ollama · 8 GB minimum for cloud providers
Ollama	Required only for local inference — install here then run `ollama pull qwen3:4b-instruct-2507-q4_K_M`
API key	Required only for cloud providers — copy `project/.env.example` → `project/.env` and fill in your key

Architecture Overview

This system implements an advanced RAG pipeline with the following key features:

Parent-Child Chunking: Documents are split into small child chunks (for precise retrieval) linked to larger parent chunks (for rich context)
Hybrid Search: Combines dense embeddings and sparse (BM25) retrieval for optimal results
LangGraph Agent: Orchestrates query rewriting, retrieval, and response generation
Multi-Provider Support: Seamlessly switch between Ollama, OpenAI GPT, Google Gemini, and Anthropic Claude
Vector Storage: Uses Qdrant for efficient similarity search

Data Flow

PDF → Markdown Conversion → Parent/Child Chunking → Vector Indexing → Agent Retrieval → LLM Response

Project Structure

Entry Point & Configuration

File	Purpose
`project/app.py`	Application entry point, launches Gradio UI
`project/config.py`	Central configuration hub - edit this for provider/model/chunking changes
`project/utils.py`	PDF to Markdown conversion and context token estimation
`project/document_chunker.py`	Parent/child splitting logic with cleaning and merging rules
`project/Dockerfile`	Dockerfile with Ollama for local deployment

Core System

File	Purpose
`project/core/rag_system.py`	System bootstrap - creates managers and compiles LangGraph agent
`project/core/document_manager.py`	Document ingestion pipeline (convert, chunk, index)
`project/core/chat_interface.py`	Thin wrapper for agent graph interaction
`project/core/observability.py`	Optional Langfuse tracing — callback handler lifecycle

Database Layer

File	Purpose
`project/db/vector_db_manager.py`	Qdrant client wrapper with embedding initialization
`project/db/parent_store_manager.py`	File-backed storage for parent chunks

RAG Agent (LangGraph)

File	Purpose
`project/rag_agent/graph.py`	Graph builder and compilation logic
`project/rag_agent/graph_state.py`	Shared and per-agent graph state definitions and answer accumulation/reset logic
`project/rag_agent/nodes.py`	Node implementations (summarize, rewrite, agent execution, aggregate)
`project/rag_agent/edges.py`	Conditional edge routing logic (e.g., routing based on query clarity)
`project/rag_agent/tools.py`	Retrieval tools (`search_child_chunks`, `retrieve_parent_chunks`)
`project/rag_agent/prompts.py`	System prompts for agent behavior
`project/rag_agent/schemas.py`	Structured output schemas (Pydantic models)

User Interface

File	Purpose
`project/ui/css.py`	Custom CSS styling for the Gradio interface
`project/ui/gradio_app.py`	Gradio UI implementation with document upload and chat

Configuration Guide

All primary settings are in project/config.py. Key parameters:

Directory Configuration

MARKDOWN_DIR = "markdown_docs"        # Storage for converted PDF → Markdown files
PARENT_STORE_PATH = "parent_store"    # File-backed storage for parent chunks
QDRANT_DB_PATH = "qdrant_db"          # Local Qdrant vector database path

Qdrant Configuration

CHILD_COLLECTION = "document_child_chunks"  # Collection name for child chunks
SPARSE_VECTOR_NAME = "sparse"               # Named sparse vector field (BM25)

Model Configuration

# Default: single model configuration
DENSE_MODEL = "sentence-transformers/all-mpnet-base-v2"
SPARSE_MODEL = "Qdrant/bm25"
LLM_MODEL = "qwen3:4b-instruct-2507-q4_K_M"
LLM_TEMPERATURE = 0  # 0 = deterministic, 1 = creative

Agent Configuration

# Hard limits to prevent infinite loops
MAX_TOOL_CALLS = 8       # Maximum tool calls per agent run
MAX_ITERATIONS = 10      # Maximum agent loop iterations
GRAPH_RECURSION_LIMIT = 50 # Maximum number of steps before hitting a stop condition

# Context compression thresholds
BASE_TOKEN_THRESHOLD = 2000     # Initial token threshold for compression
TOKEN_GROWTH_FACTOR = 0.9       # Multiplier applied after each compression

Text Splitter Configuration

CHILD_CHUNK_SIZE = 500              # Size of chunks used for retrieval
CHILD_CHUNK_OVERLAP = 100           # Overlap between chunks (prevents context loss)
MIN_PARENT_SIZE = 2000              # Minimum parent chunk size
MAX_PARENT_SIZE = 4000             # Maximum parent chunk size

# Markdown header splitting strategy
HEADERS_TO_SPLIT_ON = [
    ("#", "H1"),
    ("##", "H2"),
    ("###", "H3")
]

Langfuse Observability (Optional)

LANGFUSE_ENABLED = False               # Set to True via LANGFUSE_ENABLED env var
LANGFUSE_PUBLIC_KEY = ""               # From your Langfuse project settings
LANGFUSE_SECRET_KEY = ""               # From your Langfuse project settings
LANGFUSE_BASE_URL = "http://localhost:3000"  # Langfuse Cloud or self-hosted URL

Common Customizations

1. Switching LLM Provider (Single Provider)

Performance Note: LLMs with 7B+ parameters typically offer superior reasoning, context comprehension, and response quality compared to smaller models. This applies to both proprietary and open-source models, as long as they support native tool/function calling, which is required for agentic RAG workflows.

If you want to permanently switch from one provider to another (e.g., Ollama → Google Gemini), follow this steps:

Step 1: Install the provider's SDK

pip install langchain-google-genai

Step 2: Set environment variable

export GOOGLE_API_KEY="your-google-key"

Step 3: Update project/config.py

LLM_MODEL = "gemini-2.5-pro"
LLM_TEMPERATURE = 0

Step 4: Modify project/core/rag_system.py

Replace:

llm = ChatOllama(model=config.LLM_MODEL, temperature=config.LLM_TEMPERATURE)

With:

from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(model=config.LLM_MODEL, temperature=config.LLM_TEMPERATURE)

2. Multi-Provider Configuration

This approach allows you to maintain multiple provider configurations and switch between them easily.

Step 1: Install required SDKs

pip install langchain-openai langchain-anthropic langchain-google-genai

Step 2: Set environment variables

export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GOOGLE_API_KEY="your-google-key"

Step 3: Update project/config.py with multi-provider configuration

# --- Multi-Provider LLM Configuration ---
LLM_CONFIGS = {
    "ollama": {
        "model": "ministral-3:14b-instruct-2512-q4_K_M",
        "url":"http://localhost:11434",
        "temperature": 0
    },
    "openai": {
        "model": "gpt-4o",
        "temperature": 0
    },
    "anthropic": {
        "model": "claude-sonnet-4-6",
        "temperature": 0
    },
    "google": {
        "model": "gemini-2.5-flash",
        "temperature": 0
    }
}

# Switch providers by changing this single line
ACTIVE_LLM_CONFIG = "ollama"

Step 4: Modify project/core/rag_system.py in the initialize() method

Replace the existing LLM initialization with:

def initialize(self):
    self.vector_db.create_collection(self.collection_name)
    collection = self.vector_db.get_collection(self.collection_name)
    
    # Load active configuration
    active_config = config.LLM_CONFIGS[config.ACTIVE_LLM_CONFIG]
    model = active_config["model"]
    temperature = active_config["temperature"]
    
    if config.ACTIVE_LLM_CONFIG == "ollama":
        from langchain_ollama import ChatOllama
        llm = ChatOllama(model=model, temperature=temperature, base_url=active_config["url"])
        
    elif config.ACTIVE_LLM_CONFIG == "openai":
        from langchain_openai import ChatOpenAI
        llm = ChatOpenAI(model=model, temperature=temperature)
        
    elif config.ACTIVE_LLM_CONFIG == "anthropic":
        from langchain_anthropic import ChatAnthropic
        llm = ChatAnthropic(model=model, temperature=temperature)
        
    elif config.ACTIVE_LLM_CONFIG == "google":
        from langchain_google_genai import ChatGoogleGenerativeAI
        llm = ChatGoogleGenerativeAI(model=model, temperature=temperature)
        
    else:
        raise ValueError(f"Unsupported LLM provider: {config.ACTIVE_LLM_CONFIG}")
    
    # Continue with tool and graph initialization
    tools = ToolFactory(collection).create_tools()
    self.agent_graph = create_agent_graph(llm, tools)

Switching Providers: Simply change ACTIVE_LLM_CONFIG in config.py:

ACTIVE_LLM_CONFIG = "google"  # Switch to Gemini Pro
# ACTIVE_LLM_CONFIG = "anthropic"  # Or Claude Sonnet
# ACTIVE_LLM_CONFIG = "openai"  # Or GPT-4o

Provider Reference Table:

Provider	Environment Variable	Import Statement	Example Models
OpenAI	`OPENAI_API_KEY`	`from langchain_openai import ChatOpenAI`	`gpt-4o`, `gpt-4o-mini`
Anthropic	`ANTHROPIC_API_KEY`	`from langchain_anthropic import ChatAnthropic`	`claude-opus-4-6`, `claude-sonnet-4-6`
Google	`GOOGLE_API_KEY`	`from langchain_google_genai import ChatGoogleGenerativeAI`	`gemini-2.5-pro`, `gemini-2.5-flash`
Ollama	None (local)	`from langchain_ollama import ChatOllama`	`qwen3:4b-instruct-2507-q4_K_M`, `ministral-3:8b-instruct-2512-q4_K_M`, `llama3.1:8b-instruct-q6_K`

3. Changing Embedding Models

Why change? Trade-offs between speed, cost, and quality.

Step 1: Update project/config.py

# Example: Faster, smaller model
DENSE_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

# Example: Higher quality, slower model
# DENSE_MODEL = "sentence-transformers/all-mpnet-base-v2"

# Example: Gemma embeddings (Google's open model)
# DENSE_MODEL = "google/embeddinggemma-300m"

# Example: Qwen embeddings (Alibaba's multilingual model)
# DENSE_MODEL = "Qwen/Qwen3-Embedding-8B"

SPARSE_MODEL = "Qdrant/bm25"  # Usually no need to change

Step 2: Re-index your documents

⚠️ Important: Changing embeddings requires re-indexing all documents through the Gradio UI.

Implementation Details (in project/db/vector_db_manager.py):

from langchain_huggingface import HuggingFaceEmbeddings
from langchain_qdrant import FastEmbedSparse
import config

self.__dense_embeddings = HuggingFaceEmbeddings(model_name=config.DENSE_MODEL)
self.__sparse_embeddings = FastEmbedSparse(model_name=config.SPARSE_MODEL)

Popular Embedding Models:

Model	Context Size	Vector Dimension	Speed	Quality	Use Case
all-MiniLM-L6-v2	256 tokens	384	Fast	Good	General purpose, quick semantic similarity
all-mpnet-base-v2	512 tokens	768	Medium	Excellent	High-accuracy semantic search
bge-large-en-v1.5	512 tokens	1024	Slow	Best	Production-grade retrieval on GPU
google/embeddinggemma-300m	2048 tokens	768	Fast	Very Good	Lightweight, efficient multilingual retrieval
Qwen/Qwen3-Embedding-8B	32768 tokens	4096	Slow	Excellent / SOTA	Large-scale multilingual embeddings, long-context RAG

4. Adjusting Chunking Strategy

Why adjust? Balance between retrieval precision and context richness.

Step 1: Update chunk sizes in project/config.py

# For short, factual queries (e.g., technical documentation)
CHILD_CHUNK_SIZE = 300
CHILD_CHUNK_OVERLAP = 60
MIN_PARENT_SIZE = 1500
MAX_PARENT_SIZE = 8000

# For narrative or contextual queries (e.g., legal documents)
# CHILD_CHUNK_SIZE = 800
# CHILD_CHUNK_OVERLAP = 150
# MIN_PARENT_SIZE = 3000
# MAX_PARENT_SIZE = 15000

Step 2 (Optional): Replace the splitter in project/document_chunker.py

Default (Character-based):

from langchain_text_splitters import RecursiveCharacterTextSplitter

self.__child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=config.CHILD_CHUNK_SIZE,
    chunk_overlap=config.CHILD_CHUNK_OVERLAP
)

Alternative (Sentence-aware):

from langchain_text_splitters import SentenceTransformersTokenTextSplitter

self.__child_splitter = SentenceTransformersTokenTextSplitter(
    chunk_size=config.CHILD_CHUNK_SIZE,
    chunk_overlap=config.CHILD_CHUNK_OVERLAP
)

Step 3: Re-run ingestion pipeline

Upload documents again through the Gradio interface to apply new chunking.

Chunking Guidelines:

⚠️ Disclaimer: These are empirical guidelines. Optimal sizes depend on:

Child chunk → embedding model's context window (e.g. 256 tokens for all-MiniLM-L6-v2, 512 for bge-large-en-v1.5): child size should not exceed it

Parent chunk → generative model's context window (e.g. 8K, 32K, 128K tokens): parent must fit within the context sent to the LLM alongside the query

Always validate values empirically on your own corpus.

Document Type	Child Size	Parent Size	Reasoning
Technical Docs	300-500	2000-4000	Precise lookups, code snippets
Legal Contracts	600-1000	5000-15000	Context-heavy, definitions
Research Papers	400-600	3000-8000	Balance of precision and context
FAQs / Knowledge Base	200-400	1500-4000	Short, focused answers

5. Agent Configuration

Tune agent behavior in project/config.py:

# Hard limits to prevent infinite loops
MAX_TOOL_CALLS = 8       # Maximum tool calls per agent run
MAX_ITERATIONS = 10      # Maximum agent loop iterations
GRAPH_RECURSION_LIMIT = 50 # Maximum number of steps before hitting a stop condition

# Context compression thresholds
BASE_TOKEN_THRESHOLD = 2000     # Initial token threshold for compression
TOKEN_GROWTH_FACTOR = 0.9       # Multiplier applied after each compression

Parameter	Effect
`MAX_TOOL_CALLS`	Increase for complex queries, decrease to speed up simple ones
`MAX_ITERATIONS`	Controls how many reasoning loops the agent can run
`GRAPH_RECURSION_LIMIT`	Increase for complex graphs
`BASE_TOKEN_THRESHOLD`	Delay compression by increasing this value
`TOKEN_GROWTH_FACTOR`	Lower values compress more aggressively

Observability

Optional tracing with Langfuse captures every LLM call, tool invocation, and graph transition useful for debugging agent behavior, tracking costs, and evaluating retrieval quality.

Enabling Langfuse

Sign up on Langfuse Cloud, create an organization, then create a project and generate API keys from the project settings.
Set environment variables (or copy .env.example to .env):

export LANGFUSE_ENABLED=true
export LANGFUSE_PUBLIC_KEY=pk-lf-...
export LANGFUSE_SECRET_KEY=sk-lf-...
export LANGFUSE_BASE_URL=https://cloud.langfuse.com   # or your self-hosted URL

Run the app normally. Traces appear in your Langfuse dashboard.

To disable tracing, set LANGFUSE_ENABLED=false or leave the variables unset. The application runs identically either way.

For additional details on integrating Langfuse with LangChain or LangGraph, see the official documentation.

What gets traced

Component	Traced operations
Graph nodes	`summarize_history`, `rewrite_query`, `orchestrator`, `compress_context`, `fallback_response`, `aggregate_answers`
Tools	`search_child_chunks`, `retrieve_parent_chunks` (arguments + results)
Structured output	`QueryAnalysis` parsing in the rewrite step
Subgraph fan-out	Parallel agent invocations via `Send()`

Hosting options

Langfuse Cloud — sign up at cloud.langfuse.com, free up to 50K observations/month.
Self-hosted — MIT-licensed, deploy with Docker Compose. See the official self-hosting docs.

For a detailed comparison of observability platforms (LangSmith, Arize Phoenix, AgentOps, Braintrust, Helicone) and the full self-hosting setup, see observability.ipynb.

Advanced Topics

Customizing the RAG Agent

Location: project/rag_agent/

Add/Remove Nodes: Edit graph.py and nodes.py

Example: Adding a fact-checking node

# In nodes.py
def fact_check_node(state):
    # Your fact-checking logic
    return {"fact_checked": True}

# In graph.py
builder.add_node("fact_check", fact_check_node)
builder.add_edge("retrieve", "fact_check")

Modify Conditional Routing: Edit edges.py to change graph flow logic

Example from the system - routing based on query clarity:

def route_after_rewrite(state: State) -> Literal["request_clarification", "agent"]:
    """Routes to human input if question unclear, otherwise processes all rewritten queries"""
    if not state.get("questionIsClear", False):
        return "request_clarification"
    else:
        # Fan-out: send each rewritten question to parallel processing
        return [
            Send("agent", {"question": query, "question_index": idx, "messages": []})
            for idx, query in enumerate(state["rewrittenQuestions"])
        ]

This pattern allows the agent to either request clarification from the user or fan-out multiple query variations for parallel retrieval.

Modify Prompts: Edit prompts.py to change agent behavior and response style

Add Custom Tools: Extend tools.py with new retrieval strategies or external integrations

Replacing Storage Backends

Vector Database:

Default: Local Qdrant
Alternatives: Remote Qdrant Cloud, Pinecone, Weaviate
Edit: project/db/vector_db_manager.py

Parent Store:

Default: JSON file
Alternatives: PostgreSQL, MongoDB, S3
Edit: project/db/parent_store_manager.py

Extending the UI

Location: project/ui/gradio_app.py

Add runtime settings, admin panels, or analytics:

with gr.Accordion("Advanced Settings", open=False):
    provider_dropdown = gr.Dropdown(
        choices=["openai", "anthropic", "google", "ollama"],
        label="LLM Provider"
    )

Docker Deployment

⚠️ System Requirements: At least 8GB of RAM allocated to Docker. The default Ollama model needs approximately 3.3GB to run.

Build and Run

# Build image
docker build -t agentic-rag -f project/Dockerfile .

# Run container
docker run --name rag-assistant -p 7860:7860 agentic-rag

Optional: GPU acceleration (NVIDIA only):

docker run --gpus all --name rag-assistant -p 7860:7860 agentic-rag

Common commands:

docker stop rag-assistant      # Stop
docker start rag-assistant     # Restart
docker logs -f rag-assistant   # View logs
docker rm -f rag-assistant     # Remove

⚠️ Performance Note: On Windows/Mac, Docker runs via a Linux VM which may slow down I/O operations like document indexing. LLM inference speed is largely unaffected. On Linux, performance is comparable to running locally.

⚠️ Data Persistence: The default docker run command does not persist indexed documents — they are lost when the container is removed. Mount a volume to keep them:
docker run --name rag-assistant -p 7860:7860 \
  -v $(pwd)/qdrant_db:/app/qdrant_db \
  -v $(pwd)/parent_store:/app/parent_store \
  agentic-rag

Once running, open http://localhost:7860.

Troubleshooting

Issue	Cause	Solution
"Model not found" error	Incorrect model name for provider	Verify `LLM_MODEL` matches provider's API (e.g., `gpt-4o-mini` not `gpt4-mini`)
Low-quality retrieval results	Poor embedding model or chunk configuration	Re-index with better embeddings (e.g., all-mpnet-base-v2) or adjust chunk sizes
Slow response times	Large embedding model or high `top_k` value	Use smaller embedding models (e.g., all-MiniLM-L6-v2) or reduce `top_k` in retrieval tools
API rate limits exceeded	Too many requests to external provider	Add retry logic with exponential backoff or switch to local Ollama models
Out of memory errors	Large document set or embedding model	Use smaller embeddings, reduce batch size, or enable GPU acceleration
Empty retrieval results	Collection not indexed or wrong collection name	Verify documents are uploaded and `CHILD_COLLECTION` name matches in config
Import errors after provider switch	Missing SDK installation	Install required package: `pip install langchain-{provider}`
Inconsistent answers across runs	High temperature setting	Set `LLM_TEMPERATURE = 0` in config for deterministic responses

FilesExpand file tree

README.md

Latest commit

History