Skip to content

Latest commit

 

History

History
622 lines (449 loc) · 21.9 KB

File metadata and controls

622 lines (449 loc) · 21.9 KB

Intro To Agentic RAG System Documentation

This codebase powers an Agentic RAG Engine leveraging the LangGraph framework. It natively supports hierarchical indexing, hybrid semantic and keyword search, and pluggable LLM compatibility.

Table of Contents

Quick Start | Architecture Overview | Project Structure | Configuration Guide | Common Customizations | Observability | Advanced Topics | Troubleshooting


Quick Start

Installation

Install the necessary Python packages:

pip install -r requirements.txt

Starting the Web UI

Launch the interactive chat server:

python project/app.py

The application will be available at http://localhost:7860 (default Gradio port).

Prerequisites

Requirement Details
Python 3.11 or higher
RAM 16 GB for Ollama · 8 GB minimum for cloud providers
Ollama Required only for local inference — install here then run ollama pull qwen3:4b-instruct-2507-q4_K_M
API key Required only for cloud providers — copy project/.env.exampleproject/.env and fill in your key

Architecture Overview

This system implements an advanced RAG pipeline with the following key features:

  • Parent-Child Chunking: Documents are split into small child chunks (for precise retrieval) linked to larger parent chunks (for rich context)
  • Hybrid Search: Combines dense embeddings and sparse (BM25) retrieval for optimal results
  • LangGraph Agent: Orchestrates query rewriting, retrieval, and response generation
  • Multi-Provider Support: Seamlessly switch between Ollama, OpenAI GPT, Google Gemini, and Anthropic Claude
  • Vector Storage: Uses Qdrant for efficient similarity search

Data Flow

PDF → Markdown Conversion → Parent/Child Chunking → Vector Indexing → Agent Retrieval → LLM Response

Project Structure

Entry Point & Configuration

File Purpose
project/app.py Application entry point, launches Gradio UI
project/config.py Central configuration hub - edit this for provider/model/chunking changes
project/utils.py PDF to Markdown conversion and context token estimation
project/document_chunker.py Parent/child splitting logic with cleaning and merging rules
project/Dockerfile Dockerfile with Ollama for local deployment

Core System

File Purpose
project/core/rag_system.py System bootstrap - creates managers and compiles LangGraph agent
project/core/document_manager.py Document ingestion pipeline (convert, chunk, index)
project/core/chat_interface.py Thin wrapper for agent graph interaction
project/core/observability.py Optional Langfuse tracing — callback handler lifecycle

Database Layer

File Purpose
project/db/vector_db_manager.py Qdrant client wrapper with embedding initialization
project/db/parent_store_manager.py File-backed storage for parent chunks

RAG Agent (LangGraph)

File Purpose
project/rag_agent/graph.py Graph builder and compilation logic
project/rag_agent/graph_state.py Shared and per-agent graph state definitions and answer accumulation/reset logic
project/rag_agent/nodes.py Node implementations (summarize, rewrite, agent execution, aggregate)
project/rag_agent/edges.py Conditional edge routing logic (e.g., routing based on query clarity)
project/rag_agent/tools.py Retrieval tools (search_child_chunks, retrieve_parent_chunks)
project/rag_agent/prompts.py System prompts for agent behavior
project/rag_agent/schemas.py Structured output schemas (Pydantic models)

User Interface

File Purpose
project/ui/css.py Custom CSS styling for the Gradio interface
project/ui/gradio_app.py Gradio UI implementation with document upload and chat

Configuration Guide

All primary settings are in project/config.py. Key parameters:

Directory Configuration

MARKDOWN_DIR = "markdown_docs"        # Storage for converted PDF → Markdown files
PARENT_STORE_PATH = "parent_store"    # File-backed storage for parent chunks
QDRANT_DB_PATH = "qdrant_db"          # Local Qdrant vector database path

Qdrant Configuration

CHILD_COLLECTION = "document_child_chunks"  # Collection name for child chunks
SPARSE_VECTOR_NAME = "sparse"               # Named sparse vector field (BM25)

Model Configuration

# Default: single model configuration
DENSE_MODEL = "sentence-transformers/all-mpnet-base-v2"
SPARSE_MODEL = "Qdrant/bm25"
LLM_MODEL = "qwen3:4b-instruct-2507-q4_K_M"
LLM_TEMPERATURE = 0  # 0 = deterministic, 1 = creative

Agent Configuration

# Hard limits to prevent infinite loops
MAX_TOOL_CALLS = 8       # Maximum tool calls per agent run
MAX_ITERATIONS = 10      # Maximum agent loop iterations
GRAPH_RECURSION_LIMIT = 50 # Maximum number of steps before hitting a stop condition

# Context compression thresholds
BASE_TOKEN_THRESHOLD = 2000     # Initial token threshold for compression
TOKEN_GROWTH_FACTOR = 0.9       # Multiplier applied after each compression

Text Splitter Configuration

CHILD_CHUNK_SIZE = 500              # Size of chunks used for retrieval
CHILD_CHUNK_OVERLAP = 100           # Overlap between chunks (prevents context loss)
MIN_PARENT_SIZE = 2000              # Minimum parent chunk size
MAX_PARENT_SIZE = 4000             # Maximum parent chunk size

# Markdown header splitting strategy
HEADERS_TO_SPLIT_ON = [
    ("#", "H1"),
    ("##", "H2"),
    ("###", "H3")
]

Langfuse Observability (Optional)

LANGFUSE_ENABLED = False               # Set to True via LANGFUSE_ENABLED env var
LANGFUSE_PUBLIC_KEY = ""               # From your Langfuse project settings
LANGFUSE_SECRET_KEY = ""               # From your Langfuse project settings
LANGFUSE_BASE_URL = "http://localhost:3000"  # Langfuse Cloud or self-hosted URL

Common Customizations

1. Switching LLM Provider (Single Provider)

Performance Note: LLMs with 7B+ parameters typically offer superior reasoning, context comprehension, and response quality compared to smaller models. This applies to both proprietary and open-source models, as long as they support native tool/function calling, which is required for agentic RAG workflows.

If you want to permanently switch from one provider to another (e.g., Ollama → Google Gemini), follow this steps:

Step 1: Install the provider's SDK

pip install langchain-google-genai

Step 2: Set environment variable

export GOOGLE_API_KEY="your-google-key"

Step 3: Update project/config.py

LLM_MODEL = "gemini-2.5-pro"
LLM_TEMPERATURE = 0

Step 4: Modify project/core/rag_system.py

Replace:

llm = ChatOllama(model=config.LLM_MODEL, temperature=config.LLM_TEMPERATURE)

With:

from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(model=config.LLM_MODEL, temperature=config.LLM_TEMPERATURE)

2. Multi-Provider Configuration

This approach allows you to maintain multiple provider configurations and switch between them easily.

Step 1: Install required SDKs

pip install langchain-openai langchain-anthropic langchain-google-genai

Step 2: Set environment variables

export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GOOGLE_API_KEY="your-google-key"

Step 3: Update project/config.py with multi-provider configuration

# --- Multi-Provider LLM Configuration ---
LLM_CONFIGS = {
    "ollama": {
        "model": "ministral-3:14b-instruct-2512-q4_K_M",
        "url":"http://localhost:11434",
        "temperature": 0
    },
    "openai": {
        "model": "gpt-4o",
        "temperature": 0
    },
    "anthropic": {
        "model": "claude-sonnet-4-6",
        "temperature": 0
    },
    "google": {
        "model": "gemini-2.5-flash",
        "temperature": 0
    }
}

# Switch providers by changing this single line
ACTIVE_LLM_CONFIG = "ollama"

Step 4: Modify project/core/rag_system.py in the initialize() method

Replace the existing LLM initialization with:

def initialize(self):
    self.vector_db.create_collection(self.collection_name)
    collection = self.vector_db.get_collection(self.collection_name)
    
    # Load active configuration
    active_config = config.LLM_CONFIGS[config.ACTIVE_LLM_CONFIG]
    model = active_config["model"]
    temperature = active_config["temperature"]
    
    if config.ACTIVE_LLM_CONFIG == "ollama":
        from langchain_ollama import ChatOllama
        llm = ChatOllama(model=model, temperature=temperature, base_url=active_config["url"])
        
    elif config.ACTIVE_LLM_CONFIG == "openai":
        from langchain_openai import ChatOpenAI
        llm = ChatOpenAI(model=model, temperature=temperature)
        
    elif config.ACTIVE_LLM_CONFIG == "anthropic":
        from langchain_anthropic import ChatAnthropic
        llm = ChatAnthropic(model=model, temperature=temperature)
        
    elif config.ACTIVE_LLM_CONFIG == "google":
        from langchain_google_genai import ChatGoogleGenerativeAI
        llm = ChatGoogleGenerativeAI(model=model, temperature=temperature)
        
    else:
        raise ValueError(f"Unsupported LLM provider: {config.ACTIVE_LLM_CONFIG}")
    
    # Continue with tool and graph initialization
    tools = ToolFactory(collection).create_tools()
    self.agent_graph = create_agent_graph(llm, tools)

Switching Providers: Simply change ACTIVE_LLM_CONFIG in config.py:

ACTIVE_LLM_CONFIG = "google"  # Switch to Gemini Pro
# ACTIVE_LLM_CONFIG = "anthropic"  # Or Claude Sonnet
# ACTIVE_LLM_CONFIG = "openai"  # Or GPT-4o

Provider Reference Table:

Provider Environment Variable Import Statement Example Models
OpenAI OPENAI_API_KEY from langchain_openai import ChatOpenAI gpt-4o, gpt-4o-mini
Anthropic ANTHROPIC_API_KEY from langchain_anthropic import ChatAnthropic claude-opus-4-6, claude-sonnet-4-6
Google GOOGLE_API_KEY from langchain_google_genai import ChatGoogleGenerativeAI gemini-2.5-pro, gemini-2.5-flash
Ollama None (local) from langchain_ollama import ChatOllama qwen3:4b-instruct-2507-q4_K_M, ministral-3:8b-instruct-2512-q4_K_M, llama3.1:8b-instruct-q6_K

3. Changing Embedding Models

Why change? Trade-offs between speed, cost, and quality.

Step 1: Update project/config.py

# Example: Faster, smaller model
DENSE_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

# Example: Higher quality, slower model
# DENSE_MODEL = "sentence-transformers/all-mpnet-base-v2"

# Example: Gemma embeddings (Google's open model)
# DENSE_MODEL = "google/embeddinggemma-300m"

# Example: Qwen embeddings (Alibaba's multilingual model)
# DENSE_MODEL = "Qwen/Qwen3-Embedding-8B"

SPARSE_MODEL = "Qdrant/bm25"  # Usually no need to change

Step 2: Re-index your documents

⚠️ Important: Changing embeddings requires re-indexing all documents through the Gradio UI.

Implementation Details (in project/db/vector_db_manager.py):

from langchain_huggingface import HuggingFaceEmbeddings
from langchain_qdrant import FastEmbedSparse
import config

self.__dense_embeddings = HuggingFaceEmbeddings(model_name=config.DENSE_MODEL)
self.__sparse_embeddings = FastEmbedSparse(model_name=config.SPARSE_MODEL)

Popular Embedding Models:

Model Context Size Vector Dimension Speed Quality Use Case
all-MiniLM-L6-v2 256 tokens 384 Fast Good General purpose, quick semantic similarity
all-mpnet-base-v2 512 tokens 768 Medium Excellent High-accuracy semantic search
bge-large-en-v1.5 512 tokens 1024 Slow Best Production-grade retrieval on GPU
google/embeddinggemma-300m 2048 tokens 768 Fast Very Good Lightweight, efficient multilingual retrieval
Qwen/Qwen3-Embedding-8B 32768 tokens 4096 Slow Excellent / SOTA Large-scale multilingual embeddings, long-context RAG

4. Adjusting Chunking Strategy

Why adjust? Balance between retrieval precision and context richness.

Step 1: Update chunk sizes in project/config.py

# For short, factual queries (e.g., technical documentation)
CHILD_CHUNK_SIZE = 300
CHILD_CHUNK_OVERLAP = 60
MIN_PARENT_SIZE = 1500
MAX_PARENT_SIZE = 8000

# For narrative or contextual queries (e.g., legal documents)
# CHILD_CHUNK_SIZE = 800
# CHILD_CHUNK_OVERLAP = 150
# MIN_PARENT_SIZE = 3000
# MAX_PARENT_SIZE = 15000

Step 2 (Optional): Replace the splitter in project/document_chunker.py

Default (Character-based):

from langchain_text_splitters import RecursiveCharacterTextSplitter

self.__child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=config.CHILD_CHUNK_SIZE,
    chunk_overlap=config.CHILD_CHUNK_OVERLAP
)

Alternative (Sentence-aware):

from langchain_text_splitters import SentenceTransformersTokenTextSplitter

self.__child_splitter = SentenceTransformersTokenTextSplitter(
    chunk_size=config.CHILD_CHUNK_SIZE,
    chunk_overlap=config.CHILD_CHUNK_OVERLAP
)

Step 3: Re-run ingestion pipeline

Upload documents again through the Gradio interface to apply new chunking.

Chunking Guidelines:

⚠️ Disclaimer: These are empirical guidelines. Optimal sizes depend on:

  • Child chunk → embedding model's context window (e.g. 256 tokens for all-MiniLM-L6-v2, 512 for bge-large-en-v1.5): child size should not exceed it
  • Parent chunk → generative model's context window (e.g. 8K, 32K, 128K tokens): parent must fit within the context sent to the LLM alongside the query

Always validate values empirically on your own corpus.

Document Type Child Size Parent Size Reasoning
Technical Docs 300-500 2000-4000 Precise lookups, code snippets
Legal Contracts 600-1000 5000-15000 Context-heavy, definitions
Research Papers 400-600 3000-8000 Balance of precision and context
FAQs / Knowledge Base 200-400 1500-4000 Short, focused answers

5. Agent Configuration

Tune agent behavior in project/config.py:

# Hard limits to prevent infinite loops
MAX_TOOL_CALLS = 8       # Maximum tool calls per agent run
MAX_ITERATIONS = 10      # Maximum agent loop iterations
GRAPH_RECURSION_LIMIT = 50 # Maximum number of steps before hitting a stop condition

# Context compression thresholds
BASE_TOKEN_THRESHOLD = 2000     # Initial token threshold for compression
TOKEN_GROWTH_FACTOR = 0.9       # Multiplier applied after each compression
Parameter Effect
MAX_TOOL_CALLS Increase for complex queries, decrease to speed up simple ones
MAX_ITERATIONS Controls how many reasoning loops the agent can run
GRAPH_RECURSION_LIMIT Increase for complex graphs
BASE_TOKEN_THRESHOLD Delay compression by increasing this value
TOKEN_GROWTH_FACTOR Lower values compress more aggressively

Observability

Optional tracing with Langfuse captures every LLM call, tool invocation, and graph transition useful for debugging agent behavior, tracking costs, and evaluating retrieval quality.

Enabling Langfuse

  1. Sign up on Langfuse Cloud, create an organization, then create a project and generate API keys from the project settings.
  2. Set environment variables (or copy .env.example to .env):
export LANGFUSE_ENABLED=true
export LANGFUSE_PUBLIC_KEY=pk-lf-...
export LANGFUSE_SECRET_KEY=sk-lf-...
export LANGFUSE_BASE_URL=https://cloud.langfuse.com   # or your self-hosted URL
  1. Run the app normally. Traces appear in your Langfuse dashboard.

To disable tracing, set LANGFUSE_ENABLED=false or leave the variables unset. The application runs identically either way.

For additional details on integrating Langfuse with LangChain or LangGraph, see the official documentation.

What gets traced

Component Traced operations
Graph nodes summarize_history, rewrite_query, orchestrator, compress_context, fallback_response, aggregate_answers
Tools search_child_chunks, retrieve_parent_chunks (arguments + results)
Structured output QueryAnalysis parsing in the rewrite step
Subgraph fan-out Parallel agent invocations via Send()

Hosting options

For a detailed comparison of observability platforms (LangSmith, Arize Phoenix, AgentOps, Braintrust, Helicone) and the full self-hosting setup, see observability.ipynb.


Advanced Topics

Customizing the RAG Agent

Location: project/rag_agent/

Add/Remove Nodes: Edit graph.py and nodes.py

Example: Adding a fact-checking node

# In nodes.py
def fact_check_node(state):
    # Your fact-checking logic
    return {"fact_checked": True}

# In graph.py
builder.add_node("fact_check", fact_check_node)
builder.add_edge("retrieve", "fact_check")

Modify Conditional Routing: Edit edges.py to change graph flow logic

Example from the system - routing based on query clarity:

def route_after_rewrite(state: State) -> Literal["request_clarification", "agent"]:
    """Routes to human input if question unclear, otherwise processes all rewritten queries"""
    if not state.get("questionIsClear", False):
        return "request_clarification"
    else:
        # Fan-out: send each rewritten question to parallel processing
        return [
            Send("agent", {"question": query, "question_index": idx, "messages": []})
            for idx, query in enumerate(state["rewrittenQuestions"])
        ]

This pattern allows the agent to either request clarification from the user or fan-out multiple query variations for parallel retrieval.

Modify Prompts: Edit prompts.py to change agent behavior and response style

Add Custom Tools: Extend tools.py with new retrieval strategies or external integrations

Replacing Storage Backends

Vector Database:

  • Default: Local Qdrant
  • Alternatives: Remote Qdrant Cloud, Pinecone, Weaviate
  • Edit: project/db/vector_db_manager.py

Parent Store:

  • Default: JSON file
  • Alternatives: PostgreSQL, MongoDB, S3
  • Edit: project/db/parent_store_manager.py

Extending the UI

Location: project/ui/gradio_app.py

Add runtime settings, admin panels, or analytics:

with gr.Accordion("Advanced Settings", open=False):
    provider_dropdown = gr.Dropdown(
        choices=["openai", "anthropic", "google", "ollama"],
        label="LLM Provider"
    )

Docker Deployment

⚠️ System Requirements: At least 8GB of RAM allocated to Docker. The default Ollama model needs approximately 3.3GB to run.

Build and Run

# Build image
docker build -t agentic-rag -f project/Dockerfile .

# Run container
docker run --name rag-assistant -p 7860:7860 agentic-rag

Optional: GPU acceleration (NVIDIA only):

docker run --gpus all --name rag-assistant -p 7860:7860 agentic-rag

Common commands:

docker stop rag-assistant      # Stop
docker start rag-assistant     # Restart
docker logs -f rag-assistant   # View logs
docker rm -f rag-assistant     # Remove

⚠️ Performance Note: On Windows/Mac, Docker runs via a Linux VM which may slow down I/O operations like document indexing. LLM inference speed is largely unaffected. On Linux, performance is comparable to running locally.

⚠️ Data Persistence: The default docker run command does not persist indexed documents — they are lost when the container is removed. Mount a volume to keep them:

docker run --name rag-assistant -p 7860:7860 \
  -v $(pwd)/qdrant_db:/app/qdrant_db \
  -v $(pwd)/parent_store:/app/parent_store \
  agentic-rag

Once running, open http://localhost:7860.


Troubleshooting

Issue Cause Solution
"Model not found" error Incorrect model name for provider Verify LLM_MODEL matches provider's API (e.g., gpt-4o-mini not gpt4-mini)
Low-quality retrieval results Poor embedding model or chunk configuration Re-index with better embeddings (e.g., all-mpnet-base-v2) or adjust chunk sizes
Slow response times Large embedding model or high top_k value Use smaller embedding models (e.g., all-MiniLM-L6-v2) or reduce top_k in retrieval tools
API rate limits exceeded Too many requests to external provider Add retry logic with exponential backoff or switch to local Ollama models
Out of memory errors Large document set or embedding model Use smaller embeddings, reduce batch size, or enable GPU acceleration
Empty retrieval results Collection not indexed or wrong collection name Verify documents are uploaded and CHILD_COLLECTION name matches in config
Import errors after provider switch Missing SDK installation Install required package: pip install langchain-{provider}
Inconsistent answers across runs High temperature setting Set LLM_TEMPERATURE = 0 in config for deterministic responses