This codebase powers an Agentic RAG Engine leveraging the LangGraph framework. It natively supports hierarchical indexing, hybrid semantic and keyword search, and pluggable LLM compatibility.
Quick Start | Architecture Overview | Project Structure | Configuration Guide | Common Customizations | Observability | Advanced Topics | Troubleshooting
Install the necessary Python packages:
pip install -r requirements.txtLaunch the interactive chat server:
python project/app.pyThe application will be available at http://localhost:7860 (default Gradio port).
| Requirement | Details |
|---|---|
| Python | 3.11 or higher |
| RAM | 16 GB for Ollama · 8 GB minimum for cloud providers |
| Ollama | Required only for local inference — install here then run ollama pull qwen3:4b-instruct-2507-q4_K_M |
| API key | Required only for cloud providers — copy project/.env.example → project/.env and fill in your key |
This system implements an advanced RAG pipeline with the following key features:
- Parent-Child Chunking: Documents are split into small child chunks (for precise retrieval) linked to larger parent chunks (for rich context)
- Hybrid Search: Combines dense embeddings and sparse (BM25) retrieval for optimal results
- LangGraph Agent: Orchestrates query rewriting, retrieval, and response generation
- Multi-Provider Support: Seamlessly switch between Ollama, OpenAI GPT, Google Gemini, and Anthropic Claude
- Vector Storage: Uses Qdrant for efficient similarity search
PDF → Markdown Conversion → Parent/Child Chunking → Vector Indexing → Agent Retrieval → LLM Response
| File | Purpose |
|---|---|
project/app.py |
Application entry point, launches Gradio UI |
project/config.py |
Central configuration hub - edit this for provider/model/chunking changes |
project/utils.py |
PDF to Markdown conversion and context token estimation |
project/document_chunker.py |
Parent/child splitting logic with cleaning and merging rules |
project/Dockerfile |
Dockerfile with Ollama for local deployment |
| File | Purpose |
|---|---|
project/core/rag_system.py |
System bootstrap - creates managers and compiles LangGraph agent |
project/core/document_manager.py |
Document ingestion pipeline (convert, chunk, index) |
project/core/chat_interface.py |
Thin wrapper for agent graph interaction |
project/core/observability.py |
Optional Langfuse tracing — callback handler lifecycle |
| File | Purpose |
|---|---|
project/db/vector_db_manager.py |
Qdrant client wrapper with embedding initialization |
project/db/parent_store_manager.py |
File-backed storage for parent chunks |
| File | Purpose |
|---|---|
project/rag_agent/graph.py |
Graph builder and compilation logic |
project/rag_agent/graph_state.py |
Shared and per-agent graph state definitions and answer accumulation/reset logic |
project/rag_agent/nodes.py |
Node implementations (summarize, rewrite, agent execution, aggregate) |
project/rag_agent/edges.py |
Conditional edge routing logic (e.g., routing based on query clarity) |
project/rag_agent/tools.py |
Retrieval tools (search_child_chunks, retrieve_parent_chunks) |
project/rag_agent/prompts.py |
System prompts for agent behavior |
project/rag_agent/schemas.py |
Structured output schemas (Pydantic models) |
| File | Purpose |
|---|---|
project/ui/css.py |
Custom CSS styling for the Gradio interface |
project/ui/gradio_app.py |
Gradio UI implementation with document upload and chat |
All primary settings are in project/config.py. Key parameters:
MARKDOWN_DIR = "markdown_docs" # Storage for converted PDF → Markdown files
PARENT_STORE_PATH = "parent_store" # File-backed storage for parent chunks
QDRANT_DB_PATH = "qdrant_db" # Local Qdrant vector database pathCHILD_COLLECTION = "document_child_chunks" # Collection name for child chunks
SPARSE_VECTOR_NAME = "sparse" # Named sparse vector field (BM25)# Default: single model configuration
DENSE_MODEL = "sentence-transformers/all-mpnet-base-v2"
SPARSE_MODEL = "Qdrant/bm25"
LLM_MODEL = "qwen3:4b-instruct-2507-q4_K_M"
LLM_TEMPERATURE = 0 # 0 = deterministic, 1 = creative# Hard limits to prevent infinite loops
MAX_TOOL_CALLS = 8 # Maximum tool calls per agent run
MAX_ITERATIONS = 10 # Maximum agent loop iterations
GRAPH_RECURSION_LIMIT = 50 # Maximum number of steps before hitting a stop condition
# Context compression thresholds
BASE_TOKEN_THRESHOLD = 2000 # Initial token threshold for compression
TOKEN_GROWTH_FACTOR = 0.9 # Multiplier applied after each compressionCHILD_CHUNK_SIZE = 500 # Size of chunks used for retrieval
CHILD_CHUNK_OVERLAP = 100 # Overlap between chunks (prevents context loss)
MIN_PARENT_SIZE = 2000 # Minimum parent chunk size
MAX_PARENT_SIZE = 4000 # Maximum parent chunk size
# Markdown header splitting strategy
HEADERS_TO_SPLIT_ON = [
("#", "H1"),
("##", "H2"),
("###", "H3")
]LANGFUSE_ENABLED = False # Set to True via LANGFUSE_ENABLED env var
LANGFUSE_PUBLIC_KEY = "" # From your Langfuse project settings
LANGFUSE_SECRET_KEY = "" # From your Langfuse project settings
LANGFUSE_BASE_URL = "http://localhost:3000" # Langfuse Cloud or self-hosted URLPerformance Note: LLMs with 7B+ parameters typically offer superior reasoning, context comprehension, and response quality compared to smaller models. This applies to both proprietary and open-source models, as long as they support native tool/function calling, which is required for agentic RAG workflows.
If you want to permanently switch from one provider to another (e.g., Ollama → Google Gemini), follow this steps:
Step 1: Install the provider's SDK
pip install langchain-google-genaiStep 2: Set environment variable
export GOOGLE_API_KEY="your-google-key"Step 3: Update project/config.py
LLM_MODEL = "gemini-2.5-pro"
LLM_TEMPERATURE = 0Step 4: Modify project/core/rag_system.py
Replace:
llm = ChatOllama(model=config.LLM_MODEL, temperature=config.LLM_TEMPERATURE)With:
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model=config.LLM_MODEL, temperature=config.LLM_TEMPERATURE)This approach allows you to maintain multiple provider configurations and switch between them easily.
Step 1: Install required SDKs
pip install langchain-openai langchain-anthropic langchain-google-genaiStep 2: Set environment variables
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GOOGLE_API_KEY="your-google-key"Step 3: Update project/config.py with multi-provider configuration
# --- Multi-Provider LLM Configuration ---
LLM_CONFIGS = {
"ollama": {
"model": "ministral-3:14b-instruct-2512-q4_K_M",
"url":"http://localhost:11434",
"temperature": 0
},
"openai": {
"model": "gpt-4o",
"temperature": 0
},
"anthropic": {
"model": "claude-sonnet-4-6",
"temperature": 0
},
"google": {
"model": "gemini-2.5-flash",
"temperature": 0
}
}
# Switch providers by changing this single line
ACTIVE_LLM_CONFIG = "ollama"Step 4: Modify project/core/rag_system.py in the initialize() method
Replace the existing LLM initialization with:
def initialize(self):
self.vector_db.create_collection(self.collection_name)
collection = self.vector_db.get_collection(self.collection_name)
# Load active configuration
active_config = config.LLM_CONFIGS[config.ACTIVE_LLM_CONFIG]
model = active_config["model"]
temperature = active_config["temperature"]
if config.ACTIVE_LLM_CONFIG == "ollama":
from langchain_ollama import ChatOllama
llm = ChatOllama(model=model, temperature=temperature, base_url=active_config["url"])
elif config.ACTIVE_LLM_CONFIG == "openai":
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model=model, temperature=temperature)
elif config.ACTIVE_LLM_CONFIG == "anthropic":
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model=model, temperature=temperature)
elif config.ACTIVE_LLM_CONFIG == "google":
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model=model, temperature=temperature)
else:
raise ValueError(f"Unsupported LLM provider: {config.ACTIVE_LLM_CONFIG}")
# Continue with tool and graph initialization
tools = ToolFactory(collection).create_tools()
self.agent_graph = create_agent_graph(llm, tools)Switching Providers: Simply change ACTIVE_LLM_CONFIG in config.py:
ACTIVE_LLM_CONFIG = "google" # Switch to Gemini Pro
# ACTIVE_LLM_CONFIG = "anthropic" # Or Claude Sonnet
# ACTIVE_LLM_CONFIG = "openai" # Or GPT-4oProvider Reference Table:
| Provider | Environment Variable | Import Statement | Example Models |
|---|---|---|---|
| OpenAI | OPENAI_API_KEY |
from langchain_openai import ChatOpenAI |
gpt-4o, gpt-4o-mini |
| Anthropic | ANTHROPIC_API_KEY |
from langchain_anthropic import ChatAnthropic |
claude-opus-4-6, claude-sonnet-4-6 |
GOOGLE_API_KEY |
from langchain_google_genai import ChatGoogleGenerativeAI |
gemini-2.5-pro, gemini-2.5-flash |
|
| Ollama | None (local) | from langchain_ollama import ChatOllama |
qwen3:4b-instruct-2507-q4_K_M, ministral-3:8b-instruct-2512-q4_K_M, llama3.1:8b-instruct-q6_K |
Why change? Trade-offs between speed, cost, and quality.
Step 1: Update project/config.py
# Example: Faster, smaller model
DENSE_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
# Example: Higher quality, slower model
# DENSE_MODEL = "sentence-transformers/all-mpnet-base-v2"
# Example: Gemma embeddings (Google's open model)
# DENSE_MODEL = "google/embeddinggemma-300m"
# Example: Qwen embeddings (Alibaba's multilingual model)
# DENSE_MODEL = "Qwen/Qwen3-Embedding-8B"
SPARSE_MODEL = "Qdrant/bm25" # Usually no need to changeStep 2: Re-index your documents
Implementation Details (in project/db/vector_db_manager.py):
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_qdrant import FastEmbedSparse
import config
self.__dense_embeddings = HuggingFaceEmbeddings(model_name=config.DENSE_MODEL)
self.__sparse_embeddings = FastEmbedSparse(model_name=config.SPARSE_MODEL)Popular Embedding Models:
| Model | Context Size | Vector Dimension | Speed | Quality | Use Case |
|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 256 tokens | 384 | Fast | Good | General purpose, quick semantic similarity |
| all-mpnet-base-v2 | 512 tokens | 768 | Medium | Excellent | High-accuracy semantic search |
| bge-large-en-v1.5 | 512 tokens | 1024 | Slow | Best | Production-grade retrieval on GPU |
| google/embeddinggemma-300m | 2048 tokens | 768 | Fast | Very Good | Lightweight, efficient multilingual retrieval |
| Qwen/Qwen3-Embedding-8B | 32768 tokens | 4096 | Slow | Excellent / SOTA | Large-scale multilingual embeddings, long-context RAG |
Why adjust? Balance between retrieval precision and context richness.
Step 1: Update chunk sizes in project/config.py
# For short, factual queries (e.g., technical documentation)
CHILD_CHUNK_SIZE = 300
CHILD_CHUNK_OVERLAP = 60
MIN_PARENT_SIZE = 1500
MAX_PARENT_SIZE = 8000
# For narrative or contextual queries (e.g., legal documents)
# CHILD_CHUNK_SIZE = 800
# CHILD_CHUNK_OVERLAP = 150
# MIN_PARENT_SIZE = 3000
# MAX_PARENT_SIZE = 15000Step 2 (Optional): Replace the splitter in project/document_chunker.py
Default (Character-based):
from langchain_text_splitters import RecursiveCharacterTextSplitter
self.__child_splitter = RecursiveCharacterTextSplitter(
chunk_size=config.CHILD_CHUNK_SIZE,
chunk_overlap=config.CHILD_CHUNK_OVERLAP
)Alternative (Sentence-aware):
from langchain_text_splitters import SentenceTransformersTokenTextSplitter
self.__child_splitter = SentenceTransformersTokenTextSplitter(
chunk_size=config.CHILD_CHUNK_SIZE,
chunk_overlap=config.CHILD_CHUNK_OVERLAP
)Step 3: Re-run ingestion pipeline
Upload documents again through the Gradio interface to apply new chunking.
Chunking Guidelines:
⚠️ Disclaimer: These are empirical guidelines. Optimal sizes depend on:
- Child chunk → embedding model's context window (e.g. 256 tokens for all-MiniLM-L6-v2, 512 for bge-large-en-v1.5): child size should not exceed it
- Parent chunk → generative model's context window (e.g. 8K, 32K, 128K tokens): parent must fit within the context sent to the LLM alongside the query
Always validate values empirically on your own corpus.
| Document Type | Child Size | Parent Size | Reasoning |
|---|---|---|---|
| Technical Docs | 300-500 | 2000-4000 | Precise lookups, code snippets |
| Legal Contracts | 600-1000 | 5000-15000 | Context-heavy, definitions |
| Research Papers | 400-600 | 3000-8000 | Balance of precision and context |
| FAQs / Knowledge Base | 200-400 | 1500-4000 | Short, focused answers |
Tune agent behavior in project/config.py:
# Hard limits to prevent infinite loops
MAX_TOOL_CALLS = 8 # Maximum tool calls per agent run
MAX_ITERATIONS = 10 # Maximum agent loop iterations
GRAPH_RECURSION_LIMIT = 50 # Maximum number of steps before hitting a stop condition
# Context compression thresholds
BASE_TOKEN_THRESHOLD = 2000 # Initial token threshold for compression
TOKEN_GROWTH_FACTOR = 0.9 # Multiplier applied after each compression| Parameter | Effect |
|---|---|
MAX_TOOL_CALLS |
Increase for complex queries, decrease to speed up simple ones |
MAX_ITERATIONS |
Controls how many reasoning loops the agent can run |
GRAPH_RECURSION_LIMIT |
Increase for complex graphs |
BASE_TOKEN_THRESHOLD |
Delay compression by increasing this value |
TOKEN_GROWTH_FACTOR |
Lower values compress more aggressively |
Optional tracing with Langfuse captures every LLM call, tool invocation, and graph transition useful for debugging agent behavior, tracking costs, and evaluating retrieval quality.
- Sign up on Langfuse Cloud, create an organization, then create a project and generate API keys from the project settings.
- Set environment variables (or copy
.env.exampleto.env):
export LANGFUSE_ENABLED=true
export LANGFUSE_PUBLIC_KEY=pk-lf-...
export LANGFUSE_SECRET_KEY=sk-lf-...
export LANGFUSE_BASE_URL=https://cloud.langfuse.com # or your self-hosted URL- Run the app normally. Traces appear in your Langfuse dashboard.
To disable tracing, set LANGFUSE_ENABLED=false or leave the variables unset. The application runs identically either way.
For additional details on integrating Langfuse with LangChain or LangGraph, see the official documentation.
| Component | Traced operations |
|---|---|
| Graph nodes | summarize_history, rewrite_query, orchestrator, compress_context, fallback_response, aggregate_answers |
| Tools | search_child_chunks, retrieve_parent_chunks (arguments + results) |
| Structured output | QueryAnalysis parsing in the rewrite step |
| Subgraph fan-out | Parallel agent invocations via Send() |
- Langfuse Cloud — sign up at cloud.langfuse.com, free up to 50K observations/month.
- Self-hosted — MIT-licensed, deploy with Docker Compose. See the official self-hosting docs.
For a detailed comparison of observability platforms (LangSmith, Arize Phoenix, AgentOps, Braintrust, Helicone) and the full self-hosting setup, see observability.ipynb.
Location: project/rag_agent/
Add/Remove Nodes: Edit graph.py and nodes.py
Example: Adding a fact-checking node
# In nodes.py
def fact_check_node(state):
# Your fact-checking logic
return {"fact_checked": True}
# In graph.py
builder.add_node("fact_check", fact_check_node)
builder.add_edge("retrieve", "fact_check")Modify Conditional Routing: Edit edges.py to change graph flow logic
Example from the system - routing based on query clarity:
def route_after_rewrite(state: State) -> Literal["request_clarification", "agent"]:
"""Routes to human input if question unclear, otherwise processes all rewritten queries"""
if not state.get("questionIsClear", False):
return "request_clarification"
else:
# Fan-out: send each rewritten question to parallel processing
return [
Send("agent", {"question": query, "question_index": idx, "messages": []})
for idx, query in enumerate(state["rewrittenQuestions"])
]This pattern allows the agent to either request clarification from the user or fan-out multiple query variations for parallel retrieval.
Modify Prompts: Edit prompts.py to change agent behavior and response style
Add Custom Tools: Extend tools.py with new retrieval strategies or external integrations
Vector Database:
- Default: Local Qdrant
- Alternatives: Remote Qdrant Cloud, Pinecone, Weaviate
- Edit:
project/db/vector_db_manager.py
Parent Store:
- Default: JSON file
- Alternatives: PostgreSQL, MongoDB, S3
- Edit:
project/db/parent_store_manager.py
Location: project/ui/gradio_app.py
Add runtime settings, admin panels, or analytics:
with gr.Accordion("Advanced Settings", open=False):
provider_dropdown = gr.Dropdown(
choices=["openai", "anthropic", "google", "ollama"],
label="LLM Provider"
)
⚠️ System Requirements: At least 8GB of RAM allocated to Docker. The default Ollama model needs approximately 3.3GB to run.
# Build image
docker build -t agentic-rag -f project/Dockerfile .
# Run container
docker run --name rag-assistant -p 7860:7860 agentic-ragOptional: GPU acceleration (NVIDIA only):
docker run --gpus all --name rag-assistant -p 7860:7860 agentic-ragCommon commands:
docker stop rag-assistant # Stop
docker start rag-assistant # Restart
docker logs -f rag-assistant # View logs
docker rm -f rag-assistant # Remove
⚠️ Performance Note: On Windows/Mac, Docker runs via a Linux VM which may slow down I/O operations like document indexing. LLM inference speed is largely unaffected. On Linux, performance is comparable to running locally.
⚠️ Data Persistence: The defaultdocker runcommand does not persist indexed documents — they are lost when the container is removed. Mount a volume to keep them:docker run --name rag-assistant -p 7860:7860 \ -v $(pwd)/qdrant_db:/app/qdrant_db \ -v $(pwd)/parent_store:/app/parent_store \ agentic-rag
Once running, open http://localhost:7860.
| Issue | Cause | Solution |
|---|---|---|
| "Model not found" error | Incorrect model name for provider | Verify LLM_MODEL matches provider's API (e.g., gpt-4o-mini not gpt4-mini) |
| Low-quality retrieval results | Poor embedding model or chunk configuration | Re-index with better embeddings (e.g., all-mpnet-base-v2) or adjust chunk sizes |
| Slow response times | Large embedding model or high top_k value |
Use smaller embedding models (e.g., all-MiniLM-L6-v2) or reduce top_k in retrieval tools |
| API rate limits exceeded | Too many requests to external provider | Add retry logic with exponential backoff or switch to local Ollama models |
| Out of memory errors | Large document set or embedding model | Use smaller embeddings, reduce batch size, or enable GPU acceleration |
| Empty retrieval results | Collection not indexed or wrong collection name | Verify documents are uploaded and CHILD_COLLECTION name matches in config |
| Import errors after provider switch | Missing SDK installation | Install required package: pip install langchain-{provider} |
| Inconsistent answers across runs | High temperature setting | Set LLM_TEMPERATURE = 0 in config for deterministic responses |