This document contains detailed implementation notes, patterns, and design decisions. Read this when you're working on specific components or need to understand internal behavior.
- File Processor Details
- Database Layer Implementation
- Embedding Layer Implementation
- Configuration System
- MCP Server Architecture
- Processing Patterns
- Testing Strategy
- Dependencies
- File discovery via recursive traversal
- Respects
.gitignore, standard ignore patterns (e.g.,node_modules,.git, build directories), and configurable additional patterns - File reading with multiple encoding fallbacks to handle diverse source files
- Syntax-aware chunking for supported languages, with fallback to line-aware chunking
- Extensible file type detection (40+ programming languages and formats)
Syntax-aware chunking (preferred):
- Uses tree-sitter to parse supported languages: Python, JavaScript, TypeScript, Go, Rust, Java, C++, C
- Chunks by syntax boundaries (functions, classes, etc.)
- Automatically falls back to line-aware chunking on parse errors
Line-aware chunking (fallback):
- Character-based with line-break awareness
- Ensures chunks don't split mid-line
- Used for unsupported languages or when syntax parsing fails
Each chunk includes:
{
"file_path": "src/code_rag/example.py",
"chunk_index": 0, # Position in file (0-indexed)
"total_chunks": 5, # Total chunks from this file
"content": "chunk text..." # The actual code chunk
}Extend SyntaxChunker.LANGUAGE_PACKAGES and install tree-sitter binding:
LANGUAGE_PACKAGES = {
"python": "tree-sitter-python",
"javascript": "tree-sitter-javascript",
# Add new languages here
}File reading errors: Print message, continue processing other files Gitignore parsing errors: Silently ignored Syntax chunking errors: Automatically fall back to basic chunking
Consider adding structured logging if more visibility is needed.
Any database implementation must provide:
class DatabaseInterface(ABC):
@abstractmethod
def initialize(self, collection_name: str, dimension: int) -> None:
"""Initialize/connect to database with collection and vector dimension"""
pass
@abstractmethod
def add_documents(self, documents: List[str], embeddings: List[List[float]],
metadatas: List[dict]) -> None:
"""Batch insert documents with their embeddings and metadata"""
pass
@abstractmethod
def query(self, query_embedding: List[float], n_results: int) -> List[dict]:
"""Similarity search, return top n results"""
pass
@abstractmethod
def is_processed(self, collection_name: str) -> bool:
"""Check if collection has documents (for idempotency)"""
pass
@abstractmethod
def close(self) -> None:
"""Clean up resources"""
passProperties:
- Embedded database (no server required)
- Persistent on-disk storage
- Cosine similarity for vector search
- Collection-based organization
Initialization: Creates/loads collection with specified dimension
Key behavior: Uses cosine distance metric by default
Properties:
- Network-based (requires Qdrant server)
- Supports distributed deployments
- L2 distance for vector search
- Collection-based organization
Initialization: Connects to server, creates collection if needed
Key behavior: Uses L2 (Euclidean) distance metric
Different backends use different vector distance metrics:
- ChromaDB: Cosine similarity
- Qdrant: L2 distance
Results may differ slightly between backends when querying the same data. Both are valid; choice depends on use case and infrastructure requirements.
class EmbeddingInterface(ABC):
@abstractmethod
def embed(self, text: str) -> List[float]:
"""Embed single text string"""
pass
@abstractmethod
def embed_batch(self, texts: List[str]) -> List[List[float]]:
"""Embed multiple texts efficiently"""
pass
@abstractmethod
def embed_query(self, text: str) -> List[float]:
"""Embed query text (may apply model-specific prefixes)"""
pass
@abstractmethod
def get_dimension(self) -> int:
"""Report vector dimension"""
pass- Type: General-purpose sentence transformer
- Dimension: 384
- Best for: Balanced performance and speed
- Query prefix: None
- API key: Not required
- Type: Code-optimized embedding model
- Dimension: 768
- Best for: Code semantic search
- Query prefix: "Represent this query for searching relevant code: "
- API key: Not required
- Implementation note: Overrides
embed_query()to apply prefix
- Type: OpenAI API-based
- Dimension: 1536
- Best for: High quality, cloud-based
- Query prefix: None
- API key: Required (
OPENAI_API_KEYenvironment variable)
Some models perform better when queries have instruction prefixes. Implement this pattern:
class CodeRankEmbedding(EmbeddingInterface):
QUERY_INSTRUCTION_PREFIX = "Represent this query for searching relevant code: "
def embed_query(self, text: str) -> List[float]:
"""Override to prepend instruction prefix for queries"""
prefixed_text = self.QUERY_INSTRUCTION_PREFIX + text
return self.embed(prefixed_text)
def embed(self, text: str) -> List[float]:
"""Regular embedding for documents (no prefix)"""
# ... implementation
def embed_batch(self, texts: List[str]) -> List[List[float]]:
"""Batch embedding for documents (no prefix)"""
# ... implementationImportant: Only apply prefix in embed_query(). Keep embed() and embed_batch() unchanged so document embeddings remain unprefixed.
Embedding models:
- Are initialized once at startup
- Maintain no internal state about processed documents
- Can be reused across all operations without side effects
- Thread-safe (for most implementations)
Located in src/code_rag/config/config.py. All have fallback defaults:
| Variable | Default | Description |
|---|---|---|
CODE_RAG_EMBEDDING_MODEL |
all-MiniLM-L6-v2 |
Embedding model name |
CODE_RAG_DATABASE_TYPE |
chroma |
Database backend (chroma or qdrant) |
CODE_RAG_DATABASE_PATH |
.code-rag |
Persist directory for embeddings |
CODE_RAG_CHUNK_SIZE |
1024 |
Characters per code chunk |
CODE_RAG_BATCH_SIZE |
32 |
Documents per batch before insert |
CODE_RAG_ADDITIONAL_IGNORE_PATTERNS |
`` | Comma-separated additional ignore patterns |
CLI arguments override environment variables:
# Environment default: chroma
export CODE_RAG_DATABASE_TYPE=qdrant
# CLI override: uses chroma despite environment
code-rag --database chroma- Add to environment variable list in
config.py - Provide sensible default
- Add CLI argument in
main.pyif needed - Document in this file
Exposes Code-RAG functionality to AI assistants (Claude, etc.) via Model Context Protocol.
- Transport: stdio (standard input/output)
- Protocol: JSON-RPC messages
- Tools: Single tool
search_codebase - Auto-indexing: Transparently indexes codebases on first search
async def async_main():
"""Async server implementation"""
# ... MCP server logic
def main():
"""Synchronous entry point for setuptools"""
asyncio.run(async_main())Why this matters: Setuptools console_scripts cannot handle async entry points. The entry point MUST be synchronous (def main()) and call asyncio.run() internally.
Never do this:
# WRONG - setuptools cannot call this
async def main():
# ... async codeMCP mode auto-indexes without user prompts:
- First search on a codebase triggers indexing
- No validation prompts or progress callbacks
- Silent operation for seamless AI assistant integration
- Controlled via
validate_codebase=Falseparameter
- Define tool schema in
list_tools()handler:
@server.list_tools()
async def list_tools() -> List[Dict]:
return [
{
"name": "search_codebase",
"description": "...",
"inputSchema": {...}
},
# Add new tool here
]- Implement handler in
call_tool():
@server.call_tool()
async def call_tool(name: str, arguments: Dict) -> List[Dict]:
if name == "search_codebase":
# ... existing implementation
elif name == "new_tool":
# ... new implementationOptimization: Chunks are embedded and inserted in batches (default 32) rather than individually.
Flow:
- Process files → accumulate chunks
- When batch size reached → embed all chunks
- Insert batch into database
- Continue until all files processed
- Flush remaining chunks
Rationale: Reduces database I/O overhead and improves embedding throughput (models process batches more efficiently).
Pattern: Before processing, check if database already contains documents.
Implementation:
if db.is_processed(collection_name) and not args.reindex:
print("Codebase already processed. Use --reindex to force.")
returnBehavior:
- No
--reindexflag: Skip processing if documents exist - With
--reindexflag: Always reprocess and replace embeddings
Rationale: Prevents duplicate embeddings from multiple runs, saves time on subsequent invocations.
Every chunk stores metadata:
- file_path: Source file location
- chunk_index: Position within file (0-indexed)
- total_chunks: How many chunks came from this file
Purpose: Search results can be traced back to exact source location. Enables features like "show me the full file" or "navigate to line X".
Design: Database and embedding implementations are interchangeable via configuration.
Wiring:
- Interfaces defined in separate files (
database_interface.py,embedding_interface.py) - Implementations are independent modules
- CLI orchestration layer instantiates based on configuration:
# Example from main.py
if args.database == "chroma":
db = ChromaDB(persist_directory=args.path)
elif args.database == "qdrant":
db = QdrantDB(url=args.qdrant_url)Adding new implementation:
- Create class extending interface
- Implement all abstract methods
- Add conditional instantiation in CLI
- Wire configuration if needed
# Full test suite
pytest
# With coverage
pytest --cov=code_rag
# Specific module
pytest tests/test_file_processor.py- Unit tests: Individual components (file processor, chunking, embeddings)
- Integration tests: Database operations, end-to-end indexing
- Mock patterns: External dependencies (OpenAI API, Qdrant server) are mocked
# Format code
black src/code_rag/
# Lint
flake8 src/code_rag/
# Type checking (if configured)
mypy src/code_rag/| Package | Purpose |
|---|---|
chromadb |
Default vector database |
qdrant-client |
Alternative vector database |
sentence-transformers |
Local embedding models |
openai |
OpenAI API embeddings |
tree-sitter |
Syntax-aware chunking |
python-dotenv |
Environment configuration |
mcp |
Model Context Protocol server |
Required for syntax-aware chunking:
tree-sitter-pythontree-sitter-javascripttree-sitter-typescripttree-sitter-gotree-sitter-rusttree-sitter-javatree-sitter-cpptree-sitter-c
Principle: No circular imports. Each module imports only what it directly uses.
Flow:
- CLI imports: file processor, database implementations, embedding implementations, config
- Database implementations: Independent of each other
- Embedding implementations: Independent and self-contained
- File processor: No dependencies on database or embedding layers
System continues processing even when individual operations fail:
- File reading errors → print message, continue with other files
- Gitignore parsing errors → silently ignored, continue
- Syntax chunking errors → fall back to basic chunking, continue
Consider adding structured logging if more visibility into failures is needed.
Different backends use different similarity metrics:
- ChromaDB: Cosine similarity (normalized dot product)
- Qdrant: L2 distance (Euclidean distance)
Both are valid; results may differ slightly. This is expected behavior, not a bug.
- Install tree-sitter binding:
pip install tree-sitter-{language} - Add to
SyntaxChunker.LANGUAGE_PACKAGESmapping - Test on sample files from that language
- Update documentation
- Batch size affects memory usage vs. throughput
- Larger chunks preserve more context but reduce granularity
- Syntax-aware chunking is slower but produces better results
- Embedding models vary significantly in speed and quality
Tune based on your codebase size and quality requirements.