The dependency_analysis_services module is a comprehensive repository analysis system that orchestrates multi-language code parsing, call graph generation, and repository structure analysis. It serves as the core engine for understanding code dependencies and relationships across diverse programming languages.
Primary Purpose: Enable automated analysis of GitHub repositories to extract function definitions, call relationships, and visual representations of code structure across 9+ programming languages.
Key Capabilities:
- Multi-language AST parsing and call graph generation
- Repository structure analysis with intelligent filtering
- Cross-language function relationship resolution
- Visualization-ready graph data generation
- GitHub repository cloning and automated cleanup
The module consists of three main components working together:
AnalysisService (Main Orchestrator)
- Coordinates the complete analysis workflow
- Manages repository cloning and cleanup
- Delegates structure analysis to RepoAnalyzer
- Delegates call graph analysis to CallGraphAnalyzer
CallGraphAnalyzer (Multi-Language Coordinator)
- Routes files to language-specific analyzers
- Consolidates results from all languages
- Resolves function call relationships
- Generates visualization data
RepoAnalyzer (Structure Analysis)
- Builds file tree with pattern-based filtering
- Validates file paths for security
- Provides file and size statistics
Responsibility: Central orchestrator for the complete analysis workflow.
Key Methods:
Performs comprehensive repository analysis including call graph generation.
Workflow:
- Clones GitHub repository to temporary directory
- Parses repository information from URL
- Analyzes repository structure with filtering
- Performs multi-language call graph analysis
- Consolidates results into AnalysisResult object
- Cleans up temporary files
Returns: AnalysisResult with:
- Repository metadata
- Function definitions
- Call relationships
- File tree structure
- Visualization data
- README content (if available)
Lightweight structure-only analysis without call graph generation.
Use Cases:
- Quick repository exploration
- Codebase size assessment
- File distribution analysis
Returns: Dictionary with repository info and file tree
Analyzes an already-cloned local repository.
Features:
- No network operations required
- Optional language filtering
- File count limiting for performance
Data Flow:
- Client calls
analyze_repository_full(url)on AnalysisService - AnalysisService clones repository to temporary directory
- RepoAnalyzer analyzes file structure and returns file tree
- CallGraphAnalyzer extracts code files from file tree
- For each file, language-specific analyzer parses content
- CallGraphAnalyzer consolidates all functions and relationships
- CallGraphAnalyzer resolves function call relationships
- CallGraphAnalyzer deduplicates call relationships
- AnalysisService cleans up temporary directory
- Client receives AnalysisResult with all analysis data
Responsibility: Multi-language call graph orchestration and function relationship analysis.
Key Features:
extract_code_files(file_tree) → List[Dict]- Filters files based on supported extensions
- Extracts language information
- Returns structured code file list
Supported Extensions: .py, .js, .ts, .java, .kt, .cs, .c, .cpp, .php, etc.
Routes files to appropriate language-specific analyzers:
JavaScript/TypeScript → tree-sitter analyzer
Python → AST analyzer
Java/Kotlin → tree-sitter analyzer
C/C++/C# → tree-sitter analyzer
PHP → tree-sitter analyzer + namespace resolution
Resolution Strategy:
- Build lookup table with multiple keys per function:
- Full ID (file path + function name)
- Short name
- Component ID
- Match function calls to definitions using available keys
- Mark resolved vs unresolved relationships
Produces Cytoscape.js-compatible graph data:
{
"cytoscape": {
"elements": [
{
"data": {
"id": "file.py:function_name",
"label": "function_name",
"file": "path/to/file.py",
"type": "function",
"language": "python"
},
"classes": ["node-function", "lang-python"]
},
{
"data": {
"id": "caller->callee",
"source": "file.py:caller",
"target": "file.py:callee",
"line": 42
},
"classes": ["edge-call"]
}
]
},
"summary": {
"total_nodes": 150,
"total_edges": 243,
"unresolved_calls": 12
}
}Responsibility: Repository structure analysis with intelligent file filtering.
Key Features:
- ✓ Rejects symlinks to prevent traversal attacks
- ✓ Validates paths don't escape repository root
- ✓ Permission error handling
_build_file_tree(repo_dir) → DictStructure:
{
"type": "directory",
"name": "repo_root",
"path": ".",
"children": [
{
"type": "file",
"name": "main.py",
"path": "src/main.py",
"extension": ".py",
"_size_bytes": 1024
}
]
}Include Patterns (DEFAULT):
["*.py", "*.js", "*.ts", "*.java", "*.kt", "*.cs", "*.c", "*.cpp", "*.php", "*.go", "*.rs"]- Customizable per instance
- If specified, replaces defaults entirely
Exclude Patterns (DEFAULT):
[".*", "*/.*", "node_modules/*", "__pycache__/*", ".git/*", "*.egg-info/*", "dist/*", "build/*", "venv/*"]- Customizable per instance
- Merges with defaults
Filter Logic:
- Check relative path against exclude patterns
- Check filename against exclude patterns
- Validate directory escaping
- For files: verify against include patterns
| Language | Analyzer Type | Status |
|---|---|---|
| Python | AST (ast module) | ✓ Stable |
| JavaScript | tree-sitter | ✓ Stable |
| TypeScript | tree-sitter | ✓ Stable |
| Java | tree-sitter | ✓ Stable |
| Kotlin | tree-sitter | ✓ Stable |
| C# | tree-sitter | ✓ Stable |
| C | tree-sitter | ✓ Stable |
| C++ | tree-sitter | ✓ Stable |
| PHP | tree-sitter + namespace resolver | ✓ Stable |
CallGraphAnalyzer routes files to language-specific analyzers:
- Python files → PythonASTAnalyzer (using Python AST module)
- JavaScript/TypeScript files → TreeSitterJSAnalyzer (tree-sitter)
- Java files → TreeSitterJavaAnalyzer (tree-sitter)
- Kotlin files → TreeSitterKotlinAnalyzer (tree-sitter)
- C# files → TreeSitterCSharpAnalyzer (tree-sitter)
- C files → TreeSitterCAnalyzer (tree-sitter)
- C++ files → TreeSitterCppAnalyzer (tree-sitter)
- PHP files → TreeSitterPHPAnalyzer (tree-sitter + NamespaceResolver)
Each analyzer extracts function definitions and call relationships, which are consolidated into a unified set of Function Nodes and CallRelationships.
All analyzers return:
Tuple[List[Node], List[CallRelationship]]
# Node represents a function/method
Node:
- id: str # Unique identifier
- name: str # Function name
- file_path: str # Source file path
- node_type: str # 'function', 'method', 'class_method'
- parameters: List[str] # Parameter names
- docstring: Optional[str] # Documentation
- component_id: str # Path::ClassName::methodName
# CallRelationship represents a function call
CallRelationship:
- caller: str # Calling function ID
- callee: str # Called function name (may be unresolved)
- call_line: int # Line number of call
- is_resolved: bool # Whether callee matched to definitionThe analysis workflow follows these sequential steps:
- Repository Cloning - Clone GitHub repository to temporary directory
- URL Parsing - Extract owner, repository name, and URL from GitHub URL
- Structure Analysis - Analyze file tree with include/exclude filtering
- File Extraction - Extract code files organized by language
- Language Analysis - Route files to language-specific analyzers in parallel
- Python files → AST analysis
- JS/TS files → tree-sitter analysis
- Java files → tree-sitter analysis
- PHP files → tree-sitter + namespace resolver
- C/C++/C# files → tree-sitter analysis
- Result Consolidation - Merge functions and relationships from all languages
- Relationship Resolution - Match function calls to definitions
- Deduplication - Remove duplicate call relationships
- Visualization - Generate Cytoscape.js-compatible graph data
- Cleanup - Remove temporary directory
- Return Results - Package into AnalysisResult object
Individual file analysis is protected with 30-second timeout:
- Prevents hangs on very large or problematic files
- Platform-aware (Unix signal-based, Windows-compatible)
- Logs warning and continues to next file
# Each file analysis wrapped in try-catch
try:
self._analyze_code_file(repo_dir, file_info)
files_analyzed += 1
except Exception as e:
files_failed += 1
logger.warning(f"Failed to analyze {file_path}: {str(e)}")
# Continue with next file- Tracks all temporary directories
- Automatic cleanup on AnalysisService destruction
- Explicit
cleanup_all()method available - Defensive error handling during cleanup
- Silently skips unsupported languages
- Reports unsupported file counts in summary
- Continues analysis with supported languages
Input Dependencies:
language_analyzers- Provides language-specific AST parsing functionsdependency_analyzer_models- Provides Node and CallRelationship data modelsdependency_analyzer_utils- Provides utility functions and patterns
Output Dependencies:
documentation_generation- Consumes AnalysisResult to generate documentationdependency_graph_construction- Uses analysis data for graph operations
AnalysisResult feeds into:
- documentation_generation: Uses functions, relationships, and file tree to generate documentation
- Frontend: Visualization data rendered in web UI via Cytoscape.js
- API Responses: Serialized for JSON response to clients
from codewiki.src.be.dependency_analyzer.analysis.analysis_service import AnalysisService
service = AnalysisService()
result = service.analyze_repository_full(
github_url="https://github.com/user/repo",
include_patterns=["*.py", "*.js"],
exclude_patterns=["*test*", "*spec*"]
)
# Access results
print(f"Functions: {len(result.functions)}")
print(f"Relationships: {len(result.relationships)}")
print(f"Visualization ready: {result.visualization is not None}")result = service.analyze_repository_structure_only(
github_url="https://github.com/user/repo"
)
# Quick overview without call graph
print(f"Total files: {result['file_summary']['total_files']}")
print(f"File tree: {result['file_tree']}")result = service.analyze_local_repository(
repo_path="/path/to/local/repo",
max_files=50,
languages=["python", "javascript"]
)
# Fast analysis of pre-cloned repository
print(f"Nodes: {len(result['nodes'])}")
print(f"Relationships: {len(result['relationships'])}")RepoAnalyzer constructor:
analyzer = RepoAnalyzer(
include_patterns=["*.py", "src/**/*.py"], # Only these files
exclude_patterns=["*test*", "__pycache__"] # Additionally ignore these
)Pattern Matching:
- Uses
fnmatchfor glob pattern matching - Supports
*,?,[seq],[!seq] - Directory patterns can use
/separator
# Analyze only specific languages
result = service.analyze_local_repository(
repo_path="/path/to/repo",
languages=["python", "javascript"] # Skip other languages
)- Repository Cloning: O(repository_size)
- Structure Analysis: O(file_count)
- Call Graph Analysis: O(file_count + function_count)
- Visualization Generation: O(node_count + edge_count)
- Per-File Timeout: 30 seconds per file prevents pathological cases
- Selective Language Analysis: Only analyze supported languages
- File Count Limiting:
max_filesparameter for large repositories - Lazy Loading: Visualization generated only when needed
- Deduplication: Reduces relationship graph size
For a medium repository (1000 files, 500 functions):
- Structure analysis: ~2-5 seconds
- Call graph analysis: ~10-20 seconds
- Visualization generation: <1 second
- Cleanup: <1 second
- Total: ~15-30 seconds
@dataclass
class AnalysisResult:
repository: Repository # Repository metadata
functions: List[Node] # All extracted functions
relationships: List[CallRelationship] # Function calls
file_tree: Dict[str, Any] # Nested file structure
summary: Dict[str, Any] # Statistics
visualization: Dict[str, Any] # Cytoscape data
readme_content: Optional[str] # README from repo@dataclass
class Node:
id: str # Unique identifier
name: str # Function/method name
file_path: str # Source file path
node_type: str # 'function' | 'method'
line_number: int # Definition line
parameters: List[str] # Parameter names
docstring: Optional[str] # Documentation
component_id: Optional[str] # Qualified name@dataclass
class CallRelationship:
caller: str # Calling function ID
callee: str # Called function name
call_line: int # Line of call
is_resolved: bool # Matched to definition?service = AnalysisService()
# 1. Analyze repository
result = service.analyze_repository_full("https://github.com/example/project")
# 2. Use with documentation generator (from documentation_generation module)
from codewiki.src.be.documentation_generator import DocumentationGenerator
gen = DocumentationGenerator()
doc = gen.generate_documentation(
analysis_result=result,
module_name="example.project"
)
# 3. Clean up
service.cleanup_all()service = AnalysisService()
# 1. Get structure only
structure = service.analyze_repository_structure_only(
github_url="https://github.com/example/project",
exclude_patterns=["**/test/**", "**/*_test.py"]
)
# 2. Inspect file distribution
files_by_language = {}
for file_info in structure['file_tree']['children']:
lang = file_info.get('extension')
files_by_language[lang] = files_by_language.get(lang, 0) + 1
print(f"File distribution: {files_by_language}")service = AnalysisService()
# Analyze repository with mixed languages
result = service.analyze_repository_full("https://github.com/example/monorepo")
# Access language-specific metrics
summary = result.summary
print(f"Languages found: {summary['languages_analyzed']}")
print(f"Unsupported files: {summary.get('unsupported_files', 0)}")
print(f"Functions extracted: {summary['total_functions']}")
# Visualization includes language information
for element in result.visualization['cytoscape']['elements']:
if 'language' in element['data']:
print(f"{element['data']['label']} ({element['data']['language']})")-
AnalysisService Tests:
- Repository cloning simulation
- Temporary directory cleanup
- Error handling on invalid URLs
-
CallGraphAnalyzer Tests:
- File extraction filtering
- Language routing
- Relationship resolution logic
- Deduplication correctness
-
RepoAnalyzer Tests:
- Pattern matching accuracy
- Security validation (symlink rejection)
- File tree structure correctness
- Size calculations
- Small test repositories with known structures
- Multi-language test repositories
- Repositories with various edge cases:
- Very large files
- Symlinks
- Permission errors
- Complex module relationships
Symptom: "File parsing exceeded 30s timeout"
Solution:
- Increase timeout in
timeout()context manager - Use
max_filesparameter to limit scope - Exclude problematic files with
exclude_patterns
Symptom: High unresolved_calls count in visualization
Causes:
- Cross-language calls (Python calling JS library)
- Dynamically resolved calls (
getattr,eval) - External library calls
Mitigation:
- Ensure include/exclude patterns capture all source files
- Check for namespace resolution issues in PHP analyzer
- Review function ID generation in language analyzers
Symptom: Expected functions not in results
Verification:
- Check language is supported
- Verify file matches include patterns
- Check file wasn't excluded by exclude patterns
- Review language analyzer logs for parse errors
- Ensure proper file encoding (UTF-8)
- language_analyzers - Language-specific parsing implementations
- dependency_analyzer_models - Data models and structures
- dependency_graph_construction - Graph building utilities
- documentation_generation - Doc generation from analysis results
- dependency_analyzer_utils - Utility functions and helpers
The dependency_analysis_services module provides a robust, multi-language repository analysis platform that:
✓ Orchestrates complex multi-step analysis workflows
✓ Supports 9+ programming languages with extensible architecture
✓ Generates visualization-ready call graphs
✓ Handles errors gracefully with per-file resilience
✓ Manages resources with automatic cleanup
✓ Filters intelligently with pattern-based inclusion/exclusion
✓ Integrates seamlessly with documentation generation pipeline
It serves as the analytical backbone of the CodeWiki system, transforming raw source code into structured, queryable function dependency information.