The documentation_generation module is the orchestration core of the CodeWiki system, responsible for coordinating the entire automated documentation generation workflow. It manages the end-to-end process of analyzing codebases, grouping components into logical modules, and generating comprehensive documentation using LLM-based agents.
Transform raw code analysis (dependency graphs, components) into organized, hierarchical documentation through intelligent module clustering and iterative document generation using a dynamic programming approach.
- Dynamic Programming Strategy: Processes modules bottom-up (leaf modules first), then aggregates documentation for parent modules
- Agent-Based Generation: Delegates module-specific documentation to sub-agents via backend LLM
- Hierarchical Documentation: Generates documentation for nested module structures with contextual parent overviews
- Metadata Tracking: Records generation metadata, statistics, and configuration for documentation artifacts
DocumentationGenerator (Main Orchestrator)
├── DependencyGraphBuilder (Analysis)
├── LLMBackend (Generation Engine)
├── Configuration Management
├── File Operations
└── Logging & Progress Tracking
Input Layer:
- Config Object, Commit ID, LLMBackend → DocumentationGenerator
Core Processing:
- DocumentationGenerator orchestrates DependencyGraphBuilder
- DependencyGraphBuilder → Module Clustering
Documentation Stages:
- Leaf Module Docs (direct component analysis)
- Parent Module Docs (aggregates children)
- Repository Overview (aggregates all modules)
Output Layer:
- Markdown files, Metadata JSON, Module Tree artifacts
Purpose: Main orchestrator for the documentation generation pipeline
Key Responsibilities:
- Initialization: Set up configuration, dependency graph builder, and LLM backend
- Module Processing: Coordinate generation of documentation for all modules
- Hierarchy Management: Build and maintain module tree structure
- Documentation Aggregation: Generate parent module and repository documentation from child components
- Metadata Management: Create and maintain generation metadata
Initializes the generator with configuration and backend resources.
- Creates DependencyGraphBuilder for code analysis
- Sets up LLM backend for document generation
- Stores commit ID for metadata tracking
Main entry point - executes the complete documentation generation pipeline.
Process Flow:
- Build dependency graph from codebase
- Cluster components into logical modules (if not cached)
- Generate documentation for all modules in dependency order
- Create metadata file with generation statistics
Orchestrates documentation generation for all modules using dynamic programming.
Algorithm:
- Get topological sort of modules (leaves first)
- For each leaf module: delegate to backend agent for generation
- For each parent module: aggregate children docs, generate overview
- For repository: create system-wide overview
Generates documentation for parent/overview modules.
Process:
- Load module tree and child documentation
- Build context structure with 1-depth children docs
- Call LLM with REPO_OVERVIEW_PROMPT or MODULE_OVERVIEW_PROMPT
- Parse response (handles both wrapped and raw markdown)
- Save documentation file
Constructs input context for overview generation.
Responsibilities:
- Loads child module documentation files
- Marks target module with metadata flag
- Handles file path resolution with name variations
- Returns enriched module tree with embedded docs
Determines module processing sequence using topological sort.
Algorithm: Bottom-up traversal
- Recursively collect leaf modules first
- Add parent modules after their children
- Returns list of (path, module_name) tuples in processing order
Determines if a module has no children (leaf node).
Logic:
- Checks if "children" key exists and is non-empty
- True if children dict is empty or missing
Creates metadata.json with generation information.
Metadata Structure:
{
"generation_info": {
"timestamp": "ISO 8601 timestamp",
"main_model": "LLM model name",
"generator_version": "1.0.1",
"repo_path": "Repository path",
"commit_id": "Git commit hash"
},
"statistics": {
"total_components": "Number",
"leaf_nodes": "Number",
"max_depth": "Configured max depth"
},
"files_generated": ["list of markdown files"]
}Phase-by-Phase Processing:
-
Initialization
- Start with repository
- Build dependency graph using DependencyGraphBuilder
- Extract components and leaf nodes
-
Clustering Decision
- Check if module tree exists
- If no: Cluster modules using LLM
- If yes: Load cached module tree
- Save module tree to disk
-
Module Processing
- Perform topological sort (leaf-first order)
- Process each module in order
-
Module Generation (branch by type)
- Leaf Modules: Generate component-specific documentation
- Parent Modules: Aggregate children docs and generate overview
- Save generated documentation
-
Repository Overview
- Aggregate all module documentation
- Generate system-wide overview
-
Finalization
- Create metadata.json with generation info
- Complete documentation generation
Generation Workflow:
-
Analysis Phase:
- Main Process → DependencyGraphBuilder:
build_dependency_graph() - Returns: (components, leaf_nodes)
- Main Process → DependencyGraphBuilder:
-
Clustering Phase:
- Main Process → Clustering Module:
cluster_modules(components, leaf_nodes) - Clustering Module → LLMBackend:
complete(clustering_prompt) - Returns: module_tree back to Main Process
- Save to filesystem
- Main Process → Clustering Module:
-
Module Documentation Phase (for each module in leaf-first order):
- Leaf Modules:
- Main Process → LLMBackend:
run_module_agent(components...) - Returns: generated_docs
- Main Process → LLMBackend:
- Parent Modules:
- Main Process → FileSystem: load child module docs
- Main Process → LLMBackend:
complete(overview_prompt) - Returns: overview_docs
- Save module docs to filesystem
- Leaf Modules:
-
Overview Phase:
- Main Process → LLMBackend:
complete(repo_overview_prompt) - Returns: repo_overview
- Save overview.md
- Main Process → LLMBackend:
-
Finalization Phase:
- Create and save metadata.json
Pattern: Bottom-up computation with memoization
- Application: Process leaf modules first, store results, use for parent computation
- Benefit: Avoids redundant generation; enables incremental documentation updates
- Implementation:
get_processing_order()returns leaves-first traversal; tracked withprocessed_modulesset
Pattern: Progressive context enrichment
- Application: Parents built from children; repo built from all modules
- Benefit: Maintains consistency across documentation hierarchy
- Implementation:
build_overview_structure()embeds child docs in parent context
Pattern: Pluggable LLM backends
- Application: Different LLM implementations (CAW, PydanticAI, OpenAI-compatible)
- Benefit: Support multiple LLM providers without changing core logic
- Implementation: Accepts
LLMBackendinstance; delegates generation via.complete()and.run_module_agent()
Pattern: Centralized file operations
- Application: All I/O through
file_managerutility - Benefit: Consistent error handling, path normalization, JSON/text handling
- Implementation: Uses
file_manager.save_json(),file_manager.load_text(), etc.
Pattern: Fallback mechanisms for robustness
- Application: Module path resolution tries multiple filename variants
- Benefit: Handles sub-agent file naming inconsistencies
- Implementation:
_resolve_child_docs_path()tries variations (spaces→underscores, lowercasing, etc.)
- Used for: Analyzing codebase and extracting component dependencies
- Interface:
build_dependency_graph()→ returns(components, leaf_nodes) - Role: Provides foundation for module clustering
- Used for: All LLM-based generation (clustering, documentation, overviews)
- Interface:
.complete(prompt, model=None)→ returns generated text.run_module_agent(module_name, components, ...)→ returns updated module_tree
- Role: Performs intelligent module clustering and document generation
- Used for: Repository path, documentation output directory, LLM models, max depth
- Key Fields:
repo_path,docs_dir,main_model,cluster_model,max_depth - Role: Central configuration source
- REPO_OVERVIEW_PROMPT: Template for repository-level overview generation
- MODULE_OVERVIEW_PROMPT: Template for parent module overview generation
- Role: Guide LLM output format and content quality
- Consumes: Main entry point for CLI-initiated documentation generation
- Interface: Creates DocumentationGenerator instance, calls
.run() - Usage:
DocumentationGenerator(config).run()
- Consumes: Generated documentation files and metadata
- Interface: Reads from
docs_dirafter generation completes - Usage: Displays generated markdown and module tree in web UI
The system maintains two versions of the module tree:
-
first_module_tree.json: Initial clustering result
- Created once during clustering phase
- Never modified
- Preserved for reference and reproducibility
-
module_tree.json: Working copy during generation
- Updated by sub-agents as they process modules
- Reloaded at each iteration to capture changes
- Final version represents complete hierarchy
Sub-agents may modify the module tree structure (e.g., adjusting hierarchy, adding metadata). The generator handles this by:
- Reloading module_tree at each iteration
- Supporting dynamic structure changes
- Preserving first_module_tree for reference
# Module processing failures don't stop pipeline
try:
process_module()
except Exception as e:
logger.error(f"Failed to process module: {e}")
continue # Process next moduleBenefit: Partial documentation generation continues despite individual module failures
# Try multiple filename variants before giving up
candidates = [
child_name,
child_name.replace(" ", "_"),
child_name.replace(" ", "-"),
]
# Returns first existing match or NoneBenefit: Handles inconsistent sub-agent file naming conventions
# Handle both wrapped and raw markdown
if "<OVERVIEW>" in response:
content = response.split("<OVERVIEW>")[1].split("</OVERVIEW>")[0]
else:
content = response # Use raw responseBenefit: Works with different LLM response formats
| Parameter | Impact | Default Behavior |
|---|---|---|
repo_path |
Root directory for code analysis | Required; error if missing |
docs_dir |
Output directory for documentation | Required; created if missing |
main_model |
Primary LLM for documentation generation | Stored in metadata |
cluster_model |
Separate model for module clustering | Falls back to main_model if empty |
max_depth |
Module hierarchy depth limit | Affects clustering granularity |
The configuration object distributes parameters to various components:
- repo_path → DependencyGraphBuilder (source code location)
- docs_dir → DocumentationGenerator (output directory)
- main_model → LLMBackend (primary LLM model)
- cluster_model → Module Clustering (clustering-specific model)
- max_depth → Clustering Module (hierarchy depth limit)
- Files:
{module_name}.md(one per module) - Content: Module overview, purpose, components, relationships
- Generated by: Module agents for leaf modules; orchestrator for parents
- Example:
authentication.md,api_server.md
- File:
overview.md - Content: System-wide architecture, module relationships, design patterns
- Generated by: DocumentationGenerator from all module documentation
- Replaces:
{repo_name}.mdif generated in single-module case
- Files:
first_module_tree.json,module_tree.json - Content: Hierarchical module structure with component mappings
- Used by: Frontend for navigation; subsequent generations for caching
- File:
metadata.json - Content: Generation timestamp, models used, statistics, generated files list
- Lifecycle: Created at end of each generation
docs/
├── overview.md # Repository overview
├── authentication.md # Module: Authentication
├── api_server.md # Module: API Server
├── database.md # Module: Database
├── module_tree.json # Complete module hierarchy
├── first_module_tree.json # Initial clustering (immutable)
└── metadata.json # Generation metadata
The get_processing_order() method implements a depth-first traversal that guarantees leaves are processed before parents:
def collect_modules(tree, path):
for module_name, module_info in tree.items():
current_path = path + [module_name]
# Recursively process children first
if module_info.get("children"):
collect_modules(module_info["children"], current_path)
# Add module after its children (leaf-first order)
processing_order.append((current_path, module_name))Properties:
- Correctness: All parent dependencies appear after children
- Completeness: Every module in tree is included
- Efficiency: Single O(n) traversal
- Determinism: Consistent order across runs (dictionary iteration order)
Condition: len(module_tree) == 0 (entire repo fits in context)
Handling:
- Generates documentation for entire repository in one pass
- Treats repository as single module
- Renames output from
{repo_name}.md→overview.md - Skips parent aggregation phase
Benefit: Optimal for small/focused codebases
Condition: Child module docs file not found during parent generation
Handling:
- Logs warning
- Proceeds with empty string for that child's docs
- Parent overview generated with incomplete context
Rationale: Prevents total pipeline failure from missing individual modules
Condition: Sub-agent saves {module_name} but tree references it differently
Handling:
- Tries original name
- Tries with spaces→underscores
- Tries with spaces→hyphens
- Tries with spaces removed
- Returns first match or None
Rationale: Accommodates inconsistent naming across sub-agents
The documentation generation process follows these sequential phases:
-
Analyze Code (Initialize)
- DependencyGraphBuilder extracts components & leaf nodes
- Builds dependency graph from source code
-
Cluster Modules (Dependencies extracted)
- LLM-based clustering groups components
- Creates initial module hierarchy
-
Generate Leaf Docs (Module tree created)
- Sub-agents process each leaf module
- Generates component-specific documentation
-
Generate Parent Docs (Leaf docs complete)
- Aggregates child documentation
- Synthesizes parent/overview documentation
-
Generate Repo Overview (Parent docs complete)
- Synthesizes all module documentation
- Creates system-wide architecture overview
-
Create Metadata (Overview complete)
- Records generation information
- Collects statistics & timestamps
-
Complete (Generation complete)
- All artifacts finalized and saved
Current Behavior:
- Module tree cached in
first_module_tree.json - Documentation generation always runs (no caching of generated docs)
- Can be extended for partial re-generation
Future Enhancement Opportunity:
- Check if module docs exist before re-generation
- Skip already-processed modules
- Enable incremental updates
| Phase | Complexity | Notes |
|---|---|---|
| Dependency Graph Building | O(n) | Linear scan of code |
| Module Clustering | O(n log n) | LLM clustering over components |
| Topological Sort | O(m + n) | DFS over module tree |
| Documentation Generation | O(n × L) | n modules × L token length |
| Overview Generation | O(m × L) | m parent nodes × context size |
Where:
- n = number of components/leaf nodes
- m = number of parent modules
- L = average LLM context window
- Single-module generation: Clusters all modules in one LLM call
- Multi-module generation: Processes sequentially (can be parallelized)
- Deep hierarchies: May exceed context window limits (config via
max_depth) - Large codebases: Parent aggregation includes child doc context (can be summarized if needed)
- Cache generated documentation files
- Check existence before re-generation
- Support partial updates on codebase changes
- Process independent leaf modules concurrently
- Maintain synchronization for parent aggregation
- Reduce total generation time
- Summarize child documentation for parent generation
- Implement selective context inclusion
- Support generation of larger codebases
- Integrate with progress tracking system
- Real-time status updates to frontend
- Estimated time remaining
- Generate documentation in multiple formats (HTML, JSON, RST)
- Support template customization
- Enable documentation versioning
For detailed information on connected modules, see:
- Dependency Analysis Services - Code analysis foundation
- LLM Backends - Generation engines and provider integration
- CLI Core - Command-line interface integration
- Frontend Web App - Documentation presentation layer
- Shared Config & Utils - Configuration and utilities
- Agent Tools - Sub-agent capabilities and dependencies
The documentation_generation module is the orchestration hub of CodeWiki, coordinating the transformation of raw code analysis into organized, hierarchical documentation. Its key strengths are:
- Intelligent Hierarchy: Builds documentation from leaf modules up to repository overview
- LLM-Powered: Leverages language models for understanding and synthesis
- Robust Pipeline: Graceful error handling and fallback mechanisms
- Flexible Architecture: Pluggable backends and configuration-driven behavior
- Metadata Tracking: Maintains generation provenance and statistics
By combining dependency analysis with intelligent clustering and iterative generation, it enables developers to understand complex codebases through automatically generated, contextually-informed documentation.