Documentation Generation Module

Overview

The documentation_generation module is the orchestration core of the CodeWiki system, responsible for coordinating the entire automated documentation generation workflow. It manages the end-to-end process of analyzing codebases, grouping components into logical modules, and generating comprehensive documentation using LLM-based agents.

Core Responsibility

Transform raw code analysis (dependency graphs, components) into organized, hierarchical documentation through intelligent module clustering and iterative document generation using a dynamic programming approach.

Key Characteristics

Dynamic Programming Strategy: Processes modules bottom-up (leaf modules first), then aggregates documentation for parent modules
Agent-Based Generation: Delegates module-specific documentation to sub-agents via backend LLM
Hierarchical Documentation: Generates documentation for nested module structures with contextual parent overviews
Metadata Tracking: Records generation metadata, statistics, and configuration for documentation artifacts

Architecture

Component Structure

DocumentationGenerator (Main Orchestrator)
├── DependencyGraphBuilder (Analysis)
├── LLMBackend (Generation Engine)
├── Configuration Management
├── File Operations
└── Logging & Progress Tracking

Module Interactions

Input Layer:

Config Object, Commit ID, LLMBackend → DocumentationGenerator

Core Processing:

DocumentationGenerator orchestrates DependencyGraphBuilder
DependencyGraphBuilder → Module Clustering

Documentation Stages:

Leaf Module Docs (direct component analysis)
Parent Module Docs (aggregates children)
Repository Overview (aggregates all modules)

Output Layer:

Markdown files, Metadata JSON, Module Tree artifacts

Core Components

DocumentationGenerator Class

Purpose: Main orchestrator for the documentation generation pipeline

Key Responsibilities:

Initialization: Set up configuration, dependency graph builder, and LLM backend
Module Processing: Coordinate generation of documentation for all modules
Hierarchy Management: Build and maintain module tree structure
Documentation Aggregation: Generate parent module and repository documentation from child components
Metadata Management: Create and maintain generation metadata

Key Methods

`init(config, commit_id, backend)`

Initializes the generator with configuration and backend resources.

Creates DependencyGraphBuilder for code analysis
Sets up LLM backend for document generation
Stores commit ID for metadata tracking

`run()`

Main entry point - executes the complete documentation generation pipeline.

Process Flow:

Build dependency graph from codebase
Cluster components into logical modules (if not cached)
Generate documentation for all modules in dependency order
Create metadata file with generation statistics

`generate_module_documentation(components, leaf_nodes)`

Orchestrates documentation generation for all modules using dynamic programming.

Algorithm:

Get topological sort of modules (leaves first)
For each leaf module: delegate to backend agent for generation
For each parent module: aggregate children docs, generate overview
For repository: create system-wide overview

`generate_parent_module_docs(module_path, working_dir)`

Generates documentation for parent/overview modules.

Process:

Load module tree and child documentation
Build context structure with 1-depth children docs
Call LLM with REPO_OVERVIEW_PROMPT or MODULE_OVERVIEW_PROMPT
Parse response (handles both wrapped and raw markdown)
Save documentation file

`build_overview_structure(module_tree, module_path, working_dir)`

Constructs input context for overview generation.

Responsibilities:

Loads child module documentation files
Marks target module with metadata flag
Handles file path resolution with name variations
Returns enriched module tree with embedded docs

`get_processing_order(module_tree, parent_path)`

Determines module processing sequence using topological sort.

Algorithm: Bottom-up traversal

Recursively collect leaf modules first
Add parent modules after their children
Returns list of (path, module_name) tuples in processing order

`is_leaf_module(module_info)`

Determines if a module has no children (leaf node).

Logic:

Checks if "children" key exists and is non-empty
True if children dict is empty or missing

`create_documentation_metadata(working_dir, components, num_leaf_nodes)`

Creates metadata.json with generation information.

Metadata Structure:

{
    "generation_info": {
        "timestamp": "ISO 8601 timestamp",
        "main_model": "LLM model name",
        "generator_version": "1.0.1",
        "repo_path": "Repository path",
        "commit_id": "Git commit hash"
    },
    "statistics": {
        "total_components": "Number",
        "leaf_nodes": "Number",
        "max_depth": "Configured max depth"
    },
    "files_generated": ["list of markdown files"]
}

Generation Pipeline

Processing Architecture

Phase-by-Phase Processing:

Initialization
- Start with repository
- Build dependency graph using DependencyGraphBuilder
- Extract components and leaf nodes
Clustering Decision
- Check if module tree exists
- If no: Cluster modules using LLM
- If yes: Load cached module tree
- Save module tree to disk
Module Processing
- Perform topological sort (leaf-first order)
- Process each module in order
Module Generation (branch by type)
- Leaf Modules: Generate component-specific documentation
- Parent Modules: Aggregate children docs and generate overview
- Save generated documentation
Repository Overview
- Aggregate all module documentation
- Generate system-wide overview
Finalization
- Create metadata.json with generation info
- Complete documentation generation

Data Flow

Generation Workflow:

Analysis Phase:
- Main Process → DependencyGraphBuilder: build_dependency_graph()
- Returns: (components, leaf_nodes)
Clustering Phase:
- Main Process → Clustering Module: cluster_modules(components, leaf_nodes)
- Clustering Module → LLMBackend: complete(clustering_prompt)
- Returns: module_tree back to Main Process
- Save to filesystem
Module Documentation Phase (for each module in leaf-first order):
- Leaf Modules:
  - Main Process → LLMBackend: run_module_agent(components...)
  - Returns: generated_docs
- Parent Modules:
  - Main Process → FileSystem: load child module docs
  - Main Process → LLMBackend: complete(overview_prompt)
  - Returns: overview_docs
- Save module docs to filesystem
Overview Phase:
- Main Process → LLMBackend: complete(repo_overview_prompt)
- Returns: repo_overview
- Save overview.md
Finalization Phase:
- Create and save metadata.json

Key Design Patterns

1. Dynamic Programming Approach

Pattern: Bottom-up computation with memoization

Application: Process leaf modules first, store results, use for parent computation
Benefit: Avoids redundant generation; enables incremental documentation updates
Implementation: get_processing_order() returns leaves-first traversal; tracked with processed_modules set

2. Hierarchical Context Building

Pattern: Progressive context enrichment

Application: Parents built from children; repo built from all modules
Benefit: Maintains consistency across documentation hierarchy
Implementation: build_overview_structure() embeds child docs in parent context

3. Strategy Pattern for Backends

Pattern: Pluggable LLM backends

Application: Different LLM implementations (CAW, PydanticAI, OpenAI-compatible)
Benefit: Support multiple LLM providers without changing core logic
Implementation: Accepts LLMBackend instance; delegates generation via .complete() and .run_module_agent()

4. File Abstraction

Pattern: Centralized file operations

Application: All I/O through file_manager utility
Benefit: Consistent error handling, path normalization, JSON/text handling
Implementation: Uses file_manager.save_json(), file_manager.load_text(), etc.

5. Graceful Degradation

Pattern: Fallback mechanisms for robustness

Application: Module path resolution tries multiple filename variants
Benefit: Handles sub-agent file naming inconsistencies
Implementation: _resolve_child_docs_path() tries variations (spaces→underscores, lowercasing, etc.)

Integration Points

Dependencies (Inbound)

DependencyGraphBuilder [ref: dependency_analysis_services.md]

Used for: Analyzing codebase and extracting component dependencies
Interface: build_dependency_graph() → returns (components, leaf_nodes)
Role: Provides foundation for module clustering

LLMBackend [ref: llm_backends.md]

Used for: All LLM-based generation (clustering, documentation, overviews)
Interface:
- .complete(prompt, model=None) → returns generated text
- .run_module_agent(module_name, components, ...) → returns updated module_tree
Role: Performs intelligent module clustering and document generation

Configuration [ref: shared_config_and_utils.md]

Used for: Repository path, documentation output directory, LLM models, max depth
Key Fields: repo_path, docs_dir, main_model, cluster_model, max_depth
Role: Central configuration source

Prompt Templates

REPO_OVERVIEW_PROMPT: Template for repository-level overview generation
MODULE_OVERVIEW_PROMPT: Template for parent module overview generation
Role: Guide LLM output format and content quality

Dependents (Outbound)

CLI Documentation Generator [ref: cli_core.md]

Consumes: Main entry point for CLI-initiated documentation generation
Interface: Creates DocumentationGenerator instance, calls .run()
Usage: DocumentationGenerator(config).run()

Frontend Web App [ref: frontend_web_app.md]

Consumes: Generated documentation files and metadata
Interface: Reads from docs_dir after generation completes
Usage: Displays generated markdown and module tree in web UI

Module Tree Structure

First Module Tree vs Module Tree

The system maintains two versions of the module tree:

first_module_tree.json: Initial clustering result
- Created once during clustering phase
- Never modified
- Preserved for reference and reproducibility
module_tree.json: Working copy during generation
- Updated by sub-agents as they process modules
- Reloaded at each iteration to capture changes
- Final version represents complete hierarchy

Modification Strategy

Sub-agents may modify the module tree structure (e.g., adjusting hierarchy, adding metadata). The generator handles this by:

Reloading module_tree at each iteration
Supporting dynamic structure changes
Preserving first_module_tree for reference

Error Handling & Robustness

Exception Handling Strategy

# Module processing failures don't stop pipeline
try:
    process_module()
except Exception as e:
    logger.error(f"Failed to process module: {e}")
    continue  # Process next module

Benefit: Partial documentation generation continues despite individual module failures

File Resolution Robustness

# Try multiple filename variants before giving up
candidates = [
    child_name,
    child_name.replace(" ", "_"),
    child_name.replace(" ", "-"),
]
# Returns first existing match or None

Benefit: Handles inconsistent sub-agent file naming conventions

Response Format Flexibility

# Handle both wrapped and raw markdown
if "<OVERVIEW>" in response:
    content = response.split("<OVERVIEW>")[1].split("</OVERVIEW>")[0]
else:
    content = response  # Use raw response

Benefit: Works with different LLM response formats

Configuration Impact

Key Configuration Parameters

Parameter	Impact	Default Behavior
`repo_path`	Root directory for code analysis	Required; error if missing
`docs_dir`	Output directory for documentation	Required; created if missing
`main_model`	Primary LLM for documentation generation	Stored in metadata
`cluster_model`	Separate model for module clustering	Falls back to main_model if empty
`max_depth`	Module hierarchy depth limit	Affects clustering granularity

Configuration Flow

The configuration object distributes parameters to various components:

repo_path → DependencyGraphBuilder (source code location)
docs_dir → DocumentationGenerator (output directory)
main_model → LLMBackend (primary LLM model)
cluster_model → Module Clustering (clustering-specific model)
max_depth → Clustering Module (hierarchy depth limit)

Output Artifacts

Generated Files

Module Documentation

Files: {module_name}.md (one per module)
Content: Module overview, purpose, components, relationships
Generated by: Module agents for leaf modules; orchestrator for parents
Example: authentication.md, api_server.md

Repository Overview

File: overview.md
Content: System-wide architecture, module relationships, design patterns
Generated by: DocumentationGenerator from all module documentation
Replaces: {repo_name}.md if generated in single-module case

Module Tree Files

Files: first_module_tree.json, module_tree.json
Content: Hierarchical module structure with component mappings
Used by: Frontend for navigation; subsequent generations for caching

Metadata

File: metadata.json
Content: Generation timestamp, models used, statistics, generated files list
Lifecycle: Created at end of each generation

File Organization Example

docs/
├── overview.md                    # Repository overview
├── authentication.md              # Module: Authentication
├── api_server.md                 # Module: API Server
├── database.md                   # Module: Database
├── module_tree.json              # Complete module hierarchy
├── first_module_tree.json        # Initial clustering (immutable)
└── metadata.json                 # Generation metadata

Processing Order Algorithm

Topological Sort Implementation

The get_processing_order() method implements a depth-first traversal that guarantees leaves are processed before parents:

def collect_modules(tree, path):
    for module_name, module_info in tree.items():
        current_path = path + [module_name]
        
        # Recursively process children first
        if module_info.get("children"):
            collect_modules(module_info["children"], current_path)
        
        # Add module after its children (leaf-first order)
        processing_order.append((current_path, module_name))

Properties:

Correctness: All parent dependencies appear after children
Completeness: Every module in tree is included
Efficiency: Single O(n) traversal
Determinism: Consistent order across runs (dictionary iteration order)

Special Cases

Single-Module Repository

Condition: len(module_tree) == 0 (entire repo fits in context)

Handling:

Generates documentation for entire repository in one pass
Treats repository as single module
Renames output from {repo_name}.md → overview.md
Skips parent aggregation phase

Benefit: Optimal for small/focused codebases

Missing Child Documentation

Condition: Child module docs file not found during parent generation

Handling:

Logs warning
Proceeds with empty string for that child's docs
Parent overview generated with incomplete context

Rationale: Prevents total pipeline failure from missing individual modules

Multiple File Name Variants

Condition: Sub-agent saves {module_name} but tree references it differently

Handling:

Tries original name
Tries with spaces→underscores
Tries with spaces→hyphens
Tries with spaces removed
Returns first match or None

Rationale: Accommodates inconsistent naming across sub-agents

Lifecycle and State Management

Generation Phases

The documentation generation process follows these sequential phases:

Analyze Code (Initialize)
- DependencyGraphBuilder extracts components & leaf nodes
- Builds dependency graph from source code
Cluster Modules (Dependencies extracted)
- LLM-based clustering groups components
- Creates initial module hierarchy
Generate Leaf Docs (Module tree created)
- Sub-agents process each leaf module
- Generates component-specific documentation
Generate Parent Docs (Leaf docs complete)
- Aggregates child documentation
- Synthesizes parent/overview documentation
Generate Repo Overview (Parent docs complete)
- Synthesizes all module documentation
- Creates system-wide architecture overview
Create Metadata (Overview complete)
- Records generation information
- Collects statistics & timestamps
Complete (Generation complete)
- All artifacts finalized and saved

Caching & Resumption

Current Behavior:

Module tree cached in first_module_tree.json
Documentation generation always runs (no caching of generated docs)
Can be extended for partial re-generation

Future Enhancement Opportunity:

Check if module docs exist before re-generation
Skip already-processed modules
Enable incremental updates

Performance Characteristics

Complexity Analysis

Phase	Complexity	Notes
Dependency Graph Building	O(n)	Linear scan of code
Module Clustering	O(n log n)	LLM clustering over components
Topological Sort	O(m + n)	DFS over module tree
Documentation Generation	O(n × L)	n modules × L token length
Overview Generation	O(m × L)	m parent nodes × context size

Where:

n = number of components/leaf nodes
m = number of parent modules
L = average LLM context window

Scalability Considerations

Single-module generation: Clusters all modules in one LLM call
Multi-module generation: Processes sequentially (can be parallelized)
Deep hierarchies: May exceed context window limits (config via max_depth)
Large codebases: Parent aggregation includes child doc context (can be summarized if needed)

Future Enhancement Opportunities

1. Incremental Generation

Cache generated documentation files
Check existence before re-generation
Support partial updates on codebase changes

2. Parallel Processing

Process independent leaf modules concurrently
Maintain synchronization for parent aggregation
Reduce total generation time

3. Context Optimization

Summarize child documentation for parent generation
Implement selective context inclusion
Support generation of larger codebases

4. Generation Progress Tracking

Integrate with progress tracking system
Real-time status updates to frontend
Estimated time remaining

5. Structured Output

Generate documentation in multiple formats (HTML, JSON, RST)
Support template customization
Enable documentation versioning

Summary

The documentation_generation module is the orchestration hub of CodeWiki, coordinating the transformation of raw code analysis into organized, hierarchical documentation. Its key strengths are:

Intelligent Hierarchy: Builds documentation from leaf modules up to repository overview
LLM-Powered: Leverages language models for understanding and synthesis
Robust Pipeline: Graceful error handling and fallback mechanisms
Flexible Architecture: Pluggable backends and configuration-driven behavior
Metadata Tracking: Maintains generation provenance and statistics

By combining dependency analysis with intelligent clustering and iterative generation, it enables developers to understand complex codebases through automatically generated, contextually-informed documentation.

FilesExpand file tree

documentation_generation.md

Latest commit

History

documentation_generation.md

File metadata and controls

Documentation Generation Module

Overview

Core Responsibility

Key Characteristics

Architecture

Component Structure

Module Interactions

Core Components

DocumentationGenerator Class

Key Methods

__init__(config, commit_id, backend)

run()

generate_module_documentation(components, leaf_nodes)

generate_parent_module_docs(module_path, working_dir)

build_overview_structure(module_tree, module_path, working_dir)

get_processing_order(module_tree, parent_path)

is_leaf_module(module_info)

create_documentation_metadata(working_dir, components, num_leaf_nodes)

Generation Pipeline

Processing Architecture

Data Flow

Key Design Patterns

1. Dynamic Programming Approach

2. Hierarchical Context Building

3. Strategy Pattern for Backends

4. File Abstraction

5. Graceful Degradation

Integration Points

Dependencies (Inbound)

DependencyGraphBuilder [ref: dependency_analysis_services.md]

LLMBackend [ref: llm_backends.md]

Configuration [ref: shared_config_and_utils.md]

Prompt Templates

Dependents (Outbound)

CLI Documentation Generator [ref: cli_core.md]

Frontend Web App [ref: frontend_web_app.md]

Module Tree Structure

First Module Tree vs Module Tree

Modification Strategy

Error Handling & Robustness

Exception Handling Strategy

File Resolution Robustness

Response Format Flexibility

Configuration Impact

Key Configuration Parameters

Configuration Flow

Output Artifacts

Generated Files

Module Documentation

Repository Overview

Module Tree Files

Metadata

File Organization Example

Processing Order Algorithm

Topological Sort Implementation

Special Cases

Single-Module Repository

Missing Child Documentation

Multiple File Name Variants

Lifecycle and State Management

Generation Phases

Caching & Resumption

Performance Characteristics

Complexity Analysis

Scalability Considerations

Future Enhancement Opportunities

1. Incremental Generation

2. Parallel Processing

3. Context Optimization

4. Generation Progress Tracking

5. Structured Output

Related Documentation

Summary

`init(config, commit_id, backend)`

`run()`

`generate_module_documentation(components, leaf_nodes)`

`generate_parent_module_docs(module_path, working_dir)`

`build_overview_structure(module_tree, module_path, working_dir)`

`get_processing_order(module_tree, parent_path)`

`is_leaf_module(module_info)`

`create_documentation_metadata(working_dir, components, num_leaf_nodes)`