LeanRAG Hierarchy Tuning Guide

This walk-through explains how chunking, extraction, and hierarchy construction interact, and what to tweak when you want different aggregation levels or graph shapes.

Pipeline Recap

python leanrag_cli.py chunk <dataset.jsonl> writes _chunk.json artifacts per input file (no hierarchy yet).
python leanrag_cli.py extract <output_dir> merges raw LLM triples into the shared SQLite store. Entities are keyed by name here, so duplicates collapse. Level defaults to 0 and relationships land at level 0 as well.
python leanrag_cli.py build <output_dir> --refresh drives _cluster_utils.HierarchicalClustering to create aggregates, writes all levels back to SQLite via persist_hierarchy_to_sqlite, refreshes Qdrant, and reloads Neo4j (one node per (name, level) pair).

Where Levels Come From

Level 0 = base entities from extraction (leanrag_cli.py:176). The count equals the number of unique names after merge.
Higher levels = synthetic communities: _cluster_utils.py:596 clusters the current layer and calls process_cluster to build LLM summaries. Those new nodes feed the next layer.
The loop continues until either the clusterer says “enough” or we reach the fallback safeguard. Expect a final “root” layer of size 1.

Key Parameters & Hooks

Merge behaviour: adjust the merge logic in leanrag_cli.py (search for # Merge entities by name). Removing the merge or using richer keys (e.g., name + source_id) keeps more Level-0 rows.
Clustering controls (see _cluster_utils.py):
- cluster_size (default 20) influences how many groups we target per layer.
- cluster_threshold and the BIC convergence check steer GMM sensitivity.
- The fallback splitter (added near max_depth calculation) keeps the pipeline from terminating early; tweak max_groups or remove it if you want the original behaviour.
LLM prompts: PROMPTS['aggregate_entities'] and PROMPTS['cluster_cluster_relation'] control summarisation formatting. Editing these affects aggregate naming and relation synthesis.

Suggested Tuning Workflow

Copy the repo defaults (git checkout -b tuning/hierarchy).
Modify a single parameter (e.g., raise cluster_size to 40).
Run python leanrag_cli.py build <dir> --refresh and inspect:
- CLI stats (SQLite vs. Neo4j counts).
- level_summary.txt for level distributions.
- python neo4j_check.py plus visualize for graph shape.
Iterate. If you alter extraction, rerun chunk → extract → build to propagate changes.

Keep this guide handy when exploring alternate clustering strategies or when you onboard new contributors to the hierarchy subsystem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LeanRAG Hierarchy Tuning Guide

Pipeline Recap

Where Levels Come From

Key Parameters & Hooks

Suggested Tuning Workflow

FilesExpand file tree

tutorial.md

Latest commit

History

tutorial.md

File metadata and controls

LeanRAG Hierarchy Tuning Guide

Pipeline Recap

Where Levels Come From

Key Parameters & Hooks

Suggested Tuning Workflow