This walk-through explains how chunking, extraction, and hierarchy construction interact, and what to tweak when you want different aggregation levels or graph shapes.
python leanrag_cli.py chunk <dataset.jsonl>writes_chunk.jsonartifacts per input file (no hierarchy yet).python leanrag_cli.py extract <output_dir>merges raw LLM triples into the shared SQLite store. Entities are keyed by name here, so duplicates collapse. Level defaults to0and relationships land at level0as well.python leanrag_cli.py build <output_dir> --refreshdrives_cluster_utils.HierarchicalClusteringto create aggregates, writes all levels back to SQLite viapersist_hierarchy_to_sqlite, refreshes Qdrant, and reloads Neo4j (one node per(name, level)pair).
- Level
0= base entities from extraction (leanrag_cli.py:176). The count equals the number of unique names after merge. - Higher levels = synthetic communities:
_cluster_utils.py:596clusters the current layer and callsprocess_clusterto build LLM summaries. Those new nodes feed the next layer. - The loop continues until either the clusterer says “enough” or we reach the fallback safeguard. Expect a final “root” layer of size 1.
- Merge behaviour: adjust the merge logic in
leanrag_cli.py(search for# Merge entities by name). Removing the merge or using richer keys (e.g.,name + source_id) keeps more Level-0 rows. - Clustering controls (see
_cluster_utils.py):cluster_size(default20) influences how many groups we target per layer.cluster_thresholdand the BIC convergence check steer GMM sensitivity.- The fallback splitter (added near
max_depthcalculation) keeps the pipeline from terminating early; tweakmax_groupsor remove it if you want the original behaviour.
- LLM prompts:
PROMPTS['aggregate_entities']andPROMPTS['cluster_cluster_relation']control summarisation formatting. Editing these affects aggregate naming and relation synthesis.
- Copy the repo defaults (
git checkout -b tuning/hierarchy). - Modify a single parameter (e.g., raise
cluster_sizeto40). - Run
python leanrag_cli.py build <dir> --refreshand inspect:- CLI stats (SQLite vs. Neo4j counts).
level_summary.txtfor level distributions.python neo4j_check.pyplusvisualizefor graph shape.
- Iterate. If you alter extraction, rerun
chunk → extract → buildto propagate changes.
Keep this guide handy when exploring alternate clustering strategies or when you onboard new contributors to the hierarchy subsystem.