Skip to content

Latest commit

 

History

History
32 lines (26 loc) · 2.66 KB

File metadata and controls

32 lines (26 loc) · 2.66 KB

LeanRAG Hierarchy Tuning Guide

This walk-through explains how chunking, extraction, and hierarchy construction interact, and what to tweak when you want different aggregation levels or graph shapes.

Pipeline Recap

  1. python leanrag_cli.py chunk <dataset.jsonl> writes _chunk.json artifacts per input file (no hierarchy yet).
  2. python leanrag_cli.py extract <output_dir> merges raw LLM triples into the shared SQLite store. Entities are keyed by name here, so duplicates collapse. Level defaults to 0 and relationships land at level 0 as well.
  3. python leanrag_cli.py build <output_dir> --refresh drives _cluster_utils.HierarchicalClustering to create aggregates, writes all levels back to SQLite via persist_hierarchy_to_sqlite, refreshes Qdrant, and reloads Neo4j (one node per (name, level) pair).

Where Levels Come From

  • Level 0 = base entities from extraction (leanrag_cli.py:176). The count equals the number of unique names after merge.
  • Higher levels = synthetic communities: _cluster_utils.py:596 clusters the current layer and calls process_cluster to build LLM summaries. Those new nodes feed the next layer.
  • The loop continues until either the clusterer says “enough” or we reach the fallback safeguard. Expect a final “root” layer of size 1.

Key Parameters & Hooks

  • Merge behaviour: adjust the merge logic in leanrag_cli.py (search for # Merge entities by name). Removing the merge or using richer keys (e.g., name + source_id) keeps more Level-0 rows.
  • Clustering controls (see _cluster_utils.py):
    • cluster_size (default 20) influences how many groups we target per layer.
    • cluster_threshold and the BIC convergence check steer GMM sensitivity.
    • The fallback splitter (added near max_depth calculation) keeps the pipeline from terminating early; tweak max_groups or remove it if you want the original behaviour.
  • LLM prompts: PROMPTS['aggregate_entities'] and PROMPTS['cluster_cluster_relation'] control summarisation formatting. Editing these affects aggregate naming and relation synthesis.

Suggested Tuning Workflow

  1. Copy the repo defaults (git checkout -b tuning/hierarchy).
  2. Modify a single parameter (e.g., raise cluster_size to 40).
  3. Run python leanrag_cli.py build <dir> --refresh and inspect:
    • CLI stats (SQLite vs. Neo4j counts).
    • level_summary.txt for level distributions.
    • python neo4j_check.py plus visualize for graph shape.
  4. Iterate. If you alter extraction, rerun chunk → extract → build to propagate changes.

Keep this guide handy when exploring alternate clustering strategies or when you onboard new contributors to the hierarchy subsystem.