Skip to content

MCP T18 — Incremental indexing (file-hash based) #665

@DvirDukhan

Description

@DvirDukhan

Context

Today, every call to Project.analyze_sources() re-analyzes the entire repo from scratch. For a 5,000-file codebase this is the difference between an interactive workflow and a coffee break — and after the MCP server lands, agents will be calling index_repo repeatedly during a session as they edit code.

This ticket adds file-hash-based incremental indexing: track per-file content hashes in Redis, diff against the current state on each index call, and only re-analyze changed files.

Builds on T17's per-branch storage (each branch tracks its own hash map).

Scope (in)

  1. Hash storage — track per-file content hashes in Redis under {repo}:{branch}_files (a Redis hash, field per file path → SHA256). Persisted at the end of every full or incremental index.
  2. Project.analyze_sources(incremental=True) — walk the file tree, compute current hashes, diff against stored hashes:
    • Unchanged files → skip the analyzer entirely
    • Modified files → call existing delete_files() to remove old graph entities for this file, then re-run the analyzer (first pass) on just these files
    • Deleted files → call delete_files() only
    • New files → analyze normally
  3. Second pass (LSP symbol resolution) — for v1, safe correctness wins: if any file changed, run the second pass over the entire branch graph. Per-file second-pass optimization is deferred.
  4. Persist the new hash map to Redis at the end (atomic — old map stays until new one is fully written).
  5. Project API — expose was_incremental: bool and files_changed: list[str] for callers.
  6. CLIcgraph index . defaults to incremental when a graph already exists for (project, branch); new --full flag forces a full re-index.
  7. MCP toolindex_repo(..., incremental=True) is the default (consumed by [MCP T4] index_repo MCP tool #652 T4); response includes mode: \"full\"|\"incremental\" and files_changed: list[str].

Edge cases handled

  • First-time indexing of a branch → falls back to full
  • Hash store missing or corrupted → falls back to full with a warning logged to stderr
  • File renames → treated as delete + add (rename detection deferred to Phase 2)
  • Aborted previous run leaving stale hashes → next full run overwrites

Scope (out)

  • Per-file second-pass / LSP optimization (Phase 2).
  • Rename detection (Phase 2).
  • Cross-branch incremental (each branch has its own hash store).
  • Watching the filesystem for changes (this is pull-based; user/agent calls index_repo).

Files

  • modified api/project.py (new incremental flag, hash diff orchestration, was_incremental / files_changed attributes)
  • modified api/info.py (new file-hash get/set helpers under {repo}:{branch}_files)
  • modified api/analyzers/source_analyzer.py (incremental orchestration over the changed-file set)
  • modified api/cli.py (--full flag on index and index-repo)
  • modified api/mcp/tools/structural.py (consume incremental flag; report mode + files_changed)
  • new tests/test_incremental_indexing.py

Acceptance criteria

  • Index fixture → re-index with no changes → second run reports mode=incremental, files_changed=[] and is significantly faster (assert via analyzer-call-count, not wall clock).
  • Modify one file → re-index → only that file's entities are deleted+re-added; other entities untouched (verify by node-id snapshot diff).
  • Delete a file → re-index → its entities are removed from the graph.
  • Add a new file → re-index → its entities appear.
  • First run on a fresh branch automatically falls back to full (no hash store yet).
  • --full CLI flag forces full re-index even when graph exists.
  • Corrupted hash store → falls back to full with a warning logged.
  • MCP index_repo integration test exercises an unchanged → modified → deleted → added sequence end-to-end.
  • Existing full-index tests still pass (incremental is opt-in at the API level, even if CLI defaults to it).

Dependencies

Notes for the implementer

  • Use SHA256 over file bytes (not mtime) — mtime is unreliable across git checkouts and CI environments.
  • The hash diff should be the only place that decides what to re-analyze. Don't sprinkle incremental logic deep into individual analyzers; orchestrate it in source_analyzer.py.
  • Be careful with delete_files() — it must remove all graph entities tied to a file (Functions, Classes, edges) without leaving orphans. Verify with a node-count assertion in the test.
  • The second-pass-over-everything decision is intentional for v1. Don't try to be clever here; the goal is correctness, and the first pass is where most of the win is.
  • When the hash store is missing/corrupted, log clearly to stderr so users notice and aren't surprised by a slow "incremental" run.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestmcpMCP server (model context protocol) work

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions