Skip to content

Implement graph edges collection support with admin and ingest updates#2

Open
voarsh2 wants to merge 13 commits intomit-base-salvagefrom
mit-base-salvage-symbol
Open

Implement graph edges collection support with admin and ingest updates#2
voarsh2 wants to merge 13 commits intomit-base-salvagefrom
mit-base-salvage-symbol

Conversation

@voarsh2
Copy link
Owner

@voarsh2 voarsh2 commented Feb 9, 2026

Summary by CodeRabbit

  • New Features

    • Graph-edge indexing (calls/imports) with dedicated graph collections, optional backfill, pipeline sync, indexer info endpoint
    • CLI flag to clear local indexing caches; support for collections without vectors
    • VS Code uploader: bundled bridge mode and new bridge configuration; bridge packaged into extension
    • Bridge: resource listing/reading proxy with timeouts and dedupe
  • Bug Fixes

    • Best-effort cleanup and copying of companion graph collections on delete/copy/recreate
    • File deletions also remove associated graph edges (non-fatal on failure)
  • Improvements

    • Admin UI reports graph copy/delete status in redirects
    • Stricter path exclusions for upload/watch flows
    • Admin-triggered ingest forces explicit collection handling
  • Tests

    • Added tests for admin redirects and ingest CLI behavior

…arity

feat(graph-edges): add qdrant _graph backfill + symbol_graph accel; mirror admin lifecycle
feat: materialize _graph edges and use for symbol_graph; ensure clone/delete handles companion collections
graph: add _graph edge collection + backfill; prune/watcher cleanup; admin copy/delete parity
feat(admin+graph): copy/delete companion _graph collections; add UI status + tests
Adds a command-line option to clear indexing caches before running the ingest process.

This ensures a clean indexing run by removing any stale file hash or symbol caches.  This is useful for scenarios where the underlying code has changed significantly, invalidating the existing cache.

Also, ensures `CTXCE_FORCE_COLLECTION_NAME` disables multi-repo enumeration in `ingest_code`, forcing the use of the specified collection name, and clarifies its purpose in `indexing_admin.py`.

Broken by commit 2e6317d
@coderabbitai
Copy link

coderabbitai bot commented Feb 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • ✅ Review completed - (🔄 Check again to review again)

Walkthrough

Adds materialized graph-edge storage and backfill for Qdrant, integrates best-effort graph sync into indexing and deletion flows, adds path-exclusion for upload clients, a cache-clear CLI flag, admin redirect graph feedback and companion collection copy/delete attempts, plus tests and debug/debug-field handling. (≤50 words)

Changes

Cohort / File(s) Summary
Graph Edges Core
scripts/ingest/graph_edges.py
New module providing <base>_graph naming, ensure/create logic, deterministic payload-only edge upserts, deletion by caller_path, and incremental backfill tick with state, retries, and vector-mode handling.
Indexing Pipeline & CLI
scripts/ingest/pipeline.py, scripts/ingest/cli.py, scripts/ingest_code.py
Best-effort graph-edge sync after upserts, lazy-import fallbacks and re-exports, new CLI flag --clear-indexing-caches, and admin-spawned ingest forced-collection env handling (CTXCE_FORCE_COLLECTION_NAME).
Collection Admin & Admin UI
scripts/collection_admin.py, scripts/indexing_admin.py, scripts/upload_service.py, templates/admin/acl.html
Best-effort delete/copy of companion <name>_graph collections; admin redirects now include graph_deleted/graph_copied when inferred; CTXCE_FORCE_COLLECTION_NAME enforced for admin-spawned ingest; UI banner shows copy/delete + graph status.
MCP / Query Integration
scripts/mcp_impl/symbol_graph.py, scripts/mcp_impl/search.py, scripts/mcp_indexer_server.py
Symbol queries now consult <collection>_graph for callers/importers first and hydrate results; repo search accepts debug and strips internal debug fields when debug=false; new indexer info resource endpoint.
Backfill Orchestration
scripts/watch_index_core/pseudo.py, scripts/ingest_code.py
Config flags GRAPH_EDGES_BACKFILL and GRAPH_EDGES_BACKFILL_MAX_FILES and a guarded backfill tick; ingest_code exposes graph backfill/delete/upsert symbols with no-op fallbacks when unavailable.
Deletion / Prune / Watcher Integration
scripts/prune.py, scripts/watch_index_core/processor.py
delete_graph_edges_by_path gains optional repo param and path normalization; file-deletion flow triggers best-effort graph-edge deletion alongside point deletions (errors ignored).
Upload Clients — Path Exclusions
scripts/remote_upload_client.py, scripts/standalone_upload_client.py
Added _excluded_dirnames() and _is_ignored_path() and applied to change detection, watcher events, get_all_code_files, move detection, and pending-change processing to ignore VCS/build/IDE/dev-workspace paths.
Admin Tests & CLI Tests
tests/test_admin_collection_delete.py, tests/test_ingest_cli.py, tests/test_staging_lifecycle.py
New/updated tests covering graph_deleted redirect param, forced-collection disabling multi-repo enumeration, and admin copy endpoint reporting graph clone in redirect; spawn-ingest env behavior test updated.
Other Bridge / Misc
ctx-mcp-bridge/src/mcpServer.js, scripts/mcp_impl/search.py
Bridge: resource listing/reading with timeouts, dedupe helpers, OAuth endpoint gating. Search: debug flag propagation and debug-field stripping in results.

Sequence Diagrams

sequenceDiagram
    participant Client
    participant Pipeline
    participant Graph as graph_edges Module
    participant Qdrant

    Client->>Pipeline: upsert file (points + calls/imports)
    activate Pipeline
    Pipeline->>Graph: ensure_graph_collection(base)
    activate Graph
    Graph->>Qdrant: create/verify <base>_graph collection
    Qdrant-->>Graph: collection ready
    deactivate Graph

    Pipeline->>Graph: delete_edges_by_path(caller_path, repo)
    activate Graph
    Graph->>Qdrant: delete matching payload-only edge docs
    Qdrant-->>Graph: deletion result
    deactivate Graph

    Pipeline->>Graph: upsert_file_edges(caller_path, calls, imports, repo)
    activate Graph
    Graph->>Qdrant: upsert payload-only edge docs with deterministic IDs
    Qdrant-->>Graph: upsert ack
    deactivate Graph

    deactivate Pipeline
    Pipeline-->>Client: indexing complete
Loading
sequenceDiagram
    participant MCP as UserQuery
    participant SymbolGraph
    participant Graph as graph_edges (Query)
    participant Qdrant

    MCP->>SymbolGraph: query_callers(symbol)
    activate SymbolGraph
    SymbolGraph->>Graph: _query_graph_edges_collection(type=calls)
    activate Graph
    Graph->>Qdrant: query <base>_graph filter by edge_type=calls
    Qdrant-->>Graph: edge documents
    Graph->>Graph: hydrate & dedupe results
    Graph-->>SymbolGraph: hydrated caller results
    deactivate Graph

    alt graph results exist
        SymbolGraph-->>MCP: return hydrated results
    else
        SymbolGraph->>Qdrant: fallback query on callers[] array field
        Qdrant-->>SymbolGraph: legacy results
        SymbolGraph-->>MCP: return legacy results
    end
    deactivate SymbolGraph
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I hop through nodes and payload lanes,
Stitching calls and import chains,
Backfills tick and edges bloom — hooray!
Paths trimmed tidy so I can play,
A rabbit maps your code today. 🐇✨

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.20% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: implementing graph edges collection support with updates to admin and ingest systems, which is reflected throughout the changeset.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into mit-base-salvage

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch mit-base-salvage-symbol

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@scripts/mcp_impl/symbol_graph.py`:
- Around line 342-352: Inner functions _scroll and _scroll_main close over loop
variables (flt and p) by reference which can cause latent bugs; change their
definitions to bind the current loop value explicitly (e.g., add flt=None /
p=None as default parameters and use that inside, or accept the value as a
parameter) so each coroutine captures the intended value when created; update
the calls that invoke _scroll() and _scroll_main() to pass no args (if using
defaults) or the bound value accordingly to ensure each task uses the correct
flt and p.
🧹 Nitpick comments (15)
scripts/remote_upload_client.py (1)

523-557: _excluded_dirnames() rebuilds the set on every _is_ignored_path call — consider caching.

Since _is_ignored_path is called for every path in detect_file_changes, get_all_code_files, and the event handler, _excluded_dirnames() reconstructs the set each time. The env vars it depends on won't change mid-process.

Also, parts = set(rel.parts) on Line 544 includes the filename itself, so a code file literally named build, dist, out, etc. (no extension) would be incorrectly ignored. This is unlikely for real code files but worth noting.

♻️ Cache the excluded set (optional)
+    `@functools.cached_property`
+    def _excluded_dirs(self) -> frozenset:
+        excluded = {
+            "node_modules", "vendor", "dist", "build", "target", "out",
+            ".git", ".hg", ".svn", ".vscode", ".idea", ".venv", "venv",
+            "__pycache__", ".pytest_cache", ".mypy_cache", ".cache",
+            ".context-engine", ".context-engine-uploader", ".codebase",
+        }
+        dev_remote = os.environ.get("DEV_REMOTE_MODE") == "1" or os.environ.get("REMOTE_UPLOAD_MODE") == "development"
+        if dev_remote:
+            excluded.add("dev-workspace")
+        return frozenset(excluded)

And to fix the filename false-positive, check only directory parts:

-        parts = set(rel.parts)
-        if parts & self._excluded_dirnames():
+        dir_parts = set(rel.parts[:-1]) if len(rel.parts) > 1 else set()
+        if dir_parts & self._excluded_dirnames():
scripts/standalone_upload_client.py (1)

733-763: Identical logic to remote_upload_client.py — keep-in-sync comment is appreciated but fragile.

The "Keep in sync with get_all_code_files exclusions" comment helps, but these two files now share ~30 lines of identical exclusion logic. Same note as the other file applies: _excluded_dirnames() is rebuilt per call, and parts = set(rel.parts) includes the filename.

Since the standalone client is intentionally dependency-free, the duplication is understandable. Consider at minimum adding a test that asserts the excluded dir sets match across both clients to catch drift.

scripts/collection_admin.py (1)

493-505: Log the graph-copy failure instead of silently swallowing it.

The recursive copy_collection_qdrant call is properly guarded against infinite recursion (the _graph suffix check). However, silently passing on exceptions makes debugging difficult when graph copies fail unexpectedly.

♻️ Add minimal logging
     if not src.endswith("_graph") and not dest.endswith("_graph"):
         try:
             copy_collection_qdrant(
                 source=f"{src}_graph",
                 target=f"{dest}_graph",
                 qdrant_url=base_url,
                 overwrite=overwrite,
             )
-        except Exception:
-            pass
+        except Exception as exc:
+            import logging as _logging
+            _logging.getLogger(__name__).debug(
+                "Best-effort graph collection copy %s_graph -> %s_graph failed: %s",
+                src, dest, exc,
+            )
scripts/ingest/graph_edges.py (2)

109-144: Broad except on vector-less creation attempt may mask transient errors.

Line 117 catches any exception from the vector-less create_collection call and falls back to the named-vector variant. If the failure is a transient network error rather than "unsupported config," the fallback will likely also fail — so the net effect is acceptable. Just noting this for awareness.

♻️ Optional: log the fallback reason at debug level
         try:
             client.create_collection(
                 collection_name=graph_coll,
                 vectors_config={},
             )
             _GRAPH_VECTOR_MODE[graph_coll] = "none"
-        except Exception:
+        except Exception as vec_exc:
+            logger.debug("Vector-less creation failed for %s, trying named vector: %s", graph_coll, vec_exc)
             client.create_collection(

147-149: Edge ID uses : as a separator — consider a more robust delimiter.

If any component (repo, path, symbol) contains :, different edge tuples could produce the same hash input. E.g., "calls:a:b:c:d" is ambiguous. A null byte (\x00) would be safer as a delimiter since it doesn't appear in file paths or symbol names.

This is low-risk in practice since paths and symbols rarely contain colons.

♻️ Use null byte separator
 def _edge_id(edge_type: str, repo: str, caller_path: str, callee_symbol: str) -> str:
-    key = f"{edge_type}:{repo}:{caller_path}:{callee_symbol}"
+    key = f"{edge_type}\x00{repo}\x00{caller_path}\x00{callee_symbol}"
     return hashlib.sha256(key.encode("utf-8", errors="ignore")).hexdigest()[:32]
scripts/mcp_impl/symbol_graph.py (2)

27-28: GRAPH_COLLECTION_SUFFIX is duplicated from scripts/ingest/graph_edges.py.

Both files define GRAPH_COLLECTION_SUFFIX = "_graph". This is a minor DRY concern; consider importing from graph_edges or extracting to a shared constants module if the two modules will evolve together.


381-432: N+1 hydration queries — acceptable given the limit cap.

Each caller path triggers a separate scroll call against the main collection (lines 400-408). With the default limit of 20, this means up to 20 sequential queries. For the current use case this is fine, but if limit grows significantly, consider batching hydration using a single scroll with a MatchAny filter on metadata.path.

scripts/upload_service.py (1)

1241-1273: Import pooled_qdrant_client and use it instead of creating a standalone QdrantClient.

The rest of the codebase uses pooled_qdrant_client for Qdrant interactions. Here, a new QdrantClient is instantiated just to check if the _graph collection exists. Using the pooled client would be more consistent and avoid creating an extra connection. This requires importing pooled_qdrant_client from scripts.qdrant_client_manager.

♻️ Suggested refactor using pooled client

At the top of the file with other imports, add:

try:
    from scripts.qdrant_client_manager import pooled_qdrant_client
except Exception:
    pooled_qdrant_client = None

Then replace the collection check logic:

     graph_copied: Optional[str] = None
     try:
         if not name.endswith("_graph") and not str(new_name).endswith("_graph"):
-            from qdrant_client import QdrantClient  # type: ignore
-
-            cli = QdrantClient(
-                url=QDRANT_URL,
-                api_key=os.environ.get("QDRANT_API_KEY"),
-                timeout=float(os.environ.get("QDRANT_TIMEOUT", "5") or 5),
-            )
-            try:
-                cli.get_collection(collection_name=f"{new_name}_graph")
-                graph_copied = "1"
-            except Exception:
-                graph_copied = "0"
-            finally:
-                try:
-                    cli.close()
-                except Exception:
-                    pass
+            if pooled_qdrant_client is not None:
+                with pooled_qdrant_client(url=QDRANT_URL, api_key=os.environ.get("QDRANT_API_KEY")) as cli:
+                    try:
+                        cli.get_collection(collection_name=f"{new_name}_graph")
+                        graph_copied = "1"
+                    except Exception:
+                        graph_copied = "0"
     except Exception:
         graph_copied = None
scripts/ingest/cli.py (1)

195-209: Remove duplication — reuse scripts/collection_health.clear_indexing_caches instead.

scripts/collection_health.py already has a clear_indexing_caches(workspace_path, repo_name) function that clears both file-hash and symbol caches. It uses public clear_repo_cache/clear_cache APIs instead of relying on private helpers (_ws._get_repo_state_dir, _ws._get_cache_path), which are fragile if workspace_state internals change. The existing implementation also handles errors more gracefully.

Since the return value is discarded at all call sites, wrap the existing function in a thin adapter:

♻️ Reuse existing helper
+from scripts.collection_health import clear_indexing_caches as _clear_indexing_caches_impl
+
 ...
 
-    def _clear_indexing_caches(workspace_root: Path, repo_name: str | None) -> None:
-        try:
-            _ws.clear_symbol_cache(workspace_path=str(workspace_root), repo_name=repo_name)
-        except Exception:
-            pass
-        try:
-            if _ws.is_multi_repo_mode() and repo_name:
-                state_dir = _ws._get_repo_state_dir(repo_name)
-                cache_path = state_dir / _ws.CACHE_FILENAME
-            else:
-                cache_path = _ws._get_cache_path(workspace_root)
-            if cache_path.exists():
-                cache_path.unlink()
-        except Exception:
-            pass
+    def _clear_indexing_caches(workspace_root: Path, repo_name: str | None) -> None:
+        try:
+            _clear_indexing_caches_impl(str(workspace_root), repo_name=repo_name)
+        except Exception:
+            pass
scripts/ingest/pipeline.py (2)

656-689: Silent except Exception: pass swallows errors without any diagnostic trace.

While this block is intentionally best-effort, silently swallowing all exceptions (including unexpected ones like TypeError or AttributeError from API misuse) makes production debugging very difficult. A debug-level log would preserve the "safe to skip" intent while aiding troubleshooting.

Also, this entire block (env check → lazy import → ensure → delete → upsert) is duplicated nearly verbatim at lines 1404–1435 in process_file_with_smart_reindexing. Consider extracting a shared helper (e.g., _sync_graph_edges_best_effort(client, collection, file_path, repo, calls, imports)) to keep both call sites in sync and reduce copy-paste drift.

♻️ Proposed: extract helper and add minimal logging

Add a helper near the top of the file (after imports):

def _sync_graph_edges_best_effort(
    client: QdrantClient,
    collection: str,
    file_path: str,
    repo: str | None,
    calls: list[str] | None,
    imports: list[str] | None,
) -> None:
    """Best-effort sync of file-level graph edges. Safe to skip on failure."""
    enabled = str(os.environ.get("GRAPH_EDGES_ENABLE", "1") or "").strip().lower() in {
        "1", "true", "yes", "on",
    }
    if not enabled:
        return
    try:
        from scripts.ingest.graph_edges import (
            delete_edges_by_path,
            ensure_graph_collection,
            upsert_file_edges,
        )
        ensure_graph_collection(client, collection)
        delete_edges_by_path(client, collection, caller_path=file_path, repo=repo)
        upsert_file_edges(client, collection, caller_path=file_path, repo=repo, calls=calls, imports=imports)
    except Exception as exc:
        try:
            print(f"[graph_edges] best-effort sync failed for {file_path}: {exc}")
        except Exception:
            pass

Then replace both blocks:

-        # Optional: materialize file-level graph edges in a companion `<collection>_graph` store.
-        # This is an accelerator for symbol_graph callers/importers and is safe to skip on failure.
-        try:
-            enabled = str(os.environ.get("GRAPH_EDGES_ENABLE", "1") or "").strip().lower() in {
-                "1",
-                "true",
-                "yes",
-                "on",
-            }
-            if enabled:
-                from scripts.ingest.graph_edges import (
-                    delete_edges_by_path as _delete_edges_by_path,
-                    ensure_graph_collection as _ensure_graph_collection,
-                    upsert_file_edges as _upsert_file_edges,
-                )
-
-                _ensure_graph_collection(client, collection)
-                # Important: delete stale edges for this file before upserting the new set.
-                _delete_edges_by_path(
-                    client,
-                    collection,
-                    caller_path=str(file_path),
-                    repo=repo_tag,
-                )
-                _upsert_file_edges(
-                    client,
-                    collection,
-                    caller_path=str(file_path),
-                    repo=repo_tag,
-                    calls=calls,
-                    imports=imports,
-                )
-        except Exception:
-            pass
+        _sync_graph_edges_best_effort(client, collection, str(file_path), repo_tag, calls, imports)

1404-1435: Duplicate graph-edge sync block — same concern as lines 656–689.

This is a near-verbatim copy of the graph-edge sync block in _index_single_file_inner. If you extract the helper suggested above, this becomes a one-liner too, and the two paths stay in lockstep automatically.

scripts/ingest_code.py (1)

215-232: Asymmetric fallback: graph_edges_backfill_tick = None vs. no-op functions for the other two.

This is a valid design choice (callers should explicitly check before invoking a periodic tick), but it creates a subtle API inconsistency in the façade. Any consumer calling graph_edges_backfill_tick(...) without a None check will get a TypeError at runtime when graph_edges is unavailable.

If this is intentional (forcing callers to guard the call), consider adding a brief docstring or inline comment explaining why the backfill tick uses None while the others use no-op stubs.

scripts/indexing_admin.py (2)

930-935: Best-effort graph companion cleanup — consider a debug log for production visibility.

The _graph suffix guard correctly prevents recursive deletion. The silent except Exception: pass is consistent with the best-effort pattern, but a brief log message (even at debug level) would help operators understand why graph data might persist after a collection delete, especially in cases where the Qdrant API returns a non-404 error (e.g., timeout, auth failure).

🔧 Minimal logging suggestion
         if not name.endswith("_graph"):
             try:
                 cli.delete_collection(collection_name=f"{name}_graph")
-            except Exception:
-                pass
+            except Exception as exc:
+                try:
+                    print(f"[indexing_admin] best-effort graph collection delete failed for {name}_graph: {exc}")
+                except Exception:
+                    pass

960-965: Same pattern as delete_collection_qdrant — same optional logging suggestion applies.

scripts/prune.py (1)

42-90: Consider reusing the shared _normalize_path function for consistency.

The path normalization logic in delete_graph_edges_by_path is nearly identical to graph_edges.py's _normalize_path helper, but there's a minor exception-handling divergence: if os.path.normpath fails, graph_edges.py still applies backslash replacement while prune.py returns the path unchanged. While this is unlikely to cause issues in practice, extracting _normalize_path into a shared utility module would eliminate duplication and guarantee identical behavior across both modules.

@voarsh2 voarsh2 changed the title Mit base salvage symbol Implement graph edges collection support with admin and ingest updates Feb 9, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@scripts/collection_admin.py`:
- Around line 493-510: Module-level logger is missing causing NameError in the
except block; add a module logger by importing logging and creating logger =
logging.getLogger(__name__) near the top-level imports so logger.debug in the
best-effort graph copy block (around copy_collection_qdrant usage) can run
without raising; ensure the import and logger creation are placed at module
scope so all functions in scripts/collection_admin.py can use logger.
🧹 Nitpick comments (7)
scripts/mcp_impl/symbol_graph.py (1)

27-34: Consider expiring the missing-collection cache.

_MISSING_GRAPH_COLLECTIONS is a permanent negative cache, so long-lived processes won’t re-check if collections appear later. A TTL or manual reset hook would keep it fresh.

scripts/standalone_upload_client.py (1)

733-768: Consider sharing ignore logic with scripts/remote_upload_client.py.

_excluded_dirnames and _is_ignored_path mirror the other client; extracting a shared helper would reduce drift over time.

scripts/remote_upload_client.py (2)

523-562: Duplicated logic with standalone_upload_client.py.

Both _excluded_dirnames and _is_ignored_path are near-identical copies of the same methods in scripts/standalone_upload_client.py (lines 732–767). The comment on line 524 acknowledges this ("Keep in sync with standalone_upload_client exclusions"), but keeping them manually in sync is error-prone.

Consider extracting these into a shared utility module to avoid divergence.


556-561: Extensionless-file lookup is case-sensitive on the dict side.

Line 560 compares rel.name.lower() against the lowered keys in extensionless. However, idx.EXTENSIONLESS_FILES keys may already be lowercase by convention. If a key were ever added with mixed case, this set(…keys()) (without lowering each key) would silently miss it.

The standalone client at line 766 uses (EXTENSIONLESS_FILES or {}).keys() without lowering keys either. This is consistent but fragile.

Suggested defensive fix
         try:
-            extensionless = set((idx.EXTENSIONLESS_FILES or {}).keys())
+            extensionless = {k.lower() for k in (idx.EXTENSIONLESS_FILES or {}).keys()}
         except Exception:
             extensionless = set()
scripts/ingest/graph_edges.py (3)

67-149: Silently swallowed index-creation errors may hide configuration issues.

Line 141–142: the except Exception: pass when creating payload indexes means a persistent configuration issue (e.g., invalid field schema, Qdrant version mismatch) will never surface. Since this is called once per collection lifecycle, a logger.debug would be inexpensive and aids troubleshooting.

Suggested improvement
             try:
                 client.create_payload_index(
                     collection_name=graph_coll,
                     field_name=field,
                     field_schema=qmodels.PayloadSchemaType.KEYWORD,
                 )
-            except Exception:
-                pass
+            except Exception as exc:
+                logger.debug("Payload index creation for %s.%s skipped: %s", graph_coll, field, exc)

240-276: delete_edges_by_path returns 1 on success regardless of actual delete count.

The return value is typed -> int but always returns 1 on success and 0 on failure, not the number of edges deleted. This could be confusing for callers that expect an actual count. Consider renaming or documenting the return semantics clearly.


279-385: Backfill tick looks correct; minor note on lazy import time.

The incremental scroll + per-file upsert logic is clean. The retry backoff at lines 341–345 works correctly.

Line 343: import time is lazily imported inside the retry block. This is harmless (Python caches module imports) but unconventional. Consider moving it to the top-level imports for clarity.

Implements a `debug` parameter for the repo search functionality.

When `debug=true`, includes verbose internal fields (like components, rerank_counters, and code_signals) in the search results.
When `debug=false` (default), strips these fields to reduce token consumption.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/mcp_indexer_server.py (1)

1649-1681: ⚠️ Potential issue | 🟡 Minor

code_search doesn't forward debug (or output_format, repo).

code_search is documented as an "exact alias of repo_search" but its signature and forwarding call are missing debug, output_format, and repo parameters. Users calling code_search cannot access debug mode.

Proposed fix (debug only, matching this PR's scope)
 async def code_search(
     query: Any = None,
     ...
     compact: Any = None,
+    debug: Any = None,
     kwargs: Any = None,
 ) -> Dict[str, Any]:
     ...
     return await repo_search(
         ...
         compact=compact,
+        debug=debug,
         kwargs=kwargs,
     )
🤖 Fix all issues with AI agents
In `@scripts/mcp_impl/search.py`:
- Around line 494-498: The assignment to debug_raw is unused and should be
removed; in the function where debug is processed (the lines assigning debug_raw
= debug and debug = _to_bool(debug, False)), delete the debug_raw = debug
statement and keep only debug = _to_bool(debug, False) so the code no longer
defines an unused variable (referencing the debug_raw symbol and the _to_bool
call to locate the change).
- Around line 80-91: The function _strip_debug_fields currently ignores the
keep_paths parameter and always strips host_path/container_path because
_DEBUG_RESULT_FIELDS includes those keys; update _strip_debug_fields to respect
keep_paths by only removing host_path and container_path when keep_paths is
False (e.g., compute the set of fields to strip from _DEBUG_RESULT_FIELDS minus
{"host_path","container_path"} when keep_paths is True, or copy the item and pop
those keys conditionally), leaving other debug removals unchanged; ensure the
docstring stays consistent with the new behavior.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@ctx-mcp-bridge/src/mcpServer.js`:
- Around line 81-96: The function dedupeResourceTemplates is incorrectly reading
uriTemplate from template.resourceTemplate.uriTemplate which is always undefined
for MCP ResourceTemplate objects; update the extraction to read
template.uriTemplate (and keep a safe typeof string check) so uri is correctly
derived from each ResourceTemplate, then the seen/skip logic will work and out
will contain deduplicated templates; reference the dedupeResourceTemplates
function and the templates/uriTemplate symbols when making the change.

Adds a new "bundled" mode for the MCP bridge, allowing the extension to use a pre-packaged bridge binary.

The extension now attempts to use the bundled bridge if the `mcpBridgeMode` setting is set to "bundled". If not found, it falls back to external resolution.

Also rewords the "localOnly" setting description to reflect the addition of "mcpBridgeMode".
- Updates code search to accept additional parameters for debugging and output formatting.

- Caches the excluded directory names in the upload clients to improve performance and prevent unintended behavior from runtime changes to exclusion settings.

- Adds a fallback mechanism for Qdrant client connections in the admin copy collection function, ensuring resilience in case pooled client acquisition fails.
Ensures the user is prompted to create a virtual environment and install dependencies when both the initial Python environment and auto-detected environments fail to provide the required modules. Removes the conditional check that prevented the prompt when modules were missing from an auto-detected interpreter.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @.github/workflows/claude.yaml:
- Around line 12-13: The workflow uses the pull_request_target event combined
with checking out the PR head (github.event.pull_request.head.sha), which lets
untrusted PR code run with elevated permissions; change the workflow event to
pull_request or, if pull_request_target is required, update the checkout step to
use the base branch (github.event.pull_request.base.sha) instead of the PR head,
or add an explicit human-approval job that gates any steps requiring secrets;
search for the token usage and the checkout action (references:
pull_request_target, github.event.pull_request.head.sha,
github.event.pull_request.base.sha) and implement one of these safer
alternatives.
- Around line 18-27: The conditional in the workflow only checks
github.event.comment.author_association, so jobs triggered by
pull_request_target events (where the PR body contains "@claude") never pass
because there is no comment object; update the if expression to validate author
association for the correct event payloads by adding checks for
github.event.pull_request.author_association and
github.event.review.author_association (in addition to
github.event.comment.author_association and
github.event.issue.author_association) or conditionally select the association
based on event type (e.g., check github.event_name == 'pull_request' ?
github.event.pull_request.author_association :
github.event.comment.author_association) while preserving the existing
sender.type == 'User' and allowed associations (OWNER, MEMBER, COLLABORATOR) so
PR-open/synchronize events will run when `@claude` is in the PR body.

In `@scripts/remote_upload_client.py`:
- Around line 1419-1428: The watcher currently only checks idx.CODE_EXTS using
the file suffix, so extensionless code files (e.g., Dockerfile, Makefile,
.gitignore) are skipped; change the filter in both the src_path and dest_path
branches to consult CODE_EXTS using the file suffix when present and fall back
to the filename when suffix is empty — e.g., compute key =
src_path.suffix.lower() or src_path.name when suffix is empty and use
idx.CODE_EXTS.get(key, "unknown") != "unknown" while preserving the existing
self.client._is_ignored_path checks (apply same change for dest_path handling).
🧹 Nitpick comments (7)
vscode-extension/build/build.sh (1)

83-95: Guard missing bin/src to avoid hard build failures.

With set -e, missing bin or src will abort packaging even though the bridge source exists. Consider making these copies conditional (mirroring the node_modules behavior) to keep bundling best‑effort.

🛠️ Suggested hardening
 if [[ -d "$BRIDGE_SRC" && -f "$BRIDGE_SRC/package.json" ]]; then
     echo "Bundling MCP bridge npm package into staged extension..."
     mkdir -p "$STAGE_DIR/$BRIDGE_DIR"
-    cp -a "$BRIDGE_SRC/bin" "$STAGE_DIR/$BRIDGE_DIR/"
-    cp -a "$BRIDGE_SRC/src" "$STAGE_DIR/$BRIDGE_DIR/"
+    if [[ -e "$BRIDGE_SRC/bin" ]]; then
+        cp -a "$BRIDGE_SRC/bin" "$STAGE_DIR/$BRIDGE_DIR/"
+    else
+        echo "Warning: Bridge bin not found; skipping."
+    fi
+    if [[ -d "$BRIDGE_SRC/src" ]]; then
+        cp -a "$BRIDGE_SRC/src" "$STAGE_DIR/$BRIDGE_DIR/"
+    else
+        echo "Warning: Bridge src not found; skipping."
+    fi
     cp "$BRIDGE_SRC/package.json" "$STAGE_DIR/$BRIDGE_DIR/"
scripts/mcp_indexer_server.py (1)

1193-1226: Consider forwarding the debug parameter in repo_search_compat for consistency.

The new debug parameter added to repo_search is not forwarded through repo_search_compat. Clients using this compatibility wrapper won't be able to enable debug mode.

♻️ Proposed fix to forward debug parameter
         forward = {
             "query": query,
             "limit": limit,
             "per_path": args.get("per_path"),
             "include_snippet": args.get("include_snippet"),
             "context_lines": args.get("context_lines"),
             "rerank_enabled": args.get("rerank_enabled"),
             "rerank_top_n": args.get("rerank_top_n"),
             "rerank_return_m": args.get("rerank_return_m"),
             "rerank_timeout_ms": args.get("rerank_timeout_ms"),
             "highlight_snippet": args.get("highlight_snippet"),
             "collection": args.get("collection"),
             "session": args.get("session"),
             "workspace_path": args.get("workspace_path"),
             "language": args.get("language"),
             "under": args.get("under"),
             "kind": args.get("kind"),
             "symbol": args.get("symbol"),
             "path_regex": args.get("path_regex"),
             "path_glob": args.get("path_glob"),
             "not_glob": args.get("not_glob"),
             "ext": args.get("ext"),
             "not_": not_value,
             "case": args.get("case"),
             "compact": args.get("compact"),
+            "debug": args.get("debug"),
             "mode": args.get("mode"),
             "repo": args.get("repo"),  # Cross-codebase isolation
             "output_format": args.get("output_format"),  # "json" or "toon"
             # Alias passthroughs captured by repo_search(**kwargs)
             "queries": queries,
             "q": args.get("q"),
             "text": args.get("text"),
             "top_k": args.get("top_k"),
         }
scripts/upload_service.py (1)

1245-1286: Best-effort graph copy detection logic is acceptable but complex.

The nested try-except structure correctly implements a fallback chain (pooled client → direct client → default to "0"), which is appropriate for a non-critical "best-effort" feature. A few observations:

  1. Line 1271: The or 5 is redundant since the default is already "5".
  2. The try-except-pass on cli.close() (lines 1279-1282) is acceptable for cleanup code where logging would add noise.

Consider extracting the graph-existence check into a helper function if this pattern is reused elsewhere, but it's acceptable as-is for a single use case.

♻️ Minor cleanup for redundant default
                     cli = QdrantClient(
                         url=QDRANT_URL,
                         api_key=os.environ.get("QDRANT_API_KEY"),
-                        timeout=float(os.environ.get("QDRANT_TIMEOUT", "5") or 5),
+                        timeout=float(os.environ.get("QDRANT_TIMEOUT", "5")),
                     )
vscode-extension/context-engine-uploader/mcp_bridge.js (1)

99-107: Potential mismatch between reported kind and actual source.

When mode === 'bundled' but the bundled binary is not found, findLocalBridgeBin() falls back to external resolution (configured path or env override). If an external binary exists, binPath will be set, but kind would still be reported as 'bundled' because line 106 only checks the mode, not the actual source of the binary.

Consider tracking whether the path came from the bundled source:

♻️ Proposed fix to accurately report the binary source
 function resolveBridgeCliInvocation() {
-  const binPath = findLocalBridgeBin();
   const mode = getBridgeMode();
+  const bundledBin = mode === 'bundled' ? findBundledBridgeBin() : undefined;
+  const binPath = bundledBin || findLocalBridgeBin();
+  const isBundled = !!bundledBin;
   if (binPath) {
     return {
       command: 'node',
       args: [binPath],
-      kind: mode === 'bundled' ? 'bundled' : 'local'
+      kind: isBundled ? 'bundled' : 'local'
     };
   }

Note: This would require refactoring findLocalBridgeBin() to skip bundled resolution when called from here, or extracting the bundled check.

.github/workflows/claude.yaml (2)

44-54: Redundant configuration: settings file and inline settings parameter both defined.

The Claude settings are configured twice:

  1. Lines 44-54: Written to /home/runner/.claude/settings.json
  2. Line 64: Passed via settings parameter

This duplication is unnecessary and could cause confusion if they diverge. Pick one approach.

♻️ Proposed fix: Remove the redundant settings file step
-      - name: Create Claude settings file
-        run: |
-          mkdir -p /home/runner/.claude
-          cat > /home/runner/.claude/settings.json << 'EOF'
-          {
-            "env": {
-              "ANTHROPIC_BASE_URL": "https://api.z.ai/api/anthropic",
-              "ANTHROPIC_AUTH_TOKEN": "${{ secrets.CUSTOM_ENDPOINT_API_KEY }}"
-            }
-          }
-          EOF
-
       - name: Run Claude Code

Also applies to: 64-64


39-39: Consider pinning GitHub Actions to specific commit SHAs for enhanced supply chain security.

This workflow has write permissions (contents: write, pull-requests: write, issues: write) and accesses secrets (CUSTOM_ENDPOINT_API_KEY), making it a higher-value target. Pinning actions/checkout@v4 and anthropics/claude-code-action@v1 to full commit SHAs prevents potential supply chain attacks if an action repository is compromised. Major version tags can change without notice, whereas commit SHAs are immutable.

vscode-extension/context-engine-uploader/python_env.js (1)

363-365: Redundant cache check.

This cache check will never return true at this point in the flow—venvPython was just resolved and the cache is only populated on line 368 after checkPythonDeps succeeds. This block is effectively dead code.

🧹 Proposed cleanup
     setPythonOverridePath(venvPython);
     log(`Using private venv interpreter: ${getPythonOverridePath()}`);
     const venvKey = cacheKey(venvPython, workingDirectory);
-    if (depCheckCache.get(venvKey)) {
-        return true;
-    }
     const finalOk = await checkPythonDeps(venvPython, workingDirectory, { showInterpreterError: true });
     if (finalOk) {
         depCheckCache.set(venvKey, true);

Comment on lines +12 to +13
pull_request_target:
types: [opened, synchronize]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: pull_request_target with PR head checkout creates a "pwn request" vulnerability.

Using pull_request_target grants write permissions and access to secrets, but checking out github.event.pull_request.head.sha (the untrusted PR code) allows malicious PRs to execute arbitrary code with those elevated privileges. An attacker could modify workflow files or exfiltrate secrets.

Safer alternatives:

  1. Use pull_request event instead (runs in PR context without secrets access)
  2. If pull_request_target is required, only checkout the base branch, not the PR head
  3. Add an explicit approval step before running on external PRs
🔒 Recommended fix: Use pull_request event or avoid checking out PR head
-  pull_request_target:
-    types: [opened, synchronize]
+  pull_request:
+    types: [opened, synchronize]

Or if you need secrets access, don't checkout PR head:

      - name: Checkout repository
        uses: actions/checkout@v4
-        with:
-          # This correctly checks out the PR's head commit for pull_request_target events.
-          ref: ${{ github.event.pull_request.head.sha }}

Also applies to: 38-42

🤖 Prompt for AI Agents
In @.github/workflows/claude.yaml around lines 12 - 13, The workflow uses the
pull_request_target event combined with checking out the PR head
(github.event.pull_request.head.sha), which lets untrusted PR code run with
elevated permissions; change the workflow event to pull_request or, if
pull_request_target is required, update the checkout step to use the base branch
(github.event.pull_request.base.sha) instead of the PR head, or add an explicit
human-approval job that gates any steps requiring secrets; search for the token
usage and the checkout action (references: pull_request_target,
github.event.pull_request.head.sha, github.event.pull_request.base.sha) and
implement one of these safer alternatives.

Comment on lines +18 to +27
if: >
(contains(github.event.comment.body, '@claude') ||
contains(github.event.review.body, '@claude') ||
contains(github.event.issue.body, '@claude') ||
contains(github.event.pull_request.body, '@claude')) &&
(github.event.sender.type == 'User' && (
github.event.comment.author_association == 'OWNER' ||
github.event.comment.author_association == 'MEMBER' ||
github.event.comment.author_association == 'COLLABORATOR'
))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Condition logic is incomplete for pull_request_target events.

The author_association check only references github.event.comment.author_association, which is undefined for pull_request_target opened/synchronize events (there's no comment). This causes the job to never run for those triggers even when @claude is in the PR body.

If you intend to support PR-triggered runs, you need to check associations conditionally:

🛠️ Proposed fix to handle different event types
     if: >
       (contains(github.event.comment.body, '@claude') ||
       contains(github.event.review.body, '@claude') ||
       contains(github.event.issue.body, '@claude') ||
       contains(github.event.pull_request.body, '@claude')) &&
-      (github.event.sender.type == 'User' && (
-        github.event.comment.author_association == 'OWNER' ||
-        github.event.comment.author_association == 'MEMBER' ||
-        github.event.comment.author_association == 'COLLABORATOR'
+      github.event.sender.type == 'User' && (
+        github.event.comment.author_association == 'OWNER' ||
+        github.event.comment.author_association == 'MEMBER' ||
+        github.event.comment.author_association == 'COLLABORATOR' ||
+        github.event.review.author_association == 'OWNER' ||
+        github.event.review.author_association == 'MEMBER' ||
+        github.event.review.author_association == 'COLLABORATOR' ||
+        github.event.issue.author_association == 'OWNER' ||
+        github.event.issue.author_association == 'MEMBER' ||
+        github.event.issue.author_association == 'COLLABORATOR' ||
+        github.event.pull_request.author_association == 'OWNER' ||
+        github.event.pull_request.author_association == 'MEMBER' ||
+        github.event.pull_request.author_association == 'COLLABORATOR'
-      ))
+      )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if: >
(contains(github.event.comment.body, '@claude') ||
contains(github.event.review.body, '@claude') ||
contains(github.event.issue.body, '@claude') ||
contains(github.event.pull_request.body, '@claude')) &&
(github.event.sender.type == 'User' && (
github.event.comment.author_association == 'OWNER' ||
github.event.comment.author_association == 'MEMBER' ||
github.event.comment.author_association == 'COLLABORATOR'
))
if: >
(contains(github.event.comment.body, '@claude') ||
contains(github.event.review.body, '@claude') ||
contains(github.event.issue.body, '@claude') ||
contains(github.event.pull_request.body, '@claude')) &&
github.event.sender.type == 'User' && (
github.event.comment.author_association == 'OWNER' ||
github.event.comment.author_association == 'MEMBER' ||
github.event.comment.author_association == 'COLLABORATOR' ||
github.event.review.author_association == 'OWNER' ||
github.event.review.author_association == 'MEMBER' ||
github.event.review.author_association == 'COLLABORATOR' ||
github.event.issue.author_association == 'OWNER' ||
github.event.issue.author_association == 'MEMBER' ||
github.event.issue.author_association == 'COLLABORATOR' ||
github.event.pull_request.author_association == 'OWNER' ||
github.event.pull_request.author_association == 'MEMBER' ||
github.event.pull_request.author_association == 'COLLABORATOR'
)
🤖 Prompt for AI Agents
In @.github/workflows/claude.yaml around lines 18 - 27, The conditional in the
workflow only checks github.event.comment.author_association, so jobs triggered
by pull_request_target events (where the PR body contains "@claude") never pass
because there is no comment object; update the if expression to validate author
association for the correct event payloads by adding checks for
github.event.pull_request.author_association and
github.event.review.author_association (in addition to
github.event.comment.author_association and
github.event.issue.author_association) or conditionally select the association
based on event type (e.g., check github.event_name == 'pull_request' ?
github.event.pull_request.author_association :
github.event.comment.author_association) while preserving the existing
sender.type == 'User' and allowed associations (OWNER, MEMBER, COLLABORATOR) so
PR-open/synchronize events will run when `@claude` is in the PR body.

Comment on lines 1419 to 1428
# Always check src_path
src_path = Path(event.src_path)
if idx.CODE_EXTS.get(src_path.suffix.lower(), "unknown") != "unknown":
if not self.client._is_ignored_path(src_path) and idx.CODE_EXTS.get(src_path.suffix.lower(), "unknown") != "unknown":
paths_to_process.append(src_path)

# For FileMovedEvent, also process the destination path
if hasattr(event, 'dest_path') and event.dest_path:
dest_path = Path(event.dest_path)
if idx.CODE_EXTS.get(dest_path.suffix.lower(), "unknown") != "unknown":
if not self.client._is_ignored_path(dest_path) and idx.CODE_EXTS.get(dest_path.suffix.lower(), "unknown") != "unknown":
paths_to_process.append(dest_path)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Event-based watcher skips extensionless code files.

Line 1421/1427 only checks idx.CODE_EXTS, so files like Dockerfile, Makefile, or .gitignore (now supported elsewhere) won’t be enqueued under watchdog. Those edits won’t sync in watch mode.

🛠️ Suggested fix to include extensionless files in watchdog filtering
@@
+    def _is_code_path(self, path: Path) -> bool:
+        fname = path.name.lower()
+        if path.suffix.lower() in idx.CODE_EXTS:
+            return True
+        try:
+            extensionless = set((idx.EXTENSIONLESS_FILES or {}).keys())
+        except Exception:
+            extensionless = set()
+        return fname in extensionless or fname.startswith("dockerfile")
@@
-                if not self.client._is_ignored_path(src_path) and idx.CODE_EXTS.get(src_path.suffix.lower(), "unknown") != "unknown":
+                if not self.client._is_ignored_path(src_path) and self.client._is_code_path(src_path):
                     paths_to_process.append(src_path)
@@
-                    if not self.client._is_ignored_path(dest_path) and idx.CODE_EXTS.get(dest_path.suffix.lower(), "unknown") != "unknown":
+                    if not self.client._is_ignored_path(dest_path) and self.client._is_code_path(dest_path):
                         paths_to_process.append(dest_path)
🤖 Prompt for AI Agents
In `@scripts/remote_upload_client.py` around lines 1419 - 1428, The watcher
currently only checks idx.CODE_EXTS using the file suffix, so extensionless code
files (e.g., Dockerfile, Makefile, .gitignore) are skipped; change the filter in
both the src_path and dest_path branches to consult CODE_EXTS using the file
suffix when present and fall back to the filename when suffix is empty — e.g.,
compute key = src_path.suffix.lower() or src_path.name when suffix is empty and
use idx.CODE_EXTS.get(key, "unknown") != "unknown" while preserving the existing
self.client._is_ignored_path checks (apply same change for dest_path handling).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants