Implement graph edges collection support with admin and ingest updates#2
Implement graph edges collection support with admin and ingest updates#2voarsh2 wants to merge 13 commits intomit-base-salvagefrom
Conversation
…arity feat(graph-edges): add qdrant _graph backfill + symbol_graph accel; mirror admin lifecycle feat: materialize _graph edges and use for symbol_graph; ensure clone/delete handles companion collections graph: add _graph edge collection + backfill; prune/watcher cleanup; admin copy/delete parity feat(admin+graph): copy/delete companion _graph collections; add UI status + tests
Adds a command-line option to clear indexing caches before running the ingest process. This ensures a clean indexing run by removing any stale file hash or symbol caches. This is useful for scenarios where the underlying code has changed significantly, invalidating the existing cache. Also, ensures `CTXCE_FORCE_COLLECTION_NAME` disables multi-repo enumeration in `ingest_code`, forcing the use of the specified collection name, and clarifies its purpose in `indexing_admin.py`. Broken by commit 2e6317d
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds materialized graph-edge storage and backfill for Qdrant, integrates best-effort graph sync into indexing and deletion flows, adds path-exclusion for upload clients, a cache-clear CLI flag, admin redirect graph feedback and companion collection copy/delete attempts, plus tests and debug/debug-field handling. (≤50 words) Changes
Sequence DiagramssequenceDiagram
participant Client
participant Pipeline
participant Graph as graph_edges Module
participant Qdrant
Client->>Pipeline: upsert file (points + calls/imports)
activate Pipeline
Pipeline->>Graph: ensure_graph_collection(base)
activate Graph
Graph->>Qdrant: create/verify <base>_graph collection
Qdrant-->>Graph: collection ready
deactivate Graph
Pipeline->>Graph: delete_edges_by_path(caller_path, repo)
activate Graph
Graph->>Qdrant: delete matching payload-only edge docs
Qdrant-->>Graph: deletion result
deactivate Graph
Pipeline->>Graph: upsert_file_edges(caller_path, calls, imports, repo)
activate Graph
Graph->>Qdrant: upsert payload-only edge docs with deterministic IDs
Qdrant-->>Graph: upsert ack
deactivate Graph
deactivate Pipeline
Pipeline-->>Client: indexing complete
sequenceDiagram
participant MCP as UserQuery
participant SymbolGraph
participant Graph as graph_edges (Query)
participant Qdrant
MCP->>SymbolGraph: query_callers(symbol)
activate SymbolGraph
SymbolGraph->>Graph: _query_graph_edges_collection(type=calls)
activate Graph
Graph->>Qdrant: query <base>_graph filter by edge_type=calls
Qdrant-->>Graph: edge documents
Graph->>Graph: hydrate & dedupe results
Graph-->>SymbolGraph: hydrated caller results
deactivate Graph
alt graph results exist
SymbolGraph-->>MCP: return hydrated results
else
SymbolGraph->>Qdrant: fallback query on callers[] array field
Qdrant-->>SymbolGraph: legacy results
SymbolGraph-->>MCP: return legacy results
end
deactivate SymbolGraph
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@scripts/mcp_impl/symbol_graph.py`:
- Around line 342-352: Inner functions _scroll and _scroll_main close over loop
variables (flt and p) by reference which can cause latent bugs; change their
definitions to bind the current loop value explicitly (e.g., add flt=None /
p=None as default parameters and use that inside, or accept the value as a
parameter) so each coroutine captures the intended value when created; update
the calls that invoke _scroll() and _scroll_main() to pass no args (if using
defaults) or the bound value accordingly to ensure each task uses the correct
flt and p.
🧹 Nitpick comments (15)
scripts/remote_upload_client.py (1)
523-557:_excluded_dirnames()rebuilds the set on every_is_ignored_pathcall — consider caching.Since
_is_ignored_pathis called for every path indetect_file_changes,get_all_code_files, and the event handler,_excluded_dirnames()reconstructs the set each time. The env vars it depends on won't change mid-process.Also,
parts = set(rel.parts)on Line 544 includes the filename itself, so a code file literally namedbuild,dist,out, etc. (no extension) would be incorrectly ignored. This is unlikely for real code files but worth noting.♻️ Cache the excluded set (optional)
+ `@functools.cached_property` + def _excluded_dirs(self) -> frozenset: + excluded = { + "node_modules", "vendor", "dist", "build", "target", "out", + ".git", ".hg", ".svn", ".vscode", ".idea", ".venv", "venv", + "__pycache__", ".pytest_cache", ".mypy_cache", ".cache", + ".context-engine", ".context-engine-uploader", ".codebase", + } + dev_remote = os.environ.get("DEV_REMOTE_MODE") == "1" or os.environ.get("REMOTE_UPLOAD_MODE") == "development" + if dev_remote: + excluded.add("dev-workspace") + return frozenset(excluded)And to fix the filename false-positive, check only directory parts:
- parts = set(rel.parts) - if parts & self._excluded_dirnames(): + dir_parts = set(rel.parts[:-1]) if len(rel.parts) > 1 else set() + if dir_parts & self._excluded_dirnames():scripts/standalone_upload_client.py (1)
733-763: Identical logic toremote_upload_client.py— keep-in-sync comment is appreciated but fragile.The "Keep in sync with get_all_code_files exclusions" comment helps, but these two files now share ~30 lines of identical exclusion logic. Same note as the other file applies:
_excluded_dirnames()is rebuilt per call, andparts = set(rel.parts)includes the filename.Since the standalone client is intentionally dependency-free, the duplication is understandable. Consider at minimum adding a test that asserts the excluded dir sets match across both clients to catch drift.
scripts/collection_admin.py (1)
493-505: Log the graph-copy failure instead of silently swallowing it.The recursive
copy_collection_qdrantcall is properly guarded against infinite recursion (the_graphsuffix check). However, silently passing on exceptions makes debugging difficult when graph copies fail unexpectedly.♻️ Add minimal logging
if not src.endswith("_graph") and not dest.endswith("_graph"): try: copy_collection_qdrant( source=f"{src}_graph", target=f"{dest}_graph", qdrant_url=base_url, overwrite=overwrite, ) - except Exception: - pass + except Exception as exc: + import logging as _logging + _logging.getLogger(__name__).debug( + "Best-effort graph collection copy %s_graph -> %s_graph failed: %s", + src, dest, exc, + )scripts/ingest/graph_edges.py (2)
109-144: Broadexcepton vector-less creation attempt may mask transient errors.Line 117 catches any exception from the vector-less
create_collectioncall and falls back to the named-vector variant. If the failure is a transient network error rather than "unsupported config," the fallback will likely also fail — so the net effect is acceptable. Just noting this for awareness.♻️ Optional: log the fallback reason at debug level
try: client.create_collection( collection_name=graph_coll, vectors_config={}, ) _GRAPH_VECTOR_MODE[graph_coll] = "none" - except Exception: + except Exception as vec_exc: + logger.debug("Vector-less creation failed for %s, trying named vector: %s", graph_coll, vec_exc) client.create_collection(
147-149: Edge ID uses:as a separator — consider a more robust delimiter.If any component (repo, path, symbol) contains
:, different edge tuples could produce the same hash input. E.g.,"calls:a:b:c:d"is ambiguous. A null byte (\x00) would be safer as a delimiter since it doesn't appear in file paths or symbol names.This is low-risk in practice since paths and symbols rarely contain colons.
♻️ Use null byte separator
def _edge_id(edge_type: str, repo: str, caller_path: str, callee_symbol: str) -> str: - key = f"{edge_type}:{repo}:{caller_path}:{callee_symbol}" + key = f"{edge_type}\x00{repo}\x00{caller_path}\x00{callee_symbol}" return hashlib.sha256(key.encode("utf-8", errors="ignore")).hexdigest()[:32]scripts/mcp_impl/symbol_graph.py (2)
27-28:GRAPH_COLLECTION_SUFFIXis duplicated fromscripts/ingest/graph_edges.py.Both files define
GRAPH_COLLECTION_SUFFIX = "_graph". This is a minor DRY concern; consider importing fromgraph_edgesor extracting to a shared constants module if the two modules will evolve together.
381-432: N+1 hydration queries — acceptable given thelimitcap.Each caller path triggers a separate
scrollcall against the main collection (lines 400-408). With the default limit of 20, this means up to 20 sequential queries. For the current use case this is fine, but iflimitgrows significantly, consider batching hydration using a single scroll with aMatchAnyfilter onmetadata.path.scripts/upload_service.py (1)
1241-1273: Importpooled_qdrant_clientand use it instead of creating a standaloneQdrantClient.The rest of the codebase uses
pooled_qdrant_clientfor Qdrant interactions. Here, a newQdrantClientis instantiated just to check if the_graphcollection exists. Using the pooled client would be more consistent and avoid creating an extra connection. This requires importingpooled_qdrant_clientfromscripts.qdrant_client_manager.♻️ Suggested refactor using pooled client
At the top of the file with other imports, add:
try: from scripts.qdrant_client_manager import pooled_qdrant_client except Exception: pooled_qdrant_client = NoneThen replace the collection check logic:
graph_copied: Optional[str] = None try: if not name.endswith("_graph") and not str(new_name).endswith("_graph"): - from qdrant_client import QdrantClient # type: ignore - - cli = QdrantClient( - url=QDRANT_URL, - api_key=os.environ.get("QDRANT_API_KEY"), - timeout=float(os.environ.get("QDRANT_TIMEOUT", "5") or 5), - ) - try: - cli.get_collection(collection_name=f"{new_name}_graph") - graph_copied = "1" - except Exception: - graph_copied = "0" - finally: - try: - cli.close() - except Exception: - pass + if pooled_qdrant_client is not None: + with pooled_qdrant_client(url=QDRANT_URL, api_key=os.environ.get("QDRANT_API_KEY")) as cli: + try: + cli.get_collection(collection_name=f"{new_name}_graph") + graph_copied = "1" + except Exception: + graph_copied = "0" except Exception: graph_copied = Nonescripts/ingest/cli.py (1)
195-209: Remove duplication — reusescripts/collection_health.clear_indexing_cachesinstead.
scripts/collection_health.pyalready has aclear_indexing_caches(workspace_path, repo_name)function that clears both file-hash and symbol caches. It uses publicclear_repo_cache/clear_cacheAPIs instead of relying on private helpers (_ws._get_repo_state_dir,_ws._get_cache_path), which are fragile ifworkspace_stateinternals change. The existing implementation also handles errors more gracefully.Since the return value is discarded at all call sites, wrap the existing function in a thin adapter:
♻️ Reuse existing helper
+from scripts.collection_health import clear_indexing_caches as _clear_indexing_caches_impl + ... - def _clear_indexing_caches(workspace_root: Path, repo_name: str | None) -> None: - try: - _ws.clear_symbol_cache(workspace_path=str(workspace_root), repo_name=repo_name) - except Exception: - pass - try: - if _ws.is_multi_repo_mode() and repo_name: - state_dir = _ws._get_repo_state_dir(repo_name) - cache_path = state_dir / _ws.CACHE_FILENAME - else: - cache_path = _ws._get_cache_path(workspace_root) - if cache_path.exists(): - cache_path.unlink() - except Exception: - pass + def _clear_indexing_caches(workspace_root: Path, repo_name: str | None) -> None: + try: + _clear_indexing_caches_impl(str(workspace_root), repo_name=repo_name) + except Exception: + passscripts/ingest/pipeline.py (2)
656-689: Silentexcept Exception: passswallows errors without any diagnostic trace.While this block is intentionally best-effort, silently swallowing all exceptions (including unexpected ones like
TypeErrororAttributeErrorfrom API misuse) makes production debugging very difficult. A debug-level log would preserve the "safe to skip" intent while aiding troubleshooting.Also, this entire block (env check → lazy import → ensure → delete → upsert) is duplicated nearly verbatim at lines 1404–1435 in
process_file_with_smart_reindexing. Consider extracting a shared helper (e.g.,_sync_graph_edges_best_effort(client, collection, file_path, repo, calls, imports)) to keep both call sites in sync and reduce copy-paste drift.♻️ Proposed: extract helper and add minimal logging
Add a helper near the top of the file (after imports):
def _sync_graph_edges_best_effort( client: QdrantClient, collection: str, file_path: str, repo: str | None, calls: list[str] | None, imports: list[str] | None, ) -> None: """Best-effort sync of file-level graph edges. Safe to skip on failure.""" enabled = str(os.environ.get("GRAPH_EDGES_ENABLE", "1") or "").strip().lower() in { "1", "true", "yes", "on", } if not enabled: return try: from scripts.ingest.graph_edges import ( delete_edges_by_path, ensure_graph_collection, upsert_file_edges, ) ensure_graph_collection(client, collection) delete_edges_by_path(client, collection, caller_path=file_path, repo=repo) upsert_file_edges(client, collection, caller_path=file_path, repo=repo, calls=calls, imports=imports) except Exception as exc: try: print(f"[graph_edges] best-effort sync failed for {file_path}: {exc}") except Exception: passThen replace both blocks:
- # Optional: materialize file-level graph edges in a companion `<collection>_graph` store. - # This is an accelerator for symbol_graph callers/importers and is safe to skip on failure. - try: - enabled = str(os.environ.get("GRAPH_EDGES_ENABLE", "1") or "").strip().lower() in { - "1", - "true", - "yes", - "on", - } - if enabled: - from scripts.ingest.graph_edges import ( - delete_edges_by_path as _delete_edges_by_path, - ensure_graph_collection as _ensure_graph_collection, - upsert_file_edges as _upsert_file_edges, - ) - - _ensure_graph_collection(client, collection) - # Important: delete stale edges for this file before upserting the new set. - _delete_edges_by_path( - client, - collection, - caller_path=str(file_path), - repo=repo_tag, - ) - _upsert_file_edges( - client, - collection, - caller_path=str(file_path), - repo=repo_tag, - calls=calls, - imports=imports, - ) - except Exception: - pass + _sync_graph_edges_best_effort(client, collection, str(file_path), repo_tag, calls, imports)
1404-1435: Duplicate graph-edge sync block — same concern as lines 656–689.This is a near-verbatim copy of the graph-edge sync block in
_index_single_file_inner. If you extract the helper suggested above, this becomes a one-liner too, and the two paths stay in lockstep automatically.scripts/ingest_code.py (1)
215-232: Asymmetric fallback:graph_edges_backfill_tick = Nonevs. no-op functions for the other two.This is a valid design choice (callers should explicitly check before invoking a periodic tick), but it creates a subtle API inconsistency in the façade. Any consumer calling
graph_edges_backfill_tick(...)without aNonecheck will get aTypeErrorat runtime when graph_edges is unavailable.If this is intentional (forcing callers to guard the call), consider adding a brief docstring or inline comment explaining why the backfill tick uses
Nonewhile the others use no-op stubs.scripts/indexing_admin.py (2)
930-935: Best-effort graph companion cleanup — consider a debug log for production visibility.The
_graphsuffix guard correctly prevents recursive deletion. The silentexcept Exception: passis consistent with the best-effort pattern, but a brief log message (even at debug level) would help operators understand why graph data might persist after a collection delete, especially in cases where the Qdrant API returns a non-404 error (e.g., timeout, auth failure).🔧 Minimal logging suggestion
if not name.endswith("_graph"): try: cli.delete_collection(collection_name=f"{name}_graph") - except Exception: - pass + except Exception as exc: + try: + print(f"[indexing_admin] best-effort graph collection delete failed for {name}_graph: {exc}") + except Exception: + pass
960-965: Same pattern asdelete_collection_qdrant— same optional logging suggestion applies.scripts/prune.py (1)
42-90: Consider reusing the shared_normalize_pathfunction for consistency.The path normalization logic in
delete_graph_edges_by_pathis nearly identical tograph_edges.py's_normalize_pathhelper, but there's a minor exception-handling divergence: ifos.path.normpathfails,graph_edges.pystill applies backslash replacement whileprune.pyreturns the path unchanged. While this is unlikely to cause issues in practice, extracting_normalize_pathinto a shared utility module would eliminate duplication and guarantee identical behavior across both modules.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@scripts/collection_admin.py`:
- Around line 493-510: Module-level logger is missing causing NameError in the
except block; add a module logger by importing logging and creating logger =
logging.getLogger(__name__) near the top-level imports so logger.debug in the
best-effort graph copy block (around copy_collection_qdrant usage) can run
without raising; ensure the import and logger creation are placed at module
scope so all functions in scripts/collection_admin.py can use logger.
🧹 Nitpick comments (7)
scripts/mcp_impl/symbol_graph.py (1)
27-34: Consider expiring the missing-collection cache.
_MISSING_GRAPH_COLLECTIONSis a permanent negative cache, so long-lived processes won’t re-check if collections appear later. A TTL or manual reset hook would keep it fresh.scripts/standalone_upload_client.py (1)
733-768: Consider sharing ignore logic withscripts/remote_upload_client.py.
_excluded_dirnamesand_is_ignored_pathmirror the other client; extracting a shared helper would reduce drift over time.scripts/remote_upload_client.py (2)
523-562: Duplicated logic withstandalone_upload_client.py.Both
_excluded_dirnamesand_is_ignored_pathare near-identical copies of the same methods inscripts/standalone_upload_client.py(lines 732–767). The comment on line 524 acknowledges this ("Keep in sync with standalone_upload_client exclusions"), but keeping them manually in sync is error-prone.Consider extracting these into a shared utility module to avoid divergence.
556-561: Extensionless-file lookup is case-sensitive on the dict side.Line 560 compares
rel.name.lower()against the lowered keys inextensionless. However,idx.EXTENSIONLESS_FILESkeys may already be lowercase by convention. If a key were ever added with mixed case, thisset(…keys())(without lowering each key) would silently miss it.The standalone client at line 766 uses
(EXTENSIONLESS_FILES or {}).keys()without lowering keys either. This is consistent but fragile.Suggested defensive fix
try: - extensionless = set((idx.EXTENSIONLESS_FILES or {}).keys()) + extensionless = {k.lower() for k in (idx.EXTENSIONLESS_FILES or {}).keys()} except Exception: extensionless = set()scripts/ingest/graph_edges.py (3)
67-149: Silently swallowed index-creation errors may hide configuration issues.Line 141–142: the
except Exception: passwhen creating payload indexes means a persistent configuration issue (e.g., invalid field schema, Qdrant version mismatch) will never surface. Since this is called once per collection lifecycle, alogger.debugwould be inexpensive and aids troubleshooting.Suggested improvement
try: client.create_payload_index( collection_name=graph_coll, field_name=field, field_schema=qmodels.PayloadSchemaType.KEYWORD, ) - except Exception: - pass + except Exception as exc: + logger.debug("Payload index creation for %s.%s skipped: %s", graph_coll, field, exc)
240-276:delete_edges_by_pathreturns 1 on success regardless of actual delete count.The return value is typed
-> intbut always returns1on success and0on failure, not the number of edges deleted. This could be confusing for callers that expect an actual count. Consider renaming or documenting the return semantics clearly.
279-385: Backfill tick looks correct; minor note on lazyimport time.The incremental scroll + per-file upsert logic is clean. The retry backoff at lines 341–345 works correctly.
Line 343:
import timeis lazily imported inside the retry block. This is harmless (Python caches module imports) but unconventional. Consider moving it to the top-level imports for clarity.
Implements a `debug` parameter for the repo search functionality. When `debug=true`, includes verbose internal fields (like components, rerank_counters, and code_signals) in the search results. When `debug=false` (default), strips these fields to reduce token consumption.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/mcp_indexer_server.py (1)
1649-1681:⚠️ Potential issue | 🟡 Minor
code_searchdoesn't forwarddebug(oroutput_format,repo).
code_searchis documented as an "exact alias of repo_search" but its signature and forwarding call are missingdebug,output_format, andrepoparameters. Users callingcode_searchcannot access debug mode.Proposed fix (debug only, matching this PR's scope)
async def code_search( query: Any = None, ... compact: Any = None, + debug: Any = None, kwargs: Any = None, ) -> Dict[str, Any]: ... return await repo_search( ... compact=compact, + debug=debug, kwargs=kwargs, )
🤖 Fix all issues with AI agents
In `@scripts/mcp_impl/search.py`:
- Around line 494-498: The assignment to debug_raw is unused and should be
removed; in the function where debug is processed (the lines assigning debug_raw
= debug and debug = _to_bool(debug, False)), delete the debug_raw = debug
statement and keep only debug = _to_bool(debug, False) so the code no longer
defines an unused variable (referencing the debug_raw symbol and the _to_bool
call to locate the change).
- Around line 80-91: The function _strip_debug_fields currently ignores the
keep_paths parameter and always strips host_path/container_path because
_DEBUG_RESULT_FIELDS includes those keys; update _strip_debug_fields to respect
keep_paths by only removing host_path and container_path when keep_paths is
False (e.g., compute the set of fields to strip from _DEBUG_RESULT_FIELDS minus
{"host_path","container_path"} when keep_paths is True, or copy the item and pop
those keys conditionally), leaving other debug removals unchanged; ensure the
docstring stays consistent with the new behavior.
…or Codex compatibility
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@ctx-mcp-bridge/src/mcpServer.js`:
- Around line 81-96: The function dedupeResourceTemplates is incorrectly reading
uriTemplate from template.resourceTemplate.uriTemplate which is always undefined
for MCP ResourceTemplate objects; update the extraction to read
template.uriTemplate (and keep a safe typeof string check) so uri is correctly
derived from each ResourceTemplate, then the seen/skip logic will work and out
will contain deduplicated templates; reference the dedupeResourceTemplates
function and the templates/uriTemplate symbols when making the change.
Adds a new "bundled" mode for the MCP bridge, allowing the extension to use a pre-packaged bridge binary. The extension now attempts to use the bundled bridge if the `mcpBridgeMode` setting is set to "bundled". If not found, it falls back to external resolution. Also rewords the "localOnly" setting description to reflect the addition of "mcpBridgeMode".
- Updates code search to accept additional parameters for debugging and output formatting. - Caches the excluded directory names in the upload clients to improve performance and prevent unintended behavior from runtime changes to exclusion settings. - Adds a fallback mechanism for Qdrant client connections in the admin copy collection function, ensuring resilience in case pooled client acquisition fails.
Ensures the user is prompted to create a virtual environment and install dependencies when both the initial Python environment and auto-detected environments fail to provide the required modules. Removes the conditional check that prevented the prompt when modules were missing from an auto-detected interpreter.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @.github/workflows/claude.yaml:
- Around line 12-13: The workflow uses the pull_request_target event combined
with checking out the PR head (github.event.pull_request.head.sha), which lets
untrusted PR code run with elevated permissions; change the workflow event to
pull_request or, if pull_request_target is required, update the checkout step to
use the base branch (github.event.pull_request.base.sha) instead of the PR head,
or add an explicit human-approval job that gates any steps requiring secrets;
search for the token usage and the checkout action (references:
pull_request_target, github.event.pull_request.head.sha,
github.event.pull_request.base.sha) and implement one of these safer
alternatives.
- Around line 18-27: The conditional in the workflow only checks
github.event.comment.author_association, so jobs triggered by
pull_request_target events (where the PR body contains "@claude") never pass
because there is no comment object; update the if expression to validate author
association for the correct event payloads by adding checks for
github.event.pull_request.author_association and
github.event.review.author_association (in addition to
github.event.comment.author_association and
github.event.issue.author_association) or conditionally select the association
based on event type (e.g., check github.event_name == 'pull_request' ?
github.event.pull_request.author_association :
github.event.comment.author_association) while preserving the existing
sender.type == 'User' and allowed associations (OWNER, MEMBER, COLLABORATOR) so
PR-open/synchronize events will run when `@claude` is in the PR body.
In `@scripts/remote_upload_client.py`:
- Around line 1419-1428: The watcher currently only checks idx.CODE_EXTS using
the file suffix, so extensionless code files (e.g., Dockerfile, Makefile,
.gitignore) are skipped; change the filter in both the src_path and dest_path
branches to consult CODE_EXTS using the file suffix when present and fall back
to the filename when suffix is empty — e.g., compute key =
src_path.suffix.lower() or src_path.name when suffix is empty and use
idx.CODE_EXTS.get(key, "unknown") != "unknown" while preserving the existing
self.client._is_ignored_path checks (apply same change for dest_path handling).
🧹 Nitpick comments (7)
vscode-extension/build/build.sh (1)
83-95: Guard missingbin/srcto avoid hard build failures.With
set -e, missingbinorsrcwill abort packaging even though the bridge source exists. Consider making these copies conditional (mirroring the node_modules behavior) to keep bundling best‑effort.🛠️ Suggested hardening
if [[ -d "$BRIDGE_SRC" && -f "$BRIDGE_SRC/package.json" ]]; then echo "Bundling MCP bridge npm package into staged extension..." mkdir -p "$STAGE_DIR/$BRIDGE_DIR" - cp -a "$BRIDGE_SRC/bin" "$STAGE_DIR/$BRIDGE_DIR/" - cp -a "$BRIDGE_SRC/src" "$STAGE_DIR/$BRIDGE_DIR/" + if [[ -e "$BRIDGE_SRC/bin" ]]; then + cp -a "$BRIDGE_SRC/bin" "$STAGE_DIR/$BRIDGE_DIR/" + else + echo "Warning: Bridge bin not found; skipping." + fi + if [[ -d "$BRIDGE_SRC/src" ]]; then + cp -a "$BRIDGE_SRC/src" "$STAGE_DIR/$BRIDGE_DIR/" + else + echo "Warning: Bridge src not found; skipping." + fi cp "$BRIDGE_SRC/package.json" "$STAGE_DIR/$BRIDGE_DIR/"scripts/mcp_indexer_server.py (1)
1193-1226: Consider forwarding thedebugparameter inrepo_search_compatfor consistency.The new
debugparameter added torepo_searchis not forwarded throughrepo_search_compat. Clients using this compatibility wrapper won't be able to enable debug mode.♻️ Proposed fix to forward debug parameter
forward = { "query": query, "limit": limit, "per_path": args.get("per_path"), "include_snippet": args.get("include_snippet"), "context_lines": args.get("context_lines"), "rerank_enabled": args.get("rerank_enabled"), "rerank_top_n": args.get("rerank_top_n"), "rerank_return_m": args.get("rerank_return_m"), "rerank_timeout_ms": args.get("rerank_timeout_ms"), "highlight_snippet": args.get("highlight_snippet"), "collection": args.get("collection"), "session": args.get("session"), "workspace_path": args.get("workspace_path"), "language": args.get("language"), "under": args.get("under"), "kind": args.get("kind"), "symbol": args.get("symbol"), "path_regex": args.get("path_regex"), "path_glob": args.get("path_glob"), "not_glob": args.get("not_glob"), "ext": args.get("ext"), "not_": not_value, "case": args.get("case"), "compact": args.get("compact"), + "debug": args.get("debug"), "mode": args.get("mode"), "repo": args.get("repo"), # Cross-codebase isolation "output_format": args.get("output_format"), # "json" or "toon" # Alias passthroughs captured by repo_search(**kwargs) "queries": queries, "q": args.get("q"), "text": args.get("text"), "top_k": args.get("top_k"), }scripts/upload_service.py (1)
1245-1286: Best-effort graph copy detection logic is acceptable but complex.The nested try-except structure correctly implements a fallback chain (pooled client → direct client → default to "0"), which is appropriate for a non-critical "best-effort" feature. A few observations:
- Line 1271: The
or 5is redundant since the default is already"5".- The
try-except-passoncli.close()(lines 1279-1282) is acceptable for cleanup code where logging would add noise.Consider extracting the graph-existence check into a helper function if this pattern is reused elsewhere, but it's acceptable as-is for a single use case.
♻️ Minor cleanup for redundant default
cli = QdrantClient( url=QDRANT_URL, api_key=os.environ.get("QDRANT_API_KEY"), - timeout=float(os.environ.get("QDRANT_TIMEOUT", "5") or 5), + timeout=float(os.environ.get("QDRANT_TIMEOUT", "5")), )vscode-extension/context-engine-uploader/mcp_bridge.js (1)
99-107: Potential mismatch between reportedkindand actual source.When
mode === 'bundled'but the bundled binary is not found,findLocalBridgeBin()falls back to external resolution (configured path or env override). If an external binary exists,binPathwill be set, butkindwould still be reported as'bundled'because line 106 only checks the mode, not the actual source of the binary.Consider tracking whether the path came from the bundled source:
♻️ Proposed fix to accurately report the binary source
function resolveBridgeCliInvocation() { - const binPath = findLocalBridgeBin(); const mode = getBridgeMode(); + const bundledBin = mode === 'bundled' ? findBundledBridgeBin() : undefined; + const binPath = bundledBin || findLocalBridgeBin(); + const isBundled = !!bundledBin; if (binPath) { return { command: 'node', args: [binPath], - kind: mode === 'bundled' ? 'bundled' : 'local' + kind: isBundled ? 'bundled' : 'local' }; }Note: This would require refactoring
findLocalBridgeBin()to skip bundled resolution when called from here, or extracting the bundled check..github/workflows/claude.yaml (2)
44-54: Redundant configuration: settings file and inlinesettingsparameter both defined.The Claude settings are configured twice:
- Lines 44-54: Written to
/home/runner/.claude/settings.json- Line 64: Passed via
settingsparameterThis duplication is unnecessary and could cause confusion if they diverge. Pick one approach.
♻️ Proposed fix: Remove the redundant settings file step
- - name: Create Claude settings file - run: | - mkdir -p /home/runner/.claude - cat > /home/runner/.claude/settings.json << 'EOF' - { - "env": { - "ANTHROPIC_BASE_URL": "https://api.z.ai/api/anthropic", - "ANTHROPIC_AUTH_TOKEN": "${{ secrets.CUSTOM_ENDPOINT_API_KEY }}" - } - } - EOF - - name: Run Claude CodeAlso applies to: 64-64
39-39: Consider pinning GitHub Actions to specific commit SHAs for enhanced supply chain security.This workflow has write permissions (
contents: write,pull-requests: write,issues: write) and accesses secrets (CUSTOM_ENDPOINT_API_KEY), making it a higher-value target. Pinningactions/checkout@v4andanthropics/claude-code-action@v1to full commit SHAs prevents potential supply chain attacks if an action repository is compromised. Major version tags can change without notice, whereas commit SHAs are immutable.vscode-extension/context-engine-uploader/python_env.js (1)
363-365: Redundant cache check.This cache check will never return
trueat this point in the flow—venvPythonwas just resolved and the cache is only populated on line 368 aftercheckPythonDepssucceeds. This block is effectively dead code.🧹 Proposed cleanup
setPythonOverridePath(venvPython); log(`Using private venv interpreter: ${getPythonOverridePath()}`); const venvKey = cacheKey(venvPython, workingDirectory); - if (depCheckCache.get(venvKey)) { - return true; - } const finalOk = await checkPythonDeps(venvPython, workingDirectory, { showInterpreterError: true }); if (finalOk) { depCheckCache.set(venvKey, true);
| pull_request_target: | ||
| types: [opened, synchronize] |
There was a problem hiding this comment.
Critical: pull_request_target with PR head checkout creates a "pwn request" vulnerability.
Using pull_request_target grants write permissions and access to secrets, but checking out github.event.pull_request.head.sha (the untrusted PR code) allows malicious PRs to execute arbitrary code with those elevated privileges. An attacker could modify workflow files or exfiltrate secrets.
Safer alternatives:
- Use
pull_requestevent instead (runs in PR context without secrets access) - If
pull_request_targetis required, only checkout the base branch, not the PR head - Add an explicit approval step before running on external PRs
🔒 Recommended fix: Use pull_request event or avoid checking out PR head
- pull_request_target:
- types: [opened, synchronize]
+ pull_request:
+ types: [opened, synchronize]Or if you need secrets access, don't checkout PR head:
- name: Checkout repository
uses: actions/checkout@v4
- with:
- # This correctly checks out the PR's head commit for pull_request_target events.
- ref: ${{ github.event.pull_request.head.sha }}Also applies to: 38-42
🤖 Prompt for AI Agents
In @.github/workflows/claude.yaml around lines 12 - 13, The workflow uses the
pull_request_target event combined with checking out the PR head
(github.event.pull_request.head.sha), which lets untrusted PR code run with
elevated permissions; change the workflow event to pull_request or, if
pull_request_target is required, update the checkout step to use the base branch
(github.event.pull_request.base.sha) instead of the PR head, or add an explicit
human-approval job that gates any steps requiring secrets; search for the token
usage and the checkout action (references: pull_request_target,
github.event.pull_request.head.sha, github.event.pull_request.base.sha) and
implement one of these safer alternatives.
| if: > | ||
| (contains(github.event.comment.body, '@claude') || | ||
| contains(github.event.review.body, '@claude') || | ||
| contains(github.event.issue.body, '@claude') || | ||
| contains(github.event.pull_request.body, '@claude')) && | ||
| (github.event.sender.type == 'User' && ( | ||
| github.event.comment.author_association == 'OWNER' || | ||
| github.event.comment.author_association == 'MEMBER' || | ||
| github.event.comment.author_association == 'COLLABORATOR' | ||
| )) |
There was a problem hiding this comment.
Condition logic is incomplete for pull_request_target events.
The author_association check only references github.event.comment.author_association, which is undefined for pull_request_target opened/synchronize events (there's no comment). This causes the job to never run for those triggers even when @claude is in the PR body.
If you intend to support PR-triggered runs, you need to check associations conditionally:
🛠️ Proposed fix to handle different event types
if: >
(contains(github.event.comment.body, '@claude') ||
contains(github.event.review.body, '@claude') ||
contains(github.event.issue.body, '@claude') ||
contains(github.event.pull_request.body, '@claude')) &&
- (github.event.sender.type == 'User' && (
- github.event.comment.author_association == 'OWNER' ||
- github.event.comment.author_association == 'MEMBER' ||
- github.event.comment.author_association == 'COLLABORATOR'
+ github.event.sender.type == 'User' && (
+ github.event.comment.author_association == 'OWNER' ||
+ github.event.comment.author_association == 'MEMBER' ||
+ github.event.comment.author_association == 'COLLABORATOR' ||
+ github.event.review.author_association == 'OWNER' ||
+ github.event.review.author_association == 'MEMBER' ||
+ github.event.review.author_association == 'COLLABORATOR' ||
+ github.event.issue.author_association == 'OWNER' ||
+ github.event.issue.author_association == 'MEMBER' ||
+ github.event.issue.author_association == 'COLLABORATOR' ||
+ github.event.pull_request.author_association == 'OWNER' ||
+ github.event.pull_request.author_association == 'MEMBER' ||
+ github.event.pull_request.author_association == 'COLLABORATOR'
- ))
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if: > | |
| (contains(github.event.comment.body, '@claude') || | |
| contains(github.event.review.body, '@claude') || | |
| contains(github.event.issue.body, '@claude') || | |
| contains(github.event.pull_request.body, '@claude')) && | |
| (github.event.sender.type == 'User' && ( | |
| github.event.comment.author_association == 'OWNER' || | |
| github.event.comment.author_association == 'MEMBER' || | |
| github.event.comment.author_association == 'COLLABORATOR' | |
| )) | |
| if: > | |
| (contains(github.event.comment.body, '@claude') || | |
| contains(github.event.review.body, '@claude') || | |
| contains(github.event.issue.body, '@claude') || | |
| contains(github.event.pull_request.body, '@claude')) && | |
| github.event.sender.type == 'User' && ( | |
| github.event.comment.author_association == 'OWNER' || | |
| github.event.comment.author_association == 'MEMBER' || | |
| github.event.comment.author_association == 'COLLABORATOR' || | |
| github.event.review.author_association == 'OWNER' || | |
| github.event.review.author_association == 'MEMBER' || | |
| github.event.review.author_association == 'COLLABORATOR' || | |
| github.event.issue.author_association == 'OWNER' || | |
| github.event.issue.author_association == 'MEMBER' || | |
| github.event.issue.author_association == 'COLLABORATOR' || | |
| github.event.pull_request.author_association == 'OWNER' || | |
| github.event.pull_request.author_association == 'MEMBER' || | |
| github.event.pull_request.author_association == 'COLLABORATOR' | |
| ) |
🤖 Prompt for AI Agents
In @.github/workflows/claude.yaml around lines 18 - 27, The conditional in the
workflow only checks github.event.comment.author_association, so jobs triggered
by pull_request_target events (where the PR body contains "@claude") never pass
because there is no comment object; update the if expression to validate author
association for the correct event payloads by adding checks for
github.event.pull_request.author_association and
github.event.review.author_association (in addition to
github.event.comment.author_association and
github.event.issue.author_association) or conditionally select the association
based on event type (e.g., check github.event_name == 'pull_request' ?
github.event.pull_request.author_association :
github.event.comment.author_association) while preserving the existing
sender.type == 'User' and allowed associations (OWNER, MEMBER, COLLABORATOR) so
PR-open/synchronize events will run when `@claude` is in the PR body.
| # Always check src_path | ||
| src_path = Path(event.src_path) | ||
| if idx.CODE_EXTS.get(src_path.suffix.lower(), "unknown") != "unknown": | ||
| if not self.client._is_ignored_path(src_path) and idx.CODE_EXTS.get(src_path.suffix.lower(), "unknown") != "unknown": | ||
| paths_to_process.append(src_path) | ||
|
|
||
| # For FileMovedEvent, also process the destination path | ||
| if hasattr(event, 'dest_path') and event.dest_path: | ||
| dest_path = Path(event.dest_path) | ||
| if idx.CODE_EXTS.get(dest_path.suffix.lower(), "unknown") != "unknown": | ||
| if not self.client._is_ignored_path(dest_path) and idx.CODE_EXTS.get(dest_path.suffix.lower(), "unknown") != "unknown": | ||
| paths_to_process.append(dest_path) |
There was a problem hiding this comment.
Event-based watcher skips extensionless code files.
Line 1421/1427 only checks idx.CODE_EXTS, so files like Dockerfile, Makefile, or .gitignore (now supported elsewhere) won’t be enqueued under watchdog. Those edits won’t sync in watch mode.
🛠️ Suggested fix to include extensionless files in watchdog filtering
@@
+ def _is_code_path(self, path: Path) -> bool:
+ fname = path.name.lower()
+ if path.suffix.lower() in idx.CODE_EXTS:
+ return True
+ try:
+ extensionless = set((idx.EXTENSIONLESS_FILES or {}).keys())
+ except Exception:
+ extensionless = set()
+ return fname in extensionless or fname.startswith("dockerfile")
@@
- if not self.client._is_ignored_path(src_path) and idx.CODE_EXTS.get(src_path.suffix.lower(), "unknown") != "unknown":
+ if not self.client._is_ignored_path(src_path) and self.client._is_code_path(src_path):
paths_to_process.append(src_path)
@@
- if not self.client._is_ignored_path(dest_path) and idx.CODE_EXTS.get(dest_path.suffix.lower(), "unknown") != "unknown":
+ if not self.client._is_ignored_path(dest_path) and self.client._is_code_path(dest_path):
paths_to_process.append(dest_path)🤖 Prompt for AI Agents
In `@scripts/remote_upload_client.py` around lines 1419 - 1428, The watcher
currently only checks idx.CODE_EXTS using the file suffix, so extensionless code
files (e.g., Dockerfile, Makefile, .gitignore) are skipped; change the filter in
both the src_path and dest_path branches to consult CODE_EXTS using the file
suffix when present and fall back to the filename when suffix is empty — e.g.,
compute key = src_path.suffix.lower() or src_path.name when suffix is empty and
use idx.CODE_EXTS.get(key, "unknown") != "unknown" while preserving the existing
self.client._is_ignored_path checks (apply same change for dest_path handling).
Summary by CodeRabbit
New Features
Bug Fixes
Improvements
Tests