Skip to content

feat+fix: C# support, investigation-grade trace output, BM25 search, execution flows, channel detection#162

Open
Koolerx wants to merge 17 commits intoDeusData:mainfrom
Koolerx:fix/csharp-and-trace-improvements
Open

feat+fix: C# support, investigation-grade trace output, BM25 search, execution flows, channel detection#162
Koolerx wants to merge 17 commits intoDeusData:mainfrom
Koolerx:fix/csharp-and-trace-improvements

Conversation

@Koolerx
Copy link
Copy Markdown

@Koolerx Koolerx commented Mar 27, 2026

Summary

17 commits adding major features and fixing critical bugs across the MCP handler, extraction layer, and pipeline. Developed while stress-testing against large enterprise codebases (C# monolith ~109K nodes, Node.js/TS Hapi.js monorepo ~144K nodes, React/TS monorepo ~9K nodes) and running real investigation scenarios.

Bug Fixes

1. trace_call_path Class → Method resolution

File: src/mcp/mcp.c

When targeting a Class/Interface node, BFS now resolves through DEFINES_METHOD edges to find callable methods, then runs BFS from each. Previously returned 0 results for class names. Also expands edge types to include HTTP_CALLS and ASYNC_CALLS.

2. detect_changes use-after-free

File: src/mcp/mcp.c

Filenames came from a stack buffer reused each fgets() iteration, and node names were freed before serialization. Switched to strcpy variants.

3. Route path validation

File: src/pipeline/pass_httplinks.c

Vendored/minified JS files produced false positive routes (JS operators as HTTP methods). Adds blocklist filter for keywords and operators.

4. C# inheritance via base_list

File: internal/cbm/extract_defs.c

Tree-sitter C# base_list nodes weren't handled. Adds explicit traversal with generic type argument stripping. INHERITS edges: 210 → 1,588 (7.5x).

5. Crash on 0-edge nodes + fuzzy name fallback

File: src/mcp/mcp.c

Fixes double-free when tracing nodes with 0 edges. Moves traversal array to heap. Adds fuzzy substring fallback when exact name match returns 0 results.

6. Class has_method uses correct node ID

File: src/mcp/mcp.c

DEFINES_METHOD BFS was using method IDs (from class resolution) instead of the Class node ID. Fixed to query from the Class node directly. Result: 30 methods found for test class (was 0).

New Features

7. get_architecture returns full analysis

File: src/mcp/mcp.c

Wired cbm_store_get_architecture() into the MCP handler. Returns languages, hotspots, routes, entry points, packages, clusters.

8. Louvain clustering with semantic labels

File: src/store/store.c

cbm_louvain() was implemented but never called. Added arch_clusters() that runs community detection and derives semantic labels from member file paths (e.g., Controllers, Services, Storage instead of Cluster_N).

9. Hapi.js route extraction

Files: src/pipeline/httplink.c, pass_httplinks.c, pass_parallel.c

New cbm_extract_hapi_routes() for { method: 'GET', path: '/api/...', handler: ... } object-literal patterns. Mini-parser tracks enclosing brace scope. Wired into both prescan and disk fallback paths. Test Hapi repo: 0 → 1,665 routes.

10. BM25 full-text search via SQLite FTS5

Files: src/store/store.c, Makefile.cbm

FTS5 virtual table synced via triggers + bulk backfill after B-tree dump. New query parameter on search_graph uses bm25() ranking. Structural boost: Function/Method +10, Class +5, Route +8. Excludes File/Module/Folder noise.

11. Execution flow detection

Files: src/store/store.c, src/pipeline/pipeline.c, src/mcp/mcp.c

Auto-detects cross-community execution flows from entry points via BFS + Louvain. Domain-weighted terminal naming avoids generic names. New processes/process_steps tables. New list_processes and get_process_steps MCP tools. 300 flows detected in test repo.

12. Socket.IO + EventEmitter channel detection

Files: src/pipeline/httplink.c, src/store/store.c, src/mcp/mcp.c

Extracts emit/listen patterns from JS/TS/Python/C# source. C# extractor resolves const string constants to actual channel names. New channels table. New get_channels MCP tool with cross-project querying (no link step needed). 210 channels in TS repo, 73 in C# repo.

13. get_impact blast radius tool

File: src/mcp/mcp.c

New MCP tool: upstream/downstream BFS with depth-grouped results (d1_will_break, d2_likely_affected, d3_may_need_testing). Risk assessment (LOW/MEDIUM/HIGH/CRITICAL). Affected processes cross-referenced.

14. Cypher JSON property access

File: src/cypher/cypher.c

Unknown properties in WHERE clauses now fall through to json_extract() on properties_json. Enables WHERE n.is_entry_point = 'true'.

15. Investigation-grade categorized trace output

File: src/mcp/mcp.c

Complete rewrite of trace_call_path output: incoming: {calls, imports, extends}, outgoing: {calls, has_method, extends}, plus transitive_callers (isolated, capped at 50), disambiguation with candidates, process participation, and matched node info with file:line.

16. C# delegate/event handler call resolution

Files: internal/cbm/extract_calls.c, internal/cbm/extract_unified.c

Three extraction-layer fixes:

  • event += MethodName creates CALLS edge (bare method reference)
  • delegate?.Invoke() resolves to receiver name (delegate dispatch)
  • Lambda expressions inside += don't create scope boundaries (calls attributed to parent method)

17. C# channel detection with constant resolution

File: src/pipeline/httplink.c

New cbm_extract_csharp_channels(): two-pass scan that first collects const string mappings, then resolves .Emit(CONSTANT) and .OnRequest<T>(CONSTANT) to actual channel names. Handles both constant references and string literals.

Testing

All changes compile clean with -Wall -Wextra -Werror. Stress-tested against:

  • Large C# monolith (~109K nodes, ~10.7K files) — class hierarchy, delegate chains, channel detection
  • Node.js/TS Hapi.js monorepo (~144K nodes, ~3.9K files) — routes, BM25 search, process detection, Socket.IO channels
  • React/TS monorepo (~9K nodes) — component tracing, routes
  • Real investigation scenarios: upstream error tracing, blast radius analysis, auth flow walkthrough, cross-service channel tracing

Your Name added 7 commits March 27, 2026 12:57
When trace_call_path targets a Class or Interface node, the BFS now resolves
through DEFINES_METHOD edges to find the actual callable methods, then runs
BFS from each method and merges results. Previously, tracing a class name
returned 0 results because Class nodes have no direct CALLS edges — only
their Method children do.

Also expands edge types to include HTTP_CALLS and ASYNC_CALLS alongside
CALLS for broader cross-service coverage.

Node selection improved: when multiple nodes share the same name (e.g. a
Class and its constructor Method), prefer the Class for resolution since
constructors rarely have interesting outbound CALLS.

Tested: C# class tracing went from 0 to 87 callees and
8 callers. TS repos unchanged at 50 callers.
…free

detect_changes was using yyjson_mut_arr_add_str / yyjson_mut_obj_add_str
which borrow pointers. The file name came from a stack buffer reused each
fgets() iteration, and node names were freed by cbm_store_free_nodes before
serialization. This caused corrupted output with null bytes embedded in
filenames (e.g. 'CLAUDE.md\0\0\0ings.json').

Switch to yyjson_mut_arr_add_strcpy / yyjson_mut_obj_add_strcpy which copy
the strings into yyjson's internal allocator, making them safe across the
buffer reuse and free boundaries.
Vendored/minified JS files (tsc.js, typescript.js) inside non-JS repos
produce false positive routes when the Express route extractor matches
JS operators and keywords as route paths.

Add a validation filter that rejects:
- JS/TS operators: !, +, ++, -, --, :, ~
- JS/TS keywords: void, null, true, false, throw, this, typeof, etc.
- Single-character non-slash paths (*, ?, #)
- Paths with no alphanumeric or slash characters

Also trims leading/trailing whitespace before comparison to catch
'void ' and 'throw ' variants from minified source.

Tested: Routes went from 42 (20 garbage) to 22 real routes in test C# repo.
The tree-sitter C# grammar represents class inheritance via 'base_list'
child nodes (e.g. 'class Foo : Bar, IBaz'). The extract_base_classes
function didn't handle this node type, causing most C# inheritance to
be missed.

Add explicit traversal of base_list children, extracting type identifiers
from both direct identifier nodes and wrapper nodes (simple_base_type,
primary_constructor_base_type). Generic type arguments are stripped for
resolution (List<int> → List).

Tested: INHERITS edges went from 210 to 1,588 in test C# repo (7.5x improvement).
Verified results include real C# domain classes (e.g.
ClassA→BaseClassB, TestSuite→TestsBase, etc.).
The get_architecture MCP handler was only returning node/edge label counts
(identical to get_graph_schema). The store has a full architecture analysis
function cbm_store_get_architecture() that computes languages, hotspots,
routes, entry points, packages, clusters, and layers — but it was never
called from the MCP handler.

Wire all architecture aspects into the response:
- languages: file counts per language
- hotspots: highest fan-in functions
- routes: HTTP route definitions
- entry_points: main/handler functions
- packages: top-level module groupings
- clusters: Louvain community detection results

Use strcpy variants for all architecture strings since they're freed by
cbm_store_architecture_free before any potential reuse.

Tested: get_architecture went from 0 for all fields to 10 languages,
10 hotspots, 13 routes, 20 entry points, 15 packages.
The cbm_louvain() function was fully implemented but never called.
Add arch_clusters() that loads all callable nodes and CALLS edges,
runs Louvain community detection, groups results by community ID,
and populates cbm_cluster_info_t with member counts and top-5 nodes
per cluster sorted by largest communities first.

Wire into cbm_store_get_architecture() dispatch for the 'clusters' aspect.
Cap output at 20 clusters. Top nodes per cluster are selected by iterating
community members (degree-based sorting can be added later).

Tested: Test C# repo went from 0 to 20 clusters. Largest cluster has 3,205 members
(test code), second has 1,881 (core API functions).
Add cbm_extract_hapi_routes() that handles the Hapi.js route registration
pattern: { method: 'GET', path: '/api/...', handler: ... }. Uses a
mini-parser that finds method:/path: property pairs within the same object
literal by tracking enclosing brace scope. Also extracts handler references.

Wired into both the prescan (parallel) path in pass_parallel.c and the
disk fallback path in pass_httplinks.c for both per-function and
module-level source scanning.

Tested: Test TS/Hapi repo went from 0 to 1,665 routes.
CBM now finds every route definition AND API call site, compared to
only 12 from external service proxy routes with the previous tool.
@Koolerx Koolerx force-pushed the fix/csharp-and-trace-improvements branch from 33a7d1d to 58fff9e Compare March 27, 2026 18:03
@Koolerx Koolerx changed the title fix: C# support improvements and MCP handler bug fixes fix+feat: C# support, MCP bug fixes, Hapi routes, Louvain clustering Mar 27, 2026
Your Name added 10 commits March 27, 2026 14:31
Add a nodes_fts FTS5 virtual table synced via triggers for INSERT/UPDATE/DELETE.
Enable SQLITE_ENABLE_FTS5 in both production and test Makefile flags.

New 'query' parameter on search_graph: when set, uses FTS5 MATCH with
bm25() ranking instead of regex matching. Multi-word queries are tokenized
into OR terms for broad matching (e.g. 'authentication middleware' matches
nodes containing either word, ranked by relevance).

The direct B-tree dump pipeline bypasses SQLite triggers, so add a bulk
FTS5 backfill step after indexing:
  INSERT INTO nodes_fts SELECT id, name, qualified_name, label, file_path FROM nodes

Add cbm_store_exec() public API for raw SQL execution.
Falls back gracefully to regex path if FTS5 is unavailable.

Tested: 'authentication middleware' query returns 242 ranked results
(was 0). 'session recording upload' returns 4,722 ranked results with
relevant routes, controllers, and constants at the top.
… + Louvain

Add process detection as a post-indexing pass that discovers cross-community
execution flows:

1. Find all entry point nodes (is_entry_point=true or Route label)
2. Load CALLS edges and run Louvain community detection
3. BFS from each entry point to depth 8, max 200 visited nodes
4. Identify the deepest node that crosses a Louvain community boundary
5. Name the flow 'EntryPoint → Terminal' with process_type=cross_community
6. Store to new processes + process_steps tables

New schema: 'processes' table (id, project, label, process_type, step_count,
entry_point_id, terminal_id) and 'process_steps' table (process_id, node_id, step).

New store API: cbm_store_detect_processes(), cbm_store_list_processes(),
cbm_store_get_process_steps() with corresponding free functions.

New MCP tool: list_processes returns up to 300 processes ordered by step count.

Tested: TS/Hapi monorepo detects 300 cross-community processes, matching
the flow count from competing tools. Examples: 'ssoCallbackHandler →
catchUnexpectedResponse', 'exportCourse → sendSQSMessage'.
Detect emit/listen channel patterns in JS/TS/Python source files during
indexing. Extracts socket.emit/on, io.emit/on, emitter.emit/on patterns
with a regex scanner that identifies receiver names against a whitelist
of known channel communicators (socket, io, emitter, eventBus, etc.).

Filters out generic Node.js stream events (error, close, data, etc.)
and classifies transport as 'socketio' or 'eventemitter' based on
receiver name.

New schema: 'channels' table (project, channel_name, direction, transport,
node_id, file_path, function_name) with indexes on channel_name and project.

New store API: cbm_store_detect_channels() scans source from disk for all
indexed Function/Method/Module nodes in JS/TS/Python files.
cbm_store_find_channels() queries by project and/or channel name with
partial matching. Automatic cross-repo matching at query time (no link step).

New MCP tool: get_channels returns matched channels with emitter/listener
info, filterable by channel name and project.

Tested: TS monorepo detects 210 channel references including Socket.IO
subscribe/unsubscribe flows between UI and server.
node_prop() previously returned empty string for any property not in the
hardcoded column list (name, qualified_name, label, file_path, start_line,
end_line). Now falls through to json_extract_prop() on the node's
properties_json field for unknown properties.

Enables Cypher queries like:
  WHERE n.is_entry_point = 'true'
  WHERE n.is_test = '1'
  WHERE n.confidence > '0.5'

Also adds 'file' as an alias for 'file_path' and 'id' for the node ID.

Tested: 'MATCH (n:Function) WHERE n.is_entry_point = true' returns 10
controller handlers (previously 0).
…results

QFix 1 — trace_call_path disambiguation + file paths:
- When multiple callable symbols match, includes a 'candidates' array
  with name, label, file_path, line for each (like IDE go-to-definition)
- Every BFS result node now includes file_path, label, start_line
- Adds matched_file, matched_label, matched_line to the root response

QFix 2 — domain-weighted flow terminal naming:
- Reduced BFS max_results from 200 to 50 to prevent generic utility
  functions from becoming terminals
- Terminal candidates scored by: name length (domain names are longer),
  CamelCase bonus, domain verb bonus (Handler, Controller, Service, etc.),
  penalty for generic names (update, get, set, findOne, push, etc.)
- Result: 2/300 flows end in generic names (was ~280/300)
- Step count range: 3-51 (was 3-201)

QFix 3 — FTS5 search structural filtering:
- Exclude File/Module/Folder/Section/Variable/Project nodes from results
- Structural boost: Function/Method +10, Class/Interface/Type +5, Route +8
- High fan-in bonus: nodes with >5 CALLS in-degree get +3
- Result: 'authentication middleware' returns verifyJwt, apiMiddleware,
  createAuthRequestConfig (was returning Folder/Module/Section noise)
Gap 1 — Semantic cluster labels:
Replace auto-numbered 'Cluster_N' with directory-derived semantic labels.
For each cluster, sample up to 50 member file paths, extract the most common
non-generic directory segment (skip src/lib/dist/test/node_modules/shared),
capitalize and TitleCase the result. Falls back to 'Cluster_N' when no
directory has >= 3 occurrences.
Result: 'Services', 'Components', 'Controllers', 'Storage', 'Models',
'Stores', 'Scenarios', 'Courses' — matching competing tool quality.

Gap 2 — Process participation in trace_call_path:
After BFS traversal, query the processes table to find all execution flows
the traced function participates in (as entry point, terminal, or by name
substring match in the flow label). Includes up to 20 flows with label,
process_type, and step_count directly in the trace response — no separate
tool call needed.
…ss steps

Major rewrite of trace_call_path output for investigation-grade quality:

Categorized edges (Fixes A+D):
- incoming: { calls: [...], imports: [...], extends: [...] }
- outgoing: { calls: [...], has_method: [...], extends: [...] }
- Separate transitive_callers for depth > 1 (avoids noise in main results)
Each category queried independently via single-hop BFS on specific edge types.

Broader caller coverage (Fix A):
- Include USAGE and RAISES edges alongside CALLS for incoming queries
- Query both the Class node and its methods as BFS roots
- Result: MeteorError upstream goes from 9 to 39 callers

Noise elimination (Fix C):
- Default depth 1 for categorized results (direct only)
- Transitive callers isolated in separate field, capped at 50
- No more 106 render() methods polluting results

New get_impact tool (Fix F):
- BFS upstream/downstream with depth-grouped results
- d1_will_break / d2_likely_affected / d3_may_need_testing
- Risk assessment: LOW / MEDIUM / HIGH / CRITICAL based on d1 count
- Affected processes cross-referenced by name
- Tested: protectedUpdate returns CRITICAL (38 direct, 162 transitive)

New get_process_steps tool (Fix E):
- Returns ordered step list for a specific process ID
- Each step includes name, qualified_name, file_path
- Enables step-by-step flow debugging
Fix crash (double-free) when tracing nodes with 0 in-degree and 0 out-degree
(e.g. Type nodes, empty Class stubs). Detect early via cbm_store_node_degree
and return basic match info without attempting BFS traversal. Also move the
traversal result array from stack to heap to prevent stack smashing with
many start IDs.

Add fuzzy name fallback: when exact name match returns 0 results, run a
regex search with '.*name.*' pattern and return up to 10 suggestions with
name, label, file_path, line. This handles cases like searching for
'RecordingSession' when only 'ContinuousRecordingSessionDataGen' exists.
Three fixes for C# delegate and event subscription patterns that were
invisible to the call graph:

Fix 1 — Bare method reference subscription:
  event += MethodName creates a CALLS edge from the subscribing method
  to the handler. Detects assignment_expression with += operator where
  the RHS is an identifier or member_access_expression.
  e.g. socket.OnConnected += SocketOnConnected

Fix 2 — Delegate .Invoke() resolution:
  delegate?.Invoke(args) resolved to 'Invoke' which matches nothing.
  Now detects conditional_access_expression and member_access_expression
  where the method is 'Invoke', extracts the receiver (delegate property)
  name as the call target instead.
  e.g. OnConnected?.Invoke(this, e) → CALLS edge to 'OnConnected'

Fix 3 — Lambda event body scope attribution:
  Lambda expressions inside += assignments no longer create a new scope
  boundary. Calls inside the lambda body are attributed to the enclosing
  method that subscribes the event, not to an anonymous lambda scope.
  This means all handler logic is correctly attributed to the method
  that registers the event subscription.
  e.g. socket.OnError += (s, e) => { ErrorOnce(...); } attributes
  the ErrorOnce call to the method containing the += statement.

Tested on C# codebase: SocketOnConnected gained 1 incoming caller
(from += subscription) and 1 outgoing call (from ?.Invoke resolution).
InitializeExternalClient gained 10 additional outgoing calls from
lambda body attribution (30 total, up from 20).
Fix A — Class node 0-degree early exit:
The crash guard that returns early for nodes with 0 CALLS edges was
incorrectly catching Class/Interface nodes that have DEFINES_METHOD and
INHERITS edges (cbm_store_node_degree only counts CALLS). Re-add the
is_class_like exemption so Class nodes always proceed to DEFINES_METHOD
resolution. Cap method resolution to 5 methods to prevent excessive BFS.

Fix A2 — has_method uses Class node ID:
The DEFINES_METHOD BFS was using method start_ids (from class resolution)
as the BFS root, but DEFINES_METHOD edges go FROM the Class TO Methods.
Use the original Class node ID for the has_method query.
Result: 30 methods found (GitNexus: 29), extends chain shown.

Fix B1 — Add .cs to channel detection file filter:
Channel detection SQL now includes .cs files alongside JS/TS/Python.

Fix B2 — C# channel extraction with constant resolution:
New cbm_extract_csharp_channels() in httplink.c that handles:
- const string CONSTANT = "value" → builds name-to-value map
- .Emit(CONSTANT, ...) → resolves to string value, marks as emit
- .OnRequest<T>(CONSTANT, ...) → resolves to string value, marks as listen
- .Emit("literal", ...) → direct string literal matching
Result: 73 channel references, 35 unique channels in C# repo (was 0).
@Koolerx Koolerx changed the title fix+feat: C# support, MCP bug fixes, Hapi routes, Louvain clustering feat+fix: C# support, investigation-grade trace output, BM25 search, execution flows, channel detection Mar 27, 2026
@DeusData DeusData added enhancement New feature or request parsing/quality Graph extraction bugs, false positives, missing edges language-request Request for new language support labels Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request language-request Request for new language support parsing/quality Graph extraction bugs, false positives, missing edges

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants