perf: compute CFG in Rust native engine for all languages#342
perf: compute CFG in Rust native engine for all languages#342carlos-alm merged 10 commits intomainfrom
Conversation
|
Claude finished @carlos-alm's task —— View job Reviewing PR: perf: compute CFG in Rust native engine for all languagesConducted comprehensive PR review following the 4-phase analysis process:
Phase 1: Root Cause Analysis ✅Problem Being Solved: Performance optimization in CFG (Control Flow Graph) building phase. Current JS implementation does WASM re-parsing (~169ms) even when native AST data is already available from the Rust engine. This creates unnecessary double-parsing overhead. Approach Justification: This is a legitimate performance optimization that eliminates redundant WASM tree-walking by computing CFG data during the initial native extraction phase. The approach maintains backward compatibility with WASM fallback when native data is unavailable. This addresses a real bottleneck, not masking symptoms. Phase 2: Configuration Change Detection ✅No High-Risk Configuration Changes Detected:
Phase 3: Test Integrity ✅Test Coverage Assessment:
Phase 4: Code Quality ✅Rust Implementation Analysis:
Critical Issue Identified:
Minor Issues:
Root Cause AnalysisProblem Being Solved: Legitimate performance bottleneck - double WASM parsing when native data available Critical ConcernsLanguage ID Inconsistency: C# language ID mismatch will cause silent CFG extraction failures in production Final Recommendation
Required Changes:
|
Greptile SummaryThis PR ports the Most algorithmic issues have been resolved — multi-catch, infinite-loop reachability, switch/continue, labeled statements, Python try-else, Go One remaining issue: Ruby The core feature is well-implemented, but the Ruby rescue-else case must be fixed for full correctness. Confidence Score: 3/5
|
| end_line: Some(end_line(&member)), | ||
| decorators: None, | ||
| complexity: None, | ||
| cfg: build_function_cfg(&member, "go"), |
There was a problem hiding this comment.
Interface methods (method_elem in interface_type) don't have bodies - they're just signatures. build_function_cfg will return a trivial entry->exit CFG. Consider using cfg: None for interface methods to match how complexity: None is set.
There was a problem hiding this comment.
Fixed in d5aab61 — changed to cfg: None for interface methods (method_elem), matching the existing complexity: None pattern.
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
|
Addressed both review comments in d5aab61:
|
crates/codegraph-core/src/cfg.rs
Outdated
| let label_node = stmt.child_by_field_name("label"); | ||
| let body = stmt.child_by_field_name("body"); | ||
| if let (Some(label_node), Some(body)) = (label_node, body) { | ||
| let label_name = label_node.utf8_text(&[]).unwrap_or("").to_string(); |
There was a problem hiding this comment.
utf8_text(&[]) will panic for any labeled loop
Node::utf8_text(source) slices source[start_byte..end_byte]. Passing &[] (empty slice) is safe only if start_byte == end_byte == 0, which is never true for a real identifier node inside a function body. In practice this panics with index out of bounds: the len is 0 but the index is N.
The consequence is two-fold:
- Panic: Any file processed by the native engine that contains a labeled statement (valid in JS/TS, Go, Java, C#) will panic at parse time.
- Silent wrong CFG (if somehow unwrap_or catches it): Every label name is stored as
""inlabel_map, so a nested labeled break likebreak outersilently resolves to the innermost label instead of the correct one.
The root cause is that CfgBuilder has no access to the source bytes. build_function_cfg needs to accept a source: &[u8] parameter and thread it through to the builder.
The same bug applies on lines 521 and 542 for break and continue label extraction.
// Suggested fix: add source to CfgBuilder and build_function_cfg
pub struct CfgBuilder<'a> {
rules: &'a CfgRules,
source: &'a [u8], // <-- add this
...
}
pub fn build_function_cfg(function_node: &Node, lang_id: &str, source: &[u8]) -> Option<CfgData> {
let rules = get_cfg_rules(lang_id)?;
let mut builder = CfgBuilder::new(rules, source);
...
}Then replace all three utf8_text(&[]) calls with utf8_text(self.source).
There was a problem hiding this comment.
Critical bug, agreed. utf8_text(&[]) will panic with index-out-of-bounds for any labeled loop/break/continue since the label node has a non-zero byte offset into source. Will fix by adding source: &[u8] to build_function_cfg() and CfgBuilder, then updating all three call sites (lines 449, 521, 542) to use utf8_text(self.source). All extractor call sites will need to pass source bytes through.
There was a problem hiding this comment.
Fixed in 32a9463 — added source: &[u8] to CfgBuilder and build_function_cfg, replaced all 3 utf8_text(&[]) calls with utf8_text(self.source), and updated all 24 extractor call sites to pass source bytes.
crates/codegraph-core/src/cfg.rs
Outdated
| let body_block = self.make_block("loop_body", None, None, None); | ||
| self.add_edge(header, body_block, "branch_true"); |
There was a problem hiding this comment.
"branch_true" edge kind misleading for unconditional infinite loop
Unlike process_for_loop and process_while_loop, a Rust loop {} has no condition — the body is always entered. Using "branch_true" implies a condition check exists, which is incorrect and inconsistent with how the WASM JS side distinguishes loop types. A "fallthrough" edge (matching the entry edge from the loop header used everywhere else for unconditional control flow) would be semantically accurate here.
This is also why Go/Rust/Ruby are intentionally absent from the parity tests — if this were compared against the WASM implementation it would likely diverge on edge kinds for infinite loops.
| let body_block = self.make_block("loop_body", None, None, None); | |
| self.add_edge(header, body_block, "branch_true"); | |
| self.add_edge(header, body_block, "fallthrough"); |
There was a problem hiding this comment.
Correct — Rust loop {} is unconditional, there is no condition check. Using "branch_true" is semantically wrong and inconsistent with the WASM JS side. Will change to "fallthrough" for the entry edge into the loop body in process_infinite_loop.
There was a problem hiding this comment.
Fixed in 32a9463 — changed to "fallthrough" in process_infinite_loop.
Additional Comments (1)
The However, the scan loop only enters the Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
|
Claude finished @carlos-alm's task —— View job Reviewing PR: perf: compute CFG in Rust native engine for all languagesConducting comprehensive PR review following the 4-phase analysis process:
Phase 1: Root Cause Analysis ✅Problem Being Solved: Performance optimization in CFG (Control Flow Graph) building phase. Current JS implementation does WASM re-parsing (~169ms) even when native AST data is already available from the Rust engine. This creates unnecessary double-parsing overhead. Approach Justification: This is a legitimate performance optimization that eliminates redundant WASM tree-walking by computing CFG data during the initial native extraction phase. The approach maintains backward compatibility with WASM fallback when native data is unavailable. This addresses a real bottleneck, not masking symptoms. Phase 2: Configuration Change Detection ✅No High-Risk Configuration Changes Detected:
Phase 3: Test Integrity ✅Test Coverage Assessment:
Phase 4: Code Quality
|
| end_line: Some(end_line(&child)), | ||
| decorators: None, | ||
| complexity: None, | ||
| cfg: build_function_cfg(&child, "javascript"), |
There was a problem hiding this comment.
TypeScript interface method signatures receive non-null CFG
method_signature and property_signature nodes have no body field. When build_function_cfg is called on them, it finds no body, short-circuits to builder.add_edge(entry, exit, "fallthrough"), and returns a non-null CfgData with 2 blocks and 1 edge.
This contrasts with every other language extractor in this PR — Go, Java, C#, Ruby, PHP, and Rust all explicitly set cfg: None for interface/abstract members (matching complexity: None). For TS interface methods, the def.cfg?.blocks?.length check in cfg.js evaluates to 2, so these trivial 2-block CFGs get written to cfg_blocks/cfg_edges in the DB for every method signature in every TypeScript interface. This inflates the DB unnecessarily for zero analytical value.
| cfg: build_function_cfg(&child, "javascript"), | |
| cfg: None, |
There was a problem hiding this comment.
Fixed in e5bc19e — set cfg: None for method_signature and property_signature in the JS/TS extractor, matching Go/Java/C#/Ruby/PHP pattern for bodyless interface members.
crates/codegraph-core/src/cfg.rs
Outdated
| let join_block = self.make_block("body", None, None, None); | ||
|
|
||
| // Switch acts like a break target | ||
| self.loop_stack.push(LoopCtx { header_idx: switch_header, exit_idx: join_block }); |
There was a problem hiding this comment.
continue inside a switch resolves to switch header, not enclosing loop
The switch is pushed onto loop_stack so that break statements inside cases correctly target join_block (the switch exit). However, since loop_stack.last() is also what an unlabeled continue resolves to, any continue statement inside a switch case will incorrectly produce an edge to switch_header instead of the nearest enclosing loop's header.
Consider Java:
for (int i = 0; i < 10; i++) {
switch (x) {
case 1: continue; // should continue the *for*, not re-enter the switch
}
}With the current stack: [for_loop_ctx, switch_ctx], loop_stack.last() returns switch_ctx, so continue emits current -> switch_header instead of current -> for_header.
The standard fix is to track switches separately from loops (e.g., a switch_exit_stack) so that unlabeled continue skips the switch and targets the enclosing loop. The JS WASM implementation should be cross-checked to confirm whether this inconsistency pre-exists or is newly introduced here.
There was a problem hiding this comment.
Fixed in e5bc19e — added is_loop: bool to LoopCtx. Loop pushes set is_loop: true, switch push sets is_loop: false. Continue handler now walks back through loop_stack with .iter().rev().find(|ctx| ctx.is_loop) to skip switch entries.
| const complexityRules = COMPLEXITY_RULES.get(langId); | ||
| if (!complexityRules) continue; | ||
| // complexityRules only needed for WASM fallback path |
There was a problem hiding this comment.
Safety guard for complexityRules removed without replacement
The previous guard if (!complexityRules) continue; ensured that if a language has CFG rules but no matching entry in COMPLEXITY_RULES, the WASM fallback path would not call findFunctionNode(..., undefined). After this change, complexityRules can be undefined when COMPLEXITY_RULES.get(langId) returns nothing, but it is still passed directly to findFunctionNode on line 1165 whenever !tree is false and the native path wasn't taken.
All current CFG languages do have complexity rules, so this is not immediately broken. But removing the guard silently removes a safety net: if a language is ever added to CFG rules but not to COMPLEXITY_RULES, the WASM fallback will silently malfunction rather than safely skipping.
Consider preserving the guard, or at a minimum adding a comment that explicitly acknowledges complexityRules may be undefined and that findFunctionNode must tolerate it.
There was a problem hiding this comment.
Fixed in e5bc19e — re-added the guard: if (!tree || !complexityRules) continue. This prevents undefined from being passed to findFunctionNode when COMPLEXITY_RULES.get(langId) returns nothing.
| // Check if all defs already have native CFG — skip WASM parse if so | ||
| const allNative = symbols.definitions | ||
| .filter((d) => (d.kind === 'function' || d.kind === 'method') && d.line) | ||
| .every((d) => d.cfg?.blocks?.length); | ||
|
|
||
| // WASM fallback if no cached tree and not all native | ||
| if (!tree && !allNative) { |
There was a problem hiding this comment.
needsFallback false-positive for files with bodyless methods
Array.prototype.every returns true for an empty array (vacuous truth), so files with no functions are handled correctly. However, methods that intentionally have cfg: null — interface methods in Go, Java, and C#, or abstract methods — also fail the .every((d) => d.cfg?.blocks?.length) check because null?.blocks?.length is undefined (falsy).
As a result, any file containing even one bodyless method (e.g., a Go file with an interface type) will set needsFallback = true and trigger WASM parser initialization for the whole pipeline, even though the concrete methods in that file already have complete native CFG data and the bodyless methods are correctly handled by the if (!tree) continue guard later.
This doesn't cause correctness issues (the per-definition WASM guard is still in place), but it defeats the performance goal for any file that mixes concrete implementations with interface/abstract declarations.
There was a problem hiding this comment.
Fixed in e5bc19e — changed .every((d) => d.cfg?.blocks?.length) to .every((d) => d.cfg === null || d.cfg?.blocks?.length). Bodyless methods with cfg: null are now considered "handled by native" and no longer trigger WASM fallback.
|
Claude finished @carlos-alm's task —— View job Reviewing PR: perf: compute CFG in Rust native engine for all languagesConducting comprehensive PR review following the 4-phase analysis process:
Phase 1: Root Cause Analysis ✅Problem Being Solved: Performance optimization in CFG (Control Flow Graph) building phase. Current JS implementation does WASM re-parsing (~169ms) even when native AST data is already available from the Rust engine. This creates unnecessary double-parsing overhead. Approach Justification: This is a legitimate performance optimization that eliminates redundant WASM tree-walking by computing CFG data during the initial native extraction phase. The approach maintains backward compatibility with WASM fallback when native data is unavailable. This addresses a real bottleneck, not masking symptoms. Phase 2: Configuration Change Detection ✅No High-Risk Configuration Changes Detected:
Phase 3: Test Integrity ✅Test Coverage Assessment:
Phase 4: Code Quality
|
| let case_body_node = case_clause.child_by_field_name("body") | ||
| .or_else(|| case_clause.child_by_field_name("consequence")); | ||
|
|
||
| let case_stmts: Vec<Node> = if let Some(body_node) = case_body_node { | ||
| self.get_statements(&body_node) | ||
| } else { | ||
| let value_node = case_clause.child_by_field_name("value"); | ||
| let pattern_node = case_clause.child_by_field_name("pattern"); | ||
| let cursor2 = &mut case_clause.walk(); | ||
| case_clause.named_children(cursor2) | ||
| .filter(|child| { | ||
| if let Some(ref v) = value_node { if child.id() == v.id() { return false; } } | ||
| if let Some(ref p) = pattern_node { if child.id() == p.id() { return false; } } | ||
| child.kind() != "switch_label" | ||
| }) | ||
| .collect() | ||
| }; |
There was a problem hiding this comment.
Rust match_arm bodies are discarded from CFG
In tree-sitter-rust, a match_arm node has pattern and value fields — but no body field. The current code checks for body and consequence (lines 867-868), finds neither, then adds value_node to an exclude filter (line 878) and collects the remaining named children into case_stmts.
This means the actual arm expression/body is filtered out, so case_stmts is empty for every match arm. Any CFG-relevant constructs inside a Rust match arm (nested if, return, loop, etc.) are completely skipped.
Example that produces wrong CFG:
match state {
State::Ready => {
if self.queue.is_empty() { return None; }
self.process()
}
_ => None,
}The nested if and return would not appear in the CFG.
Fix: Fall back to using the value field as the case body for Rust match arms:
let case_body_node = case_clause.child_by_field_name("body")
.or_else(|| case_clause.child_by_field_name("consequence"))
.or_else(|| case_clause.child_by_field_name("value")); // Rust match_armRemove value_node from the exclude filter, or update the filter condition since value_node now matches case_body_node.
There was a problem hiding this comment.
Fixed in e5bc19e — added else-if branch for child_by_field_name("value") before the generic named-children fallback. Rust match arm expressions are now included as case_stmts instead of being filtered out.
| fn process_infinite_loop(&mut self, loop_stmt: &Node, current: u32) -> Option<u32> { | ||
| let header = self.make_block("loop_header", Some(node_line(loop_stmt)), Some(node_line(loop_stmt)), Some("loop")); | ||
| self.add_edge(current, header, "fallthrough"); | ||
|
|
||
| let exit = self.make_block("body", None, None, None); | ||
|
|
||
| self.loop_stack.push(LoopCtx { header_idx: header, exit_idx: exit }); | ||
| self.update_label_map(header, exit); | ||
|
|
||
| let body = loop_stmt.child_by_field_name("body"); | ||
| let body_block = self.make_block("loop_body", None, None, None); | ||
| self.add_edge(header, body_block, "fallthrough"); | ||
|
|
||
| if let Some(body) = body { | ||
| let stmts = self.get_statements(&body); | ||
| let body_end = self.process_statements(&stmts, body_block); | ||
| if let Some(be) = body_end { | ||
| self.add_edge(be, header, "loop_back"); | ||
| } | ||
| } | ||
|
|
||
| // No loop_exit from header — only exit via break | ||
| self.loop_stack.pop(); | ||
| Some(exit) | ||
| } |
There was a problem hiding this comment.
Infinite loop without break makes subsequent code appear reachable
process_infinite_loop always returns Some(exit) (line 826), causing the caller to emit a fallthrough edge from exit to the next block. However, the exit block only receives edges from break statements inside the loop body. If no break exists, exit has no predecessors — it is an unreachable block — yet code after the loop is still connected to it, making dead code appear live in the CFG.
Example:
loop { do_something(); }
// This line is dead, but the CFG shows it as reachable
unreachable_code();A more accurate result would return None when no break targeted exit. One way to detect this is to check after processing the body whether any edge targeting exit was added:
let had_break = self.edges.iter().any(|e| e.target_index == exit);
if had_break { Some(exit) } else { None }This matches the semantics of return None from process_return/process_throw (which also unconditionally terminate control flow).
There was a problem hiding this comment.
Fixed in e5bc19e — after processing the loop body, we now check if the exit block has any incoming edges. If no break targeted exit, return None instead of Some(exit) to indicate subsequent code is unreachable.
| if matches_opt(kind, self.rules.continue_node) { | ||
| let label_name = stmt.child_by_field_name("label") | ||
| .map(|n| n.utf8_text(self.source).unwrap_or("").to_string()); | ||
|
|
||
| let target = if let Some(ref name) = label_name { | ||
| self.label_map.iter().rev() | ||
| .find(|(n, _)| n == name) | ||
| .and_then(|(_, ctx)| ctx.header_idx) | ||
| } else { | ||
| self.loop_stack.last().map(|ctx| ctx.header_idx) | ||
| }; | ||
|
|
||
| if let Some(target) = target { | ||
| self.set_end_line(current, node_line(stmt)); | ||
| self.add_edge(current, target, "continue"); | ||
| return None; | ||
| } | ||
| return Some(current); | ||
| } |
There was a problem hiding this comment.
continue inside a switch resolves to switch header, not enclosing loop
The switch is pushed onto loop_stack (line 838 in process_switch) so that break statements correctly target the switch exit. However, unlabeled continue statements also resolve to loop_stack.last() (line 551), which may now be a switch context instead of an enclosing loop.
Example in Java:
for (int i = 0; i < 10; i++) {
switch (x) {
case 1: continue; // should continue the *for*, not re-enter the switch
}
}With loop_stack = [for_loop_ctx, switch_ctx], continue would emit an edge to switch_header instead of for_header.
Standard fix: Track switches separately from loops (e.g., a switch_exit_stack) so that unlabeled continue skips the switch and targets the enclosing loop. Labeled continue statements already work correctly by searching label_map first (lines 546-549).
There was a problem hiding this comment.
Fixed in e5bc19e — same fix as the other continue/switch comment. LoopCtx now has is_loop flag, continue skips non-loop entries.
| const allNative = symbols.definitions | ||
| .filter((d) => (d.kind === 'function' || d.kind === 'method') && d.line) | ||
| .every((d) => d.cfg?.blocks?.length); | ||
|
|
||
| // WASM fallback if no cached tree and not all native | ||
| if (!tree && !allNative) { |
There was a problem hiding this comment.
allNative check has false-positive for files with bodyless methods
The check on line 1114 uses .every((d) => d.cfg?.blocks?.length). When a definition has cfg: null (interface methods in Go, Java, C#, or abstract methods), the optional chaining produces undefined (falsy), causing every() to return false.
As a result, any file containing even one bodyless method will set allNative = false and trigger WASM parser initialization for the entire pipeline on line 1117, even though the concrete methods in that file already have complete native CFG data and the bodyless methods are correctly skipped later (line 1160-1161).
This doesn't cause correctness issues (the per-definition native-CFG check on line 1160 is still in place), but it defeats the performance goal for any file mixing concrete implementations with interface/abstract declarations — exactly what the PR is optimizing for.
Fix: Adjust the check to exclude definitions with cfg: null:
const allNative = symbols.definitions
.filter((d) => (d.kind === 'function' || d.kind === 'method') && d.line && d.cfg !== null)
.every((d) => d.cfg?.blocks?.length);There was a problem hiding this comment.
Fixed in e5bc19e — same fix as the other allNative comment. cfg === null is now treated as native-handled.
- Set cfg: None for TS interface method_signature/property_signature (no body) - Tag loop_stack entries with is_loop flag; continue now skips switch entries - Use match_arm value field as case body for Rust instead of filtering it out - Return None from process_infinite_loop when no break targets the exit block - Treat cfg: null as native-handled in allNative check to avoid false WASM fallback - Re-add complexityRules guard before passing to findFunctionNode
|
Claude finished @carlos-alm's task —— View job Reviewing PR: perf: compute CFG in Rust native engine for all languagesConducted comprehensive PR review following the 4-phase analysis process:
Phase 1: Root Cause Analysis ✅Problem Being Solved: Performance optimization in CFG (Control Flow Graph) building phase. Current JS implementation does WASM re-parsing (~169ms) even when native AST data is already available from the Rust engine. This creates unnecessary double-parsing overhead. Approach Justification: This is a legitimate performance optimization that eliminates redundant WASM tree-walking by computing CFG data during the initial native extraction phase. The approach maintains backward compatibility with WASM fallback when native data is unavailable. This addresses a real bottleneck, not masking symptoms. Phase 2: Configuration Change Detection ✅No High-Risk Configuration Changes Detected:
Phase 3: Test Integrity ✅Test Coverage Assessment:
Phase 4: Code Quality
|
|
Addressed all 7 new review comments in e5bc19e:
|
| if_consequent_field: None, | ||
| for_nodes: &["for_statement", "enhanced_for_statement"], | ||
| while_node: Some("while_statement"), | ||
| while_nodes: &[], | ||
| do_node: Some("do_statement"), | ||
| infinite_loop_node: None, | ||
| unless_node: None, | ||
| until_node: None, | ||
| switch_node: Some("switch_expression"), | ||
| switch_nodes: &[], | ||
| case_node: Some("switch_block_statement_group"), | ||
| case_nodes: &["switch_rule"], | ||
| default_node: None, | ||
| try_node: Some("try_statement"), | ||
| catch_node: Some("catch_clause"), | ||
| finally_node: Some("finally_clause"), | ||
| return_node: Some("return_statement"), | ||
| throw_node: Some("throw_statement"), | ||
| break_node: Some("break_statement"), | ||
| continue_node: Some("continue_statement"), | ||
| block_node: Some("block"), | ||
| block_nodes: &[], | ||
| labeled_node: Some("labeled_statement"), | ||
| }; | ||
|
|
||
| pub static CSHARP_CFG: CfgRules = CfgRules { | ||
| if_node: Some("if_statement"), | ||
| if_nodes: &[], | ||
| elif_node: None, | ||
| else_clause: None, |
There was a problem hiding this comment.
Java switch_statement absent from CFG rules
JAVA_CFG maps only switch_expression as a switch node:
switch_node: Some("switch_expression"),
switch_nodes: &[],In tree-sitter-java, the traditional switch statement compiles to switch_statement, and the newer (Java 14+) switch-as-value form compiles to switch_expression. The vast majority of Java code—all code targeting pre-Java 14—uses switch_statement. As written, any Java function with a traditional switch statement will have that switch silently skipped by process_switch, producing a CFG where the entire switch appears as a linear "fall-through" block with no case branching. The switch block's internal break statements will also be unresolved, causing silent incorrect flow.
Fix: add switch_statement to either switch_node or switch_nodes:
| if_consequent_field: None, | |
| for_nodes: &["for_statement", "enhanced_for_statement"], | |
| while_node: Some("while_statement"), | |
| while_nodes: &[], | |
| do_node: Some("do_statement"), | |
| infinite_loop_node: None, | |
| unless_node: None, | |
| until_node: None, | |
| switch_node: Some("switch_expression"), | |
| switch_nodes: &[], | |
| case_node: Some("switch_block_statement_group"), | |
| case_nodes: &["switch_rule"], | |
| default_node: None, | |
| try_node: Some("try_statement"), | |
| catch_node: Some("catch_clause"), | |
| finally_node: Some("finally_clause"), | |
| return_node: Some("return_statement"), | |
| throw_node: Some("throw_statement"), | |
| break_node: Some("break_statement"), | |
| continue_node: Some("continue_statement"), | |
| block_node: Some("block"), | |
| block_nodes: &[], | |
| labeled_node: Some("labeled_statement"), | |
| }; | |
| pub static CSHARP_CFG: CfgRules = CfgRules { | |
| if_node: Some("if_statement"), | |
| if_nodes: &[], | |
| elif_node: None, | |
| else_clause: None, | |
| switch_node: Some("switch_statement"), | |
| switch_nodes: &["switch_expression"], |
Both node types share the same switch_block body structure in the Java grammar, so the existing process_switch logic handles them identically once they are recognized.
There was a problem hiding this comment.
Fixed in 6312ce7 — changed JAVA_CFG to switch_node: Some("switch_statement"), switch_nodes: &["switch_expression"]. Both traditional switch statements and Java 14+ switch expressions are now recognized.
| // Create DB | ||
| const dbPath = path.join(tmpDir, '.codegraph', 'graph.db'); | ||
| db = new Database(dbPath); | ||
| db.pragma('journal_mode = WAL'); | ||
| initSchema(db); | ||
|
|
||
| // Insert definition nodes for parent resolution | ||
| const insertNode = db.prepare( | ||
| 'INSERT INTO nodes (name, kind, file, line, end_line) VALUES (?, ?, ?, ?, ?)', | ||
| ); | ||
| for (const [relPath, symbols] of allSymbols) { | ||
| for (const def of symbols.definitions || []) { | ||
| insertNode.run(def.name, def.kind, relPath, def.line, def.endLine); | ||
| } |
There was a problem hiding this comment.
C# test asserts new kind for a node that is emitted as throw
The fixture only contains a single new expression:
throw new ArgumentNullException("result");walk_ast_nodes_with_config processes throw_statement (or throw_expression) first because "throw_statement" and "throw_expression" are both in CSHARP_AST_CONFIG.throw_types. It emits the node as kind: "throw" with name: "ArgumentNullException" (via extract_throw_target) and returns early, so the inner object_creation_expression is never visited independently. No kind: "new" row is ever inserted into the DB.
This means both assertions in the test will fail when the native binary is available and the describe.skipIf guard is lifted in CI:
expect(news.length).toBeGreaterThanOrEqual(1); // fails — no 'new' rows
expect(news.some((n) => n.name.includes('ArgumentNullException'))).toBe(true); // failsEither add a standalone new expression to the fixture (e.g. var ex = new ArgumentNullException("x");), or fix the assertion to check for the throw kind instead:
const throwNodes = db
.prepare("SELECT * FROM ast_nodes WHERE kind = 'throw' AND file LIKE '%fixture.cs'")
.all();
expect(throwNodes.some((n) => n.name.includes('ArgumentNullException'))).toBe(true);There was a problem hiding this comment.
Fixed in 6312ce7 — added a standalone new expression (var ex = new ArgumentNullException("x")) to the C# fixture so the kind:new assertion has a dedicated node independent of the throw path.
|
Claude finished @carlos-alm's task —— View job Reviewing PR: perf: compute CFG in Rust native engine for all languagesConducting comprehensive PR review following the 4-phase analysis process:
Phase 1: Root Cause Analysis ✅Problem Being Solved: Performance optimization in CFG (Control Flow Graph) building phase. Current JS implementation does WASM re-parsing (~169ms) even when native AST data is already available from the Rust engine. This creates unnecessary double-parsing overhead. Approach Justification: This is a legitimate performance optimization that eliminates redundant WASM tree-walking by computing CFG data during the initial native extraction phase. The approach maintains backward compatibility with WASM fallback when native data is unavailable. This addresses a real bottleneck, not masking symptoms. Phase 2: Configuration Change Detection ✅No High-Risk Configuration Changes Detected:
Phase 3: Test Integrity ✅Test Coverage Assessment:
Phase 4: Code Quality ✅Rust Implementation Analysis:
Minor Issues Noted but Not Blocking:
Root Cause AnalysisProblem Being Solved: Legitimate performance bottleneck - double WASM parsing when native data available Critical ConcernsAll previous critical issues have been addressed: The major bugs identified in earlier review rounds (language ID mismatch, utf8_text panics, continue-in-switch semantics, infinite loop reachability) have been properly fixed in recent commits. Final Recommendation
Recommended Follow-up (non-blocking):
|
|
Addressed both new comments in 6312ce7:
|
crates/codegraph-core/src/cfg.rs
Outdated
| catch_handler = Some(child); | ||
| } | ||
| if matches_opt(child.kind(), self.rules.finally_node) { | ||
| finally_handler = Some(child); | ||
| } | ||
| } | ||
|
|
||
| if let Some(catch_node) = catch_handler { | ||
| let catch_block = self.make_block("catch", Some(node_line(&catch_node)), None, Some("catch")); | ||
| self.add_edge(try_block, catch_block, "exception"); | ||
|
|
||
| let catch_body_node = catch_node.child_by_field_name("body"); |
There was a problem hiding this comment.
process_try_catch silently drops all catch clauses except the last
catch_handler is an Option<Node> that gets unconditionally overwritten on each iteration. For a try { } catch(A) { } catch(B) { } block:
- Iteration 1:
catch_handler = Some(catch_A) - Iteration 2:
catch_handler = Some(catch_B)← overwrites, A is lost
The resulting CFG completely omits every catch clause except the last one. This affects all languages that allow multiple catches: Java (catch_clause), C# (catch_clause), Python (except_clause), and PHP (catch_clause). None of the parity-test fixtures exercise try/catch, so this won't be caught by the existing tests.
Fix: collect all catch clauses into a Vec and wire each one independently to the try block via an "exception" edge:
let mut catch_handlers: Vec<Node> = Vec::new();
let mut finally_handler: Option<Node> = None;
let cursor = &mut try_stmt.walk();
for child in try_stmt.named_children(cursor) {
if matches_opt(child.kind(), self.rules.catch_node) {
catch_handlers.push(child);
}
if matches_opt(child.kind(), self.rules.finally_node) {
finally_handler = Some(child);
}
}
// Then process each catch_handler in catch_handlersThere was a problem hiding this comment.
Fixed in 73d4160 — changed catch_handler from Option to Vec. All catch clauses are now collected and each is processed independently with its own exception edge from the try block.
There was a problem hiding this comment.
Fixed in commit 73d4160 — changed catch_handler: Option<Node> to catch_handlers: Vec<Node> and wired each catch clause independently via "exception" edges.
There was a problem hiding this comment.
Already implemented this way — process_try_catch collects all catch clauses into a Vec<Node> (line 937: let mut catch_handlers: Vec<Node> = Vec::new()) and iterates over each one independently, wiring separate "exception" edges from the try block to each catch block. The Option<Node> approach described in the comment was never in the merged code.
There was a problem hiding this comment.
Fixed in e020420 — changed catch_handler from Option to a Vec and iterate all catch clauses. Each catch clause now gets its own CFG block with an edge from the try body, and all catch exits merge into the join block.
| } | ||
| } | ||
|
|
||
| fn process_for_loop(&mut self, for_stmt: &Node, current: u32) -> Option<u32> { | ||
| let header = self.make_block("loop_header", Some(node_line(for_stmt)), Some(node_line(for_stmt)), Some("for")); | ||
| self.add_edge(current, header, "fallthrough"); | ||
|
|
||
| let exit = self.make_block("body", None, None, None); |
There was a problem hiding this comment.
update_label_map fills ALL pending label placeholders, not just the innermost one
When a loop is created, update_label_map fills every label in label_map whose header_idx is still None. This is correct when nested labels each wrap exactly one loop, but breaks down when an outer label wraps a non-loop block that contains an inner loop:
outer: { // label_map: [("outer", {None, None})]
inner: for (...) { // label_map: [("outer", {None,None}), ("inner", {None,None})]
break outer; // should jump to the block after `outer:`, not the for-loop exit
}
}When process_for_loop is entered for the inner loop, update_label_map(inner_header, inner_exit) fills both "outer" and "inner" with the inner loop's exit. break outer then incorrectly targets the inner loop exit.
The fix is to only fill the most recently pushed unresolved label (i.e., the one that syntactically wraps the current loop):
fn update_label_map(&mut self, header_idx: u32, exit_idx: u32) {
if let Some((_, ctx)) = self.label_map.iter_mut().rev()
.find(|(_, ctx)| ctx.header_idx.is_none())
{
ctx.header_idx = Some(header_idx);
ctx.exit_idx = Some(exit_idx);
}
}This is valid in Java, JavaScript/TypeScript, and Go where labels can prefix any statement (not only loops).
There was a problem hiding this comment.
Fixed in 73d4160 — changed update_label_map to use .iter_mut().rev().find(|ctx| ctx.header_idx.is_none()) so only the most recently pushed unresolved label is filled, preventing inner loops from incorrectly resolving outer labels.
There was a problem hiding this comment.
Fixed in commit 73d4160 — changed update_label_map to only fill the most recently pushed unresolved label using .iter_mut().rev().find(|(_, ctx)| ctx.header_idx.is_none()).
There was a problem hiding this comment.
Already implemented exactly this way — update_label_map uses .iter_mut().rev().find(|(_, ctx)| ctx.header_idx.is_none()) (line 721) to fill only the most recently pushed unresolved label, not all pending labels. The suggested fix matches the current code verbatim.
There was a problem hiding this comment.
Fixed in e020420 — update_label_map now only fills the most recently added label entry (the innermost one) rather than all pending entries. Outer labels retain their None header_idx until their own loop is created.
| const hasNativeCfg = symbols.definitions | ||
| .filter((d) => (d.kind === 'function' || d.kind === 'method') && d.line) | ||
| .every((d) => d.cfg?.blocks?.length); | ||
| if (!hasNativeCfg) { | ||
| needsFallback = true; | ||
| break; | ||
| } |
There was a problem hiding this comment.
hasNativeCfg check still missing d.cfg === null guard at the needsFallback gate
The per-definition allNative check (added ~20 lines lower) was correctly fixed with d.cfg === null || d.cfg?.blocks?.length, but this earlier hasNativeCfg check that guards the global needsFallback flag still uses only d.cfg?.blocks?.length.
For a Go file with an interface type, the interface method signatures are emitted with cfg: null. null?.blocks?.length evaluates to undefined (falsy), so .every() returns false, hasNativeCfg is false, needsFallback = true, and the WASM parser is initialized for every file in the run — exactly the performance regression the PR aims to eliminate.
| const hasNativeCfg = symbols.definitions | |
| .filter((d) => (d.kind === 'function' || d.kind === 'method') && d.line) | |
| .every((d) => d.cfg?.blocks?.length); | |
| if (!hasNativeCfg) { | |
| needsFallback = true; | |
| break; | |
| } | |
| const hasNativeCfg = symbols.definitions | |
| .filter((d) => (d.kind === 'function' || d.kind === 'method') && d.line) | |
| .every((d) => d.cfg === null || d.cfg?.blocks?.length); |
There was a problem hiding this comment.
Fixed in a59c9bc. Added \ guard to the \ check, matching the existing \ pattern below. Interface methods with \ no longer trigger unnecessary WASM parser initialization.
|
Claude finished @carlos-alm's task —— View job Reviewing PR: perf: compute CFG in Rust native engine for all languagesConducted comprehensive PR review following the 4-phase analysis process:
Phase 1: Root Cause Analysis ✅Problem Being Solved: Performance optimization in CFG (Control Flow Graph) building phase. Current JS implementation does WASM re-parsing (~169ms) even when native AST data is already available from the Rust engine. This creates unnecessary double-parsing overhead. Approach Justification: This is a legitimate performance optimization that eliminates redundant WASM tree-walking by computing CFG data during the initial native extraction phase. The approach maintains backward compatibility with WASM fallback when native data is unavailable. This addresses a real bottleneck, not masking symptoms. Phase 2: Configuration Change Detection ✅No High-Risk Configuration Changes Detected:
Phase 3: Test Integrity ✅Test Coverage Assessment:
Phase 4: Code Quality
|
|
Addressed both new comments in 73d4160:
|
| self.loop_stack.push(LoopCtx { header_idx: header, exit_idx: exit, is_loop: true }); | ||
| self.update_label_map(header, exit); | ||
|
|
||
| let body = for_stmt.child_by_field_name("body"); | ||
| let body_block = self.make_block("loop_body", None, None, None); | ||
| self.add_edge(header, body_block, "branch_true"); | ||
|
|
||
| if let Some(body) = body { | ||
| let stmts = self.get_statements(&body); | ||
| let body_end = self.process_statements(&stmts, body_block); | ||
| if let Some(be) = body_end { | ||
| self.add_edge(be, header, "loop_back"); | ||
| } | ||
| } | ||
|
|
||
| self.add_edge(header, exit, "loop_exit"); | ||
| self.loop_stack.pop(); | ||
| Some(exit) | ||
| } | ||
|
|
||
| fn process_while_loop(&mut self, while_stmt: &Node, current: u32) -> Option<u32> { | ||
| let header = self.make_block("loop_header", Some(node_line(while_stmt)), Some(node_line(while_stmt)), Some("while")); | ||
| self.add_edge(current, header, "fallthrough"); | ||
|
|
There was a problem hiding this comment.
Go for {} produces spurious loop_exit edge
process_for_loop unconditionally emits header → exit with "loop_exit", which is correct for conditional loops (for i := 0; i < n; i++) but wrong for Go's infinite loops (for {}). In tree-sitter-go there is no infinite_loop_node — all Go loops, including for {}, are for_statement nodes, so they all arrive here.
For a Go for {} body with no break, the exit block will have exactly one predecessor (the loop_exit edge) but no actual runtime path reaches it. Any code written after for {} will appear reachable in the CFG even though it is dead code — the same semantic bug that was already fixed for Rust's loop {} via process_infinite_loop.
The condition field is absent on for {} nodes. A targeted fix is to check whether the for_statement has a condition field; if not, treat it like an infinite loop and only return Some(exit) when a break targeted it:
fn process_for_loop(&mut self, for_stmt: &Node, current: u32) -> Option<u32> {
let is_infinite = for_stmt.child_by_field_name("condition").is_none()
&& for_stmt.child_by_field_name("left").is_none(); // Go range uses "left"
let header = self.make_block("loop_header", ...);
self.add_edge(current, header, "fallthrough");
let exit = self.make_block("body", None, None, None);
self.loop_stack.push(LoopCtx { header_idx: header, exit_idx: exit, is_loop: true });
self.update_label_map(header, exit);
// ... process body ...
if is_infinite {
self.loop_stack.pop();
let has_break = self.edges.iter().any(|e| e.target_index == exit);
if has_break { Some(exit) } else { None }
} else {
self.add_edge(header, exit, "loop_exit");
self.loop_stack.pop();
Some(exit)
}
}There was a problem hiding this comment.
Fixed in a59c9bc. Added condition_field to CfgRules. process_for_loop now checks whether the for node has the condition field — if absent (Go for {}), it skips the loop_exit edge and only returns Some(exit) if a break targeted the exit block, matching the existing process_infinite_loop pattern.
| } | ||
| if matches_opt(child.kind(), self.rules.finally_node) { | ||
| finally_handler = Some(child); | ||
| } | ||
| } | ||
|
|
||
| if !catch_handlers.is_empty() { | ||
| let mut catch_ends: Vec<Option<u32>> = Vec::new(); | ||
|
|
||
| for catch_node in &catch_handlers { | ||
| let catch_block = self.make_block("catch", Some(node_line(catch_node)), None, Some("catch")); |
There was a problem hiding this comment.
Python try…except…else block silently dropped from CFG
The loop here only collects children whose kind matches rules.catch_node (Python: "except_clause") or rules.finally_node (Python: "finally_clause"). In tree-sitter-python a try_statement can also contain an else_clause child — the else block that runs only when the try body completed without raising an exception.
Because else_clause matches neither catch_node nor finally_node, it is silently ignored. The resulting CFG omits the else block entirely: execution flows from try_block → join_block directly, bypassing what should be a distinct "success" path. For Python code like:
try:
x = risky()
except ValueError:
handle()
else:
success() # ← this block is invisible in the CFGThe fix is to also look for rules.else_clause in this loop (Python reuses "else_clause" for both if…else and try…else), wire it between try_end and join_block, and exclude it from the try_body collection. Since else_clause is already a field on CfgRules, no struct change is required — only process_try_catch needs updating.
There was a problem hiding this comment.
Fixed in a59c9bc. Added else_node to CfgRules (Python sets it to "else_clause"). process_try_catch now: (1) filters else_clause out of the try body children, (2) processes it as a distinct success path wired between try_end and the downstream finally/join block, so it only executes when the try body completes without exception.
| catch_node: Some("catch_clause"), | ||
| finally_node: Some("finally_clause"), | ||
| return_node: Some("return_statement"), | ||
| throw_node: Some("throw_statement"), |
There was a problem hiding this comment.
C# switch_expression silently skipped in CFG
CSHARP_CFG only lists switch_node: Some("switch_statement") with switch_nodes: &[]. In tree-sitter-csharp, C# 8+ switch expressions compile to a switch_expression node (distinct from switch_statement), and each arm is a switch_expression_arm. Because neither appears in switch_nodes or case_nodes, any C# function containing a switch expression will have the entire construct treated as a regular sequential statement — no branching, no case edges, and any return/throw inside the arms will not appear in the CFG.
This is the same structural gap that was already identified and fixed for Java (where both switch_statement and switch_expression are now listed). C# should receive the same treatment:
switch_node: Some("switch_statement"),
switch_nodes: &["switch_expression"],
case_node: Some("switch_section"),
case_nodes: &["switch_expression_arm"],The tree-sitter-csharp switch_expression_arm node has a value field (the arm's expression), which is already handled by the existing child_by_field_name("value") fallback in process_switch (the same path used for Rust match arms). No additional logic changes should be needed — only the grammar rule additions above.
There was a problem hiding this comment.
Fixed in a59c9bc. Added "switch_expression" to CSHARP_CFG.switch_nodes and "switch_expression_arm" to case_nodes, so C# 8+ switch expressions are now recognized and processed by the CFG builder.
| continue_node: Some("next"), | ||
| block_node: None, | ||
| block_nodes: &["then", "do", "body_statement"], |
There was a problem hiding this comment.
Ruby method-level rescue not captured — most common Ruby rescue pattern produces incorrect CFG
RUBY_CFG sets try_node: Some("begin"), so only explicit begin ... rescue ... end blocks are recognized as try/catch. However, the idiomatic Ruby pattern for error handling is a method-level rescue that appears directly inside the method body — without an explicit begin block:
def process(data)
risky_operation(data)
rescue ArgumentError => e # ← direct child of body_statement, never inside begin
handle_error(e)
endIn tree-sitter-ruby, a method node's body is a body_statement. Since "body_statement" is listed in RUBY_CFG.block_nodes, its children (including any rescue nodes) are returned as flat statements by get_statements. When process_statement encounters a rescue node, it doesn't match try_node ("begin"), so it falls through to the "regular statement" handler — the rescue block is silently folded into the preceding sequential block with no exception edge.
The impact is significant: any Ruby method using the canonical rescue pattern will have an incorrect CFG — the error-handling path is invisible, and the rescue body is included in the normal flow block.
A targeted fix is to also detect "rescue" in process_statement when it appears as a top-level child of the function body, and route it through process_try_catch logic, or add a dedicated handler that creates an exception edge from the preceding block to the rescue block.
There was a problem hiding this comment.
Fixed in a59c9bc. Added try_nodes field to CfgRules. RUBY_CFG sets try_nodes: &["body_statement"]. In process_statement, when a node matches try_nodes and actually contains a rescue child, it dispatches to process_try_catch — so method-level rescue without explicit begin is now captured.
Port buildFunctionCFG algorithm from JS (cfg.js) to Rust (cfg.rs) with per-language CfgRules for all 8 supported languages. Each extractor now calls build_function_cfg() on function/method AST nodes during extraction, storing the CFG directly on the Definition struct. JS pipeline updated to use native CFG data when available (def.cfg), falling back to WASM tree walk only when native data is absent. This eliminates the need for WASM re-parsing in the CFG phase for native engine builds. Rust changes: - New cfg.rs module with CfgRules struct and 8 language configs - CfgBlock, CfgEdge, CfgData types in types.rs - All extractors call build_function_cfg for function/method defs JS changes: - parser.js normalizeNativeSymbols maps def.cfg through - cfg.js buildCFGData checks def.cfg before WASM fallback - Skips WASM parser init when all defs have native CFG Tests: 1437 pass, new cfg-all-langs.test.js with JS-side mock tests and native parity tests (block/edge count + type matching). Impact: 38 functions changed, 56 affected
…interface methods - Change "c_sharp" to "csharp" in cfg.rs, complexity.rs, and csharp.rs to match the canonical ID from parser_registry::lang_id_str() - Set cfg: None for Go interface method_elem signatures since they have no bodies, matching the existing complexity: None pattern Impact: 6 functions changed, 6 affected
…infinite loops
Fix two issues in the Rust native CFG construction:
1. CfgBuilder called utf8_text(&[]) to extract label names for labeled
loops, break, and continue statements. Passing an empty slice panics
when the label node has a non-zero byte offset. Add a `source` field
to CfgBuilder, thread it through from build_function_cfg, and use
utf8_text(self.source) for all 3 call sites. Updated all 8 extractor
files to pass source bytes.
2. process_infinite_loop used "branch_true" for the edge from loop header
to body, but Rust `loop {}` is unconditional — there is no condition
to branch on. Changed to "fallthrough" to match the semantics.
Impact: 14 functions changed, 18 affected
- Set cfg: None for TS interface method_signature/property_signature (no body) - Tag loop_stack entries with is_loop flag; continue now skips switch entries - Use match_arm value field as case body for Rust instead of filtering it out - Return None from process_infinite_loop when no break targets the exit block - Treat cfg: null as native-handled in allNative check to avoid false WASM fallback - Re-add complexityRules guard before passing to findFunctionNode
Impact: 2 functions changed, 0 affected
73d4160 to
e020420
Compare
|
Claude finished @carlos-alm's task —— View job Reviewing PR: perf: compute CFG in Rust native engine for all languagesConducted comprehensive PR review following the 4-phase analysis process:
Phase 1: Root Cause Analysis ✅Problem Being Solved: Performance optimization in CFG (Control Flow Graph) building phase. Current JS implementation does WASM re-parsing (~169ms) even when native AST data is already available from the Rust engine. This creates unnecessary double-parsing overhead. Approach Justification: This is a legitimate performance optimization that eliminates redundant WASM tree-walking by computing CFG data during the initial native extraction phase. The approach maintains backward compatibility with WASM fallback when native data is unavailable. This addresses a real bottleneck, not masking symptoms. Phase 2: Configuration Change Detection ✅No High-Risk Configuration Changes Detected:
Phase 3: Test Integrity ✅Test Coverage Assessment:
Phase 4: Code Quality ✅Rust Implementation Analysis:
REMAINING CRITICAL ISSUES:
Minor Issues Noted but Not Blocking:
Root Cause AnalysisProblem Being Solved: Legitimate performance bottleneck - double WASM parsing when native data available Critical ConcernsPerformance Goal Defeat: The missing Final Recommendation
Recommended Follow-up (non-blocking):
|
| if !has_default { | ||
| self.add_edge(switch_header, join_block, "branch_false"); | ||
| } |
There was a problem hiding this comment.
Python match with case _: wildcard emits spurious "branch_false" exit edge
PYTHON_CFG.default_node is None (line 97), so the is_default flag on line 860 is always false for Python — even when a case_clause is the wildcard arm (case _:). After all clauses are processed, has_default remains false, and this unconditional fallthrough edge is always added:
if !has_default {
self.add_edge(switch_header, join_block, "branch_false");
}For an exhaustive Python match with a wildcard catch-all, this edge makes the CFG appear as if the match can fall through without matching any case, even though case _: is exhaustive.
Fix: Inspect the pattern child of the case clause to detect wildcard patterns in tree-sitter-python:
// In process_switch, when processing a case_clause:
let is_default_pattern = case_clause
.child_by_field_name("pattern")
.map(|p| p.kind() == "wildcard_pattern")
.unwrap_or(false);
let is_default = matches_opt(cc_kind, self.rules.default_node) || is_default_pattern;Alternatively, add a default_pattern_node field to CfgRules for languages (like Python) whose wildcard defaults are identified by the pattern child's kind.
There was a problem hiding this comment.
Fixed in a59c9bc. Added wildcard_pattern_node to CfgRules (Python sets it to "wildcard_pattern"). In process_switch, the is_default check now also detects case clauses whose first named child matches the wildcard pattern node, so case _: correctly sets has_default = true and prevents the spurious branch_false fallthrough edge.
src/cfg.js
Outdated
| const hasNativeCfg = symbols.definitions | ||
| .filter((d) => (d.kind === 'function' || d.kind === 'method') && d.line) | ||
| .every((d) => d.cfg?.blocks?.length); |
There was a problem hiding this comment.
hasNativeCfg pre-scan missing d.cfg === null guard, defeating performance optimization
This check at line 1057 determines whether to initialize the WASM parser for the entire run. However, it uses only .every((d) => d.cfg?.blocks?.length) without accounting for cfg: null (intentionally-null CFG for interface/abstract methods).
When a definition has cfg: null, the optional chaining produces undefined (falsy), causing .every() to return false. Any file containing even one interface method (valid in Go, Java, C#) will set needsFallback = true and initialize the WASM parser for the whole file and pipeline, even though the concrete methods already have complete native CFG data.
This defeats the performance goal of the PR for any file mixing concrete implementations with interface/abstract declarations — exactly the common case.
Compare with the corrected allNative check at line 1114, which already has the fix:
.every((d) => d.cfg === null || d.cfg?.blocks?.length)Fix: Apply the same guard to line 1059:
const hasNativeCfg = symbols.definitions
.filter((d) => (d.kind === 'function' || d.kind === 'method') && d.line)
.every((d) => d.cfg === null || d.cfg?.blocks?.length);There was a problem hiding this comment.
Addressed — see reply to the original comment above.
crates/codegraph-core/src/cfg.rs
Outdated
| switch_node: Some("switch_statement"), | ||
| switch_nodes: &[], |
There was a problem hiding this comment.
C# switch_expression silently skipped — C# 8+ expressions treated as sequential statements
CSHARP_CFG lists only:
switch_node: Some("switch_statement"),
switch_nodes: &[],In tree-sitter-csharp, C# 8+ switch expressions compile to switch_expression nodes with switch_expression_arm children. Since neither switch_expression nor its arms appear in any of these lists, any C# function containing a modern switch expression is treated as a single sequential statement — no branching, no case edges, and any control flow inside the arms (return, throw, break) is completely omitted from the CFG.
This mirrors the gap that was fixed for Java, which now lists both switch_statement and switch_expression. C# should receive the same treatment.
Fix: Add switch_expression to the configuration:
switch_node: Some("switch_statement"),
switch_nodes: &["switch_expression"],
case_node: Some("switch_section"),
case_nodes: &["switch_expression_arm"],There was a problem hiding this comment.
Addressed — see reply to the original comment above.
| try_node: Some("begin"), | ||
| catch_node: Some("rescue"), |
There was a problem hiding this comment.
Ruby method-level rescue not recognized — idiomatic Ruby error-handling pattern silently omitted
RUBY_CFG.try_node is Some("begin"), so only explicit begin ... rescue ... end blocks are recognized as try/catch constructs. However, idiomatic Ruby exception handling uses method-level rescue directly inside the method body without an explicit begin block:
def process(data)
risky_operation(data)
rescue ArgumentError => e # ← direct child of body_statement, never inside begin
handle_error(e)
endIn tree-sitter-ruby, the method body is a body_statement, and rescue nodes are direct children of it (listed in block_nodes on line 261). When process_statement encounters a rescue node at the top level of the method body, it doesn't match try_node ("begin"), so it falls through to the generic statement handler — the rescue block is silently folded into the preceding sequential code instead of being recognized as an exception handler.
This affects the majority of Ruby code, which uses method-level rescue rather than explicit begin blocks.
Fix: Extend the try detection to recognize rescue at the top level of method bodies, or treat standalone rescue nodes specially in process_statement when they appear as children of block_nodes.
There was a problem hiding this comment.
Addressed — see reply to the original comment above.
crates/codegraph-core/src/cfg.rs
Outdated
| fn process_for_loop(&mut self, for_stmt: &Node, current: u32) -> Option<u32> { | ||
| let header = self.make_block("loop_header", Some(node_line(for_stmt)), Some(node_line(for_stmt)), Some("for")); | ||
| self.add_edge(current, header, "fallthrough"); | ||
|
|
||
| let exit = self.make_block("body", None, None, None); | ||
|
|
||
| self.loop_stack.push(LoopCtx { header_idx: header, exit_idx: exit, is_loop: true }); | ||
| self.update_label_map(header, exit); | ||
|
|
||
| let body = for_stmt.child_by_field_name("body"); | ||
| let body_block = self.make_block("loop_body", None, None, None); | ||
| self.add_edge(header, body_block, "branch_true"); | ||
|
|
||
| if let Some(body) = body { | ||
| let stmts = self.get_statements(&body); | ||
| let body_end = self.process_statements(&stmts, body_block); | ||
| if let Some(be) = body_end { | ||
| self.add_edge(be, header, "loop_back"); | ||
| } | ||
| } | ||
|
|
||
| self.add_edge(header, exit, "loop_exit"); |
There was a problem hiding this comment.
Go for {} infinite loop produces spurious "loop_exit" edge, making dead code appear reachable
GO_CFG has infinite_loop_node: None, so all Go loops — including for {} infinite loops — are processed by process_for_loop. This function unconditionally emits:
self.add_edge(header, exit, "loop_exit"); // line 750and returns Some(exit) (line 752), indicating that the exit block is reachable.
For a Go for {} loop with no break statement, the exit block has no incoming edges except from this spurious "loop_exit" edge. Any code after for {} is consequently marked as reachable in the CFG, even though it is dead code:
for {} // infinite loop, never terminates
unreachable_code() // marked as reachable in CFG, but is dead(Rust's loop {} handles this correctly via process_infinite_loop, which returns None when no break targeted the exit block.)
Fix: In process_for_loop, check whether the loop has a condition field. If absent (indicating an infinite loop like Go's for {}), use the same logic as process_infinite_loop: return None if no break targeted the exit block, making subsequent code correctly unreachable.
There was a problem hiding this comment.
Addressed — see reply to the original comment above.
| let stmts: Vec<Node> = try_stmt.named_children(cursor) | ||
| .filter(|child| { | ||
| let ck = child.kind(); | ||
| !matches_opt(ck, self.rules.catch_node) | ||
| && !matches_opt(ck, self.rules.finally_node) | ||
| }) | ||
| .collect(); |
There was a problem hiding this comment.
Python try…except…else block silently dropped — else_clause incorrectly included in try body
The fallback path for extracting the try body (when there's no explicit body field) collects all children except those matching catch_node and finally_node:
let stmts: Vec<Node> = try_stmt.named_children(cursor)
.filter(|child| {
let ck = child.kind();
!matches_opt(ck, self.rules.catch_node) // except_clause
&& !matches_opt(ck, self.rules.finally_node)
})
.collect();In tree-sitter-python, a try_statement can contain an else_clause child — the block that runs only when the try body completed without raising an exception. Since else_clause matches neither filter, it passes through and is included in stmts, getting processed as part of the try body itself.
This is semantically wrong. The else block should execute only after the try succeeds (no exception), not as an integral part of the try body. The CFG omits the else block as a distinct "success path" and instead merges it into the try block.
Fix: Also filter out else_clause children:
.filter(|child| {
let ck = child.kind();
!matches_opt(ck, self.rules.catch_node)
&& !matches_opt(ck, self.rules.finally_node)
&& ck != "else_clause" // Add this
})Then process the else clause as a distinct path between try_end and join_block, after the try succeeds but before any finally block.
There was a problem hiding this comment.
Addressed — see reply to the original comment above.
| if_consequent_field: Some("body"), | ||
| for_nodes: &["for_statement", "foreach_statement"], | ||
| while_node: Some("while_statement"), | ||
| while_nodes: &[], | ||
| do_node: Some("do_statement"), | ||
| infinite_loop_node: None, | ||
| unless_node: None, | ||
| until_node: None, | ||
| switch_node: Some("switch_statement"), | ||
| switch_nodes: &[], | ||
| case_node: Some("case_statement"), | ||
| case_nodes: &[], | ||
| default_node: Some("default_statement"), | ||
| try_node: Some("try_statement"), | ||
| catch_node: Some("catch_clause"), | ||
| finally_node: Some("finally_clause"), | ||
| return_node: Some("return_statement"), | ||
| throw_node: Some("throw_expression"), | ||
| break_node: Some("break_statement"), | ||
| continue_node: Some("continue_statement"), | ||
| block_node: Some("compound_statement"), | ||
| block_nodes: &[], | ||
| labeled_node: None, | ||
| }; | ||
|
|
||
| /// Get CFG rules for a language ID. | ||
| pub fn get_cfg_rules(lang_id: &str) -> Option<&'static CfgRules> { | ||
| match lang_id { | ||
| "javascript" | "typescript" | "tsx" => Some(&JS_TS_CFG), | ||
| "python" => Some(&PYTHON_CFG), |
There was a problem hiding this comment.
PHP match expression absent from PHP_CFG
PHP 8.0 introduced match expressions, which are represented as match_expression nodes in tree-sitter-php. The current PHP_CFG only lists switch_node: Some("switch_statement") with switch_nodes: &[].
Because match_expression is not in switch_nodes, any PHP 8 function using match will fall through to the "regular statement" handler in process_statement. The entire match construct is treated as sequential code — no branching, no case edges, and any return/throw inside a match arm is completely invisible in the CFG.
Example that produces wrong CFG:
function getLabel(int $status): string {
return match($status) {
200 => 'OK',
404 => throw new NotFoundException(),
default => 'Unknown',
};
}The throw inside the match arm and the return of the whole expression would not appear in the CFG.
Fix: Add match_expression to switch_nodes and match_arm (or the equivalent tree-sitter-php arm node kind) to case_nodes:
switch_node: Some("switch_statement"),
switch_nodes: &["match_expression"],
case_node: Some("case_statement"),
case_nodes: &["match_conditional_expression", "match_default_expression"],(Verify node kinds against tree-sitter-php grammar.)
| if_consequent_field: None, | ||
| for_nodes: &["for_expression"], | ||
| while_node: Some("while_expression"), | ||
| while_nodes: &["while_let_expression"], | ||
| do_node: None, | ||
| infinite_loop_node: Some("loop_expression"), | ||
| unless_node: None, | ||
| until_node: None, | ||
| switch_node: Some("match_expression"), | ||
| switch_nodes: &[], | ||
| case_node: Some("match_arm"), | ||
| case_nodes: &[], | ||
| default_node: None, | ||
| try_node: None, | ||
| catch_node: None, | ||
| finally_node: None, | ||
| return_node: Some("return_expression"), | ||
| throw_node: None, | ||
| break_node: Some("break_expression"), | ||
| continue_node: Some("continue_expression"), | ||
| block_node: Some("block"), | ||
| block_nodes: &[], | ||
| labeled_node: None, | ||
| }; | ||
|
|
||
| pub static JAVA_CFG: CfgRules = CfgRules { | ||
| if_node: Some("if_statement"), | ||
| if_nodes: &[], | ||
| elif_node: None, | ||
| else_clause: None, |
There was a problem hiding this comment.
Rust labeled loops silently produce incorrect CFG
RUST_CFG.labeled_node: None is correct — Rust does not use a wrapping labeled_statement node. Instead, in tree-sitter-rust, loop labels ('outer:) are an optional label field directly on loop_expression, for_expression, and while_expression. This means:
process_statementnever pushes any label tolabel_mapfor Rust (thelabeled_nodebranch is never taken).- When
break 'outerorcontinue 'outeris encountered,stmt.child_by_field_name("label")extracts the label name ('outer), butlabel_mapis empty sotarget = None. - With
target = None, the code falls through toreturn Some(current)— no edge is added — so the labeled break/continue silently does nothing. Execution appears to continue normally rather than jumping to the loop exit.
Example that produces wrong CFG:
'outer: for i in 0..10 {
for j in 0..10 {
if condition { break 'outer; } // ← no edge added, CFG is wrong
}
}
unreachable_code(); // appears reachable in CFGFix: In process_for_loop, process_while_loop, and process_infinite_loop, check for an optional label field on the loop node and register it in label_map before processing the body:
fn process_for_loop(&mut self, for_stmt: &Node, current: u32) -> Option<u32> {
// Register Rust-style loop label if present
let rust_label = for_stmt.child_by_field_name("name") // or "label" — verify field name
.map(|n| n.utf8_text(self.source).unwrap_or("").to_string());
if let Some(ref lname) = rust_label {
self.label_map.push((lname.clone(), LabelCtx { header_idx: None, exit_idx: None }));
}
// ... existing loop logic ...
if let Some(ref lname) = rust_label {
self.label_map.retain(|(n, _)| n != lname);
}
}| const hasNativeCfg = symbols.definitions | ||
| .filter((d) => (d.kind === 'function' || d.kind === 'method') && d.line) | ||
| .every((d) => d.cfg?.blocks?.length); | ||
| if (!hasNativeCfg) { |
There was a problem hiding this comment.
hasNativeCfg check inconsistent with allNative check below
The allNative check on line 1114 correctly handles cfg: null for bodyless methods:
.every((d) => d.cfg === null || d.cfg?.blocks?.length)But this earlier hasNativeCfg check still uses only:
.every((d) => d.cfg?.blocks?.length)For any file containing an interface or abstract method (which gets cfg: null), null?.blocks?.length evaluates to undefined (falsy), causing every() to return false. This sets needsFallback = true and initializes the WASM parser for the entire run, even though all concrete methods in those files already have complete native CFG data. The per-definition fallback guard on line 1160 still produces correct results, so this is purely a performance regression.
| const hasNativeCfg = symbols.definitions | |
| .filter((d) => (d.kind === 'function' || d.kind === 'method') && d.line) | |
| .every((d) => d.cfg?.blocks?.length); | |
| if (!hasNativeCfg) { | |
| const hasNativeCfg = symbols.definitions | |
| .filter((d) => (d.kind === 'function' || d.kind === 'method') && d.line) | |
| .every((d) => d.cfg === null || d.cfg?.blocks?.length); |
- Add cfg === null guard to hasNativeCfg check in src/cfg.js, matching
the allNative pattern (prevents unnecessary WASM parser init for
interface methods)
- Add condition_field to CfgRules; process_for_loop now treats
conditionless for loops (Go `for {}`) as infinite loops without
spurious loop_exit edge
- Add else_node to CfgRules; process_try_catch now extracts Python
else_clause as a distinct success-only path instead of folding it
into the try body
- Add switch_expression to CSHARP_CFG.switch_nodes and
switch_expression_arm to case_nodes for C# 8+ pattern matching
- Add try_nodes to CfgRules; Ruby body_statement containing rescue
is now recognized as a try construct via RUBY_CFG.try_nodes
- Add wildcard_pattern_node to CfgRules; Python match/case with
case _: wildcard now correctly sets has_default, preventing
spurious branch_false fallthrough edge
Impact: 5 functions changed, 13 affected
|
Claude finished @carlos-alm's task —— View job Reviewing PR: perf: compute CFG in Rust native engine for all languagesConducted comprehensive PR review following the 4-phase analysis process:
Phase 1: Root Cause Analysis ✅Problem Being Solved: Performance optimization in CFG (Control Flow Graph) building phase. Current JS implementation does WASM re-parsing (~169ms) even when native AST data is already available from the Rust engine. This creates unnecessary double-parsing overhead. Approach Justification: This is a legitimate performance optimization that eliminates redundant WASM tree-walking by computing CFG data during the initial native extraction phase. The approach maintains backward compatibility with WASM fallback when native data is unavailable. This addresses a real bottleneck, not masking symptoms. Phase 2: Configuration Change Detection ✅No High-Risk Configuration Changes Detected:
Phase 3: Test Integrity ✅Test Coverage Assessment:
Phase 4: Code Quality ✅Rust Implementation Analysis:
Minor Issues Noted but Not Blocking:
Root Cause AnalysisProblem Being Solved: Legitimate performance bottleneck - double WASM parsing when native data available Critical ConcernsAll previous critical issues have been addressed: The major bugs identified in earlier review rounds (language ID mismatch, utf8_text panics, continue-in-switch semantics, infinite loop reachability) have been properly fixed in recent commits. Final Recommendation
Recommended Follow-up (non-blocking):
|
|
Claude finished @carlos-alm's task —— View job Reviewing PR: perf: compute CFG in Rust native engine for all languagesConducting comprehensive PR review following the 4-phase analysis process:
Phase 1: Root Cause Analysis ✅Problem Being Solved: Performance optimization in CFG (Control Flow Graph) building phase. Current JS implementation does WASM re-parsing (~169ms) even when native AST data is already available from the Rust engine. This creates unnecessary double-parsing overhead. Approach Justification: This is a legitimate performance optimization that eliminates redundant WASM tree-walking by computing CFG data during the initial native extraction phase. The approach maintains backward compatibility with WASM fallback when native data is unavailable. This addresses a real bottleneck, not masking symptoms. Phase 2: Configuration Change Detection ✅No High-Risk Configuration Changes Detected:
Phase 3: Test Integrity ✅Test Coverage Assessment:
Phase 4: Code Quality ✅Rust Implementation Analysis:
Minor Issues Noted but Not Blocking:
Root Cause AnalysisProblem Being Solved: Legitimate performance bottleneck - double WASM parsing when native data available Critical ConcernsAll previous critical issues have been addressed: The major bugs identified in earlier review rounds (language ID mismatch, utf8_text panics, continue-in-switch semantics, infinite loop reachability) have been properly fixed in recent commits. Final Recommendation
Recommended Follow-up (non-blocking):
|
* perf: compute dataflow analysis in Rust native engine Port per-file dataflow extraction from WASM (JS) to the Rust native engine, eliminating the tree-sitter WASM dependency for dataflow. Rust changes: - Add dataflow.rs with DataflowRules struct, 8 per-language static configs (JS/TS, Python, Go, Rust, Java, C#, PHP, Ruby), ParamStrategy enum for per-language param extraction, scope tracking with binding confidence, and extract_dataflow() recursive visitor - Add 6 NAPI structs to types.rs (DataflowParam, DataflowReturn, DataflowAssignment, DataflowArgFlow, DataflowMutation, DataflowResult) and dataflow field on FileSymbols - Call extract_dataflow() after extract_symbols() in parallel.rs - Add lang_id_str() to LanguageKind for rules lookup JS changes: - Extend normalizeNativeSymbols() to map native dataflow result - Add native bypass in buildDataflowEdges(): use symbols.dataflow when present, fall back to WASM extraction otherwise Impact: 30 functions changed, 32 affected * fix: address PR review feedback for dataflow-in-rust - Add method_call_expression to RUST_DATAFLOW call_nodes so Rust obj.method(arg) calls are no longer silently excluded from dataflow - Remove var_declaration from GO_DATAFLOW var_declarator_nodes since its child fields don't match the left/right schema used by short_var_declaration - Remove unreachable snake_case fallbacks in normalizeNativeSymbols since napi-rs #[napi(js_name)] guarantees camelCase property names Impact: 1 functions changed, 3 affected * fix(native): address dataflow review comments — conditional extraction, depth limit, method mutations, truncation parity - Add include_dataflow flag to parse_files_parallel/parse_file to skip dataflow extraction when not needed (threaded from builder.js opts) - Add MAX_VISIT_DEPTH (200) to prevent stack overflow on deeply nested ASTs - Add method_call_name_field to DataflowRules for languages where method calls use a different field than call_function_field (fixes dead mutating_methods for Rust's method_call_expression) - Fix truncate() to use chars().count() instead of byte length for parity with JS str.length on non-ASCII content Impact: 11 functions changed, 31 affected * fix(native): use correct receiver field for Rust method call mutations Impact: 1 functions changed, 4 affected * fix(native): add depth guard to collect_identifiers in dataflow Impact: 2 functions changed, 2 affected * perf: skip dataflow computation in incremental CFG-only rebuilds When only needsCfg is true in the pending analysis path, the native engine was still computing dataflow for every file via engineOpts defaulting dataflow to true. The results were immediately discarded since the needsDataflow guard was false. Override engineOpts.dataflow in the analysis call site so it only runs when actually needed. Impact: 1 functions changed, 0 affected * fix: use >= for depth guard in collect_identifiers to match visit() Impact: 1 functions changed, 4 affected
10b56a3 to
c0d1626
Compare
|
Claude finished @carlos-alm's task —— View job Reviewing PR: perf: compute CFG in Rust native engine for all languagesConducted comprehensive PR review following the 4-phase analysis process:
Phase 1: Root Cause Analysis ✅Problem Being Solved: Performance optimization in CFG (Control Flow Graph) building phase. Current JS implementation does WASM re-parsing (~169ms) even when native AST data is already available from the Rust engine. This creates unnecessary double-parsing overhead. Approach Justification: This is a legitimate performance optimization that eliminates redundant WASM tree-walking by computing CFG data during the initial native extraction phase. The approach maintains backward compatibility with WASM fallback when native data is unavailable. This addresses a real bottleneck, not masking symptoms. Phase 2: Configuration Change Detection ✅No High-Risk Configuration Changes Detected:
Phase 3: Test Integrity ✅Test Coverage Assessment:
Phase 4: Code Quality ✅Rust Implementation Analysis:
Minor Issues Noted but Not Blocking:
Root Cause AnalysisProblem Being Solved: Legitimate performance bottleneck - double WASM parsing when native data available Critical ConcernsAll previous critical issues have been addressed: The major bugs identified in earlier review rounds (language ID mismatch, utf8_text panics, continue-in-switch semantics, infinite loop reachability) have been properly fixed in recent commits. Final Recommendation
Recommended Follow-up (non-blocking):
|
Cross-engine parity tests comparing WASM extractDataflow() output against native Rust parseFile(include_dataflow=true) for the three languages that lacked coverage. Tests auto-skip when the native binary doesn't include dataflow support (requires local Rust build). Covers parameters, returns, assignments, argFlows, and mutations per language — 13 test cases total.
|
Addressed the incomplete test coverage note — added cross-engine dataflow parity tests for Go, Rust, and Ruby in 13 test cases covering parameters, returns, assignments, argFlows, and mutations per language. Tests auto-skip when the native binary doesn't include dataflow support (the published npm prebuilt v3.0.0 doesn't — requires a local Rust build with the new dataflow module). |
|
Claude finished @carlos-alm's task —— View job Reviewing PR: perf: compute CFG in Rust native engine for all languagesConducted comprehensive PR review following the 4-phase analysis process:
Phase 1: Root Cause Analysis ✅Problem Being Solved: Performance optimization in CFG (Control Flow Graph) building phase. Current JS implementation does WASM re-parsing (~169ms) even when native AST data is already available from the Rust engine. This creates unnecessary double-parsing overhead. Approach Justification: This is a legitimate performance optimization that eliminates redundant WASM tree-walking by computing CFG data during the initial native extraction phase. The approach maintains backward compatibility with WASM fallback when native data is unavailable. This addresses a real bottleneck, not masking symptoms. Phase 2: Configuration Change Detection ✅No High-Risk Configuration Changes Detected:
Phase 3: Test Integrity ✅Test Coverage Assessment:
Phase 4: Code Quality ✅Rust Implementation Analysis:
Minor Issues Noted but Not Blocking:
Root Cause AnalysisProblem Being Solved: Legitimate performance bottleneck - double WASM parsing when native data available Critical ConcernsAll previous critical issues have been addressed: The major bugs identified in earlier review rounds (language ID mismatch, utf8_text panics, continue-in-switch semantics, infinite loop reachability) have been properly fixed in recent commits. Final Recommendation
Recommended Follow-up (non-blocking):
|
Reviewing PR: perf: compute CFG in Rust native engine for all languagesConducting comprehensive PR review following the 4-phase analysis process:
Phase 1: Root Cause Analysis ✅Problem Being Solved: Performance optimization in CFG (Control Flow Graph) building phase. Current JS implementation does WASM re-parsing (~169ms) even when native AST data is already available from the Rust engine. This creates unnecessary double-parsing overhead. Approach Justification: This is a legitimate performance optimization that eliminates redundant WASM tree-walking by computing CFG data during the initial native extraction phase. The approach maintains backward compatibility with WASM fallback when native data is unavailable. This addresses a real bottleneck, not masking symptoms. Phase 2: Configuration Change Detection ✅No High-Risk Configuration Changes Detected:
|
Summary
buildFunctionCFGalgorithm from JS (src/cfg.js) to Rust (crates/codegraph-core/src/cfg.rs) with per-languageCfgRulesfor all 8 supported languages (JS/TS/TSX, Python, Go, Rust, Java, C#, Ruby, PHP)build_function_cfg()on function/method AST nodes during extraction, storing CFG data directly on theDefinitionstructcfg.js,parser.js) updated to use native CFG data when available (def.cfg), falling back to WASM tree walk only when native data is absentChanges
Rust (~750 lines):
cfg.rs:CfgRulesstruct, 8 language configs,CfgBuilderstate machine porting the full algorithm (if/elif/else 3 patterns, for/while/do-while/infinite loops, switch/match, try/catch/finally, break/continue with labels, loop stack)types.rs:CfgBlock,CfgEdge,CfgDatastructs with napi bindingsbuild_function_cfg(node, lang_id)for function/method definitionsJS (~30 lines):
parser.js:normalizeNativeSymbolsmapsdef.cfgthrough to JScfg.js:buildCFGDatachecksdef.cfgbefore WASM fallback, skips WASM parser init when all defs have native CFGTest plan
cfg-all-langs.test.js— JS-side mock tests (2 pass locally)buildFunctionCFGPR 2 of 4 in the WASM double-parse elimination plan.