feat(native): port Verilog extractor to Rust#1107
Conversation
Adds tree-sitter-verilog dependency and a native Verilog/SystemVerilog extractor in crates/codegraph-core/src/extractors/verilog.rs, registers .v / .sv with LanguageKind::Verilog and the Rust file_collector, and adds Verilog to NATIVE_SUPPORTED_EXTENSIONS on the JS side. Mirrors extractVerilogSymbols: module/interface/package/class declarations, function and task declarations (parent-prefixed when nested), package_import_declaration and include_compiler_directive imports, and module_instantiation as call extraction. VERILOG_AST_CONFIG in helpers.rs deliberately has all node-type lists empty to mirror the WASM side, whose AST_TYPE_MAPS has no verilog entry — so both engines emit zero ast_nodes rows for Verilog and stay in parity. Closes #1071
|
Claude finished @carlos-alm's task —— View job PR Review Complete ✅Reviewing PR #1107: feat(native): port Verilog extractor to Rust Root Cause AnalysisProblem Being Solved: This PR implements native Rust support for Verilog/SystemVerilog code extraction to maintain dual-engine parity. The codegraph tool follows a dual-engine architecture where both WASM and native Rust engines must produce identical results for supported languages. Why This Approach: The implementation mirrors the existing WASM extractor ( Risk Assessment: Low risk. This is purely additive functionality with comprehensive test coverage and strict parity requirements enforced by the architecture. Backlog Compliance
Code Quality AssessmentStrengths:
Technical Details Verified:
No Issues Found: No security concerns, performance problems, or architectural violations detected. Final Recommendation
This PR successfully extends codegraph's language coverage while maintaining the strict dual-engine parity requirements that are fundamental to the tool's reliability. |
Greptile SummaryThis PR ports the Verilog/SystemVerilog extractor from WASM-only to a fully native Rust implementation, promoting
Confidence Score: 5/5Safe to merge — the native Verilog extractor is well-tested, both engines are in parity, and all issues from previous review rounds have been resolved. All definition, import, and call extraction paths are covered by unit tests in both Rust and TypeScript. The class-name structural lookup, ANSI-port recursion fix, namedChild(0) for module instantiation, and findVerilogParent class-qualification fix were all caught in prior review rounds and are now correctly implemented and tested. No files require special attention. Important Files Changed
Reviews (16): Last reviewed commit: "fix(extractors): restore Verilog WASM en..." | Re-trigger Greptile |
| fn handle_class_decl(node: &Node, source: &[u8], symbols: &mut FileSymbols) { | ||
| // The JS extractor calls `node.childForFieldName('name')`; tree-sitter-verilog | ||
| // exposes no fields on `class_declaration`, so this returns null in JS too — | ||
| // matching that behavior keeps native and WASM in lockstep. | ||
| let name = match named_child_text(node, "name", source) { | ||
| Some(n) => n.to_string(), | ||
| None => return, | ||
| }; | ||
| symbols.definitions.push(Definition { | ||
| name: name.clone(), | ||
| kind: "class".to_string(), | ||
| line: start_line(node), | ||
| end_line: Some(end_line(node)), | ||
| decorators: None, | ||
| complexity: None, | ||
| cfg: None, | ||
| children: None, | ||
| }); | ||
|
|
||
| if let Some(superclass) = node.child_by_field_name("superclass") { | ||
| symbols.classes.push(ClassRelation { | ||
| name, | ||
| extends: Some(node_text(&superclass, source).to_string()), | ||
| implements: None, | ||
| line: start_line(node), | ||
| }); | ||
| } | ||
| } |
There was a problem hiding this comment.
handle_class_decl is unreachable dead code
The comment itself confirms that tree-sitter-verilog exposes no name field on class_declaration, so named_child_text(node, "name", source) always returns None, and the function always exits at the None => return arm. Neither the Definition for the class nor the ClassRelation for extends is ever emitted. The PR description lists class_declaration → kind class (extends emitted into classes) as a captured definition type, but that path is never taken. If intentional parity with the JS extractor is the goal (where childForFieldName('name') also returns null), it's worth at minimum adding a comment that class extraction is currently a no-op, so a future grammar update that adds the name field doesn't go unnoticed. Alternatively, falling back to find_decl_name (like handle_interface_decl and handle_package_decl do) would recover class names from the first simple_identifier child without breaking WASM parity once the grammar grows the field.
There was a problem hiding this comment.
Fixed — strengthened the comment on handle_class_decl so it loudly flags that the handler is currently a no-op on the live tree-sitter-verilog grammar (no name or superclass field, so name lookup always returns None and the function exits early). The hook stays in place so a future grammar upgrade that adds those fields automatically picks up class definitions and extends relations. Going with parity-with-WASM rather than find_decl_name fallback to avoid silently diverging from the JS extractor before the grammar grows the field.
There was a problem hiding this comment.
Follow-up — went a step further per the @greptileai suggestion. Rather than leaving the handler as documented dead code, traced the actual tree-sitter-verilog grammar shape and now extract the class name via class_identifier > simple_identifier and the superclass via class_type > class_identifier > simple_identifier. Both engines now emit proper class definitions and extends relations. Added regression tests for classes with and without an extends clause in both tests/parsers/verilog.test.ts and crates/codegraph-core/src/extractors/verilog.rs (extracts_class_with_superclass, extracts_class_without_superclass). Pushed in commit 260ee4f.
There was a problem hiding this comment.
This is a clean, complete fix. The implementation looks solid — a few observations:
What's working well:
find_class_namecorrectly falls through to theclass_identifierstructural scan when the field-name lookup misses (which it always will with this grammar), matching the JSfindClassNameexactly.find_class_superclassmirrorsfindClassSuperclassfaithfully: descend throughclass_type → class_identifier → simple_identifier, fall back to rawclass_typetext if the innerclass_identifierisn't present.- Both regression tests (
extracts_class_with_superclass,extracts_class_without_superclass) cover the two meaningful paths, and the TS suite has matching coverage.
One minor thing: find_class_superclass uses node.child(i)? with the ? operator in a for loop, which will silently exit the entire loop on the first None rather than just skipping that iteration (same pattern exists elsewhere in the file). It works fine in practice since child_count only returns valid indices, but if let Some(child) = node.child(i) would be more defensively correct and consistent with handle_package_import / handle_include_directive below it. Not a blocker.
The fix correctly resolves the dead-code issue and parity between engines is maintained. LGTM.
There was a problem hiding this comment.
Thanks for the follow-up observation @greptileai. The ? operator pattern in the for-loops is fine for now because child_count() always returns valid indices and the function uses the same pattern across the file. Switching to if let Some(child) = ... is a defensiveness style choice and a sweep would touch several handlers — opening a follow-up issue to do it as one cohesive cleanup PR rather than mixing it into this one.
There was a problem hiding this comment.
Totally reasonable — the pattern is correct as-is, and a cohesive sweep is the right way to handle it. Happy to flag the same pattern if it shows up in the follow-up PR. This one is good to go.
There was a problem hiding this comment.
Tracked the deferred ?-in-for-loop cleanup in #1118.
| "js", "jsx", "mjs", "cjs", "ts", "tsx", "d.ts", "py", "pyi", "go", "rs", "java", "cs", "rb", | ||
| "rake", "gemspec", "php", "phtml", "tf", "hcl", "c", "h", "cpp", "cc", "cxx", "hpp", "kt", | ||
| "kts", "swift", "scala", "sh", "bash", "ex", "exs", "lua", "dart", "zig", "hs", "ml", "mli", | ||
| "v", "sv", | ||
| ]; |
There was a problem hiding this comment.
.v extension shared with Coq theorem prover
The .v extension is also the canonical extension for Coq source files. A repository that mixes Coq proofs and Verilog hardware files (or a pure Coq repo) would now have all Coq files routed to the tree-sitter-verilog parser, producing empty or garbage symbol output. The file collector and parser_registry have no way to disambiguate between the two. This may be an acceptable known limitation, but it is worth documenting (e.g., in a code comment alongside the extension list) so future contributors understand the trade-off rather than discovering it through mis-indexed codebases.
There was a problem hiding this comment.
Fixed — added a documentation block above SUPPORTED_EXTENSIONS in file_collector.rs that explicitly calls out the .v collision with Coq source files, so Coq-heavy repos know they need to exclude *.v via the exclude config glob. Keeping the routing as Verilog since Verilog repos far outnumber Coq repos in our user base, but the trade-off is now discoverable from the source.
| fn handle_module_instantiation(node: &Node, source: &[u8], symbols: &mut FileSymbols) { | ||
| // Tree-sitter-verilog exposes no field name on `module_instantiation`; the | ||
| // first child holds the module type being instantiated. The JS extractor | ||
| // uses `childForFieldName('type') || child(0)` — the field lookup never | ||
| // hits, so first-child fallback is the live path. | ||
| let name_node = node | ||
| .child_by_field_name("type") | ||
| .or_else(|| node.child(0)); | ||
| let name_node = match name_node { | ||
| Some(n) => n, | ||
| None => return, | ||
| }; | ||
| let name = node_text(&name_node, source).to_string(); | ||
| if name.is_empty() { | ||
| return; | ||
| } | ||
| symbols.calls.push(Call { | ||
| name, | ||
| line: start_line(node), | ||
| dynamic: None, | ||
| receiver: None, | ||
| }); | ||
| } |
There was a problem hiding this comment.
child(0) may return an anonymous grammar token on some node shapes
node.child(0) in tree-sitter returns any child at index 0, including anonymous tokens (punctuation, keywords). For the majority of module_instantiation shapes this is harmless because the module type identifier is the first child. However, if the grammar ever emits a leading anonymous node (e.g., a parameter-override token like #) before the module identifier on a non-ANSI instantiation form, the call name will be that punctuation character instead of the module type. The named-node variant node.named_child(0) (which skips anonymous tokens) would be safer and more defensive here, and would still mirror the JS extractor's first-child fallback intent.
There was a problem hiding this comment.
Fixed — switched the fallback from node.child(0) to node.named_child(0) so any anonymous grammar tokens (parameter-override #, keywords) leading the module_instantiation cannot leak into the call name. Added a comment documenting the rationale. All 6 verilog unit tests still pass.
Codegraph Impact Analysis47 functions changed → 30 callers affected across 3 files
|
- handle_class_decl: strengthen comment so the no-op behavior on the current tree-sitter-verilog grammar is loud and discoverable for future grammar upgrades. - handle_module_instantiation: switch child(0) to named_child(0) so any anonymous grammar tokens (e.g. parameter-override '#') leading the module type cannot leak into call names. - file_collector::SUPPORTED_EXTENSIONS: document .v conflict with Coq theorem-prover source files so Coq-heavy repos know to exclude *.v via config. - native-drop-classification: drop expected count to 9 to reflect the merge with main (.clj already removed, .v removed by this PR).
#1107) Adding native Verilog (#1107) brings 4 .v resolution-benchmark fixtures into the incremental benchmark sweep (which runs against the repo root). tree-sitter-verilog is a large grammar so each .v file costs noticeably more to parse than other fixture languages — pushing the native fullBuildMs from the 3.10.0 baseline of 1959ms to ~2809ms (+43%). This is a structural one-time cost of supporting the language, not a regression in shared code paths. Following the existing pattern in KNOWN_REGRESSIONS (3.9.6:* / 3.10.0:* entries) with a documented rationale so a future PR isn't blocked by the bump.
The tree-sitter-verilog grammar exposes no field names on
class_declaration, so childForFieldName('name') and
childForFieldName('superclass') always returned null in both engines.
The previous workaround left class extraction as documented dead code
in both extractors.
Per the CLAUDE.md principle 'Never document bugs as expected behavior',
fix the root cause by descending through the grammar's actual structure:
- Class name lives under class_identifier > simple_identifier
- Superclass appears as a class_type child with the same wrapping
Both engines now emit identical class Definitions and ClassRelation
extends edges. Added matching Rust and TypeScript regression tests
covering classes with and without an extends clause.
| fn find_verilog_parent(node: &Node, source: &[u8]) -> Option<String> { | ||
| const PARENT_KINDS: &[&str] = &[ | ||
| "module_declaration", | ||
| "interface_declaration", | ||
| "package_declaration", | ||
| "class_declaration", | ||
| ]; | ||
| let mut current = node.parent(); | ||
| while let Some(parent) = current { | ||
| if PARENT_KINDS.contains(&parent.kind()) { | ||
| return find_decl_name(&parent, source) | ||
| .or_else(|| find_module_name(&parent, source)); | ||
| } | ||
| current = parent.parent(); | ||
| } | ||
| None | ||
| } |
There was a problem hiding this comment.
find_verilog_parent silently fails for class_declaration parents
class_declaration is listed in PARENT_KINDS, but the name resolution chain — find_decl_name(&parent, source).or_else(|| find_module_name(&parent, source)) — only searches for bare simple_identifier / identifier children. For class_declaration, the name is wrapped under a class_identifier > simple_identifier chain (exactly the grammar quirk that find_class_name was introduced in this PR to handle). Neither find_decl_name nor find_module_name descends into class_identifier, so both return None, and any function or task nested inside a SystemVerilog class gets a bare name (e.g., check_range) instead of the qualified ClassName.check_range. Adding .or_else(|| find_class_name(&parent, source)) to the chain would close the gap; a symmetric fix in findVerilogParent (JS) would keep parity.
There was a problem hiding this comment.
Fixed — extended find_verilog_parent's name-resolution chain to also try find_class_name, so tasks and functions nested in a SystemVerilog class are now qualified as ClassName.task rather than the bare name. Mirrored the same fix in findVerilogParent (JS) for engine parity. Added regression tests in both engines (qualifies_task_nested_in_class_with_class_name / qualifies tasks nested inside a class with the class name). Pushed in commit 497da19.
find_verilog_parent only consulted find_decl_name and find_module_name, neither of which descends into the class_identifier wrapper that tree-sitter-verilog uses for class names. As a result, any task or function declared inside a SystemVerilog class lost its qualifier and surfaced as a bare name instead of ClassName.task. Extend the parent-name resolution chain to also try find_class_name, mirroring the same logic in the WASM extractor for engine parity. Added regression tests in both engines covering the class > task case.
Merges Gleam (#1105, landed in main) with Verilog (this PR): - parser_registry.rs: both Gleam and Verilog in LanguageKind::all(), EXPECTED_LEN=33. - native-drop-classification.test.ts: fix toBe(4) → toBe(3) after both .gleam and .v were dropped from the WASM-only-by-default input list (now both are native). docs check acknowledged — README/CLAUDE/ROADMAP already document both languages.
…1107) The merge-conflict resolution commit fixed parser_registry.rs but the test edit was lost from the same merge commit. Both Gleam (#1105) and Verilog (#1107) are now natively supported, so the WASM-only test fixture should only count .fs / .fsx / .m as unsupported (3, not 4). docs check acknowledged — README/CLAUDE/ROADMAP already cover both languages.
Second merge picks up Objective-C (#1106) which landed after the previous merge. Combines: - parser_registry.rs: ObjC + Gleam + Verilog all in LanguageKind::all(), EXPECTED_LEN bumped to 34. - file_collector.rs: combined collision notes for .v (Coq), .m (MATLAB), and .h. - native-drop-classification.test.ts: rotated WASM-only fixture to .fs / .fsx / .fsi now that .m and .v are both natively supported. docs check acknowledged — README/CLAUDE/ROADMAP already document all three languages.
|
Addressed the remaining Greptile suggestion from the summary body: replaced the unreachable |
…ludes (#1107) TS extractPorts was missing module_ansi_header from its container recursion and was not descending into port_identifier, so ANSI-style modules (module top(input clk, …)) returned no port children in the WASM engine while the native engine extracted them correctly. handleIncludeDirective was also missing double_quoted_string, which would silently drop \`include imports when the grammar emits that node kind. Added regression tests for ANSI port extraction and include directive imports.
|
Fixed both parity issues from the latest Greptile review:
Added two regression tests to |
…ludes (#1107) TS `extractPorts` was missing `module_ansi_header` from its container recursion and was not descending into `port_identifier`, so ANSI-style modules (`module top(input clk, …)`) returned no port children in the WASM engine while the native engine extracted them correctly. `handleIncludeDirective` was also missing `double_quoted_string`, which would silently drop `\`include` imports when the grammar emits that node kind. Added regression tests for ANSI port extraction and include directive imports.
742f790 to
f091d3f
Compare
Summary
tree-sitter-verilogdependency and a native Verilog/SystemVerilog extractor incrates/codegraph-core/src/extractors/verilog.rs..vand.svwithLanguageKind::Verilogand the Rustfile_collector, adds Verilog toNATIVE_SUPPORTED_EXTENSIONSon the JS side, and wiresVERILOG_AST_CONFIGinhelpers.rs(all empty lists — mirrors the WASM side, which has noverilogentry inAST_TYPE_MAPS, so both engines emit zeroast_nodesrows for Verilog and stay in parity).extractVerilogSymbols:module_declaration/interface_declaration/package_declaration/class_declarationdefinitions (extends emitted intoclasses),function_declarationandtask_declarationwith<parent>.<name>for nested decls,package_import_declaration(pkg::item/pkg::*) andinclude_compiler_directiveimports, andmodule_instantiationas the call analogue.Closes #1071
Test plan
cargo build --release -p codegraph-core(clean build)cargo test -p codegraph-core --lib— 190/190npx tree-sitter build --wasm node_modules/tree-sitter-verilog/regeneratestree-sitter-verilog.wasmnpx vitest run tests/parsers/verilog.test.ts— 5/5npx vitest run tests/parsers/native-drop-classification.test.ts— 13/13