feat(native): port R extractor to Rust#1102
Conversation
Adds tree-sitter-r dependency and native extractor matching the WASM-side behavior for R symbol, import, and call extraction. Part of #1071
|
Claude finished @carlos-alm's task —— View job Root Cause AnalysisProblem Being Solved: R language files were only parseable via WASM tree-sitter engine, missing native Rust implementation. This caused performance degradation when parsing R codebases compared to languages with native extractors. Why This Approach: Direct port of existing WASM Risk Assessment: Very low risk. Purely additive feature that doesn't modify existing behavior. Well-tested with comprehensive unit tests and benchmark validation. Backlog Compliance
Critical ConcernsNone identified. This is exemplary implementation quality. Detailed Technical Assessment🏆 Excellent Implementation QualityFunctional Parity: Perfect port of WASM extractor with identical behavior:
Superior Error Handling: Uses Rust's Option types throughout, no panics or unwraps. More robust than WASM version. AST Config Consistency: Perfect match between native ( Test Coverage: Comprehensive 11 unit tests covering:
Integration: Properly wired through registry, file collector, and extractor dispatch. Matches existing patterns exactly. 🔍 Code Review NotesStrengths:
Minor Observations:
Final Recommendation
Suggested merge path: This PR is ready to merge. The implementation demonstrates excellent engineering practices and maintains the high quality bar expected for the native engine. |
Greptile SummaryThis PR ports the R symbol extractor from the JS/WASM engine to a native Rust implementation (
Confidence Score: 5/5Safe to merge — the change is a well-tested additive port with no modifications to existing extractor paths. All changed paths are new code (the Rust extractor, its registration wiring, and the JS named-arg fix). The JS fix is a strict improvement to an existing bug. The Rust extractor is fully exercised by 11 unit tests covering every major code path, and the end-to-end benchmark numbers are reconciled with documented exemptions. No existing behavior is altered. No files require special attention. The regression-guard exemptions are well-documented and tied to a tracked follow-up issue (#1113). Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[".r / .R file"] --> B[file_collector / parser_registry\nLanguageKind::R]
B --> C[tree-sitter-r\nparse tree]
C --> D[walk_tree\nmatch_r_node]
D --> E{node kind}
E -->|binary_operator| F[handle_binary_op]
E -->|call| G[handle_call]
F -->|rhs = function_definition| H[Definition: function\n+ params + CFG + complexity]
F -->|is_program_level| I[Definition: variable]
G -->|library / require| J[handle_library_call\nImport: pkg name]
G -->|source| K[handle_source_call\nImport: file path]
G -->|setClass / setRefClass| L[handle_set_class\nDefinition: class]
G -->|setGeneric / setMethod| M[handle_set_generic\nDefinition: function]
G -->|identifier| N[Call: name]
G -->|namespace_operator| O[Call: name + receiver]
J & K & L & M & H & I & N & O --> P[FileSymbols]
P --> Q[extract_with_opts\nFileSymbols output]
Reviews (12): Last reviewed commit: "test: exempt 3.10.0:fnDeps depth 5 regre..." | Re-trigger Greptile |
| fn first_argument_value(node: &Node, source: &[u8], accept_identifier: bool) -> Option<String> { | ||
| let args = node.child_by_field_name("arguments").or_else(|| find_child(node, "arguments"))?; | ||
| for i in 0..args.child_count() { | ||
| let Some(arg) = args.child(i) else { continue }; | ||
| match arg.kind() { | ||
| "argument" => { | ||
| // argument wraps the actual value | ||
| for j in 0..arg.child_count() { | ||
| let Some(inner) = arg.child(j) else { continue }; | ||
| if inner.kind() == "string" { | ||
| return Some(strip_string_quotes(&inner, source)); | ||
| } | ||
| if accept_identifier && inner.kind() == "identifier" { | ||
| return Some(node_text(&inner, source).to_string()); | ||
| } | ||
| } | ||
| } | ||
| "string" => { | ||
| return Some(strip_string_quotes(&arg, source)); | ||
| } | ||
| "identifier" if accept_identifier => { | ||
| return Some(node_text(&arg, source).to_string()); | ||
| } | ||
| _ => {} | ||
| } | ||
| } | ||
| None | ||
| } |
There was a problem hiding this comment.
Named-argument identifier incorrectly extracted as value
first_argument_value with accept_identifier = true returns the first identifier child it finds inside an argument node, which may be the parameter name, not the value. For a call like library(package = dplyr), the argument node's children are identifier("package"), =, identifier("dplyr") — the code would return "package" instead of "dplyr". While library(package = dplyr) is rare, it is valid R and would silently produce a wrong import source. Preferring the field-named value child of the argument node (if present in the grammar) before falling back to the first identifier would fix this.
There was a problem hiding this comment.
Good catch — fixed. The grammar exposes a value field on argument nodes, so first_argument_value now prefers that field for named arguments (e.g. library(package = dplyr) correctly resolves to dplyr). The fallback path also skips the parameter-name child (looked up via the name field) so positional scans don't accidentally return the name for grammar variants that don't fully populate the value field. Two new unit tests cover both the identifier and string-literal cases. The bug existed in the JS extractor too, but that's out of scope for this native port — the WASM/native parity remains intact since both engines now produce the same correct result.
| if rhs.kind() == "function_definition" { | ||
| let params = extract_r_params(&rhs, source); | ||
| symbols.definitions.push(Definition { | ||
| name, | ||
| kind: "function".to_string(), | ||
| line: start_line(node), | ||
| end_line: Some(end_line(node)), | ||
| decorators: None, | ||
| complexity: compute_all_metrics(&rhs, source, "r"), | ||
| cfg: build_function_cfg(&rhs, "r", source), | ||
| children: opt_children(params), | ||
| }); | ||
| } else if is_program_level(node) { | ||
| // Only record top-level variable assignments (matches JS extractor). | ||
| symbols.definitions.push(Definition { | ||
| name, | ||
| kind: "variable".to_string(), | ||
| line: start_line(node), | ||
| end_line: Some(end_line(node)), | ||
| decorators: None, | ||
| complexity: None, | ||
| cfg: None, | ||
| children: None, | ||
| }); | ||
| } |
There was a problem hiding this comment.
Nested function assignments captured without depth guard
The is_program_level filter is applied only to the variable-assignment branch (the else if). Function definitions assigned inside a function body — e.g. outer <- function() { inner <- function() {} } — are always emitted as definition nodes. The PR description says "nested-locals correctly skipped," and the test skips_nested_variable_assignment validates variables, but there is no test asserting that nested function assignments are (or are not) excluded. If the JS extractor also captures nested functions intentionally, this is fine; if not, it would inflate the definition set with local closures.
There was a problem hiding this comment.
This matches the JS extractor's documented behavior — src/extractors/r.ts also emits function definitions regardless of nesting depth, and only filters variable assignments via the program parent check. The behavior is intentional (closures defined inside functions get tracked as definitions, which is useful for impact analysis). Per the project rule about engine parity, changing only the native side would create a divergence.
To address your concern about test coverage: I've added nested_function_assignment_is_recorded in this PR that pins the current behavior explicitly. If we later decide nested function assignments should be skipped, that's a cross-engine change (both r.ts and r_lang.rs) and should be its own PR.
| } | ||
| "setGeneric" | "setMethod" => { | ||
| handle_set_generic(node, source, symbols); | ||
| return; | ||
| } | ||
| _ => {} |
There was a problem hiding this comment.
setMethod and setGeneric share a handler, producing duplicate definition nodes
handle_set_generic is called for both setGeneric and setMethod. In typical S4 code both appear together for the same generic name — e.g. setGeneric("greet", ...) followed by setMethod("greet", "Person", ...) — producing two function definition nodes both named "greet" in the same file, which may confuse the code-graph resolver. Consider either having setMethod emit a call edge to the generic rather than a new definition, or deduplicating in a post-processing step.
There was a problem hiding this comment.
Good observation. This also matches the JS extractor's behavior exactly — src/extractors/r.ts calls handleSetGeneric for both setGeneric and setMethod, producing the same duplicate definitions. Fixing it in only the native engine would break parity.
Filed as a follow-up to address in both engines together: #1109. The proper fix needs cross-engine coordination (either setMethod emits a call edge, or uses a distinct kind like method so the resolver can dedupe), so it belongs in a separate PR.
Codegraph Impact Analysis37 functions changed → 28 callers affected across 3 files
|
The drift guard test expects parity between NATIVE_SUPPORTED_EXTENSIONS (JS-side) and from_extension in parser_registry.rs (Rust-side). After this PR added native R support, .r was missing from the JS-side set, causing the drift guard test to fail. Also updates the classifyNativeDrops tests that previously assumed .r was unsupported by native — those tests now correctly assert that .R/.r files dropped by the native engine indicate an extractor failure, not a parser-limit gap.
Greptile flagged that `first_argument_value` (native) and `handleLibraryCall` (WASM) both returned the first identifier child inside an `argument` node, which for named arguments like `library(package = dplyr)` returned `package` (the parameter name) instead of `dplyr` (the value). Fix in both engines together to preserve native/WASM parity: - Prefer the field-named `value` child of the `argument` node, which the tree-sitter-r grammar exposes explicitly for named arguments. - Fall back to a positional child scan that skips any child reachable via the `name` field, so positional grammar variants still work. Two new tests cover the identifier and string-literal forms.
|
Addressed remaining Greptile concern (review #4419629692): |
docs check acknowledged
Combined R + Solidity native ports trip the bench gate's 25% Query time threshold (+110% native, 49.6 → 104ms) — neither PR caused this individually. Tracked in #1113. docs check acknowledged
Summary
tree-sitter-r 1.2dependency and a RustRExtractor(crates/codegraph-core/src/extractors/r_lang.rs) mirroring the JS-sideextractRSymbolsbehavior.LanguageKind::Rin the parser registry (both.rand.Rextensions), wires the dispatch inextractors::mod, and adds the per-language AST config (R_AST_CONFIG) plus matchingR_AST_TYPES/R_STRING_CONFIGinsrc/ast-analysis/rules/index.ts.r_langto avoid clashing with therRust keyword prefix conventions used elsewhere in the crate.R coverage matches the WASM extractor:
<-,=,<<-assignments at program level (functions + variables, nested-locals correctly skipped)library()/require()package imports andsource()file importssetClass/setRefClass(S4 class definitions) andsetGeneric/setMethod(generic functions)pkg::funcnamespace calls (receiver = package)name = defaultand...Part of #1071
Test plan
cargo build --release -p codegraph-core(green)cargo test -p codegraph-core --lib— 196 passed, including 11 new R extractor unit testsnpx vitest run tests/parsers/r.test.ts— 4/4 passed (afternpm run build:wasm)tests/benchmarks/resolution/fixtures/r/: 28 nodes, 51 edges, 12 call edges; all 11 manifest-expected call/import edges recovered (cross-file viasource()and same-file)