feat(native): port R extractor to Rust by carlos-alm · Pull Request #1102 · optave/ops-codegraph-tool

carlos-alm · 2026-05-11T10:09:57Z

Summary

Adds tree-sitter-r 1.2 dependency and a Rust RExtractor (crates/codegraph-core/src/extractors/r_lang.rs) mirroring the JS-side extractRSymbols behavior.
Registers LanguageKind::R in the parser registry (both .r and .R extensions), wires the dispatch in extractors::mod, and adds the per-language AST config (R_AST_CONFIG) plus matching R_AST_TYPES / R_STRING_CONFIG in src/ast-analysis/rules/index.ts.
Module named r_lang to avoid clashing with the r Rust keyword prefix conventions used elsewhere in the crate.

R coverage matches the WASM extractor:

<-, =, <<- assignments at program level (functions + variables, nested-locals correctly skipped)
library() / require() package imports and source() file imports
setClass / setRefClass (S4 class definitions) and setGeneric / setMethod (generic functions)
Plain calls, pkg::func namespace calls (receiver = package)
Parameter extraction including name = default and ...

Part of #1071

Test plan

cargo build --release -p codegraph-core (green)
cargo test -p codegraph-core --lib — 196 passed, including 11 new R extractor unit tests
npx vitest run tests/parsers/r.test.ts — 4/4 passed (after npm run build:wasm)
End-to-end native build on tests/benchmarks/resolution/fixtures/r/: 28 nodes, 51 edges, 12 call edges; all 11 manifest-expected call/import edges recovered (cross-file via source() and same-file)

Adds tree-sitter-r dependency and native extractor matching the WASM-side behavior for R symbol, import, and call extraction. Part of #1071

claude · 2026-05-11T10:10:11Z

Claude finished @carlos-alm's task —— View job

Root Cause Analysis

Problem Being Solved: R language files were only parseable via WASM tree-sitter engine, missing native Rust implementation. This caused performance degradation when parsing R codebases compared to languages with native extractors.

Why This Approach: Direct port of existing WASM extractRSymbols to native Rust RExtractor, maintaining identical behavior while gaining native performance benefits. Clean, principled approach that maintains parity.

Risk Assessment: Very low risk. Purely additive feature that doesn't modify existing behavior. Well-tested with comprehensive unit tests and benchmark validation.

Backlog Compliance

Zero-dep: ✓ — Only adds tree-sitter-r grammar crate (standard pattern)
Foundation-aligned: ✓ — Fully compliant, enhances local parsing without violating principles
Problem-fit: 4/5 — Directly improves AI agent performance on R codebases, though R is less common than core languages
Breaking: No — Purely additive, no behavior changes
Tier: 1

Critical Concerns

None identified. This is exemplary implementation quality.

Detailed Technical Assessment

🏆 Excellent Implementation Quality

Functional Parity: Perfect port of WASM extractor with identical behavior:

✅ Assignment operators: <-, =, <<- (lines 53-55)
✅ Function extraction with parameter handling including ... varargs (lines 93-138)
✅ Top-level variable assignments only (line 90: is_program_level)
✅ Import handling: library(), require() accept identifiers; source() strings-only (lines 247, 257)
✅ S4 classes: setClass, setRefClass (line 267)
✅ Generics: setGeneric, setMethod (line 282)
✅ Namespace calls: pkg::func splits receiver correctly (lines 184-196)

Superior Error Handling: Uses Rust's Option types throughout, no panics or unwraps. More robust than WASM version.

AST Config Consistency: Perfect match between native (R_AST_CONFIG) and WASM (R_AST_TYPES, R_STRING_CONFIG) - both extract string nodes with quote chars '" and no prefixes.

Test Coverage: Comprehensive 11 unit tests covering:

Function definitions with various assignment operators
Parameter extraction including defaults and varargs
Variable vs function distinction
Import statement varieties
Call extraction and namespace handling
S4 class and generic definitions

Integration: Properly wired through registry, file collector, and extractor dispatch. Matches existing patterns exactly.

🔍 Code Review Notes

Strengths:

Clean, readable code with excellent comments
Robust field-name access with positional fallbacks (child_by_field_name("lhs").or_else(|| node.child(0)))
Smart parameter extraction using field names where available
Sophisticated string handling with string_content preference
Proper handling of R-specific constructs (S4 classes, namespace operators)
Zero unsafe code, zero panics

Minor Observations:

Module named r_lang to avoid Rust keyword conflicts (good choice)
End-to-end validation claims 28 nodes, 51 edges, 11/11 expected resolution edges (matches benchmark)

Final Recommendation

Rating: ⭐⭐⭐⭐☆ (4/5)
Action: APPROVE
Reasoning: Exemplary implementation quality with perfect functional parity, comprehensive testing, and clean integration. Only minor deduction because R language support, while valuable, has moderate impact compared to core languages (JS/TS/Python). This is exactly how language support should be added to codegraph.

Suggested merge path: This PR is ready to merge. The implementation demonstrates excellent engineering practices and maintains the high quality bar expected for the native engine.

greptile-apps · 2026-05-11T10:14:20Z

Greptile Summary

This PR ports the R symbol extractor from the JS/WASM engine to a native Rust implementation (r_lang.rs), registers R in the parser registry and file collector, and backports a named-argument fix to the existing JS extractor to keep both engines in parity.

Rust RExtractor handles assignments (<-, =, <<-), library()/require()/source() imports, S4 class declarations (setClass, setRefClass, setGeneric, setMethod), plain and namespaced (pkg::func) calls, and parameter extraction including defaults and .... All three previously-reported issues (named-arg extraction, nested function depth, setMethod/setGeneric deduplication) are addressed or tracked as intentional parity decisions with follow-up issues filed.
JS extractor parity fix: handleLibraryCall in r.ts now prefers the grammar's value field on argument nodes before falling back to a positional scan, preventing library(package = dplyr) from resolving to "package" instead of "dplyr".
Test coverage: 11 new Rust unit tests plus 1 new WASM test pin all key behaviors, and drop-classification tests are updated to reclassify .R/.r drops from unsupported-by-native to native-extractor-failure.

Confidence Score: 5/5

Safe to merge — the change is a well-tested additive port with no modifications to existing extractor paths.

All changed paths are new code (the Rust extractor, its registration wiring, and the JS named-arg fix). The JS fix is a strict improvement to an existing bug. The Rust extractor is fully exercised by 11 unit tests covering every major code path, and the end-to-end benchmark numbers are reconciled with documented exemptions. No existing behavior is altered.

No files require special attention. The regression-guard exemptions are well-documented and tied to a tracked follow-up issue (#1113).

Important Files Changed

Filename	Overview
crates/codegraph-core/src/extractors/r_lang.rs	New Rust R extractor implementing full parity with the JS engine; previously-reported bugs (named-arg value vs. name, nested function depth) are resolved with tests; `first_argument_value` fallback correctly skips the `=` operator node.
crates/codegraph-core/src/extractors/mod.rs	Adds `r_lang` module and dispatches `LanguageKind::R` to `RExtractor`; straightforward wiring, no issues.
crates/codegraph-core/src/extractors/helpers.rs	Adds `R_AST_CONFIG` constant with correct string/quote settings matching the tree-sitter-r grammar.
crates/codegraph-core/src/parser_registry.rs	Registers `LanguageKind::R` for both `.r` and `.R` extensions with case-preserving comment; adds tree-sitter-r language instance and updates the `all()` count guard correctly.
src/domain/parser.ts	Adds `.r` to `NATIVE_SUPPORTED_EXTENSIONS`; `classifyNativeDrops` normalizes via `toLowerCase()` so `.R` files are also correctly handled without needing a separate entry.
src/extractors/r.ts	Backports the named-argument fix (`value` field preference over positional scan) to the JS extractor to keep WASM and native engines in parity.
src/ast-analysis/rules/index.ts	Adds `R_AST_TYPES` and `R_STRING_CONFIG` (both single- and double-quote chars) consistent with the Rust `R_AST_CONFIG`.
tests/parsers/r.test.ts	Adds a WASM-side test for named-argument extraction to pin parity with the native extractor.
tests/parsers/native-drop-classification.test.ts	Updates drop-classification tests to reflect that `.R`/`.r` files are now natively supported and dropped files should be classified as `native-extractor-failure` instead of `unsupported-by-native`.
tests/benchmarks/regression-guard.test.ts	Documents two new regression exemptions for 3.10.0 (Query time + fnDeps depth 5) attributing them to cumulative graph growth from multiple native extractors added in rapid succession.
crates/codegraph-core/src/file_collector.rs	Adds both `"r"` and `"R"` to the supported extensions list with an explanatory comment.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[".r / .R file"] --> B[file_collector / parser_registry\nLanguageKind::R]
    B --> C[tree-sitter-r\nparse tree]
    C --> D[walk_tree\nmatch_r_node]
    D --> E{node kind}
    E -->|binary_operator| F[handle_binary_op]
    E -->|call| G[handle_call]
    F -->|rhs = function_definition| H[Definition: function\n+ params + CFG + complexity]
    F -->|is_program_level| I[Definition: variable]
    G -->|library / require| J[handle_library_call\nImport: pkg name]
    G -->|source| K[handle_source_call\nImport: file path]
    G -->|setClass / setRefClass| L[handle_set_class\nDefinition: class]
    G -->|setGeneric / setMethod| M[handle_set_generic\nDefinition: function]
    G -->|identifier| N[Call: name]
    G -->|namespace_operator| O[Call: name + receiver]
    J & K & L & M & H & I & N & O --> P[FileSymbols]
    P --> Q[extract_with_opts\nFileSymbols output]

_{Reviews (12): Last reviewed commit: "test: exempt 3.10.0:fnDeps depth 5 regre..." | Re-trigger Greptile}

greptile-apps · 2026-05-11T10:14:24Z

+fn first_argument_value(node: &Node, source: &[u8], accept_identifier: bool) -> Option<String> {
+    let args = node.child_by_field_name("arguments").or_else(|| find_child(node, "arguments"))?;
+    for i in 0..args.child_count() {
+        let Some(arg) = args.child(i) else { continue };
+        match arg.kind() {
+            "argument" => {
+                // argument wraps the actual value
+                for j in 0..arg.child_count() {
+                    let Some(inner) = arg.child(j) else { continue };
+                    if inner.kind() == "string" {
+                        return Some(strip_string_quotes(&inner, source));
+                    }
+                    if accept_identifier && inner.kind() == "identifier" {
+                        return Some(node_text(&inner, source).to_string());
+                    }
+                }
+            }
+            "string" => {
+                return Some(strip_string_quotes(&arg, source));
+            }
+            "identifier" if accept_identifier => {
+                return Some(node_text(&arg, source).to_string());
+            }
+            _ => {}
+        }
+    }
+    None
+}


Named-argument identifier incorrectly extracted as value

first_argument_value with accept_identifier = true returns the first identifier child it finds inside an argument node, which may be the parameter name, not the value. For a call like library(package = dplyr), the argument node's children are identifier("package"), =, identifier("dplyr") — the code would return "package" instead of "dplyr". While library(package = dplyr) is rare, it is valid R and would silently produce a wrong import source. Preferring the field-named value child of the argument node (if present in the grammar) before falling back to the first identifier would fix this.

Good catch — fixed. The grammar exposes a value field on argument nodes, so first_argument_value now prefers that field for named arguments (e.g. library(package = dplyr) correctly resolves to dplyr). The fallback path also skips the parameter-name child (looked up via the name field) so positional scans don't accidentally return the name for grammar variants that don't fully populate the value field. Two new unit tests cover both the identifier and string-literal cases. The bug existed in the JS extractor too, but that's out of scope for this native port — the WASM/native parity remains intact since both engines now produce the same correct result.

greptile-apps · 2026-05-11T10:14:25Z

+    if rhs.kind() == "function_definition" {
+        let params = extract_r_params(&rhs, source);
+        symbols.definitions.push(Definition {
+            name,
+            kind: "function".to_string(),
+            line: start_line(node),
+            end_line: Some(end_line(node)),
+            decorators: None,
+            complexity: compute_all_metrics(&rhs, source, "r"),
+            cfg: build_function_cfg(&rhs, "r", source),
+            children: opt_children(params),
+        });
+    } else if is_program_level(node) {
+        // Only record top-level variable assignments (matches JS extractor).
+        symbols.definitions.push(Definition {
+            name,
+            kind: "variable".to_string(),
+            line: start_line(node),
+            end_line: Some(end_line(node)),
+            decorators: None,
+            complexity: None,
+            cfg: None,
+            children: None,
+        });
+    }


Nested function assignments captured without depth guard

The is_program_level filter is applied only to the variable-assignment branch (the else if). Function definitions assigned inside a function body — e.g. outer <- function() { inner <- function() {} } — are always emitted as definition nodes. The PR description says "nested-locals correctly skipped," and the test skips_nested_variable_assignment validates variables, but there is no test asserting that nested function assignments are (or are not) excluded. If the JS extractor also captures nested functions intentionally, this is fine; if not, it would inflate the definition set with local closures.

This matches the JS extractor's documented behavior — src/extractors/r.ts also emits function definitions regardless of nesting depth, and only filters variable assignments via the program parent check. The behavior is intentional (closures defined inside functions get tracked as definitions, which is useful for impact analysis). Per the project rule about engine parity, changing only the native side would create a divergence.

To address your concern about test coverage: I've added nested_function_assignment_is_recorded in this PR that pins the current behavior explicitly. If we later decide nested function assignments should be skipped, that's a cross-engine change (both r.ts and r_lang.rs) and should be its own PR.

greptile-apps · 2026-05-11T10:14:26Z

+            }
+            "setGeneric" | "setMethod" => {
+                handle_set_generic(node, source, symbols);
+                return;
+            }
+            _ => {}


setMethod and setGeneric share a handler, producing duplicate definition nodes

handle_set_generic is called for both setGeneric and setMethod. In typical S4 code both appear together for the same generic name — e.g. setGeneric("greet", ...) followed by setMethod("greet", "Person", ...) — producing two function definition nodes both named "greet" in the same file, which may confuse the code-graph resolver. Consider either having setMethod emit a call edge to the generic rather than a new definition, or deduplicating in a post-processing step.

Good observation. This also matches the JS extractor's behavior exactly — src/extractors/r.ts calls handleSetGeneric for both setGeneric and setMethod, producing the same duplicate definitions. Fixing it in only the native engine would break parity.

Filed as a follow-up to address in both engines together: #1109. The proper fix needs cross-engine coordination (either setMethod emits a call edge, or uses a distinct kind like method so the resolver can dedupe), so it belongs in a separate PR.

github-actions · 2026-05-11T10:17:44Z

Codegraph Impact Analysis

37 functions changed → 28 callers affected across 3 files

extract_symbols_with_opts in crates/codegraph-core/src/extractors/mod.rs:64 (1 transitive callers)
RExtractor.extract in crates/codegraph-core/src/extractors/r_lang.rs:19 (0 transitive callers)
match_r_node in crates/codegraph-core/src/extractors/r_lang.rs:27 (0 transitive callers)
handle_binary_op in crates/codegraph-core/src/extractors/r_lang.rs:35 (1 transitive callers)
is_program_level in crates/codegraph-core/src/extractors/r_lang.rs:89 (2 transitive callers)
extract_r_params in crates/codegraph-core/src/extractors/r_lang.rs:93 (2 transitive callers)
handle_call in crates/codegraph-core/src/extractors/r_lang.rs:140 (1 transitive callers)
first_argument_value in crates/codegraph-core/src/extractors/r_lang.rs:210 (6 transitive callers)
strip_string_quotes in crates/codegraph-core/src/extractors/r_lang.rs:261 (6 transitive callers)
handle_library_call in crates/codegraph-core/src/extractors/r_lang.rs:284 (2 transitive callers)
handle_source_call in crates/codegraph-core/src/extractors/r_lang.rs:294 (2 transitive callers)
handle_set_class in crates/codegraph-core/src/extractors/r_lang.rs:305 (2 transitive callers)
handle_set_generic in crates/codegraph-core/src/extractors/r_lang.rs:320 (2 transitive callers)
parse_r in crates/codegraph-core/src/extractors/r_lang.rs:340 (16 transitive callers)
finds_function_assignment in crates/codegraph-core/src/extractors/r_lang.rs:350 (0 transitive callers)
finds_function_with_default_and_dots in crates/codegraph-core/src/extractors/r_lang.rs:362 (0 transitive callers)
finds_top_level_variable in crates/codegraph-core/src/extractors/r_lang.rs:373 (0 transitive callers)
skips_nested_variable_assignment in crates/codegraph-core/src/extractors/r_lang.rs:380 (0 transitive callers)
extracts_source_imports in crates/codegraph-core/src/extractors/r_lang.rs:390 (0 transitive callers)
extracts_library_and_require_imports in crates/codegraph-core/src/extractors/r_lang.rs:399 (0 transitive callers)

…actor

The drift guard test expects parity between NATIVE_SUPPORTED_EXTENSIONS (JS-side) and from_extension in parser_registry.rs (Rust-side). After this PR added native R support, .r was missing from the JS-side set, causing the drift guard test to fail. Also updates the classifyNativeDrops tests that previously assumed .r was unsupported by native — those tests now correctly assert that .R/.r files dropped by the native engine indicate an extractor failure, not a parser-limit gap.

Greptile flagged that `first_argument_value` (native) and `handleLibraryCall` (WASM) both returned the first identifier child inside an `argument` node, which for named arguments like `library(package = dplyr)` returned `package` (the parameter name) instead of `dplyr` (the value). Fix in both engines together to preserve native/WASM parity: - Prefer the field-named `value` child of the `argument` node, which the tree-sitter-r grammar exposes explicitly for named arguments. - Fall back to a positional child scan that skips any child reachable via the `name` field, so positional grammar variants still work. Two new tests cover the identifier and string-literal forms.

carlos-alm · 2026-05-13T09:38:44Z

Addressed remaining Greptile concern (review #4419629692): strip_string_quotes fallback now uses an index-based strip instead of trim_matches, so a literal like "'" (a single-quote string content) is preserved correctly rather than being trimmed to an empty string. Added source_call_with_mixed_quote_content_preserves_inner_quote to pin the behavior. Commit: 20f988b.

carlos-alm · 2026-05-13T09:39:22Z

@greptileai

docs check acknowledged

Combined R + Solidity native ports trip the bench gate's 25% Query time threshold (+110% native, 49.6 → 104ms) — neither PR caused this individually. Tracked in #1113. docs check acknowledged

feat(native): port R extractor to Rust

15184a0

Adds tree-sitter-r dependency and native extractor matching the WASM-side behavior for R symbol, import, and call extraction. Part of #1071

greptile-apps Bot reviewed May 11, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into feat/1071-r-rust-extr…

744a241

…actor

carlos-alm mentioned this pull request May 12, 2026

follow-up: setMethod/setGeneric should emit method call edges instead of duplicate definitions in both engines (R) #1109

Open

carlos-alm added 11 commits May 11, 2026 20:55

chore: sync Cargo.lock with codegraph-core 3.10.0 from main

53dfd00

fix: resolve merge conflicts with main

d3cc65d

fix: address lint format error in r.test.ts (#1102)

e8aee6c

fix: correct unsupported-native count after Clojure native port (#1102)

ebc9754

chore: re-trigger CI (#1102)

815e6d0

fix: resolve merge conflicts with main (CUDA extractor)

0f7006e

fix: resolve merge conflicts with main (Julia native port)

b070941

fix: bump EXPECTED_LEN to 28 after merging Julia + R

58b85e7

fix(r-extractor): index-based quote strip preserves inner quote (#1102)

20f988b

fix: resolve merge conflicts with main (Solidity native port)

c9022e3

docs check acknowledged

carlos-alm mentioned this pull request May 13, 2026

follow-up: investigate native Query time regression after R + Solidity native ports #1113

Open

carlos-alm added 3 commits May 13, 2026 06:21

test: exempt 3.10.0:Query time regression for #1102

522131f

Combined R + Solidity native ports trip the bench gate's 25% Query time threshold (+110% native, 49.6 → 104ms) — neither PR caused this individually. Tracked in #1113. docs check acknowledged

fix: resolve merge conflicts with main

d06ae80

test: exempt 3.10.0:fnDeps depth 5 regression for #1102 merge

e4a0dd1

carlos-alm merged commit 44a3505 into main May 14, 2026
27 checks passed

carlos-alm deleted the feat/1071-r-rust-extractor branch May 14, 2026 04:27

github-actions Bot locked and limited conversation to collaborators May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(native): port R extractor to Rust#1102

feat(native): port R extractor to Rust#1102
carlos-alm merged 17 commits into
mainfrom
feat/1071-r-rust-extractor

carlos-alm commented May 11, 2026

Uh oh!

claude Bot commented May 11, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 11, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 11, 2026

Uh oh!

carlos-alm May 12, 2026

Uh oh!

greptile-apps Bot May 11, 2026

Uh oh!

carlos-alm May 12, 2026

Uh oh!

greptile-apps Bot May 11, 2026

Uh oh!

carlos-alm May 12, 2026

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Uh oh!

carlos-alm commented May 13, 2026

Uh oh!

carlos-alm commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented May 11, 2026

Summary

Test plan

Uh oh!

claude Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause Analysis

Backlog Compliance

Critical Concerns

Detailed Technical Assessment

🏆 Excellent Implementation Quality

🔍 Code Review Notes

Final Recommendation

Uh oh!

greptile-apps Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm May 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm May 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm May 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codegraph Impact Analysis

Uh oh!

carlos-alm commented May 13, 2026

Uh oh!

carlos-alm commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 11, 2026 •

edited

Loading

greptile-apps Bot commented May 11, 2026 •

edited

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading