feat(native): port F# extractor to Rust#1104
Conversation
Adds tree-sitter-fsharp dependency and a native F# extractor in crates/codegraph-core/src/extractors/fsharp.rs. Registers .fs/.fsx/.fsi with LanguageKind::FSharp and the Rust file_collector, adds FSharp to NATIVE_SUPPORTED_EXTENSIONS on the JS side, and wires FSHARP_AST_TYPES / FSHARP_STRING_CONFIG so the native and JS engines extract identical ast_nodes for F# source. Mirrors extractFSharpSymbols: named/anonymous modules as module, function declarations (with parameter children) as function, type definitions as type / class / record / enum / interface (mapped from the F# node kind), type-member function bindings as method, value bindings as variable, and import declarations + dot-expression / application call extraction.
|
Claude finished @carlos-alm's task —— View job Reviewing this F# extractor PR... Todo List
Phase 1: Root Cause Analysis ✅Problem Being Solved: This PR ports the F# extractor from JavaScript/WASM to native Rust to achieve performance parity with other supported languages. F# was previously only supported via tree-sitter WASM, which is slower than the native Rust extractors. Why This Approach: The dual-engine architecture requires both engines to produce identical results. F# support existed in WASM-only form, and this PR adds the missing native Rust implementation to complete the dual-engine support. Risk Assessment: Low risk - this is additive functionality following established patterns. No configuration changes, no relaxed tests, purely extending existing architecture. Phase 2: Configuration Change Detection ✅✅ No high-risk changes detected:
Phase 3: Test Integrity ✅✅ Strong test integrity:
Phase 4: Code Quality Review ✅✅ Excellent implementation quality: Rust Extractor (
Infrastructure Integration:
Dual-Engine Parity:
Phase 5: Backlog Compliance ✅Root Cause AnalysisProblem Being Solved: Completing F# native support to achieve dual-engine performance parity Backlog Compliance
Critical ConcernsNone — this is a high-quality implementation following established patterns. Final Recommendation
|
Greptile SummaryThis PR ports the F# symbol extractor from the JS/WASM engine to native Rust using
Confidence Score: 5/5This PR is safe to merge — it adds a self-contained new extractor with no modifications to existing extraction paths. The Rust extractor is a line-for-line faithful port of the battle-tested JS counterpart. Every handler was cross-checked against src/extractors/fsharp.ts and found consistent. All issues raised in earlier review rounds are addressed in the current HEAD. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["F# file (.fs/.fsx/.fsi)"] --> B{Engine}
B -->|Native Rust| C["LanguageKind::FSharp"]
B -->|WASM JS| D["tree-sitter-fsharp.wasm"]
C --> E["FSharpExtractor.extract"]
E --> F["walk_tree"]
E --> G["walk_ast_nodes_with_config FSHARP_AST_CONFIG"]
F --> H["match node.kind()"]
H --> I["named_module: Definition kind=module"]
H --> J["function_declaration_left: Definition kind=function with dedup guard"]
H --> K["type_definition: Definition kind=enum/record/class/interface/type"]
H --> L["import_decl: Import"]
H --> M["application_expression: Call"]
H --> N["dot_expression: Call with receiver"]
G --> O["ast_nodes: string literals"]
D --> P["extractFSharpSymbols"]
P --> Q["walkFSharpNode same handlers - JS parity"]
Reviews (12): Last reviewed commit: "fix: resolve merge conflicts with main (..." | Re-trigger Greptile |
| "long_identifier_or_op" => { | ||
| // Inner child is either `long_identifier` (qualified, e.g. | ||
| // `Repository.save`) or `identifier` (bare, e.g. `validateUser`). | ||
| // Fall back to the wrapper text if neither exists (e.g. | ||
| // operator forms like `( + )`). | ||
| let inner = find_child(&func_node, "long_identifier") | ||
| .or_else(|| find_child(&func_node, "identifier")); | ||
| let name = match inner { | ||
| Some(n) => node_text(&n, source).to_string(), | ||
| None => node_text(&func_node, source).to_string(), | ||
| }; | ||
| symbols.calls.push(Call { | ||
| name, | ||
| line: start_line(node), | ||
| dynamic: None, | ||
| receiver: None, | ||
| }); | ||
| } |
There was a problem hiding this comment.
Divergence from JS extractor in
handle_application
Two behavioural differences exist versus the JS handleApplication that the PR claims to mirror:
-
Search order flipped: The JS extractor tries
identifierfirst, thenlong_identifierinside along_identifier_or_opwrapper (findChild(funcNode, 'identifier') || findChild(funcNode, 'long_identifier')). The Rust version trieslong_identifierfirst. For a node containing both kinds, the preferred result will differ. -
Extra fallback emits operator calls: When neither child is found (e.g., an operator expression like
( + )), JS emits nothing. Rust falls back to the raw text offunc_nodeand still pushes aCall. This means every operator application in an F# file produces a spurious call entry in the native engine that the WASM engine never produces, diverging the two outputs.
There was a problem hiding this comment.
Fixed in the merge resolution commit. The handle_application branch for long_identifier_or_op now matches the JS extractor exactly:
- Search order is now
identifierfirst, thenlong_identifier(matchesfindChild(funcNode, 'identifier') || findChild(funcNode, 'long_identifier')in the JS extractor). - When neither child is present (operator forms like
( + )), the Rust extractor emits nothing — mirroring the JS extractor's silent skip. The previous fallback that pushed aCallwith the rawfunc_nodetext has been removed.
See crates/codegraph-core/src/extractors/fsharp.rs:242-260.
Codegraph Impact Analysis21 functions changed → 5 callers affected across 2 files
|
…-extractor # Conflicts: # crates/codegraph-core/Cargo.toml # crates/codegraph-core/src/change_detection.rs # crates/codegraph-core/src/extractors/helpers.rs # crates/codegraph-core/src/extractors/mod.rs # crates/codegraph-core/src/file_collector.rs # crates/codegraph-core/src/parser_registry.rs # src/ast-analysis/rules/index.ts # src/domain/parser.ts
docs check acknowledged — merge commit only
| @@ -106,6 +108,7 @@ impl LanguageKind { | |||
| "hs" => Some(Self::Haskell), | |||
There was a problem hiding this comment.
.fsi signature files parsed with the main F# grammar
All three extensions map to a single FSharp variant backed by LANGUAGE_FSHARP. The upstream ionide/tree-sitter-fsharp ships two distinct parsers: fsharp (for .fs/.fsx) and fsharp_signature (for .fsi). If the crate exposes a LANGUAGE_FSHARP_SIGNATURE constant — analogous to how tree-sitter-ocaml exposes both LANGUAGE_OCAML and LANGUAGE_OCAML_INTERFACE (already handled in this codebase via the separate OcamlInterface variant) — then parsing .fsi files with LANGUAGE_FSHARP will produce ERROR nodes for constructs that are valid in the signature grammar but not in the source grammar (e.g., bare val declarations), silently yielding empty or wrong extractions for every signature file.
There was a problem hiding this comment.
Tracked in #1114 — this is genuinely out of scope for this PR.
The JS engine that this PR ports from has the same behavior: src/domain/parser.ts:828 routes .fs/.fsx/.fsi through a single tree-sitter-fsharp.wasm built from the main F# grammar (scripts/build-wasm.ts:208 only references the fsharp sub-grammar). Both engines therefore remain in parity — neither uses fsharp_signature.
Fixing this requires changes on both sides: a new FSharpSignature LanguageKind variant + extractor on the Rust side, AND a new tree-sitter-fsharp-signature.wasm grammar build + registry entry on the JS side. That is materially larger than this PR's mandate (port the existing JS extractor as-is) and is best done as a dedicated follow-up so the new signature grammar can be designed and tested for both engines together.
…-extractor # Conflicts: # Cargo.lock # crates/codegraph-core/src/extractors/mod.rs # crates/codegraph-core/src/file_collector.rs # crates/codegraph-core/src/parser_registry.rs
| fn handle_function_decl(node: &Node, source: &[u8], symbols: &mut FileSymbols) { | ||
| // function_declaration_left: first child is the function name identifier, | ||
| // followed by argument_patterns. | ||
| let name_node = match find_child(node, "identifier") { | ||
| Some(n) => n, | ||
| None => return, | ||
| }; | ||
| let raw_name = node_text(&name_node, source).to_string(); | ||
|
|
||
| let params = extract_fsharp_params(node, source); | ||
| let module_name = enclosing_module_name(node, source); | ||
| let qualified = match module_name { | ||
| Some(m) => format!("{}.{}", m, raw_name), | ||
| None => raw_name, | ||
| }; | ||
|
|
||
| // JS extractor uses the parent's endLine (the function_or_value_defn) for | ||
| // a tighter bound; do the same to preserve parity. | ||
| let end = node.parent().unwrap_or(*node); | ||
|
|
||
| symbols.definitions.push(Definition { | ||
| name: qualified, | ||
| kind: "function".to_string(), | ||
| line: start_line(node), | ||
| end_line: Some(end_line(&end)), | ||
| decorators: None, | ||
| complexity: compute_all_metrics(&end, source, "fsharp"), | ||
| cfg: build_function_cfg(&end, "fsharp", source), | ||
| children: opt_children(params), | ||
| }); |
There was a problem hiding this comment.
Missing deduplication guard diverges from JS extractor
The JS handleFunctionDecl has an explicit guard (lines 92–94 in src/extractors/fsharp.ts) that skips adding a definition when one with the same name and line already exists. The comment reads: "Avoid duplicates — the walk will also visit children." This guard exists because tree-sitter-fsharp nests function_declaration_left nodes for curried multi-parameter functions (e.g. let add x y = x + y), causing the walk to encounter two nodes at the same start line that both expose an identifier "add" as a direct child. The Rust walk_tree visits every node exactly once in DFS order, so both the outer and inner function_declaration_left are processed, and handle_function_decl pushes two identical Definition entries where the JS engine pushes one. Any downstream consumer that de-dupes by name+line will be unaffected, but callers that use definitions.len() or iterate all entries will see doubled results for every multi-parameter function in an F# file.
There was a problem hiding this comment.
Fixed in 2dc3fbf. The Rust extractor now mirrors the JS dedup guard: before pushing a function definition, it checks whether one with the same raw identifier name and start line already exists, and skips if so. The check uses the unqualified raw_name to match the JS behavior exactly (where the dedup compares against nameNode.text). This eliminates the doubled Definition entries for curried multi-parameter functions like let add x y = ….
See crates/codegraph-core/src/extractors/fsharp.rs:55-78.
Summary
tree-sitter-fsharpdependency and a native F# extractor incrates/codegraph-core/src/extractors/fsharp.rs..fs/.fsx/.fsiwithLanguageKind::FSharpand the Rustfile_collector, adds F# toNATIVE_SUPPORTED_EXTENSIONSon the JS side, and wiresFSHARP_AST_TYPES/FSHARP_STRING_CONFIGon both the native and JS sides so the two engines extract identicalast_nodesfor F# source.extractFSharpSymbols: named/anonymous modules asmodule, function declarations (with parameter children) asfunction, type definitions astype/class/record/enum/interface(mapped from the F# node kind), type-member function bindings asmethod, value bindings asvariable, plus import declarations anddot expression/applicationcall extraction.Closes #1071
Test plan
cargo build --release -p codegraph-core(clean build)cargo test -p codegraph-core --lib— 184/184npx tree-sitter build --wasmregeneratestree-sitter-fsharp.wasmnpx vitest run tests/parsers/fsharp.test.ts— 5/5npx vitest run tests/parsers/native-drop-classification.test.ts— 13/13