Tree-sitter chunker fails on deep C++ ASTs (recursion → segfault); no per-file isolation; no CLI index persistence

First — thanks for shipping semble; the agent-first design + zero-API-key install is exactly what's missing from the code-search-for-agents space. We piloted it against a large C++ codebase (ClickHouse, ~25K files, ~2M LOC of templated C++) and hit a chain of three issues that compound. Reporting all three together because (2) and (3) are why (1) is unrecoverable rather than just slow.

## Repro target

- Repo: [ClickHouse/ClickHouse](https://github.com/ClickHouse/ClickHouse) at master
- Version: `semble==0.1.10` installed via `uv tool install semble`
- Python: 3.10.20 (uv-managed)
- Kernel: Linux 6.1, aarch64
- Filesystem: ext4

## Issue 1 — `RecursionError` in chunker on deeply-templated C++ ASTs

```
File ".../semble/chunking/core.py", line 93, in _merge_node_inner
    groups.extend(_merge_node_inner(child, desired_length))
  [Previous line repeated 987 more times]
RecursionError: maximum recursion depth exceeded
```

ClickHouse has many template-heavy `.cpp`/`.h` files (`MergeTreeData.cpp` and friends) where the tree-sitter parse tree nests deeper than Python's default 1000-frame recursion limit. `_merge_node_inner` recurses unboundedly on AST depth and trips the limit.

**Suggested fix**: cap the recursion depth in `_merge_node_inner` and switch to an iterative fallback (or a coarser chunk) past the cap. Files that would have been chunked at a finer grain still get chunked, just less precisely.

## Issue 2 — Bumping `setrecursionlimit` segfaults instead of erroring

`sys.setrecursionlimit(50000)` made the chunker run further but produced a coredump (no Python traceback). Python's recursion counter governs Python frames; the underlying C stack is bounded by `RLIMIT_STACK` (~8 MB default on Linux). At ~50k Python frames the C stack overflows and the process dies with `SIGSEGV`. So the apparent workaround is strictly worse than the original — it converts a recoverable Python exception into an uncontrolled process exit.

**This is why Issue 1 is the only safe failure mode today, and why fixing it via cap-and-fallback (rather than \"just raise the recursion limit\") is the right shape.**

## Issue 3 — No per-file error isolation in `chunk_source`

Even with Issue 1 fixed for typical files, any single file that still trips the chunker aborts the entire index build. There's no `try/except` around `chunk_source(file)` in the indexing loop, so one bad file kills indexing for the other 24,999 files.

**Suggested fix**: wrap per-file chunking in `try/except`, log the file path + exception class, and skip. The resulting index is partial-but-useful instead of nonexistent. Operator gets a clear list of \"these files weren't chunked\" they can inspect.

## Issue 4 (bonus) — CLI mode doesn't persist its index

Once we worked around Issues 1-3 to index a small subset (the SQL-only test corpus, ~8,500 files, no C++), `semble search` from the CLI re-indexed on every invocation. The MCP-server form caches indexes per session, but agents that invoke semble via shell subprocess (i.e., agents not using MCP — many production agent frameworks shell out for tools) pay the index cost on every call. On the test-corpus subset that's 24 seconds per call; on a larger surface where the cap-and-skip fixes above let indexing complete, it'd be much higher.

**Suggested fix**: persist the index to disk (e.g., `<repo>/.semble/index.bin` or `~/.cache/semble/<repo-hash>/index.bin`) and load on subsequent CLI invocations when source mtimes haven't moved. Same semantics as the MCP-server cache, just persisted across processes.

## Why this combo matters

Issues 1+3 together mean a single deeply-templated file kills indexing for the entire repo, with no fallback. Issue 4 means the operator can't even amortize the index cost by running queries in batch — every call rebuilds. For agents shelling out 3-5 times per investigation, that's 70-120 seconds per task on a small subset, more on the full repo.

Happy to test patches against ClickHouse master if you want a real-world repro, and to share the `.sembleignore` we used (`.worktrees/`, `build/`, the large `contrib/` subtrees) so you can reproduce locally without needing a full ClickHouse checkout.

Excellent project; really hoping this gets unblocked for C++ workloads at this scale.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tree-sitter chunker fails on deep C++ ASTs (recursion → segfault); no per-file isolation; no CLI index persistence #135

Repro target

Issue 1 — `RecursionError` in chunker on deeply-templated C++ ASTs

Issue 2 — Bumping `setrecursionlimit` segfaults instead of erroring

Issue 3 — No per-file error isolation in `chunk_source`

Issue 4 (bonus) — CLI mode doesn't persist its index

Why this combo matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Tree-sitter chunker fails on deep C++ ASTs (recursion → segfault); no per-file isolation; no CLI index persistence #135

Description

Repro target

Issue 1 — RecursionError in chunker on deeply-templated C++ ASTs

Issue 2 — Bumping setrecursionlimit segfaults instead of erroring

Issue 3 — No per-file error isolation in chunk_source

Issue 4 (bonus) — CLI mode doesn't persist its index

Why this combo matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Issue 1 — `RecursionError` in chunker on deeply-templated C++ ASTs

Issue 2 — Bumping `setrecursionlimit` segfaults instead of erroring

Issue 3 — No per-file error isolation in `chunk_source`