First — thanks for shipping semble; the agent-first design + zero-API-key install is exactly what's missing from the code-search-for-agents space. We piloted it against a large C++ codebase (ClickHouse, ~25K files, ~2M LOC of templated C++) and hit a chain of three issues that compound. Reporting all three together because (2) and (3) are why (1) is unrecoverable rather than just slow.
Repro target
- Repo: ClickHouse/ClickHouse at master
- Version:
semble==0.1.10 installed via uv tool install semble
- Python: 3.10.20 (uv-managed)
- Kernel: Linux 6.1, aarch64
- Filesystem: ext4
Issue 1 — RecursionError in chunker on deeply-templated C++ ASTs
File ".../semble/chunking/core.py", line 93, in _merge_node_inner
groups.extend(_merge_node_inner(child, desired_length))
[Previous line repeated 987 more times]
RecursionError: maximum recursion depth exceeded
ClickHouse has many template-heavy .cpp/.h files (MergeTreeData.cpp and friends) where the tree-sitter parse tree nests deeper than Python's default 1000-frame recursion limit. _merge_node_inner recurses unboundedly on AST depth and trips the limit.
Suggested fix: cap the recursion depth in _merge_node_inner and switch to an iterative fallback (or a coarser chunk) past the cap. Files that would have been chunked at a finer grain still get chunked, just less precisely.
Issue 2 — Bumping setrecursionlimit segfaults instead of erroring
sys.setrecursionlimit(50000) made the chunker run further but produced a coredump (no Python traceback). Python's recursion counter governs Python frames; the underlying C stack is bounded by RLIMIT_STACK (~8 MB default on Linux). At ~50k Python frames the C stack overflows and the process dies with SIGSEGV. So the apparent workaround is strictly worse than the original — it converts a recoverable Python exception into an uncontrolled process exit.
This is why Issue 1 is the only safe failure mode today, and why fixing it via cap-and-fallback (rather than "just raise the recursion limit") is the right shape.
Issue 3 — No per-file error isolation in chunk_source
Even with Issue 1 fixed for typical files, any single file that still trips the chunker aborts the entire index build. There's no try/except around chunk_source(file) in the indexing loop, so one bad file kills indexing for the other 24,999 files.
Suggested fix: wrap per-file chunking in try/except, log the file path + exception class, and skip. The resulting index is partial-but-useful instead of nonexistent. Operator gets a clear list of "these files weren't chunked" they can inspect.
Issue 4 (bonus) — CLI mode doesn't persist its index
Once we worked around Issues 1-3 to index a small subset (the SQL-only test corpus, ~8,500 files, no C++), semble search from the CLI re-indexed on every invocation. The MCP-server form caches indexes per session, but agents that invoke semble via shell subprocess (i.e., agents not using MCP — many production agent frameworks shell out for tools) pay the index cost on every call. On the test-corpus subset that's 24 seconds per call; on a larger surface where the cap-and-skip fixes above let indexing complete, it'd be much higher.
Suggested fix: persist the index to disk (e.g., <repo>/.semble/index.bin or ~/.cache/semble/<repo-hash>/index.bin) and load on subsequent CLI invocations when source mtimes haven't moved. Same semantics as the MCP-server cache, just persisted across processes.
Why this combo matters
Issues 1+3 together mean a single deeply-templated file kills indexing for the entire repo, with no fallback. Issue 4 means the operator can't even amortize the index cost by running queries in batch — every call rebuilds. For agents shelling out 3-5 times per investigation, that's 70-120 seconds per task on a small subset, more on the full repo.
Happy to test patches against ClickHouse master if you want a real-world repro, and to share the .sembleignore we used (.worktrees/, build/, the large contrib/ subtrees) so you can reproduce locally without needing a full ClickHouse checkout.
Excellent project; really hoping this gets unblocked for C++ workloads at this scale.
First — thanks for shipping semble; the agent-first design + zero-API-key install is exactly what's missing from the code-search-for-agents space. We piloted it against a large C++ codebase (ClickHouse, ~25K files, ~2M LOC of templated C++) and hit a chain of three issues that compound. Reporting all three together because (2) and (3) are why (1) is unrecoverable rather than just slow.
Repro target
semble==0.1.10installed viauv tool install sembleIssue 1 —
RecursionErrorin chunker on deeply-templated C++ ASTsClickHouse has many template-heavy
.cpp/.hfiles (MergeTreeData.cppand friends) where the tree-sitter parse tree nests deeper than Python's default 1000-frame recursion limit._merge_node_innerrecurses unboundedly on AST depth and trips the limit.Suggested fix: cap the recursion depth in
_merge_node_innerand switch to an iterative fallback (or a coarser chunk) past the cap. Files that would have been chunked at a finer grain still get chunked, just less precisely.Issue 2 — Bumping
setrecursionlimitsegfaults instead of erroringsys.setrecursionlimit(50000)made the chunker run further but produced a coredump (no Python traceback). Python's recursion counter governs Python frames; the underlying C stack is bounded byRLIMIT_STACK(~8 MB default on Linux). At ~50k Python frames the C stack overflows and the process dies withSIGSEGV. So the apparent workaround is strictly worse than the original — it converts a recoverable Python exception into an uncontrolled process exit.This is why Issue 1 is the only safe failure mode today, and why fixing it via cap-and-fallback (rather than "just raise the recursion limit") is the right shape.
Issue 3 — No per-file error isolation in
chunk_sourceEven with Issue 1 fixed for typical files, any single file that still trips the chunker aborts the entire index build. There's no
try/exceptaroundchunk_source(file)in the indexing loop, so one bad file kills indexing for the other 24,999 files.Suggested fix: wrap per-file chunking in
try/except, log the file path + exception class, and skip. The resulting index is partial-but-useful instead of nonexistent. Operator gets a clear list of "these files weren't chunked" they can inspect.Issue 4 (bonus) — CLI mode doesn't persist its index
Once we worked around Issues 1-3 to index a small subset (the SQL-only test corpus, ~8,500 files, no C++),
semble searchfrom the CLI re-indexed on every invocation. The MCP-server form caches indexes per session, but agents that invoke semble via shell subprocess (i.e., agents not using MCP — many production agent frameworks shell out for tools) pay the index cost on every call. On the test-corpus subset that's 24 seconds per call; on a larger surface where the cap-and-skip fixes above let indexing complete, it'd be much higher.Suggested fix: persist the index to disk (e.g.,
<repo>/.semble/index.binor~/.cache/semble/<repo-hash>/index.bin) and load on subsequent CLI invocations when source mtimes haven't moved. Same semantics as the MCP-server cache, just persisted across processes.Why this combo matters
Issues 1+3 together mean a single deeply-templated file kills indexing for the entire repo, with no fallback. Issue 4 means the operator can't even amortize the index cost by running queries in batch — every call rebuilds. For agents shelling out 3-5 times per investigation, that's 70-120 seconds per task on a small subset, more on the full repo.
Happy to test patches against ClickHouse master if you want a real-world repro, and to share the
.sembleignorewe used (.worktrees/,build/, the largecontrib/subtrees) so you can reproduce locally without needing a full ClickHouse checkout.Excellent project; really hoping this gets unblocked for C++ workloads at this scale.