Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 82 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,19 +69,7 @@ semble search "save_pretrained" ./my-project
semble search "save model to disk" ./my-project --top-k 10
​```

If you anticipate doing more than one search, use `semble index` to create an index.

​```bash
semble index ./my-project -o my_index
​```

You can then reuse this index later on:

​```bash
semble search "save_pretrained" --index my_index
​```

An index is not automatically updated, so if the code changes significantly, reindex. If you notice stale results while resolving searches to files, reindex.
The index is built on first run (and cached for subsequent runs) and invalidated automatically when files change.

Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config:

Expand All @@ -97,20 +85,17 @@ Use `semble find-related` to discover code similar to a known location (pass `fi
semble find-related src/auth.py 42 ./my-project
​```

Like search, `find-related` also accepts an `--index` argument.

`path` defaults to the current directory when omitted; git URLs are accepted.

If `semble` is not on `$PATH`, use `uvx --from "semble[mcp]" semble` in its place.

### Workflow

1. Index the repo using `semble index -o cached_index`.
2. Start with `semble search` to find relevant chunks. Pass the index to achieve results faster.
3. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
4. Inspect full files only when the returned chunk does not give enough context.
5. Optionally use `semble find-related` with a promising result's `file_path` and `line` to discover related implementations.
6. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
1. Start with `semble search` to find relevant chunks. The index is built and cached automatically.
2. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
3. Inspect full files only when the returned chunk does not give enough context.
4. Optionally use `semble find-related` with a promising result's `file_path` and `line` to discover related implementations.
5. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
```

</details>
Expand Down Expand Up @@ -330,7 +315,55 @@ Add to `~/.config/zed/settings.json` (or `.zed/settings.json` in your project):
By default the MCP server indexes only code files. To also index documentation, config, or everything, append `--content docs`, `--content config`, or `--content all` to the server command, or a combination, e.g. `--content code docs`. For example, in Claude Code: `claude mcp add semble -s user -- uvx --from "semble[mcp]" semble --content all`.


## Sub-agent setup
<a id="bash-agentsmd"></a>

## Bash / AGENTS.md

An alternative to MCP is to invoke Semble via Bash. Sub-agents cannot call MCP tools directly, so this is the only option for sub-agent support; it can also be used alongside MCP for the top-level agent.

To add Bash support, append the following to your `AGENTS.md`, `CLAUDE.md`, `GEMINI.md`, or equivalent:

```markdown
## Code Search

Use `semble search` to find code by describing what it does or naming a symbol/identifier, instead of grep:

​```bash
semble search "authentication flow" ./my-project
semble search "save_pretrained" ./my-project
semble search "save model to disk" ./my-project --top-k 10
​```

The index is built on first run (and cached for subsequent runs) and invalidated automatically when files change.

Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config:

​```bash
semble search "deployment guide" ./my-project --content docs
semble search "database host port" ./my-project --content config
semble search "authentication" ./my-project --content all
​```

Use `semble find-related` to discover code similar to a known location (pass `file_path` and `line` from a prior search result):

​```bash
semble find-related src/auth.py 42 ./my-project
​```

`path` defaults to the current directory when omitted; git URLs are accepted.

If `semble` is not on `$PATH`, use `uvx --from "semble[mcp]" semble` in its place.

### Workflow

1. Start with `semble search` to find relevant chunks. The index is built and cached automatically.
2. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
3. Inspect full files only when the returned chunk does not give enough context.
4. Optionally use `semble find-related` with a promising result's `file_path` and `line` to discover related implementations.
5. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
```

### Sub-agent setup

Claude Code, Gemini CLI, Cursor, OpenCode, GitHub Copilot CLI, and Kiro all support a dedicated semble search sub-agent. Run `semble init` once in your project root:

Expand All @@ -350,13 +383,9 @@ If semble is not on `$PATH`, prefix the command with `uvx --from "semble[mcp]"`.
Semble also ships as a standalone CLI. This is useful in scripts or anywhere you want search results without an MCP session.

```bash
# Search a local repo
# Search a local repo (index is built and cached automatically)
semble search "authentication flow" ./my-project

# Index first for faster repeated searches (--index works with any command below)
semble index ./my-project -o my-index
semble search "authentication flow" --index my-index

# Search a remote repo (cloned on demand)
semble search "save model to disk" https://github.com/MinishLab/model2vec

Expand All @@ -372,6 +401,31 @@ semble find-related src/auth.py 42 ./my-project

`--content` accepts `code` (default), `docs`, `config`, or `all`. `path` defaults to the current directory when omitted; git URLs are accepted. If `semble` is not on `$PATH`, use `uvx --from "semble[mcp]" semble` in its place.

<details>
<summary>Controlling which files are indexed</summary>

Semble reads `.gitignore` and `.sembleignore` files to determine which files to index. Both files use standard gitignore syntax and their patterns are merged. `.sembleignore` lets you add semble-specific rules without touching `.gitignore`. Rules are applied recursively, so a `.sembleignore` in a subdirectory applies to that subtree.

**Excluding files:** add patterns the same way you would in `.gitignore`:

```
# .sembleignore
generated/ # exclude generated dir
*.pb.go. # exclude Go protobuf files
```

**Including non-default extensions:** prefix the extension pattern with `!` to force-include files that semble wouldn't index by default:

```
# .sembleignore
!*.proto # include Protobuf files
!*.cob # include COBOL files
```

Semble also always skips a set of well-known non-source directories regardless of ignore files (e.g. `node_modules/`, `.venv/`, `dist/`, `build/`, `__pycache__/`, and similar).

</details>

<details>
<summary>Savings</summary>

Expand All @@ -394,7 +448,7 @@ semble savings --verbose # also show breakdown by call type

Savings are calculated as follows: for each call, semble records the total character count of the unique files containing returned chunks and the character count of the snippets returned. Estimated tokens saved is `(file chars − snippet chars) / 4` (4 chars per token). This is a conservative estimate: the baseline is reading matched files in full, which is how coding agents often explore unfamiliar code.

Stats are stored in `~/.semble/savings.jsonl`.
Stats are stored in the OS cache folder (`~/Library/Caches/semble/` on macOS, `~/.cache/semble/` on Linux, `%LOCALAPPDATA%\semble\Cache\` on Windows).

</details>

Expand Down
4 changes: 2 additions & 2 deletions benchmarks/baselines/ablations.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
)
from benchmarks.run_benchmark import RepoResult, evaluate
from semble import SembleIndex
from semble.index.dense import _DEFAULT_MODEL_NAME
from semble.utils import DEFAULT_MODEL_NAME

# alpha=None → raw mode, input depends on query
# alpha=0.0 → hybrid pipeline, BM25-only input
Expand Down Expand Up @@ -129,7 +129,7 @@ def main() -> None:

summary = {
"tool": "semble-ablations",
"model": _DEFAULT_MODEL_NAME,
"model": DEFAULT_MODEL_NAME,
"by_mode": summarize_modes(results, modes),
"repos": [asdict(r) for r in results],
}
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/run_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
)
from benchmarks.metrics import ndcg_at_k, target_rank
from semble import SembleIndex
from semble.index.dense import _DEFAULT_MODEL_NAME
from semble.types import SearchResult
from semble.utils import DEFAULT_MODEL_NAME

_LATENCY_RUNS = 5
_DIRECT_TOP_K = 10
Expand Down Expand Up @@ -259,7 +259,7 @@ def _save_results(results: list[RepoResult]) -> None:
n_repos = len(results)
output = {
"tool": "semble-hybrid",
"model": _DEFAULT_MODEL_NAME,
"model": DEFAULT_MODEL_NAME,
"summary": {
"ndcg10": round(sum(r.ndcg10 for r in results) / n_repos, 4),
"tokens": round(sum(r.tokens for r in results) / n_repos, 0),
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/speed_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@
from benchmarks.data import RepoSpec, Task, available_repo_specs, load_tasks, save_results
from benchmarks.tools import run_colgrep_files, run_ripgrep_count
from semble import SembleIndex
from semble.index.dense import _DEFAULT_MODEL_NAME
from semble.types import EmbeddingMatrix
from semble.utils import DEFAULT_MODEL_NAME

# One representative repo per language (medium size, healthy NDCG on the main benchmark).
_REPOS: list[str] = [
Expand Down Expand Up @@ -192,7 +192,7 @@ def main() -> None:

print("Loading semble model...", file=sys.stderr)
started = time.perf_counter()
semble_model = StaticModel.from_pretrained(_DEFAULT_MODEL_NAME)
semble_model = StaticModel.from_pretrained(DEFAULT_MODEL_NAME)
print(f" loaded in {(time.perf_counter() - started) * 1000:.0f}ms", file=sys.stderr)

print("Loading CodeRankEmbed...", file=sys.stderr)
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/token_efficiency.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,10 @@
target_matches_location,
)
from semble import SembleIndex
from semble.index.dense import _DEFAULT_MODEL_NAME
from semble.index.file_walker import DEFAULT_IGNORED_DIRS, FILE_TYPES, FileCategory
from semble.ranking.boosting import _STOPWORDS as _SEMBLE_STOPWORDS
from semble.types import Chunk
from semble.utils import DEFAULT_MODEL_NAME

_RG_INCLUDE_GLOBS: tuple[str, ...] = tuple(
f"*{ext}" for ext, spec in FILE_TYPES.items() if spec.category == FileCategory.CODE
Expand Down Expand Up @@ -378,7 +378,7 @@ def run_recall(args: argparse.Namespace) -> None:

print("Loading tokenizer + model...", file=sys.stderr)
enc = tiktoken.get_encoding(_TOKENIZER_NAME)
model = StaticModel.from_pretrained(_DEFAULT_MODEL_NAME)
model = StaticModel.from_pretrained(DEFAULT_MODEL_NAME)

method_curves: dict[str, MethodCurves] = defaultdict(list)
print(f"\n{'Repo':<22} {'Language':<12} {'Tasks':>6} {'Time':>8}", file=sys.stderr)
Expand Down
27 changes: 6 additions & 21 deletions src/semble/agents/claude.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,19 +12,7 @@ semble search "save_pretrained" ./my-project
semble search "save model to disk" ./my-project --top-k 10
```

If you anticipate doing more than one search, use `semble index` to create an index.

```bash
semble index ./my-project -o my_index
```

You can then reuse this index later on:

```bash
semble search "save_pretrained" --index my_index
```

An index is not automatically updated, so if the code changes significantly, reindex. If you notice stale results while resolving searches to files, reindex.
Results are cached automatically on first run and invalidated when files change.

Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config:

Expand All @@ -40,17 +28,14 @@ Use `semble find-related` to discover code similar to a known location (pass `fi
semble find-related src/auth.py 42 ./my-project
```

Like search, `find-related` also accepts an `--index` argument.

`path` defaults to the current directory when omitted; git URLs are accepted.

If `semble` is not on `$PATH`, use `uvx --from "semble[mcp]" semble` in its place.

### Workflow

1. Index the repo using `semble index -o cached_index`.
2. Start with `semble search` to find relevant chunks. Pass the index to achieve results faster.
3. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
4. Inspect full files only when the returned chunk does not give enough context.
5. Optionally use `semble find-related` with a promising result's `file_path` and `line` to discover related implementations.
6. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
1. Start with `semble search` to find relevant chunks. The index is built and cached automatically.
2. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
3. Inspect full files only when the returned chunk does not give enough context.
4. Optionally use `semble find-related` with a promising result's `file_path` and `line` to discover related implementations.
5. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
27 changes: 6 additions & 21 deletions src/semble/agents/copilot.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,19 +12,7 @@ semble search "save_pretrained" ./my-project
semble search "save model to disk" ./my-project --top-k 10
```

If you anticipate doing more than one search, use `semble index` to create an index.

```bash
semble index ./my-project -o my_index
```

You can then reuse this index later on:

```bash
semble search "save_pretrained" --index my_index
```

An index is not automatically updated, so if the code changes significantly, reindex. If you notice stale results while resolving searches to files, reindex.
Results are cached automatically on first run and invalidated when files change.

Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config:

Expand All @@ -40,17 +28,14 @@ Use `semble find-related` to discover code similar to a known location (pass `fi
semble find-related src/auth.py 42 ./my-project
```

Like search, `find-related` also accepts an `--index` argument.

`path` defaults to the current directory when omitted; git URLs are accepted.

If `semble` is not on `$PATH`, use `uvx --from "semble[mcp]" semble` in its place.

### Workflow

1. Index the repo using `semble index -o cached_index`.
2. Start with `semble search` to find relevant chunks. Pass the index to achieve results faster.
3. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
4. Inspect full files only when the returned chunk does not give enough context.
5. Optionally use `semble find-related` with a promising result's `file_path` and `line` to discover related implementations.
6. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
1. Start with `semble search` to find relevant chunks. The index is built and cached automatically.
2. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
3. Inspect full files only when the returned chunk does not give enough context.
4. Optionally use `semble find-related` with a promising result's `file_path` and `line` to discover related implementations.
5. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
27 changes: 6 additions & 21 deletions src/semble/agents/cursor.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,7 @@ semble search "save_pretrained" ./my-project
semble search "save model to disk" ./my-project --top-k 10
```

If you anticipate doing more than one search, use `semble index` to create an index.

```bash
semble index ./my-project -o my_index
```

You can then reuse this index later on:

```bash
semble search "save_pretrained" --index my_index
```

An index is not automatically updated, so if the code changes significantly, reindex. If you notice stale results while resolving searches to files, reindex.
Results are cached automatically on first run and invalidated when files change.

Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config:

Expand All @@ -39,17 +27,14 @@ Use `semble find-related` to discover code similar to a known location (pass `fi
semble find-related src/auth.py 42 ./my-project
```

Like search, `find-related` also accepts an `--index` argument.

`path` defaults to the current directory when omitted; git URLs are accepted.

If `semble` is not on `$PATH`, use `uvx --from "semble[mcp]" semble` in its place.

### Workflow

1. Index the repo using `semble index -o cached_index`.
2. Start with `semble search` to find relevant chunks. Pass the index to achieve results faster.
3. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
4. Inspect full files only when the returned chunk does not give enough context.
5. Optionally use `semble find-related` with a promising result's `file_path` and `line` to discover related implementations.
6. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
1. Start with `semble search` to find relevant chunks. The index is built and cached automatically.
2. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
3. Inspect full files only when the returned chunk does not give enough context.
4. Optionally use `semble find-related` with a promising result's `file_path` and `line` to discover related implementations.
5. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
Loading
Loading