feat: add hybrid search mode with Reciprocal Rank Fusion#157
Conversation
Adds a 'hybrid' search mode that combines vector similarity search with keyword matching using Reciprocal Rank Fusion (RRF). RRF operates on rank positions rather than raw scores, making it robust to scale incompatibility between embedding distances and keyword match counts. Items appearing in both result sets receive a consensus boost. Formula: RRF_score(d) = sum(1/(k + rank_i(d))) + consensus_boost Reference: Cormack, Clarke & Buettcher (2009) The keyword search uses INSTR-based term matching on the existing code_chunks_vec table, requiring no schema changes or FTS5 setup. Also includes exclude_paths support (from cocoindex-io#146) as it shares the same full-scan query path. Usage: MCP: {"query": "auth", "mode": "hybrid"} CLI: ccc search 'auth' --mode hybrid Closes cocoindex-io#44
- Protocol: encode/decode with exclude_paths, mode, backward compat - RRF fusion: vector-only, keyword-only, consensus boost, limit, score formula - 8 tests, all passing
|
@gemini-code-assist review |
- Use CTE instead of non-standard HAVING without GROUP BY - Fetch more vector results before RRF fusion (limit*3) to avoid missing relevant matches that rank just outside the vector limit - Extract _RRF_CONSENSUS_BOOST as named constant (was magic 0.003) - Lower min keyword length to 2 (captures io, go, fs, db) - Use regex tokenizer instead of naive split for code-like terms - Validate mode parameter in query_codebase() and CLI - Add 5 new tests for keyword extraction (13 total)
- CodeChunk: creation, embedding type flexibility - QueryResult: creation, score range - Chunking exports: Chunk, TextPosition, ChunkerFn, CHUNKER_REGISTRY - 8 new tests, all passing
|
I had this idea once but I think this feature is already included with lancedb that cocoindex uses. |
|
Thanks for the feedback @faysou! You're right that LanceDB has built-in vector search, but this PR adds something different: hybrid search combining BM25 (keyword) + vector (semantic) via Reciprocal Rank Fusion (RRF). The key difference:
In practice, searching for Happy to discuss further or adjust the approach! |
review: hybrid search + exclude_paths + logging fix
chore: add tests for schema/chunking + ruff cleanup
|
Sorry about the accidental close — I deleted the source branch while merging internal PRs on my fork, which caused GitHub to auto-close this. Branch restored, PR reopened. 🙏 |
|
When I researched this, I think that ccc is configured already for hybrid search. |
|
@faysou I just checked the upstream You might be thinking of the cocoindex framework (which does use LanceDB), but cocoindex-code ( This PR adds keyword matching + RRF fusion on top of the existing sqlite-vec infrastructure, with no new dependencies. |
|
CCC uses sqlite-vec, not LanceDB. Here are examples of a hybrid search vector + BM25 that could work with sqlite-vec. Great idea and PR! I need HybridSearch in CCC! |
|
@qdrddr Thanks! That's exactly the pattern — FTS5 for BM25 + vec0 for semantic, fused with RRF. This PR implements it end-to-end with the existing sqlite-vec backend. @georgeh0 @badmonster0 — would appreciate a review when you get a chance. Happy to adjust the implementation if needed. |
Summary
Adds a
hybridsearch mode that combines vector similarity search with keyword matching using Reciprocal Rank Fusion (RRF).Problem
As described in #44, pure vector search can miss results where the query terms appear literally in the code but the semantic embedding doesn't capture the match well (e.g., specific function names, error messages, config keys).
Solution
Hybrid Search with RRF
When
mode='hybrid'is specified:code_chunks_vectableRRF_score(d) = Σ 1/(k + rank_i(d))RRF operates on rank positions rather than raw scores, making it robust to scale incompatibility between embedding distances and keyword match counts.
Reference: Cormack, Clarke & Buettcher (2009)
No Schema Changes Required
The keyword search uses
INSTR(LOWER(content), LOWER(?))on the existing auxiliary columns ofcode_chunks_vec. No FTS5 table or index changes needed.Files Changed
query.py_keyword_query(),_fuse_rrf(),modeparameter inquery_codebase()protocol.pymodefield onSearchRequestclient.pymodeparameterproject.pymodeparameterdaemon.pymodeparameterserver.pymodeMCP tool parametercli.py--modeCLI flagUsage
MCP tool:
{"query": "authentication middleware", "mode": "hybrid"}CLI:
ccc search 'authentication middleware' --mode hybridDefault mode remains
semanticfor backward compatibility.Closes #44