Skip to content

feat: add hybrid search mode with Reciprocal Rank Fusion#157

Open
filhocf wants to merge 8 commits into
cocoindex-io:mainfrom
filhocf:feat/hybrid-search
Open

feat: add hybrid search mode with Reciprocal Rank Fusion#157
filhocf wants to merge 8 commits into
cocoindex-io:mainfrom
filhocf:feat/hybrid-search

Conversation

@filhocf
Copy link
Copy Markdown

@filhocf filhocf commented May 3, 2026

Summary

Adds a hybrid search mode that combines vector similarity search with keyword matching using Reciprocal Rank Fusion (RRF).

Problem

As described in #44, pure vector search can miss results where the query terms appear literally in the code but the semantic embedding doesn't capture the match well (e.g., specific function names, error messages, config keys).

Solution

Hybrid Search with RRF

When mode='hybrid' is specified:

  1. Vector search runs as usual (semantic similarity via vec0)
  2. Keyword search runs in parallel using INSTR-based term matching on the existing code_chunks_vec table
  3. Results are fused using Reciprocal Rank Fusion: RRF_score(d) = Σ 1/(k + rank_i(d))
  4. Items appearing in both result sets receive a consensus boost

RRF operates on rank positions rather than raw scores, making it robust to scale incompatibility between embedding distances and keyword match counts.

Reference: Cormack, Clarke & Buettcher (2009)

No Schema Changes Required

The keyword search uses INSTR(LOWER(content), LOWER(?)) on the existing auxiliary columns of code_chunks_vec. No FTS5 table or index changes needed.

Files Changed

File Change
query.py _keyword_query(), _fuse_rrf(), mode parameter in query_codebase()
protocol.py mode field on SearchRequest
client.py Pass-through mode parameter
project.py Pass-through mode parameter
daemon.py Pass-through mode parameter
server.py New mode MCP tool parameter
cli.py New --mode CLI flag

Usage

MCP tool:

{"query": "authentication middleware", "mode": "hybrid"}

CLI:

ccc search 'authentication middleware' --mode hybrid

Default mode remains semantic for backward compatibility.

Closes #44

filhocf added 2 commits May 3, 2026 20:34
Adds a 'hybrid' search mode that combines vector similarity search with
keyword matching using Reciprocal Rank Fusion (RRF).

RRF operates on rank positions rather than raw scores, making it robust
to scale incompatibility between embedding distances and keyword match
counts. Items appearing in both result sets receive a consensus boost.

Formula: RRF_score(d) = sum(1/(k + rank_i(d))) + consensus_boost
Reference: Cormack, Clarke & Buettcher (2009)

The keyword search uses INSTR-based term matching on the existing
code_chunks_vec table, requiring no schema changes or FTS5 setup.

Also includes exclude_paths support (from cocoindex-io#146) as it shares the
same full-scan query path.

Usage:
  MCP: {"query": "auth", "mode": "hybrid"}
  CLI: ccc search 'auth' --mode hybrid

Closes cocoindex-io#44
- Protocol: encode/decode with exclude_paths, mode, backward compat
- RRF fusion: vector-only, keyword-only, consensus boost, limit, score formula
- 8 tests, all passing
@filhocf
Copy link
Copy Markdown
Author

filhocf commented May 3, 2026

@gemini-code-assist review

filhocf added 4 commits May 3, 2026 20:56
- Use CTE instead of non-standard HAVING without GROUP BY
- Fetch more vector results before RRF fusion (limit*3) to avoid
  missing relevant matches that rank just outside the vector limit
- Extract _RRF_CONSENSUS_BOOST as named constant (was magic 0.003)
- Lower min keyword length to 2 (captures io, go, fs, db)
- Use regex tokenizer instead of naive split for code-like terms
- Validate mode parameter in query_codebase() and CLI
- Add 5 new tests for keyword extraction (13 total)
- CodeChunk: creation, embedding type flexibility
- QueryResult: creation, score range
- Chunking exports: Chunk, TextPosition, ChunkerFn, CHUNKER_REGISTRY
- 8 new tests, all passing
@faysou
Copy link
Copy Markdown
Contributor

faysou commented May 4, 2026

I had this idea once but I think this feature is already included with lancedb that cocoindex uses.

@filhocf
Copy link
Copy Markdown
Author

filhocf commented May 4, 2026

Thanks for the feedback @faysou! You're right that LanceDB has built-in vector search, but this PR adds something different: hybrid search combining BM25 (keyword) + vector (semantic) via Reciprocal Rank Fusion (RRF).

The key difference:

  • LanceDB's native search is pure vector similarity — great for semantic matches but misses exact keyword hits (function names, error codes, config keys)
  • BM25 excels at exact/partial term matching but has no semantic understanding
  • RRF merges both ranked lists without needing score normalization, giving you the best of both worlds

In practice, searching for handleRequest with pure vector search might return semantically similar functions but miss the exact one. Hybrid search catches both. This is especially valuable for code search where identifiers matter as much as meaning.

Happy to discuss further or adjust the approach!

review: hybrid search + exclude_paths + logging fix
@filhocf filhocf closed this May 4, 2026
@filhocf filhocf deleted the feat/hybrid-search branch May 4, 2026 11:45
chore: add tests for schema/chunking + ruff cleanup
@filhocf filhocf reopened this May 4, 2026
@filhocf
Copy link
Copy Markdown
Author

filhocf commented May 4, 2026

Sorry about the accidental close — I deleted the source branch while merging internal PRs on my fork, which caused GitHub to auto-close this. Branch restored, PR reopened. 🙏

@faysou
Copy link
Copy Markdown
Contributor

faysou commented May 4, 2026

When I researched this, I think that ccc is configured already for hybrid search.

@filhocf
Copy link
Copy Markdown
Author

filhocf commented May 4, 2026

@faysou I just checked the upstream query.py — the search backend is sqlite-vec (vec0), not LanceDB. There's no FTS, no keyword matching, and no hybrid search in the current codebase. The only search paths are _knn_query() (vec0 KNN index) and _full_scan_query() (full scan with vec_distance_L2), both purely vector-based.

You might be thinking of the cocoindex framework (which does use LanceDB), but cocoindex-code (ccc) is a separate project with a different storage backend.

This PR adds keyword matching + RRF fusion on top of the existing sqlite-vec infrastructure, with no new dependencies.

@qdrddr
Copy link
Copy Markdown

qdrddr commented May 14, 2026

CCC uses sqlite-vec, not LanceDB.

Here are examples of a hybrid search vector + BM25 that could work with sqlite-vec.

https://github.com/asg017/sqlite-vec/blob/5778fecfebaddafc23b69a3a4b91a8ee80e37a92/examples/nbc-headlines/3_search.ipynb

Great idea and PR! I need HybridSearch in CCC!

@filhocf
Copy link
Copy Markdown
Author

filhocf commented May 14, 2026

@qdrddr Thanks! That's exactly the pattern — FTS5 for BM25 + vec0 for semantic, fused with RRF. This PR implements it end-to-end with the existing sqlite-vec backend.

@georgeh0 @badmonster0 — would appreciate a review when you get a chance. Happy to adjust the implementation if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add hybrid (vector + keyword) search

3 participants