Skip to content

Comments

feat(vectordb): add Qdrant backend support#232

Open
mclamee wants to merge 1 commit intovolcengine:mainfrom
mclamee:feature/qdrant-backend-support
Open

feat(vectordb): add Qdrant backend support#232
mclamee wants to merge 1 commit intovolcengine:mainfrom
mclamee:feature/qdrant-backend-support

Conversation

@mclamee
Copy link

@mclamee mclamee commented Feb 20, 2026

Summary

Add Qdrant as an alternative open-source vector database backend, giving users a self-hosted option alongside VikingDB/Volcengine.

  • QdrantCollection: Full ICollection implementation — dense/sparse hybrid search, scalar search with native order_by, full-text keyword search via payload indexes, multimodal search, aggregation with pagination, and proper filter translation
  • QdrantProject: Collection lifecycle management with configurable distance metrics and vector dimensions
  • QdrantConfig: Configuration model (URL, API key, gRPC, timeout)
  • Backend factory: Register qdrant backend type in viking_vector_index_backend.py
  • Optional dependency: pip install openviking[qdrant] (qdrant-client >= 1.9.0)

Design Decisions

1. ID System — Dual-Tracking (String ↔ UUID)

Problem: VikingDB uses arbitrary string IDs (e.g. "doc_123"), but Qdrant requires UUID or uint64 point IDs.

Solution: Deterministic UUID5 mapping with original ID preservation.

  • string_to_qdrant_id(id) converts string IDs to UUID5 using a fixed namespace (f47ac10b-58cc-4372-a567-0e02b2c3d479), ensuring the same string always maps to the same UUID — stable across processes and restarts.
  • The original string ID is stored in _original_id payload field for round-trip fidelity.
  • All query results reconstruct the original string ID from the payload, so callers never see UUIDs.

2. Vector Storage Model — Named Vectors

Problem: VikingDB separates "Index" (search config) from "Store" (data container). Each index has its own vector field. Qdrant uses a flat point model with named vectors.

Solution: Map VikingDB index names to Qdrant named vectors.

  • Dense vector → "default" named vector
  • Sparse vector → "sparse" named vector
  • The vector_field_name from schema's VectorIndex maps to named vectors at collection creation time.
  • Qdrant collection is created with VectorParams per dense index and SparseVectorParams per sparse index.

3. Filter DSL Translation

Problem: OpenViking uses a JSON-based filter DSL ({"op": "must", "field_name": ..., "conds": [...]}), while Qdrant uses typed Pydantic models (Filter, FieldCondition, MatchValue, Range, etc.).

Solution: Recursive _build_qdrant_filter() translates the full DSL:

  • must / must_notFilter(must=[...]) / Filter(must_not=[...])
  • and / or → Nested Filter with must / should
  • range (dict with gt/gte/lt/lte) → Range(...)
  • in (list of values) → MatchAny(any=[...])
  • Scalar equality → MatchValue(value=...)
  • Full-text match → MatchText(text=...)
  • must_not conditions at any nesting level are properly collected and applied.

4. Hybrid Search — RRF Fusion

Problem: VikingDB has built-in hybrid search (dense + sparse + rerank in one call). Qdrant requires explicit orchestration.

Solution: Qdrant's prefetch + Fusion.RRF pattern:

client.query_points(
    prefetch=[
        Prefetch(query=dense_vector, using="default", limit=limit),
        Prefetch(query=SparseVector(...), using="sparse", limit=limit),
    ],
    query=FusionQuery(fusion=Fusion.RRF),
)

This achieves Reciprocal Rank Fusion natively in Qdrant without an external reranker.

5. Sparse Vector Key Hashing

Problem: VikingDB sparse vectors use string term keys (e.g. {"hello": 0.5, "world": 0.3}), but Qdrant sparse vectors require integer indices.

Solution: Stable hashing via hashlib.md5:

def _stable_sparse_index(key: str) -> int:
    return int(hashlib.md5(key.encode()).hexdigest()[:8], 16) % (2**31)

Using MD5 (truncated to 8 hex chars) ensures:

  • Cross-process stability (unlike Python's hash() which is randomized per process via PYTHONHASHSEED)
  • Deterministic mapping across restarts
  • Sufficient range (2^31) to minimize collisions for typical vocabulary sizes

6. Schema Flexibility

Problem: VikingDB enforces strict field schemas. Qdrant is schema-less by default (any payload field can be stored).

Solution:

  • Field schema from create_collection is used to configure vector dimensions and named vectors, but payload fields are stored freely.
  • TextIndex payload indexes are auto-created for string fields to enable full-text MatchText search.
  • Index tracking via _created_indexes set to avoid redundant index creation.

7. Sorting & Aggregation

Problem: VikingDB supports order_by in fetch and aggregation natively. Qdrant added order_by in v1.9.0.

Solution:

  • fetch_data_by_sort → Qdrant's native scroll(order_by=...) with OrderBy(key, direction).
  • aggregate_data → Paginated scroll() with client-side grouping and counting (Qdrant lacks server-side aggregation).
  • search_by_random → Random unit vector search (matching LocalCollection approach), since Qdrant has no native random sampling.

8. search_by_id — Self-Exclusion

VikingDB's search_by_id returns neighbors excluding the query point itself. Our implementation:

  1. Retrieves the vector of the given ID
  2. Performs a query_points search with limit + 1
  3. Filters out the query ID from results and trims to limit

Known Limitations

Feature Status Notes
Path/nested field filters Not supported Qdrant payloads are flat; nested access requires flattening at ingest
Geo/datetime filters Not supported Qdrant supports GeoRadius/DatetimeRange but no DSL mapping yet
TTL (auto-expiry) Not supported Qdrant has no built-in TTL; would need external cron/scheduler
Group-by aggregation Client-side Qdrant lacks server-side GROUP BY; large collections may be slow
Reranking Via RRF only No external reranker integration; RRF fusion handles hybrid ranking

Type of Change

  • New feature (feat)

Testing

  • Comprehensive unit tests for QdrantCollection (44 tests, ~770 lines)
  • Unit tests for QdrantProject (7 tests, 266 lines)
  • All 119 existing vectordb tests continue to pass
# Run Qdrant tests (requires Qdrant server on localhost:6333)
pytest tests/vectordb/test_qdrant_collection.py tests/vectordb/test_qdrant_project.py -v

# Run all vectordb tests
pytest tests/vectordb/ -v

Usage Example

{
  "vectordb": {
    "backend": "qdrant",
    "name": "context",
    "dimension": 1024,
    "qdrant": {
      "url": "http://localhost:6333"
    }
  }
}

Checklist

  • Code follows project style guidelines
  • Tests added for new functionality (44 + 7 = 51 tests)
  • All existing interfaces preserved (backward compatible)
  • Optional dependency — no impact on existing installations
  • Stable cross-process hashing for sparse vectors
  • Proper filter DSL translation with must_not support
  • search_by_id excludes self from results

@CLAassistant
Copy link

CLAassistant commented Feb 20, 2026

CLA assistant check
All committers have signed the CLA.

@mclamee mclamee force-pushed the feature/qdrant-backend-support branch from 58a3b88 to e883acc Compare February 20, 2026 09:44
@ZaynJarvis ZaynJarvis requested a review from kkkwjx07 February 22, 2026 13:02
@ZaynJarvis
Copy link
Collaborator

ZaynJarvis commented Feb 22, 2026

looks good, help to resolve uv.lock conflicts & ruff lint issues.

we shall test this before merge it.

@mclamee mclamee force-pushed the feature/qdrant-backend-support branch from e883acc to 78a351f Compare February 23, 2026 13:46
Add Qdrant as an alternative open-source vector database backend
alongside existing VikingDB/Volcengine backends.

Key changes:
- QdrantCollection: full VectorDBCollection implementation with support
  for dense/sparse hybrid search, scalar search with native order_by,
  full-text keyword search via payload indexes, multimodal search,
  aggregate data with pagination, and proper filter translation
- QdrantProject: VectorDBProject implementation for collection lifecycle
  management with configurable distance metrics and vector dimensions
- QdrantConfig: configuration model with URL, API key, gRPC, and
  timeout settings
- Backend factory: register 'qdrant' backend type in
  viking_vector_index_backend.py
- Auto-create TextIndex for text fields in create_index
- Use FilterSelector for delete_all_data instead of drop/recreate
- Paginate aggregate_data scroll to handle large collections
- Track created indexes properly in has_index

Dependencies:
- qdrant-client >= 1.9.0 (optional extra: `pip install openviking[qdrant]`)

Tests:
- Comprehensive unit tests for QdrantCollection (752 lines)
- Unit tests for QdrantProject (266 lines)
@mclamee mclamee force-pushed the feature/qdrant-backend-support branch from 78a351f to db38e45 Compare February 23, 2026 13:49
@mclamee
Copy link
Author

mclamee commented Feb 23, 2026

Thanks for the review @ZaynJarvis!

Both issues addressed:

  1. uv.lock conflict — Rebased onto latest main (3d2d05a) and regenerated uv.lock.
  2. ruff lint — Fixed 4 issues (unused imports in qdrant_project.py and test_qdrant_collection.py, import sorting).

All Qdrant tests pass locally (51 tests: 44 collection + 7 project). Happy to help set up a test environment or add integration test instructions if needed.

@MaojiaSheng
Copy link
Collaborator

@mclamee we will offer a plugin mechanism for vector database, and would you like to help review when our code released

@kkkwjx07
Copy link
Collaborator

I think the path and datetime field types need to be supported. You can try converting them to string and float types.
By the way, I'm planning to revise the API code recently, as the integration cost is currently a bit high.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

5 participants