A production-style, distributed e-commerce semantic search system built incrementally with a strong focus on correctness, scalability, and clean engineering practices.
The project is intentionally developed in milestones, where each layer is validated independently before moving to the next.
- v0.1 — Data Ingestion & Distributed Storage
- v0.2 — Product Search Infrastructure (BM25)
- v0.3 — Semantic Review Search (FAISS)
- v0.4 — Unified FastAPI Query Gateway
Goal:
Build a reliable, reproducible data ingestion and distributed storage foundation using real-world, large-scale Amazon datasets.
This milestone focuses entirely on infrastructure correctness and data hygiene before introducing any search or retrieval layers.
- Streaming preprocessing pipeline
scripts/convert_meta_reviews.py- Two-pass, memory-safe design (no full dataset loading)
- Deterministic reservoir sampling (seeded) for reproducibility
- ~10,000 products
- ~50,000 reviews
- Cleans HTML artifacts, malformed text, whitespace, and missing fields
- Cleaned outputs for downstream systems
data/out/products_small.csvdata/out/reviews_small.csvdata/out/asin_sample_list.txt
- Distributed SQL storage
- 3-node CockroachDB cluster running locally via Docker Compose
- Schema design for
productsandreviews - Bulk ingestion using CockroachDB
IMPORT
- Verification and validation
- Row counts verified after import
- Cluster health checks
- Replication and range distribution across nodes validated
- Dockerized environment
- Fully reproducible local setup
- No dependency on host Python environment
- Repository hygiene
- Removed virtual environments and large binaries from Git history
- Clean
.gitignoreand.dockerignore - Recruiter-grade, lightweight repository
- Raw Amazon datasets (20+ GB total) are intentionally excluded from the repository.
- All preprocessing is streaming to avoid memory pressure.
- No search, embeddings, or APIs are introduced at this stage by design.
- A stable, distributed storage layer ready to support search and retrieval workloads.
- Deterministic sample datasets that can be regenerated at any time.
python scripts/convert_meta_reviews.py
docker compose up -d
cockroach sql --insecure --host=localhost:26257
Goal:
Introduce a robust lexical product search layer using Elasticsearch (BM25) while intentionally deferring APIs and semantic retrieval.
This milestone focuses on validating search relevance, indexing correctness, and integration with distributed storage before adding semantic capabilities.
What’s included
- Elasticsearch integration:
- Official Elasticsearch 8.x image (ARM64-compatible)
- Runs locally via Docker Compose alongside CockroachDB
- Product indexing pipeline:
scripts/index_products_to_elasticsearch.py- Bulk indexing of ~10,000 products from
products_small.csv
- Index design and relevance tuning:
- Explicit mappings for product fields
- BM25 similarity (default Elasticsearch ranking)
- Field boosting (e.g.,
title^3) to improve ranking quality
- Verification and validation:
- Index document counts verified
- Search correctness tested via
_search - Ranking behavior inspected using
_score
- Infrastructure-first design:
- No FastAPI endpoints added (intentional)
- No semantic embeddings or vector search at this stage
Design decisions
- Lexical search is introduced before semantic retrieval to:
- Establish a strong and explainable baseline
- Validate indexing and ranking independently
- API layer is deferred until both product and review retrieval layers are complete.
Outcome
- A production-style product search layer capable of fast and relevant keyword-based retrieval.
- Clean separation between distributed storage (CockroachDB) and search infrastructure (Elasticsearch).
How to run (short)
# start services (CockroachDB + Elasticsearch)
docker compose up -d
# index products into Elasticsearch
python scripts/index_products_to_elasticsearch.py
# test a sample query
curl "http://localhost:9200/products/_search?q=wireless"Goal: Add a semantic retrieval layer for product reviews using sentence-transformer embeddings + FAISS HNSW.
What’s included
scripts/generate_embeddings.py— streaming, memory-safe embedding generation fordata/out/reviews_small.csvusingsentence-transformers/all-MiniLM-L6-v2.scripts/build_faiss_index.py— builds an HNSW FAISS index (IndexHNSWFlat) and writesdata/faiss/reviews_hnsw.indexand a small JSON config.- Editable build-time HNSW params at top of file:
M(connectivity) andef_construction(build depth).
- Editable build-time HNSW params at top of file:
scripts/query_faiss.py— CLI for semantic queries with:--k(number of results),--efSearch(runtime efSearch to trade recall/latency),--asin(optional per-product filtering with oversampling).
data/faiss/faiss_metadata.db— SQLite mappingfaiss_id → asin,reviewText,summary,unixReviewTime.scripts/e2e_smoke_test.shandscripts/e2e_test.py— smoke & programmatic E2E checks (Cockroach, Elasticsearch, FAISS artifacts, sample queries).- Manifest & reproducibility notes:
- Model:
sentence-transformers/all-MiniLM-L6-v2(384d) - Index:
IndexHNSWFlat, defaultM=32,ef_construction=200 - Runtime
efSearchrecommended:200for demos How to run (short)
- Model:
# generate embeddings
python scripts/generate_embeddings.py
# build FAISS index
python scripts/build_faiss_index.py
# query
python scripts/query_faiss.py "long battery life" --k 5 --efSearch 200Goal:
Expose a single, clean API endpoint that orchestrates lexical search, semantic retrieval, and authoritative metadata into one unified response.
This milestone integrates all prior layers into a production-style query service.
- FastAPI service
/search— unified search endpoint/health— service health check- Interactive OpenAPI docs (
/docs)
- Hybrid retrieval pipeline
- Elasticsearch (BM25) for product candidate generation
- FAISS (HNSW) for semantic review retrieval
- CockroachDB for authoritative product metadata
- Human-style query support
- Accepts natural language queries (e.g.
“I want a phone with long battery life and fast charging”)
- Accepts natural language queries (e.g.
- Configurable ranking
top_products,top_reviewsfaiss_koversampling- Runtime
efSearchtuning - Optional filtering of semantic reviews by top-N product ASINs
- Defensive API design
- Strict response validation via Pydantic
- Safe type coercion for real-world data inconsistencies
- Local-first, reproducible setup
- All components run locally via Docker and Python
- Generated artifacts excluded from git
- A fully functional, production-style unified search API
- Clean separation of responsibilities:
- CockroachDB → system of record
- Elasticsearch → lexical retrieval
- FAISS → semantic similarity
- FastAPI → orchestration layer
This project demonstrates a production-style approach to building a distributed search system by validating each layer independently before integration.
Rather than relying on a single technology, the platform intentionally separates concerns:
- CockroachDB serves as the strongly consistent system of record,
- Elasticsearch handles fast lexical retrieval,
- FAISS enables semantic similarity search,
- FastAPI acts as a lightweight orchestration layer.
The incremental milestone-based design ensures correctness, reproducibility, and extensibility, while keeping each component replaceable as system requirements evolve.
Future milestones build on this foundation to introduce intelligent ranking, retrieval-augmented generation, and personalization.