Skip to content

Conversation

@heliamoh
Copy link
Collaborator

This PR introduces a comprehensive architectural overhaul of the Reactome chatbot's retrieval and preprocessing systems. The changes replace the existing simple ensemble retriever with a sophisticated hybrid system and implement a multi-step preprocessing pipeline with parallel execution capabilities.

1. Hybrid Retrieval System Overhaul

1.1 Core Architecture Changes

  • Replaced SelfQueryRetriever with custom HybridRetriever class
  • Added query expansion with parallel multi-source search across all Reactome data subdirectories

1.2 Retrieval Workflow Implementation

The new hybrid system operates on the following pipeline:

  • Parallel Search Execution: Vector + BM25 search across 5 expanded queries per data subdirectory
  • Reciprocal Rank Fusion (RRF): Reranking to combine results into a single, more robust ranking that enhances accuracy by prioritizing documents appearing consistently in top positions across independent searches
  • Asynchronous Processing: All retrieval operations use asyncio.to_thread() for non-blocking execution
  • Parallel Subdirectory Processing: Multiple data sources searched simultaneously

2. Preprocessing Pipeline Implementation

2.1 Multi-Step Workflow Architecture

The new preprocessing system implements a fan-out pattern with the following structure:

Sequential Step 1:

  • Query rephrasing with conversation history integration

Parallel Step 2:

  • Safety assessment and reasoning
  • Query expansion for enhanced retrieval
  • Language detection for multilingual support

3. ReactToMe Profile Enhancements

3.1 Conditional Workflow Routing

The ReactToMe profile now implements intelligent routing based on safety assessment:

  • Questions are only answered if they are ethical, appropriate, and within Reactome scope
  • Unsafe/out-of-scope questions bypass the main Q&A workflow (no RAG + no external search)
  • Contextual refusal responses generated for inappropriate queries

Performance Impact

  • Reduced Latency: Parallel processing of independent tasks
  • Enhanced Accuracy: Query expansion and RRF ranking improves result quality.
  • Better Resource Utilization: Asynchronous operations and optimized concurrent execution

- Implement parallel execution of safety and scope check, query expansion, and language detection
… expansion and conversation history management
- Replace SelfQueryRetriever with efficient hybrid search (BM25 + vector)
- Add RRF (Reciprocal Rank Fusion) support for query expansion
- Implement parallel processing for improved performance
… expansion and conversation history management
@heliamoh heliamoh requested a review from GFJHogue September 28, 2025 18:38
@heliamoh heliamoh closed this Sep 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants