Skip to content

Conversation

@GFJHogue
Copy link
Collaborator

@GFJHogue GFJHogue commented Sep 26, 2025

I identified multiple issues in the retrieval system, that together cause the inconsistent behaviours noted in #91.

  1. BM25Retrieval was relying on the default whitespace word-splitting tokenization without any additional text processing to normalize for case (BM25 case-sensitive retrieval causing failed retrievals #92) and punctuation. I added case normalization and added nltk's word tokenizer. This seems to help for cases of failed queries involving gene symbols (NFAT1).
  2. The old version of langchain-chroma we were using was not properly returning Documents from SelfQueryRetriever. This appears to have been a bug that is now fixed in the version I've upgraded to here. This caused EnsembleRetriever and MergerRetriever to miss vector search results (glycolysis).
  3. For reasons I still do not understand, RAG would occasionally claim no results found for the 1st message, even if identical query/rephrasing results in valid responses in all other cases. I've inserted the user's original message into chat_history only for the 1st message, which strangely seems to work as a workaround.

@heliamoh you should evaluate this branch and see if it is an acceptable minimal fix for the study-system.

@GFJHogue GFJHogue self-assigned this Sep 26, 2025
@heliamoh heliamoh merged commit 5137c46 into main Sep 27, 2025
9 checks passed
@GFJHogue GFJHogue mentioned this pull request Sep 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants