-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Hello maintainers!
I am interested in contributing to JudgeBench by adding a new set of response pairs in the data/ folder, using PT-BR (Brazilian Portuguese) benchmarks.
While reading the paper and exploring the repository, I noticed that a central part of the methodology involves generating multiple responses ('k' responses) for each question and selecting a pair consisting of one correct and one subtly incorrect response. However, I could not find in the JudgeBench codebase the procedure or script responsible for generating and selecting these pairs.
My questions are:
- Could you point out where the logic/tool for generating the 'k' responses and selecting the correct/subtly incorrect pair for each question is implemented or described?
- Is there any preprocessing script/notebook or methodological recommendations on this process that could be shared?
My intention is to follow the project's methodological standards so that the new PT-BR benchmarks are compatible and valuable to the community.
Thank you in advance for your attention, and congratulations on your work!