Description
_normalize_text() in rubric_based_evaluator.py only does text.lower().strip(). When LLM judge models return rubric verdicts with markdown formatting (bullets, smart quotes, extra whitespace, non-ASCII characters), the exact-match rubric lookup fails silently, producing incorrect scores.
Reproduction
Judge model returns: "• The response correctly identifies the tool"
Expected rubric: "the response correctly identifies the tool"
_normalize_text() produces: "• the response correctly identifies the tool"
Result: No match → score defaults to 0 or lowest rubric
Common Patterns That Fail
- Leading bullets:
•, *, -
- Smart quotes:
"...", '...'
- Non-ASCII: accented characters, em-dashes
- Multi-space:
"the response" vs "the response"
- Trailing whitespace/newlines
Suggested Fix
Enhanced normalization:
def _normalize_text(text: str) -> str:
if not isinstance(text, str):
return ""
text = re.sub(r'^[\s*•\-]+', '', text) # Strip leading bullets
text = re.sub(r'[\s*•\-]+$', '', text) # Strip trailing
text = re.sub(r'\s+', ' ', text) # Collapse whitespace
text = text.encode('ascii', 'ignore').decode() # Remove non-ASCII
return text.lower().strip()
Additionally, a substring fallback when exact match fails would prevent silent scoring failures:
# If exact match fails, try substring match
for rubric_text, score in rubric_map.items():
if normalized_response in rubric_text or rubric_text in normalized_response:
return score
Impact
Without this fix, GEPA optimization produces unreliable rubric-based scores (rubric_based_final_response_quality_v1, rubric_based_tool_use_quality_v1), leading to suboptimal prompt evolution.
Environment
- google-adk 2.2.0
- Judge models: gemini-2.5-pro, gemini-3.5-flash
Description
_normalize_text()inrubric_based_evaluator.pyonly doestext.lower().strip(). When LLM judge models return rubric verdicts with markdown formatting (bullets, smart quotes, extra whitespace, non-ASCII characters), the exact-match rubric lookup fails silently, producing incorrect scores.Reproduction
Judge model returns:
"• The response correctly identifies the tool"Expected rubric:
"the response correctly identifies the tool"_normalize_text()produces:"• the response correctly identifies the tool"Result: No match → score defaults to 0 or lowest rubric
Common Patterns That Fail
•,*,-"...",'...'"the response"vs"the response"Suggested Fix
Enhanced normalization:
Additionally, a substring fallback when exact match fails would prevent silent scoring failures:
Impact
Without this fix, GEPA optimization produces unreliable rubric-based scores (
rubric_based_final_response_quality_v1,rubric_based_tool_use_quality_v1), leading to suboptimal prompt evolution.Environment