Skip to content

Add position bias detection and Cohen's kappa to llm_judge#3810

Open
joaquinhuigomez wants to merge 1 commit intolm-sys:mainfrom
joaquinhuigomez:feature/position-bias-cohens-kappa
Open

Add position bias detection and Cohen's kappa to llm_judge#3810
joaquinhuigomez wants to merge 1 commit intolm-sys:mainfrom
joaquinhuigomez:feature/position-bias-cohens-kappa

Conversation

@joaquinhuigomez
Copy link

Summary

  • Adds compute_position_bias(), compute_cohens_kappa(), and interpret_kappa() to compute_agreement.py, working with the standard _pair.jsonl format
  • Extends display_result_pairwise() in show_result.py with a one-line consistency summary (position bias rate + kappa) printed after win rates
  • Adds display_consistency_metrics() function and --show-consistency CLI flag for detailed breakdown (agreement rate, bias direction, kappa with Landis & Koch interpretation)
  • Includes 13 unit tests using synthetic judgment data (no API calls)

Test plan

  • All 13 new tests pass (python3 -m pytest tests/test_consistency_metrics.py -v)
  • Existing imports verified intact
  • End-to-end tested with synthetic JSONL data
  • Manual verification with real MT-Bench judgment output

🤖 Generated with Claude Code

Extends compute_agreement.py with compute_position_bias(),
compute_cohens_kappa(), and interpret_kappa() functions that work
with the standard _pair.jsonl output format.

Modifies show_result.py to print a one-line consistency summary
at the bottom of pairwise results and adds --show-consistency flag
for detailed metrics (bias rate, direction, kappa, Landis & Koch).

Includes 13 unit tests with synthetic data in
tests/test_consistency_metrics.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants