Add position bias detection and Cohen's kappa to llm_judge by joaquinhuigomez · Pull Request #3810 · lm-sys/FastChat

joaquinhuigomez · 2026-03-16T20:29:49Z

Summary

Adds compute_position_bias(), compute_cohens_kappa(), and interpret_kappa() to compute_agreement.py, working with the standard _pair.jsonl format
Extends display_result_pairwise() in show_result.py with a one-line consistency summary (position bias rate + kappa) printed after win rates
Adds display_consistency_metrics() function and --show-consistency CLI flag for detailed breakdown (agreement rate, bias direction, kappa with Landis & Koch interpretation)
Includes 13 unit tests using synthetic judgment data (no API calls)

Test plan

All 13 new tests pass (python3 -m pytest tests/test_consistency_metrics.py -v)
Existing imports verified intact
End-to-end tested with synthetic JSONL data
Manual verification with real MT-Bench judgment output

🤖 Generated with Claude Code

Extends compute_agreement.py with compute_position_bias(), compute_cohens_kappa(), and interpret_kappa() functions that work with the standard _pair.jsonl output format. Modifies show_result.py to print a one-line consistency summary at the bottom of pairwise results and adds --show-consistency flag for detailed metrics (bias rate, direction, kappa, Landis & Koch). Includes 13 unit tests with synthetic data in tests/test_consistency_metrics.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add position bias detection and Cohen's kappa to llm_judge#3810

Add position bias detection and Cohen's kappa to llm_judge#3810
joaquinhuigomez wants to merge 1 commit intolm-sys:mainfrom
joaquinhuigomez:feature/position-bias-cohens-kappa

joaquinhuigomez commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joaquinhuigomez commented Mar 16, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants