Infrastructure Upgrade via LiteLLM & Addition of 23 Arena-Hard Benchmark Results #8

devcomfort · 2025-12-20T22:09:34Z

Summary

This PR focuses on two major contributions: (1) a complete overhaul of the evaluation infrastructure to support efficient multi-model experiments, and (2) the provision of 23 new benchmark result files generated through this upgraded pipeline. The core motivation was to enable large-scale testing across diverse LLMs, which led to the integration of LiteLLM and advanced concurrency controls.

1. System Infrastructure Updates

To support seamless experimentation with multiple models, the following architectural improvements were implemented:

Enhanced Model Connectivity (LiteLLM)

Unified API Interface: Integrated LiteLLM to support multiple providers (OpenRouter, OpenAI, Anthropic, etc.) through a single, standardized class.
Custom Endpoint Support: Added api_base and api_key configuration to support local vLLM, Ollama, and other OpenAI-compatible custom endpoints.

Performance & Stability Optimization

Concurrent Multi-Model Evaluation: Re-engineered run_judge.py to evaluate multiple judge models in parallel using asyncio.gather.
Global Concurrency Control: Introduced a Global Semaphore to strictly limit total concurrent requests across all running models, ensuring stability and preventing API rate-limiting.

Developer Experience (DX) & Reliability

Real-time Visualization: Integrated rich.progress for a multi-layered dashboard showing real-time progress for overall and individual model tasks.
Pre-run Validation: Added automated environment checks via LiteLLM to verify API configurations before execution.
Standardized Documentation: Applied NumPy-style docstrings and added .env.example for better maintainability and easier setup.

2. Experimental Results (Benchmark Data)

Using the upgraded pipeline, I have contributed 23 new JSONL result files to the outputs/ directory. These files contain Arena-Hard evaluation results for the judgebench dataset.

Included Judge Models & Performance (Arena-Hard)

Provider	Judge Model	GPT-4o Baseline	Claude-3.5 Baseline
Google	`openrouter/google/gemini-3-flash-preview`	88.79%	87.41%
Minimax	`openrouter/minimax/minimax-m2`	87.69%	81.69%
Anthropic	`openrouter/anthropic/claude-haiku-4.5`	84.06%	73.86%
DeepSeek	`openrouter/deepseek/deepseek-r1-0528`	82.87%	82.83%
OpenAI	`openrouter/openai/gpt-oss-120b`	78.00%	N/A
Mistral	`openrouter/mistralai/mistral-medium-3.1`	75.64%	N/A*
Inception	`openrouter/inception/mercury`	72.86%	63.70%
Z-AI	`openrouter/z-ai/glm-4.6`	71.97%	64.91%
Inception	`openrouter/inception/mercury-coder`	71.43%	62.59%
OpenAI	`openrouter/openai/gpt-oss-20b`	70.86%	N/A
Minimax	`openrouter/minimax/minimax-m2.1`	68.57%	61.85%
Z-AI	`openrouter/z-ai/glm-4.7`	59.43%	59.26%
Mistral	`openrouter/mistralai/mistral-small-3.1-24b-instruct`	55.49%	49.26%

*Excluded from Claude evaluations due to slow processing speed (~24 TPS).

Evaluation Scope

Results cover evaluations for both GPT-4o and Claude-3.5-Sonnet response models.
All judgments follow the standardized Arena-Hard methodology already established in the repository.

How to Run (Multi-Model Evaluation)

Prerequisites: Make sure to configure your API keys in .env (refer to .env.example for guidance).

With this update, you can now evaluate multiple judge models simultaneously by providing a comma-separated list to the --judge_model argument.

Example Command:

python run_judge.py \
  --judge_name arena_hard \
  --judge_model "openrouter/google/gemini-3-flash-preview,openrouter/deepseek/deepseek-r1-0528,openrouter/anthropic/claude-haiku-4.5" \
  --pairs data/dataset=judgebench,response_model=gpt-4o-2024-05-13.jsonl \
  --concurrency_limit 50

Key Features:

Parallel Processing: All specified models will be evaluated in parallel, managed by a global concurrency limiter.
Visual Feedback: You will see a dedicated progress bar for each model in your terminal.

Verified Items

Confirmed stable multi-model execution with flexible concurrency control (from strictly serial to highly parallel) via the global semaphore.
Verified that global concurrency limits are strictly enforced.
Validated real-time progress tracking and output file integrity (.jsonl).

Final Note

"Through the integration of LiteLLM, I hope to empower other external contributors to easily conduct experiments and explore the performance of any LLM they desire. I believe this infrastructure will lower the barrier for future research and contributions to the JudgeBench project."

Note on Dependencies Update

This PR introduces a significant update to the evaluation infrastructure, which requires several new dependencies. The requirements.txt file has been updated to reflect these changes.

Key Dependencies Added:

litellm: Unified interface for connecting to multiple LLM providers (OpenRouter, OpenAI, Anthropic, etc.)
rich: Terminal-based visualization for real-time progress tracking
aiohttp, httpx, httpcore: Async HTTP support for concurrent model evaluations
tiktoken: Token counting and encoding for LLM models

Important for Installation:

Contributors and users who have previously cloned this repository should update their environment by running:

pip install -r requirements.txt

Or if using rye:

rye sync

Without these dependencies, the enhanced run_judge.py script and the new concurrent multi-model evaluation features will not work correctly.

Why These Dependencies Matter:

The original requirements.txt only specified base packages (openai==1.16.2, anthropic==0.26.1, etc.), which were insufficient for the new async/concurrent evaluation pipeline. This PR's infrastructure improvements depend critically on these additional libraries, particularly litellm for provider abstraction and rich for improved UX.

Intent: Add standardized NumPy-style docstrings to all utility modules in the 'utils' directory to improve code readability and maintainability. Also fixed minor type hints and linting errors. Changes: - Applied NumPy-style docstrings to all classes and functions in utils/file_operations.py, utils/judges.py, utils/metrics.py, utils/models.py, and utils/prompts.py. - Fixed return type hint for 'compute_final_metrics' in utils/metrics.py (None -> float). - Resolved linting errors in utils/judges.py by moving 'AutoTokenizer' import and fixing docstring indentation. Files: - utils/file_operations.py - utils/judges.py - utils/metrics.py - utils/models.py - utils/prompts.py

Intent: Add standardized NumPy-style docstrings to the main execution script 'run_judge.py' to improve code readability and maintainability, ensuring consistency with the utility modules. Changes: - Added NumPy-style docstrings to 'judge_pairs' and 'main' functions. - Fixed return type hint for 'judge_pairs' (None -> List[Dict[str, Any]]). - Corrected parameter type hint for 'reverse_order' in 'judge_pairs' (int -> bool). Files: - run_judge.py

- Replace provider-specific API classes (OpenAI, Anthropic, Gemini, Together, Local) with a unified LiteLLMAPI class. - Update get_chat_api_from_model factory function to return LiteLLMAPI for all models. - Switch to litellm.acompletion and litellm.atext_completion for unified API access. - Generalize error handling in backoff decorators to catch generic exceptions. The intent of this refactoring is to reduce code complexity and maintenance overhead by leveraging LiteLLM's ability to interface with multiple LLM providers through a single consistent API. This simplifies the codebase by removing redundant logic for different model providers. Modified files: - utils/models.py

- Update ChatAPI and LiteLLMAPI to accept optional api_base and api_key in __init__. - Modify LiteLLMAPI.chat and LiteLLMAPI.complete to use stored api_base/api_key as defaults. - Update get_chat_api_from_model factory function to pass optional API configuration. This allows users to easily configure local model endpoints (e.g., vLLM, Ollama) at the time of API object creation, rather than passing them as arguments to every chat call. Modified files: - utils/models.py

Intent: Provide a standardized template for API keys to help users configure their local environment for various LLM providers. Changes: - Created .env.example with placeholders for OpenRouter, OpenAI, Anthropic, Gemini, Together AI, and Mistral. Files: - .env.example

Intent: Provide comprehensive evaluation results for the JudgeBench dataset using the newly implemented multi-model parallel evaluation pipeline. Changes: - Added 13 JSONL files in the outputs/ directory containing Arena-Hard evaluation results. - Explicitly included results from the following judge models: - Anthropic Claude Haiku 4.5 - DeepSeek R1 - Google Gemini 3 Flash Preview - Minimax M2 - Mistral Small 3.1 24B Instruct - Z-AI GLM 4.6 - Mistral Medium 3.1 (Included only for GPT-4o; excluded from other evaluations due to slow processing speed ~24 TPS) - These results cover evaluations for both Claude-3.5-Sonnet and GPT-4o response models. Files: - outputs/*.jsonl

…-model support Intent: Improve evaluation stability and efficiency by centralizing concurrency control and supporting concurrent multi-model evaluation with enhanced visual feedback. Changes: - Relocated Semaphore to main scope (global_semaphore) to strictly control total concurrency across all concurrent model evaluations, preventing API rate limits. - Implemented asynchronous multi-model evaluation using asyncio.gather. - Integrated 'rich.progress' for advanced, multi-layered visual tracking of evaluation status. - Added pre-evaluation model validation using LiteLLM to ensure required environment variables are set. - Improved efficiency with resume logic (skipping existing results) and single-read data handling. Modified files: - run_judge.py

…d by rye) Pins dependency versions and expands package list: adds litellm, httpx, aiohttp, httpcore, tiktoken, transformers==4.57.3, datasets==4.4.2; bumps openai 1.16.2→2.14.0; removes top-level 'anthropic'.

devcomfort · 2025-12-25T09:26:29Z

Note on Dependencies Update

This PR introduces a significant update to the evaluation infrastructure, which requires several new dependencies. The requirements.txt file has been updated to reflect these changes.

Key Dependencies Added:

litellm: Unified interface for connecting to multiple LLM providers (OpenRouter, OpenAI, Anthropic, etc.)
rich: Terminal-based visualization for real-time progress tracking
aiohttp, httpx, httpcore: Async HTTP support for concurrent model evaluations
tiktoken: Token counting and encoding for LLM models

Important for Installation:

Contributors and users who have previously cloned this repository should update their environment by running:

pip install -r requirements.txt

Or if using rye:

rye sync

Without these dependencies, the enhanced run_judge.py script and the new concurrent multi-model evaluation features will not work correctly.

Why These Dependencies Matter:

The original requirements.txt only specified base packages (openai==1.16.2, anthropic==0.26.1, etc.), which were insufficient for the new async/concurrent evaluation pipeline. This PR's infrastructure improvements depend critically on these additional libraries, particularly litellm for provider abstraction and rich for improved UX.

- Add and update judgment outputs under outputs/ following the naming convention - Include evaluations for multiple judge models/providers (Anthropic, DeepSeek, Gemini, Minimax, Mistral, GLM, Meta Llama, GPT-4o, etc.) - Cover response baselines: gpt-4o-2024-05-13 and claude-3-5-sonnet-20240620 - Generated via upgraded LiteLLM pipeline; aligns with PR ScalerLab#8 description

…ercury-coder)

- Add evaluation results for openrouter/openai/gpt-oss-20b - Add evaluation results for openrouter/openai/gpt-oss-120b - Response baseline: gpt-4o-2024-05-13 - Generated via LiteLLM pipeline with arena_hard judge

- Improve code readability for international contributors - Update all inline comments and documentation - No functional changes

devcomfort added 8 commits December 18, 2025 11:40

chore: replace requirements.txt with locked dependency file (generate…

06b5c8a

…d by rye) Pins dependency versions and expands package list: adds litellm, httpx, aiohttp, httpcore, tiktoken, transformers==4.57.3, datasets==4.4.2; bumps openai 1.16.2→2.14.0; removes top-level 'anthropic'.

devcomfort marked this pull request as draft December 23, 2025 18:09

devcomfort marked this pull request as ready for review December 23, 2025 18:10

devcomfort changed the title ~~Infrastructure Upgrade via LiteLLM & Addition of 13 Arena-Hard Benchmark Results~~ Infrastructure Upgrade via LiteLLM & Addition of 17 Arena-Hard Benchmark Results Dec 26, 2025

Add Arena-Hard evaluation results for Inception models (mercury and m…

6a40b85

…ercury-coder)

devcomfort changed the title ~~Infrastructure Upgrade via LiteLLM & Addition of 17 Arena-Hard Benchmark Results~~ Infrastructure Upgrade via LiteLLM & Addition of 21 Arena-Hard Benchmark Results Dec 26, 2025

devcomfort added 2 commits December 31, 2025 15:30

outputs: add Arena-Hard benchmark results for GPT-OSS models

cb92f12

- Add evaluation results for openrouter/openai/gpt-oss-20b - Add evaluation results for openrouter/openai/gpt-oss-120b - Response baseline: gpt-4o-2024-05-13 - Generated via LiteLLM pipeline with arena_hard judge

chore: translate execute.sh comments from Korean to English

8748b82

- Improve code readability for international contributors - Update all inline comments and documentation - No functional changes

devcomfort changed the title ~~Infrastructure Upgrade via LiteLLM & Addition of 21 Arena-Hard Benchmark Results~~ Infrastructure Upgrade via LiteLLM & Addition of 23 Arena-Hard Benchmark Results Dec 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Infrastructure Upgrade via LiteLLM & Addition of 23 Arena-Hard Benchmark Results #8

Infrastructure Upgrade via LiteLLM & Addition of 23 Arena-Hard Benchmark Results #8

Uh oh!

devcomfort commented Dec 20, 2025 •

edited

Loading

Uh oh!

devcomfort commented Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Infrastructure Upgrade via LiteLLM & Addition of 23 Arena-Hard Benchmark Results #8

Are you sure you want to change the base?

Infrastructure Upgrade via LiteLLM & Addition of 23 Arena-Hard Benchmark Results #8

Uh oh!

Conversation

devcomfort commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. System Infrastructure Updates

Enhanced Model Connectivity (LiteLLM)

Performance & Stability Optimization

Developer Experience (DX) & Reliability

2. Experimental Results (Benchmark Data)

Included Judge Models & Performance (Arena-Hard)

Evaluation Scope

How to Run (Multi-Model Evaluation)

Example Command:

Key Features:

Verified Items

Final Note

Note on Dependencies Update

Key Dependencies Added:

Important for Installation:

Why These Dependencies Matter:

Uh oh!

devcomfort commented Dec 25, 2025

Note on Dependencies Update

Key Dependencies Added:

Important for Installation:

Why These Dependencies Matter:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devcomfort commented Dec 20, 2025 •

edited

Loading