Skip to content

Conversation

@devcomfort
Copy link

@devcomfort devcomfort commented Dec 20, 2025

Summary

This PR focuses on two major contributions: (1) a complete overhaul of the evaluation infrastructure to support efficient multi-model experiments, and (2) the provision of 23 new benchmark result files generated through this upgraded pipeline. The core motivation was to enable large-scale testing across diverse LLMs, which led to the integration of LiteLLM and advanced concurrency controls.


1. System Infrastructure Updates

To support seamless experimentation with multiple models, the following architectural improvements were implemented:

Enhanced Model Connectivity (LiteLLM)

  • Unified API Interface: Integrated LiteLLM to support multiple providers (OpenRouter, OpenAI, Anthropic, etc.) through a single, standardized class.
  • Custom Endpoint Support: Added api_base and api_key configuration to support local vLLM, Ollama, and other OpenAI-compatible custom endpoints.

Performance & Stability Optimization

  • Concurrent Multi-Model Evaluation: Re-engineered run_judge.py to evaluate multiple judge models in parallel using asyncio.gather.
  • Global Concurrency Control: Introduced a Global Semaphore to strictly limit total concurrent requests across all running models, ensuring stability and preventing API rate-limiting.

Developer Experience (DX) & Reliability

  • Real-time Visualization: Integrated rich.progress for a multi-layered dashboard showing real-time progress for overall and individual model tasks.
  • Pre-run Validation: Added automated environment checks via LiteLLM to verify API configurations before execution.
  • Standardized Documentation: Applied NumPy-style docstrings and added .env.example for better maintainability and easier setup.

2. Experimental Results (Benchmark Data)

Using the upgraded pipeline, I have contributed 23 new JSONL result files to the outputs/ directory. These files contain Arena-Hard evaluation results for the judgebench dataset.

Included Judge Models & Performance (Arena-Hard)

Provider Judge Model GPT-4o Baseline Claude-3.5 Baseline
Google openrouter/google/gemini-3-flash-preview 88.79% 87.41%
Minimax openrouter/minimax/minimax-m2 87.69% 81.69%
Anthropic openrouter/anthropic/claude-haiku-4.5 84.06% 73.86%
DeepSeek openrouter/deepseek/deepseek-r1-0528 82.87% 82.83%
OpenAI openrouter/openai/gpt-oss-120b 78.00% N/A
Mistral openrouter/mistralai/mistral-medium-3.1 75.64% N/A*
Inception openrouter/inception/mercury 72.86% 63.70%
Z-AI openrouter/z-ai/glm-4.6 71.97% 64.91%
Inception openrouter/inception/mercury-coder 71.43% 62.59%
OpenAI openrouter/openai/gpt-oss-20b 70.86% N/A
Minimax openrouter/minimax/minimax-m2.1 68.57% 61.85%
Z-AI openrouter/z-ai/glm-4.7 59.43% 59.26%
Mistral openrouter/mistralai/mistral-small-3.1-24b-instruct 55.49% 49.26%

*Excluded from Claude evaluations due to slow processing speed (~24 TPS).

Evaluation Scope

  • Results cover evaluations for both GPT-4o and Claude-3.5-Sonnet response models.
  • All judgments follow the standardized Arena-Hard methodology already established in the repository.

How to Run (Multi-Model Evaluation)

Prerequisites: Make sure to configure your API keys in .env (refer to .env.example for guidance).

With this update, you can now evaluate multiple judge models simultaneously by providing a comma-separated list to the --judge_model argument.

Example Command:

python run_judge.py \
  --judge_name arena_hard \
  --judge_model "openrouter/google/gemini-3-flash-preview,openrouter/deepseek/deepseek-r1-0528,openrouter/anthropic/claude-haiku-4.5" \
  --pairs data/dataset=judgebench,response_model=gpt-4o-2024-05-13.jsonl \
  --concurrency_limit 50

Key Features:

  • Parallel Processing: All specified models will be evaluated in parallel, managed by a global concurrency limiter.
  • Visual Feedback: You will see a dedicated progress bar for each model in your terminal.

Verified Items

  • Confirmed stable multi-model execution with flexible concurrency control (from strictly serial to highly parallel) via the global semaphore.
  • Verified that global concurrency limits are strictly enforced.
  • Validated real-time progress tracking and output file integrity (.jsonl).

Final Note

"Through the integration of LiteLLM, I hope to empower other external contributors to easily conduct experiments and explore the performance of any LLM they desire. I believe this infrastructure will lower the barrier for future research and contributions to the JudgeBench project."


Note on Dependencies Update

This PR introduces a significant update to the evaluation infrastructure, which requires several new dependencies. The requirements.txt file has been updated to reflect these changes.

Key Dependencies Added:

  • litellm: Unified interface for connecting to multiple LLM providers (OpenRouter, OpenAI, Anthropic, etc.)
  • rich: Terminal-based visualization for real-time progress tracking
  • aiohttp, httpx, httpcore: Async HTTP support for concurrent model evaluations
  • tiktoken: Token counting and encoding for LLM models

Important for Installation:

Contributors and users who have previously cloned this repository should update their environment by running:

pip install -r requirements.txt

Or if using rye:

rye sync

Without these dependencies, the enhanced run_judge.py script and the new concurrent multi-model evaluation features will not work correctly.

Why These Dependencies Matter:

The original requirements.txt only specified base packages (openai==1.16.2, anthropic==0.26.1, etc.), which were insufficient for the new async/concurrent evaluation pipeline. This PR's infrastructure improvements depend critically on these additional libraries, particularly litellm for provider abstraction and rich for improved UX.

Intent:
Add standardized NumPy-style docstrings to all utility modules in the 'utils' directory to improve code readability and maintainability. Also fixed minor type hints and linting errors.

Changes:
- Applied NumPy-style docstrings to all classes and functions in utils/file_operations.py, utils/judges.py, utils/metrics.py, utils/models.py, and utils/prompts.py.
- Fixed return type hint for 'compute_final_metrics' in utils/metrics.py (None -> float).
- Resolved linting errors in utils/judges.py by moving 'AutoTokenizer' import and fixing docstring indentation.

Files:
- utils/file_operations.py
- utils/judges.py
- utils/metrics.py
- utils/models.py
- utils/prompts.py
Intent:
Add standardized NumPy-style docstrings to the main execution script 'run_judge.py' to improve code readability and maintainability, ensuring consistency with the utility modules.

Changes:
- Added NumPy-style docstrings to 'judge_pairs' and 'main' functions.
- Fixed return type hint for 'judge_pairs' (None -> List[Dict[str, Any]]).
- Corrected parameter type hint for 'reverse_order' in 'judge_pairs' (int -> bool).

Files:
- run_judge.py
- Replace provider-specific API classes (OpenAI, Anthropic, Gemini, Together, Local) with a unified LiteLLMAPI class.
- Update get_chat_api_from_model factory function to return LiteLLMAPI for all models.
- Switch to litellm.acompletion and litellm.atext_completion for unified API access.
- Generalize error handling in backoff decorators to catch generic exceptions.

The intent of this refactoring is to reduce code complexity and maintenance overhead by leveraging LiteLLM's ability to interface with multiple LLM providers through a single consistent API. This simplifies the codebase by removing redundant logic for different model providers.

Modified files:
- utils/models.py
- Update ChatAPI and LiteLLMAPI to accept optional api_base and api_key in __init__.
- Modify LiteLLMAPI.chat and LiteLLMAPI.complete to use stored api_base/api_key as defaults.
- Update get_chat_api_from_model factory function to pass optional API configuration.

This allows users to easily configure local model endpoints (e.g., vLLM, Ollama) at the time of API object creation, rather than passing them as arguments to every chat call.

Modified files:
- utils/models.py
Intent:
Provide a standardized template for API keys to help users configure their local environment for various LLM providers.

Changes:
- Created .env.example with placeholders for OpenRouter, OpenAI, Anthropic, Gemini, Together AI, and Mistral.

Files:
- .env.example
Intent:
Provide comprehensive evaluation results for the JudgeBench dataset using the newly implemented multi-model parallel evaluation pipeline.

Changes:
- Added 13 JSONL files in the outputs/ directory containing Arena-Hard evaluation results.
- Explicitly included results from the following judge models:
  - Anthropic Claude Haiku 4.5
  - DeepSeek R1
  - Google Gemini 3 Flash Preview
  - Minimax M2
  - Mistral Small 3.1 24B Instruct
  - Z-AI GLM 4.6
  - Mistral Medium 3.1 (Included only for GPT-4o; excluded from other evaluations due to slow processing speed ~24 TPS)
- These results cover evaluations for both Claude-3.5-Sonnet and GPT-4o response models.

Files:
- outputs/*.jsonl
…-model support

Intent:
Improve evaluation stability and efficiency by centralizing concurrency control and supporting concurrent multi-model evaluation with enhanced visual feedback.

Changes:
- Relocated Semaphore to main scope (global_semaphore) to strictly control total concurrency across all concurrent model evaluations, preventing API rate limits.
- Implemented asynchronous multi-model evaluation using asyncio.gather.
- Integrated 'rich.progress' for advanced, multi-layered visual tracking of evaluation status.
- Added pre-evaluation model validation using LiteLLM to ensure required environment variables are set.
- Improved efficiency with resume logic (skipping existing results) and single-read data handling.

Modified files:
- run_judge.py
…d by rye)

Pins dependency versions and expands package list: adds litellm, httpx, aiohttp, httpcore, tiktoken, transformers==4.57.3, datasets==4.4.2; bumps openai 1.16.2→2.14.0; removes top-level 'anthropic'.
@devcomfort devcomfort marked this pull request as draft December 23, 2025 18:09
@devcomfort devcomfort marked this pull request as ready for review December 23, 2025 18:10
@devcomfort
Copy link
Author

Note on Dependencies Update

This PR introduces a significant update to the evaluation infrastructure, which requires several new dependencies. The requirements.txt file has been updated to reflect these changes.

Key Dependencies Added:

  • litellm: Unified interface for connecting to multiple LLM providers (OpenRouter, OpenAI, Anthropic, etc.)
  • rich: Terminal-based visualization for real-time progress tracking
  • aiohttp, httpx, httpcore: Async HTTP support for concurrent model evaluations
  • tiktoken: Token counting and encoding for LLM models

Important for Installation:

Contributors and users who have previously cloned this repository should update their environment by running:

pip install -r requirements.txt

Or if using rye:

rye sync

Without these dependencies, the enhanced run_judge.py script and the new concurrent multi-model evaluation features will not work correctly.

Why These Dependencies Matter:

The original requirements.txt only specified base packages (openai==1.16.2, anthropic==0.26.1, etc.), which were insufficient for the new async/concurrent evaluation pipeline. This PR's infrastructure improvements depend critically on these additional libraries, particularly litellm for provider abstraction and rich for improved UX.

- Add and update judgment outputs under outputs/ following the naming convention
- Include evaluations for multiple judge models/providers (Anthropic, DeepSeek, Gemini, Minimax, Mistral, GLM, Meta Llama, GPT-4o, etc.)
- Cover response baselines: gpt-4o-2024-05-13 and claude-3-5-sonnet-20240620
- Generated via upgraded LiteLLM pipeline; aligns with PR ScalerLab#8 description
@devcomfort devcomfort changed the title Infrastructure Upgrade via LiteLLM & Addition of 13 Arena-Hard Benchmark Results Infrastructure Upgrade via LiteLLM & Addition of 17 Arena-Hard Benchmark Results Dec 26, 2025
@devcomfort devcomfort changed the title Infrastructure Upgrade via LiteLLM & Addition of 17 Arena-Hard Benchmark Results Infrastructure Upgrade via LiteLLM & Addition of 21 Arena-Hard Benchmark Results Dec 26, 2025
- Add evaluation results for openrouter/openai/gpt-oss-20b
- Add evaluation results for openrouter/openai/gpt-oss-120b
- Response baseline: gpt-4o-2024-05-13
- Generated via LiteLLM pipeline with arena_hard judge
- Improve code readability for international contributors
- Update all inline comments and documentation
- No functional changes
@devcomfort devcomfort changed the title Infrastructure Upgrade via LiteLLM & Addition of 21 Arena-Hard Benchmark Results Infrastructure Upgrade via LiteLLM & Addition of 23 Arena-Hard Benchmark Results Dec 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant