-
Notifications
You must be signed in to change notification settings - Fork 10
Infrastructure Upgrade via LiteLLM & Addition of 23 Arena-Hard Benchmark Results #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Intent: Add standardized NumPy-style docstrings to all utility modules in the 'utils' directory to improve code readability and maintainability. Also fixed minor type hints and linting errors. Changes: - Applied NumPy-style docstrings to all classes and functions in utils/file_operations.py, utils/judges.py, utils/metrics.py, utils/models.py, and utils/prompts.py. - Fixed return type hint for 'compute_final_metrics' in utils/metrics.py (None -> float). - Resolved linting errors in utils/judges.py by moving 'AutoTokenizer' import and fixing docstring indentation. Files: - utils/file_operations.py - utils/judges.py - utils/metrics.py - utils/models.py - utils/prompts.py
Intent: Add standardized NumPy-style docstrings to the main execution script 'run_judge.py' to improve code readability and maintainability, ensuring consistency with the utility modules. Changes: - Added NumPy-style docstrings to 'judge_pairs' and 'main' functions. - Fixed return type hint for 'judge_pairs' (None -> List[Dict[str, Any]]). - Corrected parameter type hint for 'reverse_order' in 'judge_pairs' (int -> bool). Files: - run_judge.py
- Replace provider-specific API classes (OpenAI, Anthropic, Gemini, Together, Local) with a unified LiteLLMAPI class. - Update get_chat_api_from_model factory function to return LiteLLMAPI for all models. - Switch to litellm.acompletion and litellm.atext_completion for unified API access. - Generalize error handling in backoff decorators to catch generic exceptions. The intent of this refactoring is to reduce code complexity and maintenance overhead by leveraging LiteLLM's ability to interface with multiple LLM providers through a single consistent API. This simplifies the codebase by removing redundant logic for different model providers. Modified files: - utils/models.py
- Update ChatAPI and LiteLLMAPI to accept optional api_base and api_key in __init__. - Modify LiteLLMAPI.chat and LiteLLMAPI.complete to use stored api_base/api_key as defaults. - Update get_chat_api_from_model factory function to pass optional API configuration. This allows users to easily configure local model endpoints (e.g., vLLM, Ollama) at the time of API object creation, rather than passing them as arguments to every chat call. Modified files: - utils/models.py
Intent: Provide a standardized template for API keys to help users configure their local environment for various LLM providers. Changes: - Created .env.example with placeholders for OpenRouter, OpenAI, Anthropic, Gemini, Together AI, and Mistral. Files: - .env.example
Intent: Provide comprehensive evaluation results for the JudgeBench dataset using the newly implemented multi-model parallel evaluation pipeline. Changes: - Added 13 JSONL files in the outputs/ directory containing Arena-Hard evaluation results. - Explicitly included results from the following judge models: - Anthropic Claude Haiku 4.5 - DeepSeek R1 - Google Gemini 3 Flash Preview - Minimax M2 - Mistral Small 3.1 24B Instruct - Z-AI GLM 4.6 - Mistral Medium 3.1 (Included only for GPT-4o; excluded from other evaluations due to slow processing speed ~24 TPS) - These results cover evaluations for both Claude-3.5-Sonnet and GPT-4o response models. Files: - outputs/*.jsonl
…-model support Intent: Improve evaluation stability and efficiency by centralizing concurrency control and supporting concurrent multi-model evaluation with enhanced visual feedback. Changes: - Relocated Semaphore to main scope (global_semaphore) to strictly control total concurrency across all concurrent model evaluations, preventing API rate limits. - Implemented asynchronous multi-model evaluation using asyncio.gather. - Integrated 'rich.progress' for advanced, multi-layered visual tracking of evaluation status. - Added pre-evaluation model validation using LiteLLM to ensure required environment variables are set. - Improved efficiency with resume logic (skipping existing results) and single-read data handling. Modified files: - run_judge.py
…d by rye) Pins dependency versions and expands package list: adds litellm, httpx, aiohttp, httpcore, tiktoken, transformers==4.57.3, datasets==4.4.2; bumps openai 1.16.2→2.14.0; removes top-level 'anthropic'.
Note on Dependencies UpdateThis PR introduces a significant update to the evaluation infrastructure, which requires several new dependencies. The Key Dependencies Added:
Important for Installation:Contributors and users who have previously cloned this repository should update their environment by running: pip install -r requirements.txtOr if using rye: rye syncWithout these dependencies, the enhanced Why These Dependencies Matter:The original |
- Add and update judgment outputs under outputs/ following the naming convention - Include evaluations for multiple judge models/providers (Anthropic, DeepSeek, Gemini, Minimax, Mistral, GLM, Meta Llama, GPT-4o, etc.) - Cover response baselines: gpt-4o-2024-05-13 and claude-3-5-sonnet-20240620 - Generated via upgraded LiteLLM pipeline; aligns with PR ScalerLab#8 description
- Add evaluation results for openrouter/openai/gpt-oss-20b - Add evaluation results for openrouter/openai/gpt-oss-120b - Response baseline: gpt-4o-2024-05-13 - Generated via LiteLLM pipeline with arena_hard judge
- Improve code readability for international contributors - Update all inline comments and documentation - No functional changes
Summary
This PR focuses on two major contributions: (1) a complete overhaul of the evaluation infrastructure to support efficient multi-model experiments, and (2) the provision of 23 new benchmark result files generated through this upgraded pipeline. The core motivation was to enable large-scale testing across diverse LLMs, which led to the integration of LiteLLM and advanced concurrency controls.
1. System Infrastructure Updates
To support seamless experimentation with multiple models, the following architectural improvements were implemented:
Enhanced Model Connectivity (LiteLLM)
api_baseandapi_keyconfiguration to support local vLLM, Ollama, and other OpenAI-compatible custom endpoints.Performance & Stability Optimization
run_judge.pyto evaluate multiple judge models in parallel usingasyncio.gather.Developer Experience (DX) & Reliability
rich.progressfor a multi-layered dashboard showing real-time progress for overall and individual model tasks..env.examplefor better maintainability and easier setup.2. Experimental Results (Benchmark Data)
Using the upgraded pipeline, I have contributed 23 new JSONL result files to the
outputs/directory. These files contain Arena-Hard evaluation results for thejudgebenchdataset.Included Judge Models & Performance (Arena-Hard)
openrouter/google/gemini-3-flash-previewopenrouter/minimax/minimax-m2openrouter/anthropic/claude-haiku-4.5openrouter/deepseek/deepseek-r1-0528openrouter/openai/gpt-oss-120bopenrouter/mistralai/mistral-medium-3.1openrouter/inception/mercuryopenrouter/z-ai/glm-4.6openrouter/inception/mercury-coderopenrouter/openai/gpt-oss-20bopenrouter/minimax/minimax-m2.1openrouter/z-ai/glm-4.7openrouter/mistralai/mistral-small-3.1-24b-instruct*Excluded from Claude evaluations due to slow processing speed (~24 TPS).
Evaluation Scope
How to Run (Multi-Model Evaluation)
Prerequisites: Make sure to configure your API keys in
.env(refer to.env.examplefor guidance).With this update, you can now evaluate multiple judge models simultaneously by providing a comma-separated list to the
--judge_modelargument.Example Command:
python run_judge.py \ --judge_name arena_hard \ --judge_model "openrouter/google/gemini-3-flash-preview,openrouter/deepseek/deepseek-r1-0528,openrouter/anthropic/claude-haiku-4.5" \ --pairs data/dataset=judgebench,response_model=gpt-4o-2024-05-13.jsonl \ --concurrency_limit 50Key Features:
Verified Items
Final Note
Note on Dependencies Update
This PR introduces a significant update to the evaluation infrastructure, which requires several new dependencies. The
requirements.txtfile has been updated to reflect these changes.Key Dependencies Added:
Important for Installation:
Contributors and users who have previously cloned this repository should update their environment by running:
Or if using rye:
Without these dependencies, the enhanced
run_judge.pyscript and the new concurrent multi-model evaluation features will not work correctly.Why These Dependencies Matter:
The original
requirements.txtonly specified base packages (openai==1.16.2, anthropic==0.26.1, etc.), which were insufficient for the new async/concurrent evaluation pipeline. This PR's infrastructure improvements depend critically on these additional libraries, particularly litellm for provider abstraction and rich for improved UX.