AI model benchmarking CLI — compare LLMs on speed, accuracy, and cost in a single reproducible run.
Point it at any two models and get side-by-side numbers for time-to-first-token, throughput, MMLU accuracy across 57 subjects, and per-token cost. Built for engineers picking an inference provider and researchers comparing model families.
- Performance — time-to-first-token (TTFT), total latency, tokens/sec (streaming)
- Accuracy — keyword-based prompt scoring plus full MMLU evaluation across 57 subjects, with configurable zero-shot or few-shot prompting
- Cost — input/output token pricing and a cost-performance ratio
- Reproducibility — deterministic by default (
temperature=0, fixedmax_tokens=1024, streaming TTFT capture)
Primary integration:
- Hyperbolic — managed API for open-weight models (Llama 3 / 3.1, Mixtral 8x7B, and others surfaced through Hyperbolic's API)
Provider adapters live under providers/, so adding new ones (OpenAI-compatible endpoints, Together, Fireworks, Groq, vLLM servers, Ollama, etc.) is a matter of implementing the request/response adapter.
A Lilypad adapter ships in the repo from earlier development. Lilypad's hosted Anura testnet is no longer the active product it was when this tool was built, so the Hyperbolic path is the primary, maintained one. The Lilypad adapter is left in the codebase as a reference implementation for plugging in alternative compute backends.
git clone https://github.com/PSkinnerTech/hypercompare.git
cd hypercompare
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install openai python-dotenv requests
cp .env.example .env
chmod +x hypercompareMinimum .env:
HYPERBOLIC_API_KEY=...
./hypercompare meta-llama/Meta-Llama-3-70B-Instruct meta-llama/Meta-Llama-3.1-8B-Instruct./hypercompare model_a model_b --providers hyperbolic <other>Pipe-delimited keyword scoring for accuracy:
Who wrote Hamlet? | Shakespeare, William Shakespeare
What is 2 + 2? | 4, four
Summarize the benefits of exercise in 2-3 sentences.
./hypercompare model_a model_b --prompts your_prompts.txt--providers PROVIDER PROVIDER Provider for model_a and model_b
--prompts PATH Custom prompt file
--system TEXT System prompt applied to all test cases
--temperature FLOAT Default 0 (deterministic)
--skip-mmlu Skip MMLU for a faster run
--n-shots N Few-shot examples for MMLU (default 0)
--num-questions N Questions per MMLU subject (default 5)
--verbose Detailed per-request output
Full MMLU support across all 57 subjects, with:
- Configurable zero-shot or few-shot prompting (
--n-shots) - Multi-layered answer extraction (initial-letter,
Answer: Xpatterns, regex fallback) so models that wrap answers in explanations still score correctly - Per-subject and aggregate accuracy reporting
- Combined cost-vs-accuracy analysis
Example impact of few-shot prompting (5 questions, High School Computer Science):
| Model | Zero-Shot | 3-Shot | Δ |
|---|---|---|---|
| Meta-Llama-3.1-8B | 60.0% | 75.0% | +15.0 |
| Meta-Llama-3-70B | 80.0% | 100.0% | +20.0 |
Run targeted MMLU evaluations directly:
hypercompare/mmlu_eval.py model_a model_b \
--subjects high_school_mathematics high_school_physics \
--num_questions 5 --n_shots 5 --verboseList available subjects:
python mmlu_dataset.py --list-subjects============ COMPARISON: Meta-Llama-3-70B (hyperbolic) vs Meta-Llama-3.1-8B (hyperbolic) ============
Speed
TTFT: 245 ms vs 312 ms
Latency: 2.1 s vs 2.8 s
Throughput: 95 tps vs 78 tps
Accuracy
Prompts: 100.0% vs 100.0%
MMLU: 73.2% vs 71.8%
Cost (per 1K tokens)
Input: $0.002 vs $0.001
Output: $0.003 vs $0.002
Cost/perf: 1.0x vs 0.7x
MIT © PSkinnerTech
Thanks to Hyperbolic for the inference API that powers the primary benchmark path.