hypercompare

AI model benchmarking CLI — compare LLMs on speed, accuracy, and cost in a single reproducible run.

Point it at any two models and get side-by-side numbers for time-to-first-token, throughput, MMLU accuracy across 57 subjects, and per-token cost. Built for engineers picking an inference provider and researchers comparing model families.

What it measures

Performance — time-to-first-token (TTFT), total latency, tokens/sec (streaming)
Accuracy — keyword-based prompt scoring plus full MMLU evaluation across 57 subjects, with configurable zero-shot or few-shot prompting
Cost — input/output token pricing and a cost-performance ratio
Reproducibility — deterministic by default (temperature=0, fixed max_tokens=1024, streaming TTFT capture)

Supported providers

Primary integration:

Hyperbolic — managed API for open-weight models (Llama 3 / 3.1, Mixtral 8x7B, and others surfaced through Hyperbolic's API)

Provider adapters live under providers/, so adding new ones (OpenAI-compatible endpoints, Together, Fireworks, Groq, vLLM servers, Ollama, etc.) is a matter of implementing the request/response adapter.

A Lilypad adapter ships in the repo from earlier development. Lilypad's hosted Anura testnet is no longer the active product it was when this tool was built, so the Hyperbolic path is the primary, maintained one. The Lilypad adapter is left in the codebase as a reference implementation for plugging in alternative compute backends.

Quick start

git clone https://github.com/PSkinnerTech/hypercompare.git
cd hypercompare

python -m venv venv
source venv/bin/activate          # Windows: venv\Scripts\activate
pip install openai python-dotenv requests

cp .env.example .env
chmod +x hypercompare

Minimum .env:

HYPERBOLIC_API_KEY=...

Usage

Compare two models on Hyperbolic

./hypercompare meta-llama/Meta-Llama-3-70B-Instruct meta-llama/Meta-Llama-3.1-8B-Instruct

Compare across providers (if you have a second adapter wired up)

./hypercompare model_a model_b --providers hyperbolic <other>

Custom prompts

Pipe-delimited keyword scoring for accuracy:

Who wrote Hamlet? | Shakespeare, William Shakespeare
What is 2 + 2? | 4, four
Summarize the benefits of exercise in 2-3 sentences.

./hypercompare model_a model_b --prompts your_prompts.txt

CLI options

--providers PROVIDER PROVIDER   Provider for model_a and model_b
--prompts PATH                  Custom prompt file
--system TEXT                   System prompt applied to all test cases
--temperature FLOAT             Default 0 (deterministic)
--skip-mmlu                     Skip MMLU for a faster run
--n-shots N                     Few-shot examples for MMLU (default 0)
--num-questions N               Questions per MMLU subject (default 5)
--verbose                       Detailed per-request output

MMLU evaluation

Full MMLU support across all 57 subjects, with:

Configurable zero-shot or few-shot prompting (--n-shots)
Multi-layered answer extraction (initial-letter, Answer: X patterns, regex fallback) so models that wrap answers in explanations still score correctly
Per-subject and aggregate accuracy reporting
Combined cost-vs-accuracy analysis

Example impact of few-shot prompting (5 questions, High School Computer Science):

Model	Zero-Shot	3-Shot	Δ
Meta-Llama-3.1-8B	60.0%	75.0%	+15.0
Meta-Llama-3-70B	80.0%	100.0%	+20.0

Run targeted MMLU evaluations directly:

hypercompare/mmlu_eval.py model_a model_b \
  --subjects high_school_mathematics high_school_physics \
  --num_questions 5 --n_shots 5 --verbose

List available subjects:

python mmlu_dataset.py --list-subjects

Example output

============ COMPARISON: Meta-Llama-3-70B (hyperbolic) vs Meta-Llama-3.1-8B (hyperbolic) ============

Speed
  TTFT:        245 ms  vs  312 ms
  Latency:     2.1 s   vs  2.8 s
  Throughput:  95 tps  vs  78 tps

Accuracy
  Prompts:     100.0%  vs  100.0%
  MMLU:         73.2%  vs   71.8%

Cost (per 1K tokens)
  Input:       $0.002  vs  $0.001
  Output:      $0.003  vs  $0.002
  Cost/perf:    1.0x   vs   0.7x

License

Acknowledgments

Thanks to Hyperbolic for the inference API that powers the primary benchmark path.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docs		docs
hypercompare		hypercompare
img		img
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hypercompare

What it measures

Supported providers

Quick start

Usage

Compare two models on Hyperbolic

Compare across providers (if you have a second adapter wired up)

Custom prompts

CLI options

MMLU evaluation

Example output

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hypercompare

What it measures

Supported providers

Quick start

Usage

Compare two models on Hyperbolic

Compare across providers (if you have a second adapter wired up)

Custom prompts

CLI options

MMLU evaluation

Example output

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages