Skip to content

PSkinnerTech/hypercompare

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hypercompare

AI model benchmarking CLI — compare LLMs on speed, accuracy, and cost in a single reproducible run.

Point it at any two models and get side-by-side numbers for time-to-first-token, throughput, MMLU accuracy across 57 subjects, and per-token cost. Built for engineers picking an inference provider and researchers comparing model families.

What it measures

  • Performance — time-to-first-token (TTFT), total latency, tokens/sec (streaming)
  • Accuracy — keyword-based prompt scoring plus full MMLU evaluation across 57 subjects, with configurable zero-shot or few-shot prompting
  • Cost — input/output token pricing and a cost-performance ratio
  • Reproducibility — deterministic by default (temperature=0, fixed max_tokens=1024, streaming TTFT capture)

Supported providers

Primary integration:

  • Hyperbolic — managed API for open-weight models (Llama 3 / 3.1, Mixtral 8x7B, and others surfaced through Hyperbolic's API)

Provider adapters live under providers/, so adding new ones (OpenAI-compatible endpoints, Together, Fireworks, Groq, vLLM servers, Ollama, etc.) is a matter of implementing the request/response adapter.

A Lilypad adapter ships in the repo from earlier development. Lilypad's hosted Anura testnet is no longer the active product it was when this tool was built, so the Hyperbolic path is the primary, maintained one. The Lilypad adapter is left in the codebase as a reference implementation for plugging in alternative compute backends.

Quick start

git clone https://github.com/PSkinnerTech/hypercompare.git
cd hypercompare

python -m venv venv
source venv/bin/activate          # Windows: venv\Scripts\activate
pip install openai python-dotenv requests

cp .env.example .env
chmod +x hypercompare

Minimum .env:

HYPERBOLIC_API_KEY=...

Usage

Compare two models on Hyperbolic

./hypercompare meta-llama/Meta-Llama-3-70B-Instruct meta-llama/Meta-Llama-3.1-8B-Instruct

Compare across providers (if you have a second adapter wired up)

./hypercompare model_a model_b --providers hyperbolic <other>

Custom prompts

Pipe-delimited keyword scoring for accuracy:

Who wrote Hamlet? | Shakespeare, William Shakespeare
What is 2 + 2? | 4, four
Summarize the benefits of exercise in 2-3 sentences.
./hypercompare model_a model_b --prompts your_prompts.txt

CLI options

--providers PROVIDER PROVIDER   Provider for model_a and model_b
--prompts PATH                  Custom prompt file
--system TEXT                   System prompt applied to all test cases
--temperature FLOAT             Default 0 (deterministic)
--skip-mmlu                     Skip MMLU for a faster run
--n-shots N                     Few-shot examples for MMLU (default 0)
--num-questions N               Questions per MMLU subject (default 5)
--verbose                       Detailed per-request output

MMLU evaluation

Full MMLU support across all 57 subjects, with:

  • Configurable zero-shot or few-shot prompting (--n-shots)
  • Multi-layered answer extraction (initial-letter, Answer: X patterns, regex fallback) so models that wrap answers in explanations still score correctly
  • Per-subject and aggregate accuracy reporting
  • Combined cost-vs-accuracy analysis

Example impact of few-shot prompting (5 questions, High School Computer Science):

Model Zero-Shot 3-Shot Δ
Meta-Llama-3.1-8B 60.0% 75.0% +15.0
Meta-Llama-3-70B 80.0% 100.0% +20.0

Run targeted MMLU evaluations directly:

hypercompare/mmlu_eval.py model_a model_b \
  --subjects high_school_mathematics high_school_physics \
  --num_questions 5 --n_shots 5 --verbose

List available subjects:

python mmlu_dataset.py --list-subjects

Example output

============ COMPARISON: Meta-Llama-3-70B (hyperbolic) vs Meta-Llama-3.1-8B (hyperbolic) ============

Speed
  TTFT:        245 ms  vs  312 ms
  Latency:     2.1 s   vs  2.8 s
  Throughput:  95 tps  vs  78 tps

Accuracy
  Prompts:     100.0%  vs  100.0%
  MMLU:         73.2%  vs   71.8%

Cost (per 1K tokens)
  Input:       $0.002  vs  $0.001
  Output:      $0.003  vs  $0.002
  Cost/perf:    1.0x   vs   0.7x

License

MIT © PSkinnerTech

Acknowledgments

Thanks to Hyperbolic for the inference API that powers the primary benchmark path.

About

AI model benchmarking CLI — compare LLMs on speed, accuracy, and cost. Primary integration: Hyperbolic.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages