LLM Benchmark: Compare model performance across specs and libraries on the site

## Idea

Integrate an LLM benchmark feature into the site that lets visitors compare how different LLMs perform at generating plot implementations.

## Concept

For a curated selection of specifications and plotting libraries, run the implementation generation pipeline with several different LLMs (e.g. Claude Opus, Claude Sonnet, GPT-5, Gemini, etc.). Each generated implementation is then scored by the existing review pipeline. The aggregated results are surfaced on the site so users can compare models side-by-side.

## Proposed scope

### Benchmark generation
- Define a benchmark set: a representative subset of specs across complexity tiers (simple → complex) and a representative set of libraries (matplotlib, plotly, seaborn, altair, …).
- Extend `impl-generate` / `bulk-generate` to accept a `model` parameter, so each implementation can be tagged with the model that produced it.
- Run the benchmark periodically (e.g. when new models drop or on a cron) and store results per `(spec_id, library, model)`.

### Review & scoring
- Reuse the existing `impl-review` quality_score pipeline so all models are judged by the same rubric.
- Persist additional benchmark metadata: model name/version, generation timestamp, token usage, latency, cost.

### Site integration
- New `/benchmark` page with:
  - Leaderboard: average quality_score per model (overall and per library).
  - Per-spec comparison view: side-by-side rendered plots from each model + their review scores and review comments.
  - Filters by library, spec category/tag, model.
  - Charts comparing models across dimensions (quality, speed, cost, library coverage).
- Link from each plot detail page → "see how other models solved this".

## Open questions

- Where to store benchmark runs — extend the existing metadata schema or a separate `benchmarks/` directory?
- How to handle non-determinism (multiple runs per model, take best/median)?
- Cost cap and rate-limit strategy for the benchmark runs.
- Which models to include in v1?

## Why

- Gives the site a unique, data-driven angle beyond just a plot gallery.
- Provides a reproducible, library-aware benchmark for plot-generation tasks — useful both for the community and as marketing content.
- Reuses infrastructure that already exists (spec pipeline + review scoring), so most of the heavy lifting is wiring + UI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Benchmark: Compare model performance across specs and libraries on the site #6913

Idea

Concept

Proposed scope

Benchmark generation

Review & scoring

Site integration

Open questions

Why

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

LLM Benchmark: Compare model performance across specs and libraries on the site #6913

Description

Idea

Concept

Proposed scope

Benchmark generation

Review & scoring

Site integration

Open questions

Why

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions