Idea
Integrate an LLM benchmark feature into the site that lets visitors compare how different LLMs perform at generating plot implementations.
Concept
For a curated selection of specifications and plotting libraries, run the implementation generation pipeline with several different LLMs (e.g. Claude Opus, Claude Sonnet, GPT-5, Gemini, etc.). Each generated implementation is then scored by the existing review pipeline. The aggregated results are surfaced on the site so users can compare models side-by-side.
Proposed scope
Benchmark generation
- Define a benchmark set: a representative subset of specs across complexity tiers (simple → complex) and a representative set of libraries (matplotlib, plotly, seaborn, altair, …).
- Extend
impl-generate / bulk-generate to accept a model parameter, so each implementation can be tagged with the model that produced it.
- Run the benchmark periodically (e.g. when new models drop or on a cron) and store results per
(spec_id, library, model).
Review & scoring
- Reuse the existing
impl-review quality_score pipeline so all models are judged by the same rubric.
- Persist additional benchmark metadata: model name/version, generation timestamp, token usage, latency, cost.
Site integration
- New
/benchmark page with:
- Leaderboard: average quality_score per model (overall and per library).
- Per-spec comparison view: side-by-side rendered plots from each model + their review scores and review comments.
- Filters by library, spec category/tag, model.
- Charts comparing models across dimensions (quality, speed, cost, library coverage).
- Link from each plot detail page → "see how other models solved this".
Open questions
- Where to store benchmark runs — extend the existing metadata schema or a separate
benchmarks/ directory?
- How to handle non-determinism (multiple runs per model, take best/median)?
- Cost cap and rate-limit strategy for the benchmark runs.
- Which models to include in v1?
Why
- Gives the site a unique, data-driven angle beyond just a plot gallery.
- Provides a reproducible, library-aware benchmark for plot-generation tasks — useful both for the community and as marketing content.
- Reuses infrastructure that already exists (spec pipeline + review scoring), so most of the heavy lifting is wiring + UI.
Idea
Integrate an LLM benchmark feature into the site that lets visitors compare how different LLMs perform at generating plot implementations.
Concept
For a curated selection of specifications and plotting libraries, run the implementation generation pipeline with several different LLMs (e.g. Claude Opus, Claude Sonnet, GPT-5, Gemini, etc.). Each generated implementation is then scored by the existing review pipeline. The aggregated results are surfaced on the site so users can compare models side-by-side.
Proposed scope
Benchmark generation
impl-generate/bulk-generateto accept amodelparameter, so each implementation can be tagged with the model that produced it.(spec_id, library, model).Review & scoring
impl-reviewquality_score pipeline so all models are judged by the same rubric.Site integration
/benchmarkpage with:Open questions
benchmarks/directory?Why