Skip to content

LLM Benchmark: Compare model performance across specs and libraries on the site #6913

@MarkusNeusinger

Description

@MarkusNeusinger

Idea

Integrate an LLM benchmark feature into the site that lets visitors compare how different LLMs perform at generating plot implementations.

Concept

For a curated selection of specifications and plotting libraries, run the implementation generation pipeline with several different LLMs (e.g. Claude Opus, Claude Sonnet, GPT-5, Gemini, etc.). Each generated implementation is then scored by the existing review pipeline. The aggregated results are surfaced on the site so users can compare models side-by-side.

Proposed scope

Benchmark generation

  • Define a benchmark set: a representative subset of specs across complexity tiers (simple → complex) and a representative set of libraries (matplotlib, plotly, seaborn, altair, …).
  • Extend impl-generate / bulk-generate to accept a model parameter, so each implementation can be tagged with the model that produced it.
  • Run the benchmark periodically (e.g. when new models drop or on a cron) and store results per (spec_id, library, model).

Review & scoring

  • Reuse the existing impl-review quality_score pipeline so all models are judged by the same rubric.
  • Persist additional benchmark metadata: model name/version, generation timestamp, token usage, latency, cost.

Site integration

  • New /benchmark page with:
    • Leaderboard: average quality_score per model (overall and per library).
    • Per-spec comparison view: side-by-side rendered plots from each model + their review scores and review comments.
    • Filters by library, spec category/tag, model.
    • Charts comparing models across dimensions (quality, speed, cost, library coverage).
  • Link from each plot detail page → "see how other models solved this".

Open questions

  • Where to store benchmark runs — extend the existing metadata schema or a separate benchmarks/ directory?
  • How to handle non-determinism (multiple runs per model, take best/median)?
  • Cost cap and rate-limit strategy for the benchmark runs.
  • Which models to include in v1?

Why

  • Gives the site a unique, data-driven angle beyond just a plot gallery.
  • Provides a reproducible, library-aware benchmark for plot-generation tasks — useful both for the community and as marketing content.
  • Reuses infrastructure that already exists (spec pipeline + review scoring), so most of the heavy lifting is wiring + UI.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions