Skip to content

benchmarks: reusable registry, with models, patterns and phases, CI smoke and Codspeed#735

Open
FBumann wants to merge 88 commits into
masterfrom
benchmark-suite-charter
Open

benchmarks: reusable registry, with models, patterns and phases, CI smoke and Codspeed#735
FBumann wants to merge 88 commits into
masterfrom
benchmark-suite-charter

Conversation

@FBumann
Copy link
Copy Markdown
Collaborator

@FBumann FBumann commented May 28, 2026

Overhaul of the internal benchmark suite

Brings benchmarks/ from a hand-wired pytest collection into a
typer-driven suite with cross-version sweeps, master-baseline CodSpeed CI,
and plotly views. Two orthogonal axes: size (scale a model up) and
severity (dial a density pathology at fixed size).

Core pieces

  • Reusable spec registry. from benchmarks import REGISTRY exposes
    11 self-registering model specs (LP / QP / MILP / SOS / piecewise /
    sparse networks / storage / PyPSA carbon) with build(size) + feature
    & phase tags. Adding a model is one file.
  • Density patterns (severity axis). A parallel PATTERNS registry of
    5 specs that hold size fixed and sweep a severity 0–100 dial measuring
    sparsity headroom — the fill a sparsity-aware kernel could drop from
    linopy's dense _term rectangle (0 = the non-pathological shape, no
    fill; 100 = the realistic worst case). Each targets a known hot path:
  • Phase coverage. Five phases per spec: build, matrices,
    lp_write, netcdf round-trip, solver_handoff (highs/gurobi/
    mosek/xpress). Plus PyPSA end-to-end.
  • One CLI. python -m benchmarks --help covers run / smoke / sweep
    / compare / plot / notebook / memory{save,sweep,compare}.
  • Cross-version sweeps. sweep 0.6.7 0.7.0 builds one uv venv per
    version; same flow for memory via memray.Tracker (model
    construction excluded from the peak).
  • Plotly views. plot <snapshots> picks scaling (1), scatter
    (2, default), or sweep heatmap (3+). --facets {phase, model}
    splits across subplots. Auto-detects timing vs memory JSON.
  • Master-baseline CodSpeed + Dependabot loop. CodSpeed (cachegrind,
    instruction-count regressions) runs on push to master plus a manual
    trigger (codspeed.yml), establishing the baseline — kept off every PR
    since the run is ~10–20× slower; regressions surface as master-to-master
    deltas. [benchmarks] pins perf-relevant deps individually (numpy/scipy/
    xarray/pandas/polars/dask/highspy/netcdf4); tooling deps grouped into one
    monthly PR. Each perf-relevant bump → one PR → one attributed CodSpeed
    delta on the next master run.
  • Notebook walkthrough at benchmarks/notebooks/registry_usage.ipynb,
    executed in CI so examples can't rot.

Two axes: size and severity

A ModelSpec sweeps n — make the same model bigger. A PatternSpec
sweeps severity 0–100 — hold the model's size fixed and dial up one
operation's sparsity headroom (the removable fill it forces into the
dense _term rectangle). Within a pattern, the operation's own axis
carries the cliff; the broadcast dims are sized only to lift peak memory
clear of the measurement noise floor, never to carry the signal. Both
implement the same BenchSpec protocol, so the phase drivers, CLI,
sweep, memory, and plotting machinery treat them uniformly; test ids
carry the axis ([nodal_balance-severity=100] vs [basic-n=500]). The
split keeps the question sharp: a regression on a model axis is "we got
slower at scale", on a pattern axis it's "a density idiom got worse".

Sample output: 0.6.7 → 0.7.0

Default scatter view — each test at (baseline_time, ratio), top-right
corner = real regressions (slow tests that got slower):

01-timing-scatter

Same comparison, memory dimension (peak RSS via memray.Tracker):

02-memory-scatter

Per-phase delta breakdown (compare --facets phase):

03-timing-compare-phase

Going forward

  • Enable CodSpeed: install the GitHub app on the org, add the repo,
    drop CODSPEED_TOKEN into Actions secrets. Until then the master-only
    codspeed.yml workflow is red, but the per-PR smoke job and notebook
    execution still run.
  • Add a model: drop a file under benchmarks/models/, register it.
    Phase coverage and CLI surface follow automatically.
  • Add a pattern: drop a file under benchmarks/patterns/,
    register_pattern it with a severity → data shape build. It joins
    every phase, sweep, and plot with no other wiring.
  • Add a phase: one test_<phase>.py parametrized off the registry.
  • Bump pins: [benchmarks] is the perf-attribution surface;
    [project.dependencies] stays loose so downstream consumers see no
    change.

FBumann and others added 30 commits May 27, 2026 23:04
…smoke

Refactors the internal benchmark suite around a reusable ModelSpec /
REGISTRY pattern so adding a model is one self-registering file with
metadata (features, applicable phases, sizes, optional deps). Other tests
and scripts can import it via `from benchmarks import REGISTRY`.

New model specs cover gaps in the existing coverage:
- milp: general (non-binary) integers (capacitated facility location)
- qp: continuous quadratic objective (diagonal portfolio)
- sos: SOS1 multi-mode generation (Model.add_sos_constraints)
- piecewise: piecewise-linear fuel cost (Model.add_piecewise_formulation)
- masked: sparse-route transportation using mask= on add_variables

SOS and piecewise specs gate their own registration on API availability,
so the suite stays runnable on older linopy.

New phase tests:
- test_solver_handoff.py: parametrizes lp.io.to_highspy/to_gurobipy/
  to_mosek/to_xpress across applicable models, skipping per-solver when
  the solver isn't installed. Uses stable lp.io wrappers (not the new
  Solver.from_name API) for backward compatibility.
- test_netcdf.py: separate to_netcdf / read_netcdf benchmarks.

CI: new benchmark-smoke.yml runs the suite under --quick
--benchmark-disable on PRs, so refactors that break a model spec get
caught early. Timings stay off CI (~35s smoke locally, no regression
tracking).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default ``pytest benchmarks/`` run now skips the slowest 1-2 sizes per
spec (e.g. knapsack at 1M, basic at 1600, pypsa_scigrid at >50) so a full
timing pass completes in ~2 minutes instead of 20-45.

ModelSpec grows a ``long_threshold`` mirror of ``quick_threshold``:

- ``--quick``  → ``size <= quick_threshold``  (CI smoke)
- default      → ``size <= long_threshold``   (medium-cost regression)
- ``--long``   → no cap                       (full sweep)

Verified locally:
- --quick: 227 passed / 230 skipped / 35s
- default: 333 passed / 124 skipped / 45s
- --long : 457 passed /   0 skipped / 2m13s

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop pypsa_scigrid from --quick entirely (quick_threshold=0). PyPSA
  import + example loading dominates the smoke wall-clock; the model
  still runs in default and --long modes.
- Lower every other spec's quick_threshold to its smallest size, so
  --quick exercises one size per model across all phases. The default
  tier (which uses long_threshold) still gives broad regression coverage.

Verified locally:
- --quick:  85 passed / 372 skipped / 18.5s   (was 35s)
- default: 333 passed / 124 skipped / 44.8s   (unchanged)
- --long : 457 passed /   0 skipped / 2m11s   (unchanged)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
benchmarks/notebooks/registry_usage.py is the canonical walkthrough for
the model registry. Authored in jupytext percent format so it triples
as:

- runnable Python script (CI executes it on every PR)
- notebook in JupyterLab / VSCode with the jupytext extension
- readable doc on GitHub (markdown cells render directly)

Covers: import, lookup by name, iterate, filter_by feature/phase,
parametrize-your-own-pytest pattern, one-off tracemalloc profiling,
and the three CLI tiers.

CI: benchmark-smoke.yml gains an "Execute registry-usage notebook" step
right after the pytest smoke — so doc rot fails the build instead of
hiding until someone next opens the file.

README: new "Worked walkthrough" subsection points at the notebook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Replace jupytext-style ``registry_usage.py`` with a proper
  ``registry_usage.ipynb`` — matches the repo convention (examples/*.ipynb,
  nbsphinx, nbstripout). CI executes it via ``jupyter nbconvert --execute``.
- Add ``__repr__`` (one-line summary) and ``_repr_html_`` (attribute table)
  to ModelSpec. Visible in pytest -v output, in interactive Python, and as
  rich HTML in Jupyter cells.
- Notebook simplified to lean on the new reprs: explicit-attribute prints
  in sections 2-5 replaced by bare expression evaluations.
- README points at the .ipynb and notes the "launch jupyter from repo root"
  convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds ``python -m benchmarks <command>`` with typer subcommands:

- list / show / filter — introspect the registry
- smoke               — pytest --quick --benchmark-disable (CI)
- run [--long --phase --model --filter --json]
                      — pytest --benchmark-only with knobs
- notebook            — execute the registry-usage notebook
- memory save/compare — replaces the argparse main in memory.py

Modern typer style throughout: Annotated[...] for every parameter,
Literal[...] for the --phase choice, function docstrings for command
help. ``--help`` is auto-generated and is the source of truth — README
and the notebook just point at it instead of duplicating the menu.

CI smoke now calls ``python -m benchmarks smoke`` and
``python -m benchmarks notebook``. memory.py keeps its save/compare
functions but loses the argparse layer. typer added to the [benchmarks]
extra.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cleanups after checking typer's docs:

- Pin typer to the latest release (==0.26.2) in the [benchmarks] extra,
  so the CLI's behaviour is reproducible across dev / CI / contributor
  machines.

- Switch ``smoke`` and ``run`` from the ``extra: list[str]`` argument
  to the idiomatic ``typer.Context`` + ``context_settings`` pattern
  (allow_extra_args, ignore_unknown_options). With the old style, any
  trailing ``--flag`` would be parsed as an unknown option and rejected;
  with ctx.args, ``python -m benchmarks run --long -- --tb=short -x``
  actually works.

Other patterns already match typer's recommended style: Annotated[...],
Literal for choice params, docstrings for command help, sub-apps via
add_typer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two layers of pinning for stable measurement:

- ``[benchmarks]`` extra in pyproject pins the test infra exactly
  (pytest, pytest-benchmark, pytest-memray, pypsa, highspy, netcdf4,
  nbconvert, typer). Loose enough that the sweep workflow can install
  varying linopy versions on top.

- ``benchmarks/requirements.lock`` is the full transitive resolution
  (numpy, scipy, pandas, xarray, plus everything else). Generated via
  ``uv pip compile --no-emit-package linopy`` so the lockfile pins the
  *environment around linopy* without pinning linopy itself — that lets
  the same lockfile work for both current-tip regression runs and
  cross-version sweeps.

README clarifies that the lockfile gives consistency over time on the
same machine, not absolute reproducibility across machines (CPU / cache
/ memory bandwidth still matter).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``python -m benchmarks sweep 0.5.0 0.6.0 0.7.0`` builds a fresh venv
per version with uv, installs the benchmark infra (lockfile by default,
or the [benchmarks] pinned subset with --no-use-lock) plus the target
linopy in a single resolution pass, and runs the suite. Snapshots land
in ``<output-dir>/linopy-<version>.json``.

Useful for bootstrapping a perf history against published linopy
releases. The current benchmark code runs against each linopy version
(constant measurement layer); the ``_API_AVAILABLE`` gates on sos /
piecewise specs make older linopy versions skip those phases gracefully.

Verified locally: ``sweep 0.7.0 --quick --no-use-lock`` runs end-to-end
in ~2 min (uv installs 57 packages in 200ms; the rest is the benchmark
run). Plain releases (0.4.0) and pip specs (git+https://...) both work
via the ``_linopy_install_spec`` helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The README previously duplicated content from three sources:
- the notebook (models table, registry-usage code blocks)
- ``--help`` (quick-reference command list)
- a stale memory.py invocation (since replaced by ``memory save/compare``)

After the consolidation each surface has a clear single job:

- README: 1-paragraph what, setup (uv sync / lockfile), size-tier table
  (architectural), pointers to the notebook + ``--help``, metrics blurb.
- ``notebooks/registry_usage.ipynb``: the walkthrough — registry import,
  lookup / iterate / filter, parametrize your own pytest, profiling.
- ``python -m benchmarks --help``: command reference, autogenerated by
  typer from docstrings / Annotated[..., Option(...)] declarations.

Drops ~140 lines from the README; nothing actually disappears — it just
lives in the one place that owns it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pypsa removed from the [benchmarks] pinned set, from the sweep
``--no-use-lock`` install list, and from the lockfile. The
``test_pypsa_carbon_management.py`` module uses ``pytest.importorskip``
so collection no longer fails without pypsa; ``pypsa_scigrid`` already
had ``requires=("pypsa",)`` so its phase tests skip gracefully.
Install pypsa separately when you want those benchmarks.

Notebook (registry_usage.ipynb) rewritten as a proper operator guide:

- Architecture overview + per-phase measurement table up front.
- Registry walkthrough (lookup / iterate / filter) kept as the spine.
- Reuse patterns (parametrize-your-own-pytest, tracemalloc spot check).
- ``Running`` section now embeds ``--help`` output live via a
  ``show_help()`` helper that shells out to ``python -m benchmarks ...
  --help``. The doc stays in sync with the typer implementation
  automatically — change a flag in cli.py, re-run the notebook,
  documentation updates.
- New sections cover timing snapshots, memory snapshots, the
  cross-version sweep, and lockfile regeneration.

README gains an explicit "pypsa is optional" note in setup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rough

Mirrors ``run``'s filter knobs and applies them to every version's
pytest invocation. Also switches to the ``typer.Context`` +
``context_settings`` pattern so trailing args after ``--`` are
forwarded to pytest verbatim (same shape ``smoke`` / ``run`` use).

    python -m benchmarks sweep 0.6.7 --phase build --model basic
    python -m benchmarks sweep 0.6.7 -- --tb=short -x

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``python -m benchmarks compare a.json b.json [-- --columns=...]``
shells out to ``pytest-benchmark compare`` so the whole suite stays
under one entry point. Accepts any number of snapshots; first is the
baseline.

When called with no arguments — or with paths that don't exist — it
prints a copy-paste-ready list of snapshots found under
``.benchmarks/`` (including ``.benchmarks/sweep/`` for cross-version
runs). If nothing's saved yet, points at the ``run --json`` flow.

For memory snapshots use ``memory compare`` instead — different
format, different tool.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-paste)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ame)

pytest-benchmark's own default emits 10 columns side-by-side, which is
unreadable for any non-trivial comparison. Wrapper now prepends
``--columns=median,iqr --sort=name`` so the table is two stats wide
and the (baseline, candidate) pair of each test sits together
alphabetically.

Defaults are only applied when the user hasn't already set the flag,
so trailing pass-through overrides still work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arg split

Two fixes for the ``compare`` UX surfaced by the cross-version sweep:

- Default to ``--group-by=fullname`` so each test gets its own mini
  table showing (baseline, candidate) side-by-side with the
  parenthesized auto-ratio per column. Easy to scan ``(>1.10)`` for
  regressions in the median column. Combined with the existing
  ``--columns=median,iqr --sort=name`` defaults, the output goes from
  10-columns-wide-on-one-line to a focused two-column per-test view.

- Switch ``compare`` away from a positional ``list[Path]`` argument and
  parse ``ctx.args`` by hand: typer's positional list was greedily
  grabbing trailing ``--group-by=fullname`` etc. (and the ``--``
  separator didn't escape it either). Now arg-splitting is explicit:
  anything starting with ``-`` is pytest-benchmark pass-through,
  everything else is a snapshot path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tried switching to the canonical typer pattern (``--`` separator for
pass-through) but typer's positional ``list[Path]`` + ``allow_extra_args``
still greedily ate the trailing options. There's no clean typer/click
idiom for "list-typed positional + pass-through" — workarounds are
manual splitting, bounding the positional count, or named flags.

Manual splitting is the most pragmatic: snapshots come first, once we
see any flag-like token the rest is forwarded to pytest-benchmark.
That preserves things like ``--histogram=/tmp/hist/cmp`` (built-in
SVG-per-test plotting), ``--csv=out.csv``, ``--group-by=fullname``,
and the value-taking flags whose value doesn't start with ``-``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three opinionated interactive HTML views over pytest-benchmark JSONs,
auto-picked from snapshot count or set explicitly via ``--view``:

- **compare** (2 snapshots) — horizontal bar chart of per-test median
  delta, sorted by magnitude, green→red colormap. The "did this PR
  regress anything?" picture in one glance, vs pytest-benchmark's
  60-individual-SVGs which are useless for that workflow.
- **sweep** (3+ snapshots) — heatmap of median ratio relative to the
  first snapshot, rows = tests, columns = labels. Pairs with the
  ``sweep`` subcommand.
- **scaling** (1 snapshot) — log-log median vs ``n`` for
  size-parametrized tests (e.g. ``[basic-n=10..1600]``), faceted by
  phase. Shows whether linopy's complexity scales as expected.

plotly==6.7.0 pinned in [benchmarks]; lockfile regenerated. plotly is
lazy-imported inside ``plot`` so the rest of the suite stays usable
without it (with a clear error if a user tries ``plot`` and it's
missing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er absolutes

- New ``benchmarks/plotting.py`` module owns the three views
  (``plot_compare`` / ``plot_sweep`` / ``plot_scaling``) plus a
  ``RENDERERS`` dispatch dict. cli.py drops ~140 lines and just imports
  ``PlotView`` + ``RENDERERS``; plotly is still lazy-loaded inside the
  view functions so importing the module without plotly works.

- ``compare`` bar chart and ``sweep`` heatmap now use ``text_auto``
  so values render inside each bar / cell.

- Hover info upgraded:
  - compare hover shows the per-test median of *both* snapshots
    (formatted to 4 significant figures) in addition to the delta %.
  - sweep hover shows the absolute median (s) alongside the ratio, via
    a customdata + hovertemplate plumbed through ``update_traces``.

scaling view already shows the absolute median on hover by virtue of
being a line chart.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ride

For microbenchmarks the lowest observed time is closest to the "true"
cost — background noise (GC, context switches, thermal throttling) can
only slow things down. pytest-benchmark's own ``--sort`` default is
``min`` for the same reason; LLVM's perf guide, JMH, Google Benchmark
and Alexandrescu's "Speed is found in the minds of people" all argue
similarly.

Changes:

- ``plot`` defaults to ``--metric min`` (was median). Accepts
  ``--metric median|mean|max`` to override. The metric drives the bar
  values, heatmap ratios, scaling-curve y-axis, and the hover labels.
- ``plot_compare`` / ``plot_sweep`` / ``plot_scaling`` in
  ``benchmarks/plotting.py`` all take a ``metric: Metric = "min"`` arg.
- ``compare`` table defaults to ``--columns=min,iqr --sort=min`` (was
  median,iqr / name). The auto-ratios next to each ``min`` flag
  regressions in the same readable form.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For a suite where test costs span six orders of magnitude (knapsack
microsecond builds vs PyPSA carbon at 2.4 s), sorting by % delta
overweights cheap tests — a 100% regression on a 1µs test ranks above
a 1% regression on a 2s test, but the absolute impacts are 1µs vs
24ms.

Changes:

- Default sort is now ``absolute`` (``b - a`` in seconds). Bar values
  are the time delta with SI-prefix formatting on the x-axis (24 ms,
  240 µs, etc.). Big actual-time impacts float to the bottom.
- ``--sort relative`` keeps the old percent behaviour.
- Both ``delta_abs`` and ``delta_pct`` are surfaced in hover regardless
  of which one drives the sort, so you can read off whichever lens.
- ``plot_sweep`` / ``plot_scaling`` accept a ``sort`` arg for uniform
  signature but ignore it (no two-snapshot diff there).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The compare bar chart forces a choice between sorting by relative %
or absolute Δ. Both have blind spots: pure-relative makes microbenchmark
noise look catastrophic, pure-absolute hides real algorithmic regressions
on fast paths.

The two-axis scatter resolves the tension visually. Per test:

- x = baseline time (log scale)
- y = candidate / baseline ratio
- colour = absolute Δ

A point is a real regression worth chasing only when it sits in the
top-right — slow tests that got slower. Top-left (high ratio, tiny
absolute) reads as microbenchmark noise; bottom-right (high absolute,
ratio ≈ 1) was already slow and didn't change. A dashed reference line
at ``y=1`` makes "no change" trivial to see.

The view is auto-picked for nothing (compare wins for 2 snapshots);
pass ``--view scatter`` explicitly to get it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two-axis scatter now scales beyond a single baseline-vs-candidate
pair. With N>=3 inputs the first is still the baseline (reference); each
subsequent snapshot becomes one animation frame. Use the slider / play
button to scrub through versions and watch tests drift across releases.

Implementation:

- First snapshot is the baseline. Skipped from the frame set (would
  trivially be y=1 everywhere).
- Each subsequent snapshot contributes points at (baseline_time,
  ratio, Δ) per overlapping test. ``animation_frame="version"`` does
  the per-frame slicing; ``category_orders`` preserves input order in
  the slider so the timeline reads left→right.
- ``range_x`` / ``range_y`` are pinned to the global min/max so the
  camera doesn't jump between frames.
- 2 inputs still produces a static scatter (no animation overhead).

Considered ``facet_col`` but it gets cramped past ~4 versions — the
slider scales to arbitrary length.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… |Δ|

Two small but high-value tweaks for the multi-snapshot scatter:

- The baseline snapshot now contributes its own animation frame where
  every point sits at ratio=1, Δ=0. Gives the animation a "before
  anything happened" anchor: hit play and watch points drift from the
  baseline horizon outward. Previously the first frame was the first
  candidate, which made the visual feel as if it started mid-story.

- ``range_color`` is pinned to the 95th-percentile absolute Δ
  (±p95). One huge outlier no longer drags the colour scale and
  flattens everyone else to white; outliers saturate at the bound,
  the rest of the distribution stays readable. Colour-bar label notes
  ``Δ (s, p95-clipped)`` so the convention is explicit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "no change" line sits at y=1, but with asymmetric data (e.g. some
2x regression, no symmetric speedup), it landed near the bottom of the
visible range and improvements got squeezed near the floor.

Now: ``max_dist = max(|1 - y_lo|, |y_hi - 1|)`` and ``range_y = [1 -
max_dist, 1 + max_dist]``. Pure min/max coverage (no clipping) but the
window is symmetric around 1.0, so regressions above and improvements
below are equally readable regardless of the data skew.

The colour scale keeps the p95-clipped centred-at-0 treatment from the
previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h warn

Three safe fixes from a code review of benchmarks/plotting.py:

- Row-height multiplier 14 → 22 in plot_compare and plot_sweep, with
  the floor bumped from 400 to 500. At 25+ tests the y-axis labels
  were colliding; now they breathe.
- plot_scaling reads ``params.size`` (the cleanly-stored int from
  parametrize) and only falls back to the id regex if absent. The
  ``model`` name still needs the regex because pytest-benchmark
  serializes our ModelSpec as ``UNSERIALIZABLE[ModelSpec(...)]``, so a
  full params switch isn't possible here — but the size path is now
  robust to test-id rename.
- plot_compare surfaces the mismatch between snapshots: prints a
  stderr line with the test counts only in A / only in B / common,
  and embeds the same as a subtitle in the figure. Silent intersection
  was the worst-case footgun.

Skipped (per review note): the default-view swap for 3+ snapshots
(sweep → scatter) is a judgement call left for the user. Default
output filename change (clobber on each run) also skipped — they want
to decide whether per-view filenames are worth the API change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…view>.html

Two coupled changes setting up the notebook-embedding path:

- ``plot_compare`` / ``plot_scatter`` / ``plot_sweep`` / ``plot_scaling``
  in ``benchmarks/plotting.py`` now return a ``plotly.graph_objects.Figure``
  instead of writing to disk + returning a count. The CLI does the
  ``fig.write_html(output)`` step. ``benchmarks.plotting.n_points(fig)``
  is exported as a helper so the CLI still emits a "N points → path"
  status line.

  This unblocks rendering plots directly in jupyter — call
  ``plot_compare(...)`` and Jupyter's display hook renders the Figure
  inline.

- Default ``-o`` for ``plot`` is now ``.benchmarks/plots/<view>.html``
  (was ``benchmark-plot.html`` in cwd). Matches where snapshots already
  land (and is gitignored), and the per-view filename means consecutive
  runs of different views don't clobber each other.

Bonus: two ``numpy_array or fallback`` bugs in scatter (``df.abs().max()
or 1e-9``) and the new ``n_points`` helper (``trace.x or trace.z``) —
both triggered ``ValueError: The truth value of an array with more
than one element is ambiguous``. Replaced with explicit ``is None``
checks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ospection

The ``n_points(fig)`` helper added in the previous commit walked
``fig.data`` traces and called ``len(trace.x)`` to recover the test
count. That's backwards — the count is sitting right there in the
source DataFrame at render time, no need to reach into the rendered
plot.

Renderers now return ``tuple[Figure, int]`` directly. ``len(df)`` for
compare / sweep / scaling; ``df["test"].nunique()`` for scatter
(rows are per-(test, version) so the raw len double-counts).

n_points helper dropped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two refinements to the end-to-end plotting section:

- tqdm wraps the subprocess loop that generates the two snapshots.
  Each ``--quick --phase build`` run takes ~10 s; tqdm makes the
  ~20 s wait visible. ``tqdm.auto`` auto-picks the notebook widget
  vs console bar based on context.

- Plots are now rendered via ``python -m benchmarks plot --view <name>``
  rather than direct ``plot_compare`` / ``plot_scatter`` imports.
  A small ``cli_plot(view, snapshots)`` helper runs the subprocess,
  reads the generated HTML, and inlines it via ``IPython.display.HTML``.
  Demonstrates the actual user-facing CLI path inside the notebook
  rather than the internal API.

Notebook end-to-end runtime: ~37 s (~33 s for the run loop + plotting
overhead).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@FBumann
Copy link
Copy Markdown
Collaborator Author

FBumann commented Jun 3, 2026

@FabianHofmann DO you think adding sth like CodSpeed to the codebase would be a nice addition? Im not sure, but we might just try it out...

@FBumann FBumann requested a review from FabianHofmann June 3, 2026 12:50
@FabianHofmann
Copy link
Copy Markdown
Collaborator

@FabianHofmann DO you think adding sth like CodSpeed to the codebase would be a nice addition? Im not sure, but we might just try it out...

trying out sounds good @lkstrp what do you thing? don't know if we should enable it for the whole organization or only for linopy?

@FBumann
Copy link
Copy Markdown
Collaborator Author

FBumann commented Jun 4, 2026

Im looking into sparsity a bit more. I need this benchmarking stuff to reliably improve memory usage while avoiding regression

FBumann and others added 15 commits June 4, 2026 22:11
Patterns are fragments of realistic modelling code (a nodal-balance
groupby-sum, a KVL sparse contraction) that reproduce the dense-`_term`
materialisation hot spots — measured the same way models are (time +
peak memory, through the same phases), but parametrised by `severity`
(0-100, "how pathological the data shape is") instead of `size`.

Unify models and patterns under one `BenchSpec` contract:

- `ModelSpec` (axis "n", sweeps size) and `PatternSpec` (axis
  "severity", sweeps severity) both build a `linopy.Model` and expose
  `sweep`/`axis`/`description`; harness reads the contract, not the type.
- One generic `iter_params`/`param_ids` over `all_specs()` (models +
  patterns); test-id grammar generalised to `-<axis>=<int>` with an
  `axis` column carried through snapshots and plots. `severity` is an
  int so it rides the same grammar/Int64/plot machinery as size.
- Patterns ride the existing phase drivers (build/matrices/lp_write/
  netcdf/solver) — no dedicated driver — so the build-vs-export contrast
  (does the bloat reach the matrix/LP file or collapse?) falls out.
  `memory.run_phase` iterates all specs through one path.
- `--quick` keeps severities up to the midpoint so smoke exercises real
  pathology, not just the benign endpoint.

Two patterns to start: `nodal_balance` (#745 groupby padding, a live
cliff in peak memory) and `kvl_cycles` (#748 sparse `@` densification,
flat today — a sparse-aware kernel bends it down).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- memory: add an opt-in `pipeline` phase (build→matrices→lp_write in one
  tracker) reporting the end-to-end peak / OOM ceiling — the number that
  can't be recovered from the isolated per-phase marginals. Kept out of
  the default set (it re-runs those phases) and requested via
  `--phase pipeline`; CLI validates against a new `ALL_MEMORY_PHASES`.
- cli: `list` now shows models and patterns with a `--kind {all,models,
  patterns}` selector (default both); `list --details` gets a patterns
  table; `show <name>` resolves patterns too. Closes the discoverability
  gap left by patterns sharing the phase drivers (no dedicated test file).
- docs: walkthrough gains a "Patterns" section (how to run them via the
  shared drivers + `-k severity`) and a note on the `pipeline` ceiling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lper

Trim the spec-selection surface: `--model`, `--kind`, and the proposed
`--spec` were all "which specs" flags overlapping with `--filter`/`-k`.
Collapse to two composable selectors on the run/measure commands —
`--phase` (stage) and `--filter`/`-k` (specs by name or id substring:
`nodal_balance` one spec, `severity` patterns, `n=` models):

- run/sweep: drop `--model` (== `--filter <name>`).
- memory save: replace `--kind` with `--filter` (substring on the
  `<name>-<axis>=<value>` key; also gains single-spec selection, which
  memory previously couldn't do).
- list: keeps `--kind {all,models,patterns}` — it filters names, where
  the axis tag doesn't appear, so a substring filter can't select a kind.

Also factor the test-id fragment `f"{name}-{axis}={value}"` into
`snapshot.spec_param_id()` — the one source of truth shared by
`param_ids`, the memory grid ids, the solver-handoff ids, the memory
`--filter` key, and `synth_test_id`, kept in lock-step with the parse regex.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cy guard

Running mypy across the whole suite (CI excludes benchmarks/, so this is a
manual check) surfaced 9 errors, all from two roots:

- BenchSpec declared settable members, but ModelSpec/PatternSpec are frozen
  dataclasses (read-only attrs), so they didn't structurally satisfy the
  Protocol — which cascaded through every function typed on BenchSpec
  (all_specs, maybe_skip, _measurements). Declare the Protocol members as
  read-only @Property to match. Clears 8 of 9.
- synth_test_id passed `model: str | None` to spec_param_id (expects str);
  `all(<bools>)` doesn't narrow. Narrow explicitly. Clears the 9th.

Also fix the `plot` command's plotly guard: `from benchmarks.plotting import
RENDERERS` always succeeds (plotly is imported lazily inside each renderer), so
a plotly-less user hit a raw ModuleNotFoundError instead of the friendly
message. Check `importlib.util.find_spec("plotly")` instead. Plus refresh two
stale `plot` help strings (`-n=<size>` → `-<axis>=<value>`; scaling is now
axis-aware, not always log-log).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two density idioms that both the PyPSA and flixopt scans converged on —
neither covered by the existing nodal_balance (#745) / kvl_cycles (#748):

- merge_balance (#749): merging sub-expressions of different `_term`
  widths along a shared dim pads every block to the widest, leaving the
  narrow ones mostly fill — the documented SciGRID build peak. PyPSA
  `merge(gen+storage+lines, join="outer")`; flixopt bus balance
  `sum([flows])`. severity dials the widest block (verified _term
  3 -> 102 -> 200 across 30 blocks).

- flow_sum: `.sum(dim)` folds the summed dim's whole size into `_term`.
  PyPSA `(p*w).sum()` CO2/operational limits; flixopt
  `.sum(['time','cluster'])`. severity dials the folded-dim size
  (verified _term 2 -> 100 -> 200).

Both ride the shared phase drivers like the other patterns; mypy-clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…coupling)

Close the intertemporal-coupling op gap, split correctly across the two axes:

- storage MODEL (size): a fleet of SoC recursions
  `soc[t] - decay*soc[t-1] - eff*charge + discharge/eff == 0` via
  `soc.shift(time=1)`. The only model exercising the `.shift()`/`.isel()`
  intertemporal ops (PyPSA SoC, flixopt `charge_state.isel` recursion). It's
  a model, not a pattern: bidiagonal → ~4 terms/row regardless of horizon or
  unit count (verified `_term`=4 flat across size), so it scales with size,
  no benign→worst dial.

- rolling PATTERN (severity): the *windowed* form does have a dial.
  `status.rolling(K).sum()` (min up/down time, windowed limits) builds K
  terms per row, so window width is the severity (verified `_term`
  1 -> 84 -> 168 across the horizon).

The pair is a clean illustration of the size-vs-severity split: storage's
`_term` is flat (size axis), rolling's climbs (severity axis).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`cumsum` folds a growing window into `_term` (running totals — cumulative
energy, a rolling budget): `(1*x).cumsum("t")`, verified `_term` 2 -> 100
-> 200 (triangular).

It's benchmarked as its own op rather than folded into `rolling` even
though linopy currently implements `cumsum` via `rolling(window=full_dim)`
(expressions.py): that delegation is an implementation coincidence, not a
contract, and `cumsum` is a natural de-densification target (a prefix sum
need not materialise the triangle). So it plays the same instrument role
as `kvl_cycles` — flat/redundant today, but the thing that would show a
dedicated cumsum kernel land. Benchmark the public op, not its wiring.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `benchmarks/` suite was excluded from `mypy .`, so its type-cleanliness
was only enforced on a manual run. Activate it. The catch: the old
`'benchmark/*'` pattern is a regex whose `/*` (zero-or-more slash) matches
the `benchmark` prefix of `benchmarks/` too — so simply dropping
`'benchmarks/*'` wouldn't have worked. Replace both with `'^benchmark/'`,
which requires the slash and so excludes only the legacy singular
`benchmark/` (not mypy-clean) while checking `benchmarks/`.

Verified `mypy .` is clean (101 files) with benchmarks/ included and
benchmark/ still skipped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- pypsa: skip on environmental example-data fetch failure (network /
  example-API drift) in build_pypsa_scigrid and the carbon fixture, but
  leave n.optimize.create_model() unguarded so a genuine linopy
  regression surfaces rather than being swallowed into a skip.
- CodSpeed job: continue-on-error — it's red until CODSPEED_TOKEN is set
  on the org, and a perpetually-failing missing-secret check shouldn't
  block every PR.
- Trim verbose comments in the [benchmarks] extra.

Dep-group migration deliberately NOT done here: moving dev/docs/benchmarks
from extras to PEP 735 [dependency-groups] is a project-wide change (they're
all extras today) and belongs in its own focused PR, not the benchmark PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 933-line cli.py was a god-file whose ``# --- section ---`` banners
signalled it wanted to be separate modules. Split one module per command
group, all registering onto a shared ``app`` so the flat command surface
and ``--help`` order stay identical:

  _base.py        app/memory_app, shared types + helpers
  introspect.py   list / show / filter
  run.py          smoke / run / notebook
  sweep.py        sweep
  compare.py      compare
  plot.py         plot
  memory.py       memory save / sweep / compare

Also tidy two over-narrated spots:
- sweep.py: drop a stale comment that claimed ``cwd`` was pinned to the
  repo root when the code runs ``cwd=import_dir``; trim the isolation
  blocks the function docstring already covers in full.
- plotting.py: tighten the facet-layout comment.

CLI surface, ruff, and mypy all verified unchanged.
Move the three suite-internal unit tests (test_bench, test_sweep,
test_memory_id_alignment) out of the benchmarks/ root into a dedicated
benchmarks/_tests/ subdirectory, separating them from the benchmark
driver modules. Fix the repo-root computation in test_memory_id_alignment
(parents[1] -> parents[2]) now that the file sits one level deeper.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Condense the prose comment blocks in benchmark-smoke.yml, .gitignore,
and the [benchmarks] extra to one or two lines each — keep the why,
drop the restatement.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dial)

flow_sum's .sum(dim) fold produces all-real terms with zero fill, so its
severity is just a sub-dim size knob with no sparsity headroom — it
measures nothing the size axis doesn't. Keep cumsum, whose triangular
fold is a genuine de-densification target. PATTERNS is now 5 specs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@FBumann FBumann marked this pull request as draft June 5, 2026 08:12
@FBumann
Copy link
Copy Markdown
Collaborator Author

FBumann commented Jun 5, 2026

@FabianHofmann Im preparing this to cover different sparsity patterns to make improvements about sparsity tracable

…e floor

Separate each pattern's dimensions into the op's own axis (carries the
severity cliff) and the broadcast/volume axis (sets amplitude only,
shape-preserving). Shrink volume dims that were inflating cost on an axis
the op never touches, and size each pattern so its sev-100 peak lands in
a comfortably measurable band rather than near the memray noise floor:

  rolling  unit 200->8, time 168->1000   0.8 -> 138 MiB (linear)
  cumsum   row 200->64                    0  -> 45  MiB (quadratic)
  nodal    time 24->8                     1.2-> 32  MiB (linear)
  merge    row 200->128                    0.8-> 24  MiB (linear)
  kvl      time 3->168                    126 MiB flat

kvl stays flat across severity by design (today's @ densifies to
n_branch regardless of C sparsity); time=168 lifts the flat level so the
always-paid densification is visible — the headroom a sparse kernel
would reclaim, at a realistic weekly horizon.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@FBumann FBumann marked this pull request as ready for review June 5, 2026 10:39
The y-axis read "min" (looks like minutes) and the x-axis read a bare
"severity". Label the y-axis "<unit> (<metric>)" with metric spelled out
(min->minimum, max->maximum), and drive the x-axis label + log-scale
choice from a single _AXIS_DISPLAY table keyed by axis name, replacing
the scattered `x_label == "n"` checks. A third axis is now one row, not a
new branch; the mixed-axis case falls back to a string, so the key type
stays plain str.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@FBumann
Copy link
Copy Markdown
Collaborator Author

FBumann commented Jun 5, 2026

@FabianHofmann I think i will just merge this on my fork and try out codspeed there for now.

FBumann and others added 3 commits June 5, 2026 13:14
Patterns aren't models, so the shared spec-name field/column was
mis-named. Rename it to "spec" throughout: the load_long_df / bench
column, the `--facets spec` CLI option (FacetBy), the
`to_snapshot(spec=)` / `synth_test_id(spec=)` kwargs, and docs/tests.
The scaling-plot legend now reads "spec" rather than "model".

Purely internal: test-id strings are unchanged, so existing snapshots
still parse — only the in-memory column and the kwarg/option names move.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The cachegrind CodSpeed job shared benchmark-smoke.yml's trigger, so it
ran on every PR despite being ~10–20× slower and only useful as a
master-to-master baseline comparison. Move it to codspeed.yml triggered
on push-to-master + workflow_dispatch (manual), and leave the cheap
smoke job (--benchmark-disable) running on every PR as before.

Regressions now surface as master-to-master deltas; ad-hoc branch checks
go through the manual "Run workflow" button.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@FBumann FBumann changed the title benchmarks: reusable registry, new model types, new phases, CI smoke benchmarks: reusable registry, with models, patterns and phases, CI smoke and Codspeed Jun 5, 2026
@FBumann
Copy link
Copy Markdown
Collaborator Author

FBumann commented Jun 5, 2026

@FabianHofmann THe genuine outcome an be seen on a PR where i try to improve memory issues (not ready at all yet)
fluxopt#25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants