benchmarks: reusable registry, with models, patterns and phases, CI smoke and Codspeed by FBumann · Pull Request #735 · PyPSA/linopy

FBumann · 2026-05-28T06:37:51Z

Overhaul of the internal benchmark suite

Brings benchmarks/ from a hand-wired pytest collection into a
typer-driven suite with cross-version sweeps, master-baseline CodSpeed CI,
and plotly views. Two orthogonal axes: size (scale a model up) and
severity (dial a density pathology at fixed size).

Core pieces

Reusable spec registry. from benchmarks import REGISTRY exposes
11 self-registering model specs (LP / QP / MILP / SOS / piecewise /
sparse networks / storage / PyPSA carbon) with build(size) + feature
& phase tags. Adding a model is one file.
Density patterns (severity axis). A parallel PATTERNS registry of
5 specs that hold size fixed and sweep a severity 0–100 dial measuring
sparsity headroom — the fill a sparsity-aware kernel could drop from
linopy's dense _term rectangle (0 = the non-pathological shape, no
fill; 100 = the realistic worst case). Each targets a known hot path:
- nodal_balance — groupby().sum() skew (groupby-sum pads every group to the largest group size #745): 0 uniform → 100 one
  hub holds the rest
- kvl_cycles — sparse @ (@/dot against a sparse matrix densifies the result to full _term #748): today's dot densifies to
  nterm=n_branch regardless of cycle sparsity, so memory is flat
  across severity — a latent sentinel that lights up when a sparse @
  lands
- merge_balance — ragged merge (merge of ragged expressions is the peak allocation in PyPSA model builds #749): one wide block pads every
  block to its width
- rolling — rolling(window).sum() window width (1 → full horizon)
- cumsum — cumsum triangular term growth
Phase coverage. Five phases per spec: build, matrices,
lp_write, netcdf round-trip, solver_handoff (highs/gurobi/
mosek/xpress). Plus PyPSA end-to-end.
One CLI. python -m benchmarks --help covers run / smoke / sweep
/ compare / plot / notebook / memory{save,sweep,compare}.
Cross-version sweeps. sweep 0.6.7 0.7.0 builds one uv venv per
version; same flow for memory via memray.Tracker (model
construction excluded from the peak).
Plotly views. plot <snapshots> picks scaling (1), scatter
(2, default), or sweep heatmap (3+). --facets {phase, model}
splits across subplots. Auto-detects timing vs memory JSON.
Master-baseline CodSpeed + Dependabot loop. CodSpeed (cachegrind,
instruction-count regressions) runs on push to master plus a manual
trigger (codspeed.yml), establishing the baseline — kept off every PR
since the run is ~10–20× slower; regressions surface as master-to-master
deltas. [benchmarks] pins perf-relevant deps individually (numpy/scipy/
xarray/pandas/polars/dask/highspy/netcdf4); tooling deps grouped into one
monthly PR. Each perf-relevant bump → one PR → one attributed CodSpeed
delta on the next master run.
Notebook walkthrough at benchmarks/notebooks/registry_usage.ipynb,
executed in CI so examples can't rot.

Two axes: size and severity

A ModelSpec sweeps n — make the same model bigger. A PatternSpec
sweeps severity 0–100 — hold the model's size fixed and dial up one
operation's sparsity headroom (the removable fill it forces into the
dense _term rectangle). Within a pattern, the operation's own axis
carries the cliff; the broadcast dims are sized only to lift peak memory
clear of the measurement noise floor, never to carry the signal. Both
implement the same BenchSpec protocol, so the phase drivers, CLI,
sweep, memory, and plotting machinery treat them uniformly; test ids
carry the axis ([nodal_balance-severity=100] vs [basic-n=500]). The
split keeps the question sharp: a regression on a model axis is "we got
slower at scale", on a pattern axis it's "a density idiom got worse".

Sample output: 0.6.7 → 0.7.0

Default scatter view — each test at (baseline_time, ratio), top-right
corner = real regressions (slow tests that got slower):

Same comparison, memory dimension (peak RSS via memray.Tracker):

Per-phase delta breakdown (compare --facets phase):

Going forward

Enable CodSpeed: install the GitHub app on the org, add the repo,
drop CODSPEED_TOKEN into Actions secrets. Until then the master-only
codspeed.yml workflow is red, but the per-PR smoke job and notebook
execution still run.
Add a model: drop a file under benchmarks/models/, register it.
Phase coverage and CLI surface follow automatically.
Add a pattern: drop a file under benchmarks/patterns/,
register_pattern it with a severity → data shape build. It joins
every phase, sweep, and plot with no other wiring.
Add a phase: one test_<phase>.py parametrized off the registry.
Bump pins: [benchmarks] is the perf-attribution surface;
[project.dependencies] stays loose so downstream consumers see no
change.

…smoke Refactors the internal benchmark suite around a reusable ModelSpec / REGISTRY pattern so adding a model is one self-registering file with metadata (features, applicable phases, sizes, optional deps). Other tests and scripts can import it via `from benchmarks import REGISTRY`. New model specs cover gaps in the existing coverage: - milp: general (non-binary) integers (capacitated facility location) - qp: continuous quadratic objective (diagonal portfolio) - sos: SOS1 multi-mode generation (Model.add_sos_constraints) - piecewise: piecewise-linear fuel cost (Model.add_piecewise_formulation) - masked: sparse-route transportation using mask= on add_variables SOS and piecewise specs gate their own registration on API availability, so the suite stays runnable on older linopy. New phase tests: - test_solver_handoff.py: parametrizes lp.io.to_highspy/to_gurobipy/ to_mosek/to_xpress across applicable models, skipping per-solver when the solver isn't installed. Uses stable lp.io wrappers (not the new Solver.from_name API) for backward compatibility. - test_netcdf.py: separate to_netcdf / read_netcdf benchmarks. CI: new benchmark-smoke.yml runs the suite under --quick --benchmark-disable on PRs, so refactors that break a model spec get caught early. Timings stay off CI (~35s smoke locally, no regression tracking). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The default ``pytest benchmarks/`` run now skips the slowest 1-2 sizes per spec (e.g. knapsack at 1M, basic at 1600, pypsa_scigrid at >50) so a full timing pass completes in ~2 minutes instead of 20-45. ModelSpec grows a ``long_threshold`` mirror of ``quick_threshold``: - ``--quick`` → ``size <= quick_threshold`` (CI smoke) - default → ``size <= long_threshold`` (medium-cost regression) - ``--long`` → no cap (full sweep) Verified locally: - --quick: 227 passed / 230 skipped / 35s - default: 333 passed / 124 skipped / 45s - --long : 457 passed / 0 skipped / 2m13s Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Drop pypsa_scigrid from --quick entirely (quick_threshold=0). PyPSA import + example loading dominates the smoke wall-clock; the model still runs in default and --long modes. - Lower every other spec's quick_threshold to its smallest size, so --quick exercises one size per model across all phases. The default tier (which uses long_threshold) still gives broad regression coverage. Verified locally: - --quick: 85 passed / 372 skipped / 18.5s (was 35s) - default: 333 passed / 124 skipped / 44.8s (unchanged) - --long : 457 passed / 0 skipped / 2m11s (unchanged) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

benchmarks/notebooks/registry_usage.py is the canonical walkthrough for the model registry. Authored in jupytext percent format so it triples as: - runnable Python script (CI executes it on every PR) - notebook in JupyterLab / VSCode with the jupytext extension - readable doc on GitHub (markdown cells render directly) Covers: import, lookup by name, iterate, filter_by feature/phase, parametrize-your-own-pytest pattern, one-off tracemalloc profiling, and the three CLI tiers. CI: benchmark-smoke.yml gains an "Execute registry-usage notebook" step right after the pytest smoke — so doc rot fails the build instead of hiding until someone next opens the file. README: new "Worked walkthrough" subsection points at the notebook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Replace jupytext-style ``registry_usage.py`` with a proper ``registry_usage.ipynb`` — matches the repo convention (examples/*.ipynb, nbsphinx, nbstripout). CI executes it via ``jupyter nbconvert --execute``. - Add ``__repr__`` (one-line summary) and ``_repr_html_`` (attribute table) to ModelSpec. Visible in pytest -v output, in interactive Python, and as rich HTML in Jupyter cells. - Notebook simplified to lean on the new reprs: explicit-attribute prints in sections 2-5 replaced by bare expression evaluations. - README points at the .ipynb and notes the "launch jupyter from repo root" convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds ``python -m benchmarks <command>`` with typer subcommands: - list / show / filter — introspect the registry - smoke — pytest --quick --benchmark-disable (CI) - run [--long --phase --model --filter --json] — pytest --benchmark-only with knobs - notebook — execute the registry-usage notebook - memory save/compare — replaces the argparse main in memory.py Modern typer style throughout: Annotated[...] for every parameter, Literal[...] for the --phase choice, function docstrings for command help. ``--help`` is auto-generated and is the source of truth — README and the notebook just point at it instead of duplicating the menu. CI smoke now calls ``python -m benchmarks smoke`` and ``python -m benchmarks notebook``. memory.py keeps its save/compare functions but loses the argparse layer. typer added to the [benchmarks] extra. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two cleanups after checking typer's docs: - Pin typer to the latest release (==0.26.2) in the [benchmarks] extra, so the CLI's behaviour is reproducible across dev / CI / contributor machines. - Switch ``smoke`` and ``run`` from the ``extra: list[str]`` argument to the idiomatic ``typer.Context`` + ``context_settings`` pattern (allow_extra_args, ignore_unknown_options). With the old style, any trailing ``--flag`` would be parsed as an unknown option and rejected; with ctx.args, ``python -m benchmarks run --long -- --tb=short -x`` actually works. Other patterns already match typer's recommended style: Annotated[...], Literal for choice params, docstrings for command help, sub-apps via add_typer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two layers of pinning for stable measurement: - ``[benchmarks]`` extra in pyproject pins the test infra exactly (pytest, pytest-benchmark, pytest-memray, pypsa, highspy, netcdf4, nbconvert, typer). Loose enough that the sweep workflow can install varying linopy versions on top. - ``benchmarks/requirements.lock`` is the full transitive resolution (numpy, scipy, pandas, xarray, plus everything else). Generated via ``uv pip compile --no-emit-package linopy`` so the lockfile pins the *environment around linopy* without pinning linopy itself — that lets the same lockfile work for both current-tip regression runs and cross-version sweeps. README clarifies that the lockfile gives consistency over time on the same machine, not absolute reproducibility across machines (CPU / cache / memory bandwidth still matter). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

``python -m benchmarks sweep 0.5.0 0.6.0 0.7.0`` builds a fresh venv per version with uv, installs the benchmark infra (lockfile by default, or the [benchmarks] pinned subset with --no-use-lock) plus the target linopy in a single resolution pass, and runs the suite. Snapshots land in ``<output-dir>/linopy-<version>.json``. Useful for bootstrapping a perf history against published linopy releases. The current benchmark code runs against each linopy version (constant measurement layer); the ``_API_AVAILABLE`` gates on sos / piecewise specs make older linopy versions skip those phases gracefully. Verified locally: ``sweep 0.7.0 --quick --no-use-lock`` runs end-to-end in ~2 min (uv installs 57 packages in 200ms; the rest is the benchmark run). Plain releases (0.4.0) and pip specs (git+https://...) both work via the ``_linopy_install_spec`` helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The README previously duplicated content from three sources: - the notebook (models table, registry-usage code blocks) - ``--help`` (quick-reference command list) - a stale memory.py invocation (since replaced by ``memory save/compare``) After the consolidation each surface has a clear single job: - README: 1-paragraph what, setup (uv sync / lockfile), size-tier table (architectural), pointers to the notebook + ``--help``, metrics blurb. - ``notebooks/registry_usage.ipynb``: the walkthrough — registry import, lookup / iterate / filter, parametrize your own pytest, profiling. - ``python -m benchmarks --help``: command reference, autogenerated by typer from docstrings / Annotated[..., Option(...)] declarations. Drops ~140 lines from the README; nothing actually disappears — it just lives in the one place that owns it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pypsa removed from the [benchmarks] pinned set, from the sweep ``--no-use-lock`` install list, and from the lockfile. The ``test_pypsa_carbon_management.py`` module uses ``pytest.importorskip`` so collection no longer fails without pypsa; ``pypsa_scigrid`` already had ``requires=("pypsa",)`` so its phase tests skip gracefully. Install pypsa separately when you want those benchmarks. Notebook (registry_usage.ipynb) rewritten as a proper operator guide: - Architecture overview + per-phase measurement table up front. - Registry walkthrough (lookup / iterate / filter) kept as the spine. - Reuse patterns (parametrize-your-own-pytest, tracemalloc spot check). - ``Running`` section now embeds ``--help`` output live via a ``show_help()`` helper that shells out to ``python -m benchmarks ... --help``. The doc stays in sync with the typer implementation automatically — change a flag in cli.py, re-run the notebook, documentation updates. - New sections cover timing snapshots, memory snapshots, the cross-version sweep, and lockfile regeneration. README gains an explicit "pypsa is optional" note in setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rough Mirrors ``run``'s filter knobs and applies them to every version's pytest invocation. Also switches to the ``typer.Context`` + ``context_settings`` pattern so trailing args after ``--`` are forwarded to pytest verbatim (same shape ``smoke`` / ``run`` use). python -m benchmarks sweep 0.6.7 --phase build --model basic python -m benchmarks sweep 0.6.7 -- --tb=short -x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

``python -m benchmarks compare a.json b.json [-- --columns=...]`` shells out to ``pytest-benchmark compare`` so the whole suite stays under one entry point. Accepts any number of snapshots; first is the baseline. When called with no arguments — or with paths that don't exist — it prints a copy-paste-ready list of snapshots found under ``.benchmarks/`` (including ``.benchmarks/sweep/`` for cross-version runs). If nothing's saved yet, points at the ``run --json`` flow. For memory snapshots use ``memory compare`` instead — different format, different tool. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-paste) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ame) pytest-benchmark's own default emits 10 columns side-by-side, which is unreadable for any non-trivial comparison. Wrapper now prepends ``--columns=median,iqr --sort=name`` so the table is two stats wide and the (baseline, candidate) pair of each test sits together alphabetically. Defaults are only applied when the user hasn't already set the flag, so trailing pass-through overrides still work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…arg split Two fixes for the ``compare`` UX surfaced by the cross-version sweep: - Default to ``--group-by=fullname`` so each test gets its own mini table showing (baseline, candidate) side-by-side with the parenthesized auto-ratio per column. Easy to scan ``(>1.10)`` for regressions in the median column. Combined with the existing ``--columns=median,iqr --sort=name`` defaults, the output goes from 10-columns-wide-on-one-line to a focused two-column per-test view. - Switch ``compare`` away from a positional ``list[Path]`` argument and parse ``ctx.args`` by hand: typer's positional list was greedily grabbing trailing ``--group-by=fullname`` etc. (and the ``--`` separator didn't escape it either). Now arg-splitting is explicit: anything starting with ``-`` is pytest-benchmark pass-through, everything else is a snapshot path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tried switching to the canonical typer pattern (``--`` separator for pass-through) but typer's positional ``list[Path]`` + ``allow_extra_args`` still greedily ate the trailing options. There's no clean typer/click idiom for "list-typed positional + pass-through" — workarounds are manual splitting, bounding the positional count, or named flags. Manual splitting is the most pragmatic: snapshots come first, once we see any flag-like token the rest is forwarded to pytest-benchmark. That preserves things like ``--histogram=/tmp/hist/cmp`` (built-in SVG-per-test plotting), ``--csv=out.csv``, ``--group-by=fullname``, and the value-taking flags whose value doesn't start with ``-``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three opinionated interactive HTML views over pytest-benchmark JSONs, auto-picked from snapshot count or set explicitly via ``--view``: - **compare** (2 snapshots) — horizontal bar chart of per-test median delta, sorted by magnitude, green→red colormap. The "did this PR regress anything?" picture in one glance, vs pytest-benchmark's 60-individual-SVGs which are useless for that workflow. - **sweep** (3+ snapshots) — heatmap of median ratio relative to the first snapshot, rows = tests, columns = labels. Pairs with the ``sweep`` subcommand. - **scaling** (1 snapshot) — log-log median vs ``n`` for size-parametrized tests (e.g. ``[basic-n=10..1600]``), faceted by phase. Shows whether linopy's complexity scales as expected. plotly==6.7.0 pinned in [benchmarks]; lockfile regenerated. plotly is lazy-imported inside ``plot`` so the rest of the suite stays usable without it (with a clear error if a user tries ``plot`` and it's missing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…er absolutes - New ``benchmarks/plotting.py`` module owns the three views (``plot_compare`` / ``plot_sweep`` / ``plot_scaling``) plus a ``RENDERERS`` dispatch dict. cli.py drops ~140 lines and just imports ``PlotView`` + ``RENDERERS``; plotly is still lazy-loaded inside the view functions so importing the module without plotly works. - ``compare`` bar chart and ``sweep`` heatmap now use ``text_auto`` so values render inside each bar / cell. - Hover info upgraded: - compare hover shows the per-test median of *both* snapshots (formatted to 4 significant figures) in addition to the delta %. - sweep hover shows the absolute median (s) alongside the ratio, via a customdata + hovertemplate plumbed through ``update_traces``. scaling view already shows the absolute median on hover by virtue of being a line chart. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ride For microbenchmarks the lowest observed time is closest to the "true" cost — background noise (GC, context switches, thermal throttling) can only slow things down. pytest-benchmark's own ``--sort`` default is ``min`` for the same reason; LLVM's perf guide, JMH, Google Benchmark and Alexandrescu's "Speed is found in the minds of people" all argue similarly. Changes: - ``plot`` defaults to ``--metric min`` (was median). Accepts ``--metric median|mean|max`` to override. The metric drives the bar values, heatmap ratios, scaling-curve y-axis, and the hover labels. - ``plot_compare`` / ``plot_sweep`` / ``plot_scaling`` in ``benchmarks/plotting.py`` all take a ``metric: Metric = "min"`` arg. - ``compare`` table defaults to ``--columns=min,iqr --sort=min`` (was median,iqr / name). The auto-ratios next to each ``min`` flag regressions in the same readable form. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

For a suite where test costs span six orders of magnitude (knapsack microsecond builds vs PyPSA carbon at 2.4 s), sorting by % delta overweights cheap tests — a 100% regression on a 1µs test ranks above a 1% regression on a 2s test, but the absolute impacts are 1µs vs 24ms. Changes: - Default sort is now ``absolute`` (``b - a`` in seconds). Bar values are the time delta with SI-prefix formatting on the x-axis (24 ms, 240 µs, etc.). Big actual-time impacts float to the bottom. - ``--sort relative`` keeps the old percent behaviour. - Both ``delta_abs`` and ``delta_pct`` are surfaced in hover regardless of which one drives the sort, so you can read off whichever lens. - ``plot_sweep`` / ``plot_scaling`` accept a ``sort`` arg for uniform signature but ignore it (no two-snapshot diff there). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The compare bar chart forces a choice between sorting by relative % or absolute Δ. Both have blind spots: pure-relative makes microbenchmark noise look catastrophic, pure-absolute hides real algorithmic regressions on fast paths. The two-axis scatter resolves the tension visually. Per test: - x = baseline time (log scale) - y = candidate / baseline ratio - colour = absolute Δ A point is a real regression worth chasing only when it sits in the top-right — slow tests that got slower. Top-left (high ratio, tiny absolute) reads as microbenchmark noise; bottom-right (high absolute, ratio ≈ 1) was already slow and didn't change. A dashed reference line at ``y=1`` makes "no change" trivial to see. The view is auto-picked for nothing (compare wins for 2 snapshots); pass ``--view scatter`` explicitly to get it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The two-axis scatter now scales beyond a single baseline-vs-candidate pair. With N>=3 inputs the first is still the baseline (reference); each subsequent snapshot becomes one animation frame. Use the slider / play button to scrub through versions and watch tests drift across releases. Implementation: - First snapshot is the baseline. Skipped from the frame set (would trivially be y=1 everywhere). - Each subsequent snapshot contributes points at (baseline_time, ratio, Δ) per overlapping test. ``animation_frame="version"`` does the per-frame slicing; ``category_orders`` preserves input order in the slider so the timeline reads left→right. - ``range_x`` / ``range_y`` are pinned to the global min/max so the camera doesn't jump between frames. - 2 inputs still produces a static scatter (no animation overhead). Considered ``facet_col`` but it gets cramped past ~4 versions — the slider scales to arbitrary length. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… |Δ| Two small but high-value tweaks for the multi-snapshot scatter: - The baseline snapshot now contributes its own animation frame where every point sits at ratio=1, Δ=0. Gives the animation a "before anything happened" anchor: hit play and watch points drift from the baseline horizon outward. Previously the first frame was the first candidate, which made the visual feel as if it started mid-story. - ``range_color`` is pinned to the 95th-percentile absolute Δ (±p95). One huge outlier no longer drags the colour scale and flattens everyone else to white; outliers saturate at the bound, the rest of the distribution stays readable. Colour-bar label notes ``Δ (s, p95-clipped)`` so the convention is explicit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The "no change" line sits at y=1, but with asymmetric data (e.g. some 2x regression, no symmetric speedup), it landed near the bottom of the visible range and improvements got squeezed near the floor. Now: ``max_dist = max(|1 - y_lo|, |y_hi - 1|)`` and ``range_y = [1 - max_dist, 1 + max_dist]``. Pure min/max coverage (no clipping) but the window is symmetric around 1.0, so regressions above and improvements below are equally readable regardless of the data skew. The colour scale keeps the p95-clipped centred-at-0 treatment from the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…h warn Three safe fixes from a code review of benchmarks/plotting.py: - Row-height multiplier 14 → 22 in plot_compare and plot_sweep, with the floor bumped from 400 to 500. At 25+ tests the y-axis labels were colliding; now they breathe. - plot_scaling reads ``params.size`` (the cleanly-stored int from parametrize) and only falls back to the id regex if absent. The ``model`` name still needs the regex because pytest-benchmark serializes our ModelSpec as ``UNSERIALIZABLE[ModelSpec(...)]``, so a full params switch isn't possible here — but the size path is now robust to test-id rename. - plot_compare surfaces the mismatch between snapshots: prints a stderr line with the test counts only in A / only in B / common, and embeds the same as a subtitle in the figure. Silent intersection was the worst-case footgun. Skipped (per review note): the default-view swap for 3+ snapshots (sweep → scatter) is a judgement call left for the user. Default output filename change (clobber on each run) also skipped — they want to decide whether per-view filenames are worth the API change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…view>.html Two coupled changes setting up the notebook-embedding path: - ``plot_compare`` / ``plot_scatter`` / ``plot_sweep`` / ``plot_scaling`` in ``benchmarks/plotting.py`` now return a ``plotly.graph_objects.Figure`` instead of writing to disk + returning a count. The CLI does the ``fig.write_html(output)`` step. ``benchmarks.plotting.n_points(fig)`` is exported as a helper so the CLI still emits a "N points → path" status line. This unblocks rendering plots directly in jupyter — call ``plot_compare(...)`` and Jupyter's display hook renders the Figure inline. - Default ``-o`` for ``plot`` is now ``.benchmarks/plots/<view>.html`` (was ``benchmark-plot.html`` in cwd). Matches where snapshots already land (and is gitignored), and the per-view filename means consecutive runs of different views don't clobber each other. Bonus: two ``numpy_array or fallback`` bugs in scatter (``df.abs().max() or 1e-9``) and the new ``n_points`` helper (``trace.x or trace.z``) — both triggered ``ValueError: The truth value of an array with more than one element is ambiguous``. Replaced with explicit ``is None`` checks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ospection The ``n_points(fig)`` helper added in the previous commit walked ``fig.data`` traces and called ``len(trace.x)`` to recover the test count. That's backwards — the count is sitting right there in the source DataFrame at render time, no need to reach into the rendered plot. Renderers now return ``tuple[Figure, int]`` directly. ``len(df)`` for compare / sweep / scaling; ``df["test"].nunique()`` for scatter (rows are per-(test, version) so the raw len double-counts). n_points helper dropped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two refinements to the end-to-end plotting section: - tqdm wraps the subprocess loop that generates the two snapshots. Each ``--quick --phase build`` run takes ~10 s; tqdm makes the ~20 s wait visible. ``tqdm.auto`` auto-picks the notebook widget vs console bar based on context. - Plots are now rendered via ``python -m benchmarks plot --view <name>`` rather than direct ``plot_compare`` / ``plot_scatter`` imports. A small ``cli_plot(view, snapshots)`` helper runs the subprocess, reads the generated HTML, and inlines it via ``IPython.display.HTML``. Demonstrates the actual user-facing CLI path inside the notebook rather than the internal API. Notebook end-to-end runtime: ~37 s (~33 s for the run loop + plotting overhead). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

FBumann · 2026-06-03T12:49:59Z

@FabianHofmann DO you think adding sth like CodSpeed to the codebase would be a nice addition? Im not sure, but we might just try it out...

FabianHofmann · 2026-06-03T12:51:52Z

@FabianHofmann DO you think adding sth like CodSpeed to the codebase would be a nice addition? Im not sure, but we might just try it out...

trying out sounds good @lkstrp what do you thing? don't know if we should enable it for the whole organization or only for linopy?

FBumann · 2026-06-04T18:01:54Z

Im looking into sparsity a bit more. I need this benchmarking stuff to reliably improve memory usage while avoiding regression

Patterns are fragments of realistic modelling code (a nodal-balance groupby-sum, a KVL sparse contraction) that reproduce the dense-`_term` materialisation hot spots — measured the same way models are (time + peak memory, through the same phases), but parametrised by `severity` (0-100, "how pathological the data shape is") instead of `size`. Unify models and patterns under one `BenchSpec` contract: - `ModelSpec` (axis "n", sweeps size) and `PatternSpec` (axis "severity", sweeps severity) both build a `linopy.Model` and expose `sweep`/`axis`/`description`; harness reads the contract, not the type. - One generic `iter_params`/`param_ids` over `all_specs()` (models + patterns); test-id grammar generalised to `-<axis>=<int>` with an `axis` column carried through snapshots and plots. `severity` is an int so it rides the same grammar/Int64/plot machinery as size. - Patterns ride the existing phase drivers (build/matrices/lp_write/ netcdf/solver) — no dedicated driver — so the build-vs-export contrast (does the bloat reach the matrix/LP file or collapse?) falls out. `memory.run_phase` iterates all specs through one path. - `--quick` keeps severities up to the midpoint so smoke exercises real pathology, not just the benign endpoint. Two patterns to start: `nodal_balance` (#745 groupby padding, a live cliff in peak memory) and `kvl_cycles` (#748 sparse `@` densification, flat today — a sparse-aware kernel bends it down). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- memory: add an opt-in `pipeline` phase (build→matrices→lp_write in one tracker) reporting the end-to-end peak / OOM ceiling — the number that can't be recovered from the isolated per-phase marginals. Kept out of the default set (it re-runs those phases) and requested via `--phase pipeline`; CLI validates against a new `ALL_MEMORY_PHASES`. - cli: `list` now shows models and patterns with a `--kind {all,models, patterns}` selector (default both); `list --details` gets a patterns table; `show <name>` resolves patterns too. Closes the discoverability gap left by patterns sharing the phase drivers (no dedicated test file). - docs: walkthrough gains a "Patterns" section (how to run them via the shared drivers + `-k severity`) and a note on the `pipeline` ceiling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…lper Trim the spec-selection surface: `--model`, `--kind`, and the proposed `--spec` were all "which specs" flags overlapping with `--filter`/`-k`. Collapse to two composable selectors on the run/measure commands — `--phase` (stage) and `--filter`/`-k` (specs by name or id substring: `nodal_balance` one spec, `severity` patterns, `n=` models): - run/sweep: drop `--model` (== `--filter <name>`). - memory save: replace `--kind` with `--filter` (substring on the `<name>-<axis>=<value>` key; also gains single-spec selection, which memory previously couldn't do). - list: keeps `--kind {all,models,patterns}` — it filters names, where the axis tag doesn't appear, so a substring filter can't select a kind. Also factor the test-id fragment `f"{name}-{axis}={value}"` into `snapshot.spec_param_id()` — the one source of truth shared by `param_ids`, the memory grid ids, the solver-handoff ids, the memory `--filter` key, and `synth_test_id`, kept in lock-step with the parse regex. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…cy guard Running mypy across the whole suite (CI excludes benchmarks/, so this is a manual check) surfaced 9 errors, all from two roots: - BenchSpec declared settable members, but ModelSpec/PatternSpec are frozen dataclasses (read-only attrs), so they didn't structurally satisfy the Protocol — which cascaded through every function typed on BenchSpec (all_specs, maybe_skip, _measurements). Declare the Protocol members as read-only @Property to match. Clears 8 of 9. - synth_test_id passed `model: str | None` to spec_param_id (expects str); `all(<bools>)` doesn't narrow. Narrow explicitly. Clears the 9th. Also fix the `plot` command's plotly guard: `from benchmarks.plotting import RENDERERS` always succeeds (plotly is imported lazily inside each renderer), so a plotly-less user hit a raw ModuleNotFoundError instead of the friendly message. Check `importlib.util.find_spec("plotly")` instead. Plus refresh two stale `plot` help strings (`-n=<size>` → `-<axis>=<value>`; scaling is now axis-aware, not always log-log). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two density idioms that both the PyPSA and flixopt scans converged on — neither covered by the existing nodal_balance (#745) / kvl_cycles (#748): - merge_balance (#749): merging sub-expressions of different `_term` widths along a shared dim pads every block to the widest, leaving the narrow ones mostly fill — the documented SciGRID build peak. PyPSA `merge(gen+storage+lines, join="outer")`; flixopt bus balance `sum([flows])`. severity dials the widest block (verified _term 3 -> 102 -> 200 across 30 blocks). - flow_sum: `.sum(dim)` folds the summed dim's whole size into `_term`. PyPSA `(p*w).sum()` CO2/operational limits; flixopt `.sum(['time','cluster'])`. severity dials the folded-dim size (verified _term 2 -> 100 -> 200). Both ride the shared phase drivers like the other patterns; mypy-clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…coupling) Close the intertemporal-coupling op gap, split correctly across the two axes: - storage MODEL (size): a fleet of SoC recursions `soc[t] - decay*soc[t-1] - eff*charge + discharge/eff == 0` via `soc.shift(time=1)`. The only model exercising the `.shift()`/`.isel()` intertemporal ops (PyPSA SoC, flixopt `charge_state.isel` recursion). It's a model, not a pattern: bidiagonal → ~4 terms/row regardless of horizon or unit count (verified `_term`=4 flat across size), so it scales with size, no benign→worst dial. - rolling PATTERN (severity): the *windowed* form does have a dial. `status.rolling(K).sum()` (min up/down time, windowed limits) builds K terms per row, so window width is the severity (verified `_term` 1 -> 84 -> 168 across the horizon). The pair is a clean illustration of the size-vs-severity split: storage's `_term` is flat (size axis), rolling's climbs (severity axis). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`cumsum` folds a growing window into `_term` (running totals — cumulative energy, a rolling budget): `(1*x).cumsum("t")`, verified `_term` 2 -> 100 -> 200 (triangular). It's benchmarked as its own op rather than folded into `rolling` even though linopy currently implements `cumsum` via `rolling(window=full_dim)` (expressions.py): that delegation is an implementation coincidence, not a contract, and `cumsum` is a natural de-densification target (a prefix sum need not materialise the triangle). So it plays the same instrument role as `kvl_cycles` — flat/redundant today, but the thing that would show a dedicated cumsum kernel land. Benchmark the public op, not its wiring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The `benchmarks/` suite was excluded from `mypy .`, so its type-cleanliness was only enforced on a manual run. Activate it. The catch: the old `'benchmark/*'` pattern is a regex whose `/*` (zero-or-more slash) matches the `benchmark` prefix of `benchmarks/` too — so simply dropping `'benchmarks/*'` wouldn't have worked. Replace both with `'^benchmark/'`, which requires the slash and so excludes only the legacy singular `benchmark/` (not mypy-clean) while checking `benchmarks/`. Verified `mypy .` is clean (101 files) with benchmarks/ included and benchmark/ still skipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- pypsa: skip on environmental example-data fetch failure (network / example-API drift) in build_pypsa_scigrid and the carbon fixture, but leave n.optimize.create_model() unguarded so a genuine linopy regression surfaces rather than being swallowed into a skip. - CodSpeed job: continue-on-error — it's red until CODSPEED_TOKEN is set on the org, and a perpetually-failing missing-secret check shouldn't block every PR. - Trim verbose comments in the [benchmarks] extra. Dep-group migration deliberately NOT done here: moving dev/docs/benchmarks from extras to PEP 735 [dependency-groups] is a project-wide change (they're all extras today) and belongs in its own focused PR, not the benchmark PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…at/benchmark-patterns

The 933-line cli.py was a god-file whose ``# --- section ---`` banners signalled it wanted to be separate modules. Split one module per command group, all registering onto a shared ``app`` so the flat command surface and ``--help`` order stay identical: _base.py app/memory_app, shared types + helpers introspect.py list / show / filter run.py smoke / run / notebook sweep.py sweep compare.py compare plot.py plot memory.py memory save / sweep / compare Also tidy two over-narrated spots: - sweep.py: drop a stale comment that claimed ``cwd`` was pinned to the repo root when the code runs ``cwd=import_dir``; trim the isolation blocks the function docstring already covers in full. - plotting.py: tighten the facet-layout comment. CLI surface, ruff, and mypy all verified unchanged.

Move the three suite-internal unit tests (test_bench, test_sweep, test_memory_id_alignment) out of the benchmarks/ root into a dedicated benchmarks/_tests/ subdirectory, separating them from the benchmark driver modules. Fix the repo-root computation in test_memory_id_alignment (parents[1] -> parents[2]) now that the file sits one level deeper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Condense the prose comment blocks in benchmark-smoke.yml, .gitignore, and the [benchmarks] extra to one or two lines each — keep the why, drop the restatement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…dial) flow_sum's .sum(dim) fold produces all-real terms with zero fill, so its severity is just a sub-dim size knob with no sparsity headroom — it measures nothing the size axis doesn't. Keep cumsum, whose triangular fold is a genuine de-densification target. PATTERNS is now 5 specs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

FBumann · 2026-06-05T08:13:31Z

@FabianHofmann Im preparing this to cover different sparsity patterns to make improvements about sparsity tracable

…e floor Separate each pattern's dimensions into the op's own axis (carries the severity cliff) and the broadcast/volume axis (sets amplitude only, shape-preserving). Shrink volume dims that were inflating cost on an axis the op never touches, and size each pattern so its sev-100 peak lands in a comfortably measurable band rather than near the memray noise floor: rolling unit 200->8, time 168->1000 0.8 -> 138 MiB (linear) cumsum row 200->64 0 -> 45 MiB (quadratic) nodal time 24->8 1.2-> 32 MiB (linear) merge row 200->128 0.8-> 24 MiB (linear) kvl time 3->168 126 MiB flat kvl stays flat across severity by design (today's @ densifies to n_branch regardless of C sparsity); time=168 lifts the flat level so the always-paid densification is visible — the headroom a sparse kernel would reclaim, at a realistic weekly horizon. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The y-axis read "min" (looks like minutes) and the x-axis read a bare "severity". Label the y-axis "<unit> (<metric>)" with metric spelled out (min->minimum, max->maximum), and drive the x-axis label + log-scale choice from a single _AXIS_DISPLAY table keyed by axis name, replacing the scattered `x_label == "n"` checks. A third axis is now one row, not a new branch; the mixed-axis case falls back to a string, so the key type stays plain str. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

FBumann · 2026-06-05T10:54:00Z

@FabianHofmann I think i will just merge this on my fork and try out codspeed there for now.

Patterns aren't models, so the shared spec-name field/column was mis-named. Rename it to "spec" throughout: the load_long_df / bench column, the `--facets spec` CLI option (FacetBy), the `to_snapshot(spec=)` / `synth_test_id(spec=)` kwargs, and docs/tests. The scaling-plot legend now reads "spec" rather than "model". Purely internal: test-id strings are unchanged, so existing snapshots still parse — only the in-memory column and the kwarg/option names move. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

for more information, see https://pre-commit.ci

The cachegrind CodSpeed job shared benchmark-smoke.yml's trigger, so it ran on every PR despite being ~10–20× slower and only useful as a master-to-master baseline comparison. Move it to codspeed.yml triggered on push-to-master + workflow_dispatch (manual), and leave the cheap smoke job (--benchmark-disable) running on every PR as before. Regressions now surface as master-to-master deltas; ad-hoc branch checks go through the manual "Run workflow" button. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

FBumann · 2026-06-05T13:21:19Z

@FabianHofmann THe genuine outcome an be seen on a PR where i try to improve memory issues (not ready at all yet)
fluxopt#25

FBumann and others added 30 commits May 27, 2026 23:04

docs: update benchmark readme

8c908af

benchmarks: compare lists snapshots as relative paths (easier to copy…

83bdeda

…-paste) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

FBumann requested a review from FabianHofmann June 3, 2026 12:50

FBumann and others added 15 commits June 4, 2026 22:11

Merge branch 'master' into benchmark-suite-charter

827a947

Merge remote-tracking branch 'origin/benchmark-suite-charter' into fe…

919e766

…at/benchmark-patterns

FBumann marked this pull request as draft June 5, 2026 08:12

FBumann marked this pull request as ready for review June 5, 2026 10:39

FBumann and others added 3 commits June 5, 2026 13:14

[pre-commit.ci] auto fixes from pre-commit.com hooks

ac3a6d1

for more information, see https://pre-commit.ci

FBumann changed the title ~~benchmarks: reusable registry, new model types, new phases, CI smoke~~ benchmarks: reusable registry, with models, patterns and phases, CI smoke and Codspeed Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarks: reusable registry, with models, patterns and phases, CI smoke and Codspeed#735

benchmarks: reusable registry, with models, patterns and phases, CI smoke and Codspeed#735
FBumann wants to merge 88 commits into
masterfrom
benchmark-suite-charter

FBumann commented May 28, 2026 •

edited

Loading

Uh oh!

FBumann commented Jun 3, 2026

Uh oh!

FabianHofmann commented Jun 3, 2026

Uh oh!

FBumann commented Jun 4, 2026

Uh oh!

FBumann commented Jun 5, 2026 •

edited

Loading

Uh oh!

FBumann commented Jun 5, 2026

Uh oh!

FBumann commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FBumann commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overhaul of the internal benchmark suite

Core pieces

Two axes: size and severity

Sample output: 0.6.7 → 0.7.0

Going forward

Uh oh!

FBumann commented Jun 3, 2026

Uh oh!

FabianHofmann commented Jun 3, 2026

Uh oh!

FBumann commented Jun 4, 2026

Uh oh!

FBumann commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FBumann commented Jun 5, 2026

Uh oh!

FBumann commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FBumann commented May 28, 2026 •

edited

Loading

FBumann commented Jun 5, 2026 •

edited

Loading