test: add summary-cache pytest-benchmark suite with CI regression gate#120
test: add summary-cache pytest-benchmark suite with CI regression gate#120clean6378-max-it wants to merge 4 commits into
Conversation
|
Warning Review limit reached
More reviews will be available in 8 minutes and 33 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more credits in the billing tab to continue. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds a summary-cache pytest-benchmark suite, records baseline timings, adds a regression-checking script, and wires the benchmarks into CI with artifact upload and ignore rules. ChangesSummary Cache Benchmark Validation
Sequence Diagram(s)sequenceDiagram
participant BenchmarksJob as benchmarks job
participant Pytest as pytest
participant RegressionScript as scripts/check_benchmark_regression.py
participant ArtifactUpload as actions/upload-artifact
BenchmarksJob->>Pytest: run tests/benchmarks/ with --benchmark-only
Pytest-->>BenchmarksJob: benchmark-results.json
BenchmarksJob->>RegressionScript: compare benchmark-results.json to benchmarks/baselines.json
RegressionScript-->>BenchmarksJob: exit code 0, 1, or 2
BenchmarksJob->>ArtifactUpload: upload benchmark-results.json with if: always()
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
scripts/check_benchmark_regression.py (1)
95-99: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winA missing benchmark result silently passes the gate.
When a baseline benchmark has no current result, the script only prints a
WARNand continues. If a gated benchmark is renamed, removed, or fails to run, the regression gate will report success and the performance coverage silently disappears. Consider treating a missing gated benchmark as a failure (or a separate exit code) so coverage gaps are surfaced.♻️ Possible approach
failures: list[str] = [] + missing: list[str] = [] for name, base in baseline_means.items(): cur = flat.get(name) if cur is None: - print(f"WARN: no current result for baseline {name!r}; skipping") + print(f"MISSING: no current result for baseline {name!r}") + missing.append(name) continueThen fail (or warn-as-error) when
missingis non-empty, per your policy.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/check_benchmark_regression.py` around lines 95 - 99, The benchmark gating in check_benchmark_regression.py currently skips missing current results in the baseline loop, which can let renamed, removed, or failing benchmarks pass unnoticed. Update the logic around the baseline_means/flat comparison so missing gated benchmarks are collected and treated as a failure (or a distinct non-zero exit condition) instead of only printing a warning. Make the behavior explicit in the final exit path of the script so coverage gaps are surfaced alongside the existing regression checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmarks/baselines.json`:
- Around line 1-15: The benchmark baselines in the summary-cache group are still
using Windows-derived values with 1.5x slack, but the regression gate runs on
ubuntu-latest. Refresh the values from a green ubuntu-latest benchmark artifact
and keep the _note consistent with the source platform; if platform-matched
baselines are not yet available, make the gate non-blocking until they are.
Focus on the baselines.json entries under groups.summary-cache so the
check_benchmark_regression.py comparisons are against the same OS.
In `@pyproject.toml`:
- Around line 39-44: The pytest configuration currently applies --benchmark-skip
globally in [tool.pytest.ini_options], which breaks the regular integration
pytest run because pytest-benchmark is not installed there. Update the test
setup so the benchmark plugin is available wherever this addopts setting is
used, or move the --benchmark-skip option to only the benchmark-specific job;
check the pytest configuration and the workflows that invoke python -m pytest,
especially the benchmark-related markers and options.
---
Nitpick comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 95-99: The benchmark gating in check_benchmark_regression.py
currently skips missing current results in the baseline loop, which can let
renamed, removed, or failing benchmarks pass unnoticed. Update the logic around
the baseline_means/flat comparison so missing gated benchmarks are collected and
treated as a failure (or a distinct non-zero exit condition) instead of only
printing a warning. Make the behavior explicit in the final exit path of the
script so coverage gaps are surfaced alongside the existing regression checks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1cc8514e-6127-42e8-95a6-134382f2ed25
📒 Files selected for processing (7)
.github/workflows/tests.yml.gitignorebenchmarks/baselines.jsonpyproject.tomlscripts/check_benchmark_regression.pytests/benchmarks/conftest.pytests/benchmarks/test_summary_cache_bench.py
|
@CodeRabbit, review this PR in more depth |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
echo "=== .github/workflows/tests.yml ==="
cat -n .github/workflows/tests.yml
echo ""
echo "=== pyproject.toml ==="
cat -n pyproject.toml
echo ""
echo "=== scripts/check_benchmark_regression.py ==="
cat -n scripts/check_benchmark_regression.py
echo ""
echo "=== tests/benchmarks/conftest.py ==="
cat -n tests/benchmarks/conftest.py
echo ""
echo "=== tests/benchmarks/test_summary_cache_bench.py ==="
cat -n tests/benchmarks/test_summary_cache_bench.py
echo ""
echo "=== benchmarks/baselines.json ==="
cat -n benchmarks/baselines.json
echo ""
echo "=== .gitignore (last 20 lines) ==="
tail -20 .gitignore
echo ""
echo "=== requirements-lock.txt ==="
cat -n requirements-lock.txt
echo ""
echo "=== services/summary_cache.py ==="
cat -n services/summary_cache.py
echo ""
echo "=== Directory tree of tests/benchmarks/ ==="
tree tests/benchmarks/ 2>/dev/null || find tests/benchmarks/ -type f
echo ""
echo "=== Directory tree of scripts/ ==="
tree scripts/ 2>/dev/null || find scripts/ -type f
echo ""
echo "=== benchmarks/ directory ==="
tree benchmarks/ 2>/dev/null || find benchmarks/ -type fLength of output: 44058
Deep Review: PR
|
| Severity | Issue | File |
|---|---|---|
| 🔴 Critical | Baseline name format may mismatch pytest-benchmark output | baselines.json, check_benchmark_regression.py |
| 🔴 Critical | Baselines are future-dated and appear manually fabricated | baselines.json |
| 🔴 Critical | benchmarks job has no needs: [unittest] |
tests.yml |
| 🟡 Medium | Missing addopts = "--benchmark-disable" |
pyproject.toml |
| 🟡 Medium | Fragile sys.path manipulation |
conftest.py |
| 🟡 Medium | pytest-benchmark not pinned in lock file |
tests.yml, requirements-lock.txt |
| 🟡 Medium | stale_fingerprint relies on implicit None assumption |
conftest.py |
| 🔵 Minor | Missing __init__.py in tests/benchmarks/ |
— |
| 🔵 Minor | Silent skip of non-dict group values | check_benchmark_regression.py |
| 🔵 Minor | Redundant str() cast |
check_benchmark_regression.py |
| 🔵 Minor | Missing blank line after imports | test_summary_cache_bench.py |
| 🔵 Minor | No unit tests for the regression gate script | — |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 17-19: The normalize_benchmark_name helper currently uses
rsplit("::", 1), which can incorrectly strip parts of parametrized benchmark
names when parameter values include "::". Update normalize_benchmark_name to
split on the first "::" only so pytest node prefixes are removed without
truncating names like test_x[param::v], and keep the behavior consistent for
both short and full benchmark names.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c745c7f7-791c-4802-9b4a-d7dc19639498
📒 Files selected for processing (8)
.github/workflows/tests.ymlbenchmarks/baselines.jsonpyproject.tomlscripts/check_benchmark_regression.pytests/benchmarks/__init__.pytests/benchmarks/conftest.pytests/benchmarks/test_summary_cache_bench.pytests/test_check_benchmark_regression.py
✅ Files skipped from review due to trivial changes (1)
- benchmarks/baselines.json
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/benchmarks/test_summary_cache_bench.py
- .github/workflows/tests.yml
| # `pytest tests/` — to avoid re-collecting unittest.TestCase classes above. | ||
| run: python -m pytest tests/test_api_search.py tests/test_api_workspaces.py tests/test_api_export.py tests/test_pdf_export.py tests/test_search_helpers.py -v --tb=short | ||
| # -o addopts= avoids inheriting benchmark-only options from pyproject.toml. | ||
| run: python -m pytest tests/test_api_search.py tests/test_api_workspaces.py tests/test_api_export.py tests/test_pdf_export.py tests/test_search_helpers.py -v --tb=short -o addopts= |
There was a problem hiding this comment.
tests/test_check_benchmark_regression.py (15 tests for the regression gate) is not run in CI. unittest discover won't collect pytest functions, and the integration pytest step lists only API/search files. Add this file to the pytest step or run pytest tests/test_check_benchmark_regression.py -o addopts= so gate logic is covered on every matrix run.
| # colorama via click) are omitted — pip still installs them on Windows when needed. | ||
| blinker==1.9.0 # via flask | ||
| click==8.4.1 # via flask | ||
| click==8.4.2 # via flask |
There was a problem hiding this comment.
Unrelated click==8.4.1 → 8.4.2 bump is out of scope for a benchmarks-only PR. Revert unless there's a documented dependency reason; otherwise it adds review nois
|
|
||
| [tool.pytest.ini_options] | ||
| pythonpath = ["."] | ||
| addopts = "--benchmark-disable" |
There was a problem hiding this comment.
PR description says --benchmark-skip but config uses --benchmark-disable. With disable, benchmark tests still run once on a bare pytest (functional smoke, ~0.4s). Either align the docs or switch to --benchmark-skip if the goal is zero benchmark execution during normal pytest.
| [tool.pytest.ini_options] | ||
| pythonpath = ["."] | ||
| addopts = "--benchmark-disable" | ||
| testpaths = ["tests"] |
There was a problem hiding this comment.
New [tool.pytest.ini_options] with addopts = "--benchmark-disable" requires pytest-benchmark installed for any default pytest invocation. CI handles this; document in PR/issue that devs need pip install -e ".[dev]" (or pytest-benchmark) before running pytest without -o addopts=.
| @pytest.mark.benchmark(group="summary-cache") | ||
| def test_summary_cache_miss( | ||
| benchmark, | ||
| summary_cache_dir: Path, | ||
| workspace_fingerprint: dict[str, Any], | ||
| stale_fingerprint: dict[str, Any], | ||
| sample_projects: list[dict[str, Any]], | ||
| ) -> None: | ||
| set_cached_projects(workspace_fingerprint, sample_projects, []) | ||
| benchmark(get_cached_projects, stale_fingerprint) |
There was a problem hiding this comment.
test_summary_cache_hit and test_summary_cache_miss measure nearly identical paths (~63µs on CI); both baselines are 6.3e-05. Consider merging into one "cache lookup" benchmark with hit/miss as subcases, or document that miss means fingerprint-mismatch only (not rebuild cost).
| @pytest.fixture | ||
| def summary_cache_dir(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> Path: | ||
| """Redirect summary-cache files to an isolated temp directory. | ||
|
|
||
| Patches ``CACHE_DIR`` (also used by tab-summary paths via ``_tab_summaries_path``) | ||
| plus the projects/composer-map file constants used by current benchmarks. | ||
| """ | ||
| cache_dir = tmp_path / "cache" | ||
| cache_dir.mkdir() | ||
| monkeypatch.setattr(summary_cache, "CACHE_DIR", cache_dir) | ||
| monkeypatch.setattr(summary_cache, "PROJECTS_CACHE_FILE", cache_dir / "projects.json") | ||
| monkeypatch.setattr( | ||
| summary_cache, | ||
| "COMPOSER_MAP_CACHE_FILE", | ||
| cache_dir / "composer-id-to-ws.json", | ||
| ) |
There was a problem hiding this comment.
summary_cache_dir patches PROJECTS_CACHE_FILE and COMPOSER_MAP_CACHE_FILE but not tab-summary paths (_tab_summaries_path uses CACHE_DIR prefix only). Fine for current scope; add a comment that tab-summary benchmarks belong in issue #8's unified suite.
| { | ||
| "_note": "Gated means from ubuntu-latest CI benchmark-results.json (PR #120, run 28123677675). Refresh: pytest tests/benchmarks/ --benchmark-only --benchmark-json=benchmark-results.json -o addopts=", |
There was a problem hiding this comment.
_note refresh command is truncated (-o addopts= with no value). Complete the example: -o addopts= --benchmark-json=benchmark-results.json.
| - name: Run summary-cache benchmarks | ||
| run: > | ||
| python -m pytest tests/benchmarks/ | ||
| --benchmark-only | ||
| --benchmark-json=benchmark-results.json | ||
| --benchmark-columns=min,max,mean,stddev,rounds | ||
| -o addopts= |
There was a problem hiding this comment.
Consider pinning --benchmark-min-rounds (e.g. 5) to reduce CI timing variance near the 20% threshold, especially for sub-millisecond hit/miss benchmarks.
| if base == 0: | ||
| print(f"WARN: baseline for {name!r} is zero; skipping ratio check") | ||
| continue | ||
| ratio = cur / base | ||
| tag = "FAIL" if ratio > threshold else "ok" | ||
| print(f"[{tag}] {name}: {cur:.6f}s vs {base:.6f}s ({ratio:.2f}x)") | ||
| if ratio > threshold: | ||
| failures.append(name) | ||
|
|
There was a problem hiding this comment.
New benchmarks without baselines only WARN; gated missing benchmarks correctly FAIL (fixed in 83bb248). Consider also failing when all current results are faster than baseline by >50% (stale baseline drift), or document a baseline-refresh workflow in benchmarks/baselines.json.
Closes #115
Summary
Adds a
pytest-benchmarksuite forservices/summary_cache.pywith synthetic fixtures and a CI regression gate. No production code changes.tests/benchmarks/test_summary_cache_bench.pymeasures cache hit, cache miss, fingerprint cost (10/50/200 workspace entries), and round-trip write+read latencytests/benchmarks/conftest.pybuilds synthetic workspace trees and isolates cache files to temp dirs (no real Cursor storage)benchmarks/baselines.jsonrecords initial gated means (1.5× slack from local reference run; refresh from ubuntu-latest CI artifact after first green job)scripts/check_benchmark_regression.pyfails CI when any benchmark mean exceeds baseline by >20%pyproject.toml: addpytest-benchmarkdev dep; default--benchmark-skipso normalpyteststays fast.github/workflows/tests.yml: newbenchmarksjob on ubuntu-latestSprint item #7 (June Week 4, Wednesday). Sequences before Thursday #8 (unified perf benchmark suite).
Test plan
pytest tests/benchmarks/test_summary_cache_bench.py --benchmark-only -o addopts=python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.jsonpytest tests/benchmarks/test_summary_cache_bench.py -o addopts=(benchmarks skipped via--benchmark-skip)python -m unittest discover tests -qbenchmarksCI job is green on ubuntu-latestSummary by CodeRabbit