Automate the sound eCPS-replacement eval in CI by MaxGhenis · Pull Request #117 · PolicyEngine/microplex-us

MaxGhenis · 2026-05-30T21:49:40Z

Summary

In-depth review of the QALY decision engine, plus fixes for the correctness, safety, and methodology issues it surfaced. All fixes are test-driven (failing test first, then fix).

Test status: 398 Python tests pass (cd python && uv run pytest); 409 TypeScript tests pass (bun run test); tsc --noEmit clean.

⚠️ Scope note — please read before reviewing. This branch was cut from in-progress work, so the diff also carries pre-existing uncommitted WIP that is not part of this review (notably reference_case.py, large protocol_ground_up.py/stack_interactions.py/sleep.py expansions, and the genetics-module changes). My review fixes depend on that WIP (shared files: catalog.py, simulate.py, web_api.py), so they could not be cleanly rebased onto main without it. The commits below are the review work; everything in the first "checkpoint" commit is the inherited WIP. Happy to split if you'd rather land the WIP separately first.

Review fixes in this PR (each TDD'd, own commit)

Engine / methodology

Pathway attenuation fixed — the hardcoded 1.3/0.8/0.6 cause-split was applied to the survival integration, shrinking every mortality effect ~8–12% (age-dependent) below the HR the model reports. Now the all-cause HR is applied flat; the pathway exponents drive only the displayed cvd/cancer/other decomposition.
Stacked-QALY double-count fixed — the portfolio summed per-item QALYs, double-counting shared baseline survival (verified +7% on a 2-item stack, +25% on 4, ~40% on a full stack). Mortality now combines multiplicatively on the hazard (one joint survival integration); only non-mortality QALYs add across items. Each item's effective HR is inverted from its own Monte-Carlo mort_qaly so a single-item "stack" reproduces the sim exactly. Wired into both the analyzer portfolio and the deployed /frontier ranker.
Confidence intervals emitted end-to-end — SimulationResult.ci80; frontier items expose net_qaly_ci/net_days_ci; baseline exposes life-expectancy/QALY intervals derived from the age-at-death distribution.
Centenarian crash guard — age ≥ modeled horizon returned IndexError/ValueError; now returns a null result.

Safety / honesty (frontend)

Medical disclaimer rendered on both results surfaces + persistent footer (previously only on /faq, /terms).
Confidence intervals rendered next to every point estimate; reduced false-precision rounding.
Prescription items badged "consult a clinician"; dose/brand stripped from public Rx display names ("Statin", "GLP-1 RA" rather than "rosuvastatin 5mg").
Privacy/FAQ copy corrected to state profile data is sent to the server engine (was "stays in your browser").
Homepage decision board labeled "Illustrative example".
Personal lab values scrubbed from public catalog notes ("LDL already 64", etc.).

Infra / docs

Added .github/workflows/ci.yml running both suites + typecheck on PRs (previously only a paper build ran).
Corrected REPRODUCIBILITY.md (production engine is Python; seed caveats) and added docs/DATA_PROVENANCE.md (CDC life tables, MEPS/Franks, Haagsma; flags the unsourced condition_joint_distribution.json and raw-git MEPS parquet).

Notable findings flagged, not changed (judgment calls)

Large standalone magnitudes for tadalafil/finasteride lean heavily on observational/mechanism evidence — the lifestyle-vs-Rx calibration deserves a look; treat the ordering of the big mortality items as less trustworthy than the corrected totals.
Genetics fixes (palindromic strand guard, CYP2D6 band gaps, non-prescriptive phrasing) are in the inherited WIP area and should land with that work.

🤖 Generated with Claude Code

Add a GitHub Actions workflow and a testable runner script so the sound Microplex-vs-eCPS comparison runs automatically instead of by hand. - .github/workflows/ecps-eval.yaml: workflow_dispatch (candidate + baseline inputs) plus a weekly Monday 09:00 UTC schedule; multi-repo checkout + uv, runs the runner, uploads the result JSON and sidecars as artifacts. - scripts/run_ecps_eval.py: resolves the baseline eCPS (HuggingFace policyengine/policyengine-us-data) and the candidate Microplex H5, runs the clone-floor baseline gate, runs the comparison with the sound flags, and writes a GitHub Step Summary (matched N, both losses, train/holdout, candidate_beats_baseline, soundness gates, and the #113 caveat). DRYRUN=1 prints the would-run command without executing or downloading. - src/microplex_us/pipelines/ecps_clone_floor.py: the clone-floor gate. Refuses to benchmark a baseline whose clone household-weight share is below 5%, and fails closed on a missing/malformed sidecar. - Tests for the gate (healthy/degraded/missing/malformed) and the runner (command assembly, resolution, DRYRUN, summary, gate short-circuit). The heavy eval is intentionally not executed here; this builds and validates the automation only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

MaxGhenis force-pushed the ci-ecps-eval branch 4 times, most recently from ffc949a to 8121063 Compare May 30, 2026 22:00

MaxGhenis force-pushed the ci-ecps-eval branch from 8121063 to 740d35f Compare May 30, 2026 22:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate the sound eCPS-replacement eval in CI#117

Automate the sound eCPS-replacement eval in CI#117
MaxGhenis wants to merge 1 commit into
mainfrom
ci-ecps-eval

MaxGhenis commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented May 30, 2026

Summary

Review fixes in this PR (each TDD'd, own commit)

Notable findings flagged, not changed (judgment calls)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant