Skip to content

Automate the sound eCPS-replacement eval in CI#117

Draft
MaxGhenis wants to merge 1 commit into
mainfrom
ci-ecps-eval
Draft

Automate the sound eCPS-replacement eval in CI#117
MaxGhenis wants to merge 1 commit into
mainfrom
ci-ecps-eval

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

In-depth review of the QALY decision engine, plus fixes for the correctness, safety, and methodology issues it surfaced. All fixes are test-driven (failing test first, then fix).

Test status: 398 Python tests pass (cd python && uv run pytest); 409 TypeScript tests pass (bun run test); tsc --noEmit clean.

⚠️ Scope note — please read before reviewing. This branch was cut from in-progress work, so the diff also carries pre-existing uncommitted WIP that is not part of this review (notably reference_case.py, large protocol_ground_up.py/stack_interactions.py/sleep.py expansions, and the genetics-module changes). My review fixes depend on that WIP (shared files: catalog.py, simulate.py, web_api.py), so they could not be cleanly rebased onto main without it. The commits below are the review work; everything in the first "checkpoint" commit is the inherited WIP. Happy to split if you'd rather land the WIP separately first.

Review fixes in this PR (each TDD'd, own commit)

Engine / methodology

  • Pathway attenuation fixed — the hardcoded 1.3/0.8/0.6 cause-split was applied to the survival integration, shrinking every mortality effect ~8–12% (age-dependent) below the HR the model reports. Now the all-cause HR is applied flat; the pathway exponents drive only the displayed cvd/cancer/other decomposition.
  • Stacked-QALY double-count fixed — the portfolio summed per-item QALYs, double-counting shared baseline survival (verified +7% on a 2-item stack, +25% on 4, ~40% on a full stack). Mortality now combines multiplicatively on the hazard (one joint survival integration); only non-mortality QALYs add across items. Each item's effective HR is inverted from its own Monte-Carlo mort_qaly so a single-item "stack" reproduces the sim exactly. Wired into both the analyzer portfolio and the deployed /frontier ranker.
  • Confidence intervals emitted end-to-endSimulationResult.ci80; frontier items expose net_qaly_ci/net_days_ci; baseline exposes life-expectancy/QALY intervals derived from the age-at-death distribution.
  • Centenarian crash guard — age ≥ modeled horizon returned IndexError/ValueError; now returns a null result.

Safety / honesty (frontend)

  • Medical disclaimer rendered on both results surfaces + persistent footer (previously only on /faq, /terms).
  • Confidence intervals rendered next to every point estimate; reduced false-precision rounding.
  • Prescription items badged "consult a clinician"; dose/brand stripped from public Rx display names ("Statin", "GLP-1 RA" rather than "rosuvastatin 5mg").
  • Privacy/FAQ copy corrected to state profile data is sent to the server engine (was "stays in your browser").
  • Homepage decision board labeled "Illustrative example".
  • Personal lab values scrubbed from public catalog notes ("LDL already 64", etc.).

Infra / docs

  • Added .github/workflows/ci.yml running both suites + typecheck on PRs (previously only a paper build ran).
  • Corrected REPRODUCIBILITY.md (production engine is Python; seed caveats) and added docs/DATA_PROVENANCE.md (CDC life tables, MEPS/Franks, Haagsma; flags the unsourced condition_joint_distribution.json and raw-git MEPS parquet).

Notable findings flagged, not changed (judgment calls)

  • Large standalone magnitudes for tadalafil/finasteride lean heavily on observational/mechanism evidence — the lifestyle-vs-Rx calibration deserves a look; treat the ordering of the big mortality items as less trustworthy than the corrected totals.
  • Genetics fixes (palindromic strand guard, CYP2D6 band gaps, non-prescriptive phrasing) are in the inherited WIP area and should land with that work.

🤖 Generated with Claude Code

@MaxGhenis MaxGhenis force-pushed the ci-ecps-eval branch 4 times, most recently from ffc949a to 8121063 Compare May 30, 2026 22:00
Add a GitHub Actions workflow and a testable runner script so the sound
Microplex-vs-eCPS comparison runs automatically instead of by hand.

- .github/workflows/ecps-eval.yaml: workflow_dispatch (candidate + baseline
  inputs) plus a weekly Monday 09:00 UTC schedule; multi-repo checkout + uv,
  runs the runner, uploads the result JSON and sidecars as artifacts.
- scripts/run_ecps_eval.py: resolves the baseline eCPS (HuggingFace
  policyengine/policyengine-us-data) and the candidate Microplex H5, runs the
  clone-floor baseline gate, runs the comparison with the sound flags, and
  writes a GitHub Step Summary (matched N, both losses, train/holdout,
  candidate_beats_baseline, soundness gates, and the #113 caveat). DRYRUN=1
  prints the would-run command without executing or downloading.
- src/microplex_us/pipelines/ecps_clone_floor.py: the clone-floor gate. Refuses
  to benchmark a baseline whose clone household-weight share is below 5%, and
  fails closed on a missing/malformed sidecar.
- Tests for the gate (healthy/degraded/missing/malformed) and the runner
  (command assembly, resolution, DRYRUN, summary, gate short-circuit).

The heavy eval is intentionally not executed here; this builds and validates the
automation only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant