Automate the sound eCPS-replacement eval in CI#117
Draft
MaxGhenis wants to merge 1 commit into
Draft
Conversation
ffc949a to
8121063
Compare
Add a GitHub Actions workflow and a testable runner script so the sound Microplex-vs-eCPS comparison runs automatically instead of by hand. - .github/workflows/ecps-eval.yaml: workflow_dispatch (candidate + baseline inputs) plus a weekly Monday 09:00 UTC schedule; multi-repo checkout + uv, runs the runner, uploads the result JSON and sidecars as artifacts. - scripts/run_ecps_eval.py: resolves the baseline eCPS (HuggingFace policyengine/policyengine-us-data) and the candidate Microplex H5, runs the clone-floor baseline gate, runs the comparison with the sound flags, and writes a GitHub Step Summary (matched N, both losses, train/holdout, candidate_beats_baseline, soundness gates, and the #113 caveat). DRYRUN=1 prints the would-run command without executing or downloading. - src/microplex_us/pipelines/ecps_clone_floor.py: the clone-floor gate. Refuses to benchmark a baseline whose clone household-weight share is below 5%, and fails closed on a missing/malformed sidecar. - Tests for the gate (healthy/degraded/missing/malformed) and the runner (command assembly, resolution, DRYRUN, summary, gate short-circuit). The heavy eval is intentionally not executed here; this builds and validates the automation only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
In-depth review of the QALY decision engine, plus fixes for the correctness, safety, and methodology issues it surfaced. All fixes are test-driven (failing test first, then fix).
Test status: 398 Python tests pass (
cd python && uv run pytest); 409 TypeScript tests pass (bun run test);tsc --noEmitclean.Review fixes in this PR (each TDD'd, own commit)
Engine / methodology
mort_qalyso a single-item "stack" reproduces the sim exactly. Wired into both the analyzer portfolio and the deployed/frontierranker.SimulationResult.ci80; frontier items exposenet_qaly_ci/net_days_ci; baseline exposes life-expectancy/QALY intervals derived from the age-at-death distribution.IndexError/ValueError; now returns a null result.Safety / honesty (frontend)
Infra / docs
.github/workflows/ci.ymlrunning both suites + typecheck on PRs (previously only a paper build ran).REPRODUCIBILITY.md(production engine is Python; seed caveats) and addeddocs/DATA_PROVENANCE.md(CDC life tables, MEPS/Franks, Haagsma; flags the unsourcedcondition_joint_distribution.jsonand raw-git MEPS parquet).Notable findings flagged, not changed (judgment calls)
🤖 Generated with Claude Code