feat: add Python ↔ Rust parity harness#53
Conversation
nikhilwoodruff
left a comment
There was a problem hiding this comment.
Think this is an important feature to include in CI, but this isn't ready yet. What would make it:
- Compare against records sampled from the FRS
- Just compare the produced microdata outputs from both (would massively simplify this)
- Explicitly don't silent fail
| summary. Exits non-zero when any diff exceeds the configured tolerance. | ||
|
|
||
| Designed to surface drift introduced by Rust ports of Python variables. Uses | ||
| synthetic households so it has no FRS data dependency. If |
There was a problem hiding this comment.
Nope, needs to fail on this. Can you tighten your Claude setup to avoid silent failures
| "couple_no_kids_40k_25k", | ||
| { | ||
| "people": { | ||
| "p1": _person(35, employment_income=40_000), |
There was a problem hiding this comment.
We need far more complex households for this- I would sample from the FRS
|
|
||
| def run_rust(situation: dict, year: int) -> dict[str, float]: | ||
| """Run the Rust engine and extract per-variable totals.""" | ||
| sim = RustSimulation.from_situation(situation, year=year) |
There was a problem hiding this comment.
Does this depend on another PR?
Adds `scripts/parity.py`, which runs a fixed set of synthetic households through both the Python `policyengine-uk` package and the Rust `policyengine_uk_compiled` wrapper, diffs key tax / benefit / net-income outputs cell-for-cell, and prints a summary. Skips Python comparison gracefully when the Python package isn't installed. Wired into CI as a non-failing smoke step so it surfaces drift on every PR without breaking on the divergences that already exist (currently up to £3,276 on couple-with-children scenarios). Tolerance can be tightened once those gaps close. Stacked on top of #52 (Simulation.from_situation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9a75e02 to
b45e8e1
Compare
Adds `policyengine_uk_compiled.yaml_tests` — a runner that mirrors the
format used by `policyengine_uk/tests/policy/` so cases can be ported one
at a time.
The runner accepts either single-person flat input
(`input: { employment_income: 50000 }`) or full-situation input
(`input: { people: ..., benunits: ..., households: ... }`), supports
absolute and relative error margins, and writes outputs against the Rust
microdata column names (`baseline_income_tax`, `baseline_universal_credit`,
`baseline_net_income`, etc.).
This PR ships:
- The runner module with CLI: `python -m policyengine_uk_compiled.yaml_tests tests/policy`
- 11 hand-written YAML cases under `tests/policy/` covering income tax,
employee NI, and Child Benefit (single + multi-person)
- A pytest module that auto-discovers and parametrizes the YAML cases
- 21 unit tests for the runner itself (input mapping, tolerance, parsing)
- pyyaml added to the package's runtime dependencies
Stacked on #53 (parity harness) which is itself stacked on #52
(Simulation.from_situation). Future PRs port more of the 196 Python YAML
tests that already exist in `policyengine_uk/tests/policy/`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the existing old-SP scaling pattern for the new-SP cohort: - If `person.state_pension > 0`: pass through, scaled by `(new_state_pension_weekly / baseline_new_sp_weekly)` for reform correctness - Else: fall back to `new_state_pension_weekly × 52` Previously the new-SP branch always returned the full parameter rate × 52, ignoring any recorded amount. This over-stated SP for partial- year claimants and broke parity for the pensioner_couple synthetic scenario in PR #53's parity harness (£946 diff). Implementation: - Plumb `baseline_new_sp_weekly` through `Simulation`, `calculate_benunit`, `calculate_state_pension`, and `person_state_pension`, parallel to the existing `baseline_old_sp_weekly` field - 3 new Rust unit tests (recorded-amount preserved, fallback to param when no record, recorded amount scales under reform) Parity-harness impact (synthetic pensioner_couple scenario): state_pension rust=23,000 py=23,000 diff=£0 (was £946) household_net_income diff=£-41 (was £905) Stacked on #58. Closes #59 (filed today as a follow-up to PR #53). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `policyengine_uk_compiled.yaml_tests` — a runner that mirrors the
format used by `policyengine_uk/tests/policy/` so cases can be ported one
at a time.
The runner accepts either single-person flat input
(`input: { employment_income: 50000 }`) or full-situation input
(`input: { people: ..., benunits: ..., households: ... }`), supports
absolute and relative error margins, and writes outputs against the Rust
microdata column names (`baseline_income_tax`, `baseline_universal_credit`,
`baseline_net_income`, etc.).
This PR ships:
- The runner module with CLI: `python -m policyengine_uk_compiled.yaml_tests tests/policy`
- 11 hand-written YAML cases under `tests/policy/` covering income tax,
employee NI, and Child Benefit (single + multi-person)
- A pytest module that auto-discovers and parametrizes the YAML cases
- 21 unit tests for the runner itself (input mapping, tolerance, parsing)
- pyyaml added to the package's runtime dependencies
Stacked on #53 (parity harness) which is itself stacked on #52
(Simulation.from_situation). Future PRs port more of the 196 Python YAML
tests that already exist in `policyengine_uk/tests/policy/`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Reworked to address @nikhilwoodruff's review — ready for another look. All three requested changes are in:
Other changes:
Comparison mode is weighted aggregate (mean + p10/p50/p90), because the two engines emit different household counts/ids (Python ~53.5k vs Rust ~16.8k) so cell-for-cell alignment isn't reliable — documented in the script/changelog. Validated against the real local FRS: it surfaces genuine divergences (notably the UC/benefits gap) and exits 1 as intended. 16 hermetic unit tests (divergence→fail, within-tolerance→pass, missing-column→raise, data-unavailable→exit 0) pass. This PR has been rebased onto |
# Conflicts: # .github/workflows/test.yml
Summary
Closes #48. Adds
scripts/parity.py, which runs a fixed set of synthetic households through bothpolicyengine_uk(Python) andpolicyengine_uk_compiled(Rust), diffs key outputs cell-for-cell, and prints a summary. Stacked on #52 (usesSimulation.from_situation).What's included
scripts/parity.py— single-file harness with:--year,--tolerance(default £1),--no-failpolicyengine-ukisn't installed (prints a Rust-only smoke output instead)Tests (
interfaces/python/tests/test_parity_harness.py, 15 cases):ScenarioResultdiff computation (incl. NaN handling)CI: new non-failing parity-smoke step runs on every PR. It's intentionally non-failing because there are real, currently-existing divergences (see below) — the harness is a measurement tool, not a unit test. Tolerance tightens as gaps close.
Divergences this surfaces today
Running locally on a clean checkout:
household_net_income(something subtle in the deduction stack)state_pensionformula or default values)These are the gaps the project should chase next; this PR makes them measurable on every commit.
Verified locally
cargo test: 135 passingpytest interfaces/python/tests: 37 passing (22 from feat: add Simulation.from_situation for situation-JSON input #52 + 15 new)python scripts/parity.py --no-fail: completes, prints report, exits 0python scripts/parity.py: completes, exits 1 (current-state divergences exceed default £1 tolerance — exactly as designed)Stacking
Based on
vahid/from-situation(#52). Once #52 merges, GitHub will auto-rebase this PR onto main. The dependency isSimulation.from_situation, used inrun_rust.Test plan
cargo test,pytest, parity smoke runs without errors)python scripts/parity.pyprints the expected reportpolicyengine-uknot installed🤖 Generated with Claude Code