feat: add Python ↔ Rust parity harness by vahid-ahmadi · Pull Request #53 · PolicyEngine/policyengine-uk-rust

vahid-ahmadi · 2026-04-30T13:42:12Z

Summary

Closes #48. Adds scripts/parity.py, which runs a fixed set of synthetic households through both policyengine_uk (Python) and policyengine_uk_compiled (Rust), diffs key outputs cell-for-cell, and prints a summary. Stacked on #52 (uses Simulation.from_situation).

$ python scripts/parity.py
=== Rust vs Python parity report ===

-- single_£50,000 --
  income_tax                rust=     7,486  py=     7,486  diff=         0
  ni_employee               rust=     2,994  py=     2,994  diff=        -0
  household_net_income      rust=    39,520  py=    39,361  diff=       159 *
  → max |diff|: 159.05
...

What's included

scripts/parity.py — single-file harness with:

11 synthetic scenarios (single at six income levels, couple, couple+kids, lone parent, pensioner couple, Scotland resident)
9 variables compared (income tax, employee NI, employer NI, UC, child benefit, state pension, pension credit, housing benefit, household net income)
CLI: --year, --tolerance (default £1), --no-fail
Skips Python comparison gracefully if policyengine-uk isn't installed (prints a Rust-only smoke output instead)

Tests (interfaces/python/tests/test_parity_harness.py, 15 cases):

Period substitution
Scenario builder produces required entities and substitutes the year
ScenarioResult diff computation (incl. NaN handling)
End-to-end CLI invocation through the Rust binary

CI: new non-failing parity-smoke step runs on every PR. It's intentionally non-failing because there are real, currently-existing divergences (see below) — the harness is a measurement tool, not a unit test. Tolerance tightens as gaps close.

Divergences this surfaces today

Running locally on a clean checkout:

scenario	max \|diff\| (£)	likely cause
single_£0 … single_£150k	159	base divergence on `household_net_income` (something subtle in the deduction stack)
couple_no_kids_40k_25k	159	same base diff
couple_2kids_30k_15k	3,276	UC entitlement gap — Rust returns £0 vs Python's larger value
lone_parent_2kids_18k	2,722	UC differs by £554, plus the £159 base
pensioner_couple	946	state pension £946 high (issue is in the `state_pension` formula or default values)
scotland_single_45k	159	base diff carries through Scottish tax bands fine

These are the gaps the project should chase next; this PR makes them measurable on every commit.

Verified locally

cargo test: 135 passing
pytest interfaces/python/tests: 37 passing (22 from feat: add Simulation.from_situation for situation-JSON input #52 + 15 new)
python scripts/parity.py --no-fail: completes, prints report, exits 0
python scripts/parity.py: completes, exits 1 (current-state divergences exceed default £1 tolerance — exactly as designed)

Stacking

Based on vahid/from-situation (#52). Once #52 merges, GitHub will auto-rebase this PR onto main. The dependency is Simulation.from_situation, used in run_rust.

Test plan

CI passes (cargo test, pytest, parity smoke runs without errors)
Local: python scripts/parity.py prints the expected report
Local: harness skips Python comparison cleanly when policyengine-uk not installed
Numbers in the table reproduce the divergences listed above (so future PRs can see them shrink)

🤖 Generated with Claude Code

nikhilwoodruff

Think this is an important feature to include in CI, but this isn't ready yet. What would make it:

Compare against records sampled from the FRS
Just compare the produced microdata outputs from both (would massively simplify this)
Explicitly don't silent fail

nikhilwoodruff · 2026-05-13T09:36:27Z

+summary. Exits non-zero when any diff exceeds the configured tolerance.
+
+Designed to surface drift introduced by Rust ports of Python variables. Uses
+synthetic households so it has no FRS data dependency. If


Nope, needs to fail on this. Can you tighten your Claude setup to avoid silent failures

nikhilwoodruff · 2026-05-13T09:36:50Z

+        "couple_no_kids_40k_25k",
+        {
+            "people": {
+                "p1": _person(35, employment_income=40_000),


We need far more complex households for this- I would sample from the FRS

nikhilwoodruff · 2026-05-13T09:37:31Z

+
+def run_rust(situation: dict, year: int) -> dict[str, float]:
+    """Run the Rust engine and extract per-variable totals."""
+    sim = RustSimulation.from_situation(situation, year=year)


Does this depend on another PR?

Adds `scripts/parity.py`, which runs a fixed set of synthetic households through both the Python `policyengine-uk` package and the Rust `policyengine_uk_compiled` wrapper, diffs key tax / benefit / net-income outputs cell-for-cell, and prints a summary. Skips Python comparison gracefully when the Python package isn't installed. Wired into CI as a non-failing smoke step so it surfaces drift on every PR without breaking on the divergences that already exist (currently up to £3,276 on couple-with-children scenarios). Tolerance can be tightened once those gaps close. Stacked on top of #52 (Simulation.from_situation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `policyengine_uk_compiled.yaml_tests` — a runner that mirrors the format used by `policyengine_uk/tests/policy/` so cases can be ported one at a time. The runner accepts either single-person flat input (`input: { employment_income: 50000 }`) or full-situation input (`input: { people: ..., benunits: ..., households: ... }`), supports absolute and relative error margins, and writes outputs against the Rust microdata column names (`baseline_income_tax`, `baseline_universal_credit`, `baseline_net_income`, etc.). This PR ships: - The runner module with CLI: `python -m policyengine_uk_compiled.yaml_tests tests/policy` - 11 hand-written YAML cases under `tests/policy/` covering income tax, employee NI, and Child Benefit (single + multi-person) - A pytest module that auto-discovers and parametrizes the YAML cases - 21 unit tests for the runner itself (input mapping, tolerance, parsing) - pyyaml added to the package's runtime dependencies Stacked on #53 (parity harness) which is itself stacked on #52 (Simulation.from_situation). Future PRs port more of the 196 Python YAML tests that already exist in `policyengine_uk/tests/policy/`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the existing old-SP scaling pattern for the new-SP cohort: - If `person.state_pension > 0`: pass through, scaled by `(new_state_pension_weekly / baseline_new_sp_weekly)` for reform correctness - Else: fall back to `new_state_pension_weekly × 52` Previously the new-SP branch always returned the full parameter rate × 52, ignoring any recorded amount. This over-stated SP for partial- year claimants and broke parity for the pensioner_couple synthetic scenario in PR #53's parity harness (£946 diff). Implementation: - Plumb `baseline_new_sp_weekly` through `Simulation`, `calculate_benunit`, `calculate_state_pension`, and `person_state_pension`, parallel to the existing `baseline_old_sp_weekly` field - 3 new Rust unit tests (recorded-amount preserved, fallback to param when no record, recorded amount scales under reform) Parity-harness impact (synthetic pensioner_couple scenario): state_pension rust=23,000 py=23,000 diff=£0 (was £946) household_net_income diff=£-41 (was £905) Stacked on #58. Closes #59 (filed today as a follow-up to PR #53). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `policyengine_uk_compiled.yaml_tests` — a runner that mirrors the format used by `policyengine_uk/tests/policy/` so cases can be ported one at a time. The runner accepts either single-person flat input (`input: { employment_income: 50000 }`) or full-situation input (`input: { people: ..., benunits: ..., households: ... }`), supports absolute and relative error margins, and writes outputs against the Rust microdata column names (`baseline_income_tax`, `baseline_universal_credit`, `baseline_net_income`, etc.). This PR ships: - The runner module with CLI: `python -m policyengine_uk_compiled.yaml_tests tests/policy` - 11 hand-written YAML cases under `tests/policy/` covering income tax, employee NI, and Child Benefit (single + multi-person) - A pytest module that auto-discovers and parametrizes the YAML cases - 21 unit tests for the runner itself (input mapping, tolerance, parsing) - pyyaml added to the package's runtime dependencies Stacked on #53 (parity harness) which is itself stacked on #52 (Simulation.from_situation). Future PRs port more of the 196 Python YAML tests that already exist in `policyengine_uk/tests/policy/`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vahid-ahmadi · 2026-05-29T12:17:02Z

Reworked to address @nikhilwoodruff's review — ready for another look.

All three requested changes are in:

FRS records, not synthetic households — the harness now runs the full FRS through both engines for the year.
Compares produced microdata outputs — dropped the bespoke per-scenario/per-variable mapping; it now compares the household-level microdata (hbai_household_net_income vs Rust baseline_net_income, plus total tax and total benefits).
No silent failure — exits non-zero on any divergence beyond tolerance. No --no-fail default; a missing column or a per-variable error is a hard error, never NaN-skipped. The only non-failing exit when data is missing is an explicit, loud "FRS data unavailable — SKIPPING (this is NOT a pass)" path for CI.

Other changes:

Removed the from_situation dependency (PR feat: add Simulation.from_situation for situation-JSON input #52) entirely — the harness no longer needs it, so this PR now stands alone.
Folds in fix: parity harness compares against hbai_household_net_income #61's fix: net income is compared via hbai_household_net_income, which excludes the indirect/transaction taxes the Rust baseline omits.

Comparison mode is weighted aggregate (mean + p10/p50/p90), because the two engines emit different household counts/ids (Python ~53.5k vs Rust ~16.8k) so cell-for-cell alignment isn't reliable — documented in the script/changelog.

Validated against the real local FRS: it surfaces genuine divergences (notably the UC/benefits gap) and exits 1 as intended. 16 hermetic unit tests (divergence→fail, within-tolerance→pass, missing-column→raise, data-unavailable→exit 0) pass. This PR has been rebased onto main and is mergeable independently of the rest of the stack.

# Conflicts: # .github/workflows/test.yml

nikhilwoodruff requested changes May 13, 2026

View reviewed changes

vahid-ahmadi force-pushed the vahid/parity-harness branch from 9a75e02 to b45e8e1 Compare May 29, 2026 09:17

vahid-ahmadi changed the base branch from vahid/from-situation to main May 29, 2026 09:19

vahid-ahmadi mentioned this pull request May 29, 2026

feat: add Simulation.from_situation for situation-JSON input #52

Open

3 tasks

vahid-ahmadi added 2 commits June 1, 2026 14:04

Merge remote-tracking branch 'origin/main' into vahid/parity-harness

1070988

Merge remote-tracking branch 'origin/main' into vahid/parity-harness

81b33b1

# Conflicts: # .github/workflows/test.yml

vahid-ahmadi merged commit 7894c1b into main Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Python ↔ Rust parity harness#53

feat: add Python ↔ Rust parity harness#53
vahid-ahmadi merged 3 commits into
mainfrom
vahid/parity-harness

vahid-ahmadi commented Apr 30, 2026

Uh oh!

nikhilwoodruff left a comment

Uh oh!

nikhilwoodruff May 13, 2026

Uh oh!

nikhilwoodruff May 13, 2026

Uh oh!

nikhilwoodruff May 13, 2026

Uh oh!

vahid-ahmadi commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vahid-ahmadi commented Apr 30, 2026

Summary

What's included

Divergences this surfaces today

Verified locally

Stacking

Test plan

Uh oh!

nikhilwoodruff left a comment

Choose a reason for hiding this comment

Uh oh!

nikhilwoodruff May 13, 2026

Choose a reason for hiding this comment

Uh oh!

nikhilwoodruff May 13, 2026

Choose a reason for hiding this comment

Uh oh!

nikhilwoodruff May 13, 2026

Choose a reason for hiding this comment

Uh oh!

vahid-ahmadi commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants