Skip to content

feat: add Python ↔ Rust parity harness#53

Merged
vahid-ahmadi merged 3 commits into
mainfrom
vahid/parity-harness
Jun 1, 2026
Merged

feat: add Python ↔ Rust parity harness#53
vahid-ahmadi merged 3 commits into
mainfrom
vahid/parity-harness

Conversation

@vahid-ahmadi
Copy link
Copy Markdown
Contributor

Summary

Closes #48. Adds scripts/parity.py, which runs a fixed set of synthetic households through both policyengine_uk (Python) and policyengine_uk_compiled (Rust), diffs key outputs cell-for-cell, and prints a summary. Stacked on #52 (uses Simulation.from_situation).

$ python scripts/parity.py
=== Rust vs Python parity report ===

-- single_£50,000 --
  income_tax                rust=     7,486  py=     7,486  diff=         0
  ni_employee               rust=     2,994  py=     2,994  diff=        -0
  household_net_income      rust=    39,520  py=    39,361  diff=       159 *
  → max |diff|: 159.05
...

What's included

scripts/parity.py — single-file harness with:

  • 11 synthetic scenarios (single at six income levels, couple, couple+kids, lone parent, pensioner couple, Scotland resident)
  • 9 variables compared (income tax, employee NI, employer NI, UC, child benefit, state pension, pension credit, housing benefit, household net income)
  • CLI: --year, --tolerance (default £1), --no-fail
  • Skips Python comparison gracefully if policyengine-uk isn't installed (prints a Rust-only smoke output instead)

Tests (interfaces/python/tests/test_parity_harness.py, 15 cases):

  • Period substitution
  • Scenario builder produces required entities and substitutes the year
  • ScenarioResult diff computation (incl. NaN handling)
  • End-to-end CLI invocation through the Rust binary

CI: new non-failing parity-smoke step runs on every PR. It's intentionally non-failing because there are real, currently-existing divergences (see below) — the harness is a measurement tool, not a unit test. Tolerance tightens as gaps close.

Divergences this surfaces today

Running locally on a clean checkout:

scenario max |diff| (£) likely cause
single_£0 … single_£150k 159 base divergence on household_net_income (something subtle in the deduction stack)
couple_no_kids_40k_25k 159 same base diff
couple_2kids_30k_15k 3,276 UC entitlement gap — Rust returns £0 vs Python's larger value
lone_parent_2kids_18k 2,722 UC differs by £554, plus the £159 base
pensioner_couple 946 state pension £946 high (issue is in the state_pension formula or default values)
scotland_single_45k 159 base diff carries through Scottish tax bands fine

These are the gaps the project should chase next; this PR makes them measurable on every commit.

Verified locally

  • cargo test: 135 passing
  • pytest interfaces/python/tests: 37 passing (22 from feat: add Simulation.from_situation for situation-JSON input #52 + 15 new)
  • python scripts/parity.py --no-fail: completes, prints report, exits 0
  • python scripts/parity.py: completes, exits 1 (current-state divergences exceed default £1 tolerance — exactly as designed)

Stacking

Based on vahid/from-situation (#52). Once #52 merges, GitHub will auto-rebase this PR onto main. The dependency is Simulation.from_situation, used in run_rust.

Test plan

  • CI passes (cargo test, pytest, parity smoke runs without errors)
  • Local: python scripts/parity.py prints the expected report
  • Local: harness skips Python comparison cleanly when policyengine-uk not installed
  • Numbers in the table reproduce the divergences listed above (so future PRs can see them shrink)

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@nikhilwoodruff nikhilwoodruff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this is an important feature to include in CI, but this isn't ready yet. What would make it:

  • Compare against records sampled from the FRS
  • Just compare the produced microdata outputs from both (would massively simplify this)
  • Explicitly don't silent fail

Comment thread scripts/parity.py Outdated
summary. Exits non-zero when any diff exceeds the configured tolerance.

Designed to surface drift introduced by Rust ports of Python variables. Uses
synthetic households so it has no FRS data dependency. If
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, needs to fail on this. Can you tighten your Claude setup to avoid silent failures

Comment thread scripts/parity.py Outdated
"couple_no_kids_40k_25k",
{
"people": {
"p1": _person(35, employment_income=40_000),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need far more complex households for this- I would sample from the FRS

Comment thread scripts/parity.py Outdated

def run_rust(situation: dict, year: int) -> dict[str, float]:
"""Run the Rust engine and extract per-variable totals."""
sim = RustSimulation.from_situation(situation, year=year)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this depend on another PR?

Adds `scripts/parity.py`, which runs a fixed set of synthetic households
through both the Python `policyengine-uk` package and the Rust
`policyengine_uk_compiled` wrapper, diffs key tax / benefit / net-income
outputs cell-for-cell, and prints a summary. Skips Python comparison
gracefully when the Python package isn't installed.

Wired into CI as a non-failing smoke step so it surfaces drift on every PR
without breaking on the divergences that already exist (currently up to
£3,276 on couple-with-children scenarios). Tolerance can be tightened
once those gaps close.

Stacked on top of #52 (Simulation.from_situation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vahid-ahmadi vahid-ahmadi force-pushed the vahid/parity-harness branch from 9a75e02 to b45e8e1 Compare May 29, 2026 09:17
vahid-ahmadi added a commit that referenced this pull request May 29, 2026
Adds `policyengine_uk_compiled.yaml_tests` — a runner that mirrors the
format used by `policyengine_uk/tests/policy/` so cases can be ported one
at a time.

The runner accepts either single-person flat input
(`input: { employment_income: 50000 }`) or full-situation input
(`input: { people: ..., benunits: ..., households: ... }`), supports
absolute and relative error margins, and writes outputs against the Rust
microdata column names (`baseline_income_tax`, `baseline_universal_credit`,
`baseline_net_income`, etc.).

This PR ships:
- The runner module with CLI: `python -m policyengine_uk_compiled.yaml_tests tests/policy`
- 11 hand-written YAML cases under `tests/policy/` covering income tax,
  employee NI, and Child Benefit (single + multi-person)
- A pytest module that auto-discovers and parametrizes the YAML cases
- 21 unit tests for the runner itself (input mapping, tolerance, parsing)
- pyyaml added to the package's runtime dependencies

Stacked on #53 (parity harness) which is itself stacked on #52
(Simulation.from_situation). Future PRs port more of the 196 Python YAML
tests that already exist in `policyengine_uk/tests/policy/`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vahid-ahmadi added a commit that referenced this pull request May 29, 2026
Mirrors the existing old-SP scaling pattern for the new-SP cohort:
- If `person.state_pension > 0`: pass through, scaled by
  `(new_state_pension_weekly / baseline_new_sp_weekly)` for reform
  correctness
- Else: fall back to `new_state_pension_weekly × 52`

Previously the new-SP branch always returned the full parameter rate
× 52, ignoring any recorded amount. This over-stated SP for partial-
year claimants and broke parity for the pensioner_couple synthetic
scenario in PR #53's parity harness (£946 diff).

Implementation:
- Plumb `baseline_new_sp_weekly` through `Simulation`,
  `calculate_benunit`, `calculate_state_pension`, and
  `person_state_pension`, parallel to the existing
  `baseline_old_sp_weekly` field
- 3 new Rust unit tests (recorded-amount preserved, fallback to param
  when no record, recorded amount scales under reform)

Parity-harness impact (synthetic pensioner_couple scenario):
  state_pension     rust=23,000 py=23,000 diff=£0       (was £946)
  household_net_income           diff=£-41              (was £905)

Stacked on #58. Closes #59 (filed today as a follow-up to PR #53).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vahid-ahmadi vahid-ahmadi changed the base branch from vahid/from-situation to main May 29, 2026 09:19
vahid-ahmadi added a commit that referenced this pull request May 29, 2026
Adds `policyengine_uk_compiled.yaml_tests` — a runner that mirrors the
format used by `policyengine_uk/tests/policy/` so cases can be ported one
at a time.

The runner accepts either single-person flat input
(`input: { employment_income: 50000 }`) or full-situation input
(`input: { people: ..., benunits: ..., households: ... }`), supports
absolute and relative error margins, and writes outputs against the Rust
microdata column names (`baseline_income_tax`, `baseline_universal_credit`,
`baseline_net_income`, etc.).

This PR ships:
- The runner module with CLI: `python -m policyengine_uk_compiled.yaml_tests tests/policy`
- 11 hand-written YAML cases under `tests/policy/` covering income tax,
  employee NI, and Child Benefit (single + multi-person)
- A pytest module that auto-discovers and parametrizes the YAML cases
- 21 unit tests for the runner itself (input mapping, tolerance, parsing)
- pyyaml added to the package's runtime dependencies

Stacked on #53 (parity harness) which is itself stacked on #52
(Simulation.from_situation). Future PRs port more of the 196 Python YAML
tests that already exist in `policyengine_uk/tests/policy/`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vahid-ahmadi
Copy link
Copy Markdown
Contributor Author

Reworked to address @nikhilwoodruff's review — ready for another look.

All three requested changes are in:

  1. FRS records, not synthetic households — the harness now runs the full FRS through both engines for the year.
  2. Compares produced microdata outputs — dropped the bespoke per-scenario/per-variable mapping; it now compares the household-level microdata (hbai_household_net_income vs Rust baseline_net_income, plus total tax and total benefits).
  3. No silent failure — exits non-zero on any divergence beyond tolerance. No --no-fail default; a missing column or a per-variable error is a hard error, never NaN-skipped. The only non-failing exit when data is missing is an explicit, loud "FRS data unavailable — SKIPPING (this is NOT a pass)" path for CI.

Other changes:

Comparison mode is weighted aggregate (mean + p10/p50/p90), because the two engines emit different household counts/ids (Python ~53.5k vs Rust ~16.8k) so cell-for-cell alignment isn't reliable — documented in the script/changelog.

Validated against the real local FRS: it surfaces genuine divergences (notably the UC/benefits gap) and exits 1 as intended. 16 hermetic unit tests (divergence→fail, within-tolerance→pass, missing-column→raise, data-unavailable→exit 0) pass. This PR has been rebased onto main and is mergeable independently of the rest of the stack.

@vahid-ahmadi vahid-ahmadi merged commit 7894c1b into main Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python ↔ Rust parity harness: cell-for-cell diff on shared FRS slice

2 participants