Add PUF support clone for MP vs eCPS replacement

Reviewed plan: /Users/maxghenis/mp-ecps-tracking/mp-puf-support-clone-plan-20260601.md

Cycle status: reviewed by eCPS-fidelity and Microplex implementation-risk subagents; all actionable findings were incorporated and final reviews reported no actionable findings.

# MP eCPS PUF Support Clone Plan

Date: 2026-06-01

## Objective

Make the eCPS-shaped Microplex replacement use the same core PUF-support idea as latest eCPS: keep the original CPS/ASEC support and append a PUF-imputed support copy so calibration can choose between CPS-reported and PUF-imputed tax/income values by assigning weights. This is more important than tax-unit-preservation cleanup because the current Gate-1 MP dataset has only one support surface for overlapping PUF/CPS variables such as wages, self-employment income, dividends, interest, pensions, and capital gains.

The production loss gate remains the sound MP vs eCPS comparison against latest local eCPS, with matched household count, symmetric refit, and holdout targets. The implementation is acceptable only if it moves that loss in the right direction without breaking export-column parity or checkpoint resumability.

## How eCPS Does It

Relevant files:

- [policyengine_us_data/calibration/puf_impute.py](/Users/maxghenis/PolicyEngine/policyengine-us-data/policyengine_us_data/calibration/puf_impute.py:1)
- [policyengine_us_data/datasets/cps/extended_cps.py](/Users/maxghenis/PolicyEngine/policyengine-us-data/policyengine_us_data/datasets/cps/extended_cps.py:91)

The eCPS implementation is `puf_clone_dataset(...)` in `policyengine-us-data`.

1. eCPS loads the CPS dataset through `Microsimulation(dataset=self.cps)` and converts all arrays into `{variable: {period: array}}`.
2. It assigns detailed geography using `assign_geography_within_state_county(...)`. Before cloning it writes `spm_unit_spm_threshold` into the CPS data dict; geography is passed separately to `puf_clone_dataset(...)` and then added to the doubled output.
3. It calls `puf_clone_dataset(...)` with the CPS data dict, assigned geography, PUF dataset, and CPS H5 path.
4. `puf_clone_dataset` runs PUF QRF imputation (`_run_qrf_imputation`) when a PUF dataset is available. `_run_qrf_imputation` returns two prediction maps that `puf_clone_dataset` uses while constructing the doubled data:
   - `y_full`: full PUF-imputed values for the PUF half.
   - `y_override`: values for the special variables eCPS wants on both halves.
   `puf_clone_dataset` itself returns the doubled data dict.
5. It doubles every array:
   - Ordinary variables: `[CPS values, CPS values]`.
   - ID variables: `[ids, ids + max(ids)]`.
   - Weight variables: `[weights, 0 * weights]`.
   - Variables in `IMPUTED_VARIABLES`: `[CPS values, PUF-imputed values]`.
   - Variables already present in CPS and in `OVERRIDDEN_IMPUTED_VARIABLES`: `[PUF-imputed override values, PUF-imputed override values]`.
   - `IMPUTED_VARIABLES` absent from CPS: `[zeros, PUF-imputed values]`, even if the variable also appears in the override list.
   - `weeks_unemployed`: `[CPS values, separately imputed PUF-half weeks]`.
6. The QRF training path stratified-subsamples PUF to about 20,000 records, preserves the top 0.5 percent by AGI, batches imputed variables, and predicts from demographic predictors calculated by PolicyEngine.
7. eCPS predicts on the person surface, then maps non-person variables back to their PolicyEngine entity with `value_from_first_person`. This matters for tax-unit, SPM-unit, household, family, or marital-unit variables whose values must be assigned consistently across persons.
8. It saves the doubled dataset, then calibration can place positive final weights on either original CPS rows or PUF-imputed clone rows. The PUF half is stored with zero weight; calibration/reweighting paths must still initialize optimization weights in a way that can activate zero-stored-weight support rows.

The current latest eCPS H5 reflects this. It has about twice the raw CPS households, and the second half has positive weights after calibration, meaning the optimizer did use PUF clone support.

There are two relevant PolicyEngine-US-data pathways:

- Public `ExtendedCPS.generate()` loads CPS and calls `puf_clone_dataset()` directly. It does not run ACS/SIPP/SCF imputation first.
- The newer unified calibration path has a source-imputation stage before PUF cloning. That is a separate pipeline design, not the public eCPS generation path.

## Current MP State

Relevant file:

- [src/microplex_us/pipelines/us.py](/Users/maxghenis/.codex-worktrees/microplex-us-ecps-puf-support-clone-20260601/src/microplex_us/pipelines/us.py:4941)

Current MP donor integration:

1. Starts with a single CPS/ASEC spine frame (`current = seed_data.copy()`).
2. For each donor source, prepares donor seed data and identifies:
   - `shared_vars`: variables observed in both scaffold and donor, used as conditioning features.
   - `donor_only_vars`: authoritative donor variables absent from the scaffold.
   - `donor_override_vars`: shared authoritative donor variables listed in `donor_imputer_authoritative_override_variables`.
3. It imputes donor target blocks into the same `current` frame in place.
4. For the Gate-1 MP run, overlapping PUF/CPS variables were mostly not represented as alternatives. Evidence from artifact checks: scaffold and seed values for `wage_income` and `self_employment_income` were identical, while eCPS has a clone half where these can differ.

There is partial infrastructure but not the production feature:

- Export contracts already know `person_is_puf_clone` can exist as an optional eCPS-internal flag.
- Source-weight diagnostics already mention `puf_support_household_weight_share` and `irs_soi_puf_support_clone`.
- Donor override machinery exists, but overriding one frame is not the same as appending original plus PUF-imputed support.

## Implementation Design

Add an explicit, opt-in MP PUF support clone mode for the eCPS-shaped pipeline.

### Configuration

Add config fields to `USMicroplexBuildConfig`:

- `puf_support_clone_enabled: bool = False`
- `puf_support_clone_source_prefixes: tuple[str, ...] = ("irs_soi_puf",)`
- `puf_support_clone_zero_initial_weight: bool = True`
- `puf_support_clone_original_flag_column: str = "person_is_puf_clone"`

The eCPS replacement checkpoint runner should enable this for CPS + PUF + SIPP + SCF no-ACS builds. Other MP builds stay unchanged until intentionally opted in.

Also wire the fields explicitly through the checkpoint CLI and rebuild defaults:

- Add CLI flags or an eCPS-shaped default that turns clone mode on for the Gate-1 CPS/PUF/SIPP/SCF no-ACS path.
- Keep the default off for generic MP builds.
- Persist the effective clone settings in build metadata so resumed calibration and comparison artifacts can tell whether the support clone was active.

Guardrail:

- `_integrate_donor_sources` is before synthesis. PUF support cloning there is only safe for `synthesis_backend="seed"` because seed synthesis preserves support rows. If clone mode is enabled with bootstrap/synthesizer backends, fail early with a clear error until a post-synthesis clone path is implemented.
- Clone mode also requires a CPS/ASEC-shaped scaffold and exactly the expected PUF donor source in donor inputs. Fail clearly if a PUF-prefixed source is selected as scaffold, no PUF donor matches `puf_support_clone_source_prefixes`, or multiple ambiguous PUF donors match without explicit configuration.

### Row Construction

When `_integrate_donor_sources` reaches the PUF donor and clone mode is enabled:

1. Save `original_current = current.copy()`.
2. Build `clone_current = original_current.copy()`.
3. Offset entity IDs in `clone_current` so original and clone rows do not collide:
   - `person_id`
   - `household_id`
   - `tax_unit_id`
   - `family_id`
   - `spm_unit_id`
   - `marital_unit_id`
   - membership/reference columns such as `person_household_id`, `person_tax_unit_id`, `person_spm_unit_id`, or any other column containing `_id` that points into the cloned support.
   Offset should be deterministic and safe for numeric IDs. If string IDs appear, use a stable suffix strategy. The mapping must be consistent across entity IDs and reference columns so clone people point to clone households/tax units/SPM units, not original entities.
4. Set all stored clone weight columns to zero by default:
   - `weight`
   - `household_weight` if present
   - `hh_weight` if present
   - any other column containing `_weight`, including `person_weight`, `tax_unit_weight`, and `spm_unit_weight` if present.
5. Add `person_is_puf_clone = 0.0` to original rows and `1.0` to clone rows.
6. Run existing PUF donor imputation on `clone_current` only, not on `original_current`.
7. Align columns before concatenation:
   - Original rows get their original CPS value for ordinary overlap variables.
   - Original rows get the PUF-imputed value for any pre-clone overlap variable included in MP's explicit eCPS-style both-halves override list.
   - Original rows get `0.0` for donor-only PUF target variables that did not exist before cloning.
   - Clone rows get the imputed PUF value for included variables.
   - No included target column may become NaN from a missing half.
8. Concatenate `[original_current, imputed_clone_current]` and immediately `reset_index(drop=True)`. Duplicate original/clone indexes would break later person-native donor assignments that rely on pandas index alignment.

This mirrors eCPS’s support structure: original CPS support remains available and PUF-imputed support is added as zero-initial-weight rows that calibration can activate.

### PUF Variable Surface

For the clone half, PUF should impute both donor-only variables and overlapping variables where PUF is authoritative enough to provide alternative support. The implementation should not depend only on one small handpicked list; eCPS imputes a broad tax surface, and MP needs either parity with that broad surface where the variables are data-owned or an explicit exclusion table for every missing variable.

Initial variable selection:

- Start from the existing PUF donor target selection:
  - donor-only authoritative variables.
  - shared authoritative variables already listed in `donor_imputer_authoritative_override_variables`.
- Add a PUF-clone-specific default overlap list modeled on the full eCPS `IMPUTED_VARIABLES` surface, constrained to variables present in both MP current and PUF donor seed and allowed by source authority. This includes the major income variables below plus deductions, credits, QBI inputs, HSA/student loan/charitable variables, and qualified-income flags where MP has a data-owned column:
  - `employment_income`
  - `self_employment_income`
  - `social_security`
  - `taxable_pension_income`
  - `taxable_interest_income`
  - `tax_exempt_interest_income`
  - `qualified_dividend_income`
  - `non_qualified_dividend_income`
  - `long_term_capital_gains`
  - `short_term_capital_gains`
  - `rental_income`
  - `farm_income`
  - `partnership_s_corp_income`
  - `taxable_ira_distributions`
  - `taxable_unemployment_compensation`
  - `traditional_ira_contributions`
  - `self_employed_health_insurance_ald`
  - `self_employed_pension_contribution_ald`
  - `wage_income`, `dividend_income`, and `capital_gains` only if they are present as MP scaffold aliases or support fields mapped to the corresponding PolicyEngine variables above.

Also implement an explicit MP equivalent of eCPS's `OVERRIDDEN_IMPUTED_VARIABLES` category for variables that should be PUF-imputed on both halves. This should be a named list with tests and comments. The default should mirror eCPS only where the variable is data-owned rather than formula-owned under the pinned PolicyEngine-US export contract, and only apply the both-halves override to variables that are already present in the pre-clone frame. If a variable is donor-only/absent before cloning, it should be added as `[zero original half, PUF-imputed clone half]` unless MP explicitly documents a different reason.

Add a generated or checked `puf_support_clone_variable_surface` sidecar/table with:

- eCPS `IMPUTED_VARIABLES`
- eCPS `OVERRIDDEN_IMPUTED_VARIABLES`
- eCPS special PUF-half variables outside those lists, currently `weeks_unemployed`
- MP-supported included variables
- MP-excluded variables
- exclusion reason (`formula_owned`, `missing_policyengine_us_variable`, `missing_puf_source_column`, `unsupported_entity_mapping`, or `intentional_divergence`)

The release path should not silently omit eCPS PUF variables.

Entity mapping requirement:

- Resolve each PUF-clone target variable's PolicyEngine entity under the pinned `policyengine-us`.
- If the donor imputation engine generates at person level for a non-person variable, map values back to the correct entity consistently before assignment, matching eCPS's `value_from_first_person` pattern.
- If MP's block engine already projects a block to the target entity, record that in diagnostics and test it so tax-unit/SPM/household variables do not get person-shape semantics accidentally.

Do not blindly import eCPS variable semantics. MP should use its own source manifests and PolicyEngine export contract, with eCPS as the reference for the support-clone architecture.

### Ordering

Required ordering for immediate eCPS replacement:

1. CPS scaffold.
2. PUF support clone immediately after CPS, creating the original CPS half and a zero-stored-weight PUF-imputed support half.
3. Downstream non-PUF donor imputations over the expanded support universe (SIPP/SCF now; ACS donor later if enabled). This lets SIPP/SCF fill benefits/assets on both CPS-original and PUF-alternative rows.
4. PE table materialization.
5. Microsim target materialization.
6. Calibration.

This deliberately treats PUF as the first support-expanding donor: the original multispine rather than a late overwrite. It is not a literal copy of either PolicyEngine-US-data pathway. Public eCPS does CPS -> PUF clone and has no SIPP/SCF stage; the unified calibration path does source imputation before PUF clone. MP's architecture should use the eCPS support-clone mechanism but place it where it is most useful for our donor pipeline.

Donor ordering must be observable, not only behavioral:

- When clone mode is enabled, `_integrate_donor_sources` must actively split donor inputs into PUF-clone donors matching `puf_support_clone_source_prefixes` and non-PUF donors, process the PUF-clone donors first, then process later donors over the expanded support.
- Do not rely on provider order, even though the current default provider bundle happens to list CPS then PUF then SIPP/SCF.
- Validate that the scaffold is CPS/ASEC-shaped before splitting. The current scaffold chooser is score-based, so clone mode must not assume PUF will always be a donor unless it checks.
- Return `processed_donor_source_order` and `puf_clone_source_order` in the donor-integration result.
- Persist those fields in metadata/sidecars alongside `fusion_plan.source_names`, because the input source order can differ from the actual processing order once PUF-first support expansion is active.

### Checkpoints and Sidecars

This must preserve restartability:

- Post-imputation checkpoint should contain the doubled person and household support after PUF clone.
- Post-microsim checkpoint should work unchanged.
- Clone metadata must either be persisted in checkpoint metadata or be reconstructable from checkpoint tables using `person_is_puf_clone`.
- Recalibration-from-checkpoint must preserve/report clone diagnostics instead of losing them when source frames are unavailable.
- Add a small `puf_support_clone_summary.json` sidecar and include the main figures in existing source-weight diagnostics:
  - enabled flag
  - donor source name
  - original row count
  - clone row count
  - final row count
  - original weight sum
  - clone initial weight sum
  - clone variable count
  - clone overlap variable count
  - clone source order
  - variable-surface parity/exclusion counts
  - non-person variable entity mapping mode
  - whether optimization initialized the clone support with activatable nonzero internal weights even though stored source weights are zero
  - final clone household count, activated household count, clone household weight sum, and clone household weight share after calibration

Existing source-weight diagnostics currently assume donor rows contribute no separate support weight; update them to compute PUF support activation from `person_is_puf_clone` after calibration.

Household-level PUF clone diagnostics:

- Keep `person_is_puf_clone` on the person table.
- Derive clone households by grouping persons by `household_id`; a clone household should have all members flagged as `person_is_puf_clone == 1`.
- Join those clone household IDs to the calibrated household table to compute clone household count, activated count, weight sum, and weight share.
- If a household has mixed clone flags, report it as a diagnostic error because ID offsetting or household construction has broken clone integrity.

Calibration activation:

- Stored source weights should mirror eCPS: clone support rows start at zero.
- Calibration internals must nevertheless be able to activate clone rows. PE L0 already floors zero initial weights; other backends need either a clone-aware internal weight floor/warm start in `calibrate_policyengine_tables` or clone mode must be restricted to proven backends.
- Add a focused test where a zero-stored-weight clone row receives positive calibrated household weight under the configured eCPS replacement backend.
- Household-budget selection runs before final calibration and can discard zero-stored-weight clone households. Clone mode must either disable/fail with `policyengine_selection_household_budget` or prove the selector also uses a clone-aware positive internal floor and keeps clone support eligible.

Imputation ablations:

- The donor-imputation ablation runner calls `_integrate_donor_sources` directly for variant scoring. Disable PUF support clone inside ablation variant configs unless the ablation is specifically testing support expansion; otherwise the evidence path will compare different support universes.

### Validation Gates

Pre-calibration validation:

- If clone mode is enabled, `synthesis_backend` is `seed` until a post-synthesis clone path exists.
- Export-column contract still passes from the post-imputation tables before calibration.
- Dataset row counts are doubled only for the PUF clone path.
- `person_is_puf_clone` exists and is optional, not required.
- Original rows preserve CPS values for ordinary overlapping variables.
- Variables in the explicit PUF-both-halves override list are PUF-imputed on both halves, matching eCPS behavior where we intentionally adopt it.
- Variables absent before cloning but included in PUF clone support have zero original-half values and PUF-imputed clone-half values, unless an explicit MP exclusion/divergence entry explains otherwise.
- `weeks_unemployed` is either handled like eCPS, with original-half CPS values and PUF-half imputed values tied to unemployment compensation support, or has an explicit exclusion/divergence reason.
- Clone rows differ from original rows for at least high-signal PUF overlap variables (`self_employment_income`, `employment_income` or `wage_income`, `taxable_interest_income`, `qualified_dividend_income`, `capital_gains`) when donor support exists.
- Clone rows start with zero values in every stored weight column.
- The calibration/refit initializer can activate clone rows; exact zero stored source weights must not become `log(0)` or permanent zero final weights.
- Household-budget selection is disabled for clone mode unless it has its own clone-aware activation/eligibility test.
- Entity ID uniqueness holds across original and clone rows, and clone reference columns point to clone entities.
- PUF-first ordering is enforced even if the caller provides donors in a deliberately misordered list.
- Concatenated original+clone rows have a clean RangeIndex before downstream non-PUF donor imputation runs.
- Clone mode fails clearly if the scaffold is PUF-prefixed, not CPS/ASEC-shaped, or no unambiguous PUF donor is available.
- Non-person PUF clone variables preserve entity consistency after assignment.

Post-calibration validation:

- Some clone rows should be activated if the PUF support is useful, but this is diagnostic, not a hard gate.
- Sound MP vs latest local eCPS comparison must improve. The release gate is still loss, not structural preferences.

## Tests

Unit tests:

1. A small synthetic CPS + PUF donor case where clone mode:
   - doubles rows,
   - offsets IDs,
   - sets clone weights to zero,
   - preserves original overlapping CPS values,
   - imputes different PUF overlap values into clone rows.
2. A regression test that non-PUF donor integration still behaves unchanged with clone mode off.
3. A source-order test: when clone mode is enabled, PUF donor support expansion happens before SIPP/SCF-style donors, and those later donors see both original and PUF clone rows.
4. A sidecar/diagnostic test for PUF support clone summary fields.
5. A variable-surface test comparing MP's included/excluded PUF clone variables to eCPS `IMPUTED_VARIABLES` and `OVERRIDDEN_IMPUTED_VARIABLES`, requiring a reason for each exclusion.
6. A non-person entity mapping test for at least one tax-unit or household PUF clone target.
7. A seed-backend guard test: clone mode fails clearly for non-seed synthesis backends until a post-synthesis path exists.
8. A source-weight diagnostics test that final PUF support activation is nonzero when clone rows receive positive calibrated weights.
9. A checkpoint/recalibration test that `person_is_puf_clone`, doubled IDs/reference columns, zero pre-calibration clone weights, and clone metadata survive save/load and are reconstructed in recalibration diagnostics from loaded tables rather than source-frame metadata.
10. An ablation test that clone mode is disabled for ordinary donor-imputation holdout variants.
11. A calibration activation test for the chosen eCPS replacement backend proving that zero-stored-weight clone rows can receive positive calibrated weight.
12. A downstream donor test where a non-PUF donor runs after clone expansion and imputes a person-native variable on both original and clone rows without index misalignment.
13. A scaffold/donor validation test for PUF-as-scaffold, missing PUF donor, and ambiguous PUF donor cases.
14. A household clone diagnostics test deriving clone household activation from person flags and detecting mixed-flag households.
15. A household-selection guard test: clone mode fails or disables selection when `policyengine_selection_household_budget` is set, unless a selector-specific clone activation test exists.

Focused integration tests:

1. Resume from the Gate-1 post-imputation or rebuild only as much as necessary.
2. Run export-column check on the PUF-cloned post-imputation tables.
3. Run sound MP vs latest local eCPS comparison.

## Execution Plan

1. Implement config and helper methods in `src/microplex_us/pipelines/us.py`.
2. Wire clone mode into the eCPS checkpoint runner/config and checkpoint CLI.
3. Add tests in `tests/pipelines/test_us.py` and any artifact/checkpoint tests needed.
4. Run focused tests.
5. Open a PR and link the issue.
6. Rebuild or resume Gate-1 with PUF support clone enabled.
7. Run the sound comparison against latest local eCPS as primary and HF eCPS as diagnostic.
8. If loss still exceeds eCPS, use corrected diagnostics from PR #139 to pick the next single production fix.

## Open Questions for Review

1. Which eCPS `OVERRIDDEN_IMPUTED_VARIABLES` should MP adopt as both-halves PUF-imputed variables under the pinned PolicyEngine-US export contract?
2. Should the stored clone weight be exactly zero like eCPS while the optimization initializer uses a positive internal start? My current recommendation is yes: stored data mirrors eCPS support semantics, optimizer internals avoid `log(0)` and can activate rows.
3. Should this mode become default for eCPS-shaped builds immediately? My recommendation is yes for the eCPS replacement pipeline, no for unrelated MP builds until scored.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PUF support clone for MP vs eCPS replacement #140

MP eCPS PUF Support Clone Plan

Objective

How eCPS Does It

Current MP State

Implementation Design

Configuration

Row Construction

PUF Variable Surface

Ordering

Checkpoints and Sidecars

Validation Gates

Tests

Execution Plan

Open Questions for Review

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add PUF support clone for MP vs eCPS replacement #140

Description

MP eCPS PUF Support Clone Plan

Objective

How eCPS Does It

Current MP State

Implementation Design

Configuration

Row Construction

PUF Variable Surface

Ordering

Checkpoints and Sidecars

Validation Gates

Tests

Execution Plan

Open Questions for Review

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions