Skip to content

Add PUF support clone for MP vs eCPS replacement #140

@MaxGhenis

Description

@MaxGhenis

Reviewed plan: /Users/maxghenis/mp-ecps-tracking/mp-puf-support-clone-plan-20260601.md

Cycle status: reviewed by eCPS-fidelity and Microplex implementation-risk subagents; all actionable findings were incorporated and final reviews reported no actionable findings.

MP eCPS PUF Support Clone Plan

Date: 2026-06-01

Objective

Make the eCPS-shaped Microplex replacement use the same core PUF-support idea as latest eCPS: keep the original CPS/ASEC support and append a PUF-imputed support copy so calibration can choose between CPS-reported and PUF-imputed tax/income values by assigning weights. This is more important than tax-unit-preservation cleanup because the current Gate-1 MP dataset has only one support surface for overlapping PUF/CPS variables such as wages, self-employment income, dividends, interest, pensions, and capital gains.

The production loss gate remains the sound MP vs eCPS comparison against latest local eCPS, with matched household count, symmetric refit, and holdout targets. The implementation is acceptable only if it moves that loss in the right direction without breaking export-column parity or checkpoint resumability.

How eCPS Does It

Relevant files:

The eCPS implementation is puf_clone_dataset(...) in policyengine-us-data.

  1. eCPS loads the CPS dataset through Microsimulation(dataset=self.cps) and converts all arrays into {variable: {period: array}}.
  2. It assigns detailed geography using assign_geography_within_state_county(...). Before cloning it writes spm_unit_spm_threshold into the CPS data dict; geography is passed separately to puf_clone_dataset(...) and then added to the doubled output.
  3. It calls puf_clone_dataset(...) with the CPS data dict, assigned geography, PUF dataset, and CPS H5 path.
  4. puf_clone_dataset runs PUF QRF imputation (_run_qrf_imputation) when a PUF dataset is available. _run_qrf_imputation returns two prediction maps that puf_clone_dataset uses while constructing the doubled data:
    • y_full: full PUF-imputed values for the PUF half.
    • y_override: values for the special variables eCPS wants on both halves.
      puf_clone_dataset itself returns the doubled data dict.
  5. It doubles every array:
    • Ordinary variables: [CPS values, CPS values].
    • ID variables: [ids, ids + max(ids)].
    • Weight variables: [weights, 0 * weights].
    • Variables in IMPUTED_VARIABLES: [CPS values, PUF-imputed values].
    • Variables already present in CPS and in OVERRIDDEN_IMPUTED_VARIABLES: [PUF-imputed override values, PUF-imputed override values].
    • IMPUTED_VARIABLES absent from CPS: [zeros, PUF-imputed values], even if the variable also appears in the override list.
    • weeks_unemployed: [CPS values, separately imputed PUF-half weeks].
  6. The QRF training path stratified-subsamples PUF to about 20,000 records, preserves the top 0.5 percent by AGI, batches imputed variables, and predicts from demographic predictors calculated by PolicyEngine.
  7. eCPS predicts on the person surface, then maps non-person variables back to their PolicyEngine entity with value_from_first_person. This matters for tax-unit, SPM-unit, household, family, or marital-unit variables whose values must be assigned consistently across persons.
  8. It saves the doubled dataset, then calibration can place positive final weights on either original CPS rows or PUF-imputed clone rows. The PUF half is stored with zero weight; calibration/reweighting paths must still initialize optimization weights in a way that can activate zero-stored-weight support rows.

The current latest eCPS H5 reflects this. It has about twice the raw CPS households, and the second half has positive weights after calibration, meaning the optimizer did use PUF clone support.

There are two relevant PolicyEngine-US-data pathways:

  • Public ExtendedCPS.generate() loads CPS and calls puf_clone_dataset() directly. It does not run ACS/SIPP/SCF imputation first.
  • The newer unified calibration path has a source-imputation stage before PUF cloning. That is a separate pipeline design, not the public eCPS generation path.

Current MP State

Relevant file:

Current MP donor integration:

  1. Starts with a single CPS/ASEC spine frame (current = seed_data.copy()).
  2. For each donor source, prepares donor seed data and identifies:
    • shared_vars: variables observed in both scaffold and donor, used as conditioning features.
    • donor_only_vars: authoritative donor variables absent from the scaffold.
    • donor_override_vars: shared authoritative donor variables listed in donor_imputer_authoritative_override_variables.
  3. It imputes donor target blocks into the same current frame in place.
  4. For the Gate-1 MP run, overlapping PUF/CPS variables were mostly not represented as alternatives. Evidence from artifact checks: scaffold and seed values for wage_income and self_employment_income were identical, while eCPS has a clone half where these can differ.

There is partial infrastructure but not the production feature:

  • Export contracts already know person_is_puf_clone can exist as an optional eCPS-internal flag.
  • Source-weight diagnostics already mention puf_support_household_weight_share and irs_soi_puf_support_clone.
  • Donor override machinery exists, but overriding one frame is not the same as appending original plus PUF-imputed support.

Implementation Design

Add an explicit, opt-in MP PUF support clone mode for the eCPS-shaped pipeline.

Configuration

Add config fields to USMicroplexBuildConfig:

  • puf_support_clone_enabled: bool = False
  • puf_support_clone_source_prefixes: tuple[str, ...] = ("irs_soi_puf",)
  • puf_support_clone_zero_initial_weight: bool = True
  • puf_support_clone_original_flag_column: str = "person_is_puf_clone"

The eCPS replacement checkpoint runner should enable this for CPS + PUF + SIPP + SCF no-ACS builds. Other MP builds stay unchanged until intentionally opted in.

Also wire the fields explicitly through the checkpoint CLI and rebuild defaults:

  • Add CLI flags or an eCPS-shaped default that turns clone mode on for the Gate-1 CPS/PUF/SIPP/SCF no-ACS path.
  • Keep the default off for generic MP builds.
  • Persist the effective clone settings in build metadata so resumed calibration and comparison artifacts can tell whether the support clone was active.

Guardrail:

  • _integrate_donor_sources is before synthesis. PUF support cloning there is only safe for synthesis_backend="seed" because seed synthesis preserves support rows. If clone mode is enabled with bootstrap/synthesizer backends, fail early with a clear error until a post-synthesis clone path is implemented.
  • Clone mode also requires a CPS/ASEC-shaped scaffold and exactly the expected PUF donor source in donor inputs. Fail clearly if a PUF-prefixed source is selected as scaffold, no PUF donor matches puf_support_clone_source_prefixes, or multiple ambiguous PUF donors match without explicit configuration.

Row Construction

When _integrate_donor_sources reaches the PUF donor and clone mode is enabled:

  1. Save original_current = current.copy().
  2. Build clone_current = original_current.copy().
  3. Offset entity IDs in clone_current so original and clone rows do not collide:
    • person_id
    • household_id
    • tax_unit_id
    • family_id
    • spm_unit_id
    • marital_unit_id
    • membership/reference columns such as person_household_id, person_tax_unit_id, person_spm_unit_id, or any other column containing _id that points into the cloned support.
      Offset should be deterministic and safe for numeric IDs. If string IDs appear, use a stable suffix strategy. The mapping must be consistent across entity IDs and reference columns so clone people point to clone households/tax units/SPM units, not original entities.
  4. Set all stored clone weight columns to zero by default:
    • weight
    • household_weight if present
    • hh_weight if present
    • any other column containing _weight, including person_weight, tax_unit_weight, and spm_unit_weight if present.
  5. Add person_is_puf_clone = 0.0 to original rows and 1.0 to clone rows.
  6. Run existing PUF donor imputation on clone_current only, not on original_current.
  7. Align columns before concatenation:
    • Original rows get their original CPS value for ordinary overlap variables.
    • Original rows get the PUF-imputed value for any pre-clone overlap variable included in MP's explicit eCPS-style both-halves override list.
    • Original rows get 0.0 for donor-only PUF target variables that did not exist before cloning.
    • Clone rows get the imputed PUF value for included variables.
    • No included target column may become NaN from a missing half.
  8. Concatenate [original_current, imputed_clone_current] and immediately reset_index(drop=True). Duplicate original/clone indexes would break later person-native donor assignments that rely on pandas index alignment.

This mirrors eCPS’s support structure: original CPS support remains available and PUF-imputed support is added as zero-initial-weight rows that calibration can activate.

PUF Variable Surface

For the clone half, PUF should impute both donor-only variables and overlapping variables where PUF is authoritative enough to provide alternative support. The implementation should not depend only on one small handpicked list; eCPS imputes a broad tax surface, and MP needs either parity with that broad surface where the variables are data-owned or an explicit exclusion table for every missing variable.

Initial variable selection:

  • Start from the existing PUF donor target selection:
    • donor-only authoritative variables.
    • shared authoritative variables already listed in donor_imputer_authoritative_override_variables.
  • Add a PUF-clone-specific default overlap list modeled on the full eCPS IMPUTED_VARIABLES surface, constrained to variables present in both MP current and PUF donor seed and allowed by source authority. This includes the major income variables below plus deductions, credits, QBI inputs, HSA/student loan/charitable variables, and qualified-income flags where MP has a data-owned column:
    • employment_income
    • self_employment_income
    • social_security
    • taxable_pension_income
    • taxable_interest_income
    • tax_exempt_interest_income
    • qualified_dividend_income
    • non_qualified_dividend_income
    • long_term_capital_gains
    • short_term_capital_gains
    • rental_income
    • farm_income
    • partnership_s_corp_income
    • taxable_ira_distributions
    • taxable_unemployment_compensation
    • traditional_ira_contributions
    • self_employed_health_insurance_ald
    • self_employed_pension_contribution_ald
    • wage_income, dividend_income, and capital_gains only if they are present as MP scaffold aliases or support fields mapped to the corresponding PolicyEngine variables above.

Also implement an explicit MP equivalent of eCPS's OVERRIDDEN_IMPUTED_VARIABLES category for variables that should be PUF-imputed on both halves. This should be a named list with tests and comments. The default should mirror eCPS only where the variable is data-owned rather than formula-owned under the pinned PolicyEngine-US export contract, and only apply the both-halves override to variables that are already present in the pre-clone frame. If a variable is donor-only/absent before cloning, it should be added as [zero original half, PUF-imputed clone half] unless MP explicitly documents a different reason.

Add a generated or checked puf_support_clone_variable_surface sidecar/table with:

  • eCPS IMPUTED_VARIABLES
  • eCPS OVERRIDDEN_IMPUTED_VARIABLES
  • eCPS special PUF-half variables outside those lists, currently weeks_unemployed
  • MP-supported included variables
  • MP-excluded variables
  • exclusion reason (formula_owned, missing_policyengine_us_variable, missing_puf_source_column, unsupported_entity_mapping, or intentional_divergence)

The release path should not silently omit eCPS PUF variables.

Entity mapping requirement:

  • Resolve each PUF-clone target variable's PolicyEngine entity under the pinned policyengine-us.
  • If the donor imputation engine generates at person level for a non-person variable, map values back to the correct entity consistently before assignment, matching eCPS's value_from_first_person pattern.
  • If MP's block engine already projects a block to the target entity, record that in diagnostics and test it so tax-unit/SPM/household variables do not get person-shape semantics accidentally.

Do not blindly import eCPS variable semantics. MP should use its own source manifests and PolicyEngine export contract, with eCPS as the reference for the support-clone architecture.

Ordering

Required ordering for immediate eCPS replacement:

  1. CPS scaffold.
  2. PUF support clone immediately after CPS, creating the original CPS half and a zero-stored-weight PUF-imputed support half.
  3. Downstream non-PUF donor imputations over the expanded support universe (SIPP/SCF now; ACS donor later if enabled). This lets SIPP/SCF fill benefits/assets on both CPS-original and PUF-alternative rows.
  4. PE table materialization.
  5. Microsim target materialization.
  6. Calibration.

This deliberately treats PUF as the first support-expanding donor: the original multispine rather than a late overwrite. It is not a literal copy of either PolicyEngine-US-data pathway. Public eCPS does CPS -> PUF clone and has no SIPP/SCF stage; the unified calibration path does source imputation before PUF clone. MP's architecture should use the eCPS support-clone mechanism but place it where it is most useful for our donor pipeline.

Donor ordering must be observable, not only behavioral:

  • When clone mode is enabled, _integrate_donor_sources must actively split donor inputs into PUF-clone donors matching puf_support_clone_source_prefixes and non-PUF donors, process the PUF-clone donors first, then process later donors over the expanded support.
  • Do not rely on provider order, even though the current default provider bundle happens to list CPS then PUF then SIPP/SCF.
  • Validate that the scaffold is CPS/ASEC-shaped before splitting. The current scaffold chooser is score-based, so clone mode must not assume PUF will always be a donor unless it checks.
  • Return processed_donor_source_order and puf_clone_source_order in the donor-integration result.
  • Persist those fields in metadata/sidecars alongside fusion_plan.source_names, because the input source order can differ from the actual processing order once PUF-first support expansion is active.

Checkpoints and Sidecars

This must preserve restartability:

  • Post-imputation checkpoint should contain the doubled person and household support after PUF clone.
  • Post-microsim checkpoint should work unchanged.
  • Clone metadata must either be persisted in checkpoint metadata or be reconstructable from checkpoint tables using person_is_puf_clone.
  • Recalibration-from-checkpoint must preserve/report clone diagnostics instead of losing them when source frames are unavailable.
  • Add a small puf_support_clone_summary.json sidecar and include the main figures in existing source-weight diagnostics:
    • enabled flag
    • donor source name
    • original row count
    • clone row count
    • final row count
    • original weight sum
    • clone initial weight sum
    • clone variable count
    • clone overlap variable count
    • clone source order
    • variable-surface parity/exclusion counts
    • non-person variable entity mapping mode
    • whether optimization initialized the clone support with activatable nonzero internal weights even though stored source weights are zero
    • final clone household count, activated household count, clone household weight sum, and clone household weight share after calibration

Existing source-weight diagnostics currently assume donor rows contribute no separate support weight; update them to compute PUF support activation from person_is_puf_clone after calibration.

Household-level PUF clone diagnostics:

  • Keep person_is_puf_clone on the person table.
  • Derive clone households by grouping persons by household_id; a clone household should have all members flagged as person_is_puf_clone == 1.
  • Join those clone household IDs to the calibrated household table to compute clone household count, activated count, weight sum, and weight share.
  • If a household has mixed clone flags, report it as a diagnostic error because ID offsetting or household construction has broken clone integrity.

Calibration activation:

  • Stored source weights should mirror eCPS: clone support rows start at zero.
  • Calibration internals must nevertheless be able to activate clone rows. PE L0 already floors zero initial weights; other backends need either a clone-aware internal weight floor/warm start in calibrate_policyengine_tables or clone mode must be restricted to proven backends.
  • Add a focused test where a zero-stored-weight clone row receives positive calibrated household weight under the configured eCPS replacement backend.
  • Household-budget selection runs before final calibration and can discard zero-stored-weight clone households. Clone mode must either disable/fail with policyengine_selection_household_budget or prove the selector also uses a clone-aware positive internal floor and keeps clone support eligible.

Imputation ablations:

  • The donor-imputation ablation runner calls _integrate_donor_sources directly for variant scoring. Disable PUF support clone inside ablation variant configs unless the ablation is specifically testing support expansion; otherwise the evidence path will compare different support universes.

Validation Gates

Pre-calibration validation:

  • If clone mode is enabled, synthesis_backend is seed until a post-synthesis clone path exists.
  • Export-column contract still passes from the post-imputation tables before calibration.
  • Dataset row counts are doubled only for the PUF clone path.
  • person_is_puf_clone exists and is optional, not required.
  • Original rows preserve CPS values for ordinary overlapping variables.
  • Variables in the explicit PUF-both-halves override list are PUF-imputed on both halves, matching eCPS behavior where we intentionally adopt it.
  • Variables absent before cloning but included in PUF clone support have zero original-half values and PUF-imputed clone-half values, unless an explicit MP exclusion/divergence entry explains otherwise.
  • weeks_unemployed is either handled like eCPS, with original-half CPS values and PUF-half imputed values tied to unemployment compensation support, or has an explicit exclusion/divergence reason.
  • Clone rows differ from original rows for at least high-signal PUF overlap variables (self_employment_income, employment_income or wage_income, taxable_interest_income, qualified_dividend_income, capital_gains) when donor support exists.
  • Clone rows start with zero values in every stored weight column.
  • The calibration/refit initializer can activate clone rows; exact zero stored source weights must not become log(0) or permanent zero final weights.
  • Household-budget selection is disabled for clone mode unless it has its own clone-aware activation/eligibility test.
  • Entity ID uniqueness holds across original and clone rows, and clone reference columns point to clone entities.
  • PUF-first ordering is enforced even if the caller provides donors in a deliberately misordered list.
  • Concatenated original+clone rows have a clean RangeIndex before downstream non-PUF donor imputation runs.
  • Clone mode fails clearly if the scaffold is PUF-prefixed, not CPS/ASEC-shaped, or no unambiguous PUF donor is available.
  • Non-person PUF clone variables preserve entity consistency after assignment.

Post-calibration validation:

  • Some clone rows should be activated if the PUF support is useful, but this is diagnostic, not a hard gate.
  • Sound MP vs latest local eCPS comparison must improve. The release gate is still loss, not structural preferences.

Tests

Unit tests:

  1. A small synthetic CPS + PUF donor case where clone mode:
    • doubles rows,
    • offsets IDs,
    • sets clone weights to zero,
    • preserves original overlapping CPS values,
    • imputes different PUF overlap values into clone rows.
  2. A regression test that non-PUF donor integration still behaves unchanged with clone mode off.
  3. A source-order test: when clone mode is enabled, PUF donor support expansion happens before SIPP/SCF-style donors, and those later donors see both original and PUF clone rows.
  4. A sidecar/diagnostic test for PUF support clone summary fields.
  5. A variable-surface test comparing MP's included/excluded PUF clone variables to eCPS IMPUTED_VARIABLES and OVERRIDDEN_IMPUTED_VARIABLES, requiring a reason for each exclusion.
  6. A non-person entity mapping test for at least one tax-unit or household PUF clone target.
  7. A seed-backend guard test: clone mode fails clearly for non-seed synthesis backends until a post-synthesis path exists.
  8. A source-weight diagnostics test that final PUF support activation is nonzero when clone rows receive positive calibrated weights.
  9. A checkpoint/recalibration test that person_is_puf_clone, doubled IDs/reference columns, zero pre-calibration clone weights, and clone metadata survive save/load and are reconstructed in recalibration diagnostics from loaded tables rather than source-frame metadata.
  10. An ablation test that clone mode is disabled for ordinary donor-imputation holdout variants.
  11. A calibration activation test for the chosen eCPS replacement backend proving that zero-stored-weight clone rows can receive positive calibrated weight.
  12. A downstream donor test where a non-PUF donor runs after clone expansion and imputes a person-native variable on both original and clone rows without index misalignment.
  13. A scaffold/donor validation test for PUF-as-scaffold, missing PUF donor, and ambiguous PUF donor cases.
  14. A household clone diagnostics test deriving clone household activation from person flags and detecting mixed-flag households.
  15. A household-selection guard test: clone mode fails or disables selection when policyengine_selection_household_budget is set, unless a selector-specific clone activation test exists.

Focused integration tests:

  1. Resume from the Gate-1 post-imputation or rebuild only as much as necessary.
  2. Run export-column check on the PUF-cloned post-imputation tables.
  3. Run sound MP vs latest local eCPS comparison.

Execution Plan

  1. Implement config and helper methods in src/microplex_us/pipelines/us.py.
  2. Wire clone mode into the eCPS checkpoint runner/config and checkpoint CLI.
  3. Add tests in tests/pipelines/test_us.py and any artifact/checkpoint tests needed.
  4. Run focused tests.
  5. Open a PR and link the issue.
  6. Rebuild or resume Gate-1 with PUF support clone enabled.
  7. Run the sound comparison against latest local eCPS as primary and HF eCPS as diagnostic.
  8. If loss still exceeds eCPS, use corrected diagnostics from PR Harden MP vs eCPS comparison diagnostics #139 to pick the next single production fix.

Open Questions for Review

  1. Which eCPS OVERRIDDEN_IMPUTED_VARIABLES should MP adopt as both-halves PUF-imputed variables under the pinned PolicyEngine-US export contract?
  2. Should the stored clone weight be exactly zero like eCPS while the optimization initializer uses a positive internal start? My current recommendation is yes: stored data mirrors eCPS support semantics, optimizer internals avoid log(0) and can activate rows.
  3. Should this mode become default for eCPS-shaped builds immediately? My recommendation is yes for the eCPS replacement pipeline, no for unrelated MP builds until scored.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions