You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cycle status: reviewed by eCPS-fidelity and Microplex implementation-risk subagents; all actionable findings were incorporated and final reviews reported no actionable findings.
MP eCPS PUF Support Clone Plan
Date: 2026-06-01
Objective
Make the eCPS-shaped Microplex replacement use the same core PUF-support idea as latest eCPS: keep the original CPS/ASEC support and append a PUF-imputed support copy so calibration can choose between CPS-reported and PUF-imputed tax/income values by assigning weights. This is more important than tax-unit-preservation cleanup because the current Gate-1 MP dataset has only one support surface for overlapping PUF/CPS variables such as wages, self-employment income, dividends, interest, pensions, and capital gains.
The production loss gate remains the sound MP vs eCPS comparison against latest local eCPS, with matched household count, symmetric refit, and holdout targets. The implementation is acceptable only if it moves that loss in the right direction without breaking export-column parity or checkpoint resumability.
The eCPS implementation is puf_clone_dataset(...) in policyengine-us-data.
eCPS loads the CPS dataset through Microsimulation(dataset=self.cps) and converts all arrays into {variable: {period: array}}.
It assigns detailed geography using assign_geography_within_state_county(...). Before cloning it writes spm_unit_spm_threshold into the CPS data dict; geography is passed separately to puf_clone_dataset(...) and then added to the doubled output.
It calls puf_clone_dataset(...) with the CPS data dict, assigned geography, PUF dataset, and CPS H5 path.
puf_clone_dataset runs PUF QRF imputation (_run_qrf_imputation) when a PUF dataset is available. _run_qrf_imputation returns two prediction maps that puf_clone_dataset uses while constructing the doubled data:
y_full: full PUF-imputed values for the PUF half.
y_override: values for the special variables eCPS wants on both halves. puf_clone_dataset itself returns the doubled data dict.
It doubles every array:
Ordinary variables: [CPS values, CPS values].
ID variables: [ids, ids + max(ids)].
Weight variables: [weights, 0 * weights].
Variables in IMPUTED_VARIABLES: [CPS values, PUF-imputed values].
Variables already present in CPS and in OVERRIDDEN_IMPUTED_VARIABLES: [PUF-imputed override values, PUF-imputed override values].
IMPUTED_VARIABLES absent from CPS: [zeros, PUF-imputed values], even if the variable also appears in the override list.
The QRF training path stratified-subsamples PUF to about 20,000 records, preserves the top 0.5 percent by AGI, batches imputed variables, and predicts from demographic predictors calculated by PolicyEngine.
eCPS predicts on the person surface, then maps non-person variables back to their PolicyEngine entity with value_from_first_person. This matters for tax-unit, SPM-unit, household, family, or marital-unit variables whose values must be assigned consistently across persons.
It saves the doubled dataset, then calibration can place positive final weights on either original CPS rows or PUF-imputed clone rows. The PUF half is stored with zero weight; calibration/reweighting paths must still initialize optimization weights in a way that can activate zero-stored-weight support rows.
The current latest eCPS H5 reflects this. It has about twice the raw CPS households, and the second half has positive weights after calibration, meaning the optimizer did use PUF clone support.
There are two relevant PolicyEngine-US-data pathways:
Public ExtendedCPS.generate() loads CPS and calls puf_clone_dataset() directly. It does not run ACS/SIPP/SCF imputation first.
The newer unified calibration path has a source-imputation stage before PUF cloning. That is a separate pipeline design, not the public eCPS generation path.
Starts with a single CPS/ASEC spine frame (current = seed_data.copy()).
For each donor source, prepares donor seed data and identifies:
shared_vars: variables observed in both scaffold and donor, used as conditioning features.
donor_only_vars: authoritative donor variables absent from the scaffold.
donor_override_vars: shared authoritative donor variables listed in donor_imputer_authoritative_override_variables.
It imputes donor target blocks into the same current frame in place.
For the Gate-1 MP run, overlapping PUF/CPS variables were mostly not represented as alternatives. Evidence from artifact checks: scaffold and seed values for wage_income and self_employment_income were identical, while eCPS has a clone half where these can differ.
There is partial infrastructure but not the production feature:
Export contracts already know person_is_puf_clone can exist as an optional eCPS-internal flag.
Source-weight diagnostics already mention puf_support_household_weight_share and irs_soi_puf_support_clone.
Donor override machinery exists, but overriding one frame is not the same as appending original plus PUF-imputed support.
Implementation Design
Add an explicit, opt-in MP PUF support clone mode for the eCPS-shaped pipeline.
The eCPS replacement checkpoint runner should enable this for CPS + PUF + SIPP + SCF no-ACS builds. Other MP builds stay unchanged until intentionally opted in.
Also wire the fields explicitly through the checkpoint CLI and rebuild defaults:
Add CLI flags or an eCPS-shaped default that turns clone mode on for the Gate-1 CPS/PUF/SIPP/SCF no-ACS path.
Keep the default off for generic MP builds.
Persist the effective clone settings in build metadata so resumed calibration and comparison artifacts can tell whether the support clone was active.
Guardrail:
_integrate_donor_sources is before synthesis. PUF support cloning there is only safe for synthesis_backend="seed" because seed synthesis preserves support rows. If clone mode is enabled with bootstrap/synthesizer backends, fail early with a clear error until a post-synthesis clone path is implemented.
Clone mode also requires a CPS/ASEC-shaped scaffold and exactly the expected PUF donor source in donor inputs. Fail clearly if a PUF-prefixed source is selected as scaffold, no PUF donor matches puf_support_clone_source_prefixes, or multiple ambiguous PUF donors match without explicit configuration.
Row Construction
When _integrate_donor_sources reaches the PUF donor and clone mode is enabled:
Save original_current = current.copy().
Build clone_current = original_current.copy().
Offset entity IDs in clone_current so original and clone rows do not collide:
person_id
household_id
tax_unit_id
family_id
spm_unit_id
marital_unit_id
membership/reference columns such as person_household_id, person_tax_unit_id, person_spm_unit_id, or any other column containing _id that points into the cloned support.
Offset should be deterministic and safe for numeric IDs. If string IDs appear, use a stable suffix strategy. The mapping must be consistent across entity IDs and reference columns so clone people point to clone households/tax units/SPM units, not original entities.
Set all stored clone weight columns to zero by default:
weight
household_weight if present
hh_weight if present
any other column containing _weight, including person_weight, tax_unit_weight, and spm_unit_weight if present.
Add person_is_puf_clone = 0.0 to original rows and 1.0 to clone rows.
Run existing PUF donor imputation on clone_current only, not on original_current.
Align columns before concatenation:
Original rows get their original CPS value for ordinary overlap variables.
Original rows get the PUF-imputed value for any pre-clone overlap variable included in MP's explicit eCPS-style both-halves override list.
Original rows get 0.0 for donor-only PUF target variables that did not exist before cloning.
Clone rows get the imputed PUF value for included variables.
No included target column may become NaN from a missing half.
Concatenate [original_current, imputed_clone_current] and immediately reset_index(drop=True). Duplicate original/clone indexes would break later person-native donor assignments that rely on pandas index alignment.
This mirrors eCPS’s support structure: original CPS support remains available and PUF-imputed support is added as zero-initial-weight rows that calibration can activate.
PUF Variable Surface
For the clone half, PUF should impute both donor-only variables and overlapping variables where PUF is authoritative enough to provide alternative support. The implementation should not depend only on one small handpicked list; eCPS imputes a broad tax surface, and MP needs either parity with that broad surface where the variables are data-owned or an explicit exclusion table for every missing variable.
Initial variable selection:
Start from the existing PUF donor target selection:
donor-only authoritative variables.
shared authoritative variables already listed in donor_imputer_authoritative_override_variables.
Add a PUF-clone-specific default overlap list modeled on the full eCPS IMPUTED_VARIABLES surface, constrained to variables present in both MP current and PUF donor seed and allowed by source authority. This includes the major income variables below plus deductions, credits, QBI inputs, HSA/student loan/charitable variables, and qualified-income flags where MP has a data-owned column:
employment_income
self_employment_income
social_security
taxable_pension_income
taxable_interest_income
tax_exempt_interest_income
qualified_dividend_income
non_qualified_dividend_income
long_term_capital_gains
short_term_capital_gains
rental_income
farm_income
partnership_s_corp_income
taxable_ira_distributions
taxable_unemployment_compensation
traditional_ira_contributions
self_employed_health_insurance_ald
self_employed_pension_contribution_ald
wage_income, dividend_income, and capital_gains only if they are present as MP scaffold aliases or support fields mapped to the corresponding PolicyEngine variables above.
Also implement an explicit MP equivalent of eCPS's OVERRIDDEN_IMPUTED_VARIABLES category for variables that should be PUF-imputed on both halves. This should be a named list with tests and comments. The default should mirror eCPS only where the variable is data-owned rather than formula-owned under the pinned PolicyEngine-US export contract, and only apply the both-halves override to variables that are already present in the pre-clone frame. If a variable is donor-only/absent before cloning, it should be added as [zero original half, PUF-imputed clone half] unless MP explicitly documents a different reason.
Add a generated or checked puf_support_clone_variable_surface sidecar/table with:
eCPS IMPUTED_VARIABLES
eCPS OVERRIDDEN_IMPUTED_VARIABLES
eCPS special PUF-half variables outside those lists, currently weeks_unemployed
MP-supported included variables
MP-excluded variables
exclusion reason (formula_owned, missing_policyengine_us_variable, missing_puf_source_column, unsupported_entity_mapping, or intentional_divergence)
The release path should not silently omit eCPS PUF variables.
Entity mapping requirement:
Resolve each PUF-clone target variable's PolicyEngine entity under the pinned policyengine-us.
If the donor imputation engine generates at person level for a non-person variable, map values back to the correct entity consistently before assignment, matching eCPS's value_from_first_person pattern.
If MP's block engine already projects a block to the target entity, record that in diagnostics and test it so tax-unit/SPM/household variables do not get person-shape semantics accidentally.
Do not blindly import eCPS variable semantics. MP should use its own source manifests and PolicyEngine export contract, with eCPS as the reference for the support-clone architecture.
Ordering
Required ordering for immediate eCPS replacement:
CPS scaffold.
PUF support clone immediately after CPS, creating the original CPS half and a zero-stored-weight PUF-imputed support half.
Downstream non-PUF donor imputations over the expanded support universe (SIPP/SCF now; ACS donor later if enabled). This lets SIPP/SCF fill benefits/assets on both CPS-original and PUF-alternative rows.
PE table materialization.
Microsim target materialization.
Calibration.
This deliberately treats PUF as the first support-expanding donor: the original multispine rather than a late overwrite. It is not a literal copy of either PolicyEngine-US-data pathway. Public eCPS does CPS -> PUF clone and has no SIPP/SCF stage; the unified calibration path does source imputation before PUF clone. MP's architecture should use the eCPS support-clone mechanism but place it where it is most useful for our donor pipeline.
Donor ordering must be observable, not only behavioral:
When clone mode is enabled, _integrate_donor_sources must actively split donor inputs into PUF-clone donors matching puf_support_clone_source_prefixes and non-PUF donors, process the PUF-clone donors first, then process later donors over the expanded support.
Do not rely on provider order, even though the current default provider bundle happens to list CPS then PUF then SIPP/SCF.
Validate that the scaffold is CPS/ASEC-shaped before splitting. The current scaffold chooser is score-based, so clone mode must not assume PUF will always be a donor unless it checks.
Return processed_donor_source_order and puf_clone_source_order in the donor-integration result.
Persist those fields in metadata/sidecars alongside fusion_plan.source_names, because the input source order can differ from the actual processing order once PUF-first support expansion is active.
Checkpoints and Sidecars
This must preserve restartability:
Post-imputation checkpoint should contain the doubled person and household support after PUF clone.
Post-microsim checkpoint should work unchanged.
Clone metadata must either be persisted in checkpoint metadata or be reconstructable from checkpoint tables using person_is_puf_clone.
Recalibration-from-checkpoint must preserve/report clone diagnostics instead of losing them when source frames are unavailable.
Add a small puf_support_clone_summary.json sidecar and include the main figures in existing source-weight diagnostics:
enabled flag
donor source name
original row count
clone row count
final row count
original weight sum
clone initial weight sum
clone variable count
clone overlap variable count
clone source order
variable-surface parity/exclusion counts
non-person variable entity mapping mode
whether optimization initialized the clone support with activatable nonzero internal weights even though stored source weights are zero
final clone household count, activated household count, clone household weight sum, and clone household weight share after calibration
Existing source-weight diagnostics currently assume donor rows contribute no separate support weight; update them to compute PUF support activation from person_is_puf_clone after calibration.
Household-level PUF clone diagnostics:
Keep person_is_puf_clone on the person table.
Derive clone households by grouping persons by household_id; a clone household should have all members flagged as person_is_puf_clone == 1.
Join those clone household IDs to the calibrated household table to compute clone household count, activated count, weight sum, and weight share.
If a household has mixed clone flags, report it as a diagnostic error because ID offsetting or household construction has broken clone integrity.
Calibration activation:
Stored source weights should mirror eCPS: clone support rows start at zero.
Calibration internals must nevertheless be able to activate clone rows. PE L0 already floors zero initial weights; other backends need either a clone-aware internal weight floor/warm start in calibrate_policyengine_tables or clone mode must be restricted to proven backends.
Add a focused test where a zero-stored-weight clone row receives positive calibrated household weight under the configured eCPS replacement backend.
Household-budget selection runs before final calibration and can discard zero-stored-weight clone households. Clone mode must either disable/fail with policyengine_selection_household_budget or prove the selector also uses a clone-aware positive internal floor and keeps clone support eligible.
Imputation ablations:
The donor-imputation ablation runner calls _integrate_donor_sources directly for variant scoring. Disable PUF support clone inside ablation variant configs unless the ablation is specifically testing support expansion; otherwise the evidence path will compare different support universes.
Validation Gates
Pre-calibration validation:
If clone mode is enabled, synthesis_backend is seed until a post-synthesis clone path exists.
Export-column contract still passes from the post-imputation tables before calibration.
Dataset row counts are doubled only for the PUF clone path.
person_is_puf_clone exists and is optional, not required.
Original rows preserve CPS values for ordinary overlapping variables.
Variables in the explicit PUF-both-halves override list are PUF-imputed on both halves, matching eCPS behavior where we intentionally adopt it.
Variables absent before cloning but included in PUF clone support have zero original-half values and PUF-imputed clone-half values, unless an explicit MP exclusion/divergence entry explains otherwise.
weeks_unemployed is either handled like eCPS, with original-half CPS values and PUF-half imputed values tied to unemployment compensation support, or has an explicit exclusion/divergence reason.
Clone rows differ from original rows for at least high-signal PUF overlap variables (self_employment_income, employment_income or wage_income, taxable_interest_income, qualified_dividend_income, capital_gains) when donor support exists.
Clone rows start with zero values in every stored weight column.
The calibration/refit initializer can activate clone rows; exact zero stored source weights must not become log(0) or permanent zero final weights.
Household-budget selection is disabled for clone mode unless it has its own clone-aware activation/eligibility test.
Entity ID uniqueness holds across original and clone rows, and clone reference columns point to clone entities.
PUF-first ordering is enforced even if the caller provides donors in a deliberately misordered list.
Concatenated original+clone rows have a clean RangeIndex before downstream non-PUF donor imputation runs.
Clone mode fails clearly if the scaffold is PUF-prefixed, not CPS/ASEC-shaped, or no unambiguous PUF donor is available.
Non-person PUF clone variables preserve entity consistency after assignment.
Post-calibration validation:
Some clone rows should be activated if the PUF support is useful, but this is diagnostic, not a hard gate.
Sound MP vs latest local eCPS comparison must improve. The release gate is still loss, not structural preferences.
Tests
Unit tests:
A small synthetic CPS + PUF donor case where clone mode:
doubles rows,
offsets IDs,
sets clone weights to zero,
preserves original overlapping CPS values,
imputes different PUF overlap values into clone rows.
A regression test that non-PUF donor integration still behaves unchanged with clone mode off.
A source-order test: when clone mode is enabled, PUF donor support expansion happens before SIPP/SCF-style donors, and those later donors see both original and PUF clone rows.
A sidecar/diagnostic test for PUF support clone summary fields.
A variable-surface test comparing MP's included/excluded PUF clone variables to eCPS IMPUTED_VARIABLES and OVERRIDDEN_IMPUTED_VARIABLES, requiring a reason for each exclusion.
A non-person entity mapping test for at least one tax-unit or household PUF clone target.
A seed-backend guard test: clone mode fails clearly for non-seed synthesis backends until a post-synthesis path exists.
A source-weight diagnostics test that final PUF support activation is nonzero when clone rows receive positive calibrated weights.
A checkpoint/recalibration test that person_is_puf_clone, doubled IDs/reference columns, zero pre-calibration clone weights, and clone metadata survive save/load and are reconstructed in recalibration diagnostics from loaded tables rather than source-frame metadata.
An ablation test that clone mode is disabled for ordinary donor-imputation holdout variants.
A calibration activation test for the chosen eCPS replacement backend proving that zero-stored-weight clone rows can receive positive calibrated weight.
A downstream donor test where a non-PUF donor runs after clone expansion and imputes a person-native variable on both original and clone rows without index misalignment.
A scaffold/donor validation test for PUF-as-scaffold, missing PUF donor, and ambiguous PUF donor cases.
A household clone diagnostics test deriving clone household activation from person flags and detecting mixed-flag households.
A household-selection guard test: clone mode fails or disables selection when policyengine_selection_household_budget is set, unless a selector-specific clone activation test exists.
Focused integration tests:
Resume from the Gate-1 post-imputation or rebuild only as much as necessary.
Run export-column check on the PUF-cloned post-imputation tables.
Run sound MP vs latest local eCPS comparison.
Execution Plan
Implement config and helper methods in src/microplex_us/pipelines/us.py.
Wire clone mode into the eCPS checkpoint runner/config and checkpoint CLI.
Add tests in tests/pipelines/test_us.py and any artifact/checkpoint tests needed.
Run focused tests.
Open a PR and link the issue.
Rebuild or resume Gate-1 with PUF support clone enabled.
Run the sound comparison against latest local eCPS as primary and HF eCPS as diagnostic.
Which eCPS OVERRIDDEN_IMPUTED_VARIABLES should MP adopt as both-halves PUF-imputed variables under the pinned PolicyEngine-US export contract?
Should the stored clone weight be exactly zero like eCPS while the optimization initializer uses a positive internal start? My current recommendation is yes: stored data mirrors eCPS support semantics, optimizer internals avoid log(0) and can activate rows.
Should this mode become default for eCPS-shaped builds immediately? My recommendation is yes for the eCPS replacement pipeline, no for unrelated MP builds until scored.
Reviewed plan: /Users/maxghenis/mp-ecps-tracking/mp-puf-support-clone-plan-20260601.md
Cycle status: reviewed by eCPS-fidelity and Microplex implementation-risk subagents; all actionable findings were incorporated and final reviews reported no actionable findings.
MP eCPS PUF Support Clone Plan
Date: 2026-06-01
Objective
Make the eCPS-shaped Microplex replacement use the same core PUF-support idea as latest eCPS: keep the original CPS/ASEC support and append a PUF-imputed support copy so calibration can choose between CPS-reported and PUF-imputed tax/income values by assigning weights. This is more important than tax-unit-preservation cleanup because the current Gate-1 MP dataset has only one support surface for overlapping PUF/CPS variables such as wages, self-employment income, dividends, interest, pensions, and capital gains.
The production loss gate remains the sound MP vs eCPS comparison against latest local eCPS, with matched household count, symmetric refit, and holdout targets. The implementation is acceptable only if it moves that loss in the right direction without breaking export-column parity or checkpoint resumability.
How eCPS Does It
Relevant files:
The eCPS implementation is
puf_clone_dataset(...)inpolicyengine-us-data.Microsimulation(dataset=self.cps)and converts all arrays into{variable: {period: array}}.assign_geography_within_state_county(...). Before cloning it writesspm_unit_spm_thresholdinto the CPS data dict; geography is passed separately topuf_clone_dataset(...)and then added to the doubled output.puf_clone_dataset(...)with the CPS data dict, assigned geography, PUF dataset, and CPS H5 path.puf_clone_datasetruns PUF QRF imputation (_run_qrf_imputation) when a PUF dataset is available._run_qrf_imputationreturns two prediction maps thatpuf_clone_datasetuses while constructing the doubled data:y_full: full PUF-imputed values for the PUF half.y_override: values for the special variables eCPS wants on both halves.puf_clone_datasetitself returns the doubled data dict.[CPS values, CPS values].[ids, ids + max(ids)].[weights, 0 * weights].IMPUTED_VARIABLES:[CPS values, PUF-imputed values].OVERRIDDEN_IMPUTED_VARIABLES:[PUF-imputed override values, PUF-imputed override values].IMPUTED_VARIABLESabsent from CPS:[zeros, PUF-imputed values], even if the variable also appears in the override list.weeks_unemployed:[CPS values, separately imputed PUF-half weeks].value_from_first_person. This matters for tax-unit, SPM-unit, household, family, or marital-unit variables whose values must be assigned consistently across persons.The current latest eCPS H5 reflects this. It has about twice the raw CPS households, and the second half has positive weights after calibration, meaning the optimizer did use PUF clone support.
There are two relevant PolicyEngine-US-data pathways:
ExtendedCPS.generate()loads CPS and callspuf_clone_dataset()directly. It does not run ACS/SIPP/SCF imputation first.Current MP State
Relevant file:
Current MP donor integration:
current = seed_data.copy()).shared_vars: variables observed in both scaffold and donor, used as conditioning features.donor_only_vars: authoritative donor variables absent from the scaffold.donor_override_vars: shared authoritative donor variables listed indonor_imputer_authoritative_override_variables.currentframe in place.wage_incomeandself_employment_incomewere identical, while eCPS has a clone half where these can differ.There is partial infrastructure but not the production feature:
person_is_puf_clonecan exist as an optional eCPS-internal flag.puf_support_household_weight_shareandirs_soi_puf_support_clone.Implementation Design
Add an explicit, opt-in MP PUF support clone mode for the eCPS-shaped pipeline.
Configuration
Add config fields to
USMicroplexBuildConfig:puf_support_clone_enabled: bool = Falsepuf_support_clone_source_prefixes: tuple[str, ...] = ("irs_soi_puf",)puf_support_clone_zero_initial_weight: bool = Truepuf_support_clone_original_flag_column: str = "person_is_puf_clone"The eCPS replacement checkpoint runner should enable this for CPS + PUF + SIPP + SCF no-ACS builds. Other MP builds stay unchanged until intentionally opted in.
Also wire the fields explicitly through the checkpoint CLI and rebuild defaults:
Guardrail:
_integrate_donor_sourcesis before synthesis. PUF support cloning there is only safe forsynthesis_backend="seed"because seed synthesis preserves support rows. If clone mode is enabled with bootstrap/synthesizer backends, fail early with a clear error until a post-synthesis clone path is implemented.puf_support_clone_source_prefixes, or multiple ambiguous PUF donors match without explicit configuration.Row Construction
When
_integrate_donor_sourcesreaches the PUF donor and clone mode is enabled:original_current = current.copy().clone_current = original_current.copy().clone_currentso original and clone rows do not collide:person_idhousehold_idtax_unit_idfamily_idspm_unit_idmarital_unit_idperson_household_id,person_tax_unit_id,person_spm_unit_id, or any other column containing_idthat points into the cloned support.Offset should be deterministic and safe for numeric IDs. If string IDs appear, use a stable suffix strategy. The mapping must be consistent across entity IDs and reference columns so clone people point to clone households/tax units/SPM units, not original entities.
weighthousehold_weightif presenthh_weightif present_weight, includingperson_weight,tax_unit_weight, andspm_unit_weightif present.person_is_puf_clone = 0.0to original rows and1.0to clone rows.clone_currentonly, not onoriginal_current.0.0for donor-only PUF target variables that did not exist before cloning.[original_current, imputed_clone_current]and immediatelyreset_index(drop=True). Duplicate original/clone indexes would break later person-native donor assignments that rely on pandas index alignment.This mirrors eCPS’s support structure: original CPS support remains available and PUF-imputed support is added as zero-initial-weight rows that calibration can activate.
PUF Variable Surface
For the clone half, PUF should impute both donor-only variables and overlapping variables where PUF is authoritative enough to provide alternative support. The implementation should not depend only on one small handpicked list; eCPS imputes a broad tax surface, and MP needs either parity with that broad surface where the variables are data-owned or an explicit exclusion table for every missing variable.
Initial variable selection:
donor_imputer_authoritative_override_variables.IMPUTED_VARIABLESsurface, constrained to variables present in both MP current and PUF donor seed and allowed by source authority. This includes the major income variables below plus deductions, credits, QBI inputs, HSA/student loan/charitable variables, and qualified-income flags where MP has a data-owned column:employment_incomeself_employment_incomesocial_securitytaxable_pension_incometaxable_interest_incometax_exempt_interest_incomequalified_dividend_incomenon_qualified_dividend_incomelong_term_capital_gainsshort_term_capital_gainsrental_incomefarm_incomepartnership_s_corp_incometaxable_ira_distributionstaxable_unemployment_compensationtraditional_ira_contributionsself_employed_health_insurance_aldself_employed_pension_contribution_aldwage_income,dividend_income, andcapital_gainsonly if they are present as MP scaffold aliases or support fields mapped to the corresponding PolicyEngine variables above.Also implement an explicit MP equivalent of eCPS's
OVERRIDDEN_IMPUTED_VARIABLEScategory for variables that should be PUF-imputed on both halves. This should be a named list with tests and comments. The default should mirror eCPS only where the variable is data-owned rather than formula-owned under the pinned PolicyEngine-US export contract, and only apply the both-halves override to variables that are already present in the pre-clone frame. If a variable is donor-only/absent before cloning, it should be added as[zero original half, PUF-imputed clone half]unless MP explicitly documents a different reason.Add a generated or checked
puf_support_clone_variable_surfacesidecar/table with:IMPUTED_VARIABLESOVERRIDDEN_IMPUTED_VARIABLESweeks_unemployedformula_owned,missing_policyengine_us_variable,missing_puf_source_column,unsupported_entity_mapping, orintentional_divergence)The release path should not silently omit eCPS PUF variables.
Entity mapping requirement:
policyengine-us.value_from_first_personpattern.Do not blindly import eCPS variable semantics. MP should use its own source manifests and PolicyEngine export contract, with eCPS as the reference for the support-clone architecture.
Ordering
Required ordering for immediate eCPS replacement:
This deliberately treats PUF as the first support-expanding donor: the original multispine rather than a late overwrite. It is not a literal copy of either PolicyEngine-US-data pathway. Public eCPS does CPS -> PUF clone and has no SIPP/SCF stage; the unified calibration path does source imputation before PUF clone. MP's architecture should use the eCPS support-clone mechanism but place it where it is most useful for our donor pipeline.
Donor ordering must be observable, not only behavioral:
_integrate_donor_sourcesmust actively split donor inputs into PUF-clone donors matchingpuf_support_clone_source_prefixesand non-PUF donors, process the PUF-clone donors first, then process later donors over the expanded support.processed_donor_source_orderandpuf_clone_source_orderin the donor-integration result.fusion_plan.source_names, because the input source order can differ from the actual processing order once PUF-first support expansion is active.Checkpoints and Sidecars
This must preserve restartability:
person_is_puf_clone.puf_support_clone_summary.jsonsidecar and include the main figures in existing source-weight diagnostics:Existing source-weight diagnostics currently assume donor rows contribute no separate support weight; update them to compute PUF support activation from
person_is_puf_cloneafter calibration.Household-level PUF clone diagnostics:
person_is_puf_cloneon the person table.household_id; a clone household should have all members flagged asperson_is_puf_clone == 1.Calibration activation:
calibrate_policyengine_tablesor clone mode must be restricted to proven backends.policyengine_selection_household_budgetor prove the selector also uses a clone-aware positive internal floor and keeps clone support eligible.Imputation ablations:
_integrate_donor_sourcesdirectly for variant scoring. Disable PUF support clone inside ablation variant configs unless the ablation is specifically testing support expansion; otherwise the evidence path will compare different support universes.Validation Gates
Pre-calibration validation:
synthesis_backendisseeduntil a post-synthesis clone path exists.person_is_puf_cloneexists and is optional, not required.weeks_unemployedis either handled like eCPS, with original-half CPS values and PUF-half imputed values tied to unemployment compensation support, or has an explicit exclusion/divergence reason.self_employment_income,employment_incomeorwage_income,taxable_interest_income,qualified_dividend_income,capital_gains) when donor support exists.log(0)or permanent zero final weights.Post-calibration validation:
Tests
Unit tests:
IMPUTED_VARIABLESandOVERRIDDEN_IMPUTED_VARIABLES, requiring a reason for each exclusion.person_is_puf_clone, doubled IDs/reference columns, zero pre-calibration clone weights, and clone metadata survive save/load and are reconstructed in recalibration diagnostics from loaded tables rather than source-frame metadata.policyengine_selection_household_budgetis set, unless a selector-specific clone activation test exists.Focused integration tests:
Execution Plan
src/microplex_us/pipelines/us.py.tests/pipelines/test_us.pyand any artifact/checkpoint tests needed.Open Questions for Review
OVERRIDDEN_IMPUTED_VARIABLESshould MP adopt as both-halves PUF-imputed variables under the pinned PolicyEngine-US export contract?log(0)and can activate rows.