Skip to content

Fix SPM and family overfragmentation in PE export #81

@MaxGhenis

Description

@MaxGhenis

Context

After PR #79, Microplex tax-unit construction is close to the eCPS structural reference, but the sound eCPS replacement comparison still shows large losses. The next structural mismatch is SPM/family fragmentation in the PolicyEngine export.

Fresh matched-N comparison artifact after PR #79:

/Users/maxghenis/CosilicoAI/microplex-us/artifacts/small_asec_acs100k_household_coherent_20260529/sound_ecps_replacement_comparison/sound_ecps_replacement_comparison.json

Key result:

  • candidate refit loss: 3.7243506116660963
  • eCPS refit loss: 0.1726525197190867
  • candidate holdout loss: 0.5319674877579285
  • eCPS holdout loss: 0.02754433292858976

The tax-unit structure is now reasonable:

  • candidate matched tax units: 54,034 on 41,314 households (1.308/HH)
  • eCPS tax units: 55,264 on 41,314 households (1.338/HH)

But SPM/family structures remain overfragmented:

  • candidate matched SPM units: 65,905 on 41,314 households (1.595/HH)
  • eCPS SPM units: 43,134 on 41,314 households (1.044/HH)
  • candidate matched families: 65,905 (1.595/HH)
  • eCPS families: 46,222 (1.119/HH)

Current root cause:

  • _assign_family_and_spm_units uses the same fallback split for family and SPM.
  • It puts relationship {0, 1, 2} in one primary family/SPM unit and assigns every other person to their own family/SPM unit.
  • The calibrated source parquet already has family_id and spm_unit_id with the same inflated count (159,331 on 100,000 households), so simply preserving those columns is not enough.

Current mitigation

PR #80 changes only the SPM fallback to one SPM unit per household while keeping the current family split unchanged:

#80

Lightweight structural probe with PR #80 logic:

/Users/maxghenis/CosilicoAI/microplex-us/artifacts/small_asec_acs100k_household_spm_20260529/structural_probe.json

  • households: 100,000
  • persons: 245,714
  • tax units: 130,980 (1.3098/HH)
  • SPM units: 100,000 (1.0/HH)
  • families: 159,331 (1.59331/HH), intentionally unchanged

Desired direction

Do not hard-code eCPS structure as the target. Use Microplex architecture to construct coherent relational units from source relationships and donor structure:

  • SPM units should be household-coherent unless richer SPM relationship detail supports a split.
  • Family units need a separate, data-driven rule; current family=primary-plus-singletons is likely too fragmented.
  • Preserve diagnostics in sidecars/gates, not in the model H5.
  • Add structural gates for SPM/family units per household and singleton-unit shares, calibrated against source distributions and policy semantics rather than only eCPS.

Acceptance criteria

  • Small ASEC+ACS100k PE export has plausible SPM/family unit counts and no impossible entity memberships.
  • A sidecar reports household/person/tax-unit/SPM/family/marital counts, per-household ratios, singleton shares, and cross-household ID violations.
  • The sound matched-N symmetric-refit eCPS comparison is rerun after SPM/family changes.
  • The comparison report breaks out SNAP, Census/SPM poverty, IRS filing-status-sensitive cells, and protected target families so we can tell whether the structural change improved the intended surfaces.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions