Skip to content

Wire geography GEOID exports (G5)#130

Closed
MaxGhenis wants to merge 1 commit into
mainfrom
g5-geoids
Closed

Wire geography GEOID exports (G5)#130
MaxGhenis wants to merge 1 commit into
mainfrom
g5-geoids

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

Makes microplex-us export the three Census GEOID leaves the enhanced-CPS (eCPS) baseline exports, so an MP dataset carries the same geography columns as a drop-in eCPS replacement:

  • block_geoid (15-digit Census block GEOID, string)
  • tract_geoid (11-digit tract GEOID, string)
  • congressional_district_geoid (integer SSDD GEOID)

Production gap, not just a missing allowlist entry

The frozen eCPS export contract (src/microplex_us/pipelines/ecps_export_contract.json) lists all three in its required set (246 cols), but the MP export only emitted state_fips/county_fips. MP already had the block-assignment machinery (geography.py: BlockGeography, derive_geographies) and the real population-weighted block crosswalk locally, but the production pipeline (_build_policyengine_households) never assigned block_geoid to households. So this was a production gap: the leaves had to be produced, not merely allowlisted.

What changed

geography.py

  • assign_household_block_geography() — draws a real 15-digit block_geoid per household from the population-weighted crosswalk, partitioned by the most specific geography available: CPS county_fips (disclosed) -> existing congressional_district_geoid (eCPS-style) -> state_fips. Then derives tract_geoid = block_geoid[:11] and the integer congressional_district_geoid from the block's crosswalk cd_id. Unresolved households (no valid state) get empty strings / CD 0 (the PE-US defaults) — geoids are never fabricated.
  • cd_id_to_congressional_district_geoid() — converts crosswalk "<abbr>-<dist>" cd_ids to the PE-US integer SSDD geoid (state_fips * 100 + district). At-large states use district 1, which matches the eCPS calibration-target CD universe exactly.

pipelines/us.py

  • _build_policyengine_households() now calls a new _assign_household_block_geography() helper, gated by a policyengine_assign_block_geography config flag (default on). Skipped cleanly when the crosswalk is unavailable (CI without the data).

policyengine/us.py

  • Adds the three leaves to SAFE_POLICYENGINE_US_EXPORT_VARIABLES. All three are storable INPUT variables (no formula) in the pinned policyengine-us, so they pass the computed-export guard and serialize as block/tract strings + CD int64.

Encoding correctness (verified against eCPS)

The crosswalk cd_id is "<state_abbr>-<district>" (e.g. "CA-01", at-large "DC-AL"); PE-US and the eCPS targets store congressional_district_geoid as integer SSDD. With at-large districts encoded as 1, the crosswalk's 436-CD universe is an exact match to the eCPS calibration-target CD universe (policy_data.db: 436 CDs, e.g. CA-01 -> 601, NC-01 -> 3701, DC-AL -> 1101, WY-AL -> 5601), with zero CDs missing on either side. A test asserts this equality.

Tests

tests/policyengine/test_geoid_export.py (18) + tests/pipelines/test_us.py (2 added):

  • tract_geoid == block_geoid[:11] exactly (the load-bearing invariant)
  • valid GEOID lengths: block=15, tract=11, CD=3-4 digit SSDD
  • CD lookups resolve and cd // 100 == state_fips
  • at-large encoding (DC -> 1101)
  • county / CD / state partition fallback chain (county-resolved households land in their CPS county; suppressed county falls back to state)
  • all three in the allowlist; not excluded by the computed-export/forbidden guard
  • end-to-end H5 export carries all three with correct dtypes
  • crosswalk CD universe == eCPS calibration-target CD universe

Existing geography / block-assignment / block-synthesis / export-column-gate suites pass unchanged.

🤖 Generated with Claude Code

Make microplex-us export the three Census GEOID leaves the enhanced-CPS
baseline exports: block_geoid, tract_geoid, and congressional_district_geoid.

The eCPS export contract (ecps_export_contract.json) lists all three in its
`required` set, but the MP export only emitted state_fips/county_fips. MP had
the block-assignment machinery (geography.py: BlockGeography, derive_geographies)
and the real population-weighted block crosswalk, but the production pipeline
never assigned block_geoid to households, so this was a production gap, not just
a missing allowlist entry.

Changes:
- geography.py:
  - assign_household_block_geography(): draws a real 15-digit block_geoid per
    household from the population-weighted crosswalk, partitioned by the most
    specific geography available (CPS county -> existing CD -> state), then
    derives tract_geoid = block_geoid[:11] and the integer
    congressional_district_geoid from the block's crosswalk cd_id.
  - cd_id_to_congressional_district_geoid(): converts crosswalk "<abbr>-<dist>"
    cd_ids to the PE-US integer SSDD geoid (state_fips*100 + district). At-large
    states use district 1, which matches the enhanced-CPS calibration-target CD
    universe exactly (436 CDs, verified against policy_data.db).
- pipelines/us.py: _build_policyengine_households() now assigns block geography
  via a new _assign_household_block_geography() helper, gated by the
  policyengine_assign_block_geography config flag (default on) and skipped
  cleanly when the crosswalk is unavailable. Geoids are never fabricated.
- policyengine/us.py: add block_geoid, tract_geoid, congressional_district_geoid
  to SAFE_POLICYENGINE_US_EXPORT_VARIABLES. All three are storable INPUT
  variables (no formula) in the pinned policyengine-us, so they survive the
  computed-export guard and serialize as block/tract strings + CD int64.

Tests (tests/policyengine/test_geoid_export.py, tests/pipelines/test_us.py):
assert tract_geoid == block_geoid[:11] exactly, valid GEOID lengths
(block=15, tract=11, cd=3-4 digit SSDD), CD lookups resolve and are
state-consistent, at-large encoding, county/CD/state partition fallback,
allowlist membership, the end-to-end H5 export carries all three, and the
crosswalk CD universe equals the enhanced-CPS calibration-target CD universe.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MaxGhenis
Copy link
Copy Markdown
Contributor Author

Closing in favor of #129 (codex) after a head-to-head correctness bakeoff on the same three geoid columns (block_geoid, tract_geoid, congressional_district_geoid). Best-of-per-gap dedup — keeping the more eCPS-faithful implementation regardless of source.

Decisive finding (measured on 8,000 real CPS households): this PR mis-locates ~80% of county-disclosed households to the wrong county. Root cause: the county normalizer here zero-pads the raw 3-digit CPS GTCO fragment (e.g. 37 -> "00037", a nonexistent county) instead of combining it as eCPS does (state*1000 + county). The household then silently falls through to the state-level partition. Only 19.6% of county-disclosed households land in their true county here, vs 100% in #129.

  • Partition fidelity: eCPS (assign_geography_within_state_county, called from extended_cps.py:112) partitions county → state → national and derives CD from the sampled block — there is no CD partition step. Add CPS block geography exports #129 matches (county → state). This PR added a spurious county → CD → state step.
  • Why my 18 tests didn't catch it: they fed pre-combined 5-digit counties ("06037", ...), not the 3-digit GTCO fragments the real pipeline supplies, so they gave false confidence on exactly the dimension that fails.
  • Design: Add CPS block geography exports #129 reuses the existing tested BlockGeography.load_assigner/assign/materialize; this PR reimplemented 6 private sampling functions (duplication).
  • Both produce real 15-digit Census geoids from the real 5.77M-block crosswalk, both satisfy tract == block_geoid[:11], and both encode the eCPS 436-CD universe exactly — so the tie-breaker is partition fidelity, where Add CPS block geography exports #129 wins outright.

Two good things from this PR are being carried over onto #129 before it merges:

  1. test_cd_universe_matches_enhanced_cps_targets (the executing 436-CD parity assertion vs policy_data.db) — Add CPS block geography exports #129 lacks a live-target CD guard.
  2. A duplicate-index-robustness fix for _attach_household_census_geographies (latent bug surfaced during the bakeoff).

@MaxGhenis MaxGhenis closed this Jun 1, 2026
MaxGhenis added a commit that referenced this pull request Jun 1, 2026
…attach

Carry-overs from reconciling PR #129 against the closed PR #130:

- _congressional_district_geoid_from_cd_id: also accept the raw Census at-large
  forms (AL/ZZ tokens and district 0/98) and normalize them to district 01,
  matching eCPS's policyengine-us-data db/create_initial_strata.py. Verified the
  encoder reproduces the eCPS 436-CD calibration universe exactly on the real
  block crosswalk (AK=201, WY=5601, DC=1101).

- _attach_household_census_geographies: collapse to a fresh RangeIndex up front.
  The block write-back via .loc[row_index] previously raised
  "ValueError: cannot reindex on an axis with duplicate labels" on a non-unique
  household-frame index. The caller consumes the result via merge on the
  household_id column, not the index, so this is safe.

- Add tests/pipelines/test_geoid_cd_encoding.py: encoder contract (multi-district
  SSDD, at-large=01, raw-form hardening, invalid inputs), the duplicate-index
  regression, and a live 436-CD universe parity check (skips without the
  crosswalk parquet, e.g. in CI).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MaxGhenis added a commit that referenced this pull request Jun 1, 2026
* Add CPS block geography exports

* Harden geoid CD encoding and fix duplicate-index in census-geography attach

Carry-overs from reconciling PR #129 against the closed PR #130:

- _congressional_district_geoid_from_cd_id: also accept the raw Census at-large
  forms (AL/ZZ tokens and district 0/98) and normalize them to district 01,
  matching eCPS's policyengine-us-data db/create_initial_strata.py. Verified the
  encoder reproduces the eCPS 436-CD calibration universe exactly on the real
  block crosswalk (AK=201, WY=5601, DC=1101).

- _attach_household_census_geographies: collapse to a fresh RangeIndex up front.
  The block write-back via .loc[row_index] previously raised
  "ValueError: cannot reindex on an axis with duplicate labels" on a non-unique
  household-frame index. The caller consumes the result via merge on the
  household_id column, not the index, so this is safe.

- Add tests/pipelines/test_geoid_cd_encoding.py: encoder contract (multi-district
  SSDD, at-large=01, raw-form hardening, invalid inputs), the duplicate-index
  regression, and a live 436-CD universe parity check (skips without the
  crosswalk parquet, e.g. in CI).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant