Conversation
Make microplex-us export the three Census GEOID leaves the enhanced-CPS
baseline exports: block_geoid, tract_geoid, and congressional_district_geoid.
The eCPS export contract (ecps_export_contract.json) lists all three in its
`required` set, but the MP export only emitted state_fips/county_fips. MP had
the block-assignment machinery (geography.py: BlockGeography, derive_geographies)
and the real population-weighted block crosswalk, but the production pipeline
never assigned block_geoid to households, so this was a production gap, not just
a missing allowlist entry.
Changes:
- geography.py:
- assign_household_block_geography(): draws a real 15-digit block_geoid per
household from the population-weighted crosswalk, partitioned by the most
specific geography available (CPS county -> existing CD -> state), then
derives tract_geoid = block_geoid[:11] and the integer
congressional_district_geoid from the block's crosswalk cd_id.
- cd_id_to_congressional_district_geoid(): converts crosswalk "<abbr>-<dist>"
cd_ids to the PE-US integer SSDD geoid (state_fips*100 + district). At-large
states use district 1, which matches the enhanced-CPS calibration-target CD
universe exactly (436 CDs, verified against policy_data.db).
- pipelines/us.py: _build_policyengine_households() now assigns block geography
via a new _assign_household_block_geography() helper, gated by the
policyengine_assign_block_geography config flag (default on) and skipped
cleanly when the crosswalk is unavailable. Geoids are never fabricated.
- policyengine/us.py: add block_geoid, tract_geoid, congressional_district_geoid
to SAFE_POLICYENGINE_US_EXPORT_VARIABLES. All three are storable INPUT
variables (no formula) in the pinned policyengine-us, so they survive the
computed-export guard and serialize as block/tract strings + CD int64.
Tests (tests/policyengine/test_geoid_export.py, tests/pipelines/test_us.py):
assert tract_geoid == block_geoid[:11] exactly, valid GEOID lengths
(block=15, tract=11, cd=3-4 digit SSDD), CD lookups resolve and are
state-consistent, at-large encoding, county/CD/state partition fallback,
allowlist membership, the end-to-end H5 export carries all three, and the
crosswalk CD universe equals the enhanced-CPS calibration-target CD universe.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Closing in favor of #129 (codex) after a head-to-head correctness bakeoff on the same three geoid columns ( Decisive finding (measured on 8,000 real CPS households): this PR mis-locates ~80% of county-disclosed households to the wrong county. Root cause: the county normalizer here zero-pads the raw 3-digit CPS
Two good things from this PR are being carried over onto #129 before it merges:
|
…attach Carry-overs from reconciling PR #129 against the closed PR #130: - _congressional_district_geoid_from_cd_id: also accept the raw Census at-large forms (AL/ZZ tokens and district 0/98) and normalize them to district 01, matching eCPS's policyengine-us-data db/create_initial_strata.py. Verified the encoder reproduces the eCPS 436-CD calibration universe exactly on the real block crosswalk (AK=201, WY=5601, DC=1101). - _attach_household_census_geographies: collapse to a fresh RangeIndex up front. The block write-back via .loc[row_index] previously raised "ValueError: cannot reindex on an axis with duplicate labels" on a non-unique household-frame index. The caller consumes the result via merge on the household_id column, not the index, so this is safe. - Add tests/pipelines/test_geoid_cd_encoding.py: encoder contract (multi-district SSDD, at-large=01, raw-form hardening, invalid inputs), the duplicate-index regression, and a live 436-CD universe parity check (skips without the crosswalk parquet, e.g. in CI). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Add CPS block geography exports * Harden geoid CD encoding and fix duplicate-index in census-geography attach Carry-overs from reconciling PR #129 against the closed PR #130: - _congressional_district_geoid_from_cd_id: also accept the raw Census at-large forms (AL/ZZ tokens and district 0/98) and normalize them to district 01, matching eCPS's policyengine-us-data db/create_initial_strata.py. Verified the encoder reproduces the eCPS 436-CD calibration universe exactly on the real block crosswalk (AK=201, WY=5601, DC=1101). - _attach_household_census_geographies: collapse to a fresh RangeIndex up front. The block write-back via .loc[row_index] previously raised "ValueError: cannot reindex on an axis with duplicate labels" on a non-unique household-frame index. The caller consumes the result via merge on the household_id column, not the index, so this is safe. - Add tests/pipelines/test_geoid_cd_encoding.py: encoder contract (multi-district SSDD, at-large=01, raw-form hardening, invalid inputs), the duplicate-index regression, and a live 436-CD universe parity check (skips without the crosswalk parquet, e.g. in CI). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Makes microplex-us export the three Census GEOID leaves the enhanced-CPS (eCPS) baseline exports, so an MP dataset carries the same geography columns as a drop-in eCPS replacement:
block_geoid(15-digit Census block GEOID, string)tract_geoid(11-digit tract GEOID, string)congressional_district_geoid(integer SSDD GEOID)Production gap, not just a missing allowlist entry
The frozen eCPS export contract (
src/microplex_us/pipelines/ecps_export_contract.json) lists all three in itsrequiredset (246 cols), but the MP export only emittedstate_fips/county_fips. MP already had the block-assignment machinery (geography.py:BlockGeography,derive_geographies) and the real population-weighted block crosswalk locally, but the production pipeline (_build_policyengine_households) never assignedblock_geoidto households. So this was a production gap: the leaves had to be produced, not merely allowlisted.What changed
geography.pyassign_household_block_geography()— draws a real 15-digitblock_geoidper household from the population-weighted crosswalk, partitioned by the most specific geography available: CPScounty_fips(disclosed) -> existingcongressional_district_geoid(eCPS-style) ->state_fips. Then derivestract_geoid = block_geoid[:11]and the integercongressional_district_geoidfrom the block's crosswalkcd_id. Unresolved households (no valid state) get empty strings / CD 0 (the PE-US defaults) — geoids are never fabricated.cd_id_to_congressional_district_geoid()— converts crosswalk"<abbr>-<dist>"cd_ids to the PE-US integer SSDD geoid (state_fips * 100 + district). At-large states use district1, which matches the eCPS calibration-target CD universe exactly.pipelines/us.py_build_policyengine_households()now calls a new_assign_household_block_geography()helper, gated by apolicyengine_assign_block_geographyconfig flag (default on). Skipped cleanly when the crosswalk is unavailable (CI without the data).policyengine/us.pySAFE_POLICYENGINE_US_EXPORT_VARIABLES. All three are storable INPUT variables (no formula) in the pinned policyengine-us, so they pass the computed-export guard and serialize as block/tract strings + CDint64.Encoding correctness (verified against eCPS)
The crosswalk
cd_idis"<state_abbr>-<district>"(e.g."CA-01", at-large"DC-AL"); PE-US and the eCPS targets storecongressional_district_geoidas integerSSDD. With at-large districts encoded as1, the crosswalk's 436-CD universe is an exact match to the eCPS calibration-target CD universe (policy_data.db: 436 CDs, e.g. CA-01 -> 601, NC-01 -> 3701, DC-AL -> 1101, WY-AL -> 5601), with zero CDs missing on either side. A test asserts this equality.Tests
tests/policyengine/test_geoid_export.py(18) +tests/pipelines/test_us.py(2 added):tract_geoid == block_geoid[:11]exactly (the load-bearing invariant)cd // 100 == state_fipsExisting geography / block-assignment / block-synthesis / export-column-gate suites pass unchanged.
🤖 Generated with Claude Code