Skip to content

Add CBWSDID covariate balancing to StackedDiD (Ustyuzhanin 2026)#534

Merged
igerber merged 1 commit into
mainfrom
feature/cbwsdid-stacked-balance
Jun 7, 2026
Merged

Add CBWSDID covariate balancing to StackedDiD (Ustyuzhanin 2026)#534
igerber merged 1 commit into
mainfrom
feature/cbwsdid-stacked-balance

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Jun 7, 2026

Summary

  • Implement CBWSDID — Covariate-Balanced Weighted Stacked Difference-in-Differences (Ustyuzhanin 2026, arXiv:2604.02293) — as a covariate-balancing path on the existing StackedDiD, not a new estimator class. New constructor parameter balance="entropy" plus fit(..., covariates=[...]) add a within-sub-experiment design stage: entropy balancing (Hainmueller 2012) reweights the clean controls toward the treated cohort's covariate means (read at the last pre-treatment period t=a-1-anticipation, so design weights use only pre-treatment information). The resulting nonnegative design weights b_sa compose with the Wing et al. (2024) corrective weights via the effective control mass into the final stacked weights W_sa = b_sa·(N^D_a/N^D_Ω)/(Ñ^C_a/Ñ^C_Ω), injected at the single composed_weights point so the existing WLS + cluster-robust vcov produce the estimate and conditional-on-weights SEs.
  • This is control-only reweighting, so it estimates untreated trends under conditional parallel trends while preserving the trimmed-aggregate-ATT estimand (the treated cohorts and their shares are unchanged). A naive b_sa·Q_aggregate multiply is not equivalent and would bias the estimand under non-uniform weights — W_sa is computed directly from the effective control mass; a dedicated test guards this.
  • New module diff_diff/balancing.py provides the entropy-balancing solver (convex-dual damped Newton with an L-BFGS fallback).
  • Scope (v1): balance="entropy" requires weighting="aggregate" and balanced event windows; population/sample_share/survey_design=, ragged/unbalanced windows, matching-based balancing, and the repeated 0→1/1→0 episode extension are out of scope and fail closed (NotImplementedError/ValueError). Infeasible cohorts (treated covariate profile outside the clean-control hull) fail closed with a clear, cohort-named error rather than silently dropping a cohort (which would shift the estimand). Deferrals tracked in TODO.md.

Methodology references (required if estimator / math changes)

  • Method name(s): CBWSDID covariate balancing for StackedDiD (entropy balancing within sub-experiments, composed with Wing-et-al. corrective stacked weights)
  • Paper / source link(s): Ustyuzhanin (2026) arXiv:2604.02293 (in-repo review docs/methodology/papers/ustyuzhanin-2026-review.md); Hainmueller (2012) Political Analysis 20(1) 25–46; Wing, Freedman & Hollingsworth (2024) NBER WP 32054
  • Any intentional deviations from the source (and why): v1 scope limits (aggregate weighting only, balanced windows only, weighting-based balancing only, no bootstrap re-estimation) are documented in docs/methodology/REGISTRY.md under the StackedDiD "Covariate balancing (CBWSDID)" subsection with - **Note:** labels, and the unit-count vs observation-count corrector convention off balanced panels is the documented (deferred) choice.

Validation

  • Tests added/updated: tests/test_balancing.py (entropy solver: exact moment balance, base weights, collinearity, infeasibility); tests/test_stacked_did.py::TestStackedDiDCovariateBalance (guards, cross-validation, constraint satisfaction, constant-covariate reduction, covariate-scale invariance, default-aggregate mode, ragged-window rejection, diagnostics); tests/test_methodology_stacked_did.py (closed-form DID^CBWSDID_e paper-formula anchor; effective-mass-is-load-bearing on a varying-control-count panel; estimand-integrity under heterogeneity; cross-language parity vs the R cbwsdid package); tests/test_guides.py::TestLLMsFullStackedDiDCoverage (signature pin).
  • Backtest / simulation / notebook evidence (if applicable): closed-form paper-formula anchor (R-independent, ~1e-8) and cross-language parity against R cbwsdid (refinement.method="weightit", method="ebal") at ~3e-7 on estimates and ~0.3% on SEs (golden fixture benchmarks/data/cbwsdid_golden.json, regenerate via benchmarks/R/generate_cbwsdid_golden.R). Full test suite: 7658 passed.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 7, 2026

Overall Assessment

⚠️ Needs changes — one unmitigated P1 edge-case/methodology guard issue.

Executive Summary

  • Affected methods: StackedDiD(balance="entropy") CBWSDID path and the new entropy_balance() solver.
  • The main CBWSDID weighting design is documented in REGISTRY.md and aligns with the in-repo Ustyuzhanin review/source identity. citeturn813885search2
  • Documented v1 limits are not defects: aggregate weighting only, balanced windows only, no survey design, no matching/repeated-treatment extension, conditional-on-weights SEs.
  • P1: the balanced-window fail-closed guard is incomplete; it checks row counts only and can miss zero-row controls or duplicate/missing event-time combinations.
  • I could not run tests in this sandbox because importing numpy failed with ModuleNotFoundError.

Methodology

Finding M1 — P1

Severity: P1
Location: diff_diff/stacked_did.py:L1330-L1345, diff_diff/stacked_did.py:L1347-L1369; registry contract at docs/methodology/REGISTRY.md:L1559-L1563
Impact: balance="entropy" is documented to fail closed on ragged/unbalanced event windows, but the implementation only checks sub.groupby(unit).size() == expected_rows. This misses units that are eligible clean controls but have zero observed rows in the sub-experiment, because treated_units/control_units are derived from stacked_df after filtering. It also lets a unit pass if it has the right row count but a duplicated event time and a missing event time. That can violate the documented balanced-window assumption before CBWSDID weights are computed.
Concrete fix: Validate exact event-time coverage, not just counts. For every treated and clean-control unit expected from the sub-experiment membership, require exactly one row for each expected event time. Raise a cohort-named ValueError for zero-row units, missing event times, or duplicate (unit, _event_time) rows before calling entropy_balance().

Finding M2 — P3 Informational

Severity: P3
Location: docs/methodology/REGISTRY.md:L1555-L1557, TODO.md:L77
Impact: v1 scope limits are documented with Note labels and tracked in TODO.md: aggregate weighting only, balanced event windows only, no survey_design=, no matching/repeated-treatment extension, and conditional-on-weights inference. These are not defects under the review rules.
Concrete fix: None required.

Code Quality

Finding C1 — P2

Severity: P2
Location: diff_diff/stacked_did.py:L356-L387
Impact: covariates is typed/documented as list[str], but a common accidental call like covariates="age" is treated as an iterable of characters by list(dict.fromkeys(covariates)), producing confusing missing-column errors.
Concrete fix: Add an explicit guard: either reject str with a clear TypeError/ValueError, or coerce covariates="age" to ["age"]. Add a small API test.

Performance

No unmitigated findings. The per-cohort entropy balancing loop is expected for this v1 design.

Maintainability

No unmitigated findings. The new balancing code is isolated and the estimator parameter is threaded through get_params(), set_params(), wrapper, and results.

Tech Debt

Documented CBWSDID deferrals are properly tracked in TODO.md:L77 and REGISTRY.md:L1555-L1557; they are P3 informational.

Security

No findings. The added benchmark files and R script do not show secrets or credential material.

Documentation/Tests

Finding T1 — P2

Severity: P2
Location: tests/test_stacked_did.py:L1873-L1887, diff_diff/stacked_did.py:L1330-L1343
Impact: Tests cover a missing single row ragged panel, but not the two cases the current guard misses: an eligible control with zero rows in the window, and a duplicate/missing event-time pattern with the expected row count.
Concrete fix: Add regression tests for both cases and assert a cohort-specific ValueError.

Path to Approval

  1. Strengthen balance="entropy" balanced-window validation to require exact expected event-time coverage for every treated and clean-control unit, including zero-row eligible controls and duplicate (unit, event_time) rows.
  2. Add tests for zero-row eligible controls and duplicate/missing event-time combinations that currently bypass the row-count-only guard.

@igerber igerber force-pushed the feature/cbwsdid-stacked-balance branch from 82722ff to f65f02e Compare June 7, 2026 10:58
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: f65f02e50ebc5f663e21fe5ad2cc1130d37e09c5


Overall Assessment

Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • Affected methods: StackedDiD(balance="entropy") CBWSDID path and entropy_balance().
  • Prior P1 is resolved: balanced-window validation now checks zero-row eligible units, wrong counts, and duplicate (unit, event_time) rows.
  • Prior P2 API issue is resolved: covariates="x" now raises a clear TypeError.
  • Methodology matches the registry’s CBWSDID contract: control-only entropy balancing, effective-control-mass composition, unchanged treated shares, and conditional-on-weights clustered inference.
  • Documented v1 limitations are properly tracked in REGISTRY.md and TODO.md; these are P3 informational, not blockers.
  • I could not run tests in this sandbox: pytest and numpy are not installed.

Methodology

Finding M1 — P3 Informational

Severity: P3
Location: docs/methodology/REGISTRY.md:L1547-L1563, TODO.md:L77, diff_diff/stacked_did.py:L1290-L1451
Impact: The implemented CBWSDID path follows the documented absorbing-design scope: entropy balancing within each clean-control sub-experiment, final W_sa via effective control mass, treated weights unchanged, and conditional-on-weights clustered inference. This aligns with the cited paper’s design-stage plus corrective-weight framing. (arxiv.org) Scope limits such as aggregate weighting only, balanced windows only, no survey_design=, and no matching/repeated-treatment extension are documented with Note labels and tracked in TODO.md, so they are not defects under the review rules.
Concrete fix: None required.

Prior Finding M1 — Resolved

Severity: Resolved prior P1
Location: diff_diff/stacked_did.py:L1339-L1364, tests at tests/test_stacked_did.py:L1899-L1933
Impact: The prior row-count-only balanced-window guard has been strengthened. It now catches eligible units with zero rows, wrong row counts, and duplicate (unit, _event_time) rows before balancing.
Concrete fix: None required.

Code Quality

Prior Finding C1 — Resolved

Severity: Resolved prior P2
Location: diff_diff/stacked_did.py:L357-L361, test at tests/test_stacked_did.py:L1935-L1938
Impact: Passing covariates="x" no longer gets interpreted as character covariates; it raises a clear TypeError.
Concrete fix: None required.

Performance

No findings. The per-cohort entropy-balancing loop is expected for this v1 implementation.

Maintainability

No findings. The new balance parameter is propagated through constructor validation, get_params(), set_params(), stacked_did(), results metadata, and guide signature tests: diff_diff/stacked_did.py:L1458-L1585, diff_diff/stacked_did_results.py:L110-L119, tests/test_guides.py:L740-L768.

Tech Debt

Finding TD1 — P3 Informational

Severity: P3
Location: TODO.md:L77, docs/methodology/REGISTRY.md:L1555-L1557
Impact: Deferred CBWSDID work is explicitly tracked: unbalanced/ragged count convention, non-aggregate weighting, survey design, matching, and repeated-treatment episodes. These are deferrable and documented.
Concrete fix: None required.

Security

No findings. The new benchmark R script and data fixtures do not expose secrets or credential material.

Documentation/Tests

Prior Finding T1 — Resolved

Severity: Resolved prior P2
Location: tests/test_stacked_did.py:L1912-L1933
Impact: Regression tests now cover both previously missing cases: zero-row eligible controls and duplicate/missing event-time coverage with the expected row count.
Concrete fix: None required.

Verification Note — P3 Informational

Severity: P3
Location: local review environment
Impact: Targeted tests could not be executed because pytest is unavailable and importing numpy fails with ModuleNotFoundError.
Concrete fix: None for the PR; run the added CBWSDID test subset in CI or a dev environment with project test dependencies installed.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Jun 7, 2026
Implement Covariate-Balanced Weighted Stacked Difference-in-Differences
(Ustyuzhanin 2026, arXiv:2604.02293) as a covariate-balancing path on the
existing StackedDiD estimator rather than a new class. A new constructor
parameter balance="entropy" plus fit(..., covariates=[...]) add a within-
sub-experiment design stage: entropy balancing (Hainmueller 2012) reweights
the clean controls toward the treated cohort's covariate means (read at the
last pre-treatment period t=a-1-anticipation, so design weights use only
pre-treatment information), and the resulting nonnegative design weights b_sa
compose with the Wing et al. (2024) corrective weights via the EFFECTIVE
control mass into the final stacked weights:

    W_sa = b_sa * (N^D_a/N^D_Omega) / (Ntilde^C_a/Ntilde^C_Omega)   (controls)
    W_sa = 1                                                        (treated)

injected at the single composed_weights point so the existing WLS +
cluster-robust vcov produce the estimate and conditional-on-weights SEs. This
is control-only reweighting, so it estimates untreated trends under conditional
parallel trends while preserving the trimmed-aggregate-ATT estimand (it reduces
exactly to weighted stacked DID at b_sa=1). A naive b_sa*Q_aggregate multiply
is NOT equivalent — it would aggregate control means with the wrong cohort
weights — so W_sa is computed directly from the effective control mass.

New module diff_diff/balancing.py provides the entropy-balancing solver
(convex-dual damped Newton with L-BFGS fallback). Scope (v1): balance="entropy"
requires weighting="aggregate"; balance + population/sample_share/survey_design=
raise NotImplementedError; matching-based balancing and the repeated-treatment
extension are out of scope. Infeasible cohorts fail closed with a clear,
cohort-named error rather than silently dropping a cohort.

Validation: closed-form DID^CBWSDID_e paper-formula anchor (R-independent, 1e-8),
an effective-mass-is-load-bearing test on a varying-control-count panel, the
estimand-integrity-under-heterogeneity test (recovers the treated-average ATT
where plain StackedDiD is biased), and cross-language parity against the R
cbwsdid package (weightit/ebal) at ~3e-7 (golden fixture committed). Docs,
REGISTRY, references, guides, and CHANGELOG updated; llms-full signature pinned
in test_guides.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the feature/cbwsdid-stacked-balance branch from f65f02e to 7a90da5 Compare June 7, 2026 13:13
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 7, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 7a90da565928c5cc4a4830b8d8f77c62b0adf2b1


Overall Assessment

Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • Affected methods: StackedDiD(balance="entropy") CBWSDID path and new entropy_balance().
  • Prior P1 remains resolved: balanced-window validation now catches zero-row eligible units, wrong row counts, and duplicate (unit, event_time) rows.
  • Prior P2 remains resolved: covariates="x" raises a clear TypeError.
  • Methodology aligns with the registry: control-only entropy balancing, effective-control-mass composition, unchanged treated shares, and conditional-on-weights clustered inference.
  • Documented v1 limits are properly mitigated by REGISTRY.md / TODO.md: aggregate weighting only, balanced windows only, no survey design, no matching/repeated-treatment extension.
  • I could not run tests because this sandbox lacks numpy, pandas, scipy, and pytest.

Methodology

Finding M1 — P3 Informational

Severity: P3
Location: docs/methodology/REGISTRY.md:L1541-L1563, diff_diff/stacked_did.py:L1290-L1451
Impact: The implementation follows the documented CBWSDID contract: entropy balancing computes control-only b_sa, final control weights use W_sa = b_sa * (N^D_a/N^D_Omega)/(Ntilde^C_a/Ntilde^C_Omega), treated weights stay unchanged, and inference conditions on the composed weights.
Concrete fix: None required.

Prior Finding M1 — Resolved

Severity: Resolved prior P1
Location: diff_diff/stacked_did.py:L1339-L1374, tests/test_stacked_did.py:L1899-L1933
Impact: The balanced-window guard now checks exact eligible-unit coverage, including zero-row eligible units and duplicate (unit, _event_time) rows.
Concrete fix: None required.

Code Quality

Prior Finding C1 — Resolved

Severity: Resolved prior P2
Location: diff_diff/stacked_did.py:L357-L361, tests/test_stacked_did.py:L1935-L1938
Impact: String covariates are rejected before being interpreted as iterable column names.
Concrete fix: None required.

Performance

No findings. The per-sub-experiment entropy-balancing loop is expected for the documented v1 scope.

Maintainability

No findings. balance is validated, included in get_params() / set_params(), threaded through stacked_did(), stored on results, and documented in guides/API surfaces.

Tech Debt

Finding TD1 — P3 Informational

Severity: P3
Location: TODO.md:L77, docs/methodology/REGISTRY.md:L1555-L1557
Impact: Deferred CBWSDID work is explicitly tracked: ragged-window count convention, non-aggregate weighting, survey design, matching, and repeated-treatment episodes.
Concrete fix: None required.

Security

No findings. The new benchmark data/script do not expose credentials or sensitive material.

Documentation/Tests

Prior Finding T1 — Resolved

Severity: Resolved prior P2
Location: tests/test_stacked_did.py:L1912-L1933
Impact: Regression tests cover the previously missed balanced-window failure modes.
Concrete fix: None required.

Finding T2 — P3 Informational

Severity: P3
Location: local review environment
Impact: I could not execute the targeted tests; required test/runtime dependencies are unavailable in this sandbox.
Concrete fix: Run the CBWSDID test subset in CI or a dev environment with project dependencies installed.

@igerber igerber merged commit 98d30ac into main Jun 7, 2026
26 checks passed
@igerber igerber deleted the feature/cbwsdid-stacked-balance branch June 7, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant