igerber · igerber · Jun 7, 2026 · Jun 7, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
 
 ### Added
+- **`StackedDiD` covariate balancing (CBWSDID; Ustyuzhanin 2026, arXiv:2604.02293).** New constructor parameter `balance="entropy"` plus `fit(..., covariates=[...])` add a within-sub-experiment design stage: entropy balancing (Hainmueller 2012) reweights the clean controls toward the treated cohort's covariate means (read at the last pre-treatment period), and the resulting design weights `b_sa` compose with the Wing et al. (2024) corrective weights via the effective control mass into the final stacked weights `W_sa`. This is **control-only reweighting**, so it estimates untreated trends under *conditional* parallel trends while preserving the trimmed-aggregate-ATT estimand (at `b_sa=1` it reduces to the paper's unit-count weighted stacked DID, equal to `StackedDiD(weighting="aggregate")` on balanced event windows). Inference reuses the existing conditional-on-weights cluster-robust path. Scope: requires `weighting="aggregate"` and **balanced event windows** (ragged windows raise — the unit-count vs observation-count convention is unresolved off balanced panels); `population`/`sample_share`/`survey_design=` and matching-based balancing / the repeated-treatment extension are not supported (raise `NotImplementedError`). Infeasible cohorts fail closed with a clear error. New `diff_diff/balancing.py` (entropy-balancing solver). Estimand validated end-to-end against the closed-form CBWSDID formula (`tests/test_methodology_stacked_did.py`).
 - **`SyntheticControl` conformal inference (Chernozhukov, Wüthrich & Zhu 2021, *JASA* 116(536)).** Three opt-in `SyntheticControlResults` methods give valid p-values for the post-period effect trajectory and pointwise confidence intervals — what the in-space placebo / Firpo-Possebom test-inversion paths cannot. Unlike the Firpo path (which re-ranks the cross-unit placebo gaps), the conformal layer fits its **own** time-permutation-invariant constrained-LS synthetic-control proxy (CWZ §2.3 eqs 3–4 — simplex weights on raw outcomes over **all** periods under the null, no `V`-matrix, no intercept) and permutes residuals **over time** for the single treated unit (CWZ's exactness theory requires a time-symmetric proxy, which the headline ADH `V`-matrix fit is not). **`conformal_test(effect, q=1, scheme="moving_block", n_iid=10000, seed=None)`** computes the joint sharp-null permutation p-value (eqs 1–2) of `S_q(û) = ((1/√T*)·Σ_{t>T0}|û_t|^q)^{1/q}` (`q ∈ {1, 2, ∞}`); the proxy is fit once and only residuals are permuted (footnote 7). **`conformal_confidence_intervals(alpha=0.1, scheme="moving_block", bounds=None, n_grid=100, seed=None)`** returns pointwise per-period CIs by test inversion (Algorithm 1 — each period `t` uses `Z = (pre-periods, t)` with the other post-periods dropped, a clean `T*=1` test). **`conformal_average_effect(alpha=0.1, scheme="moving_block", bounds=None, n_grid=200, seed=None)`** returns a CI for the average post-period effect by collapsing the panel into non-overlapping `T*`-blocks and permuting the block residuals (Appendix A.1). Permutation schemes: `"moving_block"` (`Π_→` cyclic shifts, valid under serial dependence — the default) and `"iid"` (`Π_all`, sampled, finer p-values); both include the identity so the p-value floor is `1/|Π|` (no extra `+1`). Fail-closed handling for `<1` donor / unpickled result / non-finite panel / non-converged grid points (treated as indeterminate, not rejected) / grid-limited / empty / unbounded sets; a single donor and `T*≥T0` warn. Surfaced under `conformal_inference` / `get_conformal_grid_df()` and `DiagnosticReport`'s `estimator_native_diagnostics`; the analytical `se`/`t_stat`/`p_value`/`conf_int`/`is_significant` stay NaN throughout. Core in the new `diff_diff/conformal.py` (reuses the Frank-Wolfe simplex solver). *Deferred:* one-sided variants (§7), covariates folded into the proxy, and the AR/innovation-permutation path (Lemmas 5–7).
 
 ### Changed

diff --git a/README.md b/README.md
@@ -112,7 +112,7 @@ Full guide: `diff_diff.get_llm_guide("practitioner")`.
 - [TripleDifference](https://diff-diff.readthedocs.io/en/stable/api/triple_diff.html) - triple difference (DDD) estimator for designs requiring two criteria for treatment eligibility
 - [ContinuousDiD](https://diff-diff.readthedocs.io/en/stable/api/continuous_did.html) - Callaway, Goodman-Bacon & Sant'Anna (2024) continuous treatment DiD with dose-response curves
 - [HeterogeneousAdoptionDiD](https://diff-diff.readthedocs.io/en/stable/api/had.html) - de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026) for designs where **no unit remains untreated**; local-linear estimator at the dose support boundary returning Weighted Average Slope (WAS) on Design 1' (`d̲ = 0` / QUG) or `WAS_{d̲}` on Design 1 (`d̲ > 0`, continuous-near-d̲ or mass-point), with a multi-period event-study extension (last-treatment cohort, pointwise CIs). **Panel-only** in this release - repeated cross-sections rejected by the validator. Alias `HAD`.
-- [StackedDiD](https://diff-diff.readthedocs.io/en/stable/api/stacked_did.html) - Wing, Freedman & Hollingsworth (2024) stacked DiD with Q-weights and sub-experiments
+- [StackedDiD](https://diff-diff.readthedocs.io/en/stable/api/stacked_did.html) - Wing, Freedman & Hollingsworth (2024) stacked DiD with Q-weights and sub-experiments; optional covariate balancing (Ustyuzhanin 2026)
 - [EfficientDiD](https://diff-diff.readthedocs.io/en/stable/api/efficient_did.html) - Chen, Sant'Anna & Xie (2025) efficient DiD with optimal weighting for tighter SEs
 - [TROP](https://diff-diff.readthedocs.io/en/stable/api/trop.html) - Triply Robust Panel estimator (Athey et al. 2025) with nuclear norm factor adjustment
 - [StaggeredTripleDifference](https://diff-diff.readthedocs.io/en/stable/api/staggered.html#staggeredtripledifference) - Ortiz-Villavicencio & Sant'Anna (2025) staggered DDD with group-time ATT

diff --git a/TODO.md b/TODO.md
@@ -74,6 +74,7 @@ Deferred items from PR reviews that were not addressed before merge.
 
 | Issue | Location | PR | Priority |
 |-------|----------|----|----------|
+| CBWSDID covariate balancing (`StackedDiD(balance="entropy")`) v1 supports only balanced event windows + `weighting="aggregate"`; unbalanced/ragged panels fail closed (the unit-count vs observation-count corrector convention is unresolved off balanced panels). Matching-based balancing and the repeated `0→1`/`1→0` episode extension are also deferred (out-of-scope guards raise). Documented in REGISTRY.md StackedDiD "Covariate balancing (CBWSDID)" Notes. | `stacked_did.py`, `balancing.py`, `docs/methodology/REGISTRY.md` | follow-up | Low |
 | `SyntheticControl` cv: `in_space_placebo()` / `leave_one_out()` report a cv refit excluded for STRUCTURAL infeasibility (donor-indistinguishable re-aggregated window) with the generic `status="failed"` — same machine-readable status as a genuine inner-solver non-convergence. The failure warnings now distinguish the two causes (and the correct remediation) under cv, and `in_time_placebo()` already splits structural→`"infeasible"` vs `"failed"`, but in-space/LOO do not yet emit a separate machine-readable status/reason-code. Thread a reason code from `_outer_solve_V_cv()`/`_placebo_fit_unit()` and add an `"infeasible"` status + count to the in-space/LOO outputs (mirror the in-time split). | `synthetic_control.py`, `synthetic_control_results.py` | follow-up | Low |
 | dCDH: Phase 1 per-period placebo DID_M^pl has NaN SE (no IF derivation for the per-period aggregation path). Multi-horizon placebos (L_max >= 1) have valid SE. | `chaisemartin_dhaultfoeuille.py` | #294 | Low |
 | dCDH: Survey cell-period allocator's post-period attribution is a library convention, not derived from the observation-level survey linearization. MC coverage is empirically close to nominal on the test DGP; a formal derivation (or a covariance-aware two-cell alternative) is deferred. Documented in REGISTRY.md survey IF expansion Note. | `chaisemartin_dhaultfoeuille.py`, `docs/methodology/REGISTRY.md` | #408 | Medium |

diff --git a/benchmarks/R/generate_cbwsdid_golden.R b/benchmarks/R/generate_cbwsdid_golden.R
@@ -0,0 +1,52 @@
+#!/usr/bin/env Rscript
+# Generate the cross-language golden fixture for StackedDiD's covariate-balancing
+# (CBWSDID) path against the reference R package `cbwsdid` (Ustyuzhanin 2026).
+#
+# Unlike generate_stacked_did_golden.R (which operates on a PRE-stacked CSV so the
+# R side is independent of Python stacking logic), `cbwsdid` does its OWN stacking
+# + balancing, so this harness hands it the raw panel and dumps the dynamic
+# event-study ATTs. The Python side (StackedDiD(balance="entropy", ...)) reproduces
+# them via its independent entropy-balancing solver + effective-mass W_sa.
+#
+# Refinement: refinement.method="weightit", method="ebal" = entropy balancing
+# (Hainmueller 2012) on covs.formula=~x, matching StackedDiD(balance="entropy",
+# covariates=["x"]). Install: remotes::install_github("vadvu/cbwsdid").
+#
+# Usage: Rscript benchmarks/R/generate_cbwsdid_golden.R
+
+suppressMessages({
+  library(cbwsdid)
+  library(jsonlite)
+})
+
+# Run from the repository root: Rscript benchmarks/R/generate_cbwsdid_golden.R
+panel_csv <- "benchmarks/data/cbwsdid_balance_panel.csv"
+out_json <- "benchmarks/data/cbwsdid_golden.json"
+
+df <- read.csv(panel_csv)
+
+m <- cbwsdid(
+  data = df, y = "y", d = "d", id = c("unit", "time"),
+  kappa = c(-2, 2), design = "absorbing", post_path = "stable",
+  refinement.method = "weightit", covs.formula = ~x,
+  refinement.args = list(method = "ebal"), pooled = TRUE
+)
+qoi <- cbwsdid_qoi(m, type = "dynamic")
+
+golden <- list(
+  meta = list(
+    package = "cbwsdid",
+    R_version = R.version.string,
+    panel = "benchmarks/data/cbwsdid_balance_panel.csv",
+    estimator = "cbwsdid(design='absorbing', refinement.method='weightit', method='ebal', covs.formula=~x)",
+    kappa = c(-2L, 2L)
+  ),
+  dynamic = list(
+    event_time = as.integer(qoi$et),
+    estimate = as.numeric(qoi$estimate),
+    std_error = as.numeric(qoi$std.error)
+  )
+)
+write_json(golden, out_json, auto_unbox = TRUE, digits = 15, pretty = TRUE)
+cat("wrote", out_json, "\n")
+print(data.frame(et = qoi$et, estimate = qoi$estimate, se = qoi$std.error))