This document tracks the progress of reviewing each estimator's implementation against the Methodology Registry and academic references. It ensures that implementations are correct, consistent, and well-documented.
For the methodology registry with academic foundations and key equations, see docs/methodology/REGISTRY.md.
Each estimator in diff-diff should be periodically reviewed to ensure:
- Correctness: Implementation matches the academic paper's equations
- Reference alignment: Behavior matches reference implementations (R packages, Stata commands)
- Edge case handling: Documented edge cases are handled correctly
- Standard errors: SE formulas match the documented approach
A Complete entry has a documented review pass against the primary academic source captured in this file. The minimum content is:
- A "Corrections Made" block listing every implementation fix the review uncovered, or
(None — implementation verified correct). - An explicit statement of deviations from the reference implementation, or
(None). Format varies — some entries use a dedicated "Deviations" / "Deviations from R" block, others surface deviations inline in "Corrections Made" or "Outstanding Concerns". - Verification evidence: a "Verified Components" checklist, an "Edge Cases Verified" enumeration, an "R Comparison Results" table, or some combination of these.
The catalog grew incrementally over several quarters, so formats vary across the existing Complete entries; the consistent invariant is that someone walked through the implementation against the academic source and captured the result here. New reviews going forward should aim for the fuller structure (Verified Components + Corrections Made + Deviations + dedicated methodology test file) used by the more recent entries.
In Progress entries have a REGISTRY.md section and unit-test coverage, but no formal walk-through has been captured here yet. The In Progress band is wide — some entries also have some combination of a paper review (primary or companion), a dedicated methodology test file, and R parity fixtures; others have only the REGISTRY entry and unit tests (e.g., PlaceboTests). The "Documentation in place" sub-section enumerates what each entry already has; the "Outstanding for promotion" sub-section enumerates what's still needed to flip it to Complete.
Not Started entries have neither a tracker walk-through nor an REGISTRY.md section. This tracker no longer carries any Not Started rows; new estimators are expected to enter as In Progress when their REGISTRY entry lands.
| Estimator | Module | R / Stata Reference | Status | Last Review |
|---|---|---|---|---|
| DifferenceInDifferences | estimators.py |
fixest::feols() |
Complete | 2026-01-24 |
| MultiPeriodDiD | estimators.py |
fixest::feols() |
Complete | 2026-02-02 |
| TwoWayFixedEffects | twfe.py |
fixest::feols() |
Complete | 2026-02-08 |
| Estimator | Module | R / Stata Reference | Status | Last Review |
|---|---|---|---|---|
| CallawaySantAnna | staggered.py |
did::att_gt() |
Complete | 2026-01-24 |
| SunAbraham | sun_abraham.py |
fixest::sunab() |
Complete | 2026-02-15 |
| StackedDiD | stacked_did.py |
stacked-did-weights (Wing-Freedman-Hollingsworth code) |
Complete | 2026-02-19 |
| ImputationDiD | imputation.py |
didimputation |
Complete | 2026-06-06 |
| TwoStageDiD | two_stage.py |
did2s |
In Progress | — |
| WooldridgeDiD (ETWFE) | wooldridge.py |
etwfe (R) / jwdid (Stata) |
Complete | 2026-05-22 |
| EfficientDiD | efficient_did.py |
(no canonical R package) | Complete | 2026-06-01 |
| Estimator | Module | R / Stata Reference | Status | Last Review |
|---|---|---|---|---|
| ContinuousDiD | continuous_did.py |
contdid v0.1.0 |
Complete | 2026-05-20 |
| ChaisemartinDHaultfoeuille (DCDH) | chaisemartin_dhaultfoeuille.py |
DIDmultiplegtDYN |
Complete | 2026-05-21 |
| HeterogeneousAdoptionDiD (HAD) | had.py, had_pretests.py |
chaisemartin::did_had (Credible-Answers/did_had v2.0.0); nprobust for bandwidth |
Complete | 2026-05-20 |
| TROP | trop.py, trop_local.py, trop_global.py |
(forthcoming; paper-author reference implementation) | Complete (paper method="local") |
2026-05-24 |
| Estimator | Module | R Reference | Status | Last Review |
|---|---|---|---|---|
| TripleDifference | triple_diff.py |
triplediff::ddd() |
Complete | 2026-02-18 |
| StaggeredTripleDifference | staggered_triple_diff.py |
triplediff::ddd(panel=TRUE) + agg_ddd() |
Complete | 2026-05-30 |
| Estimator | Module | R Reference | Status | Last Review |
|---|---|---|---|---|
| SyntheticDiD | synthetic_did.py |
synthdid::synthdid_estimate() |
Complete | 2026-04-23 |
| Tool | Module | R Reference | Status | Last Review |
|---|---|---|---|---|
| BaconDecomposition | bacon.py |
bacondecomp::bacon() |
Complete | 2026-05-16 |
| HonestDiD | honest_did.py |
HonestDiD package |
Complete | 2026-04-01 |
| PreTrendsPower | pretrends.py |
pretrends package |
Complete | 2026-05-19 |
| PowerAnalysis | power.py |
pwr / DeclareDesign |
Complete | 2026-05-31 |
| PlaceboTests | diagnostics.py |
(no canonical reference) | In Progress | — |
| Feature | Module | Reference | Status | Last Review |
|---|---|---|---|---|
| ConleySpatialHAC | conley.py, linalg.py |
conleyreg (R) / acreg (Stata) |
Complete | 2026-05-26 |
| Survey Data Support | survey.py, bootstrap_utils.py |
survey package (R) |
In Progress | — |
Status legend (matches the contract in § What "Complete" means in this tracker above):
- Not Started: No REGISTRY.md entry yet. Reserved for future surfaces; this tracker currently carries no Not Started rows.
- In Progress: REGISTRY.md entry and unit-test coverage exist, but no formal walk-through has been captured in this document yet. The band is wide — see each entry's "Documentation in place" / "Outstanding for promotion" sub-sections for specifics.
- Complete: A documented review pass against the primary academic source is captured here (minimum: Corrections Made, Deviations or
(None), and Verified Components / Edge Cases Verified / R Comparison Results in some form).
| Field | Value |
|---|---|
| Module | estimators.py |
| Primary Reference | Wooldridge (2010), Angrist & Pischke (2009) |
| R Reference | fixest::feols() |
| Status | Complete |
| Last Review | 2026-01-24 |
Verified Components:
- ATT formula: Double-difference of cell means matches regression interaction coefficient
- R comparison: ATT matches
fixest::feols()within 1e-3 tolerance - R comparison: SE (HC1 robust) matches within 5%
- R comparison: P-value matches within 0.01
- R comparison: Confidence intervals overlap
- R comparison: Cluster-robust SE matches within 10%
- R comparison: Fixed effects (absorb) matches
feols(...|unit)within 1% - Wild bootstrap inference (Rademacher, Mammen, Webb weights)
- Formula interface (
y ~ treated * post) - All REGISTRY.md edge cases tested
Test Coverage:
- 51 methodology verification tests in
tests/test_methodology_did.py - Existing unit-test coverage in
tests/test_estimators.py(TestDifferenceInDifferencesclass plus shared estimator-API classes) - R benchmark tests (skip if R not available)
R Comparison Results:
- ATT matches within 1e-3 (R JSON truncation limits precision)
- HC1 SE matches within 5%
- Cluster-robust SE matches within 10%
- Fixed effects results match within 1%
Corrections Made:
- (None — implementation verified correct)
Outstanding Concerns:
- R comparison precision limited by JSON output truncation (4 decimal places)
- Consider improving R script to output full precision for tighter tolerances
Edge Cases Verified:
- Empty cells: Produces rank deficiency warning (expected behavior)
- Singleton clusters: Included in variance estimation, contribute via residuals (corrected REGISTRY.md)
- Rank deficiency: All three modes (warn/error/silent) working
- Non-binary treatment/time: Raises ValueError as expected
- No variation in treatment/time: Raises ValueError as expected
- Missing values: Raises ValueError as expected
Deviations from R's fixest::feols(): (None — point estimates and SEs match within
documented tolerances; cluster-robust and absorbed-FE behavior verified.)
| Field | Value |
|---|---|
| Module | estimators.py |
| Primary Reference | Freyaldenhoven et al. (2021), Wooldridge (2010), Angrist & Pischke (2009) |
| R Reference | fixest::feols() |
| Status | Complete |
| Last Review | 2026-02-02 |
Verified Components:
- Full event-study specification: treatment × period interactions for ALL non-reference periods (pre and post)
- Reference period coefficient is zero (normalized by omission from design matrix)
- Default reference period is last pre-period (e=-1 convention, matches fixest/did)
- Pre-period coefficients available for parallel trends assessment
- Average ATT computed from post-treatment effects only, with covariance-aware SE
- Returns PeriodEffect objects with confidence intervals for all periods
- Supports balanced and unbalanced panels
- NaN inference: t_stat/p_value/CI use NaN when SE is non-finite or zero
- R-style NA propagation: avg_att is NaN if any post-period effect is unidentified
- Rank-deficient design matrix: warns and sets NaN for dropped coefficients (R-style)
- Staggered adoption detection warning (via
unitparameter) - Treatment reversal detection warning
- Time-varying D_it detection warning (advises creating ever-treated indicator)
- Single pre-period warning (ATT valid but pre-trends assessment unavailable)
- Post-period reference_period raises ValueError (would bias avg_att)
- HonestDiD/PreTrendsPower integration uses interaction sub-VCV (not full regression VCV)
- All REGISTRY.md edge cases tested
Test Coverage:
- 50 tests across
TestMultiPeriodDiDandTestMultiPeriodDiDEventStudyintests/test_estimators.py - 18 new event-study specification tests added in PR #125
Corrections Made:
- PR #125 (2026-02-02): Transformed from post-period-only estimator into full event-study specification with pre-period coefficients. Reference period default changed from first pre-period to last pre-period (e=-1 convention). HonestDiD/PreTrendsPower VCV extraction fixed to use interaction sub-VCV instead of full regression VCV.
Outstanding Concerns:
- R comparison benchmark via
benchmarks/R/benchmark_multiperiod.Rusingfixest::feols(outcome ~ treated * time_f | unit). ATT diff < 1e-11, SE diff 0.0%, period-effects correlation 1.0. Validated at small (200 units) and 1k scales. - Endpoint binning for distant event times not yet implemented.
- FutureWarning for reference_period default change should eventually be removed once the transition is complete.
Deviations from R's fixest::feols():
- Default SE is HC1, not cluster-robust at unit level (the
fixestdefault for panel data). Cluster-robust available viaclusterparameter but not the default. - Reference period default is last pre-period (e=-1 convention, matches
fixest/did); prior Python releases used first pre-period and the change is gated by aFutureWarninguntil the deprecation window closes.
| Field | Value |
|---|---|
| Module | twfe.py |
| Primary Reference | Wooldridge (2010), Ch. 10 |
| R Reference | fixest::feols() |
| Status | Complete |
| Last Review | 2026-02-08 |
Verified Components:
- Within-transformation algebra:
y_it - ȳ_i - ȳ_t + ȳmatches hand calculation (rtol=1e-12) - ATT matches manual demeaned OLS (rtol=1e-10)
- ATT matches
DifferenceInDifferenceson 2-period data (rtol=1e-10) - Covariates are also within-transformed (sum to zero within unit/time groups)
- R comparison: ATT matches
fixest::feols(y ~ treated:post | unit + post, cluster=~unit)(rtol<0.1%) - R comparison: Cluster-robust SE match (rtol<1%)
- R comparison: P-value match (atol<0.01)
- R comparison: CI bounds match (rtol<1%)
- R comparison: ATT and SE match with covariate (same tolerances)
- Edge case: Staggered treatment triggers
UserWarning - Edge case: Auto-clusters at unit level (SE matches explicit
cluster="unit") - Edge case: DF adjustment for absorbed FE matches manual
solve_ols()withdf_adjustment - Edge case: Covariate collinear with interaction raises
ValueError("cannot be identified") - Edge case: Covariate collinearity warns but ATT remains finite
- Edge case:
rank_deficient_action="error"raisesValueError - Edge case:
rank_deficient_action="silent"emits no warnings - Edge case: Unbalanced panel produces valid results (finite ATT, positive SE)
- Edge case: Missing unit column raises
ValueError - Integration:
decompose()returnsBaconDecompositionResults - SE: Cluster-robust SE >= HC1 SE
- SE: VCoV positive semi-definite
- Wild bootstrap: Valid inference (finite SE, p-value in [0,1])
- Wild bootstrap: All weight types (rademacher, mammen, webb) produce valid inference
- Wild bootstrap:
inference="wild_bootstrap"routes correctly - Params:
get_params()returns all inherited parameters - Params:
set_params()modifies attributes - Results:
summary()contains "ATT" - Results:
to_dict()contains att, se, t_stat, p_value, n_obs - Results: residuals + fitted = demeaned outcome (not raw)
- Edge case: Multi-period time emits UserWarning advising binary post indicator
- Edge case: Non-{0,1} binary time emits UserWarning (ATT still correct)
- Edge case: ATT invariant to time encoding ({0,1} vs {2020,2021} produces identical results)
Key Implementation Detail:
The interaction term D_i × Post_t must be within-transformed (demeaned) alongside the outcome,
consistent with the Frisch-Waugh-Lovell (FWL) theorem: all regressors and the outcome must be
projected out of the fixed effects space. R's fixest::feols() does this automatically when
variables appear to the left of the | separator.
Corrections Made:
- Bug fix: interaction term must be within-transformed (found during review). The previous
implementation used raw (un-demeaned)
D_i × Post_tin the demeaned regression. This gave correct results only for 2-period panels wherepost == period. For multi-period panels (e.g., 4 periods with binarypost), the raw interaction had incorrect correlation with demeaned Y, producing ATT approximately 1/3 of the true value. Fixed by applying the same within-transformation to the interaction term before regression. This matches R'sfixest::feols()behavior. (twfe.pylines 99-113)
Outstanding Concerns:
- Multi-period
timeparameter: Multi-period time values (e.g., 1,2,3,4) producetreated × period_numberinstead oftreated × post_indicator, which is not the standard D_it treatment indicator. AUserWarningis emitted whentimehas >2 unique values. For binary time with non-{0,1} values (e.g., {2020, 2021}), the ATT is mathematically correct (the within-transformation absorbs the scaling), but a warning recommends 0/1 encoding for clarity. Users with multi-period data should create a binarypostcolumn. - Staggered treatment warning: The warning only fires when
timehas >2 unique values (i.e., actual period numbers). With binarytime="post", all treated units appear to start treatment attime=1, making staggering undetectable. Users with staggered designs should usedecompose()orCallawaySantAnnadirectly for proper diagnostics.
Deviations from R's fixest::feols(): (None — point estimates, cluster-robust SEs,
CI bounds, and absorbed-FE results all match within documented tolerances on both bare
and covariate-adjusted specifications.)
| Field | Value |
|---|---|
| Module | staggered.py |
| Primary Reference | Callaway & Sant'Anna (2021) |
| R Reference | did::att_gt() |
| Status | Complete |
| Last Review | 2026-01-24 |
Verified Components:
- ATT(g,t) basic formula (hand-calculated exact match)
- Doubly robust estimator
- IPW estimator
- Outcome regression
- Base period selection (varying/universal)
- Anticipation parameter handling
- Simple/event-study/group aggregation
- Analytical SE with weight influence function
- Bootstrap SE (Rademacher/Mammen/Webb)
- Control group composition (never_treated/not_yet_treated)
- All documented edge cases from REGISTRY.md
Test Coverage:
- 61 methodology verification tests in
tests/test_methodology_callaway.py - Existing unit-test coverage in
tests/test_staggered.py - R benchmark tests (skip if R not available)
R Comparison Results:
- Overall ATT matches within 20% (difference due to dynamic effects in generated data)
- Post-treatment ATT(g,t) values match within 20%
- Pre-treatment effects may differ due to base_period handling differences
Corrections Made:
- (None — implementation verified correct)
Outstanding Concerns:
- R comparison shows ~20% difference in overall ATT with generated data
- Likely due to differences in how dynamic effects are handled in data generation
- Individual ATT(g,t) values match closely for post-treatment periods
- Further investigation recommended with real-world data
- Pre-treatment ATT(g,t) may differ from R due to base_period="varying" semantics
- Python uses t-1 as base for pre-treatment
- R's behavior requires verification
Deviations from R's did::att_gt():
- NaN for invalid inference: When SE is non-finite or zero, Python returns NaN for t_stat/p_value rather than potentially erroring. This is a defensive enhancement.
Alignment with R's did::att_gt() (as of v2.1.5):
-
Webb weights: Webb's 6-point distribution with values ±√(3/2), ±1, ±√(1/2) uses equal probabilities (1/6 each) matching R's
didpackage. This gives E[w]=0, Var(w)=1.0, consistent with other bootstrap weight distributions.Verification: Our implementation matches the well-established
fwildclusterbootR package (C++ source: wildboottest.cpp). The implementation usessqrt(1.5),1,sqrt(0.5)(and negatives) with equal 1/6 probabilities—identical to our values.Note on documentation discrepancy: Some documentation (e.g., fwildclusterboot vignette) describes Webb weights as "±1.5, ±1, ±0.5". This appears to be a simplification for readability. The actual implementations use ±√1.5, ±1, ±√0.5 which provides the required unit variance (Var(w) = 1.0).
| Field | Value |
|---|---|
| Module | sun_abraham.py |
| Primary Reference | Sun & Abraham (2021) |
| R Reference | fixest::sunab() |
| Status | Complete |
| Last Review | 2026-02-15 |
Verified Components:
- Saturated TWFE regression with cohort × relative-time interactions
- Within-transformation for unit and time fixed effects
- Interaction-weighted event study effects (δ̂_e = Σ_g ŵ_{g,e} × δ̂_{g,e})
- IW weights match event-time sample shares (n_{g,e} / Σ_g n_{g,e})
- Overall ATT as weighted average of post-treatment effects
- Delta method SE for aggregated effects (Var = w' Σ w)
- Cluster-robust SEs at unit level
- Reference period normalized to zero (e=-1 excluded from design matrix)
- R comparison: ATT matches
fixest::sunab()within machine precision (<1e-11) - R comparison: SE matches within 0.3% (small scale) / 0.1% (1k scale)
- R comparison: Event study effects correlation = 1.000000
- R comparison: Event study max diff < 1e-11
- Bootstrap inference (pairs bootstrap)
- Rank deficiency handling (warn/error/silent)
- All REGISTRY.md edge cases tested
Test Coverage:
- Combined methodology + unit tests in
tests/test_sun_abraham.py(the methodology verification block grew incrementally from the original 7 review tests as edge cases were added) - R benchmark tests via
benchmarks/run_benchmarks.py --estimator sunab
R Comparison Results:
- Overall ATT matches within machine precision (diff < 1e-11 at both scales)
- Cluster-robust SE matches within 0.3% (well within 1% threshold)
- Event study effects match perfectly (correlation 1.0, max diff < 1e-11)
- Validated at small (200 units) and 1k (1000 units) scales
Corrections Made:
-
DF adjustment for absorbed FE (
sun_abraham.py,_fit_saturated_regression()): Addeddf_adjustment = n_units + n_times - 1toLinearRegression.fit()to account for absorbed unit and time fixed effects in degrees of freedom. Unlike TWFE (which uses-2plus an explicit intercept column), SunAbraham's saturated regression has no intercept, so all absorbed df must come from the adjustment. Affects t-distribution DoF for cohort-level p-values/CIs (slightly larger p-values, slightly wider CIs) but does NOT change VCV or SE values. -
NaN return for no post-treatment effects (
sun_abraham.py,_compute_overall_att()): Changed return from(0.0, 0.0)to(np.nan, np.nan)when no post-treatment effects exist. All downstream inference fields (t_stat, p_value, conf_int) correctly propagate NaN via existing guards infit(). -
Deprecation warnings for unused parameters (
sun_abraham.py,fit()): AddedFutureWarningformin_pre_periodsandmin_post_periodsparameters that are accepted but never used (no-op). These will be removed in a future version. -
Removed event-time truncation at [-20, 20] (
sun_abraham.py): Removed the hardcoded capmax(min(...), -20)/min(max(...), 20)to match R'sfixest::sunab()which has no such limit. All available relative times are now estimated. -
Warning for variance fallback path (
sun_abraham.py,_compute_overall_att()): AddedUserWarningwhen the full weight vector cannot be constructed and a simplified variance (ignoring covariances between periods) is used as fallback. -
IW weights use event-time sample shares (
sun_abraham.py,_compute_iw_effects()): Changed IW weights fromn_g / Σ_g n_g(cohort sizes) ton_{g,e} / Σ_g n_{g,e}(per-event-time observation counts) to match the REGISTRY.md formula. For balanced panels these are identical; for unbalanced panels the new formula correctly reflects actual sample composition at each event-time. Added unbalanced panel test. -
Normalize
np.infnever-treated encoding (sun_abraham.py,fit()):first_treat=np.inf(documented as valid for never-treated) was included intreatment_groupsand_rel_timevia> 0checks, producing-infevent times. Fixed by normalizingnp.infto0immediately after computing_never_treated. Same fix applied tostaggered.py(CallawaySantAnna).
Outstanding Concerns:
- Inference distribution: Cohort-level p-values use t-distribution (via
LinearRegression.get_inference()), while aggregated event study and overall ATT p-values use normal distribution (viacompute_p_value()). This is asymptotically equivalent and standard for delta-method-aggregated quantities. R's fixest uses t-distribution at all levels, so aggregated p-values may differ slightly for small samples — this is a documented deviation.
Deviations from R's fixest::sunab():
- NaN for no post-treatment effects: Python returns
(NaN, NaN)for overall ATT/SE when no post-treatment effects exist. R would error. - Normal distribution for aggregated inference: Aggregated p-values use normal distribution (asymptotically equivalent). R uses t-distribution.
| Field | Value |
|---|---|
| Module | stacked_did.py |
| Primary Reference | Wing, Freedman & Hollingsworth (2024), NBER WP 32054 |
| R Reference | stacked-did-weights (create_sub_exp() + compute_weights()) |
| Status | Complete |
| Last Review | 2026-02-19 |
Verified Components:
- IC1 trimming:
a - kappa_pre >= T_min AND a + kappa_post <= T_max(matches R reference) - IC2 trimming: Three clean control modes (not_yet_treated, strict, never_treated)
- Sub-experiment construction: treated + clean controls within
[a - kappa_pre, a + kappa_post] - Q-weights aggregate: treated Q=1, control
Q = (sub_treat_n/stack_treat_n) / (sub_control_n/stack_control_n)per (event_time, sub_exp) — matches Rcompute_weights() - Q-weights population:
Q_a = (Pop_a^D / Pop^D) / (N_a^C / N^C)(Table 1, Row 2) - Q-weights sample_share:
Q_a = ((N_a^D + N_a^C)/(N^D+N^C)) / (N_a^C / N^C)(Table 1, Row 3) - WLS via sqrt(w) transformation (numerically equivalent to weighted regression)
- Event study regression:
Y = α_0 + α_1·D_sa + Σ_{h≠-1}[λ_h·1(e=h) + δ_h·D_sa·1(e=h)] + U(Eq. 3) - Reference period e=-1-anticipation normalized to zero (omitted from design matrix)
- Delta-method SE for overall ATT:
SE = sqrt(ones' @ sub_vcv @ ones) / K - Cluster-robust SEs at unit level (default) and unit×sub-experiment level
- Anticipation parameter: reference period shifts to e=-1-anticipation, post-treatment includes anticipation periods
- Rank deficiency handling (warn/error/silent via
solve_ols()) - Never-treated encoding: both
first_treat=0andfirst_treat=infhandled - R comparison: ATT matches within machine precision (diff < 2.1e-11)
- R comparison: SE matches within machine precision (diff < 4.0e-10)
- R comparison: Event study effects correlation = 1.000000, max diff < 4.5e-11
-
safe_inference()used for all inference fields - All REGISTRY.md edge cases tested
Test Coverage:
tests/test_stacked_did.py: 10 test classes (basic, trimming, Q-weights, clean-control, clustering, stacked-data shape, edge cases, sklearn interface, results methods, validation)- R benchmark tests via
benchmarks/run_benchmarks.py --estimator stacked
R Comparison Results (200 units, 8 periods, kappa_pre=2, kappa_post=2):
| Metric | Python | R | Diff |
|---|---|---|---|
| Overall ATT | 2.277699574579 | 2.2776995746 | 2.1e-11 |
| Overall SE | 0.062045687626 | 0.062045688027 | 4.0e-10 |
| ES e=-2 ATT | 0.044517975379 | 0.044517975379 | <1e-12 |
| ES e=0 ATT | 2.104181683763 | 2.104181683800 | <1e-11 |
| ES e=1 ATT | 2.209990715130 | 2.209990715100 | <1e-11 |
| ES e=2 ATT | 2.518926324845 | 2.518926324800 | <1e-11 |
| Stacked obs | 1600 | 1600 | exact |
| Sub-experiments | 3 | 3 | exact |
Corrections Made:
-
IC1 lower bound and time window aligned with R reference (
stacked_did.py,_trim_adoption_events()and_build_sub_experiment()): The paper text specifies time window[a - kappa_pre - 1, a + kappa_post](including an extra pre-period), but the R reference implementation by co-author Hollingsworth uses[a - kappa_pre, a + kappa_post]. The extra period had no event-study dummy, altering the baseline regression. Fixed to match R: removed-1from both IC1 check (a - kappa_pre >= T_min) and time window start. Discrepancy documented indocs/methodology/papers/wing-2024-review.mdGaps section. -
Q-weight computation: event-time-specific for aggregate weighting (
stacked_did.py,_compute_q_weights()): Changed aggregate Q-weights from unit counts per sub-experiment to observation counts per (event_time, sub_exp), matching R referencecompute_weights(). For balanced panels, results are unchanged. For unbalanced panels, weights now adjust for varying observation density. Population/sample_share retain unit-count formulas (paper notation). -
Anticipation parameter: reference period and dummies (
stacked_did.py,fit()): Reference period now shifts toe = -1 - anticipation. Event-time dummies cover the full window[-kappa_pre - anticipation, ..., kappa_post]. Post-treatment effects include anticipation periods. Consistent with ImputationDiD, TwoStageDiD, SunAbraham. -
Group aggregation removed (
stacked_did.py):aggregate="group"andaggregate="all"removed. The pooled stacked regression cannot produce cohort-specific effects without cohort×event-time interactions. Use CallawaySantAnna or ImputationDiD for cohort-level estimates. -
n_sub_experiments metadata (
stacked_did.py,fit()): Now tracks actual built sub-experiments, not all events in omega_kappa. Warns if any sub-experiments are empty after data filtering.
Outstanding Concerns:
- Population/sample_share Q-weights use paper's unit-count formulas (no R reference to validate)
- Anticipation not validated against R (R reference doesn't test anticipation > 0)
Deviations from R's stacked-did-weights:
- NaN for invalid inference: Python returns NaN for t_stat/p_value/conf_int when
SE is non-finite or zero. R would propagate through
fixest::feols()error handling.
| Field | Value |
|---|---|
| Module | imputation.py, imputation_bootstrap.py |
| Primary Reference | Borusyak, K., Jaravel, X., & Spiess, J. (2024). Revisiting Event-Study Designs: Robust and Efficient Estimation. Review of Economic Studies, 91(6), 3253–3285. |
| R Reference | didimputation (Kyle Butts, v0.5.0) |
| Status | Complete |
| Last Review | 2026-06-06 |
Verified Components (tests/test_methodology_imputation.py, paper-equation-numbered):
- Theorem 1 / 2 (eqs. 4-5): 3-step imputation (Step 1 fit on Ω₀ only; impute Ŷ(0); weighted aggregate) recovers constant and heterogeneous-by-horizon ATTs; perturbing a treated outcome shifts the overall ATT by exactly δ/N₁ (proving treated obs never feed Step 1) —
TestB2024Theorem2Imputation - Theorem 3 / eqs. 6-8 (conservative clustered variance + unit-clustered Equation 8): finite/positive SEs; the unit-clustered
τ̃_g = Σ_i a·b / Σ_i a²aggregator hand-verified; NaN-τ̂co-group observation is a variance no-op —TestB2024Theorem3Variance,TestB2024Eq8AuxiliaryAggregator - Proposition 5 (p. 3266): no never-treated units ⇒ horizons
K ≥ H̄ = max(E_i)−min(E_i)are NaN + warning; never-treated present ⇒ identified —TestB2024Proposition5NoNeverTreated - Test 1 / eq. 9 + Proposition 9 (p. 3273-4): robust pre-trend test on Ω₀ only (does not reject under parallel trends, rejects under a violation); the treatment-effect estimate is independent of the pre-trend request —
TestB2024Proposition9Test1 - R parity vs
didimputation::did_imputation: overall + event-study ATT and SEs match on a fixed-seed staggered panel (observed |Δ| ~1e-7 ATT / ~1e-10 SE; the tests assert ATTabs=1e-6, SEabs=1e-7for cross-platform robustness) —TestImputationDiDParityR(goldens:benchmarks/data/didimputation_golden.json, generator:benchmarks/R/generate_didimputation_golden.R)
Corrections Made (PR-B, surfaced by the walk-through + R parity):
- Theorem 3 untreated
v_itweights — exact projection. The FE-only path used the balanced two-way closed form-(w_i/n0_i + w_t/n0_t − w/N₀), which is wrong for the (always-unbalanced) Ω₀ in staggered designs — biasing every covariate-free analytical SE downward (~27% on the parity panel). Replaced with the exact two-way-FE projection-A₀(A₀'A₀)⁻¹A₁'w(the same path the covariate case uses), and fixed the FE design to keep all unit dummies (the prior drop-first-unit-without-intercept design projected onto a space one rank short, a further ~1.6% bias). SEs now matchdidimputation(observed ~1e-10; test contractabs=1e-7). - Auxiliary model (Equation 8) — unit-clustered form.
_compute_auxiliary_residuals_treatedused the observation-level meanΣ v·τ̂ / Σ v; corrected to the paper's unit-clusteredΣ_i a·b / Σ_i a². Coincides with the old form under uniform within-group weights; differs for survey/heterogeneity estimands and the coarsercohortpartition. NaN-safe masking of zero-weight rows. - Untreated-residual hardening.
_compute_residuals_untreatednow preserves NaN for missing FE (symmetric with the treated path) instead of a silentfillna(0.0)that could mask a rank-condition logic error (provably inert on valid data).
Deviations from the reference / library extensions (see REGISTRY.md ## ImputationDiD):
- Deviation from R:
didimputationcomputes the Equation 8 aggregator (Σ v²τ̂/Σ v²) at the cohort×event-time partition only (no partition control); at that partition it equals the unit-clustered Equation 8 = diff-diff's defaultaux_partition="cohort_horizon". diff-diff additionally offersaux_partition="cohort"/"horizon"(validated by hand-calc, no R analogue). - Multiplier bootstrap on the Theorem-3 influence function (library extension, not in the paper).
- Survey-design TSL variance on the influence function (library extension).
- NaN inference for undefined statistics; Proposition-5 refuse-to-estimate (NaN + warning).
- Leave-one-out variance refinement (Supplementary Appendix A.9) not implemented (finite-sample refinement; tracked as future work).
R Comparison Results (didimputation v0.5.0, fixed-seed panel, benchmarks/data/didimputation_golden.json):
| Quantity | Python | R | |Δ| | |----------|--------|---|-----| | Overall ATT | 2.04566790 | 2.04566803 | 1.3e-7 | | Overall SE | 0.02087938 | 0.02087938 | 9.3e-11 | | Event-study ATT (max) | — | — | 1.6e-7 | | Event-study SE (max) | — | — | 1.7e-10 |
(Point-estimate Δ is iterative-FE convergence level; SE Δ is machine precision. These are reference-platform observations; the parity tests assert ATT abs=1e-6, SE abs=1e-7 for cross-platform robustness.)
| Field | Value |
|---|---|
| Module | two_stage.py, two_stage_bootstrap.py |
| Primary Reference | Gardner (2022), Two-stage differences in differences, arXiv:2207.05943 |
| R Reference | did2s |
| Status | In Progress |
| Last Review | — |
Documentation in place:
- REGISTRY.md section:
## TwoStageDiD(Stage 1 unit+time FE on untreated, Stage 2 OLS on residualized outcomes, GMM sandwich variance per Newey-McFadden Theorem 6.1) - Implementation: 76 unit tests in
tests/test_two_stage.py(matches ImputationDiD point estimates, Rdid2sglobal(D'D)^{-1}variance, always-treated unit exclusion, multiplier bootstrap) - Documented R alignment: uses global
(D'D)^{-1}matchingdid2s(not paper Eq. 6)
Outstanding for promotion:
- Dedicated
tests/test_methodology_two_stage.pywith paper-equation-numbered Verified Components walk-through - R parity benchmark fixture against
did2s(none on file) - Documented deviation: Newey-McFadden Theorem 6.1 sandwich vs paper's Eq. 6 (already noted in REGISTRY but not formalized in this tracker)
- "Corrections Made" listing
| Field | Value |
|---|---|
| Module | wooldridge.py, wooldridge_results.py |
| Primary Reference | Wooldridge (2025), Two-way fixed effects, the two-way Mundlak regression, and difference-in-differences estimators, Empirical Economics 69(5), 2545–2587 |
| Companion Reference | Wooldridge (2023), Simple approaches to nonlinear difference-in-differences with panel data, Econometrics Journal 26(3) (nonlinear extensions for logit/Poisson paths) |
| R Reference | etwfe (McDermott 2023); Stata jwdid (Rios-Avila 2021) |
| Status | Complete |
| Last Review | 2026-05-22 |
Verified Components:
- Theorem 3.1 (Mundlak ≡ TWFE): equivalence under non-singularity Eq. 3.3 —
tests/test_methodology_wooldridge.py::TestW2025Theorem31MundlakTWFEEquivalence - Proposition 5.1 / 5.2 (Imputation ≡ POLS five-way chain):
TestW2025Proposition51ImputationPOLSEquivalence - Section 6 / Eqs. 6.1-6.5 event-study:
TestW2025Section6EventStudy - Section 7 aggregation paths (Eqs. 7.2-7.4 + 7.6): opt-in
weights="cohort_share"onaggregate()recovers paper Eq. 7.4 simple-overall and Eq. 7.6 event-time hand-calc forms —TestW2025Section7AggregationPaths - Section 8 / Eq. 8.1 heterogeneous cohort-specific trends:
cohort_trends=Trueaddsdg_i · tinteractions; recoverstauunder heterogeneous-trends DGP —TestW2025Section8HeterogeneousTrends - Section 10 unbalanced panels + time-varying covariates (Eq. 10.1-10.6):
TestW2025Section10UnbalancedPanels
Test Coverage:
tests/test_methodology_wooldridge.py— 10 test classes (6 paper-equation-numbered Theorem/Proposition/Section walk-throughs +TestW2025LibraryDeviationsconsolidating 5 surviving deviations +TestWooldridgeParityRvcov_type R-parity from PR #483 +TestWooldridgeParityRPoisson/TestWooldridgeParityRLogitsurface tests with log-link goldens; numerical R-parity for nonlinear paths deferred per TODO row)tests/test_wooldridge.py— unit-level test suite covering OLS / logit / Poisson + four aggregations + survey support + vcov_type variants + cluster/bootstrap interactionsbenchmarks/R/generate_wooldridge_golden.R— clubSandwich + sandwich + etwfe goldens atbenchmarks/data/wooldridge_golden.json
Corrections Made:
- PR #484 (PR-A): Added primary-source review for Wooldridge (2025) at
docs/methodology/papers/wooldridge-2025-review.md(771 lines). Documented the cohort-share aggregation deviation (Eqs. 7.2-7.4 simple-overall AND Eq. 7.6 event-time) and the Section 8 heterogeneous-trends gap. REGISTRY § Aggregations Note + TODO row 95 extended to cover both paths. - PR-B (this PR): Closed two paper gaps documented in PR-A:
- Opt-in cohort-share aggregation weighting via
aggregate(weights="cohort_share")onWooldridgeDiDResults(paper Eq. 7.4 simple-overall + Eq. 7.6 event-time). Default staysweights="cell"forjwdid_estatback-compat. - Heterogeneous cohort trends via
WooldridgeDiD(cohort_trends=True)(paper Eq. 8.1; OLS path only; auto-routes to full-dummy mode regardless ofvcov_typeto keep math closure verified against existing R-parity goldens). - Extended R goldens to include
etwfe(family="poisson")andetwfe(family="logit")log-link coefficients (surface tests in Python; numerical response-scale parity deferred to follow-up).
- Opt-in cohort-share aggregation weighting via
Deviations from the paper / from R / library extensions: See REGISTRY.md ## WooldridgeDiD (ETWFE) → ### Deviations from the paper / from R / library extensions block for the consolidated list (HC1 finite-sample factor, QMLE sandwich (n-1)/(n-k) term, nonlinear-vs-fixest direct QMLE, logit cohort+time additive dummies, anticipation + aggregation, cell-count default with opt-in cohort-share).
Outstanding Concerns:
- Stata
jwdidgolden values (TODO "Statajwdidgolden value tests" row): Stata-side parity infrastructure deferred until Stata install is available; Retwfeside covered in PR-B Stage D. - Response-scale APE / log-link bridge for Poisson + logit R parity (new TODO row added in PR-B): direct cell-level numerical parity between diff-diff's response-scale ATT and R
etwfelog-link coefficients requires eitheremfx()-based APE extraction on the R side or link-function inversion with baseline-mean adjustment. - QMLE sandwich Stata-parity
qmleweight type (TODO row 94): diff-diff's(G/(G-1)) × ((n-1)/(n-k))is conservative vs Stata'sG/(G-1)only; awaiting Stata golden values to confirm material difference. - Repeated cross-sections (paper p. 2581 → Deb et al. 2024): not in 2025 paper's main body; future PR.
- Treatment exit / non-absorbing treatment (2023 paper Section 7.2 sketch): not in 2025 paper; future PR.
cohort_trendspolynomial extension ("quadratic","cubic"): PR-B ships binaryTrue/Falsefor lineardg_i · t; forward-extensibility deferred.
| Field | Value |
|---|---|
| Module | efficient_did.py, efficient_did_bootstrap.py, efficient_did_covariates.py, efficient_did_weights.py |
| Primary Reference | Chen, Sant'Anna & Xie (2025), Efficient Difference-in-Differences and Event Study Estimators |
| R Reference | (no canonical R package; paper compares against did / DIDmultiplegt / BJS / Gardner / Wooldridge as benchmarks rather than providing a reference implementation) |
| Status | Complete |
| Last Review | 2026-06-01 |
Documentation in place:
- REGISTRY.md section:
## EfficientDiD(full Theorem 4.1 EIF, sieve-based propensity-ratio and outcome-regression estimation with AIC/BIC, kernel-smoothed conditional covariance, Hausman pretest for PT-All vs PT-Post, survey support) - Paper review on file:
docs/methodology/papers/chen-santanna-xie-2025-review.md(PR-A, 2026-05-31) — faithful paper-sourced transcription of arXiv:2506.17729v1 (assumptions S/O/NA/PT-Post/PT-All; Theorem 3.1/3.2 EIFs + Corollaries 3.1/3.2; §4 sieve/kernel DR estimation; Theorem 4.1 SEs; Theorem A.1 Hausman; HRS Table 6 anchor) - Tests:
tests/test_efficient_did.py(unit/API),tests/test_efficient_did_validation.py(HRS Table 6 + Compustat MC), andtests/test_methodology_efficient_did.py(PR-B paper-equation Verified Components)
Corrections Made (PR-B source validation): (None — implementation verified correct.) The PR-B walk-through traced each paper result against the source and found the no-covariate path (multi-baseline efficiency recovered via the g'=g same-cohort pairs), the generated outcome (Eq 3.9), the optimal weights (Eq 3.5/3.13), the conditional covariance Ω*(X) (Eq 3.12), the analytical SE (Theorem 4.1), the cohort-size event-study aggregation (with the (G_g − π_g) WIF correction), and the Hausman covariance direction (Eq A.2, restricted − efficient) all correct.
Implementation change (deliberate, decided with the maintainer — eliminates a deviation rather than fixing a bug):
- Covariate outcome regression upgraded from linear OLS to a polynomial sieve (AIC/BIC order selection, same basis family as the propensity-ratio sieve; a growing sieve with no fixed order ceiling —
floor(n_pos^{1/5})over the positive-weight supportn_pos(raw group size when unweighted; zero-weight survey rows are inert for order selection), bounded byn_basis < n_pos— which, since C.1's rate is on the sieve dimensionp_K = comb(K+d,d)(not the polynomial degree, which differ onced > 1), satisfies Assumption C.1's uniform-consistency /o_p(n^{-1/2})product-rate conditions for the low-dimensional covariate settings typical of DiD) so the doubly-robust covariate path attains the semiparametric efficiency bound asymptotically under the paper's nonparametric-nuisance specification (Section 4 / Theorem 4.1), not only when the conditional mean is linear. Degree 1 reproduces the prior linear OLS outcome regression;sieve_k_max=1forces all covariate-path sieves to degree 1 (it recovers the linear outcome component but also degree-1-constrains the propensity sieves, so it does not reproduce the exact pre-PR estimator). Removing the hardK≤5cap also updates the two pre-existing propensity-ratio / inverse-propensity sieves (a no-op for groups under ~3,125 units).diff_diff/efficient_did_covariates.py::estimate_outcome_regression. - Extracted
_hausman_quadratic_form(behaviour-preserving) so the Theorem A.1 statistic and effective-rank DOF logic are unit-testable in isolation.
Verified Components (tests/test_methodology_efficient_did.py, paper-equation-numbered):
- Inverse-covariance optimal weights
1'Ω*⁻¹/(1'Ω*⁻¹1)(Eq 3.5 / 3.13) + the min-variance property + the singular-Ω* pseudoinverse path. - No-covariate generated outcome (Eq 3.9):
g'=gtelescopes to the per-baseline DiD (Eq 3.3);g'=∞to the period-1 long-difference. - No-covariate efficient ATT =
weights @ generated_outcomes(Eq 3.13 / §4.1), rebuilt independently from within-group sample means/covariances. - PT-Post just-identified reduction = Callaway-Sant'Anna single-baseline (Corollary 3.1 single-date exact at 1e-9; Corollary 3.2 staggered).
- Analytical SE =
sqrt(mean(EIF²)/n)(Theorem 4.1). - Hausman statistic (Theorem A.1 / Eq A.2):
H = Δ'V⁺Δ,V = aCov(ẼS) − aCov(ÊS)(restricted − efficient);df = |E|(well-conditioned) anddf = effective_rank < |E|(rank-deficient safeguard); covariance-direction guard. - Covariate sieve outcome regression: recovers a nonlinear-in-X conditional mean (K≥2), reproduces linear OLS on linear data (K=1), and beats a forced-linear working model under a nonlinear-nuisance conditional-PT DGP.
- Empirical anchor: HRS Table 6 (
tests/test_efficient_did_validation.py::TestHRSReplication) on the paper's data (a derived openICPSR 116186 subset) matches all six ATT(g,t) + ES(0)/ES(1)/ES(2) + ES_avg to < 0.03 published SE; the Compustat MC confirms unbiasedness, the CS efficiency gain, coverage, and SE calibration.
Deviations (ratified; each carries a recognized REGISTRY label):
overall_attis the cohort-size-weighted post-treatment average (Callaway-Sant'Anna convention), not the paper's uniformES_avg(Eq 2.3);ES_avgis recoverable from the event-study output.- The multiplier bootstrap re-aggregates with fixed cohort-size weights (matching the CS bootstrap); the analytical path carries the
(G_g − π_g)WIF weight-estimation correction. - The Hausman χ² uses the effective rank of
Vas degrees of freedom (a finite-sample safeguard equal to|E|whenVis well-conditioned), rather than a fixed|E|. vcov_typeis permanently narrow to{"hc1"}(the IF-based variance has no single design matrix for analytical-sandwich families); the polynomial sieve basis and the Silverman kernel bandwidth are working-model choices the paper leaves open.- SEs are i.i.d.-asymptotics (
sqrt(mean(EIF²)/n)or multiplier bootstrap); cluster/survey EIF variants are documented library extensions beyond the paper's stated scope.
| Field | Value |
|---|---|
| Module | continuous_did.py, continuous_did_bspline.py, continuous_did_results.py |
| Primary Reference | Callaway, Goodman-Bacon & Sant'Anna (2024), Difference-in-Differences with a Continuous Treatment, NBER WP 32117 |
| R Reference | contdid v0.1.0 (CRAN) — two parity surfaces at relative tolerance: (a) scalar overall ATT parity with raw R cont_did / pte_default output at < 0.01 (1%) on all 6 benchmarks; scalar overall ACRT parity with raw R cont_did at < 0.01 (1%) on benchmarks 4-5; (b) harmonized boundary-knot-normalized curve parity with R-side ATT(d)/ACRT(d) reconstructed under Boundary.knots = range(treated_doses) (matching the library) at < 0.01 max ATT(d) and < 0.02 max ACRT(d) on benchmarks 1-3 via the benchmark harness (_run_r_contdid rebuilds the R-side basis under Boundary.knots = range(treated_doses) at tests/test_methodology_continuous_did.py:333-367; _compare_with_r orchestrates the Python-vs-R comparison at :395-459); benchmark 6 is event-study, scalar overall_att only (binarized ATT, no curve comparison and no ACRT in event-study mode). Surface (a) is direct raw-package parity; surface (b) is reconstructed-basis parity because raw contdid curves use range(dvals) instead of range(dose). NOT bit-exact (atol=1e-8) like HAD because of the boundary-knots deviation documented below. See tests/test_methodology_continuous_did.py::TestRBenchmark |
| Status | Complete |
| Last Review | 2026-05-20 |
Verified Components:
- PT and SPT identification (CGBS 2024 Assumptions 1-2) — two-level parallel trends with explicit untreated-and-doses conditioning; estimands
ATT(d|d),ATT(d),ACRT(d),ATT^{loc},ATT^{glob},ACRT^{glob}defined indocs/methodology/continuous-did.md§ 4 + REGISTRY## ContinuousDiDIdentification block. Hand-calc coverage:tests/test_methodology_continuous_did.py::TestLinearDoseResponse(4 tests atatol=1e-10/atol=1e-6on no-noise linear DGP — locks theATT^{glob}binarization formulaE[ΔY | D > 0] − E[ΔY | D = 0], theACRT^{glob}plug-in average, and theATT(d) = 2d,ACRT(d) = 2closed forms). - B-spline basis matching
splines2::bSpline(cubic and linear degrees,num_knots=0default; global boundary knots from the training-dose range, NOT per-cell) —tests/test_methodology_continuous_did.py::TestQuadraticWithCubicBasis::test_quadratic_recoveryrecoversATT(d) = d²atatol=1e-6via a degree-3 basis (cubic spline can represent quadratic exactly). The matching basis algorithm lives indiff_diff/continuous_did_bspline.py(216 LoC); the boundary-knots deviation from Rcontdidis documented in the Deviations block below. - Multi-period (g,t) cell iteration with base period selection —
TestMultiPeriodAggregation::test_multiple_groupsandtest_gt_cell_countexercise the cohort iteration on 2-cohort staggered panels; cell counts agree with the Rptetools-style convention. Scalar parity with raw Rcont_didat 1% relative further locks the staggered-aggregation surface viaTestRBenchmark::test_benchmark_4_staggered_doseandtest_benchmark_5_not_yet_treated(both assert overall ATT AND overall ACRT at< 0.01). - Dose-response (
aggregate="dose") and event-study (aggregate="eventstudy") aggregation with group-proportional weights (n_treated/n_totalper group, divided among post-treatment cells; matches Rptetoolsconvention). Two R-side surfaces are exercised: (a) scalaroverall_attviaTestRBenchmark::test_benchmark_1_basic_cubic/_2_linear/_3_interior_knots/_4_staggered_dose/_5_not_yet_treated(dose mode) and_6_event_study(event-study mode — binarized ATT only; benchmark 6 validates the event-study code path through the scalar surface, NOT per-horizonevent_study_effects); (b) harmonized boundary-knot-normalized ATT(d) / ACRT(d) curves on benchmarks 1-3 via the benchmark harness —_run_r_contdidattests/test_methodology_continuous_did.py:333-367rebuilds the R-side basis underBoundary.knots = range(treated_doses)(rawcontdidcurves userange(dvals), so this is reconstructed-basis parity not raw-package parity), and_compare_with_rorchestrates the comparison at:395-459. Per-benchmark tolerances: all 6 assert overall ATT at< 0.01(1%); benchmarks 1-3 additionally assert max ATT(d) at< 0.01and max ACRT(d) at< 0.02via the helper; benchmarks 4-5 assert overall ACRT at< 0.01inline. Per-horizonevent_study_effectsestimates and inference are exercised by Python-side tests attests/test_continuous_did.py:557-690and:1500-1528(no R cross-language comparison on the per-horizon surface). Skipped if R /contdidnot installed via_check_r_contdid(); benchmarks use R'sdvalsfor exact evaluation-grid alignment between Python and R outputs (boundary knots are harmonized separately under surface (b) — see the_run_r_contdidhelper'sBoundary.knots = range(treated_doses)block attests/test_methodology_continuous_did.py:333-367). - Multiplier bootstrap for inference (PSU-level multiplier weights on the survey path per Phase 6) — implementation in
diff_diff/continuous_did.py; bootstrap SE invariant on rank-deficient cells locked inTestEdgeCasesMethodology::test_all_same_dose(verifiesdose_response_att.seis finite on a heterogeneous-outcome / identical-dose DGP); 80 unit tests intests/test_continuous_did.pyexercise the rest of the bootstrap path. - Analytical SEs via influence functions (NOT delta method; corrected post-v3.0.0, see Corrections Made) — IF-based variance with
safe_inference()joint-NaN consistency on all six estimand fields (overall_att,overall_acrt, dose-response, event-study). - Survey support: weighted B-spline OLS, two-stage linearization (TSL) on influence functions, bootstrap + survey via PSU-level multiplier weights (Phase 3 + Phase 6). Boxed in REGISTRY
## ContinuousDiD→ Implementation Checklist → "Survey design support (Phase 3)" item. -
+inf→0never-treated recoding withUserWarningreporting the affected row count (axis-E silent-coercion fix per Phase 2 audit) — the R-style convention offirst_treat = +infis normalized internally but no longer absorbed silently. Any negativefirst_treatvalue (including-inf) raisesValueErrorwith the affected row count. Locked intests/test_continuous_did.py. - Zero-
first_treatrows with nonzerodoseforce-zeroed withUserWarningreporting the affected row count (axis-E silent-coercion fix per Phase 2 audit) — never-treated cells must haveD=0for internal consistency; the previous silent zeroing is now signaled. Locked intests/test_continuous_did.py. -
bspline_derivative_design_matrixderivative-construction failure warning (Phase 2 axis-C #12 silent-failures audit fix) — aggregates failed basis indices into a singleUserWarningnaming them, instead of swallowingscipy.interpolate.BSpline.ValueErrorand leaving silently zeroed derivative columns. Both ACRT point estimates AND analytical/bootstrap inference read the samedPsimatrix (continuous_did.py:1026-1046and the bootstrap ACRT path atcontinuous_did.py:1524-1561), so both are biased on partial-derivative failure — the warning wording makes that explicit. The all-identical-knot degenerate case (single dose value) remains silently handled because derivatives are mathematically zero there. Locked intests/test_continuous_did.py::TestBSplineDerivativeDegenerateBasis(3 tests:test_single_dose_is_silent,test_valueerror_from_bspline_emits_aggregate_warning,test_clean_knots_emit_no_warning); source-level aggregate-warning block atdiff_diff/continuous_did_bspline.py:150-187. - Edge cases: all-same-dose (rank-deficient design, recovers only intercept =
ATT^{glob}, ACRT = 0 everywhere), single-treated-unit (insufficient for OLS, raisesValueError"No valid"), discrete-treatment (detected and warned, saturated regression deferred), rank-deficiency per cell (cell skipped underrank_deficient_action="silent"/"warn"), balanced-panel-required (matches Rcontdidv0.1.0). Locked inTestEdgeCasesMethodology(2 methodology tests) + rank-deficient unit tests intest_continuous_did.py. - Anticipation-aware not-yet-treated control mask: when
anticipation > 0, the not-yet-treated control mask usesG > t + anticipation(not justG > t) to exclude cohorts in the anticipation window from controls. Whenanticipation=0(default), behavior is unchanged. CHANGELOG[3.0.x]-era fix; locked intest_continuous_did.py.
Test Coverage:
- 15 methodology tests in
tests/test_methodology_continuous_did.py(5 classes: 4 + 1 + 2 + 2 + 6); the R-benchmark class (6 tests) skips if R /contdidv0.1.0 is not installed via_check_r_contdid()guard. - 80 core unit tests in
tests/test_continuous_did.py(1,530 LoC) covering the B-spline basis (TestBSplineBasis,TestBSplineDerivativeDegenerateBasis), bootstrap, IF-based analytical SE, anticipation, rank-deficient cells, dose grid, dvals/grid validation,+infrecoding, zero-dose coercion, and result-class field contracts. Survey-design coverage is NOT in this file — it lives in the dedicated survey suites (next bullet). - ContinuousDiD survey-design tests:
tests/test_survey_phase3.py::TestContinuousDiDSurvey(tests/test_survey_phase3.py:653-705analytical SE + bootstrap;:1368-1407event-study aggregation + survey-design rejection paths) andtests/test_survey_phase6.py(:1230-1244replicate-weight + n_bootstrap rejection;:1548-1610positive-weight-gate cell skipping). - R cross-language coverage at relative tolerance (NOT bit-exact — see Deviations § "Boundary knots") on 6 benchmark configurations across two surfaces: (a) scalar parity with raw R
cont_did/pte_default— all 6 assert overall ATT at< 0.01(1%); benchmarks 4-5 also assert overall ACRT at< 0.01inline; benchmark 6 is event-study mode with scalaroverall_attonly (binarized ATT, no per-horizon and no ACRT comparison). Per-horizonevent_study_effectsis exercised by Python-side tests attests/test_continuous_did.py:557-690and:1500-1528. (b) harmonized boundary-knot-normalized curve parity with R-side ATT(d) / ACRT(d) reconstructed underBoundary.knots = range(treated_doses)(matching the library) on benchmarks 1-3 via the benchmark harness (_run_r_contdiddoes the R-side rebuild attests/test_methodology_continuous_did.py:333-367;_compare_with_rorchestrates at:395-459) — max ATT(d) at< 0.01and max ACRT(d) at< 0.02. Surface (a) is direct raw-package parity; surface (b) is reconstructed-basis parity because rawcontdidcurves userange(dvals). - Documentation:
docs/methodology/continuous-did.md(14,885 bytes theory note covering PT vs SPT, estimands, B-spline OLS, multiplier bootstrap).
Corrections Made:
- SE method correction (early v3.0.x): ContinuousDiD originally computed SEs via delta method; corrected to influence-function-based variance. See CHANGELOG entries "Fix ContinuousDiD SE method: influence function, not delta method" + "Fix methodology doc: influence functions, not delta method for ContinuousDiD SEs".
- Anticipation-aware control mask (CHANGELOG
[3.0.x]-era): not-yet-treated control mask now usesG > t + anticipationinstead ofG > t, excluding cohorts in the anticipation window from controls. - Phase 2 silent-failures audit fixes (axis-C + axis-E):
- axis-C #12:
bspline_derivative_design_matrixno longer swallowsscipy.interpolate.BSpline.ValueErrorsilently; aggregates failed basis indices into a singleUserWarning. Both ACRT point estimates AND analytical/bootstrap inference are affected when this fires. - axis-E (silent coercion):
+inf→0never-treated recoding now emitsUserWarningwith affected row count; negativefirst_treat(including-inf) raisesValueError. Zero-first_treatrows with nonzerodoseforce-zeroed now also emitUserWarning.
- axis-C #12:
- Bread normalization, fweight TSL scaling, weighted-mass IF linearization (CHANGELOG): three pieces of the IF-based variance machinery on the survey path corrected to match the analytical identities. Replicate-IF variance score scaling also fixed for EfficientDiD / TripleDifference / ContinuousDiD as part of the same sweep.
- Tracker-promotion consolidation (this PR, 2026-05-20): formal Deviations block added to REGISTRY
## ContinuousDiDconsolidating the boundary-knots deviation, thebspline_derivativewarning, and the two axis-E silent-coercion warnings into a single labeled surface. The original Edge Cases / Notes entries remain in place — Deviations is an additional canonical surface (per CLAUDE.md "Documenting Deviations (AI Review Compatibility)" labels).
Deviations from the paper / from R / library extensions:
- Deviation from R — boundary knots use
range(dose)notrange(dvals)— knots are built once from all treated doses (global, not per-cell) to ensure a common basis across (g,t) cells for aggregation. The evaluation grid is clamped to training-dose boundary knots (range(dose)). R'scontdidv0.1.0 has an inconsistency wheresplines2::bSpline(dvals)usesrange(dvals)instead ofrange(dose), which can produce extrapolation artifacts at dose-grid extremes. Scope caveat: R cross-language coverage therefore runs at relative tolerance bands across two surfaces, NOT bit-exact (atol=1e-8) like HAD —contdidand ContinuousDiD cannot bit-match on aggregated dose-response or ACRT curves because they use different knot placement; the agreement band reflects the boundary-knot divergence rather than algorithmic drift. (a) Scalar parity with raw Rcont_did/pte_defaultat 1% relative on overall ATT for all 6 benchmarks and on overall ACRT for benchmarks 4-5 (benchmark 6 is event-study, scalaroverall_attonly). (b) Harmonized boundary-knot-normalized curve parity with R-side ATT(d) / ACRT(d) reconstructed underBoundary.knots = range(treated_doses)(matching the library) on benchmarks 1-3 via the benchmark harness (_run_r_contdiddoes the R-side rebuild attests/test_methodology_continuous_did.py:333-367;_compare_with_rorchestrates at:395-459) — max ATT(d) at 1% and max ACRT(d) at 2%. The slightly wider 2% ACRT(d)-curve tolerance on benchmarks 1-3 reflects the tighter coupling between basis derivative numerics and the boundary-knot choice; benchmarks 4-5 use overall scalars (overall_acrt) where the boundary effect averages down to 1%. Library extension toward methodological soundness (avoids extrapolation). - Library extension —
bspline_derivative_design_matrixderivative-failure warning — previously swallowedscipy.interpolate.BSpline.ValueErrorin the per-basis derivative loop, leaving affected derivative-matrix columns silently zero. Now aggregates the failed basis indices into a singleUserWarningnaming them. Both ACRT point estimates and analytical/bootstrap inference read the samedPsimatrix, so both are biased when this fires — the warning wording makes that explicit. The all-identical-knot degenerate case (single dose value) remains silently handled (mathematically-zero derivatives are correct there). Phase 2 axis-C #12 silent-failures audit fix. No R correspondence;contdidv0.1.0 does not implement an equivalent warning. - Library extension —
+inf→0never-treated recoding warns — the R-style convention of coding never-treated units asfirst_treat=+infis still accepted and normalized tofirst_treat=0internally, but the estimator now emits aUserWarningreporting the row count so the silent recategorization is surfaced. Only+infis recoded (matching the R convention). Any negativefirst_treatvalue (including-inf) raisesValueErrorwith the row count, since such units would otherwise silently fall out of both the treated (g > 0) and never-treated (g == 0) masks. Pass0directly for never-treated units to avoid the warning. Library extension toward stricter safety; matches the broader Phase 2 axis-E silent-coercion convention. No R correspondence;contdidv0.1.0 silently absorbs+infwithout a signal. - Library extension — zero-
first_treatrows with nonzerodoseforce-zeroed with warning — never-treated cells must haveD=0for internal consistency in the dose-response. The estimator now emits aUserWarningwith the affected row count before the zeroing, so unintended nonzero doses on never-treated rows are no longer absorbed without a signal. Library extension toward stricter safety with no R correspondence —contdidv0.1.0 has the samefirst_treat = 0→D = 0invariant requirement but silently coerces without a warning; same axis-E silent-coercion lineage as #3.
Outstanding Concerns:
- Covariate support (deferred) —
covariates=kwarg is not implemented; matches Rcontdidv0.1.0 which also has no covariate support. Tracked as a future-work row in TODO.md (Low priority). - Discrete-treatment saturated regression (deferred) — when
doseis detected as integer-valued, the estimator currently warns; the saturated regression approach (one coefficient per discrete dose level instead of B-spline basis) is not implemented. Tracked as a future-work row. - Lowest-dose-as-control (Remark 3.1, deferred) — CGBS 2024 Remark 3.1 outlines using the lowest non-zero dose as the comparison group when
P(D=0) = 0. Not implemented; the estimator requiresP(D=0) > 0(never-treated controls present). Tracked as a future-work row.
These three are feature deferrals (paper-supported extensions that the library has chosen not to implement yet), not tracker blockers — REGISTRY ## ContinuousDiD → Implementation Checklist already marks them as [ ] deferred (the "Covariate support", "Discrete treatment saturated regression", and "Lowest-dose-as-control (Remark 3.1)" items). They mirror the same "future work" status that the ChaisemartinDHaultfoeuille and TROP tracker rows carry for analogous optional extensions.
| Field | Value |
|---|---|
| Module | chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille_bootstrap.py, chaisemartin_dhaultfoeuille_results.py |
| Primary References | (a) de Chaisemartin & D'Haultfœuille (2020), Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects, AER 110(9), 2964-2996. (b) de Chaisemartin & D'Haultfœuille (2022, revised July 2023), Difference-in-Differences Estimators of Intertemporal Treatment Effects, NBER WP 29873 — Web Appendix Section 3.7.3 for cohort-recentered plug-in variance. (Matches docs/methodology/REGISTRY.md ## ChaisemartinDHaultfoeuille § "Primary sources". The Knau et al. 2026 universal-rollout paper, while authored by overlapping authors, is the primary source for HeterogeneousAdoptionDiD and is treated as adjacent context for DCDH, not a primary reference — see "Outstanding Concerns" below.) |
| R Reference | DIDmultiplegtDYN |
| Status | Complete |
| Last Review | 2026-05-21 |
Verified Components:
- AER 2020 Theorem 3 — per-period
DID_{+,t}/DID_{-,t}plus aggregateDID_M,DID_+,DID_-—tests/test_methodology_chaisemartin_dhaultfoeuille.py::TestMethodologyWorkedExample(hand-calculable 4-group example:DID_M = 2.5,DID_+ = 2.0,DID_- = 3.0exact); paper review atdocs/methodology/papers/dechaisemartin-dhaultfoeuille-2020-review.md. - AER 2020 single-lag placebo
DID_M^pl— same Theorem 3 logic applied toY_{g,t-1} - Y_{g,t-2}on 3-period cells — covered by the worked example class andtests/test_chaisemartin_dhaultfoeuille.py::TestA11Handlingfor the placebo Assumption 11 zero-retention path. - AER 2020 Theorem 1 TWFE-weights diagnostic — negative-weight detection on binary treatment via
twfe_diagnostic=Trueand standalonetwowayfeweights()—tests/test_methodology_chaisemartin_dhaultfoeuille.py::TestTWFEDiagnostic+tests/test_chaisemartin_dhaultfoeuille.py::TestTwowayFeweightsHelper; binary-only contract documented (non-binary inputs triggerUserWarningfromfit()andValueErrorfrom the standalone helper). - NBER WP 29873 dynamic event study
DID_l(Equation 3 / 5 of the dynamic paper) —tests/test_methodology_chaisemartin_dhaultfoeuille.py::TestCohortRecenteringCritical+TestLargeNRecovery; paper review atdocs/methodology/papers/dechaisemartin-dhaultfoeuille-2022-review.md. TheTestLargeNRecoveryclass verifies that the multi-horizon estimator recovers the true ATT at large G. - NBER WP 29873 dynamic placebos
DID^{pl}_l, normalizedDID^n_l, cost-benefitdelta(Lemma 4) —tests/test_chaisemartin_dhaultfoeuille.py::TestMultiHorizonPlacebos,TestNormalizedEffects,TestCostBenefitDelta,TestSupTBands(simultaneous confidence bands). - NBER WP 29873 Web Appendix Section 3.7.3 cohort-recentered plug-in variance — locked by
tests/test_methodology_chaisemartin_dhaultfoeuille.py::TestCohortRecenteringCritical::test_cohort_recentering_not_grand_mean(constructs a designed DGP where cohort recentering and grand-mean recentering produce materially different SE and asserts they diverge — guards against silent regression to a single-mean centering). - R
DIDmultiplegtDYNparity at documented tolerance bands —tests/test_chaisemartin_dhaultfoeuille_parity.py(26 tests). Tolerance class constants:POINT_RTOL = 1e-4(pure-direction point estimates),MIXED_POINT_RTOL = 0.025(2.5% on mixed-direction panels),PURE_DIRECTION_SE_RTOL = 0.05(5% on pure-direction SE after the Round 2 full-IF fix),SE_RTOL = 0.10(10% on multi-horizon SE), andse_rtol=0.15on thejoiners_only_long_multi_horizonL_max=5 scenario where cell-count-weighting compounds across horizons. Deviations from R are documented in the consolidated REGISTRY Deviations block (D2 period-vs-cohort + D4 SE-normalization explain the residual SE gap). - Phase 3 covariate adjustment (
DID^X) — Web Appendix Section 1.2 residualization-style adjustment with first-stage OLS on first-differenced covariates with time FEs, restricted to not-yet-treated observations —tests/test_chaisemartin_dhaultfoeuille.py::TestCovariateAdjustment. - Phase 3 group-specific linear trends (
DID^{fd}) — Web Appendix Section 1.3 / Lemma 6, Z_mat first-differencing —tests/test_chaisemartin_dhaultfoeuille.py::TestLinearTrends. - Phase 3 state-set-specific trends — Web Appendix Section 1.4 control-pool restriction —
tests/test_chaisemartin_dhaultfoeuille.py::TestStateSetTrends. - Phase 3 heterogeneity testing — Web Appendix Section 1.5 / Lemma 7 saturated-OLS test for treatment-effect heterogeneity along an interaction variable —
tests/test_chaisemartin_dhaultfoeuille.py::TestHeterogeneityTesting. - Design-2 switch-in/switch-out descriptive wrapper — Web Appendix Section 1.6 —
tests/test_chaisemartin_dhaultfoeuille.py::TestDesign2. -
by_pathper-path event-study disaggregation —tests/test_chaisemartin_dhaultfoeuille.py::TestByPathGates/TestByPathBehavior/TestByPathEdgeCases/TestByPathBootstrap/TestByPathPlacebo/TestByPathSupTBands/TestByPathControls/TestByPathTrendsLinear(~8 classes covering API gates, point-estimate path, bootstrap composition, placebo composition, sup-t bands, covariates, and linear trends). - HonestDiD (Rambachan-Roth 2023) integration on placebo + event study surface —
tests/test_chaisemartin_dhaultfoeuille.py::TestHonestDiDIntegration. - Non-binary (ordinal or continuous) treatment — paper Section 2 of the dynamic companion defines treatment as a general
D_{g,t}; binary{0, 1}is a special case —tests/test_chaisemartin_dhaultfoeuille.py::TestNonBinaryTreatment. - Survey design support: Taylor-series linearization + replicate weights + Hall-Mammen wild PSU bootstrap —
tests/test_survey_dcdh.py(Binder TSL on the main ATT, DID^X, heterogeneity, TWFE diagnostic, and HonestDiD surfaces),tests/test_survey_dcdh_replicate_psu.py(Rao-Wu rescaled replicate weights for BRR/Fay/JK1/JKn/SDR), and three cell-period coverage suites (tests/test_dcdh_cell_period_coverage.py,tests/test_dcdh_bootstrap_cell_period_coverage.py,tests/test_dcdh_heterogeneity_cell_period_coverage.py) — the cell-period allocator's per-cell IF expansion is what enables within-group-varying PSU. - Two primary-source DCDH paper reviews on file:
docs/methodology/papers/dechaisemartin-dhaultfoeuille-2020-review.md(2020 AER) anddocs/methodology/papers/dechaisemartin-dhaultfoeuille-2022-review.md(2022/2023 NBER WP 29873). The adjacentdocs/methodology/papers/dechaisemartin-2026-review.mdis on disk as the primary source forHeterogeneousAdoptionDiD(HAD's universal-rollout case) and is referenced from DCDH as context only; it is not DCDH primary-source coverage.
Test Coverage:
- 12 methodology tests in
tests/test_methodology_chaisemartin_dhaultfoeuille.py(4 classes:TestMethodologyWorkedExample,TestCohortRecenteringCritical,TestTWFEDiagnostic,TestLargeNRecovery). - 26 R-parity tests in
tests/test_chaisemartin_dhaultfoeuille_parity.pyagainstDIDmultiplegtDYN. - 352 unit tests in
tests/test_chaisemartin_dhaultfoeuille.pycovering Phase 1 + Phase 2 + Phase 3 + survey-design + by-path + HonestDiD surfaces (37 test classes). - Survey-specific:
tests/test_survey_dcdh.py,tests/test_survey_dcdh_replicate_psu.py, plus three dCDH cell-period coverage suites (test_dcdh_cell_period_coverage.py,test_dcdh_bootstrap_cell_period_coverage.py,test_dcdh_heterogeneity_cell_period_coverage.py). - Two primary-source DCDH paper reviews on disk: 2020 AER and 2022/2023 NBER WP 29873 (see Verified Components above). The adjacent
docs/methodology/papers/dechaisemartin-2026-review.mdis on disk but is HAD's primary source, not DCDH's — it does not count toward DCDH primary-source coverage.
Corrections Made:
- Round 2 full-IF fix (pre-promotion): never-switching groups now participate in the variance via stable-control roles under the full
Lambda^G_{g,l=1}influence function. Then_groups_dropped_never_switchingresults field is retained for backwards compatibility but no longer represents an actual exclusion. After this fix, SE parity vs R on pure-direction scenarios narrowed from ~18% to ~3% (documented in REGISTRY## ChaisemartinDHaultfoeuille§ "Note (deviation from R DIDmultiplegtDYN):" on period-vs-cohort control sets). - PR-A consolidation (PR #478, 2026-05-21): REGISTRY
## ChaisemartinDHaultfoeuillereframed to clarify that the library's equal-cell weighting is a documented deviation from BOTH the AER 2020 Equation 3 (N_{d,d',t} = sum_g N_{g,t}observation sums) AND RDIDmultiplegtDYN(cell-size weighting); the prior framing called the Python contract "paper-literal", which was incorrect against the main-text formula. The period-vs-cohort Note was tightened to use the AER 2020's "transition-state notation" language.docs/references.rst:199,docs/methodology/REGISTRY.md:488, code docstrings, and the new paper review file headers all align on the(2022, revised July 2023)revision string for NBER WP 29873. - PR-B tracker-promotion consolidation (this PR): formal
### Deviations from the paper / from R / library extensionsblock added to REGISTRY## ChaisemartinDHaultfoeuilleconsolidating 7 documented deviations into a single AI-review-recognized labeled surface per CLAUDE.md "Documenting Deviations" labels. The original scattered**Note (deviation from R...):**entries remain in place — the new Deviations block is an additional canonical surface for AI-review consumption.
Deviations from the paper / from R / library extensions:
(Cross-references the consolidated REGISTRY Deviations block — see docs/methodology/REGISTRY.md ## ChaisemartinDHaultfoeuille § Deviations for the same 7 entries with full mechanical detail. Listed here in summary form.)
- Equal-cell weighting (deviation from BOTH paper Equation 3 AND R
DIDmultiplegtDYN). Library: each(g,t)cell contributes once regardless of within-cell observation count. AER 2020 Equation 3 definesN_{d,d',t} = sum_g N_{g,t}(observation-sum weighting); R weights by cell size. Agreement is exact on one-observation-per-cell inputs (the parity test generator). Phase 2 estimands (DID_l,DID^{pl}_l,DID^n_l, delta cost-benefit) inherit the same contract. Locked bytests/test_chaisemartin_dhaultfoeuille.py::test_cell_count_weighting_unbalanced_input(inTestDropLargerLower). - Period-based vs cohort-based stable controls (deviation from R
DIDmultiplegtDYN). Python:stable_0(t)is any cell withD_{g,t-1} = D_{g,t} = 0regardless of baselineD_{g,1}(matches AER 2020 Theorem 3 transition-state notationN_{0,0,t}andN_{1,1,t}literally). R: cohort-based control sets additionally requireD_{g,1}to match the side. Agreement exact on pure-direction panels; ~1% point-estimate divergence on mixed-direction panels where joiners' post-switch cells could serve as leavers' controls (or vice versa). After the Round 2 full-IF fix, SE parity gap on pure-direction scenarios narrowed from ~18% to ~3%. - Balanced-baseline panel required + terminal missingness retained (deviation from R
DIDmultiplegtDYN) — one composite deviation with four enforcement paths: (a) groups missing the first global period raiseValueError; (b) groups with interior period gaps are dropped withUserWarning; (c) groups with terminal missingness (observed at baseline but missing one or more later periods) are retained and contribute from their observed periods only; (d) cell-period allocator paths (Binder TSL with within-group-varying PSU, Rao-Wu replicate ATT, cell-level wild PSU bootstrap) emit a targetedValueErrorwhen cohort recentering would leak nonzero centered IF mass onto cells with no positive-weight observations. R accepts unbalanced panels with documented missing-treatment-before-first-switch handling. The four paths share a single underlying contract — "the panel must be balanced at baseline; terminal missingness is the only allowed unbalance; downstream variance machinery refuses to silently leak IF mass past the cell-period boundary". - SE normalization
N_lvs RG(~4% smaller analytical SE). Python implements the dynamic paper's Section 3.7.3 plug-in formula verbatim:SE = sigma-hat / sqrt(N_l)whereN_lis the number of eligible switcher groups at horizonl. R normalizes the influence function byG(total number of groups including never-switchers and stable controls). Both converge to the same asymptotic variance asG → ∞. In finite samples Python's tighter SE remains conservative (paper formula is already an upper bound on the true variance via Jensen's inequality under Assumption 8). Gap is deterministic on identical data and ~3.5-5.1% across horizons and scenarios. - Singleton-cohort degeneracy → NaN with warning (deviation from R
DIDmultiplegtDYN). When every variance-eligible group forms its own(D_{g,1}, F_g, S_g)cohort, cohort recentering collapses the centered IF vector to all zeros and the estimator returnsoverall_se = NaNwithUserWarning. R returns a non-zero SE on degenerate small-panel cases via small-sample sandwich machinery that Python does not implement. Both responses are valid for a degenerate case; Python'sNaN+ warning is the safer default. Bootstrap inherits the same degeneracy. <50%switcher warning at far horizons (library extension). When fewer than 50% of thel=1switchers contribute at a far horizonl,fit()emits aUserWarning. The dynamic paper (NBER WP 29873) recommends not reporting such horizons (Favara-Imbs application, footnote 14). Library convention is to warn but compute; not present in R.- DID^X covariate first-stage equal-cell weights (deviation from R
DIDmultiplegtDYN). Phase 3 covariate adjustment (controls=[...]) residualizes outcomes via per-baseline first-stage OLS on first-differenced covariates with time FEs. Python uses equal cell weights consistent with the Phase 1 cell-count convention (Deviation #1); R weights byN_{gt}. On one-observation-per-cell panels results are identical. When baseline-specific first stages fail (n_obs = 0orn_obs < n_params), both Python and R drop the affected strata.
Outstanding Concerns:
- Customer-supplied
cluster=<col>— currentlycluster=Noneis the only supported value; passing any non-Nonevalue raisesNotImplementedErrorat construction time (and the same gate fires fromset_params). Reserved for a future phase. Auto-cluster at the group level is the default;survey_design.psuis the auto-cluster surface for hierarchical sampling, and PSU-within-group-constant regimes (including the defaultpsu=groupauto-inject and strictly-coarser PSU with within-group constancy) route through the legacy group-level allocator with bit-identical SE. - 2024 NBER WP revision re-review — the disk PDF is the "March 2022, revised July 2023" version; the 2024 revision was not re-reviewed for this PR.
docs/references.rst:199,docs/methodology/REGISTRY.md:488, and the paper review file headers all align on the July 2023 revision string. If the 2024 revision introduces methodological changes, a re-review may be queued as a TODO follow-up post-promotion. - Methodology-test-file coverage of the universal-rollout extension (Knau et al. 2026) is out of scope for DCDH. The 2026 paper's primary contribution is the
HeterogeneousAdoptionDiD(HAD) estimator (already promoted in PR #473) and the universal-rollout case where no unit remains untreated. DCDH covers the reversible / mixed-direction designs from the 2020 AER and the dynamic event study from the 2022/2023 NBER WP 29873; the 2026 paper's universal-rollout contributions are documented in the companion review atdocs/methodology/papers/dechaisemartin-2026-review.mdwithout a dedicated DCDH methodology test file.
| Field | Value |
|---|---|
| Module | had.py, had_pretests.py |
| Primary Reference | de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026), Difference-in-Differences Estimators When No Unit Remains Untreated, arXiv:2405.04465v6 |
| R Reference | chaisemartin::did_had (Credible-Answers/did_had v2.0.0, SHA edc09197) — R-parity-locked at atol=1e-8 on 3 DGPs × 5 method combos via tests/test_did_had_parity.py; nprobust (Calonico-Cattaneo-Farrell) v0.5.0 used as auxiliary reference for bandwidth selection only (machine-precision port at atol=1e-14) |
| Status | Complete |
| Last Review | 2026-05-20 |
Verified Components:
- Eq. 3 / Theorem 1 (Design 1' WAS identification:
WAS = [E(ΔY) − lim_{d↓0} E(ΔY | D ≤ d)] / E(D), the boundary-subtracted form; the library estimates the boundary intercept via bias-corrected local linear and computesatt = (mean(ΔY) − τ_bc) / mean(D)) —tests/test_methodology_had.py::TestHADTheorem1Design1Prime(7 tests including MC recovery on the simpleΔY = β·D + εDGP, MC recovery on a NONZERO-BOUNDARY-INTERCEPT DGPΔY = c + β·D + εwithc != 0to exercise themean(ΔY) − τ_bcsubtraction explicitly, and N(0,1) coverage atn_replicates=200, G=1000) - Eq. 7 (local-linear with bias-corrected CI) — covered by
tests/test_bias_corrected_lprobust.py(44 tests, hand-derived R reference atatol=1e-12) andtests/test_nprobust_port.py(~46 tests, machine-precision port atatol=1e-14) - Eq. 11 / Theorem 3 (
WAS_{d_lower}under Assumption 6, mass-point path) —tests/test_methodology_had.py::TestHADTheorem3MassPoint(5 tests including Wald-IV closed-form equivalence atatol=1e-9) - Theorem 4 (QUG null test, limit law
T_λ = (λ + E_1) / E_2under Exp(1)/Exp(1)) —tests/test_methodology_had.py::TestHADTheorem4QUG(6 tests; MC distributional match against closed-formF(t) = t/(1+t)at KS-stat ≤ 0.05, n_draws=5000) - Eq. 29 / Theorem 7 (Yatchew-HR linearity test, paper-literal
σ²_diff = 1/(2G)normalization) —tests/test_methodology_had.py::TestHADTheorem7YatchewHR(6 tests; standard-normal limit, normalization lock, bothnull="linearity"andnull="mean_independence"modes) - Eq. 18 joint Stute pre-trends + homogeneity (sum-of-CvMs + shared-η Mammen wild bootstrap; both mean-independence and linearity nulls) —
tests/test_methodology_had.py::TestHADJointStute(5 tests). Coverage scope: H0 fail-to-reject onjoint_pretrends_test(mean-independence) andjoint_homogeneity_test(linearity); H1 rejection demonstrated onjoint_homogeneity_testvia a nonlinear DGP. Out of scope for the new methodology file: thetrends_lin=Truelinear-trend-detrended variant is SHIPPED in the library (R-parity locked againstDIDHAD::did_had(..., trends_lin=TRUE)v2.0.0; see REGISTRY § "Note (Phase 4 — Eq 17 / Eq 18 linear-trend detrending shipped)" andtests/test_did_had_parity.py) but its methodology-walk-through tests are NOT duplicated intest_methodology_had.py. Pierce-Schott NUMERICAL replication against the published p=0.51 anchor on the LBD-restricted panel is the waived item (REGISTRY Deviations Note #3). - R parity (
chaisemartin::did_had) atatol=1e-8on 3 DGPs × 5 method combos (bit-exact,rtol=0) —tests/test_did_had_parity.py::TestPointSEParity+TestYatchewParity(5 direct parity tests; YatchewTest closed-form parity atatol=1e-10) -
nprobust(Calonico-Cattaneo-Farrell) port at machine precision (atol=1e-14) —tests/test_nprobust_port.py(7 classes spanning kernel constants, QR-based(X'X)^{-1}, three-stage MSE-DPI bandwidth, clustered variance, weighted local-linear, single-eval-point parity) - Bandwidth selector (CCF MSE-DPI) at 1% tolerance —
tests/test_bandwidth_selector.py(8 classes covering public-API wrapper, stage diagnostics) - Survey support: pweight + strata/PSU/FPC via TSL on the continuous and mass-point paths; PSU-level Mammen wild bootstrap on the Stute family; closed-form weighted variance components on Yatchew (Phase 4.5 A/B/C; QUG-under-survey permanently deferred per Phase 4.5 C0)
- Tutorials T21 (
docs/tutorials/21_had_pretest_workflow.ipynb, 17 drift tests) + T22 (docs/tutorials/22_had_survey_design.ipynb, 32 drift tests across groups A-G); plus T20 (docs/tutorials/20_had_brand_campaign.ipynb, 14 drift tests) - Assumption 5/6 non-testability documented in
HeterogeneousAdoptionDiDclass docstring +qug_test/stute_test/yatchew_hr_test/did_had_pretest_workflowNotes blocks; reinforced by a fit-timeUserWarningemitted from the outerHeterogeneousAdoptionDiD.fit()dispatch on the overall and event-study paths when the resolved design is Design 1 family (searchdiff_diff/had.pyfor "---- Assumption 5/6 warning on Design 1 paths ----")
Test Coverage:
- 36 methodology tests in
tests/test_methodology_had.py(3 are@pytest.mark.slow+ gated byci_params.bootstrap(...): Theorem 1 N(0,1) coverage atn_reps=200/min_n=25, Theorem 4 QUG limit-law KS atn_draws=5000/min_n=200, and Theorem 7 Yatchew-HR standard-normal KS atn_reps=200/min_n=25— each carries an n-conditional tolerance band perfeedback_bootstrap_drift_tests_need_backend_tolerance) (this PR) - ~1,137 implementation-detail tests across
tests/test_had.py,tests/test_had_pretests.py,tests/test_had_mc.py,tests/test_had_dual_knob_deprecation.py - 5 R-direct parity tests at
atol=1e-8intests/test_did_had_parity.py - ~46 + ~44 nprobust port + bias-corrected port tests
- ~45 bandwidth selector tests
- 17 + 32 tutorial drift tests (T21 + T22), plus 14 T20 drift tests
Corrections Made:
- Phase 4.5 B sup-t bootstrap (PR #432, 2026-05-14): introduced the gated simultaneous-band bootstrap on the weighted event-study path with the explicit
cband=True+aggregate="event_study"+weights= or survey_design=gate. - Phase 4.5 C survey support for linearity family (PR #432): PSU-level Mammen wild bootstrap for Stute + closed-form weighted variance for Yatchew. Replaced an earlier
NotImplementedErrorstub. - HAD survey-design API consolidation (PR #439, 2026-05-15): unified
survey_design=kwarg across all 8 HAD surfaces;survey=/weights=become deprecated aliases for one minor cycle. - Tracker-promotion docstring hardening (this PR, 2026-05-20): added explicit "Non-testable assumptions (paper Section 3.1.2)" Notes block to the
HeterogeneousAdoptionDiDclass docstring + "Scope (what this test does NOT cover)" clauses toqug_test/stute_test/yatchew_hr_test/did_had_pretest_workflowNotes sections. Boxed the REGISTRY HAD Implementation Checklist closures for Phase-4 items (Pierce-Schott Figure 2 + Table 1 coverage waivers, Assumption 5/6 non-testability docs, staggered-timing fail-closedValueError).
Deviations from the paper / from R / library extensions:
- Equal-weighting on the continuous path (paper does not prescribe a unit-weighting scheme; library uses per-unit
w_g = 1matching_nprobust_port.lprobust's default, NOT cell-size weights). Locked intests/test_methodology_had.py::TestHADDeviations::test_equal_weighting_is_per_row_not_per_dose_cell(probes the deviation via selective low-dose-region replication on a nonlinear DGP: per-row equal weighting predicts the att shifts; cell-size weighting predicts invariance). - Sup-t bootstrap gating — runs only when
aggregate="event_study"AND(weights= or survey_design= supplied)ANDcband=True. Unweighted event-study bit-exactly preserves pre-Phase 4.5 B output. Locked inTestHADDeviations::test_sup_t_bootstrap_skipped_*. - Pierce-Schott Figure 2 replication waived — R parity at
atol=1e-8is a stronger anchor; paper Section 5.2 self-acknowledges NP estimators are too noisy on LBD-restricted PNTR data. See REGISTRY Deviations § "Pierce-Schott (2016) Figure 2 replication harness deferred" for the full scope-caveat statement. - Table 1 coverage-rate reproduction waived — same R-parity-is-stronger rationale; R parity locks point estimate + SE + CI bounds bit-exactly, coverage-rate MC would re-verify the CCF asymptotic coverage already pinned. Paper Table 1 (89% / 93% / 95% under-coverage at G=100 / 500 / 2500) documents the asymptotic gap that BOTH R and Python inherit.
- Staggered-timing fail-closed
ValueErroratdiff_diff/had.py:1511(paper prescribes "Warn"; library raises). Library extension toward stricter safety —UserWarningwould let the silent-misuse bug class through. Locked inTestHADDeviations::test_staggered_timing_fail_closed_value_error. - Eq. 18 linear-trend-detrended joint Stute SHIPPED (PR #389) and R-parity-locked against
DIDHAD::did_had(..., trends_lin=TRUE)v2.0.0 intests/test_did_had_parity.py(3 DGPs × 5 method combos atatol=1e-8). Thetests/test_methodology_had.py::TestHADJointStutewalkthrough deliberately covers only the un-detrended mean-independence and linearity variants (no coverage duplication with the R-parity surface). The Pierce-Schott (2016) NUMERICAL replication against the published p=0.51 anchor on the LBD-restricted PNTR panel is what's waived (Deviations Note #3).
Outstanding Concerns:
- Module split (
had.py~4593 LoC,had_pretests.py~4951 LoC) — tracked in TODO.md as tech debt, not a methodology gap. - Bandwidth selector multi-eval, cross-horizon covariance on joint event-study — tracked as Phase follow-ups in TODO.md.
- Replicate-weight designs (BRR / Fay / JK1 / JKn / SDR) on HAD continuous path remain
NotImplementedError(Phase 4.5 D follow-up). covariates=kwarg with Theorem 6 multivariate-covariate extension not implemented; currently a PythonTypeError(kwarg absent from thefit()signature). Adding an explicit**kwargs-trap withNotImplementedErrorand a Theorem 6 pointer is tracked as a Low-priority follow-up in TODO.md.
| Field | Value |
|---|---|
| Module | trop.py, trop_local.py, trop_results.py (paper-aligned local method); trop_global.py (library-side efficiency adaptation — see "Scope" below) |
| Primary Reference | Athey, Imbens, Qu & Viviano (2025), Triply Robust Panel Estimators, arXiv:2508.21536 |
| R Reference | Paper-author reference implementation (not yet released as CRAN package) |
| Status | Complete (paper method="local", version-pinned to arXiv v2 — see Version Pinning below) |
| Last Review | 2026-05-24 |
Version Pinning: This methodology promotion is anchored on arXiv:2508.21536v2 (the version covered by the paper review on file at docs/methodology/papers/athey-2025-review.md). The current arXiv version is v3 (submitted 2026-02-09). A formal v2→v3 source delta-check against the v3 PDF has NOT been performed for any of the sections this PR promotes (Eqs. 2-3, Algorithms 1-3, Section 2.2, Section 5.2-5.3, Section 6.1-6.2, Theorem 5.1, Corollary 1, Appendix Theorem 8.1). Action item: before the next paper-author reference implementation or substantive v3 release, refresh the paper review against the most recent arXiv version and re-validate the verified-component checklist; until then the promotion stays v2-anchored.
Scope: This methodology promotion covers the paper-aligned method="local" path (paper Algorithm 2: per-(i, t) estimation with observation-specific weights). The library also exposes method="global", documented in REGISTRY.md as a "computationally efficient adaptation using the (1-W) masking principle from Eq. 2" — a library-side adaptation, NOT the paper's full Algorithm 2 estimator. Defensive coverage of the global method lives in tests/test_trop.py::TestTROPGlobalMethod (704 lines, ~30 tests for the global-method-specific surface) and is not duplicated in the methodology walk-through. Methodology promotion of method="global" as a primary surface would require either (a) a paper-side derivation of the global adaptation's equivalence to Algorithm 2 under specific conditions, or (b) a separate library-extension methodology review; both are deferred.
Verified Components:
- Eq. 2 weighted nuclear-norm-penalised L estimation: proximal-gradient inner solver (soft-threshold SVD) converges to
prox_{λ/2}(R)under uniform weights; plain (non-accelerated) prox-gradient objectivef(L) + λ‖L‖_*is non-increasing across iterations of a toy loop (this verifies the prox + gradient ingredient, NOT the shipped accelerated FISTA outer loop — Nesterov momentum gives the fasterO(1/k^2)rate but does not guarantee per-step monotonicity); the shipped weighted-prox solver on non-uniform weights produces a final objective that is at-or-below initialisation (final-vs-initial check via_weighted_nuclear_norm_solve, NOT per-iteration monotonicity — the accelerated FISTA loop is allowed to have transient per-step increases) and reduces total singular-value mass below the residual. (Eq. 10 balancing representation / decomposition is the paper's identity built from these ingredients per Section 5.2; direct numerical reconstruction is out of scope — see "Outstanding Concerns".) - Eq. 3 per-(i, t) weights: unit distance excludes the target period (
1{u ≠ t}mask in the kernel) and uses only periods where both units are untreated ((1 - W_iu)(1 - W_ju)mask). - Eqs. 4-5 + Algorithm 1 LOOCV:
Q(λ)sums squared pseudo-treatment effects over ALL control observations whereD_js = 0(including pre-treatment cells of eventually-treated units, paper Eq. 2 control set); two-stage coordinate-descent cycling (footnote 2) returns a tuple of values from the input grids. - Corollary 1 (paper p. 23) — single-draw sanity checks consistent with the three unbiasedness conditions, not a repeated-MC mean-bias study: each of the three balance conditions (a) unit balance, (b) time balance, (c)
B = 0is exercised on a targeted DGP that makes one condition trivially hold while keeping the others sub-optimal. The assertion in each case is a single-realisation|att - τ| < 3 * seband using the estimator's own bootstrap SE — this is a smoke check, NOT a repeated-draw Monte Carlo bias study of the paper's conditional-unbiasedness statement under fixed weights. A stronger MC bias study at fixed λ values is deferred (would multiply test runtime by ~30x for marginal additional evidence given the existing 3-σ band already catches order-of-magnitude bias regressions). - Theorem 5.1 (paper p. 23) — simulation sanity check, not a direct theorem lock: the paper's bias bound
|E[τ_hat - τ | L]| <= ||Δ_u|| · ||Δ_t|| · ||B||_*is stated for FIXED, non-data-dependent weights. The library's TROP fit uses data-dependent LOOCV-tuned λ values, so the direct conditional bias bound is not tested here. Instead, the methodology test verifies the bound's empirical realisation: TROP RMSE strictly below DID RMSE under a confounded factor DGP withtrue τ = 0(calibration measurement: TROP/DID RMSE ratio ≈ 0.34 atfactor_strength = 1.0). The direct fixed-weight bound test is deferred — would require exposing oracle Γ / Λ / B from a paper-aligned DGP and computing each component of the bound from instrumented internals. - Section 2.2 special-case reductions: DiD benchmark sanity check (not a direct algebraic-equivalence proof) — on a no-interactive-FE multi-period panel (additive unit + time effects only, no factor structure), TROP with
λ_nn = ∞+ uniform weights produces an ATT within 0.5 ofDifferenceInDifferencesfitted asoutcome ~ treat * post_flag(basic 2×2 design with[const, D, T, D×T], extended to repeated observations within each treat×post cell). This is empirical numerical agreement on a friendly DGP, NOT a proof of the paper Section 2.2 algebraic reduction (which would require either a true 2-period block-assignment panel where the basic-DiD comparator is the algebraic target, or a comparison againstTwoWayFixedEffects— both deferred). Matrix Completion code path exercised, not equivalence-checked — TROP with uniform weights + finiteλ_nnengages the nuclear-norm prox solver (effective_rank > 0) and recovers ATT better than the DiD-style baseline on a factor-confounded DGP; this verifies the code path activates but does NOT prove equivalence with an independent MC reference implementation (which would require either an external MC port or a hand-written reference solver). SC / SDID reductions deferred — see "Outstanding Concerns". - Eq. 13 + Algorithm 2 per-(i, t) estimation:
treatment_effectsdict contains one finiteτ_hat_itper treated cell; the aggregate ATT equals the unweighted mean of per-cell effects (Eq. 1). Tests cover block adoption with a constant treatment effect; absorbing-state staggered adoption and heterogeneous per-cell effects (paper Remark 6.1) are SUPPORTED by the code path but not directly verified in this methodology surface. Section 6.1 non-absorbing / on-off / switching assignment patterns are explicitly OUT OF SCOPE — the implementation rejects non-absorbing D-matrices viatrop_local.pyabsorbing-state validation, and the methodology test enforces the rejection contract viaTestTROPDeviations::test_event_style_d_rejected_with_value_error(event-style D being one specific non-absorbing pattern; the same absorbing-state validator catches all 1→0 transitions). Cross-coverage of the staggered-cohort fit path istests/test_methodology_trop.py::TestTROPAlgorithm1LOOCV::test_control_set_includes_pretreat_of_eventually_treated. - Algorithm 3 stratified pairs bootstrap: under an unbalanced (3 treated, 17 control) panel, the stratified sampler reliably produces ≥ 67% successful bootstrap draws and a positive finite SE.
- Section 3 / Eq. 6 semi-synthetic factor DGP: five recovery tests verify limiting-case uniform weights, unit-weight bias reduction, time-weight bias reduction, factor-model bias reduction with effective_rank > 0, and null-DGP recovery centred near zero.
- safe_inference contract: confidence interval uses the t-distribution with df = max(1, n_treated_obs - 1), consistent with p_value (matches REGISTRY
## TROP"Inference CI distribution" note, post safe_inference migration).
Test Coverage:
- 36 methodology tests (10 classes) in
tests/test_methodology_trop.py. - Defensive guards (107 tests in
tests/test_trop.py): D-matrix absorbing-state validation, silent-warning audit, FISTA convergence warnings, bootstrap-failure-rate proportional warning, bootstrap NaN-SE propagation, module-split smoke tests.
Deviations from paper:
- Gap #5 (weight normalisation): paper Section 5 (p. 20) states weights sum to one (
1ᵀω = 1ᵀθ = 1), but Eq. 3 (p. 7) writes unnormalised exponential weights. The shipped implementation matches Eq. 2 (unnormalised). Locked bytests/test_methodology_trop.py::TestTROPDeviations::test_unnormalized_weights_match_eq2. - Gap #9 (unbalanced panels): paper assumes a balanced
N × Tpanel; the library accepts unbalanced panels with missing control / pre-treatment cells. The methodology test exercises 10% random drops on the control + pre-treatment subset (TROP fit completes and returns finite ATT). Three additional unbalanced-panel regressions are intests/test_trop.py::TestPR110FeedbackRound8. Missing-treated-cell handling and thinner pre-period donor support are NOT directly covered by a dedicated regression — those edge cases lean on the absorbing-state monotonicity validation introp_local.pyand the validator-side tests intests/test_trop.py::TestDMatrixValidation. Locked byTestTROPDeviations::test_unbalanced_panels_supported. - Rank selection: the library follows the paper's implicit rank-selection via nuclear-norm soft-thresholding (paper review Gap #8).
TROP.__init__does NOT expose a discreterank_selectionparameter; effective rank is reported viaTROPResults.effective_rank(sum of singular values divided by largest) as a diagnostic, not as a user-selectable mode. Earlier REGISTRY prose mentioning "cv / ic / elbow" rank-selection methods was an overclaim — corrected in this PR. Locked byTestTROPDeviations::test_rank_selection_is_implicit_via_nuclear_norm. λ_nn = ∞→ 1e10 internal conversion: results metadata stores the originalinfvalue (REGISTRY## TROP"λ_nn=∞ implementation" edge-case note) while computations use 1e10 as a numerical sentinel. Locked byTestTROPDeviations::test_lambda_nn_inf_stored_unchanged.- Bootstrap proportional 5% failure-rate warning and FISTA / outer-loop convergence warnings: defensive surfaces introduced under the Phase 2 silent-failure audit (REGISTRY
## TROP"Bootstrap minimum" and "alternating-minimization convergence" notes). Covered intests/test_trop.py::TestTROPBootstrapFailureRateGuardandTestTROPConvergenceWarnings.
Outstanding Concerns:
- Equation 14 covariate extension (paper Section 6.2): the library does NOT implement covariates.
TROP.fit()has nocovariateskeyword argument, and the corresponding Theorem 8.1 (paper Appendix, pp. 36-37) covariate-triple-robustness result is correspondingly out of scope. Non-support is locked byTestTROPDeviations::test_covariates_not_supported. Deferred until use cases motivate the X threading throughtrop_local.py/trop_global.py/ LOOCV / bootstrap. - SC / SDID special-case reductions (paper Section 2.2 third bullet): the paper claims TROP reduces to SC and SDID under
λ_nn = ∞and "specific choices of unit and time weights", but the exact(ω, θ)maps are not provided in the paper text. The library does not expose an SC- or SDID-matching weight setter (only the Eq. 3λ_unit/λ_timedecay rates). Cross-language anchor againstSyntheticDiDis deferred until paper-author code clarifies the weight map. - Rust backend survey scope: the Rust accelerated paths remain pweight-only; full survey-design (strata, PSU, FPC via Rao-Wu) falls back to the Python bootstrap loop. Cross-backend parity is covered by
tests/test_trop.pydefensive surfaces. - Eq. 10 balancing decomposition (paper Section 5.2 + Eq. 10): numerical reconstruction of the four-term identity
Y_NT_hat = L_NT + θ·(Y_pre_N - L_pre_N) + ω·(Y_T_co - L_T_co) - Σ θ_t ω_i (Y_it_co - L_it_co)requires the internal per-(i, t) weight vectorsθ_s^{i,t}/ω_j^{i,t}, which are not exposed on the public TROP API. The Eq. 2 ingredients that the Eq. 10 derivation builds on (soft-threshold SVD, plain prox-gradient monotonicity — NOT the shipped accelerated FISTA outer loop, which uses Nesterov momentum and does not guarantee per-step monotonicity — weighted-prox solver) are independently verified inTestTROPNuclearNormProx. Direct Eq. 10 lock deferred — would require exposing the internal weight vectors as a results-side diagnostic or instrumenting the test against the solver's intermediate state.
R Parity: Deferred until the paper-author reference implementation is released ("forthcoming" per the paper). Tracker entry will be reopened when it lands.
| Field | Value |
|---|---|
| Module | triple_diff.py |
| Primary Reference | Ortiz-Villavicencio & Sant'Anna (2025), Better Understanding Triple Differences Estimators, arXiv:2505.09942 |
| R Reference | triplediff::ddd() (v0.2.1, CRAN) |
| Status | Complete |
| Last Review | 2026-02-18 |
Verified Components:
- ATT matches R
triplediff::ddd()for all 3 methods (DR, RA, IPW) — <0.001% relative difference - SE matches R
triplediff::ddd()for all 3 methods — <0.001% relative difference - With-covariates ATT matches R — <0.001% relative difference
- With-covariates SE matches R — <0.001% relative difference
- Verified across all 4 DGP types from
gen_dgp_2periods()(different model misspecification scenarios) - Influence function-based SE:
SE = std(w3*IF_3 + w2*IF_2 - w1*IF_1, ddof=1) / sqrt(n) - Three-DiD decomposition:
DDD = DiD_3 + DiD_2 - DiD_1matching R's approach -
safe_inference()used for all inference fields (t_stat, p_value, conf_int)
Test Coverage:
- 45 methodology tests in
tests/test_methodology_triple_diff.py
Corrections Made:
- Complete rewrite of estimation methods (was naive cell-mean approach, now three-DiD
decomposition). The original implementation computed DDD directly from 8 cell means with
a naive cell-variance SE. Replaced with R's decomposition into three pairwise DiD
comparisons (subgroup j vs reference subgroup 4), each using DR/IPW/RA methodology
from Callaway & Sant'Anna. This fixed:
- DR SE: was off by >100% (naive cell variance vs influence function)
- IPW SE: was off by >200% (incorrect cell-probability-ratio weights)
- With-covariates ATT: was off by >1000% for all methods (incorrect cell-by-cell regression)
- Influence function SE replaces naive cell variance for all methods:
SE = std(w3*IF_3 + w2*IF_2 - w1*IF_1, ddof=1) / sqrt(n)wherew_j = n / n_jandIF_jis the per-observation influence function for pairwise DiD j. - Propensity score estimation now runs per-pairwise-comparison (P(subgroup=4|X) within {j, 4} subset) instead of global P(G=1|X).
- Outcome regression now fits separate OLS per subgroup-time cell within each pairwise
comparison, matching R's
compute_outcome_regression_rc().
Outstanding Concerns:
- Panel mode (
panel=TRUE) with differenced outcomes not yet implemented (see Deviations).
Deviations from R's triplediff::ddd():
- Repeated cross-section mode only: Implementation uses
panel=FALSE. Panel mode with differenced outcomes is not yet implemented; users with balanced panel data and time-invariant covariates should compute first differences manually before fitting.
R Comparison Results (panel=FALSE, n=500 per DGP):
| DGP | Method | Covariates | ATT Diff | SE Diff |
|---|---|---|---|---|
| 1 | DR | No | <0.001% | <0.001% |
| 1 | DR | Yes | <0.001% | <0.001% |
| 1 | REG | No | <0.001% | <0.001% |
| 1 | REG | Yes | <0.001% | <0.001% |
| 1 | IPW | No | <0.001% | <0.001% |
| 1 | IPW | Yes | <0.001% | <0.001% |
| 2-4 | All | Both | <0.001% | <0.001% |
| Field | Value |
|---|---|
| Module | staggered_triple_diff.py, staggered_triple_diff_results.py |
| Primary Reference | Ortiz-Villavicencio & Sant'Anna (2025) — same paper as TripleDifference, staggered case |
| R Reference | triplediff::ddd(panel=TRUE) + agg_ddd() (per benchmarks/R/benchmark_staggered_triplediff.R) |
| Status | Complete |
| Last Review | 2026-05-30 |
Documentation in place:
- Paper review:
docs/methodology/papers/ortiz-villavicencio-santanna-2025-review.md(full-paper, equal-depth, arXiv:2505.09942v3; shared primary source with TripleDifference) — PR #499 - REGISTRY.md section:
## StaggeredTripleDifference(per-cohort comparisons against three sub-groups, DR/RA/IPW per component, GMM-optimal closed-form inverse-variance weighting, event-study via CS mixin, IF-based SEs, multiplier bootstrap for simultaneous bands, survey support) tests/test_methodology_staggered_triple_diff.py: R cross-validation (group-time ATT/SE, both control groups) + paper-equation-anchored Verified Components (below)- Dedicated unit-test suite:
tests/test_staggered_triple_diff.py(full coverage of DR/RA/IPW paths, both control-group modes, GMM weighting, event-study aggregation, edge cases) - Survey-specific:
tests/test_survey_staggered_ddd.py(incl. the Eq-4.14 overall under survey weighting)
Verified Components (validated against the paper + R):
- Identification (Theorem 4.1 / Eq. 4.5): RA = IPW = DR coincide without covariates.
- Three-term DDD decomposition (Eq. 4.1): post-treatment ATT(g,t) recover a known constant effect.
- GMM combination (Eqs. 4.11-4.12): optimal weights sum to one; a single comparison group reduces to
w=[1]. - Event study (Eq. 4.13): ES(e) equals the eligible-treated cohort-share-weighted average of ATT(g, g+e).
- Overall (Eq. 4.14 / Cor. 4.2): opt-in
overall_att_es= unweighted mean of post-treatment ES(e), cross-validated against Ragg_ddd(type="eventstudy")$overall.att/overall.se.
R Comparison Results (benchmarks/R/benchmark_staggered_triplediff.R; triplediff::ddd(panel=TRUE) + agg_ddd(); CSV fixtures gitignored / regenerated on-the-fly, JSON golden committed):
| Quantity | Tolerance | Observed agreement |
|---|---|---|
| Group-time ATT(g,t) | rtol 0.1% | exact |
| Group-time SE(g,t) | rtol 1% | matches |
| Event-study ES(e) | rtol 25% | within (per-e eligible-treated weighting deviation) |
Overall ATT, Eq. 4.14 (overall_att_es) |
rtol 10% | ≤5% (weighting deviation averages out in the mean) |
Overall SE, Eq. 4.14 (overall_se_es) |
rtol 3% | ≤0.5% |
The paper-equation-anchored Verified Components above are deterministic and run without R.
The R cross-validation in this table runs only when local R + triplediff are available
(it skips otherwise — the fixtures are gitignored); making those fixtures deterministic in
CI and extending covariate-adjusted R parity are tracked follow-ups in TODO.md.
Documented deviations (verified non-masking; REGISTRY ## StaggeredTripleDifference): comparison-cohort admissibility (matches R triplediff, base-period/anticipation-aware; paper uses g_c > max(g,t)); aggregation weights P(S=g,Q=1) (matches paper Eq. 4.13 where G_i is defined only for Q=1, not R's P(S=g)) — drives the 25% aggregation tolerance; per-cohort group-effect WIF (conservative vs R wif=NULL); default overall_att is the CS-simple post-treatment average (paper Eq. 4.14 available opt-in as overall_att_es); cluster-robust analytical SEs accepted-but-deferred (multiplier bootstrap provides unit-level clustering).
| Field | Value |
|---|---|
| Module | synthetic_did.py |
| Primary Reference | Arkhangelsky et al. (2021) |
| R Reference | synthdid::synthdid_estimate() |
| Status | Complete |
| Last Review | 2026-04-23 |
Verified Components:
- Frank-Wolfe on the collapsed (N_co × T_pre) problem (Algorithm 1 of Arkhangelsky et al. 2021), matching R's
synthdid::fw.step() - Unit weights: Frank-Wolfe with two-pass sparsification, matching R's
synthdid::sc.weight.fw()andsparsify_function() - Time weights: Frank-Wolfe on collapsed form, matching R's
fw.step() - Auto-computed
zeta_omega/zeta_lambdafrom data noise levelN_tr × σ²(Appendix D), matching R's default behavior - Pairs-bootstrap refit per Algorithm 2 step 2, warm-started from fit-time ω/λ via the new
init_weights=kwargs oncompute_sdid_unit_weights/compute_time_weights, matching R'sbootstrap_samplewhich rebindsattr(estimate, "opts")perupdate.omega=TRUE/update.lambda=TRUE - Placebo variance (library default) and jackknife variance methods
- Same-library validation: placebo-SE tracking vs. bootstrap-SE, AER §6.3 Monte Carlo truth
- All REGISTRY.md SyntheticDiD edge cases tested
Test Coverage:
- 157 methodology tests in
tests/test_methodology_sdid.py
Corrections Made:
- Time weights: Frank-Wolfe on collapsed form (was heuristic inverse-distance).
Replaced ad-hoc inverse-distance weighting with the Frank-Wolfe algorithm operating
on the collapsed (N_co x T_pre) problem as specified in Algorithm 1 of
Arkhangelsky et al. (2021), matching R's
synthdid::fw.step(). - Unit weights: Frank-Wolfe with two-pass sparsification (was projected gradient
descent with wrong penalty). Replaced projected gradient descent (which used an
incorrect penalty formulation) with Frank-Wolfe optimization followed by two-pass
sparsification, matching R's
synthdid::sc.weight.fw()andsparsify_function(). - Auto-computed regularization from data noise level (was
lambda_reg=0.0,zeta=1.0). Regularization parameterszeta_omegaandzeta_lambdaare now computed automatically from the data noise level (N_tr * sigma^2) as specified in Appendix D of Arkhangelsky et al. (2021), matching R's default behavior. - Bootstrap SE is paper-faithful refit (Algorithm 2 step 2), matching R's default
synthdid::vcov(method="bootstrap")including its warm-start shape. On each pairs-bootstrap draw, ω and λ are re-estimated via Frank-Wolfe on the resampled panel using the fit-time normalized-scale zeta. The Frank-Wolfe first pass is warm-started from the fit-time ω (renormalized over the resampled controls via_sum_normalize) and the fit-time λ (unchanged), matching R'sbootstrap_samplewhich rebindsattr(estimate, "opts")so those weights serve as the FW initialization perupdate.omega=TRUE/update.lambda=TRUE. (Historical note: an earlier release shipped a fixed-weight shortcut here that matched neither the paper nor R's default vcov; that path was removed in PR #351 along with its R-parity fixture, which had also been mis-anchored. The same PR added the warm-start plumbing tocompute_sdid_unit_weights/compute_time_weightsvia newinit_weights=kwargs.) - Default
variance_methodchanged to"placebo"— intentional deviation from R's default (R'ssynthdid::vcov()defaults to"bootstrap"). The library default is placebo for two reasons: (a) placebo is unconditionally available on pweight-only survey designs, whereas refit bootstrap rejects every survey design in this release; (b) placebo sidesteps the ~5–30× slowdown of per-draw Frank-Wolfe re-estimation in refit bootstrap. See REGISTRY.md §SyntheticDiDNote (default variance_method deviation from R)for details. - Deprecated
lambda_regandzetaparams; new params arezeta_omegaandzeta_lambda. The old parameters had unclear semantics and did not correspond to the paper's notation. The new parameters directly match the paper and R package naming conventions.lambda_regandzetaare deprecated with warnings and will be removed in a future release.
Outstanding Concerns:
- Cross-language parity anchor against R's default
synthdid::vcov(method="bootstrap")or JuliaSynthdid.jl::src/vcov.jl::bootstrap_seis desirable to bolster the methodology contract. Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; cross-language anchor tracked in TODO.md. The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path.
Deviations from R's synthdid::synthdid_estimate():
- Default
variance_methodis"placebo"(R defaults to"bootstrap"). Rationale: (a) placebo is unconditionally available on pweight-only survey designs, whereas refit bootstrap rejects every survey design in this release; (b) placebo sidesteps the ~5–30× slowdown of per-draw Frank-Wolfe re-estimation in refit bootstrap. Documented in REGISTRY.md §SyntheticDiDNote (default variance_method deviation from R). - Parameter names:
zeta_omega/zeta_lambda(matching the paper's notation); R useseta.omega/eta.lambda. The deprecated Python aliaseslambda_reg/zetafrom prior releases emitDeprecationWarningand will be removed in a future release.
| Field | Value |
|---|---|
| Module | bacon.py |
| Primary Reference | Goodman-Bacon (2021), Difference-in-differences with variation in treatment timing, J. Econometrics 225(2), 254-277 |
| R Reference | bacondecomp::bacon() |
| Status | Complete |
| Last Review | 2026-05-16 |
Verified Components:
- Theorem 1 decomposition identity:
β̂^DD = Σ s · β̂^{2x2}atatol=1e-10(hand-calculable + noisy DGPs) - Weight sum-to-1:
Σ s = 1.0atatol=1e-10underweights="exact" - Three comparison types correctly classified:
treated_vs_never,earlier_vs_later,later_vs_earlier - Eq. 7 hand-checked:
V̂_{kU}^D = n_{kU}(1-n_{kU}) · D̄_k(1-D̄_k)(via weight-ratio test,atol=1e-10) - Eq. 8 hand-checked:
V̂_{kℓ}^{D,k} = n_{kℓ}(1-n_{kℓ}) · (D̄_k-D̄_ℓ)/(1-D̄_ℓ) · (1-D̄_k)/(1-D̄_ℓ) - Eq. 9 hand-checked:
V̂_{kℓ}^{D,ℓ} = n_{kℓ}(1-n_{kℓ}) · D̄_ℓ/D̄_k · (D̄_k-D̄_ℓ)/D̄_k - Eq. 10b 2x2 estimator value: hand-calculable panel → β̂_{kU}^{2x2} = ATT exactly
- Always-treated remap to U (paper footnote 11):
first_treat <= min(time)(excluding never-treated sentinels0andnp.inf) units auto-remapped via internal column, user's data preserved, count exposed on result -
weights="exact"is the default (PR-B 2026-05-16);weights="approximate"retained as opt-in - Unbalanced panel: accepted with
UserWarning(paper assumes balanced; library extension) - No untreated group:
s_{kU}terms drop, weights renormalize, sum-to-1 still holds - Single timing group with U: only
treated_vs_nevercomparisons - Survey design composes cleanly with exact mode and warn+remap
- R
bacondecomp::bacon()parity atatol=1e-6— 3 fixtures (uniform_3groups_with_never_treated,two_groups_no_never_treated,always_treated_remapped); TWFE coefficient + weights-sum match across all 3 fixtures; per-component estimate + weight parity locked on the 2 non-remap fixtures and on the 6 timing-vs-timing rows ofalways_treated_remapped(carve-out narrowed to U-bucket rows only); R→Python U-bucket fold-back asserted by a dedicatedtest_always_treated_remapped_fold_back_matches_rtest that aggregates R's splitLater vs Always Treated+Treated vs Untreatedrows per cohort and compares to Python's singletreated_vs_nevercell atatol=1e-6. Seebenchmarks/data/r_bacondecomp_golden.json+TestBaconParityR.
Test Coverage:
- 34 methodology tests in
tests/test_methodology_bacon.pyacross 6 classes — all active, including the 4 R-parity tests (3 aggregate/per-component + 1 always-treated fold-back; goldens committed atbenchmarks/data/r_bacondecomp_golden.json) - 32 existing tests in
tests/test_bacon.py(basic decomposition, weight properties, weights-parameter API, TWFE integration, visualization, balanced-panel warnings, edge cases)
R Comparison Results:
- Validated at
atol=1e-6againstbacondecomp::bacon()(version 0.1.1, R 4.5.2). Goldens atbenchmarks/data/r_bacondecomp_golden.json; generator atbenchmarks/R/generate_bacon_golden.R. Three DGP fixtures:uniform_3groups_with_never_treated: 9 components covering all three comparison types — full per-component parity (estimate + weight atatol=1e-6).two_groups_no_never_treated: 2 components, timing-only decomposition — full per-component parity.always_treated_remapped: TWFE coefficient + weights-sum match atatol=1e-6; the 6 timing-vs-timing rows (between cohorts 3/4/5) also satisfy direct per-component parity atatol=1e-6(carve-out narrowed to U-bucket rows only). The U-bucket breakdown diverges by convention (Python's paper-footnote-11 U-remap vs R's distinctLater vs Always Treatedcohort decomposition); the aggregate is invariant to the re-bucketing per Theorem 1, and the R→Python fold-back is pinned bytest_always_treated_remapped_fold_back_matches_rwhich aggregates R's splitLater vs Always Treated+Treated vs Untreatedrows per cohort and compares to Python's singletreated_vs_nevercell.
Corrections Made:
- Theorem 1 exact-weights rewrite (
bacon.py:_recompute_exact_weights, lines ~740-880). The previous "exact" mode implementation did not actually compute Eqs. 7-9 / 10e-g — it was missing the(1 - n_kU)factor in the within-subsample treatment variance, did not square the sample share, and added an extraneousunit_sharefactor not present in the paper. The post-hoc sum-to-1 normalization masked the relative-weight error but produced a decomposition error of ~0.3% (0.007 absolute) against TWFE on a 3-cohort + never-treated DGP. Rewrote the function to compute the exact numerators of Eqs. 10e/f/g (with proper Eqs. 7-9 variances) and let the post-hoc normalization handle theV̂^Ddenominator (Theorem 1 identity guaranteesV̂^D = Σ numerators). Now matches TWFE atatol=1e-10. The existingtest_weighted_sum_equals_twfetolerance was tightened from< 0.1to< 1e-10to lock the contract. - Default
weightsflipped from"approximate"to"exact"at three entry points:BaconDecomposition.__init__()(bacon.py:397),bacon_decompose()convenience function (bacon.py:1064),TwoWayFixedEffects.decompose()(twfe.py:684). The paper-faithful Theorem 1 weights are now the default; the simplified approximate path remains opt-in via explicitweights="approximate".diff_diff/diagnostic_report.py:1740(production diagnostic surface) was updated to pass explicitweights="exact". - Always-treated warn+remap via internal column (
bacon.py:fit(), lines ~487-525). Paper footnote 11 puts units witht_i < 1inU, butbacon.pypreviously only mappedfirst_treat ∈ {0, np.inf}into U. Added detection using ordered-time logic on the time axis (first_treat <= min(time)while excluding the never-treated sentinels0andnp.inf) withUserWarningand automatic remap via an internal column (__bacon_first_treat_internal__), preserving the user'sfirst_treatcolumn unchanged. Detection handles event-time-encoded panels (time ∈ [-2,..,3]) correctly; the0sentinel restriction applies only tofirst_treat. Count exposed via newBaconDecompositionResults.n_always_treated_remappedfield.
Deviations from R's bacondecomp::bacon() and from the paper:
- First-period boundary extension on always-treated remap (library convention, deviation from paper footnote 11 strict rule and from R): Goodman-Bacon (2021) footnote 11 uses strict
t_i < 1for the always-treated bucket (units treated before the first observable period). The library applies the inclusivefirst_treat <= min(time)rule, additionally folding units treated at the first observable period (first_treat == min(time)) intoU. Rationale: such units have no untreated cell in-panel and cannot contribute as a treated cohort, so folding them into U mirrors the always-treated handling rather than dropping them silently. Rbacondecomp::bacon()does NOT apply this boundary fold-back — it keepsfirst_treat == min(time)cohorts in their own bucket and emitsLater vs Always Treatedcomparisons. Whenmin(time) > 1(no first-period-treated cohorts) the library rule reduces to the paper's strict rule. Documented in REGISTRY**Deviation (first-period boundary extension on always-treated remap)**. - Unbalanced panel acceptance (library extension): R errors on unbalanced panels; Python emits a
UserWarningand decomposes. The paper's Appendix A proof assumes balanced panels — decomposition on unbalanced panels is approximate to Theorem 1. - Approximate weight mode (Python-only optimization):
weights="approximate"is a library-only fast path with simplified variance computation, not present in R. Users who want Python-R numerical parity should passweights="exact"(the new default). - NaN for invalid inference fields not applicable: the decomposition is deterministic; there are no SE/p-value fields on the comparison output. The
decomposition_errorfield is a finite float (zero in well-conditioned cases).
| Field | Value |
|---|---|
| Module | honest_did.py |
| Primary Reference | Rambachan & Roth (2023), A More Credible Approach to Parallel Trends, RES 90(5), 2555-2591 |
| R Reference | HonestDiD package |
| Status | Complete |
| Last Review | 2026-04-01 |
Verified Components:
- Delta^SD: second-difference constraints [1,-2,1] with delta_0=0 boundary handling
- Delta^SD: T+Tbar-1 constraint rows (bridge constraint at t=0)
- Delta^RM: constrains first differences (not levels), union of polyhedra per Lemma 2.2
- Identified set LP: pins delta_pre = beta_pre via equality constraints (Equations 5-6)
- M=0 for Delta^SD: linear extrapolation gives finite point-identified bounds
- Mbar=0 for Delta^RM: point identification (all post first-diffs = 0)
- Optimal FLCI for Delta^SD: folded normal cv_alpha, Nelder-Mead over pre-period weights
- Sensitivity grid: bounds computed for each M in grid, breakdown value via binary search
- Survey variance (RM, M=0 smoothness): t-distribution critical values from df_survey
- Survey variance (M>0 smoothness): optimal FLCI uses asymptotic normal only; df_survey=0 → NaN
- CallawaySantAnna integration: universal base period, reference period filtering
- Three-period analytical case matches paper Section 2.3
- ARP hybrid for Delta^RM: infrastructure implemented, moment inequality transformation needs calibration
- R comparison: pending (benchmark scripts need updating)
Test Coverage:
- Comprehensive unit-test coverage in
tests/test_honest_did.py(15 test classes spanning DeltaSD/DeltaRM/DeltaSDRM bounds, FLCI, ARP infrastructure, CS integration, edge cases) — all passing - 27 methodology verification tests in
tests/test_methodology_honest_did.py - R benchmark tests (pending)
- Paper review on file:
docs/methodology/papers/rambachan-roth-2023-review.md
Corrections Made:
-
DeltaRM: first differences, not levels (
honest_did.py,_construct_constraints_rm_component): The paper's Delta^RM constrains|delta_{t+1} - delta_t|(consecutive first differences) bounded by Mbar × max pre-treatment first difference. The code constrained|delta_post|(absolute levels) bounded by Mbar × max|beta_pre|. Completely rewritten using union-of-polyhedra decomposition per Lemma 2.2. -
LP pins delta_pre = beta_pre (
honest_did.py,_solve_bounds_lp): The paper's identified set LP (Equations 5-6) fixes pre-treatment violations to the observed pre-treatment coefficients. The code had no equality constraint — delta_pre was unconstrained. For Delta^SD(M=0), this made the LP unbounded. Added A_eq/b_eq equality constraints. -
DeltaSD constraint matrix: delta_0=0 boundary (
honest_did.py,_construct_A_sd): The code built second-difference matrices treating [delta_{-T},...,delta_{-1},delta_1,...,delta_{Tbar}] as consecutive, missing delta_0=0 at the boundary. Three boundary rows were wrong:- t=-1:
d_{-2} - 2*d_{-1} + 0(uses delta_0=0) - t=0:
d_{-1} + d_1(bridge constraint, was missing) - t=1:
0 - 2*d_1 + d_2(uses delta_0=0) Now produces T+Tbar-1 rows (was T+Tbar-2).
- t=-1:
-
Optimal FLCI for Delta^SD (
honest_did.py,_compute_optimal_flci): Replaced naive FLCI (lb - z*se, ub + z*se) with the paper's optimal FLCI (Section 4.1): jointly optimizes affine estimator direction v and half-length chi using folded normal critical values cv_alpha(bias/se). Significantly narrower CIs. -
REGISTRY.md equations (
docs/methodology/REGISTRY.md): DeltaSD equation was first differences (should be second differences). DeltaRM equation was absolute levels (should be first differences). Both corrected with full formulations. -
Performance (
honest_did.py): Sensitivity grid reduced from ~9 minutes to 0.1 seconds via: Newton's method for cv_alpha (5 iterations vs 100), centrosymmetric bias LP (1 solve vs 2), M=0 short-circuit, looser Nelder-Mead tolerances.
Outstanding Concerns:
- Delta^RM CI: uses naive FLCI (conservative) instead of the paper's ARP conditional/hybrid confidence sets. ARP infrastructure exists but moment inequality transformation needs calibration. Tracked in TODO.md.
- R benchmark comparison not yet run (Python benchmark needs API update)
- Combined method uses single M for both SD and RM (DeltaSDRM dataclass has separate M/Mbar)
Deviations from R's HonestDiD:
- Deviation from R: Delta^RM CIs use naive FLCI (
lb - z*se, ub + z*se) instead of ARP conditional/hybrid. Conservative (wider CIs, valid coverage). ARP deferred. - Note: Delta^SD optimal FLCI matches the paper's Section 4.1 methodology: first-difference reparameterization, slope weights with sum(w)=sum_j j*l_j constraint (Eq. 17), bias LP in fd-space, folded normal (or folded non-central t for survey df). Nelder-Mead optimizer vs R's custom solver may produce numerical differences at tolerance level.
- Note:
method="combined"(Delta^SDRM) uses naive FLCI on the intersection of SD and RM bounds. The paper proves FLCI is not consistent for Delta^SDRM (Proposition 4.2). A runtime UserWarning is emitted. Usemethod="smoothness"ormethod="relative_magnitude"separately for paper-supported inference. - Note (deviation from R): Python warns (doesn't error) when CallawaySantAnna results use
base_period != "universal". R's HonestDiD requires universal base period.
| Field | Value |
|---|---|
| Module | pretrends.py |
| Primary Reference | Roth (2022), Pretest with Caution: Event-Study Estimates after Testing for Parallel Trends, AER:I 4(3), 305-322 |
| R Reference | pretrends package |
| Status | Complete |
| Last Review | 2026-05-19 |
Documentation in place:
- REGISTRY.md section:
## PreTrendsPower— NIS-framed audit per Roth (2022) Section II.A-B with full equation blocks for both NIS and Wald forms; paper-supported alternative + γ-unit MDV + full-Σ_22 routing all locked. - Paper review on file:
docs/methodology/papers/roth-2022-review.md(added 2026-05-17 via PR #463). - Implementation:
tests/test_pretrends.py(67 tests — point-estimator, MDV, power curve, sensitivity, plus the PR-A R18 silent-failure regression and the PR-B custom-weight persistence regression) + event-study coverage intests/test_pretrends_event_study.py(27 tests). - Dedicated
tests/test_methodology_pretrends.py(added 2026-05-18 in PR-B Step 7; PR-C 2026-05-19 activatedTestPretrendsParityRwith 4 concrete tests) — Roth (2022) Section II.A-B paper-equation-numbered Verified Components walk-through (8 classes covering NIS box probability, Wald-vs-NIS, Propositions 1-4 simulation parity, linear-units γ-scale, custom-weight persistence, CS/SA full-VCV, helper API, R parity at commit122731d082). - R parity goldens:
benchmarks/data/r_pretrends_golden.jsongenerated bybenchmarks/R/generate_pretrends_golden.Ragainstjonathandroth/pretrendscommit122731d082(package version 0.1.0); 4 fixtures (regular K=3, irregular K=3[-5,-3,-1], anticipation-shifted K=4, K=1 closed form) × NIS power + γ_p MDV atatol=1e-4.
Verified Components:
- NIS box probability implemented via
scipy.stats.multivariate_normal.cdf(Roth Section II.A-B primary form) - Wald noncentral-χ² form retained as paper-supported alternative (Propositions 1+3+4 all apply — convex ellipsoid acceptance region)
- Both forms produce form-consistent MDV via doubling + brentq bisection with 1000-cap non-convergence fallback
- Non-bootstrap CS adapter consumes full
event_study_vcovsub-block (not diag) - Non-bootstrap SA adapter consumes full
event_study_vcovsub-block (W-matrix constructionevent_study_vcov = W @ vcov_cohort @ W.Tadded toSunAbrahamResults) - Bootstrap CS/SA and replicate-weight survey paths fall through to
diag(ses^2)(analytical VCV cleared to prevent mixing with bootstrap/replicate SE overrides) -
_get_violation_weights('linear')honors actual pre-period relative-time labels viafit()threading → reported MDV is in Roth's γ units on irregular and anticipation-shifted grids. ForMultiPeriodDiDResults, supported label types are numeric (int/float/np.int64) andpandas.Period/pandas.Timestamp/np.datetime64; genuinely non-numeric labels (string period IDs, unranked categoricals) emit an explicitUserWarningand fall through to the legacy count-based normalized direction (MDV is NOT in γ units in that case — re-fit with numeric labels) -
PreTrendsPowerResultspersists fittedviolation_weights+pretest_form+nis_box_probability;power_at(M)works for all four violation types on fresh fits - Helper API (
compute_pretrends_power,compute_mdv) acceptsviolation_weightsandpretest_form; closes the PR-A R18 helper/class API gap - Summary,
to_dict,to_dataframedispatch onpretest_form(NIS prints box probability; Wald prints noncentrality) - R
pretrendsparity at commit122731d082(PR-C, 2026-05-19) — 4 fixtures × NIS power + γ_p MDV atatol=1e-4;tests/test_methodology_pretrends.py::TestPretrendsParityRactive
| Field | Value |
|---|---|
| Module | power.py |
| Primary References | Bloom (1995) — normal MDE multiplier; Burlig, Preonas & Woerman (2020) — panel-DiD variance (equicorrelated special case of Eq. 2) |
| R Reference | pwr::pwr.norm.test (analytical, normal-based — not pwr.t.test); Stata pcpanel (Burlig panel); DeclareDesign (simulation) |
| Status | Complete |
| Last Review | 2026-05-31 |
Verified components:
- MDE multiplier
M = z_{1-α/2 (or 1-α)} + z_{1-κ}is the normal (Bloom 1995) multiplier; reproduces Bloom Table 1 (2.49 @ one-sided .05/.80, 2.93, 2.17). - The unified equicorrelated SE
√(σ²(1/n_T+1/n_C)(1/m+1/r)(1−ρ))(Burlig Eq. 2 equicorrelated special case): the panel path (T>2) and the 2×2 path — the m=r=1 case√(2σ²(1/n_T+1/n_C)(1−ρ)), reducing to Bloom Eq. 1's DiD analog at ρ=0 — validated by closed-form assertions, a literal-equicorrelated Monte-Carlo check, and base-Rqnormparity (incl. a 2×2 ρ>0 fixture). - Allocation factor
f(1−f)(50/50-optimal) and the exact two-tailed normal power function confirmed.
Corrections made (PR-B):
- Panel variance switched from the Moulton
(1+(T−1)ρ)/Tfactor (wrong period-scaling — ~4× too small at ρ=0, m=r=5 — and wrong ρ-sign) to the Burlig Eq. 2 equicorrelated(1/m+1/r)(1−ρ)form, in which within-unit correlation lowers the MDE. The two existing direction tests (test_icc_effect,test_extreme_icc) were inverted; tutorial06_power_analysis.ipynbwas corrected. Input guards added for all designs (validated before the 2×2-vs-panel router):n_pre≥1,n_post≥1,ρ ∈ [−1/(T−1), 1); the(1−ρ)factor also applies at T=2 (the m=r=1 case, Burlig footnote 11), so ρ is not silently ignored there. - REGISTRY equation block rewritten (z not t; corrected SE / sample-size; removed the cluster-
mand inverted-R²terms that matched neither code nor source).
Deviations (documented in REGISTRY ## PowerAnalysis):
- Critical values use the normal (z) distribution (Bloom 1995) — a large-sample approximation to Burlig Eq. 1's t — labelled
**Deviation from R:**. - Only the equicorrelated special case of Burlig Eq. 2 is implemented (single ρ); the fully general SCR form (independent ψ^B/ψ^A/ψ^X) is not.
Tests: tests/test_methodology_power.py (Bloom Table 1; 2×2 + panel closed forms; Monte-Carlo; round-trip; validation guards; R parity) + tests/test_power.py. R goldens at benchmarks/data/r_power_golden.json (generator benchmarks/R/generate_power_golden.R).
| Field | Value |
|---|---|
| Module | diagnostics.py |
| Primary Reference | None canonical (general permutation / leave-one-out diagnostic) |
| R Reference | None canonical |
| Status | In Progress |
| Last Review | — |
Documentation in place:
- REGISTRY.md section:
## PlaceboTests(NaN-inference edge cases forpermutation_testandleave_one_out_test) - Implementation: tests embedded in
tests/test_diagnostics.py
Outstanding for promotion:
- Decide whether this surface warrants a standalone methodology review or whether the brief Verified Components walk-through + NaN-inference deviation log should live as a sub-section under each per-estimator diagnostic block instead
- If kept standalone: brief Verified Components block + Deviations block for the NaN-inference convention
These are not estimators but variance/inference plumbing used across many estimators. They warrant their own methodology reviews because the implementation details (kernel choice, weight rescaling, df adjustment) are independently citable.
| Field | Value |
|---|---|
| Module | conley.py, linalg.py (_validate_vcov_args, kernel construction) |
| Primary Reference | Conley (1999), GMM Estimation with Cross-Sectional Dependence, J. Econometrics 92(1), 1-45 |
| Secondary References | Andrews (1991) HAC theory; Colella, Lalive, Sakalli & Thoenig (2019) for the Stata acreg parallel; Düsterhöft (2021) conleyreg (CRAN) parity target |
| Status | Complete |
| Last Review | 2026-05-26 |
Verified Components:
- Eq. 4.2 cross-sectional sandwich (pairwise-distance specialization)
Var(β) = (X'X)^{-1} (Σ_{i,j} K(d_ij/h) X_i ε_i ε_j X_j') (X'X)^{-1}. diff-diff implements the real-valued / pairwise-distance form (Conley 1999 Eq. 4.2 plus the "pairwise products at a given distance" remark on page 19); Eq. 3.13 is the lattice-indexed form reserved for grid coordinates —tests/test_methodology_conley.py::TestConleyEquation42 - Eq. 4.2 limits: tiny-cutoff → HC0 diagonal; huge-cutoff under uniform → rank-1 correlated limit; K(0) = 1 diagonal contribution always exact —
TestConleyHC0AndRank1Reductions - Andrews (1991) HAC lag truncation
(1 - |t-s|/(L+1))for0 < |t-s| ≤ L, matchingconleyreg::time_dist.cpp; lag 0 excluded to avoid double-counting —TestConleyAndrewsLagTruncation - Haversine convention: Earth radius 6371.01 km matches
conleyreg::haversine_dist; 1° lat = 111.195 km at equator —TestConleyHaversineConvention - Phase 2 panel block-decomposed sandwich
XeeX = XeeX_spatial + XeeX_serialmatchesconleyreg::time_dist.cppatatol=1e-12; cluster time-invariance constraint validated on panel path —TestConleyBlockDecomposition - Wave A #120 sparse k-d-tree path produces meat bit-identical to dense at
atol=1e-10(chord-projection roundoff on haversine absorbed) —TestConleySparseDenseEquivalence
Test Coverage:
tests/test_methodology_conley.py(~1600 LoC; 10 paper-anchored / R-parity / deviations classes; 60 tests including 5@pytest.mark.slow)tests/test_conley_vcov.py(~3100 LoC remaining after extraction; 11 classes — defensive surface: input validation, NaN/inf guards, dispatch-level validity, estimator-level integration smoke tests, set_params atomicity, sparse-path activation thresholds + density-gate fallback)
R Comparison Results (R conleyreg v0.1.9, Düsterhöft 2021):
| Fixture | Surface | Tolerance | Test |
|---|---|---|---|
small_haversine |
Cross-sectional | atol=1e-6, rtol=1e-6 |
TestConleyParityR::test_parity_small_haversine |
dense_haversine |
Cross-sectional | atol=1e-6, rtol=1e-6 |
TestConleyParityR::test_parity_dense_haversine |
lat_lon_realistic |
Cross-sectional | atol=1e-6, rtol=1e-6 |
TestConleyParityR::test_parity_lat_lon_realistic |
panel_haversine_lag1 |
Panel (Phase 2) | atol=1e-6, rtol=1e-6 |
TestConleyParityR::test_parity_panel_haversine_lag1 |
panel_haversine_lag2 |
Panel (Phase 2) | atol=1e-6, rtol=1e-6 |
TestConleyParityR::test_parity_panel_haversine_lag2 |
panel_lat_lon_realistic_lag1 |
Panel (Phase 2) | atol=1e-6, rtol=1e-6 |
TestConleyParityR::test_parity_panel_lat_lon_realistic_lag1 |
| Sparse k-d-tree (forced) on cross-sectional fixtures | Cross-sectional | atol=1e-6, rtol=1e-6 |
TestConleyParityR::test_sparse_forced_matches_r_cross_sectional |
| Internal block-decomposition cross-check | Phase 2 algebraic identity vs conleyreg::time_dist.cpp |
atol=1e-12 |
TestConleyBlockDecomposition::test_panel_matches_block_decomposed_reference |
| Time-asymmetric kernel literal-matching (spatial kernel does NOT propagate to temporal component) | Phase 2 contract | atol=1e-10 |
TestConleyParityR::test_time_asymmetric_kernel_matches_r_literal |
Goldens at benchmarks/data/r_conleyreg_conley_golden.json; generator at benchmarks/R/generate_conley_golden.R (Earth radius 6371.01 km matches conleyreg::haversine_dist).
Corrections Made:
- PR (this PR): Closed Outstanding-for-promotion items via the new methodology test file (
tests/test_methodology_conley.py), the inline R-parity summary table above, the consolidated deviations-area classes (TestConleyLibraryExtensions+TestConleyDeviationsFromR+TestConleyDeferrals), and the Phase 5 reframing. - Cited-paper correction (this PR): Stripped Bertanha-Imbens 2014 citation across 16 sites (
linalg.py× 8,conley.py× 1,llms-full.txt× 2,REGISTRY.md× 4,spillover.rst× 1). NBER w20773 is External Validity in Fuzzy Regression Discontinuity Designs (JBES 2020), unrelated to weighted spatial-HAC. Replaced with: "weighted spatial-HAC under probability sampling is an open methodological question; no canonical extension of Conley (1999) exists for the combination." At REGISTRY sites the replacement is wrapped in the canonical**Note (open methodological question):**label per CLAUDE.md "Documenting Deviations (AI Review Compatibility)".
Deviations from the paper / from R / library extensions (consolidated; see REGISTRY ## ConleySpatialHAC § Note (deviation / source specialization) for full text):
- 1-D radial Bartlett vs. paper's 2-D separable Eq. 3.14 PSD-guaranteed form — practitioner specialization matching R
conleyreg, Stataacreg, Hsiang (2010); not formally PSD-guaranteed. Indefiniteness guard at-1e-12applies to both spatial and cluster kernels (REGISTRY L3609).TestConleyDeviationsFromR::test_1d_radial_bartlett_vs_2d_separable_eq314+TestConleyLibraryExtensions::test_indefiniteness_guard_fires_on_negative_eigenvalue. - Combined spatial + cluster product kernel (Wave A #119, library extension; no R correspondence; two limit-fixture anchors):
K_total[i, j] = K_space(d_ij/h) · 1{c_i = c_j}. Cluster time-invariance contract enforced on panel path. Anchor 1 (all-unique-clusters → HC0 diagonal):TestConleyLibraryExtensions::test_combined_spatial_cluster_kernel_wave_a_119_all_unique. Anchor 2 (huge-cutoff + uniform → CR1 without Liang-Zeger correction):TestConleyLibraryExtensions::test_combined_spatial_cluster_kernel_wave_a_119_huge_cutoff. - Callable
conley_metricvalidation (Wave A #123, library extension; no R correspondence): 6 invariants checked ((n,n)shape, finite, non-negative, symmetric toatol=1e-10, zero diagonal, float64-castable).TestConleyLibraryExtensions::test_callable_metric_validation_wave_a_123. - Sparse k-d-tree fast path (Wave A #120, library extension; auto-activates at
n > 5,000with Bartlett + haversine/euclidean; density gate at 30% falls back to dense).TestConleyLibraryExtensions::test_sparse_kd_tree_activation_wave_a_120(activation contract);TestConleySparseDenseEquivalence(numerical equivalence). - Time-label normalization via
np.unique(return_inverse=True)(REGISTRY L3592-3603): diff-diff normalizes time labels to dense panel-period codes before lag computation; Rconleyreguses raw time values literally. diff-diff is the more robust default on non-dense encodings.TestConleyDeviationsFromR::test_time_label_normalization_deviation_from_r. - Time-asymmetric kernel (matches R
conleyregliteral; library extension would be follow-up): R'skernelargument controls the spatial component only; the temporal kernel is unconditionally Bartlett. diff-diff matches this asymmetry exactly. Parity contract inTestConleyParityR::test_time_asymmetric_kernel_matches_r_literal; deferral inTestConleyDeviationsFromR::test_independent_temporal_kernel_deferred.
Outstanding Concerns:
- Spillover-conley dependency: resolved. SpilloverDiD ships Conley + survey via Wave E.1/E.2/E.3 (PR #468 / #474 / #482, stratified-Conley sandwich on PSU totals with within-PSU serial Bartlett HAC for
lag_cutoff > 0); TwoStageDiD ships Wave E.3 parity (PR #485,fdf2cebc). - Generic LinearRegression / DiD / MPD / TWFE + Conley + survey_design: deferred. Two fail-closed contracts assert the unsupported combination: the estimator-level gate (DiD / MPD / TWFE) lives in
diff_diff/conley.py::_validate_conley_estimator_inputs(TestConleyDeferrals::test_did_mpd_twfe_survey_design_not_implemented); the generic LinearRegression gate lives indiff_diff/linalg.py::LinearRegression.fit(TestConleyDeferrals::test_linear_regression_survey_design_not_implemented). Thesurvey_design=surface isLinearRegressiononly;compute_robust_vcovdoes not acceptsurvey_design=. - Generic LinearRegression / compute_robust_vcov + Conley +
weights=: deferred for anyweight_type(pweight/aweight/fweight). Weighted Conley is not implemented on the generic linalg surface. Thepweight/survey_designsubset additionally reflects an open methodological question — no canonical extension of Conley (1999) exists for weighted spatial-HAC under probability sampling.TestConleyDeferrals::test_conley_plus_weights_not_implemented. - SyntheticDiD + Conley: raises
TypeError. SyntheticDiD uses bootstrap / jackknife / placebo variance, not the analytical sandwich.TestConleyDeferrals::test_synthetic_did_conley_typeerror. - Wild bootstrap + Conley:
NotImplementedError. Wild bootstrap is a separate inference path that does not consume the analytical Conley sandwich (REGISTRY L3547).TestConleyDeferrals::test_wild_bootstrap_conley_not_implemented. - No default bandwidth: Conley 1999 doesn't propose a plug-in selector; REGISTRY recommends
(50, 100, 200, 500)km sensitivity grid mirroring Conley 1999 § 5.tests/test_conley_vcov.py::TestConleyValidatorHelpers::test_missing_cutoff_raisesenforces the fail-closed contract. - DiagnosticReport routing for
SpilloverDiDResults(vcov_type="conley", survey_design=)is queued for a follow-up Wave F PR —_APPLICABILITY/_PT_METHODregistration is required before the new combination can be claimed consumable downstream.
| Field | Value |
|---|---|
| Module | survey.py, bootstrap_utils.py (plus per-estimator hooks) |
| Primary References | Binder (1983) for TSL variance; Lumley (2004) for the R survey package; Solon, Haider & Wooldridge (2015) for the "when to weight" framework |
| R Reference | survey R package |
| Status | In Progress |
| Last Review | — |
Documentation in place:
- REGISTRY.md sub-sections (under
## Survey Data Support): Weighted Estimation, TSL Variance, Weight Type Effects on Inference, Absorbed FE with Survey Weights, Survey Degrees of Freedom, Survey Aggregation (aggregate_survey), Survey-Aware Bootstrap (Phase 6), Replicate Weight Variance (Phase 6), DEFF Diagnostics (Phase 6), Subpopulation Analysis (Phase 6), Survey DGP (generate_survey_did_data) - Theory document:
docs/methodology/survey-theory.md— full Binder-Lumley derivation of design-based variance for modern DiD estimators, including influence-function machinery - 13 dedicated
tests/test_survey*.pyfiles:test_survey.py,test_survey_dcdh.py,test_survey_dcdh_replicate_psu.py,test_survey_estimator_validation.py,test_survey_phase3.py,test_survey_phase4.py,test_survey_phase5.py,test_survey_phase6.py,test_survey_phase7a.py,test_survey_phase8.py,test_survey_r_crossvalidation.py,test_survey_real_data.py,test_survey_staggered_ddd.py - Per-estimator survey hooks documented in the REGISTRY sections of every estimator that supports survey design (DiD/TWFE/MultiPeriodDiD, CS, SunAbraham, StackedDiD, ImputationDiD, TwoStageDiD, WooldridgeDiD, EfficientDiD, ContinuousDiD, DCDH, HAD, TripleDifference, StaggeredTripleDifference, TROP, SyntheticDiD). Scope is estimators; survey-capable diagnostics (e.g.,
BaconDecompositionPhase 3,HonestDiDsurvey-df handling) are tracked in their own sections.
Outstanding for promotion:
- Dedicated
tests/test_methodology_survey.py(or split between TSL and replicate-weight surfaces) with Binder-equation-numbered Verified Components walk-through - R parity benchmark against
survey::svyglm/survey::svycontrastfor the linear DiD case (tests/test_survey_r_crossvalidation.pyexists; needs to be wired into a documented "Reference results" table here) - Document deviations: PSU-level Hall-Mammen wild clustering as the bootstrap path when survey design is present (vs. R
survey's default analytical TSL); strata-vs-no-strata bit-equality not achievable due to RNG-path divergence between the per-stratum numpy loop and the batchedgenerate_survey_multiplier_weights_batchcall (seedocs/methodology/REGISTRY.mdHAD Stute survey-bootstrap section, "Distributional parity, NOT bit-exact" note, for the documented impossibility — distributional parity holds at large B, exact agreement atatol=1e-10does not) - Consolidated "Outstanding cross-estimator gaps" enumerating which estimators still raise
NotImplementedErroron which survey-design combinations (e.g., Conley + survey, SyntheticDiD + Conley, HAD replicate weights on Stute family)
For each estimator, complete the following steps:
- Read primary academic source - Review the key paper(s) cited in REGISTRY.md and write a
docs/methodology/papers/<name>-review.mdreview if one doesn't exist - Compare key equations - Verify implementation matches equations in REGISTRY.md
- Run benchmark against reference implementation - Execute
benchmarks/run_benchmarks.py --estimator <name>if available; otherwise generate fixtures and document parity tolerances - Verify edge case handling - Check behavior matches REGISTRY.md documentation
- Check standard error formula - Confirm SE computation matches reference (analytical, bootstrap, cluster-robust, survey-aware)
- Write dedicated methodology test file -
tests/test_methodology_<name>.pywith paper-equation-numbered assertions that correspond 1:1 to the Verified Components list - Document deviations - Add notes explaining intentional differences with rationale, using one of the REGISTRY.md labels (
- **Note:**,- **Deviation from R:**,**Note (deviation from R):**)
- After completing a review: Update status to "Complete" and add date, populate Verified Components / Corrections Made / Deviations sections
- When making corrections: Document what was fixed in the "Corrections Made" section with file path and line number
- When identifying issues: Add to "Outstanding Concerns" for future investigation
- When deviating from reference: Document the deviation and rationale; cross-reference the REGISTRY.md
Note (deviation from R)block - When promoting from In Progress to Complete: Replace the "Documentation in place" / "Outstanding for promotion" pair with the full Verified Components / Corrections Made / Deviations structure used by Complete entries
- When adding a new estimator to the library: Add a row to the appropriate Status Summary table marked In Progress and a stub section under the matching category in Detailed Review Notes (Documentation in place / Outstanding for promotion) — same PR that introduces the estimator. New surfaces enter as In Progress because they ship with a REGISTRY.md entry and unit tests by definition.
When our implementation intentionally differs from the reference implementation, document:
- What differs: Specific behavior or formula that differs
- Why: Rationale (e.g., "defensive enhancement", "bug in R package", "follows updated paper")
- Impact: Whether results differ in practice
- Cross-reference: Update REGISTRY.md edge cases section using one of the recognized labels
Example:
**Deviation (2025-01-15)**: CallawaySantAnna returns NaN for t_stat when SE is non-finite,
whereas R's `did::att_gt` would error. This is a defensive enhancement that provides
more graceful handling of edge cases while still signaling invalid inference to users.
Promotion priority for the In Progress entries, ordered by what's blocked on substantive review work (top of list = needs review next) vs. consolidation pass (bottom of list = mostly tracker walk-through):
Substantive-review-blocked (still missing a methodology test file / R parity and a paper review):
- PlaceboTests — decide first whether to keep standalone or absorb into per-estimator diagnostic sections; methodologically lightweight either way.
- TwoStageDiD — the remaining half of the imputation pair (ImputationDiD is now Complete, validated against
didimputation). Needs a Gardner (2022) paper review,tests/test_methodology_two_stage.py, and an R parity fixture againstdid2s.
Consolidation-pass-blocked (already has paper review or methodology file or R parity; mostly Verified Components walk-through):
- Survey Data Support — cross-cutting feature; promotion requires the per-estimator integration paths to be locked down first.
- REGISTRY.md — Academic foundations and key equations
- docs/methodology/papers/ — Per-paper retrospective reviews (Athey 2025, Butts 2021/2023, Clarke 2017, Colella et al. 2019, Conley 1999, de Chaisemartin 2026, Rambachan-Roth 2023, Wooldridge 2023)
- docs/methodology/continuous-did.md — ContinuousDiD theory note
- docs/methodology/survey-theory.md — Design-based variance estimation for modern DiD estimators
- docs/methodology/REPORTING.md — Reporting conventions across estimators
- ROADMAP.md — Feature roadmap
- TODO.md — Technical debt tracking, including deferred methodology items from code reviews
- CLAUDE.md — Development guidelines