This document records the methodology choices embedded in
BusinessReport and DiagnosticReport — the convenience layer that
produces plain-English stakeholder narratives from any diff-diff result.
Methodology for estimators lives in REGISTRY.md. This file is the
single source for reporting-layer decisions; REGISTRY.md cross-links
here rather than duplicating content.
diff_diff/business_report.py—BusinessReport,BusinessContext.diff_diff/diagnostic_report.py—DiagnosticReport,DiagnosticReportResults.
Both modules dispatch by type(results).__name__ lookup to avoid
circular imports across the 16 result classes. They do no estimator
fitting and do not re-derive any variance from raw data; every effect,
SE, p-value, CI, and sensitivity bound is either read from the fitted
result or produced by an existing diff-diff utility
(compute_honest_did, HonestDiD.sensitivity, bacon_decompose,
check_parallel_trends, compute_pretrends_power). When the caller
passes the raw panel + column kwargs, DiagnosticReport may call
those utilities on the supplied data (2x2 PT via
check_parallel_trends, Goodman-Bacon decomposition via
bacon_decompose, and the EfficientDiD Hausman PT-All vs PT-Post
pretest via EfficientDiD.hausman_pretest).
The design_effect section of DiagnosticReport.to_dict() is a
read-only surface: it echoes survey_metadata.design_effect and
effective_n from the fitted result along with a band_label enum
classifying the deviation from 1. The enum values are:
"improves_precision"fordeff < 0.95(effective N is LARGER than nominal N — a precision-improving design);"trivial"for0.95 <= deff < 1.05(effectively no effect on inference);"slightly_reduces"for1.05 <= deff < 2;"materially_reduces"for2 <= deff < 5;"large_warning"fordeff >= 5;Nonewhendeffis missing or non-finite.
The section does not call compute_deff_diagnostics (that helper
needs per-fit internals the result objects do not expose). The report layer does compose a few
cross-period summary statistics from per-period inputs already
produced by the estimator — specifically the joint-Wald / Bonferroni
pre-trends p-value from pre-period event-study coefficients (see
_pt_event_study), the MDV-to-ATT ratio for power-tier selection,
and the heterogeneity dispersion block (CV / range / sign-
consistency over post-treatment group / event-study / group-time
effects, pre-period and reference-marker rows excluded). These are
reporting-layer aggregations of inputs already in the result object,
not new inference.
The BusinessReport and DiagnosticReport schemas both carry a
top-level target_parameter block that names what scalar the
headline number actually represents. The 16 result classes have
meaningfully different estimands — a stakeholder reading
overall_att = -0.0214 on a Callaway-Sant'Anna fit cannot tell
whether that is the simple-weighted average across ATT(g,t)
cells, an event-study-weighted aggregate, or a group-weighted
aggregate. Baker et al. (2025) Step 2 is "Define the target
parameter"; BR/DR does that work for the user.
Schema shape:
"target_parameter": {
"name": "overall ATT (cohort-size-weighted average of ATT(g,t))",
"definition": "A cohort-size-weighted average of group-time ATTs ...",
"aggregation": "simple",
"headline_attribute": "overall_att",
"reference": "Callaway & Sant'Anna (2021); REGISTRY.md Sec. CallawaySantAnna"
}Field semantics:
-
name— short stakeholder-facing name. Rendered verbatim in BR's summary paragraph and DR's overall-interpretation paragraph. Always non-empty. -
definition— plain-English description of what the scalar is and how it is aggregated. Rendered in BR's and DR's full-report markdown (under "## Target Parameter") but omitted from the summary paragraph so stakeholder prose stays within the 6-10- sentence target. -
aggregation— machine-readable tag dispatching agents can branch on. Complete enumeration per estimator:"did_or_twfe"(DiDResults / TwoWayFixedEffects both route here — neutral tag; ambiguous at the result-class level until estimator provenance is persisted)"event_study"(MultiPeriodDiDResults)"simple"(CallawaySantAnna / Imputation / TwoStage / Wooldridge)"iw"(SunAbraham)"stacked"(StackedDiD)"pt_all_combined"/"pt_post_single_baseline"(EfficientDiD branched onpt_assumption)"dose_overall"(ContinuousDiD)"ddd"/"staggered_ddd"(TripleDifference / StaggeredTripleDiff)- dCDH dynamic branches follow the exact
overall_attcontract:"M"/"M_x"/"M_fd"/"M_x_fd"forL_max=None;"DID_1"/"DID_1_x"/"DID_1_fd"/"DID_1_x_fd"forL_max=1;"delta"/"delta_x"forL_max>=2without trend suppression; and"no_scalar_headline"whentrends_linear=TrueANDL_max>=2(the scalar is intentionally NaN). "synthetic"(SyntheticDiD) /"factor_model"(TROP) /"twfe"(BaconDecomposition read-out) /"unknown"(default fallback).
-
headline_attribute— the raw result attribute the scalar comes from ("overall_att"/"att"/"avg_att"/"twfe_estimate"), ORNonewhenaggregation == "no_scalar_headline"(the dCDHtrends_linear=True, L_max>=2branch whereoverall_attis intentionally NaN by design). Agents dispatching on this field must handleNoneby inspectingheadline.reason(BR) /headline_metric.reason(DR), which distinguishes two subcases:- Populated-surface subcase (per-horizon
linear_trends_effectsdict is non-empty):reasondirects callers toresults.linear_trends_effects[l]for per-horizon cumulated level effects. - Empty-surface subcase (
linear_trends_effects is Nonebecause no horizons survived estimation):reasonnames the empty state explicitly and directs callers toward re-fit remediation (largerL_maxortrends_linear=False) rather than a nonexistent dict. The dCDH native estimand label is also branched — on this subcase_estimand_label()returnsDID^{fd}_l (no cumulated level effects survived estimation)(orDID^{X,fd}_l (...)when covariates are active).
Different result classes use different attribute names; agents that want to re-read the raw value can dispatch on
headline_attribute. - Populated-surface subcase (per-horizon
-
reference— one-line citation pointer to the canonical paper and the REGISTRY.md section.
Per-estimator dispatch lives in
diff_diff/_reporting_helpers.py::describe_target_parameter. Each
branch is sourced from the corresponding estimator's section in
REGISTRY.md; new result classes must add an explicit branch (the
exhaustiveness test TestTargetParameterCoversEveryResultClass
locks this in).
A few branches read fit-time config from the result object:
-
EfficientDiDResults.pt_assumption:"all"(over-identified combined) vs"post"(just-identified single-baseline) branchesaggregationbetween"pt_all_combined"and"pt_post_single_baseline". -
StackedDiDResults.clean_control:"never_treated"/"strict"/"not_yet_treated"varies thedefinitionclause describing which units qualify as controls. -
ChaisemartinDHaultfoeuilleResults.L_max+covariate_residuals+linear_trends_effects: branches the dCDH estimand tag per the exactoverall_attcontract inchaisemartin_dhaultfoeuille.py:2602-2634andchaisemartin_dhaultfoeuille.py:2828-2834:L_max=None→DID_M(Phase 1 per-period aggregate;aggregation="M").L_max=1→DID_1(single-horizon per-group estimand, Equation 3 of the dynamic companion paper;aggregation="DID_1").L_max>=2→ cost-benefitdelta(Lemma 4 cross-horizon aggregate;aggregation="delta").trends_linear=TrueANDL_max>=2→overall_attis intentionally NaN (no scalar aggregate; per-horizon level effects live onresults.linear_trends_effects[l]).aggregation="no_scalar_headline"andheadline_attributeisNone.
Covariates (
has_controls) and/or linear trends (has_trends, whenL_max < 2) add_x/_fd/_x_fdsuffixes to theaggregationtag and the corresponding^X/^{fd}/^{X,fd}superscripts to thename(e.g.DID^X_1,delta^X,DID^{fd}_M), matching the result class's own_estimand_label()helper atchaisemartin_dhaultfoeuille_results.py:454-490.
A few branches emit a fixed tag regardless of fit-time config —
notably CallawaySantAnna, ImputationDiD, TwoStageDiD, and
WooldridgeDiD. For these estimators the overall_att
(or att / avg_att) scalar is ALWAYS the simple weighted
aggregation; the fit-time aggregate kwarg populates additional
horizon / group tables on the result object but does not change
the headline scalar. Disambiguating those tables in prose is
tracked under BR/DR gap #9 (per-cohort narrative rendering).
ContinuousDiDResults emits a single "dose_overall" tag with a
disjunctive definition (ATT^loc under PT; ATT^glob under
SPT) because the PT-vs-SPT regime is a user-level assumption, not
a library setting.
-
Note: No hard pass/fail gates.
DiagnosticReportdoes not produce a traffic-light verdict. Severity is conveyed through natural-language phrasing ("robust", "fragile", "material share"). This is an explicit deviation from the strategy document's Gap 4 ("traffic-light assessment (green/yellow/red)"); the choice is motivated by the well-known risk of naive thresholds producing false confidence. AConservativeThresholdsopt-in layer remains available as a future addition if practitioner demand materialises. -
Note: Placebo battery is opt-in (
run_placebo=Falseby default).run_all_placebo_testson a typical panel (500 permutations times one DiD fit per permutation) adds tens of seconds of latency, which would be surprising as the default on a convenience wrapper. The schema reserves the"placebo"key; it is always rendered with{"status": "skipped", "reason": "..."}in MVP so agents parsing the schema see a stable shape. -
Note:
DiagnosticReportdoes not callcheck_parallel_trendson event-study or staggered result objects.check_parallel_trendsindiff_diff/utils.pyassumes a single binary treatment with universal pre-periods; for staggered and event-study designs, DR reads the pre-period event-study coefficients directly and constructs a joint Wald statistic (or Bonferroni fallback whenvcovis missing). This mirrors the guidance inpractitioner._parallel_trends_step(staggered=True). -
Note: Survey-design threading for fit-faithful Bacon replay.
DiagnosticReport(survey_design=...)andBusinessReport(survey_design=...)accept the originalSurveyDesignobject and forward it tobacon_decompose(survey_design=...)so the Goodman-Bacon decomposition is computed under the same design as the weighted estimate. Whensurvey_metadatais set butsurvey_designis not supplied, Bacon skips with an explicit reason rather than replaying an unweighted decomposition for a design that differs from the weighted estimate; users can alternatively passprecomputed={'bacon': ...}with a survey-aware result.The simple 2x2 parallel-trends helper (
utils.check_parallel_trends) has no survey-aware variant. On a survey-backedDiDResultsthe check is skipped unconditionally, regardless of whethersurvey_designis supplied, because the helper cannot consume the design even when it is available. Users must passprecomputed={'parallel_trends': ...}with a survey-aware pretest result to opt in. Event-study PT on staggered estimators is unaffected — it reads the weighted pre-period coefficients directly off the fitted result and uses the finite-df reference described below, so no second replay is needed. -
Note: Survey finite-df PT policy. When the fitted result carries a finite
survey_metadata.df_survey,_pt_event_studycomputesF = W / k(numerator df = k pre-period coefficients) against an F(k, df_survey) reference distribution rather than chi-square(k). The design-based SE already reflects the effective sample size, so the chi-square reference would systematically over-reject under the finite-sample correction the SE captures. The schema surfaces the survey branch via themethodsuffix_survey(e.g.,joint_wald_survey,joint_wald_event_study_survey) and exposes the denominator df asdf_denom, so BR / DR prose can flag the finite-sample correction rather than silently presenting a chi-square-style result. Non-finitedf_survey(NaN / inf / non-positive) falls back to the chi-square path. -
Note: Estimator-native validation surfaces are surfaced rather than duplicated.
SyntheticDiDResultsroutes parallel-trends topre_treatment_fit(the RMSE of the synthetic-control fit on the pre-period), and routes sensitivity toin_time_placebo()+sensitivity_to_zeta_omega().TROPResultssurfaces factor-model diagnostics (effective_rank,loocv_score, selectedlambda_*) underestimator_native_diagnostics.SyntheticControlResultsroutes parallel-trends to thescm_fitanalogue (pre_rmspe, verdictdesign_enforced_pt) and surfacespre_rmspe, donor-weight concentration, the in-space placebo permutation p-value, the ADH-2015 leave-one-out (leave_one_out) and in-time placebo (in_time_placebo) blocks, the Firpo-Possebom (2018) test-inversionconfidence_set, and the Chernozhukov-Wüthrich-Zhu (2021)conformal_inferenceblock (joint / pointwise / average) underestimator_native_diagnostics— each is populated only when the caller has already run the corresponding opt-in method (DR never triggers a refit loop implicitly; otherwise astatus="not_run"stub), and it omits HonestDiD-stylesensitivity(significance IS the placebo).EfficientDiDResultsPT runs throughEfficientDiD.hausman_pretest(the estimator's native PT-All vs PT-Post check). -
Note: Pre-trends verdict is a three-bin heuristic, not a field convention. DR maps the joint p-value as follows:
joint_p >= 0.30→no_detected_violation.0.05 <= joint_p < 0.30→some_evidence_against.joint_p < 0.05→clear_violation.
These thresholds are diff-diff heuristics. The 0.30 upper bound draws on equivalence-testing intuition (Rambachan & Roth 2023 discuss the limitations of pre-tests). The
no_detected_violationlabel deliberately avoids "parallel trends hold" language — the test did not detect a violation, but pre-trends tests are commonly underpowered. See the power-aware phrasing rule below. -
Note: Power-aware phrasing for
no_detected_violation. DR callscompute_pretrends_power(results, violation_type='linear', alpha=alpha, target_power=0.80)for the estimator families that ship acompute_pretrends_poweradapter:MultiPeriodDiDResults,CallawaySantAnnaResults, andSunAbrahamResults(see_APPLICABILITY["pretrends_power"]indiff_diff/diagnostic_report.py). Other staggered families with event-study output (ImputationDiDResults,TwoStageDiDResults,StackedDiDResults,EfficientDiDResults,StaggeredTripleDiffResults,WooldridgeDiDResults,ChaisemartinDHaultfoeuilleResults) do not yet have a power adapter and therefore render theno_detected_violationtier asunderpoweredwith the fallback reason recorded inschema["pre_trends"]["power_reason"](plain-English explanation) whileschema["pre_trends"]["power_status"]carries the machine-readable enum ("ran"/"skipped"/"error"/"not_applicable"). BusinessReport then readsmdv_share_of_att = max_abs_pre_violation / abs(att)and selects a tier. The numerator is the level-scale max pre-period violation under the MDV, computed asmdv * max(|violation_weights|)— NOT the rawmdvscalar. Post PR-B Step 4, rawmdvforviolation_type='linear'is in Roth's γ units (a slope on relative time), so comparing it directly to a level-scale|att|would mix units on irregular pre-period grids and mis-tier the result. The level-scale quantity is exposed via the newPreTrendsPowerResults.max_abs_pre_violationproperty and theDiagnosticReport.pretrends_powerblock schema field of the same name. Tier thresholds:< 0.25→well_powered— "the test has 80% power to detect a violation of magnitude M, which is only X% of the estimated effect; if a material pre-trend existed, this test would likely have caught it.">= 0.25 and < 1.0→moderately_powered— "the test is informative but not definitive; see the sensitivity analysis below for bounded-violation guarantees.">= 1.0→underpowered— "the test has limited power — a non-rejection does not prove the assumption. See the HonestDiD sensitivity analysis below for a more reliable signal."- Power analysis not runnable → fall back to
underpoweredphrasing; the fallback reason is recorded inschema["pre_trends"]["power_reason"](plain-English explanation;power_statuscarries the enum).
Rationale: always-hedging phrasing under-sells well-designed studies; always-confident phrasing over-sells underpowered ones. The library already ships
compute_pretrends_power(), so using it is the honest default rather than hedging every non-violation. -
Note: Pre-period covariance routing for staggered-estimator power. As of the PR-B PreTrendsPower implementation audit (Roth 2022),
compute_pretrends_power()consumes the fullevent_study_vcovsub-block when it is available — non-bootstrap CS fits (staggered_results.pypopulates the matrix) and non-bootstrap SA fits (sun_abraham.pybuilds it viaW @ vcov_cohort @ W.T). ThePreTrendsPowerResults.covariance_sourcefield records the actual extraction path ("full_pre_period_vcov"vs"diag_fallback"), and theDiagnosticReport.pretrends_powerblock surfaces that label unchanged. There are two paths through the report layer with different downgrade semantics:- New fits (post-PR-B,
PreTrendsPowerResults.covariance_sourceis populated):DiagnosticReportreads the persisted label directly. Non-bootstrap CS / SA fits report"full_pre_period_vcov"and are NOT downgraded; bootstrap / replicate-weight paths report"diag_fallback"and also pass through unchanged (no "available but unused" concern — the estimator did its best with what was available). - Legacy serialized results (pre-PR-B, no
covariance_sourcefield on the object): the report layer falls back to type-based inference in_infer_cov_source(source_fit). For event-study result types (CS / SA / etc.) with populatedevent_study_vcov, the legacy- ambiguous case still emits the conservative"diag_fallback_available_full_vcov_unused"sentinel and thewell_powered → moderately_powereddowngrade still applies — because without the persisted provenance we cannot rule out that the stored power was computed fromdiag(ses^2)under PR-A semantics. ForMultiPeriodDiDResultswithoutinteraction_indices, the legacy fallback reports"diag_fallback"(a genuine fallback, not the "available but unused" case, so no downgrade applies).
Remaining
"diag_fallback"cases on new fits — bootstrap / replicate-weight CS and SA, plus ImputationDiD / Stacked / EfficientDiD / TwoStageDiD — pass through unchanged because nothing better is available on those result types yet. - New fits (post-PR-B,
-
Note: Unit-translation policy. BusinessReport does not arithmetically translate log-points to percents or level effects to log-points. The estimate is rendered in the scale the estimator produced;
outcome_unit="log_points"emits an informational caveat. The policy avoids guessing the underlying model (no estimator in the library currently exports both log and level coefficients), which would be unsafe in the presence of non-linear link functions (Poisson QMLE, logit). -
Note: Single-knob
alphawith preserved-native-CI fallback. BusinessReport exposes onlyalpha(defaults toresults.alpha); there is no separatesignificance_thresholdparameter. When the requestedalphamatches the fit's native level, it drives both the CI level ((1 - alpha) * 100% interval) and the phrasing tier threshold ("statistically significant at the (1 - alpha) * 100% level"). When the requestedalphadiffers from the fit's native level (e.g., the user asks foralpha=0.10on a result fit withalpha=0.05), BusinessReport does NOT recompute the CI at the requested level, because the stored CI is the only quantile the underlying estimator supplied (bootstrap distributions and finite-df analytical variances are not always retained on the result). Instead, the schema preserves the fit's native CI (with its original level) and uses the requestedalphaonly for the significance-phrasing threshold, and emits analpha_override_preservedcaveat describing the mismatch. This is the conservative choice: it avoids silently recomputing CIs under assumptions the estimator may not support. -
Note: Schema stability policy for the AI-legible
to_dict()surface. New top-level keys count as additive (no version bump); new values in anystatusenum count as breaking (agents doing exhaustive pattern match will break on unknown enums); renames and removals count as breaking. TheBUSINESS_REPORT_SCHEMA_VERSIONandDIAGNOSTIC_REPORT_SCHEMA_VERSIONconstants bump independently. The v3.2 CHANGELOG marks both schemas experimental so users do not anchor tooling on them prematurely; a formal deprecation policy will land within two subsequent PRs. -
Note: Schema version 2.0 (both BR and DR). The BR/DR gap #6 target-parameter PR adds the
headline.status/headline_metric.statusvalue"no_scalar_by_design"(used for the dCDHtrends_linear=True, L_max>=2configuration whereoverall_attis intentionally NaN). Per the stability policy above, new enum values are breaking changes, soBUSINESS_REPORT_SCHEMA_VERSIONandDIAGNOSTIC_REPORT_SCHEMA_VERSIONbumped from"1.0"to"2.0". The schemas remain marked experimental, so the formal deprecation policy does not yet apply.
The phrasing rules follow the guidance in:
- Baker, A. C., Callaway, B., Cunningham, S., Goodman-Bacon, A., &
Sant'Anna, P. H. C. (2025). Difference-in-Differences Designs: A
Practitioner's Guide. (The 8-step workflow enforced through
diff_diff/practitioner.py.) - Rambachan, A., & Roth, J. (2023). A More Credible Approach to Parallel Trends. Review of Economic Studies. (HonestDiD sensitivity; the pre-test power caveat directly shaped the three-tier power phrasing.)
- Roth, J. (2022). Pretest with Caution: Event-study Estimates after Testing for Parallel Trends. American Economic Review: Insights. (Motivates the power-aware phrasing tiers.)