Inference: Typically want to cluster at unit-level to allow for correlation over time within units, but problems with small numbers of treated or control groups:

Conley-Taber CIs

Wild cluster bootstrap

Randomization inference

“Extra” things like propensity score weighting and doubly robust estimation

DD and TWFE?

Just a shorthand for a common regression specification

Fixed effects for each unit and each time period, \(\gamma_{i}\) and \(\gamma_{t}\)

TWFE and 2x2 DD identical with homogeneous effects and common treatment timing

Otherwise…TWFE is biased and inconsistent for ATT

Consider standard TWFE specification with a single treatment coefficient, \[y_{it} = \alpha + \delta D_{it} + \gamma_{i} + \gamma_{t} + \varepsilon_{it}.\] We can decompose \(\hat{\delta}\) into three things:

\[\hat{\delta}_{twfe} = \text{VW} ATT + \text{VW} PT - \Delta ATT\]

A variance-weighted ATT

Violation of parallel trends

Heterogeneous effects over time

Intuition

Problems come from heterogeneous effects and staggered treatment timing

OLS is a weighted average of all 2x2 DD groups

Weights are function of size of subsamples, size of treatment/control units, and timing of treatment

Units treated in middle of sample receive larger weights

Best case: Variance-weighted ATT

Prior-treated units act as controls for late-treated units, so differential timing alone can introduce bias

Heterogeneity and differential timing introduces “contamination” via negative weights assigned to some underlying 2x2 DDs

Does it really matter?

Definitely! But how much?

Large treatment effects for early treated units could reverse the sign of final estimate