Inference: Typically want to cluster at unit-level to allow for correlation over time within units, but problems with small numbers of treated or control groups:
Conley-Taber CIs
Wild cluster bootstrap
Randomization inference
“Extra” things like propensity score weighting and doubly robust estimation
DD and TWFE?
Just a shorthand for a common regression specification
Fixed effects for each unit and each time period, \(\gamma_{i}\) and \(\gamma_{t}\)
TWFE and 2x2 DD identical with homogeneous effects and common treatment timing
Otherwise…TWFE is biased and inconsistent for ATT
Consider standard TWFE specification with a single treatment coefficient, \[y_{it} = \alpha + \delta D_{it} + \gamma_{i} + \gamma_{t} + \varepsilon_{it}.\] We can decompose \(\hat{\delta}\) into three things:
\[\hat{\delta}_{twfe} = \text{VW} ATT + \text{VW} PT - \Delta ATT\]
A variance-weighted ATT
Violation of parallel trends
Heterogeneous effects over time
Intuition
Problems come from heterogeneous effects and staggered treatment timing
OLS is a weighted average of all 2x2 DD groups
Weights are function of size of subsamples, size of treatment/control units, and timing of treatment
Units treated in middle of sample receive larger weights
Best case: Variance-weighted ATT
Prior-treated units act as controls for late-treated units, so differential timing alone can introduce bias
Heterogeneity and differential timing introduces “contamination” via negative weights assigned to some underlying 2x2 DDs
Does it really matter?
Definitely! But how much?
Large treatment effects for early treated units could reverse the sign of final estimate