Regression Discontinuity: Part I

Ian McCarthy | Emory University

Outline for Today

What is RD?
Implementing an RD strategy
Practice with Simulated Data
“Fuzzy” RD

What is Regression Discontinuity?

Intuition

Regression discontinuity (RD)
Basic idea: Observations are identical just above/below some exogenous threshold
Some motivation from Causal Inference: The Mixtape
Highly relevant in “rule-based” world…
- School eligibility based on age cutoffs
- Program participation based on discrete income thresholds
- Performance scores rounded to nearest integer

Required elements

Score
Cutoff
Treatment

Types of RD

Sharp regression discontinuity: those above the threshold guaranteed to participate
Fuzzy regression discontinuity: those above the threshold are eligible but may not participate

Sharp RD

\[W_{i} = 1(x_{i}>c) = \begin{cases} 1 & \text{if} & x_{i}>c \\ 0 & \text{if} & x_{i}<c \end{cases}\]

\(x\) is “forcing variable”
\(c\) is the threshold value or cutoff point

Sharp RD Scatterplot

R Code

n=1000
rd.dat <- tibble(
  X = runif(n,0,2),
  W = (X>1),
  Y = 0.5 + 2*X + 4*W + -2.5*W*X + rnorm(n,0,.5)
)
plot1 <- rd.dat %>% ggplot(aes(x=X,y=Y)) + 
  geom_point() + theme_bw() +
  geom_vline(aes(xintercept=1),linetype='dashed') +
  scale_x_continuous(
    breaks = c(.5, 1.5),
    label = c("Untreated", "Treated")
  ) +
  xlab("Running Variable") + ylab("Outcome")
plot1

Sharp RD Linear Predictions

R Code

plot2 <- plot1 +
  geom_smooth(method = 'lm', data = (rd.dat %>% filter(W == TRUE)) ) +
  geom_smooth(method = 'lm', data = (rd.dat %>% filter(W == FALSE)) )
plot2

Sharp RD Linear Predictions

R Code

plot3 <- plot2 +
  stat_smooth(method = 'lm', data = (rd.dat %>% filter(W == TRUE)), fullrange = TRUE, linetype = 'dashed') +
  stat_smooth(method = 'lm', data = (rd.dat %>% filter(W == FALSE)), fullrange = TRUE, linetype = 'dashed')
plot3

Why bandwidth matters

Mean difference around threshold of 0.2, 3.99 - 2.23 = 1.76
Mean overall difference, 3.77 - 1.48 = 2.29

More generally

Running variable may affect outcome directly
Focusing on area around cutoff does two things:

Controls for running variable
“Controls” for unobserved things correlated with running variable and outcome

Animations!

RD in five steps

Start with raw data on both sides of the cutoff
Bin observations along the running variable
Collapse to bin averages to remove noise
Zoom in on the bins just around the cutoff
The jump at the cutoff is the treatment effect

Implementing RD

Ultimate Goal

Goal is to estimate \(E[Y_{1}|X=c] - E[Y_{0}|X=c]\)

Trim to reasonable window around threshold (“bandwidth”), \(X \in [c-h, c+h]\)
Transform running variable, \(\tilde{X}=X-c\)
Estimate regressions within that window

Starting point: linear, same slope

\[y = \alpha + \delta D + \beta \tilde{X} + \varepsilon\]

\(D = 1(X > c)\) is the treatment indicator
\(\delta\) is the RD estimate (the jump at the cutoff)
Forces the slope on \(\tilde{X}\) to be the same on both sides

Allowing differential slopes

\[y = \alpha + \delta D + \beta \tilde{X} + \gamma D\tilde{X} + \varepsilon\]

Now the slope can differ above and below the cutoff
\(\gamma\) captures the difference in slopes
More flexible, but uses more degrees of freedom in small bandwidths

Beyond linear

Can add polynomials in \(\tilde{X}\) and interactions with \(D\)
In practice, local linear (the previous two slides) is the standard
Higher-order polynomials are generally discouraged (more on this later)

Key RD Steps

Show clear graphical evidence of a change around the discontinuity (bin scatter)
Balance above/below threshold (use baseline covariates as outcomes)
Manipulation tests
RD estimates
Sensitivity and robustness:
- Bandwidths
- Order of polynomial
- Inclusion of covariates

1. Graphical evidence

Before presenting RD estimates, any good RD approach first highlights the discontinuity with a simple graph. We can do so by plotting the average outcomes within bins of the forcing variable (i.e., binned averages), \[\bar{Y}_{k} = \frac{1}{N_{k}}\sum_{i=1}^{N} Y_{i} \times 1(b_{k} < X_{i} \leq b_{k+1}).\]

The binned averages helps to remove noise in the graph and can provide a cleaner look at the data. Just make sure that no bin includes observations above and below the cutoff!

R
Python
Output

library(rdrobust)
rd.result <- rdplot(rd.dat$Y, rd.dat$X, 
                    c=1, 
                    title="RD Plot with Binned Average", 
                    x.label="Running Variable", 
                    y.label="Outcome",
                    hide = TRUE)
bin.avg <- as_tibble(rd.result$vars_bins)
plot.bin <- bin.avg %>% ggplot(aes(x=rdplot_mean_x,y=rdplot_mean_y)) + 
  geom_point() + theme_bw() +
  geom_vline(aes(xintercept=1),linetype='dashed') +
  scale_x_continuous(
    breaks = c(.5, 1.5),
    label = c("Untreated", "Treated")
  ) +
  xlab("Running Variable") + ylab("Outcome")

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from rdrobust import rdplot

# rd_dat is a pandas DataFrame with columns "Y" and "X"

# Run rdplot but suppress its own figure and grab the binned data
rd_result = rdplot(
    y=rd_dat["Y"].values,
    x=rd_dat["X"].values,
    c=1,
    title="RD Plot with Binned Average",
    x_label="Running Variable",
    y_label="Outcome",
    hide=True  # don't show the built-in plot
)

# vars_bins is already a pandas DataFrame in the Python implementation
bin_avg = rd_result.vars_bins

# Recreate the custom plot
sns.set_style("whitegrid")
fig, ax = plt.subplots()

sns.scatterplot(
    data=bin_avg,
    x="rdplot_mean_x",
    y="rdplot_mean_y",
    ax=ax
)

ax.axvline(x=1, linestyle="--", color="black")

ax.set_xticks([0.5, 1.5])
ax.set_xticklabels(["Untreated", "Treated"])

ax.set_xlabel("Running Variable")
ax.set_ylabel("Outcome")
ax.set_title("RD Plot with Binned Average")

sns.despine()
plt.tight_layout()
plt.show()

Selecting “bin” width

How many bins should we use? Too few and we obscure the discontinuity; too many and the graph is noisy.

Bin width: formal tests

Dummy variable approach: Create dummies for each bin, regress outcome on the dummies (\(R^{2}_{r}\)). Double the number of bins (\(R^{2}_{u}\)). F-test: \[F = \frac{R^{2}_{u}-R^{2}_{r}}{1-R^{2}_{u}}\times \frac{n-K-1}{K}\]
Interaction approach: Include interactions between bin dummies and the running variable, then joint F-test on the interaction terms

If significant, we have too few bins and should narrow the width.

2. Balance

If RD is an appropriate design, passing the cutoff should only affect treatment and outcome of interest
How do we test for this?
- Covariate balance
- Placebo tests of other outcomes (e.g., t-1 outcomes against treatment at time t)

3. Manipulation tests

Individuals should not be able to precisely manipulate running variable to enter into treatment
Sometimes discussed as “bunching”
Test for differences in density to left and right of cutoffs (rddensity)
Permutation tests proposed in Ganong and Jager (2017)

What if there is bunching?

Gerard, Rokkanen, and Rothe (2020) suggest partial identification allowing for bunching
Can also be used as a robustness check (rdbounds)
Assumption: bunching only moves people in one direction

4. RD Estimation

Start with the “default” options

Local linear regression
Optimal bandwidth
Uniform kernel

Selecting bandwidth

The bandwidth is a “tuning parameter”

High \(h\) means high bias but lower variance (use more of the data, closer to OLS)
Low \(h\) means low bias but higher variance (use less data, more focused around discontinuity)

Represent bias-variance tradeoff with the mean-square error, \[MSE(h) = E[(\hat{\tau}_{h} - \tau_{RD})^2]=\left(E[\hat{\tau}_{h} - \tau_{RD}] \right)^2 + V(\hat{\tau}_{h}).\]

Selecting bandwidth

In the RD case, we have two different mean-square error terms:

“From above”, \(MSE_{+}(h) = E[(\hat{\mu}_{+}(c,h) - E[Y_{1i}|X_{i}=c])^2]\)
“From below”, \(MSE_{-}(h) = E[(\hat{\mu}_{-}(c,h) - E[Y_{0i}|X_{i}=c])^2]\)

Goal is to find \(h\) that minimizes these values, but we don’t know the true \(E[Y_{1}|X=c]\) and \(E[Y_{0}|X=c]\). So we have two approaches:

Use cross-validation to choose \(h\)
Explicitly solve for optimal bandwidth

Cross-validation

Essentially a series of “leave-one-out” estimates:

Pick an \(h\)
Run regression, leaving out observation \(i\). If \(i\) is to the left of the threshold, we estimate regression for observations within \(X_{i}-h\), and conversely \(X_{i}+h\) if \(i\) is to the right of the threshold.
Predicted \(\hat{Y}_{i}\) at \(X_{i}\) (out of sample prediction for the left out observation)
Do this for all \(i\), and form \(CV(h)=\frac{1}{N}\sum (Y_{i} - \hat{Y}_{i})^2\)

Select \(h\) with lowest \(CV(h)\) value.

RD with Simulated Data

ols <- lm(Y~X+W, data=rd.dat)

rd.dat3 <- rd.dat %>%
  mutate(x_dev = X-1) %>%
  filter( (X>0.8 & X <1.2) )
rd <- lm(Y~x_dev + W, data=rd.dat3)

import pandas as pd
import statsmodels.formula.api as smf

# OLS on full sample
ols = smf.ols("Y ~ X + W", data=rd_dat).fit()

# Local linear regression around cutoff X = 1
rd_dat3 = (
    rd_dat
    .assign(x_dev=lambda df: df["X"] - 1)
    .query("X > 0.8 and X < 1.2")
)

rd = smf.ols("Y ~ x_dev + W", data=rd_dat3).fit()

True effect: 1.5
Standard linear regression with same slopes: 1.56
RD (linear with same slopes): 1.81

Estimates with RD Packages

R
Python
Output

rd.y <- rd.dat$Y
rd.x <- rd.dat$X
rd.est <- rdrobust(y=rd.y, x=rd.x, c=1)
summary(rd.est)

from rdrobust import rdrobust

# rd_dat is a pandas DataFrame with columns "Y" and "X"
rd_y = rd_dat["Y"].values
rd_x = rd_dat["X"].values

rd_est = rdrobust(y=rd_y, x=rd_x, c=1)

# Print a summary (rdrobust returns a custom object with its own __str__)
print(rd_est)

Sharp RD estimates using local polynomial regression.

Number of Obs.                 1000
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                  490          510
Eff. Number of Obs.             152          166
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   0.320        0.320
BW bias (b)                   0.497        0.497
rho (h/b)                     0.644        0.644
Unique Obs.                     490          510

=====================================================================
                   Point    Robust Inference
                Estimate         z     P>|z|      [ 95% C.I. ]       
---------------------------------------------------------------------
     RD Effect     1.739    12.381     0.000     [1.475 , 2.030]     
=====================================================================

Cattaneo et al. (2020) argue:

Report conventional point estimate
Report robust confidence interval

5. Robustness and sensitivity

Results should be stable across reasonable specification choices:

Bandwidths: try half and double the optimal bandwidth. If the estimate swings wildly, the result is fragile.
Kernels: uniform (equal weight) vs. triangular (more weight near cutoff). Should tell similar stories.
Polynomials: linear vs. quadratic. If you need a cubic to find the effect, worry.
Covariates: adding pre-treatment controls shouldn’t change the estimate much (they should be balanced at the cutoff). But they can improve precision.

Pitfalls of polynomials

Assign too much weight to points away from the cutoff
Results highly sensitive to degree of polynomial
Narrow confidence intervals (over-rejection of the null)

For more discussion, see this World Bank Blog post

Fuzzy RD

The Idea

“Fuzzy” just means that assignment isn’t guaranteed based on the running variable. For example, maybe students are much more likely to get a scholarship past some threshold SAT score, but it remains possible for students below the threshold to still get the scholarship.

Discontinuity reflects a jump in the probability of treatment
Other RD assumptions still required (namely, can’t manipulate running variable around the threshold)

Fuzzy RD is IV

In practice, fuzzy RD is employed as an instrumental variables estimator

Difference in outcomes among those above and below the discontinuity divided by the difference in treatment probabilities for those above and below the discontinuity, \[E[Y_{i} | D_{i}=1] - E[Y_{i} | D_{i}=0] = \frac{E[Y_{i} | x_{i}\geq c] - E[Y_{i} | x_{i}< c]}{E[D_{i} | x_{i}\geq c] - E[D_{i} | x_{i}<c]}\]
Indicator for \(x_{i}\geq c\) is an instrument for treatment status, \(D_{i}\).
Implemented with rdrobust and fuzzy=t option