Instrumental Variables (IV) is a way to identify causal effects using variation in treatment participation that is due to an exogenous variable that is only related to the outcome through treatment.
Two reasons to consider IV:
Either problem is sometimes loosely referred to as endogeneity
Consider simple regression equation: \[y = \beta x + \varepsilon (x),\] where \(\varepsilon(x)\) reflects the dependence between our observed variable and the error term.
Simple OLS will yield \[\frac{dy}{dx} = \beta + \frac{d\varepsilon}{dx} \neq \beta\]
The regression we want to do, \[y_{i} = \alpha + \delta D_{i} + \gamma A_{i} + \epsilon_{i},\] where \(D_{i}\) is treatment (think of schooling for now) and \(A_{i}\) is something like ability.
\(A_{i}\) is unobserved, so instead we run \[y_{i} = \alpha + \beta D_{i} + \epsilon_{i}\]
From this “short” regression, we don’t actually estimate \(\delta\). Instead, we get an estimate of \[\beta = \delta + \lambda_{ds}\gamma \neq \delta,\] where \(\lambda_{ds}\) is the coefficient of a regression of \(A_{i}\) on \(D_{i}\).
IV will recover the “long” regression without observing underlying ability
IF our IV satisfies all of the necessary assumptions.
We want to estimate \[E[Y_{i} | D_{i}=1] - E[Y_{i} | D_{i}=0]\]
With instrument \(Z_{i}\) that satisfies relevant assumptions, we can estimate this as \[E[Y_{i} | D_{i}=1] - E[Y_{i} | D_{i}=0] = \frac{E[Y_{i} | Z_{i}=1] - E[Y_{i} | Z_{i}=0]}{E[D_{i} | Z_{i}=1] - E[D_{i} | Z_{i}=0]}\]
In words, this is effect of the instrument on the outcome (“reduced form”) divided by the effect of the instrument on treatment (“first stage”)
Recall “long” regression: \(Y=\alpha + \delta S + \gamma A + \epsilon\).
\[\begin{align} COV(Y,Z) & = E[YZ] - E[Y] E[Z] \\ & = E[(\alpha + \delta S + \gamma A + \epsilon)\times Z] - E[\alpha + \delta S + \gamma A + \epsilon]E[Z] \\ & = \alpha E[Z] + \delta E[SZ] + \gamma E[AZ] + E[\epsilon Z] \\ & \hspace{.2in} - \alpha E[Z] - \delta E[S]E[Z] - \gamma E[A] E[Z] - E[\epsilon]E[Z] \\ & = \delta (E[SZ] - E[S] E[Z]) + \gamma (E[AZ] - E[A] E[Z]) \\ & \hspace{.2in} + E[\epsilon Z] - E[\epsilon] E[Z] \\ & = \delta C(S,Z) + \gamma C(A,Z) + C(\epsilon, Z) \end{align}\]
Working from \(COV(Y,Z) = \delta COV(S,Z) + \gamma COV(A,Z) + COV(\epsilon,Z)\), we find
\[\delta = \frac{COV(Y,Z)}{COV(S,Z)}\]
if \(COV(A,Z)=COV(\epsilon, Z)=0\)
Easy to think of in terms of randomized controlled trial…
| Measure | Offered Seat | Not Offered Seat | Difference |
|---|---|---|---|
| Score | -0.003 | -0.358 | 0.355 |
| % Enrolled | 0.787 | 0.046 | 0.741 |
| Effect | 0.48 |
Angrist et al., 2012. “Who Benefits from KIPP?” Journal of Policy Analysis and Management.
Think of IV as two-steps:
Interested in estimating \(\delta\) from \(y_{i} = \alpha + \beta x_{i} + \delta D_{i} + \varepsilon_{i}\), but \(D_{i}\) is endogenous (no pure “selection on observables”).
Step 1: With instrument \(Z_{i}\), we can regress \(D_{i}\) on \(Z_{i}\) and \(x_{i}\): \[D_{i} = \lambda + \theta Z_{i} + \kappa x_{i} + \nu,\] and form prediction \(\hat{D}_{i}\).
Step 2: Regress \(y_{i}\) on \(x_{i}\) and \(\hat{D}_{i}\): \[y_{i} = \alpha + \beta x_{i} + \delta \hat{D}_{i} + \xi_{i}\]
Recall our first-stage, \(S=\theta Z + \varepsilon\), where \(\hat{\theta}=\frac{C(Z,S)}{V(Z)}\), or \(\hat{\theta}V(Z) = C(Z,S)\). Then:
\[\begin{align} \hat{\delta} & = \frac{COV(Y,Z)}{COV(S,Z)} \\ & = \frac{\hat{\theta}C(Y,Z)}{\hat{\theta}C(S,Z)} = \frac{\hat{\theta}C(Y,Z)}{\hat{\theta}^{2}V(Z)} \\ & = \frac{C(\hat{\theta}Z,Y)}{V(\hat{\theta}Z)} = \frac{C(\hat{S},Y)}{V(\hat{S})} \end{align}\]
But in practice, DON’T do this in two steps. Why?
Because standard errors are wrong…not accounting for noise in prediction, \(\hat{D}_{i}\). The appropriate fix is built into most modern stats programs.
Assumptions 1 and 2 sometimes grouped into an only through condition or discussed as instrument validity
Conley et al (2010) and “plausible exogeneity”, union of confidence intervals approach
Kippersluis and Rietveld (2018), “Beyond Plausibly Exogenous”
Just says that your instrument is correlated with the endogenous variable, but what about the strength of the correlation?
Recall our schooling and wages equation, \[y = \beta S + \epsilon.\] Bias in IV can be represented as:
\[Bias_{IV} \approx \frac{Cov(S, \epsilon)}{V(S)} \frac{1}{F+1} = Bias_{OLS} \frac{1}{F+1}\]
Single endogenous variable
Single endogenous variable
Single endogenous variable
Many endogenous variables
Recall that the true treatment effect is 5.25
import numpy as np
import pandas as pd
# --- data ---
n = 5000
b_true = 5.25
rng = np.random.default_rng(123) # for reproducibility
iv_dat = pd.DataFrame({
"z": rng.normal(0, 2, n),
"eps": rng.normal(0, 1, n),
})
iv_dat["d"] = (
iv_dat["z"] + 1.5 * iv_dat["eps"] + rng.normal(0, 1, n) > 0.25
).astype(int)
iv_dat["y"] = 2.5 + b_true * iv_dat["d"] + iv_dat["eps"] + rng.normal(0, 0.5, n)
# --- OLS: lm(y ~ d) ---
import statsmodels.formula.api as smf
ols = smf.ols("y ~ d", data=iv_dat).fit()
print(ols.summary())
# --- IV/2SLS: feols(y ~ 1 | d ~ z) ---
# (y on intercept, endogenous d, instrument z)
try:
from linearmodels.iv import IV2SLS
iv = IV2SLS.from_formula("y ~ 1 + [d ~ z]", data=iv_dat).fit(cov_type="robust")
print(iv.summary)
except ImportError:
# Fallback: manual 2SLS using statsmodels
import statsmodels.api as sm
first = sm.OLS(iv_dat["d"], sm.add_constant(iv_dat["z"])).fit()
iv_dat["d_hat"] = first.fittedvalues
second = sm.OLS(iv_dat["y"], sm.add_constant(iv_dat["d_hat"])).fit()
print("\nFirst stage (d ~ z):")
print(first.summary())
print("\nSecond stage (y ~ d_hat):")
print(second.summary())
Call:
lm(formula = y ~ d, data = iv.dat)
Residuals:
Min 1Q Median 3Q Max
-3.5988 -0.7071 0.0141 0.6875 3.9679
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.09456 0.01995 105.0 <2e-16 ***
dTRUE 6.13007 0.02925 209.6 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.032 on 4998 degrees of freedom
Multiple R-squared: 0.8978, Adjusted R-squared: 0.8978
F-statistic: 4.392e+04 on 1 and 4998 DF, p-value: < 2.2e-16
TSLS estimation - Dep. Var.: y
Endo. : d
Instr. : z
Second stage: Dep. Var.: y
Observations: 5,000
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.50530 0.029548 84.7877 < 2.2e-16 ***
fit_dTRUE 5.24675 0.053608 97.8716 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.12156 Adj. R2: 0.231381
F-test (1st stage), dTRUE: stat = 2,715.24, p < 2.2e-16, on 1 and 4,998 DoF.
Wu-Hausman: stat = 549.86, p < 2.2e-16, on 1 and 4,997 DoF.
Call:
lm(formula = y ~ z, data = iv.dat)
Residuals:
Min 1Q Median 3Q Max
-8.0911 -2.1728 -0.0365 2.1631 9.1191
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.94776 0.04001 123.66 <2e-16 ***
z 0.77845 0.02006 38.81 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.829 on 4998 degrees of freedom
Multiple R-squared: 0.2315, Adjusted R-squared: 0.2314
F-statistic: 1506 on 1 and 4998 DF, p-value: < 2.2e-16
Call:
lm(formula = d ~ z, data = iv.dat)
Residuals:
Min 1Q Median 3Q Max
-1.03085 -0.32969 -0.01796 0.33273 1.11780
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.465519 0.005679 81.97 <2e-16 ***
z 0.148368 0.002847 52.11 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4016 on 4998 degrees of freedom
Multiple R-squared: 0.352, Adjusted R-squared: 0.3519
F-statistic: 2715 on 1 and 4998 DF, p-value: < 2.2e-16
Call:
lm(formula = y ~ d.hat, data = iv.dat)
Residuals:
Min 1Q Median 3Q Max
-8.0911 -2.1728 -0.0365 2.1631 9.1191
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.50530 0.07452 33.62 <2e-16 ***
d.hat 5.24675 0.13521 38.81 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.829 on 4998 degrees of freedom
Multiple R-squared: 0.2315, Adjusted R-squared: 0.2314
F-statistic: 1506 on 1 and 4998 DF, p-value: < 2.2e-16
Assumption: Denote the effect of our instrument on treatment by \(\pi_{1i}\). Monotonicity states that \(\pi_{1i} \geq 0\) or \(\pi_{1i} \leq 0, \text{ } \forall i\).
\[\delta_{IV} = \frac{E[Y_{i} | Z_{i}=1] - E[Y_{i} | Z_{i}=0]}{E[D_{i} | Z_{i}=1] - E[D_{i} | Z_{i}=0]}=E[Y_{i}(1) - Y_{i}(0) | \text{complier}]\]