## R Code

Mean | SD | Histogram | |
---|---|---|---|

Sales per Capita | 95.15 | 41.13 | ▂▅▅▇▄▁ |

Real Price | 3.40 | 1.64 | ▇▇▂▃▂▂▁ |

Nominal Price | 2.68 | 2.24 | ▇▄▁▂▂▁▁ |

Ian McCarthy | Emory University

- Data from CDC Tax Burden on Tobacco
- Visit GitHub repository for other info: Tobacco GitHub repository
- Supplement with CPI data, also in GitHub repo.

We’re interested in cigarette prices and sales, so let’s focus our summaries on those two variables

Mean | SD | Histogram | |
---|---|---|---|

Sales per Capita | 95.15 | 41.13 | ▂▅▅▇▄▁ |

Real Price | 3.40 | 1.64 | ▇▇▂▃▂▂▁ |

Nominal Price | 2.68 | 2.24 | ▇▄▁▂▂▁▁ |

Instrumental Variables (IV) is a way to identify causal effects using variation in treatment particpation that is due to an *exogenous* variable that is only related to the outcome through treatment.

Two reasons to consider IV:

- Selection on unobservables
- Reverse causation

Either problem is sometimes loosely referred to as *endogeneity*

Consider simple regression equation: \[y = \beta x + \varepsilon (x),\] where \(\varepsilon(x)\) reflects the dependence between our observed variable and the error term.

Simple OLS will yield \[\frac{dy}{dx} = \beta + \frac{d\varepsilon}{dx} \neq \beta\]

The regression we want to do, \[y_{i} = \alpha + \delta D_{i} + \gamma A_{i} + \epsilon_{i},\] where \(D_{i}\) is treatment (think of schooling for now) and \(A_{i}\) is something like ability.

\(A_{i}\) is unobserved, so instead we run \[y_{i} = \alpha + \beta D_{i} + \epsilon_{i}\]

From this “short” regression, we don’t actually estimate \(\delta\). Instead, we get an estimate of \[\beta = \delta + \lambda_{ds}\gamma \neq \delta,\] where \(\lambda_{ds}\) is the coefficient of a regression of \(A_{i}\) on \(D_{i}\).

IV will recover the “long” regression without observing underlying ability

*IF* our IV satisfies all of the necessary assumptions.

We want to estimate \[E[Y_{i} | D_{i}=1] - E[Y_{i} | D_{i}=0]\]

With instrument \(Z_{i}\) that satisfies relevant assumptions, we can estimate this as \[E[Y_{i} | D_{i}=1] - E[Y_{i} | D_{i}=0] = \frac{E[Y_{i} | Z_{i}=1] - E[Y_{i} | Z_{i}=0]}{E[D_{i} | Z_{i}=1] - E[D_{i} | Z_{i}=0]}\]

In words, this is effect of the instrument on the outcome (“reduced form”) divided by the effect of the instrument on treatment (“first stage”)

Recall “long” regression: \(Y=\alpha + \delta S + \gamma A + \epsilon\).

\[\begin{align} COV(Y,Z) & = E[YZ] - E[Y] E[Z] \\ & = E[(\alpha + \delta S + \gamma A + \epsilon)\times Z] - E[\alpha + \delta S + \gamma A + \epsilon)]E[Z] \\ & = \alpha E[Z] + \delta E[SZ] + \gamma E[AZ] + E[\epsilon Z] \\ & \hspace{.2in} - \alpha E[Z] - \delta E[S]E[Z] - \gamma E[A] E[Z] - E[\epsilon]E[Z] \\ & = \delta (E[SZ] - E[S] E[Z]) + \gamma (E[AZ] - E[A] E[Z]) \\ & \hspace{.2in} + E[\epsilon Z] - E[\epsilon] E[Z] \\ & = \delta C(S,Z) + \gamma C(A,Z) + C(\epsilon, Z) \end{align}\]

Working from \(COV(Y,Z) = \delta COV(S,Z) + \gamma COV(A,Z) + COV(\epsilon,Z)\), we find

\[\delta = \frac{COV(Y,Z)}{COV(S,Z)}\]

if \(COV(A,Z)=COV(\epsilon, Z)=0\)

Easy to think of in terms of randomized controlled trial…

Measure | Offered Seat | Not Offered Seat | Difference |
---|---|---|---|

Score | -0.003 | -0.358 | 0.355 |

% Enrolled | 0.787 | 0.046 | 0.741 |

Effect | 0.48 |

Angrist *et al.*, 2012. “Who Benefits from KIPP?” *Journal of Policy Analysis and Management*.

Think of IV as two-steps:

- Isolate variation due to the instrument only (not due to endogenous stuff)
- Estimate effect on outcome using only this source of variation

Interested in estimating \(\delta\) from \(y_{i} = \alpha + \beta x_{i} + \delta D_{i} + \varepsilon_{i}\), but \(D_{i}\) is endogenous (no pure “selection on observables”).

**Step 1:**With instrument \(Z_{i}\), we can regress \(D_{i}\) on \(Z_{i}\) and \(x_{i}\): \[D_{i} = \lambda + \theta Z_{i} + \kappa x_{i} + \nu,\] and form prediction \(\hat{D}_{i}\).**Step 2:**Regress \(y_{i}\) on \(x_{i}\) and \(\hat{D}_{i}\): \[y_{i} = \alpha + \beta x_{i} + \delta \hat{D}_{i} + \xi_{i}\]

Recall \(\hat{\theta}=\frac{C(Z,S)}{V(Z)}\), or \(\hat{\theta}V(Z) = C(Y,Z)\). Then:

\[\begin{align} \hat{\delta} & = \frac{COV(Y,Z)}{COV(S,Z)} \\ & = \frac{\hat{\theta}C(Y,Z)}{\hat{\theta}C(S,Z)} = \frac{\hat{\theta}C(Y,Z)}{\hat{\theta}^{2}V(Z)} \\ & = \frac{C(\hat{\theta}Z,Y)}{V(\hat{\theta}Z)} = \frac{C(\hat{S},Y)}{V(\hat{S})} \end{align}\]

But in practice, *DON’T* do this in two steps. Why?

Because standard errors are wrong…not accounting for noise in prediction, \(\hat{D}_{i}\). The appropriate fix is built into most modern stats programs.

We’ll talk about this next class!