CDC Data on Smoking and Cigarette Prices

Ian McCarthy | Emory University

Smoking and Cigarette Pricing Data

The Data

Summary stats

We’re interested in cigarette prices and sales, so let’s focus our summaries on those two variables

R Code
sum.vars <- cig.data %>% select('Sales per Capita' = sales_per_capita, 'Real Price' = price_cpi, 'Nominal Price'=cost_per_pack)

datasummary(All(sum.vars) ~ Mean + SD + Histogram, data=sum.vars)
Mean SD Histogram
Sales per Capita 95.15 41.13 ▂▅▅▇▄▁
Real Price 3.40 1.64 ▇▇▂▃▂▂▁
Nominal Price 2.68 2.24 ▇▄▁▂▂▁▁

Cigarette Sales

R Code
cig.data %>% 
  ggplot(aes(x=Year,y=sales_per_capita)) + 
  stat_summary(fun.y="mean",geom="line") +
  labs(
    x="Year",
    y="Packs per Capita",
    title="Cigarette Sales"
  ) + theme_bw() +
  scale_x_continuous(breaks=seq(1970, 2020, 5))

Cigarette Prices

R Code
cig.data %>% 
  ggplot(aes(x=Year,y=price_cpi)) + 
  stat_summary(fun.y="mean",geom="line") +
  labs(
    x="Year",
    y="Price per Pack ($)",
    title="Cigarette Prices in 2010 Real Dollars"
  ) + theme_bw() +
  scale_x_continuous(breaks=seq(1970, 2020, 5))

Introduction to Instrumental Variables

What is instrumental variables

Instrumental Variables (IV) is a way to identify causal effects using variation in treatment particpation that is due to an exogenous variable that is only related to the outcome through treatment.

Why bother with IV?

Two reasons to consider IV:

  1. Selection on unobservables
  2. Reverse causation

Either problem is sometimes loosely referred to as endogeneity

Simple example

Consider simple regression equation: \[y = \beta x + \varepsilon (x),\] where \(\varepsilon(x)\) reflects the dependence between our observed variable and the error term.

Simple OLS will yield \[\frac{dy}{dx} = \beta + \frac{d\varepsilon}{dx} \neq \beta\]

What does IV do?

  • The regression we want to do, \[y_{i} = \alpha + \delta D_{i} + \gamma A_{i} + \epsilon_{i},\] where \(D_{i}\) is treatment (think of schooling for now) and \(A_{i}\) is something like ability.

  • \(A_{i}\) is unobserved, so instead we run \[y_{i} = \alpha + \beta D_{i} + \epsilon_{i}\]

  • From this “short” regression, we don’t actually estimate \(\delta\). Instead, we get an estimate of \[\beta = \delta + \lambda_{ds}\gamma \neq \delta,\] where \(\lambda_{ds}\) is the coefficient of a regression of \(A_{i}\) on \(D_{i}\).

Intuition

IV will recover the “long” regression without observing underlying ability

IF our IV satisfies all of the necessary assumptions.

More formally

  • We want to estimate \[E[Y_{i} | D_{i}=1] - E[Y_{i} | D_{i}=0]\]

  • With instrument \(Z_{i}\) that satisfies relevant assumptions, we can estimate this as \[E[Y_{i} | D_{i}=1] - E[Y_{i} | D_{i}=0] = \frac{E[Y_{i} | Z_{i}=1] - E[Y_{i} | Z_{i}=0]}{E[D_{i} | Z_{i}=1] - E[D_{i} | Z_{i}=0]}\]

  • In words, this is effect of the instrument on the outcome (“reduced form”) divided by the effect of the instrument on treatment (“first stage”)

Derivation

Recall “long” regression: \(Y=\alpha + \delta S + \gamma A + \epsilon\).

\[\begin{align} COV(Y,Z) & = E[YZ] - E[Y] E[Z] \\ & = E[(\alpha + \delta S + \gamma A + \epsilon)\times Z] - E[\alpha + \delta S + \gamma A + \epsilon)]E[Z] \\ & = \alpha E[Z] + \delta E[SZ] + \gamma E[AZ] + E[\epsilon Z] \\ & \hspace{.2in} - \alpha E[Z] - \delta E[S]E[Z] - \gamma E[A] E[Z] - E[\epsilon]E[Z] \\ & = \delta (E[SZ] - E[S] E[Z]) + \gamma (E[AZ] - E[A] E[Z]) \\ & \hspace{.2in} + E[\epsilon Z] - E[\epsilon] E[Z] \\ & = \delta C(S,Z) + \gamma C(A,Z) + C(\epsilon, Z) \end{align}\]

Derivation

Working from \(COV(Y,Z) = \delta COV(S,Z) + \gamma COV(A,Z) + COV(\epsilon,Z)\), we find

\[\delta = \frac{COV(Y,Z)}{COV(S,Z)}\]

if \(COV(A,Z)=COV(\epsilon, Z)=0\)

IVs in practice

Easy to think of in terms of randomized controlled trial…

Measure Offered Seat Not Offered Seat Difference
Score -0.003 -0.358 0.355
% Enrolled 0.787 0.046 0.741
Effect 0.48

Angrist et al., 2012. “Who Benefits from KIPP?” Journal of Policy Analysis and Management.

What is IV really doing

Think of IV as two-steps:

  1. Isolate variation due to the instrument only (not due to endogenous stuff)
  2. Estimate effect on outcome using only this source of variation

In regression terms

Interested in estimating \(\delta\) from \(y_{i} = \alpha + \beta x_{i} + \delta D_{i} + \varepsilon_{i}\), but \(D_{i}\) is endogenous (no pure “selection on observables”).

  • Step 1: With instrument \(Z_{i}\), we can regress \(D_{i}\) on \(Z_{i}\) and \(x_{i}\): \[D_{i} = \lambda + \theta Z_{i} + \kappa x_{i} + \nu,\] and form prediction \(\hat{D}_{i}\).

  • Step 2: Regress \(y_{i}\) on \(x_{i}\) and \(\hat{D}_{i}\): \[y_{i} = \alpha + \beta x_{i} + \delta \hat{D}_{i} + \xi_{i}\]

Derivation

Recall \(\hat{\theta}=\frac{C(Z,S)}{V(Z)}\), or \(\hat{\theta}V(Z) = C(Y,Z)\). Then:

\[\begin{align} \hat{\delta} & = \frac{COV(Y,Z)}{COV(S,Z)} \\ & = \frac{\hat{\theta}C(Y,Z)}{\hat{\theta}C(S,Z)} = \frac{\hat{\theta}C(Y,Z)}{\hat{\theta}^{2}V(Z)} \\ & = \frac{C(\hat{\theta}Z,Y)}{V(\hat{\theta}Z)} = \frac{C(\hat{S},Y)}{V(\hat{S})} \end{align}\]

In regression terms

But in practice, DON’T do this in two steps. Why?

Because standard errors are wrong…not accounting for noise in prediction, \(\hat{D}_{i}\). The appropriate fix is built into most modern stats programs.

How to do IV in practice

We’ll talk about this next class!