CDC Data on Smoking and Cigarette Prices

Ian McCarthy | Emory University

Smoking and Cigarette Pricing Data

The Data

Data from CDC Tax Burden on Tobacco
Visit GitHub repository for other info: Tobacco GitHub repository
Supplement with CPI data, also in GitHub repo.

Summary stats

We’re interested in cigarette prices and sales, so let’s focus our summaries on those two variables

R Code

sum.vars <- cig.data %>% select('Sales per Capita' = sales_per_capita, 'Real Price' = price_cpi, 'Nominal Price'=cost_per_pack)

datasummary(All(sum.vars) ~ Mean + SD + Histogram, data=sum.vars)

	Mean	SD	Histogram
Sales per Capita	95.15	41.13	▂▅▅▇▄▁
Real Price	3.40	1.64	▇▇▂▃▂▂▁
Nominal Price	2.68	2.24	▇▄▁▂▂▁▁

Cigarette Sales

R Code

cig.data %>% 
  ggplot(aes(x=Year,y=sales_per_capita)) + 
  stat_summary(fun.y="mean",geom="line") +
  labs(
    x="Year",
    y="Packs per Capita",
    title="Cigarette Sales"
  ) + theme_bw() +
  scale_x_continuous(breaks=seq(1970, 2020, 5))

Cigarette Prices

R Code

cig.data %>% 
  ggplot(aes(x=Year,y=price_cpi)) + 
  stat_summary(fun.y="mean",geom="line") +
  labs(
    x="Year",
    y="Price per Pack ($)",
    title="Cigarette Prices in 2010 Real Dollars"
  ) + theme_bw() +
  scale_x_continuous(breaks=seq(1970, 2020, 5))

Introduction to Instrumental Variables

What is instrumental variables

Instrumental Variables (IV) is a way to identify causal effects using variation in treatment particpation that is due to an exogenous variable that is only related to the outcome through treatment.

Why bother with IV?

Two reasons to consider IV:

Selection on unobservables
Reverse causation

Either problem is sometimes loosely referred to as endogeneity

Simple example

Consider simple regression equation: \[y = \beta x + \varepsilon (x),\] where \(\varepsilon(x)\) reflects the dependence between our observed variable and the error term.

Simple OLS will yield \[\frac{dy}{dx} = \beta + \frac{d\varepsilon}{dx} \neq \beta\]

What does IV do?

The regression we want to do, \[y_{i} = \alpha + \delta D_{i} + \gamma A_{i} + \epsilon_{i},\] where \(D_{i}\) is treatment (think of schooling for now) and \(A_{i}\) is something like ability.
\(A_{i}\) is unobserved, so instead we run \[y_{i} = \alpha + \beta D_{i} + \epsilon_{i}\]
From this “short” regression, we don’t actually estimate \(\delta\). Instead, we get an estimate of \[\beta = \delta + \lambda_{ds}\gamma \neq \delta,\] where \(\lambda_{ds}\) is the coefficient of a regression of \(A_{i}\) on \(D_{i}\).

Intuition

IV will recover the “long” regression without observing underlying ability

IF our IV satisfies all of the necessary assumptions.

More formally

We want to estimate \[E[Y_{i} | D_{i}=1] - E[Y_{i} | D_{i}=0]\]
With instrument \(Z_{i}\) that satisfies relevant assumptions, we can estimate this as \[E[Y_{i} | D_{i}=1] - E[Y_{i} | D_{i}=0] = \frac{E[Y_{i} | Z_{i}=1] - E[Y_{i} | Z_{i}=0]}{E[D_{i} | Z_{i}=1] - E[D_{i} | Z_{i}=0]}\]
In words, this is effect of the instrument on the outcome (“reduced form”) divided by the effect of the instrument on treatment (“first stage”)

Derivation

Recall “long” regression: \(Y=\alpha + \delta S + \gamma A + \epsilon\).

\[\begin{align} COV(Y,Z) & = E[YZ] - E[Y] E[Z] \\ & = E[(\alpha + \delta S + \gamma A + \epsilon)\times Z] - E[\alpha + \delta S + \gamma A + \epsilon)]E[Z] \\ & = \alpha E[Z] + \delta E[SZ] + \gamma E[AZ] + E[\epsilon Z] \\ & \hspace{.2in} - \alpha E[Z] - \delta E[S]E[Z] - \gamma E[A] E[Z] - E[\epsilon]E[Z] \\ & = \delta (E[SZ] - E[S] E[Z]) + \gamma (E[AZ] - E[A] E[Z]) \\ & \hspace{.2in} + E[\epsilon Z] - E[\epsilon] E[Z] \\ & = \delta C(S,Z) + \gamma C(A,Z) + C(\epsilon, Z) \end{align}\]

Derivation

Working from \(COV(Y,Z) = \delta COV(S,Z) + \gamma COV(A,Z) + COV(\epsilon,Z)\), we find

\[\delta = \frac{COV(Y,Z)}{COV(S,Z)}\]

if \(COV(A,Z)=COV(\epsilon, Z)=0\)

IVs in practice

Easy to think of in terms of randomized controlled trial…

Measure	Offered Seat	Not Offered Seat	Difference
Score	-0.003	-0.358	0.355
% Enrolled	0.787	0.046	0.741
Effect			0.48

Angrist et al., 2012. “Who Benefits from KIPP?” Journal of Policy Analysis and Management.

What is IV really doing

Think of IV as two-steps:

Isolate variation due to the instrument only (not due to endogenous stuff)
Estimate effect on outcome using only this source of variation

In regression terms

Interested in estimating \(\delta\) from \(y_{i} = \alpha + \beta x_{i} + \delta D_{i} + \varepsilon_{i}\), but \(D_{i}\) is endogenous (no pure “selection on observables”).

Step 1: With instrument \(Z_{i}\), we can regress \(D_{i}\) on \(Z_{i}\) and \(x_{i}\): \[D_{i} = \lambda + \theta Z_{i} + \kappa x_{i} + \nu,\] and form prediction \(\hat{D}_{i}\).
Step 2: Regress \(y_{i}\) on \(x_{i}\) and \(\hat{D}_{i}\): \[y_{i} = \alpha + \beta x_{i} + \delta \hat{D}_{i} + \xi_{i}\]

Derivation

Recall our first-stage, \(S=\theta Z + \varepsilon\), where \(\hat{\theta}=\frac{C(Z,S)}{V(Z)}\), or \(\hat{\theta}V(Z) = C(Z,S)\). Then:

\[\begin{align} \hat{\delta} & = \frac{COV(Y,Z)}{COV(S,Z)} \\ & = \frac{\hat{\theta}C(Y,Z)}{\hat{\theta}C(S,Z)} = \frac{\hat{\theta}C(Y,Z)}{\hat{\theta}^{2}V(Z)} \\ & = \frac{C(\hat{\theta}Z,Y)}{V(\hat{\theta}Z)} = \frac{C(\hat{S},Y)}{V(\hat{S})} \end{align}\]

In regression terms

But in practice, DON’T do this in two steps. Why?

Because standard errors are wrong…not accounting for noise in prediction, \(\hat{D}_{i}\). The appropriate fix is built into most modern stats programs.

How to do IV in practice

We’ll talk about this next class!