class: center, middle, inverse, title-slide .title[ # Module 0: Getting Started ] .subtitle[ ## Part 2: Introduction to Causal Inference ] .author[ ### Ian McCarthy | Emory University ] .date[ ### Econ 470 & HLTH 470 ] --- <!-- Adjust some CSS code for font size and maintain R code font size --> <style type="text/css"> .remark-slide-content { font-size: 30px; padding: 1em 2em 1em 2em; } .remark-code { font-size: 15px; } .remark-inline-code { font-size: 20px; } </style> <!-- Set R options for how code chunks are displayed and load packages --> # Why causal inference? <img src="00-2-causal-inference_files/figure-html/dartmouth-output-1.png" style="display: block; margin: auto;" /> --- # Why causal inference? Another example: **What price should we charge for a night in a hotel?** -- .pull-left[ **Machine Learning** - Focuses on prediction - High prices are strongly correlated with higher sales - Increase prices to attract more people? ] .pull-right[ **Causal Inference** - Focuses on **counterfactuals** - What would sales look like if prices were higher? ] --- # Goal of Causal Inference - **Goal:** Estimate effect of some policy or program - Key building block for causal inference is the idea of **potential outcomes** --- # Some notation **Treatment** `\(D_{i}\)` `$$D_{i}=\begin{cases} 1 \text{ with treatment} \\ 0 \text{ without treatment} \end{cases}$$` --- # Some notation **Potential outcomes** - `\(Y_{1i}\)` is the potential outcome for unit `\(i\)` with treatment - `\(Y_{0i}\)` is the potential outcome for unit `\(i\)` without treatment --- # Some notation **Observed outcome** `$$Y_{i}=Y_{1i} \times D_{i} + Y_{0i} \times (1-D_{i})$$` or `$$Y_{i}=\begin{cases} Y_{1i} \text{ if } D_{i}=1 \\ Y_{0i} \text{ if } D_{i}=0 \end{cases}$$` .footnote[ Assumes **SUTVA** (stable unit treatment value assumption)...no interference across units ] --- # Example of "Potential Outcomes" .pull-left[ ![:scale 420px](pics/EmoryPicture.jpg) `\(Y_{1}\)`= <span>$</span>75,000 ] .pull-right[ ![:scale 370px](pics/UNTPicture.jpg) `\(Y_{0}\)`= <span>$</span>60,000 ] --- # Example of "Potential Outcomes" .pull-left[ ![:scale 420px](pics/EmoryPicture.jpg) `\(Y_{1}\)`= <span>$</span>75,000 ] .pull-right[ ![:scale 370px](pics/UNTPicture.jpg) `\(Y_{0}\)`= <span>$</span>60,000 ] Earnings due to Emory = `\(Y_{1}-Y_{0}\)` = <span>$</span>15,000 --- # Example of "Potential Outcomes" .pull-left[ ![:scale 420px](pics/EmoryPicture.jpg) `\(Y_{1}\)`= <span>$</span>75,000 ] .pull-right[ ![:scale 370px](pics/UNTPicture.jpg) `\(Y_{0}\)`= ? ] --- # Example of "Potential Outcomes" .pull-left[ ![:scale 420px](pics/EmoryPicture.jpg) `\(Y_{1}\)`= <span>$</span>75,000 ] .pull-right[ ![:scale 370px](pics/UNTPicture.jpg) `\(Y_{0}\)`= ? ] Earnings due to Emory = `\(Y_{1}-Y_{0}\)` = ? --- # Do we ever observe the potential outcomes? .center[ ![:scale 700px](https://media.giphy.com/media/zZeCRfPyXi9UI/giphy.gif) ] -- Without a time machine...not possible to get *individual* effects. --- # Fundamental Problem of Causal Inference - We don't observe the counterfactual outcome...what would have happened if a treated unit was actually untreated. - *ALL* attempts at causal inference represent some attempt at estimating the counterfactual outcome. We need an estimate for `\(Y_{0}\)` among those that were treated, and vice versa for `\(Y_{1}\)`. --- class: inverse, center, middle name: ate # Average Treatment Effects <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1055px></html> --- # Different treatment effects Tend to focus on **averages**<sup>1</sup>: - **ATE**: `\(\delta_{ATE} = E[ Y_{1} - Y_{0}]\)` - **ATT**: `\(\delta_{ATT} = E[ Y_{1} - Y_{0} | D=1]\)` - **ATU**: `\(\delta_{ATU} = E[ Y_{1} - Y_{0} | D=0]\)` .footnote[<sup>1</sup> or similar measures such as medians or quantiles] --- # Average Treatment Effects - **Estimand**: `$$\delta_{ATE} = E[Y_{1} - Y_{0}] = E[Y | D=1] - E[Y | D=0]$$` - **Estimate**: `$$\hat{\delta}_{ATE} = \frac{1}{N_{1}} \sum_{D_{i}=1} Y_{i} - \frac{1}{N_{0}} \sum_{D_{i}=0} Y_{i},$$` where `\(N_{1}\)` is number of treated and `\(N_{0}\)` is number untreated (control) - With random assignment and equal groups, inference/hypothesis testing with standard two-sample t-test <!-- New Section --> --- class: inverse, center, middle name: selection # Selection Bias <html><div style='float:left'></div><hr color='#EB811B' size=1px width=1055px></html> --- # Selection bias - Assume (for simplicity) constant effects, `\(Y_{1i}=Y_{0i} + \delta\)` - Since we don't observe `\(Y_{0}\)` and `\(Y_{1}\)`, we have to use the observed outcomes, `\(Y_{i}\)` `$$\begin{align} E[Y_{i} | D_{i}=1] &- E[Y_{i} | D_{i}=0] \\ =& E[Y_{1i} | D_{i}=1] - E[Y_{0i} | D_{i}=0] \\ =& \delta + E[Y_{0i} | D_{i}=1] - E[Y_{0i} | D_{i}=0] \\ =& \text{ATE } + \text{ Selection Bias} \end{align}$$` --- # Selection bias - Selection bias means `\(E[Y_{0i} | D_{i}=1] - E[Y_{0i} | D_{i}=0] \neq 0\)` - In words, the potential outcome without treatment, `\(Y_{0i}\)`, is different between those that ultimately did and did not receive treatment. - e.g., treated group was going to be better on average even without treatment (higher wages, healthier, etc.) --- # Selection bias - How to "remove" selection bias? - How about random assignment? - In this case, treatment assignment doesn't tell us anything about `\(Y_{0i}\)` `$$E[Y_{0i}|D_{i}=1] = E[Y_{0i}|D_{i}=0],$$` such that `$$E[Y_{i}|D_{i}=1] - E[Y_{i} | D_{i}=0] = \delta_{ATE} = \delta_{ATT} = \delta_{ATU}$$` --- # Selection bias - Without random assignment, there's a high probability that `$$E[Y_{0i}|D_{i}=1] \neq E[Y_{0i}|D_{i}=0]$$` - i.e., outcomes without treatment are different for the treated group --- # Omitted variables bias - In a regression setting, selection bias is the same problem as omitted variables bias (OVB) - Quick review: Goal of OLS is to find `\(\hat{\beta}\)` to "best fit" the linear equation `\(y_{i} = \alpha + x_{i} \beta + \epsilon_{i}\)` --- # Regression review `$$\begin{align} \min_{\beta} & \sum_{i=1}^{N} \left(y_{i} - \alpha - x_{i} \beta\right)^{2} = \min_{\beta} \sum_{i=1}^{N} \left(y_{i} - (\bar{y} - \bar{x}\beta) - x_{i} \beta\right)^{2}\\ 0 &= \sum_{i=1}^{N} \left(y_{i} - \bar{y} - (x_{i} - \bar{x})\hat{\beta} \right)(x_{i} - \bar{x}) \\ 0 &= \sum_{i=1}^{N} (y_{i} - \bar{y})(x_{i} - \bar{x}) - \hat{\beta} \sum_{i=1}^{N}(x_{i} - \bar{x})^{2} \\ \hat{\beta} &= \frac{\sum_{i=1}^{N} (y_{i} - \bar{y})(x_{i} - \bar{x})}{\sum_{i=1}^{N} (x_{i} - \bar{x})^{2}} = \frac{Cov(y,x)}{Var(x)} \end{align}$$` --- # Omitted variables bias - Interested in estimate of the effect of schooling on wages `$$Y_{i} = \alpha + \beta s_{i} + \gamma A_{i} + \epsilon_{i}$$` - But we don't observe ability, `\(A_{i}\)`, so we estimate `$$Y_{i} = \alpha + \beta s_{i} + u_{i}$$` - What is our estimate of `\(\beta\)` from this regression? --- # Omitted variables bias `$$\begin{align} \hat{\beta} &= \frac{Cov(Y_{i}, s_{i})}{Var(s_{i})} \\ &= \frac{Cov(\alpha + \beta s_{i} + \gamma A_{i} + \epsilon_{i}, s_{i})}{Var(s_{i})} \\ &= \frac{\beta Cov(s_{i}, s_{i}) + \gamma Cov(A_{i},s_{i}) + Cov(\epsilon_{i}, s_{i})}{Var(s_{i})}\\ &= \beta \frac{Var(s_{i})}{Var(s_{i})} + \gamma \frac{Cov(A_{i},s_{i})}{Var(s_{i})} + 0\\ &= \beta + \gamma \times \theta_{as} \end{align}$$` --- # Removing selection bias without RCT - The field of causal inference is all about different strategies to remove selection bias - The first strategy (really, assumption) in this class: **selection on observables** or **conditional indpendence** --- # Intuition - Example: Does having health insurance, `\(D_{i}=1\)`, improve your health relative to someone without health insurance, `\(D_{i}=0\)`? - `\(Y_{1i}\)` denotes health with insurance, and `\(Y_{0i}\)` health without insurance (these are **potential** outcomes) - In raw data, `\([Y_{i} | D_{i}=1] > E[Y_{i} | D_{i}=0]\)`, but is that causal? --- # Intuition Some assumptions: - `\(Y_{0i}=\alpha + \eta_{i}\)` - `\(Y_{1i} - Y_{0i} = \delta\)` - There is some set of "controls", `\(x_{i}\)`, such that `\(\eta_{i} = \beta x_{i} + u_{i}\)` and `\(E[u_{i} | x_{i}]=0\)` (conditional independence assumption, or CIA) -- `$$\begin{align} Y_{i} &= Y_{1i} \times D_{i} + Y_{0i} \times (1-D_{i}) \\ &= \delta D_{i} + Y_{0i} D_{i} + Y_{0i} - Y_{0i} D_{i} \\ &= \delta D_{i} + \alpha + \eta_{i} \\ &= \delta D_{i} + \alpha + \beta x_{i} + u_{i} \end{align}$$` --- # ATEs versus regression coefficients - Estimating the regression equation, `$$Y_{i} = \alpha + \delta D_{i} + \beta x_{i} + u_{i}$$` provides a causal estimate of the effect of `\(D_{i}\)` on `\(Y_{i}\)` - But what does that really mean? --- # ATEs vs regression coefficients - *Ceteris paribus* ("with other conditions remaining the same"), a change in `\(D_{i}\)` will lead to a change in `\(Y_{i}\)` in the amount of `\(\hat{\delta}\)` - But is *ceteris paribus* informative about policy? --- # ATEs vs regression coefficients - `\(Y_{1i} = Y_{0i} + \delta_{i} D_{i}\)` (allows for heterogeneous effects) - `\(Y_{i} = \alpha + \beta D_{i} + \gamma X_{i} + \epsilon_{i}\)`, with `\(Y_{0i}, Y_{1i} \perp\!\!\!\perp D_{i} | X_{i}\)` - Aronow and Samii, 2016, show that: `$$\hat{\beta} \rightarrow_{p} \frac{E[w_{i} \delta_{i}]}{E[w_{i}]},$$` where `\(w_{i} = (D_{i} - E[D_{i} | X_{i}])^{2}\)` --- # ATEs vs regression coefficients - Simplify to ATT and ATU - `\(Y_{1i} = Y_{0i} + \delta_{ATT} D_{i} + \delta_{ATU} (1-D_{i})\)` - `\(Y_{i} = \alpha + \beta D_{i} + \gamma X_{i} + \epsilon_{i}\)`, with `\(Y_{0i}, Y_{1i} \perp\!\!\!\perp D_{i} | X_{i}\)` -- `$$\begin{align} \beta = & \frac{P(D_{i}=1) \times \pi (X_{i} | D_{i}=1) \times (1- \pi (X_{i} | D_{i}=1))}{\sum_{j=0,1} P(D_{i}=j) \times \pi (X_{i} | D_{i}=j) \times (1- \pi (X_{i} | D_{i}=j))} \delta_{ATU} \\ & + \frac{P(D_{i}=0) \times \pi (X_{i} | D_{i}=0) \times (1- \pi (X_{i} | D_{i}=0))}{\sum_{j=0,1} P(D_{i}=j) \times \pi (X_{i} | D_{i}=j) \times (1- \pi (X_{i} | D_{i}=j))} \delta_{ATT} \end{align}$$` --- # ATEs vs regression coefficients What does this mean? - OLS puts more weight on observations with treatment `\(D_{i}\)` "unexplained" by `\(X_{i}\)` - "Reverse" weighting such that the proportion of treated units are used to weight the ATU while the proportion of untreated units enter the weights of the ATT - This is *an* average effect, but probably not the average we want