Chapter 36 Endogeneity

Refresher

A general model framework

\[ \mathbf{Y = X \beta + \epsilon} \]

where

  • \(\mathbf{Y} = n \times 1\)

  • \(\mathbf{X} = n \times k\)

  • \(\beta = k \times 1\)

  • \(\epsilon = n \times 1\)

Then, OLS estimates of coefficients are

\[ \begin{aligned} \hat{\beta}_{OLS} &= (\mathbf{X}'\mathbf{X})^{-1}(\mathbf{X}'\mathbf{Y}) \\ &= (\mathbf{X}'\mathbf{X})^{-1}(\mathbf{X}'(\mathbf{X \beta + \epsilon})) \\ &= (\mathbf{X}'\mathbf{X})^{-1} (\mathbf{X}'\mathbf{X}) \beta + (\mathbf{X}'\mathbf{X})^{-1} (\mathbf{X}'\mathbf{\epsilon}) \\ \hat{\beta}_{OLS} & \to \beta + (\mathbf{X}'\mathbf{X})^{-1} (\mathbf{X}'\mathbf{\epsilon}) \end{aligned} \]

To have unbiased estimates, we have to get rid of the second part \((\mathbf{X}'\mathbf{X})^{-1} (\mathbf{X}'\mathbf{\epsilon})\)

There are 2 conditions to achieve unbiased estimates:

  1. \(E(\epsilon |X) = 0\) (This is easy, putting an intercept can solve this issue)
  2. \(Cov(\mathbf{X}, \epsilon) = 0\) (This is the hard part)

We only care about omitted variable

Usually, the problem will stem Omitted Variables Bias, but we only care about omitted variable bias when

  1. Omitted variables correlate with the variables we care about (\(X\)). If OMV does not correlate with \(X\), we don’t care, and random assignment makes this correlation goes to 0)
  2. Omitted variables correlates with outcome/ dependent variable

There are more types of endogeneity listed below.

Types of endogeneity

  1. Endogenous Treatment
  • Omitted Variables Bias

    • Motivation
    • Ability/talent
    • Self-selection
  • Feedback Effect (Simultaneity): also known as bidirectionality

  • Reverse Causality: Subtle difference from Simultaneity: Technically, two variables affect each other sequentially, but in a big enough time frame, (e.g., monthly, or yearly), our coefficient will be biased just like simultaneity.

  • Measurement Error

  1. Endogenous Sample Selection

To deal with this problem, we have a toolbox (that has been mentioned in previous chapter 18)

Using control variables in regression is a “selection on observables” identification strategy.

In other words, if you believe you have an omitted variable, and you can measure it, including it in the regression model solves your problem. These uninterested variables are called control variables in your model.

However, this is rarely the case (because the problem is we don’t have their measurements). Hence, we need more elaborate methods:

Before we get to methods that deal with bias arises from omitted variables, we consider cases where we do have measurements of a variable, but there is measurement error (bias).