30.4 Testing Assumptions

\[ Y = \beta_1 X_1 + \beta_2 X_2 + \epsilon \]

where

  • \(X_1\) are exogenous variables

  • \(X_2\) are endogenous variables

  • \(Z\) are instrumental variables

If \(Z\) satisfies the relevance condition, it means \(Cov(Z, X_2) \neq 0\)

This is important because we need this to be able to estimate \(\beta_2\) where

\[ \beta_2 = \frac{Cov(Z,Y)}{Cov(Z, X_2)} \]

If \(Z\) satisfies the exogeneity condition, \(E[Z\epsilon]=0\), this can achieve by

  • \(Z\) having no direct effect on \(Y\) except through \(X_2\)

  • In the presence of omitted variable, \(Z\) is uncorrelated with this variable.

If we just want to know the effect of \(Z\) on \(Y\) (reduced form) where the coefficient of \(Z\) is

\[ \rho = \frac{Cov(Y, Z)}{Var(Z)} \]

and this effect is only through \(X_2\) (by the exclusion restriction assumption).

We can also consistently estimate the effect of \(Z\) on \(X\) (first stage) where the the coefficient of \(X_2\) is

\[ \pi = \frac{Cov(X_2, Z)}{Var(Z)} \]

and the IV estimate is

\[ \beta_2 = \frac{Cov(Y,Z)}{Cov(X_2, Z)} = \frac{\rho}{\pi} \]

30.4.1 Relevance Assumption

  • Weak instruments: can explain little variation in the endogenous regressor

    • Coefficient estimate of the endogenous variable will be inaccurate.
    • For cases where weak instruments are unavoidable, M. J. Moreira (2003) proposes the conditional likelihood ratio test for robust inference. This test is considered approximately optimal for weak instrument scenarios (D. W. Andrews, Moreira, and Stock 2008; D. W. Andrews and Marmer 2008).
  • Rule of thumb:

    • Compute F-statistic in the first-stage, where it should be greater than 10. But this is discouraged now by Lee et al. (2022)

    • use linearHypothesis() to see only instrument coefficients.

First-Stage F-Test

In the context of a two-stage least squares (2SLS) setup where you are estimating the equation:

\[ Y = X \beta + \epsilon \]

and \(X\) is endogenous, you typically estimate a first-stage regression of:

\[ X = Z \pi + u \]

where 𝑍Z is the instrument.

The first-stage F-test evaluates the joint significance of the instruments in this first stage:

\[ F = \frac{(SSR_r - SSR_{ur})/q}{SSR_{ur}/ (n - k - 1)} \]

where:

  • \(SSR_r\) is the sum of squared residuals from the restricted model (no instruments, just the constant).

  • \(SSR_{ur}\) is the sum of squared residuals from the unrestricted model (with instruments).

  • \(q\) is the number of instruments excluded from the main equation.

  • \(n\) is the number of observations.

  • \(k\) is the number of explanatory variables excluding the instruments.

Cragg-Donald Test

The Cragg-Donald statistic is essentially the same as the Wald statistic of the joint significance of the instruments in the first stage, and it’s used specifically when you have multiple endogenous regressors. It’s calculated as:

\[ CD = n \times (R_{ur}^2 - R_r^2) \]

where:

  • \(R_{ur}^2\) and \(R_r^2\) are the R-squared values from the unrestricted and restricted models respectively.

  • \(n\) is the number of observations.

For one endogenous variable, the Cragg-Donald test results should align closely with those from Stock and Yogo. The Anderson canonical correlation test, a likelihood ratio test, also works under similar conditions, contrasting with Cragg-Donald’s Wald statistic approach. Both are valid with one endogenous variable and at least one instrument.

Stock-Yogo Weak IV Test

The Stock-Yogo test does not directly compute a statistic like the F-test or Cragg-Donald, but rather uses pre-computed critical values to assess the strength of instruments. It often uses the eigenvalues derived from the concentration matrix:

\[ S = \frac{1}{n} (Z' X) (X'Z) \]

where \(Z\) is the matrix of instruments and \(X\) is the matrix of endogenous regressors.

Stock and Yogo provide critical values for different scenarios (bias, size distortion) for a given number of instruments and endogenous regressors, based on the smallest eigenvalue of \(S\). The test compares these eigenvalues against critical values that correspond to thresholds of permissible bias or size distortion in a 2SLS estimator.

  • Critical Values and Test Conditions: The critical values derived by Stock and Yogo depend on the level of acceptable bias, the number of endogenous regressors, and the number of instruments. For example, with a 5% maximum acceptable bias, one endogenous variable, and three instruments, the critical value for a sufficient first stage F-statistic is 13.91. Note that this framework requires at least two overidentifying degree of freedom.

Comparison

Test Description Focus Usage
First-Stage F-Test Evaluates the joint significance of instruments in the first stage. Predictive power of instruments for the endogenous variable. Simplest and most direct test, widely used especially with a single endogenous variable. Rule of thumb: F < 10 suggests weak instruments.
Cragg-Donald Test Wald statistic for joint significance of instruments. Joint strength of multiple instruments with multiple endogenous variables. More appropriate in complex IV setups with multiple endogenous variables. Compares statistic against critical values for assessing instrument strength.
Stock-Yogo Weak IV Test Compares test statistic to pre-determined critical values. Minimizing size distortions and bias from weak instruments. Theoretical evaluation of instrument strength, ensuring the reliability of 2SLS estimates against specific thresholds of bias or size distortion.

All the mentioned tests (Stock Yogo, Cragg-Donald, Anderson canonical correlation test) assume errors are independently and identically distributed. If this assumption is violated, the Kleinbergen-Paap test is robust against violations of the iid assumption and can be applied even with a single endogenous variable and instrument, provided the model is properly identified (Baum and Schaffer 2021).

30.4.1.1 Cragg-Donald

(Cragg and Donald 1993)

Similar to the first-stage F-statistic

library(cragg)
library(AER) # for dataaset
data("WeakInstrument")

cragg_donald(
    # control variables
    X = ~ 1, 
    # endogeneous variables
    D = ~ x, 
    # instrument variables 
    Z = ~ z, 
    data = WeakInstrument
)
#> Cragg-Donald test for weak instruments:
#> 
#>      Data:                        WeakInstrument 
#>      Controls:                    ~1 
#>      Treatments:                  ~x 
#>      Instruments:                 ~z 
#> 
#>      Cragg-Donald Statistic:        4.566136 
#>      Df:                                 198

Large CD statistic implies that the instruments are strong, but not in our case here. But to judge it against some critical value, we have to look at Stock-Yogo

30.4.1.2 Stock-Yogo

J. H. Stock and Yogo (2002) set the critical values such that the bias is less then 10% (default)

\(H_0:\) Instruments are weak

\(H_1:\) Instruments are not weak

library(cragg)
library(AER) # for dataaset
data("WeakInstrument")
stock_yogo_test(
    # control variables
    X = ~ 1,
    # endogeneous variables
    D = ~ x,
    # instrument variables
    Z = ~ z,
    size_bias = "bias",
    data = WeakInstrument
)

The CD statistic should be bigger than the set critical value to be considered strong instruments.

30.4.2 Exogeneity Assumption

The local average treatment effect (LATE) is defined as:

\[ \text{LATE} = \frac{\text{reduced form}}{\text{first stage}} = \frac{\rho}{\phi} \]

This implies that the reduced form (\(\rho\)) is the product of the first stage (\(\phi\)) and LATE:

\[ \rho = \phi \times \text{LATE} \]

Thus, if the first stage (\(\phi\)) is 0, the reduced form (\(\rho\)) should also be 0.

# Load necessary libraries
library(shiny)
library(AER)  # for ivreg
library(ggplot2)  # for visualization
library(dplyr)  # for data manipulation

# Function to simulate the dataset
simulate_iv_data <- function(n, beta, phi, direct_effect) {
  Z <- rnorm(n)
  epsilon_x <- rnorm(n)
  epsilon_y <- rnorm(n)
  X <- phi * Z + epsilon_x
  Y <- beta * X + direct_effect * Z + epsilon_y
  data <- data.frame(Y = Y, X = X, Z = Z)
  return(data)
}

# Function to run the simulations and calculate the effects
run_simulation <- function(n, beta, phi, direct_effect) {
  # Simulate the data
  simulated_data <- simulate_iv_data(n, beta, phi, direct_effect)
  
  # Estimate first-stage effect (phi)
  first_stage <- lm(X ~ Z, data = simulated_data)
  phi <- coef(first_stage)["Z"]
  phi_ci <- confint(first_stage)["Z", ]
  
  # Estimate reduced-form effect (rho)
  reduced_form <- lm(Y ~ Z, data = simulated_data)
  rho <- coef(reduced_form)["Z"]
  rho_ci <- confint(reduced_form)["Z", ]
  
  # Estimate LATE using IV regression
  iv_model <- ivreg(Y ~ X | Z, data = simulated_data)
  iv_late <- coef(iv_model)["X"]
  iv_late_ci <- confint(iv_model)["X", ]
  
  # Calculate LATE as the ratio of reduced-form and first-stage coefficients
  calculated_late <- rho / phi
  calculated_late_se <- sqrt(
    (rho_ci[2] - rho)^2 / phi^2 + (rho * (phi_ci[2] - phi) / phi^2)^2
  )
  calculated_late_ci <- c(calculated_late - 1.96 * calculated_late_se, 
                          calculated_late + 1.96 * calculated_late_se)
  
  # Return a list of results
  list(phi = phi, 
       phi_ci = phi_ci,
       rho = rho, 
       rho_ci = rho_ci,
       direct_effect = direct_effect,
       direct_effect_ci = c(direct_effect, direct_effect),  # Placeholder for direct effect CI
       iv_late = iv_late, 
       iv_late_ci = iv_late_ci,
       calculated_late = calculated_late, 
       calculated_late_ci = calculated_late_ci,
       true_effect = beta,
       true_effect_ci = c(beta, beta))  # Placeholder for true effect CI
}

# Define UI for the sliders
ui <- fluidPage(
  titlePanel("IV Model Simulation"),
  sidebarLayout(
    sidebarPanel(
      sliderInput("beta", "True Effect of X on Y (beta):", min = 0, max = 1.0, value = 0.5, step = 0.1),
      sliderInput("phi", "First Stage Effect (phi):", min = 0, max = 1.0, value = 0.7, step = 0.1),
      sliderInput("direct_effect", "Direct Effect of Z on Y:", min = -0.5, max = 0.5, value = 0, step = 0.1)
    ),
    mainPanel(
      plotOutput("dotPlot")
    )
  )
)

# Define server logic to run the simulation and generate the plot
server <- function(input, output) {
  output$dotPlot <- renderPlot({
    # Run simulation
    results <- run_simulation(n = 1000, beta = input$beta, phi = input$phi, direct_effect = input$direct_effect)
    
    # Prepare data for plotting
    plot_data <- data.frame(
      Effect = c("First Stage (phi)", "Reduced Form (rho)", "Direct Effect", "LATE (Ratio)", "LATE (IV)", "True Effect"),
      Value = c(results$phi, results$rho, results$direct_effect, results$calculated_late, results$iv_late, results$true_effect),
      CI_Lower = c(results$phi_ci[1], results$rho_ci[1], results$direct_effect_ci[1], results$calculated_late_ci[1], results$iv_late_ci[1], results$true_effect_ci[1]),
      CI_Upper = c(results$phi_ci[2], results$rho_ci[2], results$direct_effect_ci[2], results$calculated_late_ci[2], results$iv_late_ci[2], results$true_effect_ci[2])
    )
    
    # Create dot plot with confidence intervals
    ggplot(plot_data, aes(x = Effect, y = Value)) +
      geom_point(size = 3) +
      geom_errorbar(aes(ymin = CI_Lower, ymax = CI_Upper), width = 0.2) +
      labs(title = "IV Model Effects",
           y = "Coefficient Value") +
      coord_cartesian(ylim = c(-1, 1)) +  # Limits the y-axis to -1 to 1 but allows CI beyond
      theme_minimal() +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))
  })
}

# Run the application 
shinyApp(ui = ui, server = server)

A statistically significant reduced form estimate without a corresponding first stage indicates an issue, suggesting an alternative channel linking instruments to outcomes or a direct effect of the IV on the outcome.

  • No Direct Effect: When the direct effect is 0 and the first stage is 0, the reduced form is 0.
    • Note: Extremely rare cases with multiple additional paths that perfectly cancel each other out can also produce this result, but testing for all possible paths is impractical.
  • With Direct Effect: When there is a direct effect of the IV on the outcome, the reduced form can be significantly different from 0, even if the first stage is 0.
    • This violates the exogeneity assumption, as the IV should only affect the outcome through the treatment variable.

To test the validity of the exogeneity assumption, we can use a sanity test:

  • Identify groups for which the effects of instruments on the treatment variable are small and not significantly different from 0. The reduced form estimate for these groups should also be 0. These “no-first-stage samples” provide evidence of whether the exogeneity assumption is violated.

30.4.2.1 Overid Tests

  • Wald test and Hausman test for exogeneity of \(X\) assuming \(Z\) is exogenous

    • People might prefer Wald test over Hausman test.
  • Sargan (for 2SLS) is a simpler version of Hansen’s J test (for IV-GMM)

  • Modified J test (i.e., Regularized jacknife IV): can handle weak instruments and small sample size (Carrasco and Doukali 2022) (also proposed a regularized F-test to test relevance assumption that is robust to heteroskedasticity).

  • New advances: endogeneity robust inference in finite sample and sensitivity analysis of inference (Kiviet 2020)

These tests that can provide evidence fo the validity of the over-identifying restrictions is not sufficient or necessary for the validity of the moment conditions (i.e., this assumption cannot be tested). (Deaton 2010; Parente and Silva 2012)

  • The over-identifying restriction can still be valid even when the instruments are correlated with the error terms, but then in this case, what you’re estimating is no longer your parameters of interest.

  • Rejection of the over-identifying restrictions can also be the result of parameter heterogeneity (J. D. Angrist, Graddy, and Imbens 2000)

Why overid tests hold no value/info?

  • Overidentifying restrictions are valid irrespective of the instruments’ validity

    • Whenever instruments have the same motivation and are on the same scale, the estimated parameter of interests will be very close (Parente and Silva 2012, 316)
  • Overidentifying restriction are invalid when each instrument is valid

    • When the effect of your parameter of interest is heterogeneous (e.g., you have two groups with two different true effects), your first instrument can be correlated with your variable of interest only for the first group and your second interments can be correlated with your variable of interest only for the second group (i.e., each instrument is valid), and if you use each instrument, you can still identify the parameter of interest. However, if you use both of them, what you estimate is a mixture of the two groups. Hence, the overidentifying restriction will be invalid (because no single parameters can make the errors of the model orthogonal to both instruments). The result may seem confusing at first because if each subset of overidentifying restrictions is valid, the full set should also be valid. However, this interpretation is flawed because the residual’s orthogonality to the instruments depends on the chosen set of instruments, and therefore the set of restrictions tested when using two sets of instruments together is not the same as the union of the sets of restrictions tested when using each set of instruments separately (Parente and Silva 2012, 316)

These tests (of overidentifying restrictions) should be used to check whether different instruments identify the same parameters of interest, not to check their validity

(J. A. Hausman 1983; Parente and Silva 2012)

30.4.2.1.1 Wald Test

Assuming that \(Z\) is exogenous (a valid instrument), we want to know whether \(X_2\) is exogenous

1st stage:

\[ X_2 = \hat{\alpha} Z + \hat{\epsilon} \]

2nd stage:

\[ Y = \delta_0 X_1 + \delta_1 X_2 + \delta_2 \hat{\epsilon} + u \]

where

  • \(\hat{\epsilon}\) is the residuals from the 1st stage

The Wald test of exogeneity assumes

\[ H_0: \delta_2 = 0 \\ H_1: \delta_2 \neq 0 \]

If you have more than one endogenous variable with more than one instrument, \(\delta_2\) is a vector of all residuals from all the first-stage equations. And the null hypothesis is that they are jointly equal 0.

If you reject this hypothesis, it means that \(X_2\) is not endogenous. Hence, for this test, we do not want to reject the null hypothesis.

If the test is not sacrificially significant, we might just don’t have enough information to reject the null.

When you have a valid instrument \(Z\), whether \(X_2\) is endogenous or exogenous, your coefficient estimates of \(X_2\) should still be consistent. But if \(X_2\) is exogenous, then 2SLS will be inefficient (i.e., larger standard errors).

Intuition:

\(\hat{\epsilon}\) is the supposed endogenous part of \(X_2\), When we regress \(Y\) on \(\hat{\epsilon}\) and observe that its coefficient is not different from 0. It means that the exogenous part of \(X_2\) can explain well the impact on \(Y\), and there is no endogenous part.

30.4.2.1.2 Hausman’s Test

Similar to Wald Test and identical to Wald Test when we have homoskedasticity (i.e., homogeneity of variances). Because of this assumption, it’s used less often than Wald Test

30.4.2.1.3 Hansen’s J
  • (L. P. Hansen 1982)

  • J-test (over-identifying restrictions test): test whether additional instruments are exogenous

    • Can only be applied in cases where you have more instruments than endogenous variables
      • \(dim(Z) > dim(X_2)\)
    • Assume at least one instrument within \(Z\) is exogenous

Procedure IV-GMM:

  1. Obtain the residuals of the 2SLS estimation
  2. Regress the residuals on all instruments and exogenous variables.
  3. Test the joint hypothesis that all coefficients of the residuals across instruments are 0 (i.e., this is true when instruments are exogenous).
    1. Compute \(J = mF\) where \(m\) is the number of instruments, and \(F\) is your equation \(F\) statistic (can you use linearHypothesis() again).

    2. If your exogeneity assumption is true, then \(J \sim \chi^2_{m-k}\) where \(k\) is the number of endogenous variables.

  4. If you reject this hypothesis, it can be that
    1. The first sets of instruments are invalid

    2. The second sets of instruments are invalid

    3. Both sets of instruments are invalid

Note: This test is only true when your residuals are homoskedastic.

For a heteroskedasticity-robust \(J\)-statistic, see (Carrasco and Doukali 2022; H. Li et al. 2022)

30.4.2.1.4 Sargan Test

(Sargan 1958)

Similar to Hansen’s J, but it assumes homoskedasticity

  • Have to be careful when sample is not collected exogenously. As such, when you have choice-based sampling design, the sampling weights have to be considered to have consistent estimates. However, even if we apply sampling weights, the tests are not suitable because the iid assumption off errors are already violated. Hence, the test is invalid in this case (Pitt 2011).

  • If one has heteroskedasticity in its design, the Sargan test is invalid (Pitt 2011})

References

Andrews, Donald WK, and Vadim Marmer. 2008. “Exactly Distribution-Free Inference in Instrumental Variables Regression with Possibly Weak Instruments.” Journal of Econometrics 142 (1): 183–200.
———. 2008. “Efficient Two-Sided Nonsimilar Invariant Tests in IV Regression with Weak Instruments.” Journal of Econometrics 146 (2): 241–54.
Angrist, Joshua D, Kathryn Graddy, and Guido W Imbens. 2000. “The Interpretation of Instrumental Variables Estimators in Simultaneous Equations Models with an Application to the Demand for Fish.” The Review of Economic Studies 67 (3): 499–527.
Baum, Christopher, and Mark Schaffer. 2021. “IVREG2H: Stata Module to Perform Instrumental Variables Estimation Using Heteroskedasticity-Based Instruments.”
Carrasco, Marine, and Mohamed Doukali. 2022. “Testing Overidentifying Restrictions with Many Instruments and Heteroscedasticity Using Regularised Jackknife IV.” The Econometrics Journal 25 (1): 71–97.
Cragg, John G, and Stephen G Donald. 1993. “Testing Identifiability and Specification in Instrumental Variable Models.” Econometric Theory 9 (2): 222–40.
Deaton, Angus. 2010. “Instruments, Randomization, and Learning about Development.” Journal of Economic Literature 48 (2): 424–55.
Hansen, Lars Peter. 1982. “Large Sample Properties of Generalized Method of Moments Estimators.” Econometrica: Journal of the Econometric Society, 1029–54.
———. 1983. “Specification and Estimation of Simultaneous Equation Models.” Handbook of Econometrics 1: 391–448.
Kiviet, Jan F. 2020. “Testing the Impossible: Identifying Exclusion Restrictions.” Journal of Econometrics 218 (2): 294–316.
Lee, David S, Justin McCrary, Marcelo J Moreira, and Jack Porter. 2022. “Valid t-Ratio Inference for IV.” American Economic Review 112 (10): 3260–90.
Li, Hanqing, Xiaohui Liu, Yuting Chen, and Yawen Fan. 2022. “Testing for Serial Correlation in Autoregressive Exogenous Models with Possible GARCH Errors.” Entropy 24 (8): 1076.
Moreira, Marcelo J. 2003. “A Conditional Likelihood Ratio Test for Structural Models.” Econometrica 71 (4): 1027–48.
Parente, Paulo MDC, and JMC Santos Silva. 2012. “A Cautionary Note on Tests of Overidentifying Restrictions.” Economics Letters 115 (2): 314–17.
Pitt, Mark M. 2011. “Overidentification Tests and Causality: A Second Response to Roodman and Morduch.” Access at: Http://Www. Brown. Edu/Research/Projects/Pitt.
Sargan, John D. 1958. “The Estimation of Economic Relationships Using Instrumental Variables.” Econometrica: Journal of the Econometric Society, 393–415.
Stock, James H, and Motohiro Yogo. 2002. “Testing for Weak Instruments in Linear IV Regression.” National Bureau of Economic Research Cambridge, Mass., USA.