30.7 Control Function

Also known as two-stage residual inclusion

Resources:

  • Binary outcome and binary endogenous variable application (E. Tchetgen Tchetgen 2014)

    • In rare events: we use a logistic model in the 2nd stage

    • In non-rare events: use risk ratio regression in the 2nd stage

  • Application in marketing for consumer choice model (Petrin and Train 2010)

Notes

  • This approach is better suited for models with nonadditive errors (e.g., discrete choice models), or binary endogenous model, binary response variable, etc.

\[ Y = g(X) + U \\ X = \pi(Z) + V \\ E(U |Z,V) = E(U|V) \\ E(V|Z) = 0 \]

Under control function approach,

\[ E(Y|Z,V) = g(X) + E(U|Z,V) \\ = g(X) + E(U|V) \\ = g(X) + h(V) \]

where \(h(V)\) is the control function that models the endogeneity

  1. Linear in parameters
  1. Linear Endogenous Variables:
    • The control function function approach is identical to the usual 2SLS estimator
  2. Nonlinear Endogenous Variables:
    • The control function is different from the 2SLS estimator
  1. Nonlinear in parameters:
    • The CF function is superior than the 2SLS estimator

30.7.1 Simulation

library(fixest)
library(tidyverse)
library(modelsummary)

# Set the seed for reproducibility
set.seed(123)
n = 10000
# Generate the exogenous variable from a normal distribution
exogenous <- rnorm(n, mean = 5, sd = 1)

# Generate the omitted variable as a function of the exogenous variable
omitted <- rnorm(n, mean = 2, sd = 1)

# Generate the endogenous variable as a function of the omitted variable and the exogenous variable
endogenous <- 5 * omitted + 2 * exogenous + rnorm(n, mean = 0, sd = 1)

# nonlinear endogenous variable
endogenous_nonlinear <- 5 * omitted^2 + 2 * exogenous + rnorm(100, mean = 0, sd = 1)

unrelated <- rexp(n, rate = 1)

# Generate the response variable as a function of the endogenous variable and the omitted variable
response <- 4 +  3 * endogenous + 6 * omitted + rnorm(n, mean = 0, sd = 1)

response_nonlinear <- 4 +  3 * endogenous_nonlinear + 6 * omitted + rnorm(n, mean = 0, sd = 1)

response_nonlinear_para <- 4 +  3 * endogenous ^ 2 + 6 * omitted + rnorm(n, mean = 0, sd = 1)


# Combine the variables into a data frame
my_data <-
    data.frame(
        exogenous,
        omitted,
        endogenous,
        response,
        unrelated,
        response,
        response_nonlinear,
        response_nonlinear_para
    )

# View the first few rows of the data frame
# head(my_data)

wo_omitted <- feols(response ~ endogenous + sw0(unrelated), data = my_data)
w_omitted  <- feols(response ~ endogenous + omitted + unrelated, data = my_data)


# ivreg::ivreg(response ~ endogenous + unrelated | exogenous, data = my_data)
iv <- feols(response ~ 1 + sw0(unrelated) | endogenous ~ exogenous, data = my_data)

etable(
    wo_omitted,
    w_omitted,
    iv, 
    digits = 2
    # vcov = list("each", "iid", "hetero")
)
#>                   wo_omitted.1   wo_omitted.2      w_omitted           iv.1
#> Dependent Var.:       response       response       response       response
#>                                                                            
#> Constant        -3.9*** (0.10) -4.0*** (0.10)  4.0*** (0.05) 15.7*** (0.59)
#> endogenous      4.0*** (0.005) 4.0*** (0.005) 3.0*** (0.004)  3.0*** (0.03)
#> unrelated                         0.03 (0.03)  0.002 (0.010)               
#> omitted                                        6.0*** (0.02)               
#> _______________ ______________ ______________ ______________ ______________
#> S.E. type                  IID            IID            IID            IID
#> Observations            10,000         10,000         10,000         10,000
#> R2                     0.98566        0.98567        0.99803        0.92608
#> Adj. R2                0.98566        0.98566        0.99803        0.92607
#> 
#>                           iv.2
#> Dependent Var.:       response
#>                               
#> Constant        15.6*** (0.59)
#> endogenous       3.0*** (0.03)
#> unrelated         0.10. (0.06)
#> omitted                       
#> _______________ ______________
#> S.E. type                  IID
#> Observations            10,000
#> R2                     0.92610
#> Adj. R2                0.92608
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Linear in parameter and linear in endogenous variable

# manual
# 2SLS
first_stage = lm(endogenous ~ exogenous, data = my_data)
new_data = cbind(my_data, new_endogenous = predict(first_stage, my_data))
second_stage = lm(response ~ new_endogenous, data = new_data)
summary(second_stage)
#> 
#> Call:
#> lm(formula = response ~ new_endogenous, data = new_data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -77.683 -14.374  -0.107  14.289  78.274 
#> 
#> Coefficients:
#>                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)     15.6743     2.0819   7.529 5.57e-14 ***
#> new_endogenous   3.0142     0.1039  29.025  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 21.26 on 9998 degrees of freedom
#> Multiple R-squared:  0.07771,    Adjusted R-squared:  0.07762 
#> F-statistic: 842.4 on 1 and 9998 DF,  p-value: < 2.2e-16

new_data_cf = cbind(my_data, residual = resid(first_stage))
second_stage_cf = lm(response ~ endogenous + residual, data = new_data_cf)
summary(second_stage_cf)
#> 
#> Call:
#> lm(formula = response ~ endogenous + residual, data = new_data_cf)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -5.360 -1.016  0.003  1.023  5.201 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 15.674265   0.149350   105.0   <2e-16 ***
#> endogenous   3.014202   0.007450   404.6   <2e-16 ***
#> residual     1.140920   0.008027   142.1   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.525 on 9997 degrees of freedom
#> Multiple R-squared:  0.9953, Adjusted R-squared:  0.9953 
#> F-statistic: 1.048e+06 on 2 and 9997 DF,  p-value: < 2.2e-16

modelsummary(list(second_stage, second_stage_cf))
 (1)   (2)
(Intercept) 15.674 15.674
(2.082) (0.149)
new_endogenous 3.014
(0.104)
endogenous 3.014
(0.007)
residual 1.141
(0.008)
Num.Obs. 10000 10000
R2 0.078 0.995
R2 Adj. 0.078 0.995
AIC 89520.9 36826.8
BIC 89542.5 36855.6
Log.Lik. −44757.438 −18409.377
F 842.424 1048263.304
RMSE 21.26 1.53

Nonlinear in endogenous variable

# 2SLS
first_stage = lm(endogenous_nonlinear ~ exogenous, data = my_data)

new_data = cbind(my_data, new_endogenous_nonlinear = predict(first_stage, my_data))
second_stage = lm(response_nonlinear ~ new_endogenous_nonlinear, data = new_data)
summary(second_stage)
#> 
#> Call:
#> lm(formula = response_nonlinear ~ new_endogenous_nonlinear, data = new_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -94.43 -52.10 -15.29  36.50 446.08 
#> 
#> Coefficients:
#>                          Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               15.3390    11.8175   1.298    0.194    
#> new_endogenous_nonlinear   3.0174     0.3376   8.938   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 69.51 on 9998 degrees of freedom
#> Multiple R-squared:  0.007927,   Adjusted R-squared:  0.007828 
#> F-statistic: 79.89 on 1 and 9998 DF,  p-value: < 2.2e-16

new_data_cf = cbind(my_data, residual = resid(first_stage))
second_stage_cf = lm(response_nonlinear ~ endogenous_nonlinear + residual, data = new_data_cf)
summary(second_stage_cf)
#> 
#> Call:
#> lm(formula = response_nonlinear ~ endogenous_nonlinear + residual, 
#>     data = new_data_cf)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -17.5437  -0.8348   0.4614   1.4424   4.8154 
#> 
#> Coefficients:
#>                      Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)          15.33904    0.38459   39.88   <2e-16 ***
#> endogenous_nonlinear  3.01737    0.01099  274.64   <2e-16 ***
#> residual              0.24919    0.01104   22.58   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.262 on 9997 degrees of freedom
#> Multiple R-squared:  0.9989, Adjusted R-squared:  0.9989 
#> F-statistic: 4.753e+06 on 2 and 9997 DF,  p-value: < 2.2e-16

modelsummary(list(second_stage, second_stage_cf))
 (1)   (2)
(Intercept) 15.339 15.339
(11.817) (0.385)
new_endogenous_nonlinear 3.017
(0.338)
endogenous_nonlinear 3.017
(0.011)
residual 0.249
(0.011)
Num.Obs. 10000 10000
R2 0.008 0.999
R2 Adj. 0.008 0.999
AIC 113211.6 44709.6
BIC 113233.2 44738.4
Log.Lik. −56602.782 −22350.801
F 79.887 4752573.052
RMSE 69.50 2.26

Nonlinear in parameters

# 2SLS
first_stage = lm(endogenous ~ exogenous, data = my_data)

new_data = cbind(my_data, new_endogenous = predict(first_stage, my_data))
second_stage = lm(response_nonlinear_para ~ new_endogenous, data = new_data)
summary(second_stage)
#> 
#> Call:
#> lm(formula = response_nonlinear_para ~ new_endogenous, data = new_data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1536.5  -452.4   -80.7   368.4  3780.9 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)    -1089.943     61.706  -17.66   <2e-16 ***
#> new_endogenous   119.829      3.078   38.93   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 630.2 on 9998 degrees of freedom
#> Multiple R-squared:  0.1316, Adjusted R-squared:  0.1316 
#> F-statistic:  1516 on 1 and 9998 DF,  p-value: < 2.2e-16

new_data_cf = cbind(my_data, residual = resid(first_stage))
second_stage_cf = lm(response_nonlinear_para ~ endogenous_nonlinear + residual, data = new_data_cf)
summary(second_stage_cf)
#> 
#> Call:
#> lm(formula = response_nonlinear_para ~ endogenous_nonlinear + 
#>     residual, data = new_data_cf)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -961.00 -139.32  -16.02  135.57 1403.62 
#> 
#> Coefficients:
#>                      Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)          678.1593     9.9177   68.38   <2e-16 ***
#> endogenous_nonlinear  17.7884     0.2759   64.46   <2e-16 ***
#> residual              52.5016     1.1552   45.45   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 231.9 on 9997 degrees of freedom
#> Multiple R-squared:  0.8824, Adjusted R-squared:  0.8824 
#> F-statistic: 3.751e+04 on 2 and 9997 DF,  p-value: < 2.2e-16

modelsummary(list(second_stage, second_stage_cf))
 (1)   (2)
(Intercept) −1089.943 678.159
(61.706) (9.918)
new_endogenous 119.829
(3.078)
endogenous_nonlinear 17.788
(0.276)
residual 52.502
(1.155)
Num.Obs. 10000 10000
R2 0.132 0.882
R2 Adj. 0.132 0.882
AIC 157302.4 137311.3
BIC 157324.1 137340.1
Log.Lik. −78648.225 −68651.628
F 1515.642 37505.777
RMSE 630.10 231.88

References

Petrin, Amil, and Kenneth Train. 2010. “A Control Function Approach to Endogeneity in Consumer Choice Models.” Journal of Marketing Research 47 (1): 3–13.
Tchetgen Tchetgen, Eric. 2014. “A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome.” Epidemiologic Methods 3 (1): 107–12.