33.7 Control Function

Also known as two-stage residual inclusion

Resources:

Binary outcome and binary endogenous variable application (E. Tchetgen Tchetgen 2014)
- In rare events: we use a logistic model in the 2nd stage
- In non-rare events: use risk ratio regression in the 2nd stage
Application in marketing for consumer choice model (Petrin and Train 2010)

Notes

This approach is better suited for models with nonadditive errors (e.g., discrete choice models), or binary endogenous model, binary response variable, etc.

$Y = g(X) + U \\ X = \pi(Z) + V \\ E(U |Z,V) = E(U|V) \\ E(V|Z) = 0$

Under control function approach,

$E(Y|Z,V) = g(X) + E(U|Z,V) \\ = g(X) + E(U|V) \\ = g(X) + h(V)$

where $h(V)$ is the control function that models the endogeneity

Linear in parameters

Linear Endogenous Variables:
- The control function function approach is identical to the usual 2SLS estimator
Nonlinear Endogenous Variables:
- The control function is different from the 2SLS estimator
Nonlinear in parameters:
- The CF function is superior than the 2SLS estimator

33.7.1 Simulation

library(fixest)
library(tidyverse)
library(modelsummary)

# Set the seed for reproducibility
set.seed(123)
n = 10000
# Generate the exogenous variable from a normal distribution
exogenous <- rnorm(n, mean = 5, sd = 1)

# Generate the omitted variable as a function of the exogenous variable
omitted <- rnorm(n, mean = 2, sd = 1)

# Generate the endogenous variable as a function of the omitted variable and the exogenous variable
endogenous <- 5 * omitted + 2 * exogenous + rnorm(n, mean = 0, sd = 1)

# nonlinear endogenous variable
endogenous_nonlinear <- 5 * omitted^2 + 2 * exogenous + rnorm(100, mean = 0, sd = 1)

unrelated <- rexp(n, rate = 1)

# Generate the response variable as a function of the endogenous variable and the omitted variable
response <- 4 +  3 * endogenous + 6 * omitted + rnorm(n, mean = 0, sd = 1)

response_nonlinear <- 4 +  3 * endogenous_nonlinear + 6 * omitted + rnorm(n, mean = 0, sd = 1)

response_nonlinear_para <- 4 +  3 * endogenous ^ 2 + 6 * omitted + rnorm(n, mean = 0, sd = 1)


# Combine the variables into a data frame
my_data <-
    data.frame(
        exogenous,
        omitted,
        endogenous,
        response,
        unrelated,
        response,
        response_nonlinear,
        response_nonlinear_para
    )

# View the first few rows of the data frame
# head(my_data)

wo_omitted <- feols(response ~ endogenous + sw0(unrelated), data = my_data)
w_omitted  <- feols(response ~ endogenous + omitted + unrelated, data = my_data)


# ivreg::ivreg(response ~ endogenous + unrelated | exogenous, data = my_data)
iv <- feols(response ~ 1 + sw0(unrelated) | endogenous ~ exogenous, data = my_data)

etable(
    wo_omitted,
    w_omitted,
    iv, 
    digits = 2
    # vcov = list("each", "iid", "hetero")
)
#>                   wo_omitted.1   wo_omitted.2      w_omitted           iv.1
#> Dependent Var.:       response       response       response       response
#>                                                                            
#> Constant        -3.9*** (0.10) -4.0*** (0.10)  4.0*** (0.05) 15.7*** (0.59)
#> endogenous      4.0*** (0.005) 4.0*** (0.005) 3.0*** (0.004)  3.0*** (0.03)
#> unrelated                         0.03 (0.03)  0.002 (0.010)               
#> omitted                                        6.0*** (0.02)               
#> _______________ ______________ ______________ ______________ ______________
#> S.E. type                  IID            IID            IID            IID
#> Observations            10,000         10,000         10,000         10,000
#> R2                     0.98566        0.98567        0.99803        0.92608
#> Adj. R2                0.98566        0.98566        0.99803        0.92607
#> 
#>                           iv.2
#> Dependent Var.:       response
#>                               
#> Constant        15.6*** (0.59)
#> endogenous       3.0*** (0.03)
#> unrelated         0.10. (0.06)
#> omitted                       
#> _______________ ______________
#> S.E. type                  IID
#> Observations            10,000
#> R2                     0.92610
#> Adj. R2                0.92608
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Linear in parameter and linear in endogenous variable

# manual
# 2SLS
first_stage = lm(endogenous ~ exogenous, data = my_data)
new_data = cbind(my_data, new_endogenous = predict(first_stage, my_data))
second_stage = lm(response ~ new_endogenous, data = new_data)
summary(second_stage)
#> 
#> Call:
#> lm(formula = response ~ new_endogenous, data = new_data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -77.683 -14.374  -0.107  14.289  78.274 
#> 
#> Coefficients:
#>                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)     15.6743     2.0819   7.529 5.57e-14 ***
#> new_endogenous   3.0142     0.1039  29.025  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 21.26 on 9998 degrees of freedom
#> Multiple R-squared:  0.07771,    Adjusted R-squared:  0.07762 
#> F-statistic: 842.4 on 1 and 9998 DF,  p-value: < 2.2e-16

new_data_cf = cbind(my_data, residual = resid(first_stage))
second_stage_cf = lm(response ~ endogenous + residual, data = new_data_cf)
summary(second_stage_cf)
#> 
#> Call:
#> lm(formula = response ~ endogenous + residual, data = new_data_cf)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -5.360 -1.016  0.003  1.023  5.201 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 15.674265   0.149350   105.0   <2e-16 ***
#> endogenous   3.014202   0.007450   404.6   <2e-16 ***
#> residual     1.140920   0.008027   142.1   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.525 on 9997 degrees of freedom
#> Multiple R-squared:  0.9953, Adjusted R-squared:  0.9953 
#> F-statistic: 1.048e+06 on 2 and 9997 DF,  p-value: < 2.2e-16

modelsummary(list(second_stage, second_stage_cf))

	(1)	(2)
(Intercept)	15.674	15.674
	(2.082)	(0.149)
new_endogenous	3.014
	(0.104)
endogenous		3.014
		(0.007)
residual		1.141
		(0.008)
Num.Obs.	10000	10000
R2	0.078	0.995
R2 Adj.	0.078	0.995
AIC	89520.9	36826.8
BIC	89542.5	36855.6
Log.Lik.	−44757.438	−18409.377
F	842.424	1048263.304
RMSE	21.26	1.53

Nonlinear in endogenous variable

# 2SLS
first_stage = lm(endogenous_nonlinear ~ exogenous, data = my_data)

new_data = cbind(my_data, new_endogenous_nonlinear = predict(first_stage, my_data))
second_stage = lm(response_nonlinear ~ new_endogenous_nonlinear, data = new_data)
summary(second_stage)
#> 
#> Call:
#> lm(formula = response_nonlinear ~ new_endogenous_nonlinear, data = new_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -94.43 -52.10 -15.29  36.50 446.08 
#> 
#> Coefficients:
#>                          Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               15.3390    11.8175   1.298    0.194    
#> new_endogenous_nonlinear   3.0174     0.3376   8.938   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 69.51 on 9998 degrees of freedom
#> Multiple R-squared:  0.007927,   Adjusted R-squared:  0.007828 
#> F-statistic: 79.89 on 1 and 9998 DF,  p-value: < 2.2e-16

new_data_cf = cbind(my_data, residual = resid(first_stage))
second_stage_cf = lm(response_nonlinear ~ endogenous_nonlinear + residual, data = new_data_cf)
summary(second_stage_cf)
#> 
#> Call:
#> lm(formula = response_nonlinear ~ endogenous_nonlinear + residual, 
#>     data = new_data_cf)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -17.5437  -0.8348   0.4614   1.4424   4.8154 
#> 
#> Coefficients:
#>                      Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)          15.33904    0.38459   39.88   <2e-16 ***
#> endogenous_nonlinear  3.01737    0.01099  274.64   <2e-16 ***
#> residual              0.24919    0.01104   22.58   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.262 on 9997 degrees of freedom
#> Multiple R-squared:  0.9989, Adjusted R-squared:  0.9989 
#> F-statistic: 4.753e+06 on 2 and 9997 DF,  p-value: < 2.2e-16

modelsummary(list(second_stage, second_stage_cf))

	(1)	(2)
(Intercept)	15.339	15.339
	(11.817)	(0.385)
new_endogenous_nonlinear	3.017
	(0.338)
endogenous_nonlinear		3.017
		(0.011)
residual		0.249
		(0.011)
Num.Obs.	10000	10000
R2	0.008	0.999
R2 Adj.	0.008	0.999
AIC	113211.6	44709.6
BIC	113233.2	44738.4
Log.Lik.	−56602.782	−22350.801
F	79.887	4752573.052
RMSE	69.50	2.26

Nonlinear in parameters

# 2SLS
first_stage = lm(endogenous ~ exogenous, data = my_data)

new_data = cbind(my_data, new_endogenous = predict(first_stage, my_data))
second_stage = lm(response_nonlinear_para ~ new_endogenous, data = new_data)
summary(second_stage)
#> 
#> Call:
#> lm(formula = response_nonlinear_para ~ new_endogenous, data = new_data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1536.5  -452.4   -80.7   368.4  3780.9 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)    -1089.943     61.706  -17.66   <2e-16 ***
#> new_endogenous   119.829      3.078   38.93   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 630.2 on 9998 degrees of freedom
#> Multiple R-squared:  0.1316, Adjusted R-squared:  0.1316 
#> F-statistic:  1516 on 1 and 9998 DF,  p-value: < 2.2e-16

new_data_cf = cbind(my_data, residual = resid(first_stage))
second_stage_cf = lm(response_nonlinear_para ~ endogenous_nonlinear + residual, data = new_data_cf)
summary(second_stage_cf)
#> 
#> Call:
#> lm(formula = response_nonlinear_para ~ endogenous_nonlinear + 
#>     residual, data = new_data_cf)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -961.00 -139.32  -16.02  135.57 1403.62 
#> 
#> Coefficients:
#>                      Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)          678.1593     9.9177   68.38   <2e-16 ***
#> endogenous_nonlinear  17.7884     0.2759   64.46   <2e-16 ***
#> residual              52.5016     1.1552   45.45   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 231.9 on 9997 degrees of freedom
#> Multiple R-squared:  0.8824, Adjusted R-squared:  0.8824 
#> F-statistic: 3.751e+04 on 2 and 9997 DF,  p-value: < 2.2e-16

modelsummary(list(second_stage, second_stage_cf))

	(1)	(2)
(Intercept)	−1089.943	678.159
	(61.706)	(9.918)
new_endogenous	119.829
	(3.078)
endogenous_nonlinear		17.788
		(0.276)
residual		52.502
		(1.155)
Num.Obs.	10000	10000
R2	0.132	0.882
R2 Adj.	0.132	0.882
AIC	157302.4	137311.3
BIC	157324.1	137340.1
Log.Lik.	−78648.225	−68651.628
F	1515.642	37505.777
RMSE	630.10	231.88

References

Petrin, Amil, and Kenneth Train. 2010. “A Control Function Approach to Endogeneity in Consumer Choice Models.” Journal of Marketing Research 47 (1): 3–13.

Tchetgen Tchetgen, Eric. 2014. “A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome.” Epidemiologic Methods 3 (1): 107–12.