30.7 Control Function
Also known as two-stage residual inclusion
Resources:
Binary outcome and binary endogenous variable application (E. Tchetgen Tchetgen 2014)
In rare events: we use a logistic model in the 2nd stage
In non-rare events: use risk ratio regression in the 2nd stage
Application in marketing for consumer choice model (Petrin and Train 2010)
Notes
- This approach is better suited for models with nonadditive errors (e.g., discrete choice models), or binary endogenous model, binary response variable, etc.
\[ Y = g(X) + U \\ X = \pi(Z) + V \\ E(U |Z,V) = E(U|V) \\ E(V|Z) = 0 \]
Under control function approach,
\[ E(Y|Z,V) = g(X) + E(U|Z,V) \\ = g(X) + E(U|V) \\ = g(X) + h(V) \]
where \(h(V)\) is the control function that models the endogeneity
- Linear in parameters
- Linear Endogenous Variables:
- The control function function approach is identical to the usual 2SLS estimator
- Nonlinear Endogenous Variables:
- The control function is different from the 2SLS estimator
- Nonlinear in parameters:
- The CF function is superior than the 2SLS estimator
30.7.1 Simulation
library(fixest)
library(tidyverse)
library(modelsummary)
# Set the seed for reproducibility
set.seed(123)
n = 10000
# Generate the exogenous variable from a normal distribution
exogenous <- rnorm(n, mean = 5, sd = 1)
# Generate the omitted variable as a function of the exogenous variable
omitted <- rnorm(n, mean = 2, sd = 1)
# Generate the endogenous variable as a function of the omitted variable and the exogenous variable
endogenous <- 5 * omitted + 2 * exogenous + rnorm(n, mean = 0, sd = 1)
# nonlinear endogenous variable
endogenous_nonlinear <- 5 * omitted^2 + 2 * exogenous + rnorm(100, mean = 0, sd = 1)
unrelated <- rexp(n, rate = 1)
# Generate the response variable as a function of the endogenous variable and the omitted variable
response <- 4 + 3 * endogenous + 6 * omitted + rnorm(n, mean = 0, sd = 1)
response_nonlinear <- 4 + 3 * endogenous_nonlinear + 6 * omitted + rnorm(n, mean = 0, sd = 1)
response_nonlinear_para <- 4 + 3 * endogenous ^ 2 + 6 * omitted + rnorm(n, mean = 0, sd = 1)
# Combine the variables into a data frame
my_data <-
data.frame(
exogenous,
omitted,
endogenous,
response,
unrelated,
response,
response_nonlinear,
response_nonlinear_para
)
# View the first few rows of the data frame
# head(my_data)
wo_omitted <- feols(response ~ endogenous + sw0(unrelated), data = my_data)
w_omitted <- feols(response ~ endogenous + omitted + unrelated, data = my_data)
# ivreg::ivreg(response ~ endogenous + unrelated | exogenous, data = my_data)
iv <- feols(response ~ 1 + sw0(unrelated) | endogenous ~ exogenous, data = my_data)
etable(
wo_omitted,
w_omitted,
iv,
digits = 2
# vcov = list("each", "iid", "hetero")
)
#> wo_omitted.1 wo_omitted.2 w_omitted iv.1
#> Dependent Var.: response response response response
#>
#> Constant -3.9*** (0.10) -4.0*** (0.10) 4.0*** (0.05) 15.7*** (0.59)
#> endogenous 4.0*** (0.005) 4.0*** (0.005) 3.0*** (0.004) 3.0*** (0.03)
#> unrelated 0.03 (0.03) 0.002 (0.010)
#> omitted 6.0*** (0.02)
#> _______________ ______________ ______________ ______________ ______________
#> S.E. type IID IID IID IID
#> Observations 10,000 10,000 10,000 10,000
#> R2 0.98566 0.98567 0.99803 0.92608
#> Adj. R2 0.98566 0.98566 0.99803 0.92607
#>
#> iv.2
#> Dependent Var.: response
#>
#> Constant 15.6*** (0.59)
#> endogenous 3.0*** (0.03)
#> unrelated 0.10. (0.06)
#> omitted
#> _______________ ______________
#> S.E. type IID
#> Observations 10,000
#> R2 0.92610
#> Adj. R2 0.92608
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Linear in parameter and linear in endogenous variable
# manual
# 2SLS
first_stage = lm(endogenous ~ exogenous, data = my_data)
new_data = cbind(my_data, new_endogenous = predict(first_stage, my_data))
second_stage = lm(response ~ new_endogenous, data = new_data)
summary(second_stage)
#>
#> Call:
#> lm(formula = response ~ new_endogenous, data = new_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -77.683 -14.374 -0.107 14.289 78.274
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 15.6743 2.0819 7.529 5.57e-14 ***
#> new_endogenous 3.0142 0.1039 29.025 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 21.26 on 9998 degrees of freedom
#> Multiple R-squared: 0.07771, Adjusted R-squared: 0.07762
#> F-statistic: 842.4 on 1 and 9998 DF, p-value: < 2.2e-16
new_data_cf = cbind(my_data, residual = resid(first_stage))
second_stage_cf = lm(response ~ endogenous + residual, data = new_data_cf)
summary(second_stage_cf)
#>
#> Call:
#> lm(formula = response ~ endogenous + residual, data = new_data_cf)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.360 -1.016 0.003 1.023 5.201
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 15.674265 0.149350 105.0 <2e-16 ***
#> endogenous 3.014202 0.007450 404.6 <2e-16 ***
#> residual 1.140920 0.008027 142.1 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.525 on 9997 degrees of freedom
#> Multiple R-squared: 0.9953, Adjusted R-squared: 0.9953
#> F-statistic: 1.048e+06 on 2 and 9997 DF, p-value: < 2.2e-16
modelsummary(list(second_stage, second_stage_cf))
(1) | (2) | |
---|---|---|
(Intercept) | 15.674 | 15.674 |
(2.082) | (0.149) | |
new_endogenous | 3.014 | |
(0.104) | ||
endogenous | 3.014 | |
(0.007) | ||
residual | 1.141 | |
(0.008) | ||
Num.Obs. | 10000 | 10000 |
R2 | 0.078 | 0.995 |
R2 Adj. | 0.078 | 0.995 |
AIC | 89520.9 | 36826.8 |
BIC | 89542.5 | 36855.6 |
Log.Lik. | −44757.438 | −18409.377 |
F | 842.424 | 1048263.304 |
RMSE | 21.26 | 1.53 |
Nonlinear in endogenous variable
# 2SLS
first_stage = lm(endogenous_nonlinear ~ exogenous, data = my_data)
new_data = cbind(my_data, new_endogenous_nonlinear = predict(first_stage, my_data))
second_stage = lm(response_nonlinear ~ new_endogenous_nonlinear, data = new_data)
summary(second_stage)
#>
#> Call:
#> lm(formula = response_nonlinear ~ new_endogenous_nonlinear, data = new_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -94.43 -52.10 -15.29 36.50 446.08
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 15.3390 11.8175 1.298 0.194
#> new_endogenous_nonlinear 3.0174 0.3376 8.938 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 69.51 on 9998 degrees of freedom
#> Multiple R-squared: 0.007927, Adjusted R-squared: 0.007828
#> F-statistic: 79.89 on 1 and 9998 DF, p-value: < 2.2e-16
new_data_cf = cbind(my_data, residual = resid(first_stage))
second_stage_cf = lm(response_nonlinear ~ endogenous_nonlinear + residual, data = new_data_cf)
summary(second_stage_cf)
#>
#> Call:
#> lm(formula = response_nonlinear ~ endogenous_nonlinear + residual,
#> data = new_data_cf)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -17.5437 -0.8348 0.4614 1.4424 4.8154
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 15.33904 0.38459 39.88 <2e-16 ***
#> endogenous_nonlinear 3.01737 0.01099 274.64 <2e-16 ***
#> residual 0.24919 0.01104 22.58 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.262 on 9997 degrees of freedom
#> Multiple R-squared: 0.9989, Adjusted R-squared: 0.9989
#> F-statistic: 4.753e+06 on 2 and 9997 DF, p-value: < 2.2e-16
modelsummary(list(second_stage, second_stage_cf))
(1) | (2) | |
---|---|---|
(Intercept) | 15.339 | 15.339 |
(11.817) | (0.385) | |
new_endogenous_nonlinear | 3.017 | |
(0.338) | ||
endogenous_nonlinear | 3.017 | |
(0.011) | ||
residual | 0.249 | |
(0.011) | ||
Num.Obs. | 10000 | 10000 |
R2 | 0.008 | 0.999 |
R2 Adj. | 0.008 | 0.999 |
AIC | 113211.6 | 44709.6 |
BIC | 113233.2 | 44738.4 |
Log.Lik. | −56602.782 | −22350.801 |
F | 79.887 | 4752573.052 |
RMSE | 69.50 | 2.26 |
Nonlinear in parameters
# 2SLS
first_stage = lm(endogenous ~ exogenous, data = my_data)
new_data = cbind(my_data, new_endogenous = predict(first_stage, my_data))
second_stage = lm(response_nonlinear_para ~ new_endogenous, data = new_data)
summary(second_stage)
#>
#> Call:
#> lm(formula = response_nonlinear_para ~ new_endogenous, data = new_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1536.5 -452.4 -80.7 368.4 3780.9
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1089.943 61.706 -17.66 <2e-16 ***
#> new_endogenous 119.829 3.078 38.93 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 630.2 on 9998 degrees of freedom
#> Multiple R-squared: 0.1316, Adjusted R-squared: 0.1316
#> F-statistic: 1516 on 1 and 9998 DF, p-value: < 2.2e-16
new_data_cf = cbind(my_data, residual = resid(first_stage))
second_stage_cf = lm(response_nonlinear_para ~ endogenous_nonlinear + residual, data = new_data_cf)
summary(second_stage_cf)
#>
#> Call:
#> lm(formula = response_nonlinear_para ~ endogenous_nonlinear +
#> residual, data = new_data_cf)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -961.00 -139.32 -16.02 135.57 1403.62
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 678.1593 9.9177 68.38 <2e-16 ***
#> endogenous_nonlinear 17.7884 0.2759 64.46 <2e-16 ***
#> residual 52.5016 1.1552 45.45 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 231.9 on 9997 degrees of freedom
#> Multiple R-squared: 0.8824, Adjusted R-squared: 0.8824
#> F-statistic: 3.751e+04 on 2 and 9997 DF, p-value: < 2.2e-16
modelsummary(list(second_stage, second_stage_cf))
(1) | (2) | |
---|---|---|
(Intercept) | −1089.943 | 678.159 |
(61.706) | (9.918) | |
new_endogenous | 119.829 | |
(3.078) | ||
endogenous_nonlinear | 17.788 | |
(0.276) | ||
residual | 52.502 | |
(1.155) | ||
Num.Obs. | 10000 | 10000 |
R2 | 0.132 | 0.882 |
R2 Adj. | 0.132 | 0.882 |
AIC | 157302.4 | 137311.3 |
BIC | 157324.1 | 137340.1 |
Log.Lik. | −78648.225 | −68651.628 |
F | 1515.642 | 37505.777 |
RMSE | 630.10 | 231.88 |
References
Petrin, Amil, and Kenneth Train. 2010. “A Control Function Approach to Endogeneity in Consumer Choice Models.” Journal of Marketing Research 47 (1): 3–13.
Tchetgen Tchetgen, Eric. 2014. “A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome.” Epidemiologic Methods 3 (1): 107–12.