16.2 Endogenous Sample Selection

sample selection or self-selection problem

the omitted variable is how people were selected into the sample

Some disciplines consider nonresponse bias and selection bias as sample selection.

  • When unobservable factors that affect who is in the sample are independent of unobservable factors that affect the outcome, the sample selection is not endogenous. Hence, the sample selection is ignorable and estimator that ignores sample selection is still consistent.
  • when the unobservable factors that affect who is included in the sample are correlated with the unobservable factors that affect the outcome, the sample selection is endogenous and not ignorable, because estimators that ignore endogenous sample selection are not consistent (we don’t know which part of the observable outcome is related to the causal relationship and which part is due to different people were selected for the treatment and control groups).

To combat Sample selection, we can

  • Randomization: participants are randomly selected into treatment and control.
  • Instruments that determine the treatment status (i.e., treatment vs. control) but not the outcome (Y)
  • Functional form of the selection and outcome processes: originated from (Heckman 1976), later on generalize by (Amemiya 1984)

We have our main model

\[ \mathbf{y^* = xb + \epsilon} \]

However, the pattern of missingness (i.e., censored) is related to the unobserved (latent) process:

\[ \mathbf{z^* = w \gamma + u} \]

and

\[ z_i = \begin{cases} 1& \text{if } z_i^*>0 \\ 0&\text{if } z_i^*\le0\\ \end{cases} \]

Equivalently, \(z_i = 1\) (\(y_i\) is observed) when

\[ u_i \ge -w_i \gamma \]

Hence, the probability of observed \(y_i\) is

\[ \begin{aligned} P(u_i \ge -w_i \gamma) &= 1 - \Phi(-w_i \gamma) \\ &= \Phi(w_i \gamma) & \text{symmetry of the standard normal distribution} \end{aligned} \]

We will assume

  • the error term of the selection \(\mathbf{u \sim N(0,I)}\)
  • \(Var(u_i) = 1\) for identification purposes

Visually, \(P(u_i \ge -w_i \gamma)\) is the shaded area.

x = seq(-3, 3, length = 200)
y = dnorm(x, mean = 0, sd = 1)
plot(x,
     y,
     type = "l",
     main = bquote("Probabibility distribution of" ~ u[i]))
x = seq(0.3, 3, length = 100)
y = dnorm(x, mean = 0, sd = 1)
polygon(c(0.3, x, 3), c(0, y, 0), col = "gray")
text(1, 0.1, bquote(1 - Phi ~ (-w[i] ~ gamma)))
arrows(-0.5, 0.1, 0.3, 0, length = .15)
text(-0.5, 0.12, bquote(-w[i] ~ gamma))
legend(
  "topright",
  "Gray = Prob of Observed",
  pch = 1,
  title = "legend",
  inset = .02
)

Hence in our observed model, we see

\[\begin{equation} y_i = x_i\beta + \epsilon_i \text{when $z_i=1$} \end{equation}\]

and the joint distribution of the selection model (\(u_i\)), and the observed equation (\(\epsilon_i\)) as

\[ \left[ \begin{array} {c} u \\ \epsilon \\ \end{array} \right] \sim^{iid}N \left( \left[ \begin{array} {c} 0 \\ 0 \\ \end{array} \right], \left[ \begin{array} {cc} 1 & \rho \\ \rho & \sigma^2_{\epsilon} \\ \end{array} \right] \right) \]

The relation between the observed and selection models:

\[ \begin{aligned} E(y_i | y_i \text{ observed}) &= E(y_i| z^*>0) \\ &= E(y_i| -w_i \gamma) \\ &= \mathbf{x}_i \beta + E(\epsilon_i | u_i > -w_i \gamma) \\ &= \mathbf{x}_i \beta + \rho \sigma_\epsilon \frac{\phi(w_i \gamma)}{\Phi(w_i \gamma)} \end{aligned} \]

where \(\frac{\phi(w_i \gamma)}{\Phi(w_i \gamma)}\) is the Inverse Mills Ratio. and \(\rho \sigma_\epsilon \frac{\phi(w_i \gamma)}{\Phi(w_i \gamma)} \ge 0\)

Great visualization of special cases of correlation patterns amongst data and errors by professor Rob Hick

Note:

(Bareinboim, Tian, and Pearl 2014) is an excellent summary of cases that we can still do causal inference in case of selection bias. I’ll try to summarize their idea here:

Let X be an action, Y be an outcome, and S be a binary indicator of entry into the data pool where (S = 1 = in the sample, S = 0 =out of sample) and Q be the conditional distribution \(Q = P(y|x)\).

Usually we want to understand , but because of S, we only have \(P(y, x|S = 1)\). Hence, we’d like to recover \(P(y|x)\) from \(P(y, x|S = 1)\)

  • If both X and Y affect S, we can’t unbiasedly estimate \(P(y|x)\)

In the case of Omitted variable bias (U) and sample selection bias (S), you have unblocked extraneous “flow” of information between X and Y, which causes spurious correlation for X and Y. Traditionally, we would recover Q by parametric assumption of

  1. the data generating process (e.g., Heckman 2-step)
  2. type of data-generating model (e..g, treatment-dependent or outcome-dependent)
  3. selection’s probability \(P(S = 1|P a_s)\) with non-parametrically based causal graphical models, the authors proposed more robust way to model misspecification regardless of the type of data-generating model, and do not require selection’s probability. Hence, you can recover Q
  • without external data
  • with external data
  • causal effects with the Selection-backdoor criterion

16.2.1 Tobit-2

also known as Heckman’s standard sample selection model
Assumption: joint normality of the errors

Data here is taken from (Mroz 1987)’s paper.

We want to estimate the log(wage) for married women, with education, experience, experience squared, and a dummy variable for living in a big city. But we can only observe the wage for women who are working, which means a lot of married women in 1975 who were out of the labor force are unaccounted for. Hence, an OLS estimate of the wage equation would be bias due to sample selection. Since we have data on non-participants (i.e., those who are not working for pay), we can correct for the selection process.

The Tobit-2 estimates are consistent

16.2.1.1 Example 1

library(sampleSelection)
## Loading required package: maxLik
## Loading required package: miscTools
## 
## Please cite the 'maxLik' package as:
## Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
## 
## If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
## https://r-forge.r-project.org/projects/maxlik/
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data("Mroz87") #1975 data on married women’s pay and labor-force participation from the Panel Study of Income Dynamics (PSID)
head(Mroz87)
##   lfp hours kids5 kids618 age educ   wage repwage hushrs husage huseduc huswage
## 1   1  1610     1       0  32   12 3.3540    2.65   2708     34      12  4.0288
## 2   1  1656     0       2  30   12 1.3889    2.65   2310     30       9  8.4416
## 3   1  1980     1       3  35   12 4.5455    4.04   3072     40      12  3.5807
## 4   1   456     0       3  34   12 1.0965    3.25   1920     53      10  3.5417
## 5   1  1568     1       2  31   14 4.5918    3.60   2000     32      12 10.0000
## 6   1  2032     0       0  54   12 4.7421    4.70   1040     57      11  6.7106
##   faminc    mtr motheduc fatheduc unem city exper  nwifeinc wifecoll huscoll
## 1  16310 0.7215       12        7  5.0    0    14 10.910060    FALSE   FALSE
## 2  21800 0.6615        7        7 11.0    1     5 19.499981    FALSE   FALSE
## 3  21040 0.6915       12        7  5.0    0    15 12.039910    FALSE   FALSE
## 4   7300 0.7815        7        7  5.0    0     6  6.799996    FALSE   FALSE
## 5  27300 0.6215       12       14  9.5    1     7 20.100058     TRUE   FALSE
## 6  19495 0.6915       14        7  7.5    1    33  9.859054    FALSE   FALSE
Mroz87 = Mroz87 %>%
        mutate(kids = kids5+kids618)

library(nnet)
library(ggplot2)
library(reshape2)

2-stage Heckman’s model:

  1. probit equation estimates the selection process (who is in the labor force?)
  2. the results from 1st stage are used to construct a variable that captures the selection effect in the wage equation. This correction variable is called the inverse Mills ratio.
# OLS: log wage regression on LF participants only
ols1 = lm(log(wage) ~ educ + exper + I( exper^2 ) + city, data=subset(Mroz87, lfp==1))
# Heckman's Two-step estimation with LFP selection equation
heck1 = heckit( lfp ~ age + I( age^2 ) + kids + huswage + educ, # the selection process, lfp = 1 if the woman is participating in the labor force 
                 log(wage) ~ educ + exper + I( exper^2 ) + city, data=Mroz87 )

Use only variables that affect the selection process in the selection equation. Technically, the selection equation and the equation of interest could have the same set of regressors. But it is not recommended because we should only use variables (or at least one) in the selection equation that affect the selection process, but not the wage process (i.e., instruments). Here, variable kids fulfill that role: women with kids may be more likely to stay home, but working moms with kids would not have their wages change.

Alternatively,

# ML estimation of selection model
ml1 = selection( lfp ~ age + I( age^2 ) + kids + huswage + educ,
                    log(wage) ~ educ + exper + I( exper^2 ) + city, data=Mroz87 ) 
library("stargazer")
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library("Mediana")
library("plm")
## 
## Attaching package: 'plm'
## The following objects are masked from 'package:dplyr':
## 
##     between, lag, lead
# function to calculate corrected SEs for regression 
cse = function(reg) {
  rob = sqrt(diag(vcovHC(reg, type = "HC1")))
  return(rob)
}

# stargazer table
stargazer(ols1, heck1, ml1,    
          se=list(cse(ols1),NULL,NULL), 
          title="Married women's wage regressions", type="text", 
          df=FALSE, digits=4, selection.equation = T)
## 
## Married women's wage regressions
## ==============================================================
##                                Dependent variable:            
##                     ------------------------------------------
##                     log(wage)                lfp              
##                        OLS         Heckman        selection   
##                                   selection                   
##                        (1)           (2)             (3)      
## --------------------------------------------------------------
## age                               0.1861***       0.1842***   
##                                   (0.0652)        (0.0658)    
##                                                               
## I(age2)                          -0.0024***      -0.0024***   
##                                   (0.0008)        (0.0008)    
##                                                               
## kids                             -0.1496***      -0.1488***   
##                                   (0.0383)        (0.0385)    
##                                                               
## huswage                          -0.0430***      -0.0434***   
##                                   (0.0122)        (0.0123)    
##                                                               
## educ                0.1057***     0.1250***       0.1256***   
##                      (0.0130)     (0.0228)        (0.0229)    
##                                                               
## exper               0.0411***                                 
##                      (0.0154)                                 
##                                                               
## I(exper2)            -0.0008*                                 
##                      (0.0004)                                 
##                                                               
## city                  0.0542                                  
##                      (0.0653)                                 
##                                                               
## Constant            -0.5308***   -4.1815***      -4.1484***   
##                      (0.2032)     (1.4024)        (1.4109)    
##                                                               
## --------------------------------------------------------------
## Observations           428           753             753      
## R2                    0.1581       0.1582                     
## Adjusted R2           0.1501       0.1482                     
## Log Likelihood                                    -914.0777   
## rho                                0.0830      0.0505 (0.2317)
## Inverse Mills Ratio            0.0551 (0.2099)                
## Residual Std. Error   0.6667                                  
## F Statistic         19.8561***                                
## ==============================================================
## Note:                              *p<0.1; **p<0.05; ***p<0.01

Rho is an estimate of the correlation of the errors between the selection and wage equations. In the lower panel, the estimated coefficient on the inverse Mills ratio is given for the Heckman model. The fact that it is not statistically different from zero is consistent with the idea that selection bias was not a serious problem in this case.

If the estimated coefficient of the inverse Mills ratio in the Heckman model is not statistically different from zero, then selection bias was not a serious problem.

16.2.1.2 Example 2

This code is from R package sampleSelection

set.seed(0)
library("sampleSelection")
library("mvtnorm")
eps <- rmvnorm(500, c(0,0), matrix(c(1,-0.7,-0.7,1), 2, 2)) # bivariate normal disturbances
xs <- runif(500)# uniformly distributed explanatory variable (vectors of explanatory variables for the selection )
ys <- xs + eps[,1] > 0 # probit data generating process
xo <- runif(500) # vectors of explanatory variables for outcome equation 
yoX <- xo + eps[,2] # latent outcome
yo <- yoX*(ys > 0) # observable outcome
# true intercepts = 0 and our true slopes = 1
# xs and xo are independent. Hence, exclusion restriction is fulfilled
summary( selection(ys~xs, yo ~xo))
## --------------------------------------------
## Tobit 2 model (sample selection model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 5 iterations
## Return code 1: gradient close to zero (gradtol)
## Log-Likelihood: -712.3163 
## 500 observations (172 censored and 328 observed)
## 6 free parameters (df = 494)
## Probit selection equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.2228     0.1081  -2.061   0.0399 *  
## xs            1.3377     0.2014   6.642 8.18e-11 ***
## Outcome equation:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0002265  0.1294178  -0.002    0.999    
## xo           0.7299070  0.1635925   4.462 1.01e-05 ***
##    Error terms:
##       Estimate Std. Error t value Pr(>|t|)    
## sigma   0.9190     0.0574  16.009  < 2e-16 ***
## rho    -0.5392     0.1521  -3.544 0.000431 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

without the exclusion restriction, we generate yo using xs instead of xo.

yoX <- xs + eps[,2]
yo <- yoX*(ys > 0)
summary(selection(ys ~ xs, yo ~ xs))
## --------------------------------------------
## Tobit 2 model (sample selection model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 14 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -712.8298 
## 500 observations (172 censored and 328 observed)
## 6 free parameters (df = 494)
## Probit selection equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.1984     0.1114  -1.781   0.0756 .  
## xs            1.2907     0.2085   6.191 1.25e-09 ***
## Outcome equation:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.5499     0.5644  -0.974  0.33038   
## xs            1.3987     0.4482   3.120  0.00191 **
##    Error terms:
##       Estimate Std. Error t value Pr(>|t|)    
## sigma  0.85091    0.05352  15.899   <2e-16 ***
## rho   -0.13226    0.72684  -0.182    0.856    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

We can see that our estimates are still unbiased but standard errors are substantially larger. The exclusion restriction (i.e., independent information about the selection process) has a certain identifying power that we desire. Hence, it’s better to have different set of variable for the selection process from the interested equation. Without the exclusion restriction, we solely rely on the functional form identification.

16.2.2 Tobit-5

Also known as the switching regression model
Condition: There is at least one variable in X in the selection process not included in the observed process. Used when there are separate models for participants, and non-participants.

set.seed(0)
vc <- diag(3)
vc[lower.tri(vc)] <- c(0.9, 0.5, 0.1)
vc[upper.tri(vc)] <- vc[lower.tri(vc)]
eps <- rmvnorm(500, c(0,0,0), vc) # 3 disturbance vectors by a 3-dimensional normal distribution
xs <- runif(500) # uniformly distributed on [0, 1]
ys <- xs + eps[,1] > 0
xo1 <- runif(500) # uniformly distributed on [0, 1]
yo1 <- xo1 + eps[,2]
xo2 <- runif(500) # uniformly distributed on [0, 1]
yo2 <- xo2 + eps[,3]

exclusion restriction is fulfilled when x’s are independent.

summary(selection(ys~xs, list(yo1 ~ xo1, yo2 ~ xo2))) # one selection equation and a list of two outcome equations
## --------------------------------------------
## Tobit 5 model (switching regression model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 11 iterations
## Return code 1: gradient close to zero (gradtol)
## Log-Likelihood: -895.8201 
## 500 observations: 172 selection 1 (FALSE) and 328 selection 2 (TRUE)
## 10 free parameters (df = 490)
## Probit selection equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.1550     0.1051  -1.474    0.141    
## xs            1.1408     0.1785   6.390 3.86e-10 ***
## Outcome equation 1:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02708    0.16395   0.165    0.869    
## xo1          0.83959    0.14968   5.609  3.4e-08 ***
## Outcome equation 2:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.1583     0.1885   0.840    0.401    
## xo2           0.8375     0.1707   4.908 1.26e-06 ***
##    Error terms:
##        Estimate Std. Error t value Pr(>|t|)    
## sigma1  0.93191    0.09211  10.118   <2e-16 ***
## sigma2  0.90697    0.04434  20.455   <2e-16 ***
## rho1    0.88988    0.05353  16.623   <2e-16 ***
## rho2    0.17695    0.33139   0.534    0.594    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

All the estimates are close to the true values.

Example of functional form misspecification

set.seed(5)
eps <- rmvnorm(1000, rep(0, 3), vc)
eps <- eps^2 - 1 # subtract 1 in order to get the mean zero disturbances
xs <- runif(1000, -1, 0) # interval [−1, 0] to get an asymmetric distribution over observed choices
ys <- xs + eps[,1] > 0
xo1 <- runif(1000)
yo1 <- xo1 + eps[,2]
xo2 <- runif(1000)
yo2 <- xo2 + eps[,3]
summary(selection(ys~xs, list(yo1 ~ xo1, yo2 ~ xo2), iterlim=20))
## Warning in sqrt(diag(vc)): NaNs produced

## Warning in sqrt(diag(vc)): NaNs produced
## Warning in sqrt(diag(vcov(object, part = "full"))): NaNs produced
## --------------------------------------------
## Tobit 5 model (switching regression model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 4 iterations
## Return code 3: Last step could not find a value above the current.
## Boundary of parameter space?  
## Consider switching to a more robust optimisation method temporarily.
## Log-Likelihood: -1665.936 
## 1000 observations: 760 selection 1 (FALSE) and 240 selection 2 (TRUE)
## 10 free parameters (df = 990)
## Probit selection equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.53698    0.05808  -9.245  < 2e-16 ***
## xs           0.31268    0.09395   3.328 0.000906 ***
## Outcome equation 1:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.70679    0.03573  -19.78   <2e-16 ***
## xo1          0.91603    0.05626   16.28   <2e-16 ***
## Outcome equation 2:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.1446         NA      NA       NA  
## xo2           1.1196     0.5014   2.233   0.0258 *
##    Error terms:
##        Estimate Std. Error t value Pr(>|t|)    
## sigma1  0.67770    0.01760   38.50   <2e-16 ***
## sigma2  2.31432    0.07615   30.39   <2e-16 ***
## rho1   -0.97137         NA      NA       NA    
## rho2    0.17039         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

Although we still have an exclusion restriction (xo1 and xo2 are independent), we now have problems with the intercepts (i.e., they are statistically significantly different from the true values zero), and convergence problems.

If we don’t have the exclusion restriction, we will have a larger variance of xs

set.seed(6)
xs <- runif(1000, -1, 1)
ys <- xs + eps[,1] > 0
yo1 <- xs + eps[,2]
yo2 <- xs + eps[,3]
summary(tmp <- selection(ys~xs, list(yo1 ~ xs, yo2 ~ xs), iterlim=20))
## --------------------------------------------
## Tobit 5 model (switching regression model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 16 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -1936.431 
## 1000 observations: 626 selection 1 (FALSE) and 374 selection 2 (TRUE)
## 10 free parameters (df = 990)
## Probit selection equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.3528     0.0424  -8.321 2.86e-16 ***
## xs            0.8354     0.0756  11.050  < 2e-16 ***
## Outcome equation 1:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.55448    0.06339  -8.748   <2e-16 ***
## xs           0.81764    0.06048  13.519   <2e-16 ***
## Outcome equation 2:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.6457     0.4994   1.293    0.196
## xs            0.3520     0.3197   1.101    0.271
##    Error terms:
##        Estimate Std. Error t value Pr(>|t|)    
## sigma1  0.59187    0.01853  31.935   <2e-16 ***
## sigma2  1.97257    0.07228  27.289   <2e-16 ***
## rho1    0.15568    0.15914   0.978    0.328    
## rho2   -0.01541    0.23370  -0.066    0.947    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

Usually it will not converge. Even if it does, the results may be seriously biased.

Note

The log-likelihood function of the models might not be globally concave. Hence, it might not converge, or converge to a local maximum. To combat this, we can use

16.2.2.0.1 Pattern-Mixture Models
  • compared to the Heckman’s model where it assumes the value of the missing data is predetermined, pattern-mixture models assume missingness affect the distribution of variable of interest (e.g., Y)
  • To read more, you can check NCSU, stefvanbuuren.