7.3 Probit model

The probit model also has as dependent variable a binary outcome. There is a latent (unobserved) random variable, \(Y_i^*\), that defines the structure of the estimation problem \(Y_i=\begin{Bmatrix} 1, & Y_i^* \geq 0\\ 0, & Y_i^* < 0 \end{Bmatrix},\)

where \(Y_i^*={\bf{x}}_i^{\top}\beta+\mu_i\), \(\mu_i\stackrel{iid} {\sim}N(0,1)\). Then,

\[\begin{align} P[Y_i=1]&=P[Y_i^*\geq 0]\\ &=P[\mathbf{x}_i^{\top}\beta+\mu_i\geq 0]\\ &=P[\mu_i\geq -\mathbf{x}_i^{\top}\beta]\\ &=1-P[\mu_i < -\mathbf{x}_i^{\top}\beta]\\ &=P[\mu_i < \mathbf{x}_i^{\top}\beta], \end{align}\]

where the last equality follows by symmetry at 0. In addition, observe that the previous calculations do not change if we multiply \(Y_i^*\) by a positive constant, this implies identification issues regarding scale. Intuitively, this is because we just observe 0’s or 1’s that are driven by an unobserved random latent variable \(Y_i^*\), this issue is also present in the logit model, that is why we set the variance equal to 1.

(Albert and Chib 1993) implemented data augmentation (Tanner and Wong 1987) to apply a Gibbs sampling algorithm in this model. Augmenting this model with \(Y_i^*\), we can have the likelihood contribution from observation \(i\), \(p(y_i|y_i^*)=1_{y_i=0}1_{y_i^*\leq 0}+1_{y_i=1}1_{y_i^*> 0}\), where \(1_A\) is an indicator function that takes the value of 1 when condition \(A\) is satisfied.

The posterior distribution is \(\pi(\beta,{\bf{Y^*}}|{\bf{y}},{\bf{X}})\propto\prod_{i=1}^n\left[\mathbf{1}_{y_i=0}1_{y_i^*< 0}+1_{y_i=1}1_{y_i^*\geq 0}\right] \times N_N({\bf{Y}}^*|{\bf{X}\beta},{\bf{I}}_N)\times N_K(\beta|\beta_0,{\bf{B}}_0)\) when taking a normal distribution as prior, \(\beta\sim N(\beta_0,{\bf{B}}_0)\).

The conditional posterior distribution of the latent variable is

\[\begin{align} Y_i^*|\beta,{\bf{y}},{\bf{X}}&\sim\begin{Bmatrix} TN_{[0,\infty)}({\bf{x}}_i^{\top}\beta,1), & y_i= 1\\ TN_{(-\infty,0)}({\bf{x}}_i^{\top}\beta,1), & y_i= 0 \\ \end{Bmatrix}, \end{align}\]

where \(TN_A\) denotes a truncated normal density in the interval \(A\).

The conditional posterior distribution of the location parameters is

\[\begin{align} \beta|{\bf{Y}}^*, {\bf{X}} & \sim N(\beta_n,\bf{B}_n), \end{align}\]

where \({\bf{B}}_n = ({\bf{B}}_0^{-1} + {\bf{X}}^{\top}{\bf{X}})^{-1}\), and \(\beta_n= {\bf{B}}_n({\bf{B}}_0^{-1}\beta_0 + {\bf{X}}^{\top}{\bf{Y}}^*)\).

Application: Determinants of hospitalization in Medellín

We use the dataset named 2HealthMed.csv, which is in folder DataApp (see Table 13.3 for details) in our github repository (https://github.com/besmarter/BSTApp) and was used by (Ramírez Hassan, Cardona Jiménez, and Cadavid Montoya 2013). Our dependent variable is a binary indicator with a value equal to 1 if an individual was hospitalized in 2007, and 0 otherwise.

The specification of the model is \[\begin{align} \text{Hosp}_i&=\beta_1+\beta_2\text{SHI}_i+\beta_3\text{Female}_i+\beta_4\text{Age}_i+\beta_5\text{Age}_i^2+\beta_6\text{Est2}_i+\beta_7\text{Est3}_i\\ &+\beta_8\text{Fair}_i+\beta_9\text{Good}_i+\beta_{10}\text{Excellent}_i, \end{align}\]

where SHI is a binary variable equal to 1 if the individual is in a subsidized health care program and 0 otherwise, Female is an indicator of gender, Age in years, Est2 and Est3 are indicators of socio-economic status, the reference is Est1, which is the lowest, and self perception of health status where bad is the reference.

Let’s set \(\beta_0={\bf{0}}_{10}\), \({\bf{B}}_0={\bf{I}}_{10}\), iterations, burn-in and thinning parameters equal to 10000, 1000 and 1, respectively.

mydata <- read.csv("DataApplications/2HealthMed.csv", header = T, sep = ",")
attach(mydata)

## The following objects are masked from mydata (pos = 3):
## 
##     Age, Age2

str(mydata)

## 'data.frame':    12975 obs. of  22 variables:
##  $ id          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MedVisPrev  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MedVisPrevOr: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Hosp        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SHI         : int  1 1 1 1 0 0 1 1 0 0 ...
##  $ Female      : int  0 1 1 1 0 1 0 1 0 1 ...
##  $ Age         : int  7 39 23 15 8 54 64 40 6 7 ...
##  $ Age2        : int  49 1521 529 225 64 2916 4096 1600 36 49 ...
##  $ FemaleAge   : int  0 39 23 15 0 54 0 40 0 7 ...
##  $ Est1        : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Est2        : int  0 1 1 1 0 1 1 1 0 0 ...
##  $ Est3        : int  0 0 0 0 1 0 0 0 1 1 ...
##  $ Bad         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fair        : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Good        : int  1 1 1 1 0 0 0 1 1 1 ...
##  $ Excellent   : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ NoEd        : int  1 0 0 0 1 0 0 0 1 1 ...
##  $ PriEd       : int  0 0 0 0 0 1 1 1 0 0 ...
##  $ HighEd      : int  0 1 1 1 0 0 0 0 0 0 ...
##  $ VocEd       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ UnivEd      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PTL         : num  0.43 0 0 0 0 0.06 0 0.38 0 1 ...

K <- 10 # Number of regressors
b0 <- rep(0, K) # Prio mean
B0i <- diag(K) # Prior precision (inverse of covariance)
Prior <- list(betabar = b0, A = B0i) # Prior list
y <- Hosp # Dependent variables
X <- cbind(1, SHI, Female, Age, Age2, Est2, Est3, Fair, Good, Excellent) # Regressors
Data <- list(y = y, X = X) # Data list
Mcmc <- list(R = 10000, keep = 1, nprint = 0) # MCMC parameters
RegProb <- bayesm::rbprobitGibbs(Data = Data, Prior = Prior, Mcmc = Mcmc) # Inference using bayesm package

##  
## Starting Gibbs Sampler for Binary Probit Model
##    with  12975  observations
## Table of y Values
## y
##     0     1 
## 12571   404 
##  
## Prior Parms: 
## betabar
##  [1] 0 0 0 0 0 0 0 0 0 0
## A
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1    0    0    0    0    0    0    0    0     0
##  [2,]    0    1    0    0    0    0    0    0    0     0
##  [3,]    0    0    1    0    0    0    0    0    0     0
##  [4,]    0    0    0    1    0    0    0    0    0     0
##  [5,]    0    0    0    0    1    0    0    0    0     0
##  [6,]    0    0    0    0    0    1    0    0    0     0
##  [7,]    0    0    0    0    0    0    1    0    0     0
##  [8,]    0    0    0    0    0    0    0    1    0     0
##  [9,]    0    0    0    0    0    0    0    0    1     0
## [10,]    0    0    0    0    0    0    0    0    0     1
##  
## MCMC parms: 
## R=  10000  keep=  1  nprint=  0
##

PostPar <- coda::mcmc(RegProb$betadraw) # Posterior draws
colnames(PostPar) <- c("Cte", "SHI", "Female", "Age", "Age2", "Est2",
                       "Est3", "Fair", "Good", "Excellent") # Names
summary(PostPar) # Posterior summary

## 
## Iterations = 1:10000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                 Mean        SD  Naive SE Time-series SE
## Cte       -9.378e-01 1.363e-01 1.363e-03      3.601e-03
## SHI       -6.933e-03 5.868e-02 5.868e-04      2.193e-03
## Female     1.266e-01 4.895e-02 4.895e-04      1.797e-03
## Age       -1.533e-04 3.625e-03 3.625e-05      1.199e-04
## Age2       4.245e-05 4.354e-05 4.354e-07      1.318e-06
## Est2      -8.793e-02 5.231e-02 5.231e-04      1.805e-03
## Est3      -4.495e-02 8.050e-02 8.050e-04      2.751e-03
## Fair      -4.937e-01 1.133e-01 1.133e-03      2.069e-03
## Good      -1.204e+00 1.121e-01 1.121e-03      2.312e-03
## Excellent -1.056e+00 1.339e-01 1.339e-03      3.523e-03
## 
## 2. Quantiles for each variable:
## 
##                 2.5%        25%        50%        75%      97.5%
## Cte       -1.208e+00 -1.030e+00 -9.361e-01 -8.448e-01 -0.6733154
## SHI       -1.199e-01 -4.581e-02 -7.643e-03  3.107e-02  0.1121491
## Female     3.131e-02  9.345e-02  1.271e-01  1.597e-01  0.2212118
## Age       -7.196e-03 -2.584e-03 -1.551e-04  2.287e-03  0.0070899
## Age2      -4.363e-05  1.317e-05  4.217e-05  7.254e-05  0.0001262
## Est2      -1.910e-01 -1.235e-01 -8.748e-02 -5.274e-02  0.0147171
## Est3      -2.026e-01 -9.924e-02 -4.423e-02  8.618e-03  0.1118982
## Fair      -7.137e-01 -5.700e-01 -4.946e-01 -4.186e-01 -0.2690228
## Good      -1.421e+00 -1.279e+00 -1.206e+00 -1.130e+00 -0.9812539
## Excellent -1.322e+00 -1.146e+00 -1.056e+00 -9.661e-01 -0.7899088

It seems from our results that female and health status are relevant variables for hospitalization, as their 95% credible intervals do not cross 0. Women have a higher probability of being hospitalized than do men, and people with bad self perception of health condition also have a higher probability of being hospitalized. We get same results programming a Gibbs sampler algorithm (see Exercise 1) and using our GUI. We also see that there are posterior convergence issues (see Exercise 2).

References

Albert, James H, and Siddhartha Chib. 1993. “Bayesian Analysis of Binary and Polychotomous Response Data.” Journal of the American Statistical Association 88 (422): 669–79.

Ramírez Hassan, A., J. Cardona Jiménez, and R. Cadavid Montoya. 2013. “The Impact of Subsidized Health Insurance on the Poor in Colombia: Evaluating the Case of Medellín.” Economia Aplicada 17 (4): 543–56.

Tanner, M. A., and W. H. Wong. 1987. “The Calculation of Posterior Distributions by Data Augmentation.” Journal of the American Statistical Association 82 (398): 528–40.