## 6.3 Probit model

The probit model also has as dependent variable a binary outcome. There is a latent (unobserved) random variable, $$Y_i^*$$, that defines the structure of the estimation problem $$Y_i=\begin{Bmatrix} 1, & Y_i^* \geq 0\\ 0, & Y_i^* < 0 \end{Bmatrix},$$

where $$Y_i^*={\bf{x}}_i^{\top}\beta+\mu_i$$, $$\mu_i\stackrel{iid} {\sim}N(0,1)$$. Then,

\begin{align} P[Y_i=1]&=P[Y_i^*\geq 0]\\ &=P[\mathbf{x}_i^{\top}\beta+\mu_i\geq 0]\\ &=P[\mu_i\geq -\mathbf{x}_i^{\top}\beta]\\ &=1-P[\mu_i < -\mathbf{x}_i^{\top}\beta]\\ &=P[\mu_i < \mathbf{x}_i^{\top}\beta], \end{align}

where the last equality follows by symmetry at 0. In addition, observe that the previous calculations do not change if we multiply $$Y_i^*$$ by a positive constant, this implies identification issues regarding scale. Intuitively, this is because we just observe 0’s or 1’s that are driven by an unobserved random latent variable $$Y_i^*$$, this issue is also present in the logit model, that is why we set the variance equal to 1.

implemented data augmentation to apply a Gibbs sampling algorithm in this model. Augmenting this model with $$Y_i^*$$, we can have the likelihood contribution from observation $$i$$, $$p(y_i|y_i^*)=1_{y_i=0}1_{y_i^*\leq 0}+1_{y_i=1}1_{y_i^*> 0}$$, where $$1_A$$ is an indicator function that takes the value of 1 when condition $$A$$ is satisfied.

The posterior distribution is $$\pi(\beta,{\bf{Y^*}}|{\bf{y}},{\bf{X}})\propto\prod_{i=1}^n\left[\mathbf{1}_{y_i=0}1_{y_i^*< 0}+1_{y_i=1}1_{y_i^*\geq 0}\right] \times N_N({\bf{Y}}^*|{\bf{X}\beta},{\bf{I}}_N)\times N_K(\beta|\beta_0,{\bf{B}}_0)$$ when taking a normal distribution as prior, $$\beta\sim N(\beta_0,{\bf{B}}_0)$$.

The conditional posterior distribution of the latent variable is

\begin{align} Y_i^*|\beta,{\bf{y}},{\bf{X}}&\sim\begin{Bmatrix} TN_{[0,\infty)}({\bf{x}}_i^{\top}\beta,1), & y_i= 1\\ TN_{(-\infty,0)}({\bf{x}}_i^{\top}\beta,1), & y_i= 0 \\ \end{Bmatrix}, \end{align}

where $$TN_A$$ denotes a truncated normal density in the interval $$A$$.

The conditional posterior distribution of the location parameters is

\begin{align} \beta|{\bf{Y}}^*, {\bf{X}} & \sim N(\beta_n,\bf{B}_n), \end{align}

where $${\bf{B}}_n = ({\bf{B}}_0^{-1} + {\bf{X}}^{\top}{\bf{X}})^{-1}$$, and $$\beta_n= {\bf{B}}_n({\bf{B}}_0^{-1}\beta_0 + {\bf{X}}^{\top}{\bf{Y}}^*)$$.

Application: Determinants of hospitalization in Medellín

We use the dataset named 2HealthMed.csv, which is in folder DataApp (see Table 13.3 for details) in our github repository (https://github.com/besmarter/BSTApp) and was used by . Our dependent variable is a binary indicator with a value equal to 1 if an individual was hospitalized in 2007, and 0 otherwise.

The specification of the model is \begin{align} \text{Hosp}_i&=\beta_1+\beta_2\text{SHI}_i+\beta_3\text{Female}_i+\beta_4\text{Age}_i+\beta_5\text{Age}_i^2+\beta_6\text{Est2}_i+\beta_7\text{Est3}_i\\ &+\beta_8\text{Fair}_i+\beta_9\text{Good}_i+\beta_{10}\text{Excellent}_i, \end{align}

where SHI is a binary variable equal to 1 if the individual is in a subsidized health care program and 0 otherwise, Female is an indicator of gender, Age in years, Est2 and Est3 are indicators of socio-economic status, the reference is Est1, which is the lowest, and self perception of health status where bad is the reference.

Let’s set $$\beta_0={\bf{0}}_{10}$$, $${\bf{B}}_0={\bf{I}}_{10}$$, iterations, burn-in and thinning parameters equal to 10000, 1000 and 1, respectively.

mydata <- read.csv("DataApplications/2HealthMed.csv", header = T, sep = ",")
attach(mydata)
## The following objects are masked from mydata (pos = 3):
##
##     Age, Age2
str(mydata)
## 'data.frame':    12975 obs. of  22 variables:
##  $id : int 1 2 3 4 5 6 7 8 9 10 ... ##$ MedVisPrev  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $MedVisPrevOr: int 1 1 1 1 1 1 1 1 1 1 ... ##$ Hosp        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $SHI : int 1 1 1 1 0 0 1 1 0 0 ... ##$ Female      : int  0 1 1 1 0 1 0 1 0 1 ...
##  $Age : int 7 39 23 15 8 54 64 40 6 7 ... ##$ Age2        : int  49 1521 529 225 64 2916 4096 1600 36 49 ...
##  $FemaleAge : int 0 39 23 15 0 54 0 40 0 7 ... ##$ Est1        : int  1 0 0 0 0 0 0 0 0 0 ...
##  $Est2 : int 0 1 1 1 0 1 1 1 0 0 ... ##$ Est3        : int  0 0 0 0 1 0 0 0 1 1 ...
##  $Bad : int 0 0 0 0 0 0 0 0 0 0 ... ##$ Fair        : int  0 0 0 0 0 0 1 0 0 0 ...
##  $Good : int 1 1 1 1 0 0 0 1 1 1 ... ##$ Excellent   : int  0 0 0 0 1 1 0 0 0 0 ...
##  $NoEd : int 1 0 0 0 1 0 0 0 1 1 ... ##$ PriEd       : int  0 0 0 0 0 1 1 1 0 0 ...
##  $HighEd : int 0 1 1 1 0 0 0 0 0 0 ... ##$ VocEd       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $UnivEd : int 0 0 0 0 0 0 0 0 0 0 ... ##$ PTL         : num  0.43 0 0 0 0 0.06 0 0.38 0 1 ...
K <- 10 # Number of regressors
b0 <- rep(0, K) # Prio mean
B0i <- diag(K) # Prior precision (inverse of covariance)
Prior <- list(betabar = b0, A = B0i) # Prior list
y <- Hosp # Dependent variables
X <- cbind(1, SHI, Female, Age, Age2, Est2, Est3, Fair, Good, Excellent) # Regressors
Data <- list(y = y, X = X) # Data list
Mcmc <- list(R = 10000, keep = 1, nprint = 0) # MCMC parameters
RegProb <- bayesm::rbprobitGibbs(Data = Data, Prior = Prior, Mcmc = Mcmc) # Inference using bayesm package
##
## Starting Gibbs Sampler for Binary Probit Model
##    with  12975  observations
## Table of y Values
## y
##     0     1
## 12571   404
##
## Prior Parms:
## betabar
##  [1] 0 0 0 0 0 0 0 0 0 0
## A
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1    0    0    0    0    0    0    0    0     0
##  [2,]    0    1    0    0    0    0    0    0    0     0
##  [3,]    0    0    1    0    0    0    0    0    0     0
##  [4,]    0    0    0    1    0    0    0    0    0     0
##  [5,]    0    0    0    0    1    0    0    0    0     0
##  [6,]    0    0    0    0    0    1    0    0    0     0
##  [7,]    0    0    0    0    0    0    1    0    0     0
##  [8,]    0    0    0    0    0    0    0    1    0     0
##  [9,]    0    0    0    0    0    0    0    0    1     0
## [10,]    0    0    0    0    0    0    0    0    0     1
##
## MCMC parms:
## R=  10000  keep=  1  nprint=  0
## 
PostPar <- coda::mcmc(RegProb\$betadraw) # Posterior draws
colnames(PostPar) <- c("Cte", "SHI", "Female", "Age", "Age2", "Est2",
"Est3", "Fair", "Good", "Excellent") # Names
summary(PostPar) # Posterior summary
##
## Iterations = 1:10000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
##
##                 Mean        SD  Naive SE Time-series SE
## Cte       -9.378e-01 1.363e-01 1.363e-03      3.601e-03
## SHI       -6.933e-03 5.868e-02 5.868e-04      2.193e-03
## Female     1.266e-01 4.895e-02 4.895e-04      1.797e-03
## Age       -1.533e-04 3.625e-03 3.625e-05      1.199e-04
## Age2       4.245e-05 4.354e-05 4.354e-07      1.318e-06
## Est2      -8.793e-02 5.231e-02 5.231e-04      1.805e-03
## Est3      -4.495e-02 8.050e-02 8.050e-04      2.751e-03
## Fair      -4.937e-01 1.133e-01 1.133e-03      2.069e-03
## Good      -1.204e+00 1.121e-01 1.121e-03      2.312e-03
## Excellent -1.056e+00 1.339e-01 1.339e-03      3.523e-03
##
## 2. Quantiles for each variable:
##
##                 2.5%        25%        50%        75%      97.5%
## Cte       -1.208e+00 -1.030e+00 -9.361e-01 -8.448e-01 -0.6733154
## SHI       -1.199e-01 -4.581e-02 -7.643e-03  3.107e-02  0.1121491
## Female     3.131e-02  9.345e-02  1.271e-01  1.597e-01  0.2212118
## Age       -7.196e-03 -2.584e-03 -1.551e-04  2.287e-03  0.0070899
## Age2      -4.363e-05  1.317e-05  4.217e-05  7.254e-05  0.0001262
## Est2      -1.910e-01 -1.235e-01 -8.748e-02 -5.274e-02  0.0147171
## Est3      -2.026e-01 -9.924e-02 -4.423e-02  8.618e-03  0.1118982
## Fair      -7.137e-01 -5.700e-01 -4.946e-01 -4.186e-01 -0.2690228
## Good      -1.421e+00 -1.279e+00 -1.206e+00 -1.130e+00 -0.9812539
## Excellent -1.322e+00 -1.146e+00 -1.056e+00 -9.661e-01 -0.7899088

It seems from our results that female and health status are relevant variables for hospitalization, as their 95% credible intervals do not cross 0. Women have a higher probability of being hospitalized than do men, and people with bad self perception of health condition also have a higher probability of being hospitalized. We get same results programming a Gibbs sampler algorithm (see Exercise 1) and using our GUI. We also see that there are posterior convergence issues (see Exercise 2).

### References

Albert, James H, and Siddhartha Chib. 1993. “Bayesian Analysis of Binary and Polychotomous Response Data.” Journal of the American Statistical Association 88 (422): 669–79.
Ramírez Hassan, A., J. Cardona Jiménez, and R. Cadavid Montoya. 2013. “The Impact of Subsidized Health Insurance on the Poor in Colombia: Evaluating the Case of Medellín.” Economia Aplicada 17 (4): 543–56.
Tanner, M. A., and W. H. Wong. 1987. “The Calculation of Posterior Distributions by Data Augmentation.” Journal of the American Statistical Association 82 (398): 528–40.