6.3 Probit model
The probit model also has as dependent variable a binary outcome. There is a latent (unobserved) random variable, \(Y_i^*\), that defines the structure of the estimation problem \(Y_i=\begin{Bmatrix} 1, & Y_i^* \geq 0\\ 0, & Y_i^* < 0 \end{Bmatrix},\)
where \(Y_i^*={\bf{x}}_i^{\top}\beta+\mu_i\), \(\mu_i\stackrel{iid} {\sim}N(0,1)\). Then,
\[\begin{align} P[Y_i=1]&=P[Y_i^*\geq 0]\\ &=P[\mathbf{x}_i^{\top}\beta+\mu_i\geq 0]\\ &=P[\mu_i\geq -\mathbf{x}_i^{\top}\beta]\\ &=1-P[\mu_i < -\mathbf{x}_i^{\top}\beta]\\ &=P[\mu_i < \mathbf{x}_i^{\top}\beta], \end{align}\]
where the last equality follows by symmetry at 0. In addition, observe that the previous calculations do not change if we multiply \(Y_i^*\) by a positive constant, this implies identification issues regarding scale. Intuitively, this is because we just observe 0’s or 1’s that are driven by an unobserved random latent variable \(Y_i^*\), this issue is also present in the logit model, that is why we set the variance equal to 1.
(Albert and Chib 1993) implemented data augmentation (Tanner and Wong 1987) to apply a Gibbs sampling algorithm in this model. Augmenting this model with \(Y_i^*\), we can have the likelihood contribution from observation \(i\), \(p(y_i|y_i^*)=1_{y_i=0}1_{y_i^*\leq 0}+1_{y_i=1}1_{y_i^*> 0}\), where \(1_A\) is an indicator function that takes the value of 1 when condition \(A\) is satisfied.
The posterior distribution is \(\pi(\beta,{\bf{Y^*}}|{\bf{y}},{\bf{X}})\propto\prod_{i=1}^n\left[\mathbf{1}_{y_i=0}1_{y_i^*< 0}+1_{y_i=1}1_{y_i^*\geq 0}\right] \times N_N({\bf{Y}}^*|{\bf{X}\beta},{\bf{I}}_N)\times N_K(\beta|\beta_0,{\bf{B}}_0)\) when taking a normal distribution as prior, \(\beta\sim N(\beta_0,{\bf{B}}_0)\).
The conditional posterior distribution of the latent variable is
\[\begin{align} Y_i^*|\beta,{\bf{y}},{\bf{X}}&\sim\begin{Bmatrix} TN_{[0,\infty)}({\bf{x}}_i^{\top}\beta,1), & y_i= 1\\ TN_{(-\infty,0)}({\bf{x}}_i^{\top}\beta,1), & y_i= 0 \\ \end{Bmatrix}, \end{align}\]
where \(TN_A\) denotes a truncated normal density in the interval \(A\).
The conditional posterior distribution of the location parameters is
\[\begin{align} \beta|{\bf{Y}}^*, {\bf{X}} & \sim N(\beta_n,\bf{B}_n), \end{align}\]
where \({\bf{B}}_n = ({\bf{B}}_0^{-1} + {\bf{X}}^{\top}{\bf{X}})^{-1}\), and \(\beta_n= {\bf{B}}_n({\bf{B}}_0^{-1}\beta_0 + {\bf{X}}^{\top}{\bf{Y}}^*)\).
Application: Determinants of hospitalization in Medellín
We use the dataset named 2HealthMed.csv, which is in folder DataApp (see Table 13.3 for details) in our github repository (https://github.com/besmarter/BSTApp) and was used by (Ramírez Hassan, Cardona Jiménez, and Cadavid Montoya 2013). Our dependent variable is a binary indicator with a value equal to 1 if an individual was hospitalized in 2007, and 0 otherwise.
The specification of the model is \[\begin{align} \text{Hosp}_i&=\beta_1+\beta_2\text{SHI}_i+\beta_3\text{Female}_i+\beta_4\text{Age}_i+\beta_5\text{Age}_i^2+\beta_6\text{Est2}_i+\beta_7\text{Est3}_i\\ &+\beta_8\text{Fair}_i+\beta_9\text{Good}_i+\beta_{10}\text{Excellent}_i, \end{align}\]
where SHI is a binary variable equal to 1 if the individual is in a subsidized health care program and 0 otherwise, Female is an indicator of gender, Age in years, Est2 and Est3 are indicators of socio-economic status, the reference is Est1, which is the lowest, and self perception of health status where bad is the reference.
Let’s set \(\beta_0={\bf{0}}_{10}\), \({\bf{B}}_0={\bf{I}}_{10}\), iterations, burn-in and thinning parameters equal to 10000, 1000 and 1, respectively.
<- read.csv("DataApplications/2HealthMed.csv", header = T, sep = ",")
mydata attach(mydata)
## The following objects are masked from mydata (pos = 3):
##
## Age, Age2
str(mydata)
## 'data.frame': 12975 obs. of 22 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MedVisPrev : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MedVisPrevOr: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Hosp : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SHI : int 1 1 1 1 0 0 1 1 0 0 ...
## $ Female : int 0 1 1 1 0 1 0 1 0 1 ...
## $ Age : int 7 39 23 15 8 54 64 40 6 7 ...
## $ Age2 : int 49 1521 529 225 64 2916 4096 1600 36 49 ...
## $ FemaleAge : int 0 39 23 15 0 54 0 40 0 7 ...
## $ Est1 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Est2 : int 0 1 1 1 0 1 1 1 0 0 ...
## $ Est3 : int 0 0 0 0 1 0 0 0 1 1 ...
## $ Bad : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Fair : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Good : int 1 1 1 1 0 0 0 1 1 1 ...
## $ Excellent : int 0 0 0 0 1 1 0 0 0 0 ...
## $ NoEd : int 1 0 0 0 1 0 0 0 1 1 ...
## $ PriEd : int 0 0 0 0 0 1 1 1 0 0 ...
## $ HighEd : int 0 1 1 1 0 0 0 0 0 0 ...
## $ VocEd : int 0 0 0 0 0 0 0 0 0 0 ...
## $ UnivEd : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PTL : num 0.43 0 0 0 0 0.06 0 0.38 0 1 ...
<- 10 # Number of regressors
K <- rep(0, K) # Prio mean
b0 <- diag(K) # Prior precision (inverse of covariance)
B0i <- list(betabar = b0, A = B0i) # Prior list
Prior <- Hosp # Dependent variables
y <- cbind(1, SHI, Female, Age, Age2, Est2, Est3, Fair, Good, Excellent) # Regressors
X <- list(y = y, X = X) # Data list
Data <- list(R = 10000, keep = 1, nprint = 0) # MCMC parameters
Mcmc <- bayesm::rbprobitGibbs(Data = Data, Prior = Prior, Mcmc = Mcmc) # Inference using bayesm package RegProb
##
## Starting Gibbs Sampler for Binary Probit Model
## with 12975 observations
## Table of y Values
## y
## 0 1
## 12571 404
##
## Prior Parms:
## betabar
## [1] 0 0 0 0 0 0 0 0 0 0
## A
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 0 0 0 0 0 0 0 0 0
## [2,] 0 1 0 0 0 0 0 0 0 0
## [3,] 0 0 1 0 0 0 0 0 0 0
## [4,] 0 0 0 1 0 0 0 0 0 0
## [5,] 0 0 0 0 1 0 0 0 0 0
## [6,] 0 0 0 0 0 1 0 0 0 0
## [7,] 0 0 0 0 0 0 1 0 0 0
## [8,] 0 0 0 0 0 0 0 1 0 0
## [9,] 0 0 0 0 0 0 0 0 1 0
## [10,] 0 0 0 0 0 0 0 0 0 1
##
## MCMC parms:
## R= 10000 keep= 1 nprint= 0
##
<- coda::mcmc(RegProb$betadraw) # Posterior draws
PostPar colnames(PostPar) <- c("Cte", "SHI", "Female", "Age", "Age2", "Est2",
"Est3", "Fair", "Good", "Excellent") # Names
summary(PostPar) # Posterior summary
##
## Iterations = 1:10000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## Cte -9.378e-01 1.363e-01 1.363e-03 3.601e-03
## SHI -6.933e-03 5.868e-02 5.868e-04 2.193e-03
## Female 1.266e-01 4.895e-02 4.895e-04 1.797e-03
## Age -1.533e-04 3.625e-03 3.625e-05 1.199e-04
## Age2 4.245e-05 4.354e-05 4.354e-07 1.318e-06
## Est2 -8.793e-02 5.231e-02 5.231e-04 1.805e-03
## Est3 -4.495e-02 8.050e-02 8.050e-04 2.751e-03
## Fair -4.937e-01 1.133e-01 1.133e-03 2.069e-03
## Good -1.204e+00 1.121e-01 1.121e-03 2.312e-03
## Excellent -1.056e+00 1.339e-01 1.339e-03 3.523e-03
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## Cte -1.208e+00 -1.030e+00 -9.361e-01 -8.448e-01 -0.6733154
## SHI -1.199e-01 -4.581e-02 -7.643e-03 3.107e-02 0.1121491
## Female 3.131e-02 9.345e-02 1.271e-01 1.597e-01 0.2212118
## Age -7.196e-03 -2.584e-03 -1.551e-04 2.287e-03 0.0070899
## Age2 -4.363e-05 1.317e-05 4.217e-05 7.254e-05 0.0001262
## Est2 -1.910e-01 -1.235e-01 -8.748e-02 -5.274e-02 0.0147171
## Est3 -2.026e-01 -9.924e-02 -4.423e-02 8.618e-03 0.1118982
## Fair -7.137e-01 -5.700e-01 -4.946e-01 -4.186e-01 -0.2690228
## Good -1.421e+00 -1.279e+00 -1.206e+00 -1.130e+00 -0.9812539
## Excellent -1.322e+00 -1.146e+00 -1.056e+00 -9.661e-01 -0.7899088
It seems from our results that female and health status are relevant variables for hospitalization, as their 95% credible intervals do not cross 0. Women have a higher probability of being hospitalized than do men, and people with bad self perception of health condition also have a higher probability of being hospitalized. We get same results programming a Gibbs sampler algorithm (see Exercise 1) and using our GUI. We also see that there are posterior convergence issues (see Exercise 2).