7.2 Probit Regression

Probit regression is a type of Generalized Linear Models used for binary outcome variables. Unlike logistic regression, which uses the logit function, probit regression assumes that the probability of success is determined by an underlying normally distributed latent variable.

7.2.1 Probit Model

Let Yi be a binary response variable:

Yi={1,if success occurs0,otherwise

We assume that Yi follows a Bernoulli distribution:

YiBernoulli(pi),where pi=P(Yi=1|xi)

Instead of the logit function in logistic regression:

logit(pi)=log(pi1pi)=xiβ

Probit regression uses the inverse standard normal CDF:

Φ1(pi)=xiθ

where:

  • Φ() is the CDF of the standard normal distribution.

  • xi is the vector of predictors.

  • θ is the vector of regression coefficients.

Thus, the probability of success is:

pi=P(Yi=1|xi)=Φ(xiθ)

where:

Φ(z)=z12πet2/2dt


7.2.2 Application: Probit Regression

# Load necessary library
library(ggplot2)

# Set seed for reproducibility
set.seed(123)

# Simulate data
n <- 1000
x1 <- rnorm(n)
x2 <- rnorm(n)
latent <- 0.5 * x1 + 0.7 * x2 + rnorm(n)  # Linear combination
y <- ifelse(latent > 0, 1, 0)  # Binary outcome

# Create dataframe
data <- data.frame(y, x1, x2)

# Fit Probit model
probit_model <-
    glm(y ~ x1 + x2, family = binomial(link = "probit"), data = data)
summary(probit_model)
#> 
#> Call:
#> glm(formula = y ~ x1 + x2, family = binomial(link = "probit"), 
#>     data = data)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.3740  -0.8663  -0.2318   0.8684   2.6666  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -0.09781    0.04499  -2.174   0.0297 *  
#> x1           0.43838    0.04891   8.963   <2e-16 ***
#> x2           0.75538    0.05306  14.235   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1385.1  on 999  degrees of freedom
#> Residual deviance: 1045.3  on 997  degrees of freedom
#> AIC: 1051.3
#> 
#> Number of Fisher Scoring iterations: 5

# Fit Logit model
logit_model <-
    glm(y ~ x1 + x2, family = binomial(link = "logit"), data = data)
summary(logit_model)
#> 
#> Call:
#> glm(formula = y ~ x1 + x2, family = binomial(link = "logit"), 
#>     data = data)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.3048  -0.8571  -0.2805   0.8632   2.5335  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -0.16562    0.07600  -2.179   0.0293 *  
#> x1           0.73234    0.08507   8.608   <2e-16 ***
#> x2           1.25220    0.09486  13.201   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1385.1  on 999  degrees of freedom
#> Residual deviance: 1048.4  on 997  degrees of freedom
#> AIC: 1054.4
#> 
#> Number of Fisher Scoring iterations: 4

# Compare Coefficients
coef_comparison <- data.frame(
    Variable = names(coef(probit_model)),
    Probit_Coef = coef(probit_model),
    Logit_Coef = coef(logit_model),
    Logit_Probit_Ratio = coef(logit_model) / coef(probit_model)
)

print(coef_comparison)
#>                Variable Probit_Coef Logit_Coef Logit_Probit_Ratio
#> (Intercept) (Intercept) -0.09780689 -0.1656216           1.693353
#> x1                   x1  0.43837627  0.7323392           1.670572
#> x2                   x2  0.75538259  1.2522008           1.657704

# Compute predicted probabilities
data$probit_pred <- predict(probit_model, type = "response")
data$logit_pred <- predict(logit_model, type = "response")

# Plot Probit vs Logit predictions
ggplot(data, aes(x = probit_pred, y = logit_pred)) +
    geom_point(alpha = 0.5) +
    geom_abline(slope = 1,
                intercept = 0,
                col = "red") +
    labs(title = "Comparison of Predicted Probabilities",
         x = "Probit Predictions", y = "Logit Predictions")


# Classification Accuracy
threshold <- 0.5
data$probit_class <- ifelse(data$probit_pred > threshold, 1, 0)
data$logit_class <- ifelse(data$logit_pred > threshold, 1, 0)

probit_acc <- mean(data$probit_class == data$y)
logit_acc <- mean(data$logit_class == data$y)

print(paste("Probit Accuracy:", round(probit_acc, 4)))
#> [1] "Probit Accuracy: 0.71"
print(paste("Logit Accuracy:", round(logit_acc, 4)))
#> [1] "Logit Accuracy: 0.71"