7.2 Probit Regression

Probit regression is a type of Generalized Linear Models used for binary outcome variables. Unlike logistic regression, which uses the logit function, probit regression assumes that the probability of success is determined by an underlying normally distributed latent variable.

7.2.1 Probit Model

Let $Y_i$ be a binary response variable:

$Y_i = \begin{cases} 1, & \text{if success occurs} \\ 0, & \text{otherwise} \end{cases}$

We assume that $Y_i$ follows a Bernoulli distribution:

$Y_i \sim \text{Bernoulli}(p_i), \quad \text{where } p_i = P(Y_i = 1 | \mathbf{x_i})$

Instead of the logit function in logistic regression:

$\text{logit}(p_i) = \log\left( \frac{p_i}{1 - p_i} \right) = \mathbf{x_i'\beta}$

Probit regression uses the inverse standard normal CDF:

$\Phi^{-1}(p_i) = \mathbf{x_i'\theta}$

where:

$\Phi(\cdot)$ is the CDF of the standard normal distribution.
$\mathbf{x_i}$ is the vector of predictors.
$\theta$ is the vector of regression coefficients.

Thus, the probability of success is:

$p_i = P(Y_i = 1 | \mathbf{x_i}) = \Phi(\mathbf{x_i'\theta})$

where:

$\Phi(z) = \int_{-\infty}^{z} \frac{1}{\sqrt{2\pi}} e^{-t^2/2} dt$

7.2.2 Application: Probit Regression

# Load necessary library
library(ggplot2)

# Set seed for reproducibility
set.seed(123)

# Simulate data
n <- 1000
x1 <- rnorm(n)
x2 <- rnorm(n)
latent <- 0.5 * x1 + 0.7 * x2 + rnorm(n)  # Linear combination
y <- ifelse(latent > 0, 1, 0)  # Binary outcome

# Create dataframe
data <- data.frame(y, x1, x2)

# Fit Probit model
probit_model <-
    glm(y ~ x1 + x2, family = binomial(link = "probit"), data = data)
summary(probit_model)
#> 
#> Call:
#> glm(formula = y ~ x1 + x2, family = binomial(link = "probit"), 
#>     data = data)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -0.09781    0.04499  -2.174   0.0297 *  
#> x1           0.43838    0.04891   8.963   <2e-16 ***
#> x2           0.75538    0.05306  14.235   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1385.1  on 999  degrees of freedom
#> Residual deviance: 1045.3  on 997  degrees of freedom
#> AIC: 1051.3
#> 
#> Number of Fisher Scoring iterations: 5

# Fit Logit model
logit_model <-
    glm(y ~ x1 + x2, family = binomial(link = "logit"), data = data)
summary(logit_model)
#> 
#> Call:
#> glm(formula = y ~ x1 + x2, family = binomial(link = "logit"), 
#>     data = data)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -0.16562    0.07600  -2.179   0.0293 *  
#> x1           0.73234    0.08507   8.608   <2e-16 ***
#> x2           1.25220    0.09486  13.201   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1385.1  on 999  degrees of freedom
#> Residual deviance: 1048.4  on 997  degrees of freedom
#> AIC: 1054.4
#> 
#> Number of Fisher Scoring iterations: 4

# Compare Coefficients
coef_comparison <- data.frame(
    Variable = names(coef(probit_model)),
    Probit_Coef = coef(probit_model),
    Logit_Coef = coef(logit_model),
    Logit_Probit_Ratio = coef(logit_model) / coef(probit_model)
)

print(coef_comparison)
#>                Variable Probit_Coef Logit_Coef Logit_Probit_Ratio
#> (Intercept) (Intercept) -0.09780689 -0.1656216           1.693353
#> x1                   x1  0.43837627  0.7323392           1.670572
#> x2                   x2  0.75538259  1.2522008           1.657704

# Compute predicted probabilities
data$probit_pred <- predict(probit_model, type = "response")
data$logit_pred <- predict(logit_model, type = "response")

# Plot Probit vs Logit predictions
ggplot(data, aes(x = probit_pred, y = logit_pred)) +
    geom_point(alpha = 0.5) +
    geom_abline(slope = 1,
                intercept = 0,
                col = "red") +
    labs(title = "Comparison of Predicted Probabilities",
         x = "Probit Predictions", y = "Logit Predictions")


# Classification Accuracy
threshold <- 0.5
data$probit_class <- ifelse(data$probit_pred > threshold, 1, 0)
data$logit_class <- ifelse(data$logit_pred > threshold, 1, 0)

probit_acc <- mean(data$probit_class == data$y)
logit_acc <- mean(data$logit_class == data$y)

print(paste("Probit Accuracy:", round(probit_acc, 4)))
#> [1] "Probit Accuracy: 0.71"
print(paste("Logit Accuracy:", round(logit_acc, 4)))
#> [1] "Logit Accuracy: 0.71"