7.2 Probit Regression
Probit regression is a type of Generalized Linear Models used for binary outcome variables. Unlike logistic regression, which uses the logit function, probit regression assumes that the probability of success is determined by an underlying normally distributed latent variable.
7.2.1 Probit Model
Let Yi be a binary response variable:
Yi={1,if success occurs0,otherwise
We assume that Yi follows a Bernoulli distribution:
Yi∼Bernoulli(pi),where pi=P(Yi=1|xi)
Instead of the logit function in logistic regression:
logit(pi)=log(pi1−pi)=x′iβ
Probit regression uses the inverse standard normal CDF:
Φ−1(pi)=x′iθ
where:
Φ(⋅) is the CDF of the standard normal distribution.
xi is the vector of predictors.
θ is the vector of regression coefficients.
Thus, the probability of success is:
pi=P(Yi=1|xi)=Φ(x′iθ)
where:
Φ(z)=∫z−∞1√2πe−t2/2dt
7.2.2 Application: Probit Regression
# Load necessary library
library(ggplot2)
# Set seed for reproducibility
set.seed(123)
# Simulate data
n <- 1000
x1 <- rnorm(n)
x2 <- rnorm(n)
latent <- 0.5 * x1 + 0.7 * x2 + rnorm(n) # Linear combination
y <- ifelse(latent > 0, 1, 0) # Binary outcome
# Create dataframe
data <- data.frame(y, x1, x2)
# Fit Probit model
probit_model <-
glm(y ~ x1 + x2, family = binomial(link = "probit"), data = data)
summary(probit_model)
#>
#> Call:
#> glm(formula = y ~ x1 + x2, family = binomial(link = "probit"),
#> data = data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.3740 -0.8663 -0.2318 0.8684 2.6666
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.09781 0.04499 -2.174 0.0297 *
#> x1 0.43838 0.04891 8.963 <2e-16 ***
#> x2 0.75538 0.05306 14.235 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1385.1 on 999 degrees of freedom
#> Residual deviance: 1045.3 on 997 degrees of freedom
#> AIC: 1051.3
#>
#> Number of Fisher Scoring iterations: 5
# Fit Logit model
logit_model <-
glm(y ~ x1 + x2, family = binomial(link = "logit"), data = data)
summary(logit_model)
#>
#> Call:
#> glm(formula = y ~ x1 + x2, family = binomial(link = "logit"),
#> data = data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.3048 -0.8571 -0.2805 0.8632 2.5335
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.16562 0.07600 -2.179 0.0293 *
#> x1 0.73234 0.08507 8.608 <2e-16 ***
#> x2 1.25220 0.09486 13.201 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1385.1 on 999 degrees of freedom
#> Residual deviance: 1048.4 on 997 degrees of freedom
#> AIC: 1054.4
#>
#> Number of Fisher Scoring iterations: 4
# Compare Coefficients
coef_comparison <- data.frame(
Variable = names(coef(probit_model)),
Probit_Coef = coef(probit_model),
Logit_Coef = coef(logit_model),
Logit_Probit_Ratio = coef(logit_model) / coef(probit_model)
)
print(coef_comparison)
#> Variable Probit_Coef Logit_Coef Logit_Probit_Ratio
#> (Intercept) (Intercept) -0.09780689 -0.1656216 1.693353
#> x1 x1 0.43837627 0.7323392 1.670572
#> x2 x2 0.75538259 1.2522008 1.657704
# Compute predicted probabilities
data$probit_pred <- predict(probit_model, type = "response")
data$logit_pred <- predict(logit_model, type = "response")
# Plot Probit vs Logit predictions
ggplot(data, aes(x = probit_pred, y = logit_pred)) +
geom_point(alpha = 0.5) +
geom_abline(slope = 1,
intercept = 0,
col = "red") +
labs(title = "Comparison of Predicted Probabilities",
x = "Probit Predictions", y = "Logit Predictions")
# Classification Accuracy
threshold <- 0.5
data$probit_class <- ifelse(data$probit_pred > threshold, 1, 0)
data$logit_class <- ifelse(data$logit_pred > threshold, 1, 0)
probit_acc <- mean(data$probit_class == data$y)
logit_acc <- mean(data$logit_class == data$y)
print(paste("Probit Accuracy:", round(probit_acc, 4)))
#> [1] "Probit Accuracy: 0.71"
print(paste("Logit Accuracy:", round(logit_acc, 4)))
#> [1] "Logit Accuracy: 0.71"