7.1 Logistic Regression
Logistic regression is a widely used Generalized Linear Model designed for modeling binary response variables. It is particularly useful in applications such as credit scoring, medical diagnosis, and customer churn prediction.
7.1.1 Logistic Model
Given a set of predictor variables xi, the probability of a positive outcome (e.g., success, event occurring) is modeled as:
pi=f(xi;β)=exp(x′iβ)1+exp(x′iβ)
where:
- pi=E[Yi] is the probability of success for observation i.
- xi is the vector of predictor variables.
- β is the vector of model coefficients.
7.1.1.1 Logit Transformation
The logistic function can be rewritten in terms of the log-odds, also known as the logit function:
logit(pi)=log(pi1−pi)=x′iβ
where:
- pi1−pi represents the odds of success (the ratio of the probability of success to the probability of failure).
- The logit function ensures linearity in the parameters, which aligns with the GLM framework.
Thus, logistic regression belongs to the family of Generalized Linear Models because a function of the mean response (logit) is linear in the predictors.
7.1.2 Likelihood Function
Since Yi follows a Bernoulli distribution with probability pi, the likelihood function for n independent observations is:
L(pi)=n∏i=1pYii(1−pi)1−Yi
By substituting the logistic function for pi:
pi=exp(x′iβ)1+exp(x′iβ),1−pi=11+exp(x′iβ)
we obtain:
L(β)=n∏i=1(exp(x′iβ)1+exp(x′iβ))Yi(11+exp(x′iβ))1−Yi
Taking the natural logarithm of the likelihood function gives the log-likelihood function:
Q(β)=logL(β)=n∑i=1Yix′iβ−n∑i=1log(1+exp(x′iβ))
Since this function is concave, we can maximize it numerically using iterative optimization techniques, such as:
- Newton-Raphson Method
- Fisher Scoring Algorithm
These methods allow us to obtain the Maximum Likelihood Estimates of the parameters, ˆβ.
Under standard regularity conditions, the MLEs of logistic regression parameters are asymptotically normal:
ˆβ˙∼AN(β,[I(β)]−1)
where:
- I(β) is the Fisher Information Matrix, which determines the variance-covariance structure of ˆβ.
7.1.3 Fisher Information Matrix
The Fisher Information Matrix quantifies the amount of information that an observable random variable carries about the unknown parameter β. It is crucial in estimating the variance-covariance matrix of the estimated coefficients in logistic regression.
Mathematically, the Fisher Information Matrix is defined as:
I(β)=E[∂logL(β)∂β∂logL(β)∂β′]
which expands to:
I(β)=E[(∂logL(β)∂βi∂logL(β)∂βj)ij]
Under regularity conditions, the Fisher Information Matrix is equivalent to the negative expected Hessian matrix:
I(β)=−E[∂2logL(β)∂β∂β′]
which further expands to:
I(β)=−E[(∂2logL(β)∂βi∂βj)ij]
This representation is particularly useful because it allows us to compute the Fisher Information Matrix directly from the Hessian of the log-likelihood function.
Example: Fisher Information Matrix in Logistic Regression
Consider a simple logistic regression model with one predictor:
x′iβ=β0+β1xi
From the log-likelihood function, the second-order partial derivatives are:
−∂2ln(L(β))∂β20=n∑i=1exp(x′iβ)1+exp(x′iβ)−[exp(x′iβ)1+exp(x′iβ)]2Intercept=n∑i=1pi(1−pi)−∂2ln(L(β))∂β21=n∑i=1x2iexp(x′iβ)1+exp(x′iβ)−[xiexp(x′iβ)1+exp(x′iβ)]2Slope=n∑i=1x2ipi(1−pi)−∂2ln(L(β))∂β0∂β1=n∑i=1xiexp(x′iβ)1+exp(x′iβ)−xi[exp(x′iβ)1+exp(x′iβ)]2Cross-derivative=n∑i=1xipi(1−pi)
Combining these elements, the Fisher Information Matrix for the logistic regression model is:
I(β)=[∑ni=1pi(1−pi)∑ni=1xipi(1−pi)∑ni=1xipi(1−pi)∑ni=1x2ipi(1−pi)]
where:
- pi=exp(x′iβ)1+exp(x′iβ) represents the predicted probability.
- pi(1−pi) is the variance of the Bernoulli response variable.
- The diagonal elements represent the variances of the estimated coefficients.
- The off-diagonal elements represent the covariances between β0 and β1.
The inverse of the Fisher Information Matrix provides the variance-covariance matrix of the estimated coefficients:
Var(ˆβ)=I(ˆβ)−1
This matrix is essential for:
- Estimating standard errors of the logistic regression coefficients.
- Constructing confidence intervals for β.
- Performing hypothesis tests (e.g., Wald Test).
# Load necessary library
library(stats)
# Simulated dataset
set.seed(123)
n <- 100
x <- rnorm(n)
y <- rbinom(n, 1, prob = plogis(0.5 + 1.2 * x))
# Fit logistic regression model
model <- glm(y ~ x, family = binomial)
# Extract the Fisher Information Matrix (Negative Hessian)
fisher_info <- summary(model)$cov.unscaled
# Display the Fisher Information Matrix
print(fisher_info)
#> (Intercept) x
#> (Intercept) 0.05718171 0.01564322
#> x 0.01564322 0.10302992
7.1.4 Inference in Logistic Regression
Once we estimate the model parameters ˆβ using Maximum Likelihood Estimation, we can conduct inference to assess the significance of predictors, construct confidence intervals, and perform hypothesis testing. The two most common inference approaches in logistic regression are:
These tests rely on the asymptotic normality of MLEs and the properties of the Fisher Information Matrix.
7.1.4.1 Likelihood Ratio Test
The Likelihood Ratio Test compares two models:
- Restricted Model: A simpler model where some parameters are constrained to specific values.
- Unrestricted Model: The full model without constraints.
To test a hypothesis about a subset of parameters β1, we leave β2 (nuisance parameters) unspecified.
Hypothesis Setup:
H0:β1=β1,0
where β1,0 is a specified value (often zero). Let:
- ˆβ2,0 be the MLE of β2 under the constraint β1=β1,0.
- ˆβ1,ˆβ2 be the MLEs under the full model.
The likelihood ratio test statistic is:
−2logΛ=−2[logL(β1,0,ˆβ2,0)−logL(ˆβ1,ˆβ2)]
where:
- The first term is the log-likelihood of the restricted model.
- The second term is the log-likelihood of the unrestricted model.
Under the null hypothesis:
−2logΛ∼χ2υ
where υ is the number of restricted parameters. We reject H0 if:
−2logΛ>χ2υ,1−α
Interpretation: If the likelihood ratio test statistic is large, this suggests that the restricted model (under H0) fits significantly worse than the full model, leading us to reject the null hypothesis.
7.1.4.2 Wald Test
The Wald test is based on the asymptotic normality of MLEs:
ˆβ∼AN(β,[I(β)]−1)
We test:
H0:Lˆβ=0
where L is a q×p matrix with q linearly independent rows (often used to test multiple coefficients simultaneously). The Wald test statistic is:
W=(Lˆβ)′(L[I(ˆβ)]−1L′)−1(Lˆβ)
Under H0:
W∼χ2q
Interpretation: If W is large, the null hypothesis is rejected, suggesting that at least one of the tested coefficients is significantly different from zero.
Comparing Likelihood Ratio and Wald Tests
Test | Best Used When… |
---|---|
Likelihood Ratio Test | More accurate in small samples, providing better control of error rates. Recommended when sample sizes are small. |
Wald Test | Easier to compute but may be inaccurate in small samples. Recommended when computational efficiency is a priority. |
# Load necessary library
library(stats)
# Simulate some binary outcome data
set.seed(123)
n <- 100
x <- rnorm(n)
y <- rbinom(n, 1, prob = plogis(0.5 + 1.2 * x))
# Fit logistic regression model
model <- glm(y ~ x, family = binomial)
# Display model summary (includes Wald tests)
summary(model)
#>
#> Call:
#> glm(formula = y ~ x, family = binomial)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6226 -0.9385 0.5287 0.8333 1.4656
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.7223 0.2391 3.020 0.002524 **
#> x 1.2271 0.3210 3.823 0.000132 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 128.21 on 99 degrees of freedom
#> Residual deviance: 108.29 on 98 degrees of freedom
#> AIC: 112.29
#>
#> Number of Fisher Scoring iterations: 4
# Perform likelihood ratio test using anova()
anova(model, test="Chisq")
#> Analysis of Deviance Table
#>
#> Model: binomial, link: logit
#>
#> Response: y
#>
#> Terms added sequentially (first to last)
#>
#>
#> Df Deviance Resid. Df Resid. Dev Pr(>Chi)
#> NULL 99 128.21
#> x 1 19.913 98 108.29 8.105e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
7.1.4.3 Confidence Intervals for Coefficients
A 95% confidence interval for a logistic regression coefficient βi is given by:
ˆβi±1.96ˆsii
where:
- ˆβi is the estimated coefficient.
- ˆsii is the standard error (square root of the diagonal element of [I(ˆβ)]−1).
This confidence interval provides a range of plausible values for βi. If the interval does not include zero, we conclude that βi is statistically significant.
- For large sample sizes, the Likelihood Ratio Test and Wald Test yield similar results.
- For small sample sizes, the Likelihood Ratio Test is preferred because the Wald test can be less reliable.
7.1.4.4 Interpretation of Logistic Regression Coefficients
For a single predictor variable, the logistic regression model is:
logit(ˆpi)=log(ˆpi1−ˆpi)=ˆβ0+ˆβ1xi
where:
- ˆpi is the predicted probability of success at xi.
- ˆβ1 represents the log odds change for a one-unit increase in x.
Interpreting β1 in Terms of Odds
When the predictor variable increases by one unit, the logit of the probability changes by ˆβ1:
logit(ˆpxi+1)=ˆβ0+ˆβ1(xi+1)=logit(ˆpxi)+ˆβ1
Thus, the difference in log odds is:
logit(ˆpxi+1)−logit(ˆpxi)=log(odds(ˆpxi+1))−log(odds(ˆpxi))=log(odds(ˆpxi+1)odds(ˆpxi))=ˆβ1
Exponentiating both sides:
exp(ˆβ1)=odds(ˆpxi+1)odds(ˆpxi)
This quantity, exp(ˆβ1), is the odds ratio, which quantifies the effect of a one-unit increase in x on the odds of success.
Generalization: Odds Ratio for Any Change in x
For a difference of c units in the predictor x, the estimated odds ratio is:
exp(cˆβ1)
For multiple predictors, exp(ˆβk) represents the odds ratio for xk, holding all other variables constant.
7.1.4.5 Inference on the Mean Response
For a given set of predictor values xh=(1,xh1,...,xh,p−1)′, the estimated mean response (probability of success) is:
ˆph=exp(x′hˆβ)1+exp(x′hˆβ)
The variance of the estimated probability is:
s2(ˆph)=x′h[I(ˆβ)]−1xh
where:
- I(ˆβ)−1 is the variance-covariance matrix of ˆβ.
- s2(ˆph) provides an estimate of uncertainty in ˆph.
In many applications, logistic regression is used for classification, where we predict whether an observation belongs to category 0 or 1. A commonly used decision rule is:
- Assign y=1 if ˆph≥τ
- Assign y=0 if ˆph<τ
where τ is a chosen cutoff threshold (typically τ=0.5).
# Load necessary library
library(stats)
# Simulated dataset
set.seed(123)
n <- 100
x <- rnorm(n)
y <- rbinom(n, 1, prob = plogis(0.5 + 1.2 * x))
# Fit logistic regression model
model <- glm(y ~ x, family = binomial)
# Display model summary
summary(model)
#>
#> Call:
#> glm(formula = y ~ x, family = binomial)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6226 -0.9385 0.5287 0.8333 1.4656
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.7223 0.2391 3.020 0.002524 **
#> x 1.2271 0.3210 3.823 0.000132 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 128.21 on 99 degrees of freedom
#> Residual deviance: 108.29 on 98 degrees of freedom
#> AIC: 112.29
#>
#> Number of Fisher Scoring iterations: 4
# Extract coefficients and standard errors
coef_estimates <- coef(summary(model))
beta_hat <- coef_estimates[, 1] # Estimated coefficients
se_beta <- coef_estimates[, 2] # Standard errors
# Compute 95% confidence intervals for coefficients
conf_intervals <- cbind(
beta_hat - 1.96 * se_beta,
beta_hat + 1.96 * se_beta
)
# Compute Odds Ratios
odds_ratios <- exp(beta_hat)
# Display results
print("Confidence Intervals for Coefficients:")
#> [1] "Confidence Intervals for Coefficients:"
print(conf_intervals)
#> [,1] [,2]
#> (Intercept) 0.2535704 1.190948
#> x 0.5979658 1.856218
print("Odds Ratios:")
#> [1] "Odds Ratios:"
print(odds_ratios)
#> (Intercept) x
#> 2.059080 3.411295
# Predict probability for a new observation (e.g., x = 1)
new_x <- data.frame(x = 1)
predicted_prob <- predict(model, newdata = new_x, type = "response")
print("Predicted Probability for x = 1:")
#> [1] "Predicted Probability for x = 1:"
print(predicted_prob)
#> 1
#> 0.8753759
7.1.5 Application: Logistic Regression
In this section, we demonstrate the application of logistic regression using simulated data. We explore model fitting, inference, residual analysis, and goodness-of-fit testing.
1. Load Required Libraries
library(kableExtra)
library(dplyr)
library(pscl)
library(ggplot2)
library(faraway)
library(nnet)
library(agridat)
library(nlstools)
2. Data Generation
We generate a dataset where the predictor variable X follows a uniform distribution:
x∼Unif(−0.5,2.5)
The linear predictor is given by:
η=0.5+0.75x
Passing η into the inverse-logit function, we obtain:
p=exp(η)1+exp(η)
which ensures that p∈[0,1]. We then generate the binary response variable:
y∼Bernoulli(p)
set.seed(23) # Set seed for reproducibility
x <- runif(1000, min = -0.5, max = 2.5) # Generate X values
eta1 <- 0.5 + 0.75 * x # Compute linear predictor
p <- exp(eta1) / (1 + exp(eta1)) # Compute probabilities
y <- rbinom(1000, 1, p) # Generate binary response
BinData <- data.frame(X = x, Y = y) # Create data frame
3. Model Fitting
We fit a logistic regression model to the simulated data:
logit(p)=β0+β1X
Logistic_Model <- glm(formula = Y ~ X,
# Specifies the response distribution
family = binomial,
data = BinData)
summary(Logistic_Model) # Model summary
#>
#> Call:
#> glm(formula = Y ~ X, family = binomial, data = BinData)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.2317 0.4153 0.5574 0.7922 1.1469
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.46205 0.10201 4.530 5.91e-06 ***
#> X 0.78527 0.09296 8.447 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1106.7 on 999 degrees of freedom
#> Residual deviance: 1027.4 on 998 degrees of freedom
#> AIC: 1031.4
#>
#> Number of Fisher Scoring iterations: 4
nlstools::confint2(Logistic_Model) # Confidence intervals
#> 2.5 % 97.5 %
#> (Intercept) 0.2618709 0.6622204
#> X 0.6028433 0.9676934
# Compute odds ratios
OddsRatio <- coef(Logistic_Model) %>% exp
OddsRatio
#> (Intercept) X
#> 1.587318 2.192995
Interpretation of the Odds Ratio
When x=0, the odds of success are 1.59.
When x=1, the odds of success increase by a factor of 2.19, indicating a 119.29% increase.
4. Deviance Test
We assess the model’s significance using the deviance test, which compares:
H0: No predictors are related to the response (intercept-only model).
H1: At least one predictor is related to the response.
The test statistic is:
D=Null Deviance−Residual Deviance
Test_Dev <- Logistic_Model$null.deviance - Logistic_Model$deviance
p_val_dev <- 1 - pchisq(q = Test_Dev, df = 1)
p_val_dev
#> [1] 0
Conclusion:
Since the p-value is approximately 0, we reject H0, confirming that X is significantly related to Y.
5. Residual Analysis
We compute deviance residuals and plot them against X.
Logistic_Resids <- residuals(Logistic_Model, type = "deviance")
plot(
y = Logistic_Resids,
x = BinData$X,
xlab = 'X',
ylab = 'Deviance Residuals'
)
This plot is not very informative. A more insightful approach is binned residual plots.
6. Binned Residual Plot
We group residuals into bins based on predicted values.
plot_bin <- function(Y,
X,
bins = 100,
return.DF = FALSE) {
Y_Name <- deparse(substitute(Y))
X_Name <- deparse(substitute(X))
Binned_Plot <- data.frame(Plot_Y = Y, Plot_X = X)
Binned_Plot$bin <-
cut(Binned_Plot$Plot_X, breaks = bins) %>% as.numeric
Binned_Plot_summary <- Binned_Plot %>%
group_by(bin) %>%
summarise(
Y_ave = mean(Plot_Y),
X_ave = mean(Plot_X),
Count = n()
) %>% as.data.frame
plot(
y = Binned_Plot_summary$Y_ave,
x = Binned_Plot_summary$X_ave,
ylab = Y_Name,
xlab = X_Name
)
if (return.DF)
return(Binned_Plot_summary)
}
plot_bin(Y = Logistic_Resids, X = BinData$X, bins = 100)
We also examine predicted values vs residuals:
Logistic_Predictions <- predict(Logistic_Model, type = "response")
plot_bin(Y = Logistic_Resids, X = Logistic_Predictions, bins = 100)
Finally, we compare predicted probabilities to actual outcomes:
NumBins <- 10
Binned_Data <- plot_bin(
Y = BinData$Y,
X = Logistic_Predictions,
bins = NumBins,
return.DF = TRUE
)
Binned_Data
#> bin Y_ave X_ave Count
#> 1 1 0.5833333 0.5382095 72
#> 2 2 0.5200000 0.5795887 75
#> 3 3 0.6567164 0.6156540 67
#> 4 4 0.7014925 0.6579674 67
#> 5 5 0.6373626 0.6984765 91
#> 6 6 0.7500000 0.7373341 72
#> 7 7 0.7096774 0.7786747 93
#> 8 8 0.8503937 0.8203819 127
#> 9 9 0.8947368 0.8601232 133
#> 10 10 0.8916256 0.9004734 203
abline(0, 1, lty = 2, col = 'blue')
7. Model Goodness-of-Fit: Hosmer-Lemeshow Test
The Hosmer-Lemeshow test evaluates whether the model fits the data well. The test statistic is: X2HL=J∑j=1(yj−mjˆpj)2mjˆpj(1−ˆpj) where:
yj is the observed number of successes in bin j.
mj is the number of observations in bin j.
ˆpj is the predicted probability in bin j.
Under H0, we assume:
X2HL∼χ2J−1
HL_BinVals <- (Binned_Data$Count * Binned_Data$Y_ave -
Binned_Data$Count * Binned_Data$X_ave) ^ 2 /
(Binned_Data$Count * Binned_Data$X_ave * (1 - Binned_Data$X_ave))
HLpval <- pchisq(q = sum(HL_BinVals),
df = NumBins - 1,
lower.tail = FALSE)
HLpval
#> [1] 0.4150004
Conclusion:
Since p-value = 0.99, we fail to reject H0.
This indicates that the model fits the data well.