7.1 Logistic Regression

Logistic regression is a widely used Generalized Linear Model designed for modeling binary response variables. It is particularly useful in applications such as credit scoring, medical diagnosis, and customer churn prediction.

7.1.1 Logistic Model

Given a set of predictor variables xi, the probability of a positive outcome (e.g., success, event occurring) is modeled as:

pi=f(xi;β)=exp(xiβ)1+exp(xiβ)

where:

  • pi=E[Yi] is the probability of success for observation i.
  • xi is the vector of predictor variables.
  • β is the vector of model coefficients.

7.1.1.1 Logit Transformation

The logistic function can be rewritten in terms of the log-odds, also known as the logit function:

logit(pi)=log(pi1pi)=xiβ

where:

  • pi1pi represents the odds of success (the ratio of the probability of success to the probability of failure).
  • The logit function ensures linearity in the parameters, which aligns with the GLM framework.

Thus, logistic regression belongs to the family of Generalized Linear Models because a function of the mean response (logit) is linear in the predictors.

7.1.2 Likelihood Function

Since Yi follows a Bernoulli distribution with probability pi, the likelihood function for n independent observations is:

L(pi)=ni=1pYii(1pi)1Yi

By substituting the logistic function for pi:

pi=exp(xiβ)1+exp(xiβ),1pi=11+exp(xiβ)

we obtain:

L(β)=ni=1(exp(xiβ)1+exp(xiβ))Yi(11+exp(xiβ))1Yi

Taking the natural logarithm of the likelihood function gives the log-likelihood function:

Q(β)=logL(β)=ni=1Yixiβni=1log(1+exp(xiβ))

Since this function is concave, we can maximize it numerically using iterative optimization techniques, such as:

  • Newton-Raphson Method
  • Fisher Scoring Algorithm

These methods allow us to obtain the Maximum Likelihood Estimates of the parameters, ˆβ.

Under standard regularity conditions, the MLEs of logistic regression parameters are asymptotically normal:

ˆβ˙AN(β,[I(β)]1)

where:

  • I(β) is the Fisher Information Matrix, which determines the variance-covariance structure of ˆβ.

7.1.3 Fisher Information Matrix

The Fisher Information Matrix quantifies the amount of information that an observable random variable carries about the unknown parameter β. It is crucial in estimating the variance-covariance matrix of the estimated coefficients in logistic regression.

Mathematically, the Fisher Information Matrix is defined as:

I(β)=E[logL(β)βlogL(β)β]

which expands to:

I(β)=E[(logL(β)βilogL(β)βj)ij]

Under regularity conditions, the Fisher Information Matrix is equivalent to the negative expected Hessian matrix:

I(β)=E[2logL(β)ββ]

which further expands to:

I(β)=E[(2logL(β)βiβj)ij]

This representation is particularly useful because it allows us to compute the Fisher Information Matrix directly from the Hessian of the log-likelihood function.


Example: Fisher Information Matrix in Logistic Regression

Consider a simple logistic regression model with one predictor:

xiβ=β0+β1xi

From the log-likelihood function, the second-order partial derivatives are:

2ln(L(β))β20=ni=1exp(xiβ)1+exp(xiβ)[exp(xiβ)1+exp(xiβ)]2Intercept=ni=1pi(1pi)2ln(L(β))β21=ni=1x2iexp(xiβ)1+exp(xiβ)[xiexp(xiβ)1+exp(xiβ)]2Slope=ni=1x2ipi(1pi)2ln(L(β))β0β1=ni=1xiexp(xiβ)1+exp(xiβ)xi[exp(xiβ)1+exp(xiβ)]2Cross-derivative=ni=1xipi(1pi)

Combining these elements, the Fisher Information Matrix for the logistic regression model is:

I(β)=[ni=1pi(1pi)ni=1xipi(1pi)ni=1xipi(1pi)ni=1x2ipi(1pi)]

where:

  • pi=exp(xiβ)1+exp(xiβ) represents the predicted probability.
  • pi(1pi) is the variance of the Bernoulli response variable.
  • The diagonal elements represent the variances of the estimated coefficients.
  • The off-diagonal elements represent the covariances between β0 and β1.

The inverse of the Fisher Information Matrix provides the variance-covariance matrix of the estimated coefficients:

Var(ˆβ)=I(ˆβ)1

This matrix is essential for:

  • Estimating standard errors of the logistic regression coefficients.
  • Constructing confidence intervals for β.
  • Performing hypothesis tests (e.g., Wald Test).

# Load necessary library
library(stats)

# Simulated dataset
set.seed(123)
n <- 100
x <- rnorm(n)
y <- rbinom(n, 1, prob = plogis(0.5 + 1.2 * x))

# Fit logistic regression model
model <- glm(y ~ x, family = binomial)

# Extract the Fisher Information Matrix (Negative Hessian)
fisher_info <- summary(model)$cov.unscaled

# Display the Fisher Information Matrix
print(fisher_info)
#>             (Intercept)          x
#> (Intercept)  0.05718171 0.01564322
#> x            0.01564322 0.10302992

7.1.4 Inference in Logistic Regression

Once we estimate the model parameters ˆβ using Maximum Likelihood Estimation, we can conduct inference to assess the significance of predictors, construct confidence intervals, and perform hypothesis testing. The two most common inference approaches in logistic regression are:

  1. Likelihood Ratio Test
  2. Wald Statistics

These tests rely on the asymptotic normality of MLEs and the properties of the Fisher Information Matrix.


7.1.4.1 Likelihood Ratio Test

The Likelihood Ratio Test compares two models:

  • Restricted Model: A simpler model where some parameters are constrained to specific values.
  • Unrestricted Model: The full model without constraints.

To test a hypothesis about a subset of parameters β1, we leave β2 (nuisance parameters) unspecified.

Hypothesis Setup:

H0:β1=β1,0

where β1,0 is a specified value (often zero). Let:

  • ˆβ2,0 be the MLE of β2 under the constraint β1=β1,0.
  • ˆβ1,ˆβ2 be the MLEs under the full model.

The likelihood ratio test statistic is:

2logΛ=2[logL(β1,0,ˆβ2,0)logL(ˆβ1,ˆβ2)]

where:

  • The first term is the log-likelihood of the restricted model.
  • The second term is the log-likelihood of the unrestricted model.

Under the null hypothesis:

2logΛχ2υ

where υ is the number of restricted parameters. We reject H0 if:

2logΛ>χ2υ,1α

Interpretation: If the likelihood ratio test statistic is large, this suggests that the restricted model (under H0) fits significantly worse than the full model, leading us to reject the null hypothesis.


7.1.4.2 Wald Test

The Wald test is based on the asymptotic normality of MLEs:

ˆβAN(β,[I(β)]1)

We test:

H0:Lˆβ=0

where L is a q×p matrix with q linearly independent rows (often used to test multiple coefficients simultaneously). The Wald test statistic is:

W=(Lˆβ)(L[I(ˆβ)]1L)1(Lˆβ)

Under H0:

Wχ2q

Interpretation: If W is large, the null hypothesis is rejected, suggesting that at least one of the tested coefficients is significantly different from zero.


Comparing Likelihood Ratio and Wald Tests

Test Best Used When…
Likelihood Ratio Test More accurate in small samples, providing better control of error rates. Recommended when sample sizes are small.
Wald Test Easier to compute but may be inaccurate in small samples. Recommended when computational efficiency is a priority.

# Load necessary library
library(stats)

# Simulate some binary outcome data
set.seed(123)
n <- 100
x <- rnorm(n)
y <- rbinom(n, 1, prob = plogis(0.5 + 1.2 * x))

# Fit logistic regression model
model <- glm(y ~ x, family = binomial)

# Display model summary (includes Wald tests)
summary(model)
#> 
#> Call:
#> glm(formula = y ~ x, family = binomial)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.6226  -0.9385   0.5287   0.8333   1.4656  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)   0.7223     0.2391   3.020 0.002524 ** 
#> x             1.2271     0.3210   3.823 0.000132 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 128.21  on 99  degrees of freedom
#> Residual deviance: 108.29  on 98  degrees of freedom
#> AIC: 112.29
#> 
#> Number of Fisher Scoring iterations: 4

# Perform likelihood ratio test using anova()
anova(model, test="Chisq")
#> Analysis of Deviance Table
#> 
#> Model: binomial, link: logit
#> 
#> Response: y
#> 
#> Terms added sequentially (first to last)
#> 
#> 
#>      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
#> NULL                    99     128.21              
#> x     1   19.913        98     108.29 8.105e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7.1.4.3 Confidence Intervals for Coefficients

A 95% confidence interval for a logistic regression coefficient βi is given by:

ˆβi±1.96ˆsii

where:

  • ˆβi is the estimated coefficient.
  • ˆsii is the standard error (square root of the diagonal element of [I(ˆβ)]1).

This confidence interval provides a range of plausible values for βi. If the interval does not include zero, we conclude that βi is statistically significant.


7.1.4.4 Interpretation of Logistic Regression Coefficients

For a single predictor variable, the logistic regression model is:

logit(ˆpi)=log(ˆpi1ˆpi)=ˆβ0+ˆβ1xi

where:

  • ˆpi is the predicted probability of success at xi.
  • ˆβ1 represents the log odds change for a one-unit increase in x.

Interpreting β1 in Terms of Odds

When the predictor variable increases by one unit, the logit of the probability changes by ˆβ1:

logit(ˆpxi+1)=ˆβ0+ˆβ1(xi+1)=logit(ˆpxi)+ˆβ1

Thus, the difference in log odds is:

logit(ˆpxi+1)logit(ˆpxi)=log(odds(ˆpxi+1))log(odds(ˆpxi))=log(odds(ˆpxi+1)odds(ˆpxi))=ˆβ1

Exponentiating both sides:

exp(ˆβ1)=odds(ˆpxi+1)odds(ˆpxi)

This quantity, exp(ˆβ1), is the odds ratio, which quantifies the effect of a one-unit increase in x on the odds of success.


Generalization: Odds Ratio for Any Change in x

For a difference of c units in the predictor x, the estimated odds ratio is:

exp(cˆβ1)

For multiple predictors, exp(ˆβk) represents the odds ratio for xk, holding all other variables constant.


7.1.4.5 Inference on the Mean Response

For a given set of predictor values xh=(1,xh1,...,xh,p1), the estimated mean response (probability of success) is:

ˆph=exp(xhˆβ)1+exp(xhˆβ)

The variance of the estimated probability is:

s2(ˆph)=xh[I(ˆβ)]1xh

where:

  • I(ˆβ)1 is the variance-covariance matrix of ˆβ.
  • s2(ˆph) provides an estimate of uncertainty in ˆph.

In many applications, logistic regression is used for classification, where we predict whether an observation belongs to category 0 or 1. A commonly used decision rule is:

  • Assign y=1 if ˆphτ
  • Assign y=0 if ˆph<τ

where τ is a chosen cutoff threshold (typically τ=0.5).


# Load necessary library
library(stats)

# Simulated dataset
set.seed(123)
n <- 100
x <- rnorm(n)
y <- rbinom(n, 1, prob = plogis(0.5 + 1.2 * x))

# Fit logistic regression model
model <- glm(y ~ x, family = binomial)

# Display model summary
summary(model)
#> 
#> Call:
#> glm(formula = y ~ x, family = binomial)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.6226  -0.9385   0.5287   0.8333   1.4656  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)   0.7223     0.2391   3.020 0.002524 ** 
#> x             1.2271     0.3210   3.823 0.000132 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 128.21  on 99  degrees of freedom
#> Residual deviance: 108.29  on 98  degrees of freedom
#> AIC: 112.29
#> 
#> Number of Fisher Scoring iterations: 4

# Extract coefficients and standard errors
coef_estimates <- coef(summary(model))
beta_hat <- coef_estimates[, 1]   # Estimated coefficients
se_beta  <- coef_estimates[, 2]   # Standard errors

# Compute 95% confidence intervals for coefficients
conf_intervals <- cbind(
  beta_hat - 1.96 * se_beta, 
  beta_hat + 1.96 * se_beta
)

# Compute Odds Ratios
odds_ratios <- exp(beta_hat)

# Display results
print("Confidence Intervals for Coefficients:")
#> [1] "Confidence Intervals for Coefficients:"
print(conf_intervals)
#>                  [,1]     [,2]
#> (Intercept) 0.2535704 1.190948
#> x           0.5979658 1.856218

print("Odds Ratios:")
#> [1] "Odds Ratios:"
print(odds_ratios)
#> (Intercept)           x 
#>    2.059080    3.411295

# Predict probability for a new observation (e.g., x = 1)
new_x <- data.frame(x = 1)
predicted_prob <- predict(model, newdata = new_x, type = "response")

print("Predicted Probability for x = 1:")
#> [1] "Predicted Probability for x = 1:"
print(predicted_prob)
#>         1 
#> 0.8753759

7.1.5 Application: Logistic Regression

In this section, we demonstrate the application of logistic regression using simulated data. We explore model fitting, inference, residual analysis, and goodness-of-fit testing.

1. Load Required Libraries

library(kableExtra)
library(dplyr)
library(pscl)
library(ggplot2)
library(faraway)
library(nnet)
library(agridat)
library(nlstools)

2. Data Generation

We generate a dataset where the predictor variable X follows a uniform distribution:

xUnif(0.5,2.5)

The linear predictor is given by:

η=0.5+0.75x

Passing η into the inverse-logit function, we obtain:

p=exp(η)1+exp(η)

which ensures that p[0,1]. We then generate the binary response variable:

yBernoulli(p)

set.seed(23) # Set seed for reproducibility
x <- runif(1000, min = -0.5, max = 2.5)  # Generate X values
eta1 <- 0.5 + 0.75 * x                   # Compute linear predictor
p <- exp(eta1) / (1 + exp(eta1))          # Compute probabilities
y <- rbinom(1000, 1, p)                   # Generate binary response
BinData <- data.frame(X = x, Y = y)       # Create data frame

3. Model Fitting

We fit a logistic regression model to the simulated data:

logit(p)=β0+β1X

Logistic_Model <- glm(formula = Y ~ X,
                      # Specifies the response distribution
                      family = binomial,
                      data = BinData)

summary(Logistic_Model) # Model summary
#> 
#> Call:
#> glm(formula = Y ~ X, family = binomial, data = BinData)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.2317   0.4153   0.5574   0.7922   1.1469  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  0.46205    0.10201   4.530 5.91e-06 ***
#> X            0.78527    0.09296   8.447  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1106.7  on 999  degrees of freedom
#> Residual deviance: 1027.4  on 998  degrees of freedom
#> AIC: 1031.4
#> 
#> Number of Fisher Scoring iterations: 4
nlstools::confint2(Logistic_Model) # Confidence intervals
#>                 2.5 %    97.5 %
#> (Intercept) 0.2618709 0.6622204
#> X           0.6028433 0.9676934

# Compute odds ratios
OddsRatio <- coef(Logistic_Model) %>% exp
OddsRatio
#> (Intercept)           X 
#>    1.587318    2.192995

Interpretation of the Odds Ratio

  • When x=0, the odds of success are 1.59.

  • When x=1, the odds of success increase by a factor of 2.19, indicating a 119.29% increase.

4. Deviance Test

We assess the model’s significance using the deviance test, which compares:

  • H0: No predictors are related to the response (intercept-only model).

  • H1: At least one predictor is related to the response.

The test statistic is:

D=Null DevianceResidual Deviance

Test_Dev <- Logistic_Model$null.deviance - Logistic_Model$deviance
p_val_dev <- 1 - pchisq(q = Test_Dev, df = 1)
p_val_dev
#> [1] 0

Conclusion:

Since the p-value is approximately 0, we reject H0, confirming that X is significantly related to Y.

5. Residual Analysis

We compute deviance residuals and plot them against X.

Logistic_Resids <- residuals(Logistic_Model, type = "deviance")

plot(
    y = Logistic_Resids,
    x = BinData$X,
    xlab = 'X',
    ylab = 'Deviance Residuals'
)

This plot is not very informative. A more insightful approach is binned residual plots.

6. Binned Residual Plot

We group residuals into bins based on predicted values.

plot_bin <- function(Y,
                     X,
                     bins = 100,
                     return.DF = FALSE) {
  Y_Name <- deparse(substitute(Y))
  X_Name <- deparse(substitute(X))
  
  Binned_Plot <- data.frame(Plot_Y = Y, Plot_X = X)
  Binned_Plot$bin <-
    cut(Binned_Plot$Plot_X, breaks = bins) %>% as.numeric
  
  Binned_Plot_summary <- Binned_Plot %>%
    group_by(bin) %>%
    summarise(
      Y_ave = mean(Plot_Y),
      X_ave = mean(Plot_X),
      Count = n()
    ) %>% as.data.frame
  
  plot(
    y = Binned_Plot_summary$Y_ave,
    x = Binned_Plot_summary$X_ave,
    ylab = Y_Name,
    xlab = X_Name
  )
  
  if (return.DF)
    return(Binned_Plot_summary)
}

plot_bin(Y = Logistic_Resids, X = BinData$X, bins = 100)

We also examine predicted values vs residuals:

Logistic_Predictions <- predict(Logistic_Model, type = "response")
plot_bin(Y = Logistic_Resids, X = Logistic_Predictions, bins = 100)

Finally, we compare predicted probabilities to actual outcomes:

NumBins <- 10
Binned_Data <- plot_bin(
    Y = BinData$Y,
    X = Logistic_Predictions,
    bins = NumBins,
    return.DF = TRUE
)
Binned_Data
#>    bin     Y_ave     X_ave Count
#> 1    1 0.5833333 0.5382095    72
#> 2    2 0.5200000 0.5795887    75
#> 3    3 0.6567164 0.6156540    67
#> 4    4 0.7014925 0.6579674    67
#> 5    5 0.6373626 0.6984765    91
#> 6    6 0.7500000 0.7373341    72
#> 7    7 0.7096774 0.7786747    93
#> 8    8 0.8503937 0.8203819   127
#> 9    9 0.8947368 0.8601232   133
#> 10  10 0.8916256 0.9004734   203
abline(0, 1, lty = 2, col = 'blue')

7. Model Goodness-of-Fit: Hosmer-Lemeshow Test

The Hosmer-Lemeshow test evaluates whether the model fits the data well. The test statistic is: X2HL=Jj=1(yjmjˆpj)2mjˆpj(1ˆpj) where:

  • yj is the observed number of successes in bin j.

  • mj is the number of observations in bin j.

  • ˆpj is the predicted probability in bin j.

Under H0, we assume:

X2HLχ2J1

HL_BinVals <- (Binned_Data$Count * Binned_Data$Y_ave - 
               Binned_Data$Count * Binned_Data$X_ave) ^ 2 /   
               (Binned_Data$Count * Binned_Data$X_ave * (1 - Binned_Data$X_ave))

HLpval <- pchisq(q = sum(HL_BinVals),
                 df = NumBins - 1,
                 lower.tail = FALSE)
HLpval
#> [1] 0.4150004

Conclusion:

  • Since p-value = 0.99, we fail to reject H0.

  • This indicates that the model fits the data well.