7.3 Binomial Regression

In previous sections, we introduced binomial regression models, including both Logistic Regression and probit regression, and discussed their theoretical foundations. Now, we apply these methods to real-world data using the esoph dataset, which examines the relationship between esophageal cancer and potential risk factors such as alcohol consumption and age group.

7.3.1 Dataset Overview

The esoph dataset consists of:

  • Successes (ncases): The number of individuals diagnosed with esophageal cancer.

  • Failures (ncontrols): The number of individuals in the control group (without cancer).

  • Predictors:

    • agegp: Age group of individuals.

    • alcgp: Alcohol consumption category.

    • tobgp: Tobacco consumption category.

Before fitting our models, let’s inspect the dataset and visualize some key relationships.

# Load and inspect the dataset
data("esoph")
head(esoph, n = 3)
#>   agegp     alcgp    tobgp ncases ncontrols
#> 1 25-34 0-39g/day 0-9g/day      0        40
#> 2 25-34 0-39g/day    10-19      0        10
#> 3 25-34 0-39g/day    20-29      0         6

# Visualizing the proportion of cancer cases by alcohol consumption
plot(
  esoph$ncases / (esoph$ncases + esoph$ncontrols) ~ esoph$alcgp,
  ylab = "Proportion of Cancer Cases",
  xlab = "Alcohol Consumption Group",
  main = "Esophageal Cancer Data"
)


# Ensure categorical variables are treated as factors
class(esoph$agegp) <- "factor"
class(esoph$alcgp) <- "factor"
class(esoph$tobgp) <- "factor"

7.3.2 Apply Logistic Model

We first fit a logistic regression model, where the response variable is the proportion of cancer cases (ncases) relative to total observations (ncases + ncontrols).

# Logistic regression using alcohol consumption as a predictor
model <-
    glm(cbind(ncases, ncontrols) ~ alcgp,
        data = esoph,
        family = binomial)

# Summary of the model
summary(model)
#> 
#> Call:
#> glm(formula = cbind(ncases, ncontrols) ~ alcgp, family = binomial, 
#>     data = esoph)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -4.0759  -1.2037  -0.0183   1.0928   3.7336  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  -2.5885     0.1925 -13.444  < 2e-16 ***
#> alcgp40-79    1.2712     0.2323   5.472 4.46e-08 ***
#> alcgp80-119   2.0545     0.2611   7.868 3.59e-15 ***
#> alcgp120+     3.3042     0.3237  10.209  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 367.95  on 87  degrees of freedom
#> Residual deviance: 221.46  on 84  degrees of freedom
#> AIC: 344.51
#> 
#> Number of Fisher Scoring iterations: 5

Interpretation

  • The coefficients represent the log-odds of having esophageal cancer relative to the baseline alcohol consumption group.

  • P-values indicate whether alcohol consumption levels significantly influence cancer risk.

Model Diagnostics

# Convert coefficients to odds ratios
exp(coefficients(model))
#> (Intercept)  alcgp40-79 alcgp80-119   alcgp120+ 
#>  0.07512953  3.56527094  7.80261593 27.22570533

# Model goodness-of-fit measures
deviance(model) / df.residual(model)  # Closer to 1 suggests a better fit
#> [1] 2.63638
model$aic  # Lower AIC is preferable for model comparison
#> [1] 344.5109

To improve our model, we include age group (agegp) as an additional predictor.

# Logistic regression with alcohol consumption and age
better_model <- glm(
    cbind(ncases, ncontrols) ~ agegp + alcgp,
    data = esoph,
    family = binomial
)

# Summary of the improved model
summary(better_model)
#> 
#> Call:
#> glm(formula = cbind(ncases, ncontrols) ~ agegp + alcgp, family = binomial, 
#>     data = esoph)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.2395  -0.7186  -0.2324   0.7930   3.3538  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  -6.1472     1.0419  -5.900 3.63e-09 ***
#> agegp35-44    1.6311     1.0800   1.510 0.130973    
#> agegp45-54    3.4258     1.0389   3.297 0.000976 ***
#> agegp55-64    3.9435     1.0346   3.811 0.000138 ***
#> agegp65-74    4.3568     1.0413   4.184 2.87e-05 ***
#> agegp75+      4.4242     1.0914   4.054 5.04e-05 ***
#> alcgp40-79    1.4343     0.2448   5.859 4.64e-09 ***
#> alcgp80-119   2.0071     0.2776   7.230 4.84e-13 ***
#> alcgp120+     3.6800     0.3763   9.778  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 367.95  on 87  degrees of freedom
#> Residual deviance: 105.88  on 79  degrees of freedom
#> AIC: 238.94
#> 
#> Number of Fisher Scoring iterations: 6

# Model evaluation
better_model$aic # Lower AIC is better
#> [1] 238.9361

# Convert coefficients to odds ratios
# exp(coefficients(better_model))
data.frame(`Odds Ratios` = exp(coefficients(better_model)))
#>              Odds.Ratios
#> (Intercept)  0.002139482
#> agegp35-44   5.109601844
#> agegp45-54  30.748594216
#> agegp55-64  51.596634690
#> agegp65-74  78.005283850
#> agegp75+    83.448437749
#> alcgp40-79   4.196747169
#> alcgp80-119  7.441782227
#> alcgp120+   39.646885126

# Compare models using likelihood ratio test (Chi-square test)
pchisq(
    q = model$deviance - better_model$deviance,
    df = model$df.residual - better_model$df.residual,
    lower.tail = FALSE
)
#> [1] 2.713923e-23

Key Takeaways

  • AIC Reduction: A lower AIC suggests that adding age as a predictor improves the model.

  • Likelihood Ratio Test: This test compares the two models and determines whether the improvement is statistically significant.

7.3.3 Apply Probit Model

As discussed earlier, the probit model is an alternative to logistic regression, using a cumulative normal distribution instead of the logistic function.

# Probit regression model
Prob_better_model <- glm(
    cbind(ncases, ncontrols) ~ agegp + alcgp,
    data = esoph,
    family = binomial(link = probit)
)

# Summary of the probit model
summary(Prob_better_model)
#> 
#> Call:
#> glm(formula = cbind(ncases, ncontrols) ~ agegp + alcgp, family = binomial(link = probit), 
#>     data = esoph)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.1325  -0.6877  -0.1661   0.7654   3.3258  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  -3.3741     0.4922  -6.855 7.13e-12 ***
#> agegp35-44    0.8562     0.5081   1.685 0.092003 .  
#> agegp45-54    1.7829     0.4904   3.636 0.000277 ***
#> agegp55-64    2.1034     0.4876   4.314 1.61e-05 ***
#> agegp65-74    2.3374     0.4930   4.741 2.13e-06 ***
#> agegp75+      2.3694     0.5275   4.491 7.08e-06 ***
#> alcgp40-79    0.8080     0.1330   6.076 1.23e-09 ***
#> alcgp80-119   1.1399     0.1558   7.318 2.52e-13 ***
#> alcgp120+     2.1204     0.2060  10.295  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 367.95  on 87  degrees of freedom
#> Residual deviance: 104.48  on 79  degrees of freedom
#> AIC: 237.53
#> 
#> Number of Fisher Scoring iterations: 6

Why Consider a Probit Model?

  • Like logistic regression, probit regression estimates probabilities, but it assumes a normal distribution of the latent variable.

  • While the interpretation of coefficients differs, model comparisons can still be made using AIC.