8.4 Information Criteria for Model Selection

Information Criteria are statistical tools used to compare competing models by balancing model fit (likelihood) with model complexity (number of parameters).
They help in identifying the most parsimonious model that adequately explains the data without overfitting.

The three most commonly used criteria are:


8.4.1 Akaike Information Criterion

The Akaike Information Criterion is derived from the Kullback-Leibler divergence, which measures the difference between the true data-generating process and the fitted model.

AIC Formula:

AIC=2l(ˆθ,ˆβ)+2q

where:

  • l(ˆθ,ˆβ): The maximized log-likelihood of the model, evaluated at the estimates ˆθ (variance components) and ˆβ (fixed effects).
  • q: The effective number of parameters, including:
    • The number of fixed effects.
    • The number of variance-covariance parameters (random effects).
    • Excludes parameters constrained to boundary values (e.g., variances estimated as zero).

Key Points About AIC

  • Model Selection Rule:
    • Lower AIC indicates a better model.
    • Occasionally, software may report AIC as lq, in which case higher AIC is better (rare).
  • Comparing Random Effects Models:
    • Not recommended when comparing models with different random effects because it’s difficult to accurately count the effective number of parameters.
  • Sample Size Considerations:
    • Requires large sample sizes for reliable comparisons.
    • In small samples, AIC tends to favor more complex models due to insufficient penalty for model complexity.
  • Potential Bias:
    • Can be negatively biased (i.e., favoring overly complex models) when the sample size is small relative to the number of parameters.

When to Use AIC:

  • Comparing models with the same random effects structure but different fixed effects.
  • Selecting covariance structures in mixed models when the sample size is large.

8.4.2 Corrected AIC

The Corrected AIC (AICc) addresses the bias in AIC for small sample sizes. It was developed by (Hurvich and Tsai 1989).

AICc Formula:

AICc=AIC+2q(q+1)nq1

where:

  • n: The sample size.
  • q: The number of estimated parameters.

Key Points About AICc

  • Small Sample Correction:
    • Provides a stronger penalty for model complexity when the sample size is small.
  • Applicability:
    • Valid when comparing models with fixed covariance structures.
    • Not recommended for models with general covariance structures due to difficulties in bias correction.
  • Model Selection Rule:
    • Lower AICc indicates a better model.

When to Use AICc:

  • Small sample sizes (n/q ratio is low).
  • Models with fixed random effects or simple covariance structures.

8.4.3 Bayesian Information Criterion

The Bayesian Information Criterion is derived from a Bayesian framework and incorporates a stronger penalty for model complexity compared to AIC.

BIC Formula

BIC=2l(ˆθ,ˆβ)+qlog(n)

where:

  • n: The number of observations.
  • q: The number of effective parameters.
  • l(ˆθ,ˆβ): The maximized log-likelihood.

Key Points About BIC

  • Model Selection Rule:
    • Lower BIC indicates a better model.
  • Stronger Penalty:
    • The penalty term qlog(n) grows with sample size, leading BIC to favor simpler models more than AIC.
  • Applicability to MLE and REML:
    • BIC can be used with both MLE and REML, but:
      • Use MLE when comparing models with different fixed effects.
      • Use REML when comparing models with different random effects (same fixed effects).
  • Consistency:
    • BIC is consistent, meaning that as the sample size increases, it will select the true model with probability 1 (if the true model is among the candidates).

When to Use BIC:

  • Large sample sizes where model simplicity is prioritized.
  • Model selection for hypothesis testing (due to its connection to Bayesian inference).

Comparison of AIC, AICc, and BIC

Criterion Formula Penalty Term Best For Model Selection Rule
AIC 2l+2q 2q General model comparison (large n) Lower is better
AICc AIC+2q(q+1)nq1 Adjusted for small samples Small sample sizes, simple random effects Lower is better
BIC 2l+qlog(n) qlog(n) Large samples, model selection in hypothesis testing Lower is better

Key Takeaways

  1. AIC is suitable for large datasets and general model comparisons but may favor overly complex models in small samples.
  2. AICc corrects AIC’s bias in small sample sizes.
  3. BIC favors simpler models, especially as the sample size increases, making it suitable for hypothesis testing and situations where parsimony is essential.
  4. Use MLE for comparing models with different fixed effects, and REML when comparing models with different random effects (same fixed effects).
  5. When comparing random effects structures, AIC and BIC may not be reliable due to difficulty in counting effective parameters accurately.

8.4.4 Practical Example with Linear Mixed Models

Consider the Linear Mixed Model:

Yik={β0+b1i+(β1+b2i)tij+ϵijLβ0+b1i+(β2+b2i)tij+ϵijHβ0+b1i+(β3+b2i)tij+ϵijC

where:

  • i=1,,N (subjects)
  • j=1,,ni (repeated measures at time tij)

Yi|biN(Xiβ+1bi,σ2I)biN(0,d11)

We aim to estimate:

  • Fixed effects: β
  • Variance components: σ2, d11
  • Random effects: Predict bi

When comparing models (e.g., different random slopes or covariance structures), we can compute:

  • AIC: Penalizes model complexity with 2q.
  • BIC: Stronger penalty via qlog(n), favoring simpler models.
  • AICc: Adjusted AIC for small sample sizes.
# Practical Example with Linear Mixed Models in R


# Load required libraries
library(lme4)     # For fitting linear mixed-effects models
library(MuMIn)    # For calculating AICc
library(dplyr)    # For data manipulation

# Set seed for reproducibility
set.seed(123)

# Simulate Data
N <- 50             # Number of subjects
n_i <- 5            # Number of repeated measures per subject
t_ij <- rep(1:n_i, N)  # Time points

# Treatment groups (L, H, C)
treatment <- rep(c("L", "H", "C"), length.out = N)
group <- factor(rep(treatment, each = n_i))

# Simulate random effects
b1_i <- rnorm(N, mean = 0, sd = 2)    # Random intercepts
b2_i <- rnorm(N, mean = 0, sd = 1)    # Random slopes

# Fixed effects
beta_0 <- 5
beta_1 <- 0.5
beta_2 <- 1
beta_3 <- 1.5

# Generate response variable Y based on the specified model
Y <- numeric(N * n_i)
subject_id <- rep(1:N, each = n_i)

for (i in 1:N) {
    for (j in 1:n_i) {
        idx <- (i - 1) * n_i + j
        time <- t_ij[idx]
        
        # Treatment-specific model
        if (group[idx] == "L") {
            Y[idx] <-
                beta_0 + b1_i[i] + (beta_1 + b2_i[i]) * time + rnorm(1, 0, 1)
        } else if (group[idx] == "H") {
            Y[idx] <-
                beta_0 + b1_i[i] + (beta_2 + b2_i[i]) * time + rnorm(1, 0, 1)
        } else {
            Y[idx] <-
                beta_0 + b1_i[i] + (beta_3 + b2_i[i]) * time + rnorm(1, 0, 1)
        }
    }
}

# Combine into a data frame
data <- data.frame(
    Y = Y,
    time = t_ij,
    group = group,
    subject = factor(subject_id)
)

# Fit Linear Mixed Models
# Model 1: Random Intercepts Only
model1 <-
    lmer(Y ~ time * group + (1 | subject), data = data, REML = FALSE)

# Model 2: Random Intercepts and Random Slopes
model2 <-
    lmer(Y ~ time * group + (1 + time |
                                 subject),
         data = data,
         REML = FALSE)

# Model 3: Simpler Model (No Interaction)
model3 <-
    lmer(Y ~ time + group + (1 | subject), data = data, REML = FALSE)

# Extract Information Criteria
results <- data.frame(
    Model = c(
        "Random Intercepts",
        "Random Intercepts + Slopes",
        "No Interaction"
    ),
    AIC = c(AIC(model1), AIC(model2), AIC(model3)),
    BIC = c(BIC(model1), BIC(model2), BIC(model3)),
    AICc = c(AICc(model1), AICc(model2), AICc(model3))
)

# Display the results
print(results)
#>                        Model       AIC      BIC      AICc
#> 1          Random Intercepts 1129.2064 1157.378 1129.8039
#> 2 Random Intercepts + Slopes  974.5514 1009.766  975.4719
#> 3             No Interaction 1164.2797 1185.408 1164.6253

Interpretation of Results:

  • Model 2 (with random intercepts and slopes) has the lowest AIC, BIC, and AICc, indicating the best fit among the models.

  • Model 1 (random intercepts only) performs worse, suggesting that allowing random slopes improves model fit.

  • Model 3 (simpler fixed effects without interaction) has the highest AIC/BIC/AICc, indicating poor fit compared to Models 1 and 2.

Model Selection Criteria:

Criterion Best Model Reason
AIC Model 2 Best trade-off between fit and complexity
BIC Model 2 Stronger penalty for complexity, still favored
AICc Model 2 Adjusted for small samples, Model 2 still best

References

Hurvich, Clifford M, and Chih-Ling Tsai. 1989. “Regression and Time Series Model Selection in Small Samples.” Biometrika 76 (2): 297–307.