8.4 Information Criteria for Model Selection

Information Criteria are statistical tools used to compare competing models by balancing model fit (likelihood) with model complexity (number of parameters).
They help in identifying the most parsimonious model that adequately explains the data without overfitting.

The three most commonly used criteria are:

Akaike Information Criterion (AIC)
Corrected Akaike Information Criterion (AICc)
Bayesian Information Criterion (BIC)

8.4.1 Akaike Information Criterion

The Akaike Information Criterion is derived from the Kullback-Leibler divergence, which measures the difference between the true data-generating process and the fitted model.

AIC Formula:

$AIC = -2 l(\hat{\theta}, \hat{\beta}) + 2q$

where:

$l(\hat{\theta}, \hat{\beta})$ : The maximized log-likelihood of the model, evaluated at the estimates $\hat{\theta}$ (variance components) and $\hat{\beta}$ (fixed effects).
$q$ : The effective number of parameters, including:
- The number of fixed effects.
- The number of variance-covariance parameters (random effects).
- Excludes parameters constrained to boundary values (e.g., variances estimated as zero).

Key Points About AIC

Model Selection Rule:
- Lower AIC indicates a better model.
- Occasionally, software may report AIC as $l - q$ , in which case higher AIC is better (rare).
Comparing Random Effects Models:
- Not recommended when comparing models with different random effects because it’s difficult to accurately count the effective number of parameters.
Sample Size Considerations:
- Requires large sample sizes for reliable comparisons.
- In small samples, AIC tends to favor more complex models due to insufficient penalty for model complexity.
Potential Bias:
- Can be negatively biased (i.e., favoring overly complex models) when the sample size is small relative to the number of parameters.

When to Use AIC:

Comparing models with the same random effects structure but different fixed effects.
Selecting covariance structures in mixed models when the sample size is large.

8.4.2 Corrected AIC

The Corrected AIC (AICc) addresses the bias in AIC for small sample sizes. It was developed by (Hurvich and Tsai 1989).

AICc Formula:

$AICc = AIC + \frac{2q(q + 1)}{n - q - 1}$

where:

$n$ : The sample size.
$q$ : The number of estimated parameters.

Key Points About AICc

Small Sample Correction:
- Provides a stronger penalty for model complexity when the sample size is small.
Applicability:
- Valid when comparing models with fixed covariance structures.
- Not recommended for models with general covariance structures due to difficulties in bias correction.
Model Selection Rule:
- Lower AICc indicates a better model.

When to Use AICc:

Small sample sizes ( $n/q$ ratio is low).
Models with fixed random effects or simple covariance structures.

8.4.3 Bayesian Information Criterion

The Bayesian Information Criterion is derived from a Bayesian framework and incorporates a stronger penalty for model complexity compared to AIC.

BIC Formula

$BIC = -2 l(\hat{\theta}, \hat{\beta}) + q \log(n)$

where:

$n$ : The number of observations.
$q$ : The number of effective parameters.
$l(\hat{\theta}, \hat{\beta})$ : The maximized log-likelihood.

Key Points About BIC

Model Selection Rule:
- Lower BIC indicates a better model.
Stronger Penalty:
- The penalty term $q \log(n)$ grows with sample size, leading BIC to favor simpler models more than AIC.
Applicability to MLE and REML:
- BIC can be used with both MLE and REML, but:
  - Use MLE when comparing models with different fixed effects.
  - Use REML when comparing models with different random effects (same fixed effects).
Consistency:
- BIC is consistent, meaning that as the sample size increases, it will select the true model with probability 1 (if the true model is among the candidates).

When to Use BIC:

Large sample sizes where model simplicity is prioritized.
Model selection for hypothesis testing (due to its connection to Bayesian inference).

Comparison of AIC, AICc, and BIC

Comparison of AIC, AICc, and BIC
Criterion	Formula	Penalty Term	Best For	Model Selection Rule
AIC	$-2l + 2q$	$2q$	General model comparison (large $n$ )	Lower is better
AICc	$AIC + \frac{2q(q+1)}{n - q - 1}$	Adjusted for small samples	Small sample sizes, simple random effects	Lower is better
BIC	$-2l + q \log(n)$	$q \log(n)$	Large samples, model selection in hypothesis testing	Lower is better

Key Takeaways

AIC is suitable for large datasets and general model comparisons but may favor overly complex models in small samples.
AICc corrects AIC’s bias in small sample sizes.
BIC favors simpler models, especially as the sample size increases, making it suitable for hypothesis testing and situations where parsimony is essential.
Use MLE for comparing models with different fixed effects, and REML when comparing models with different random effects (same fixed effects).
When comparing random effects structures, AIC and BIC may not be reliable due to difficulty in counting effective parameters accurately.

8.4.4 Practical Example with Linear Mixed Models

Consider the Linear Mixed Model:

$Y_{ik} = \begin{cases} \beta_0 + b_{1i} + (\beta_1 + b_{2i}) t_{ij} + \epsilon_{ij} & L \\ \beta_0 + b_{1i} + (\beta_2 + b_{2i}) t_{ij} + \epsilon_{ij} & H \\ \beta_0 + b_{1i} + (\beta_3 + b_{2i}) t_{ij} + \epsilon_{ij} & C \end{cases}$

where:

$i = 1, \dots, N$ (subjects)
$j = 1, \dots, n_i$ (repeated measures at time $t_{ij}$ )

$\begin{aligned} \mathbf{Y}_i | b_i &\sim N(\mathbf{X}_i \beta + \mathbf{1} b_i, \sigma^2 \mathbf{I}) \\ b_i &\sim N(0, d_{11}) \end{aligned}$

We aim to estimate:

Fixed effects: $\beta$
Variance components: $\sigma^2$ , $d_{11}$
Random effects: Predict $b_i$

When comparing models (e.g., different random slopes or covariance structures), we can compute:

AIC: Penalizes model complexity with $2q$ .
BIC: Stronger penalty via $q \log(n)$ , favoring simpler models.
AICc: Adjusted AIC for small sample sizes.

# Practical Example with Linear Mixed Models in R


# Load required libraries
library(lme4)     # For fitting linear mixed-effects models
library(MuMIn)    # For calculating AICc
library(dplyr)    # For data manipulation

# Set seed for reproducibility
set.seed(123)

# Simulate Data
N <- 50             # Number of subjects
n_i <- 5            # Number of repeated measures per subject
t_ij <- rep(1:n_i, N)  # Time points

# Treatment groups (L, H, C)
treatment <- rep(c("L", "H", "C"), length.out = N)
group <- factor(rep(treatment, each = n_i))

# Simulate random effects
b1_i <- rnorm(N, mean = 0, sd = 2)    # Random intercepts
b2_i <- rnorm(N, mean = 0, sd = 1)    # Random slopes

# Fixed effects
beta_0 <- 5
beta_1 <- 0.5
beta_2 <- 1
beta_3 <- 1.5

# Generate response variable Y based on the specified model
Y <- numeric(N * n_i)
subject_id <- rep(1:N, each = n_i)

for (i in 1:N) {
    for (j in 1:n_i) {
        idx <- (i - 1) * n_i + j
        time <- t_ij[idx]
        
        # Treatment-specific model
        if (group[idx] == "L") {
            Y[idx] <-
                beta_0 + b1_i[i] + (beta_1 + b2_i[i]) * time + rnorm(1, 0, 1)
        } else if (group[idx] == "H") {
            Y[idx] <-
                beta_0 + b1_i[i] + (beta_2 + b2_i[i]) * time + rnorm(1, 0, 1)
        } else {
            Y[idx] <-
                beta_0 + b1_i[i] + (beta_3 + b2_i[i]) * time + rnorm(1, 0, 1)
        }
    }
}

# Combine into a data frame
data <- data.frame(
    Y = Y,
    time = t_ij,
    group = group,
    subject = factor(subject_id)
)

# Fit Linear Mixed Models
# Model 1: Random Intercepts Only
model1 <-
    lmer(Y ~ time * group + (1 | subject), data = data, REML = FALSE)

# Model 2: Random Intercepts and Random Slopes
model2 <-
    lmer(Y ~ time * group + (1 + time |
                                 subject),
         data = data,
         REML = FALSE)

# Model 3: Simpler Model (No Interaction)
model3 <-
    lmer(Y ~ time + group + (1 | subject), data = data, REML = FALSE)

# Extract Information Criteria
results <- data.frame(
    Model = c(
        "Random Intercepts",
        "Random Intercepts + Slopes",
        "No Interaction"
    ),
    AIC = c(AIC(model1), AIC(model2), AIC(model3)),
    BIC = c(BIC(model1), BIC(model2), BIC(model3)),
    AICc = c(AICc(model1), AICc(model2), AICc(model3))
)

# Display the results
print(results)
#>                        Model       AIC      BIC      AICc
#> 1          Random Intercepts 1129.2064 1157.378 1129.8039
#> 2 Random Intercepts + Slopes  974.5514 1009.766  975.4719
#> 3             No Interaction 1164.2797 1185.408 1164.6253

Interpretation of Results:

Model 2 (with random intercepts and slopes) has the lowest AIC, BIC, and AICc, indicating the best fit among the models.
Model 1 (random intercepts only) performs worse, suggesting that allowing random slopes improves model fit.
Model 3 (simpler fixed effects without interaction) has the highest AIC/BIC/AICc, indicating poor fit compared to Models 1 and 2.

Model Selection Criteria:

Model Selection Results Based on Information Criteria
Criterion	Best Model	Reason
AIC	Model 2	Best trade-off between fit and complexity
BIC	Model 2	Stronger penalty for complexity, still favored
AICc	Model 2	Adjusted for small samples, Model 2 still best

References

Hurvich, Clifford M, and Chih-Ling Tsai. 1989. “Regression and Time Series Model Selection in Small Samples.” Biometrika 76 (2): 297–307.