15.3 Embedded Methods (Integrated into Model Training)
15.3.1 Regularization-Based Selection
- Lasso (L1 Regularization): Shrinks some coefficients to zero, performing automatic selection.
- Ridge (L2 Regularization): Shrinks coefficients but does not eliminate variables.
- Elastic Net: Combines L1 and L2 penalties for better feature selection.
15.3.2 Tree-Based Feature Importance
Decision trees and ensemble methods (Random Forests, Gradient Boosting) rank features based on their contribution to predictions.
15.3.3 Genetic Algorithms
Genetic Algorithms (GA) are inspired by the principles of natural selection and genetics. They are metaheuristic optimization techniques that iteratively evolve a population of solutions to find an optimal or near-optimal subset of predictors for regression or classification tasks.
Theoretical Foundation
- Objective:
- Select a subset of predictors that optimizes a predefined fitness function (e.g., R2, AIC, BIC, or prediction error).
- Key Concepts:
- Population: A collection of candidate solutions (subsets of predictors).
- Chromosome: Represents a solution as a binary vector (e.g.,
1
if a variable is included,0
otherwise). - Fitness Function: Evaluates the quality of a solution based on the selected variables.
- Crossover: Combines two solutions to create a new solution.
- Mutation: Randomly alters a solution to maintain diversity in the population.
- Selection: Chooses solutions with higher fitness to create the next generation.
- Advantages:
- Explores a wide solution space effectively.
- Escapes local optima by introducing randomness.
Steps in Genetic Algorithms
- Initialization:
- Generate an initial population of candidate solutions randomly.
- Evaluation:
- Compute the fitness function for each solution in the population.
- Selection:
- Select solutions with higher fitness values to be parents for the next generation.
- Crossover:
- Combine pairs of parent solutions to create offspring.
- Mutation:
- Randomly modify some solutions to introduce variability.
- Replacement:
- Replace the old population with the new generation.
- Stopping Criteria:
- Terminate the algorithm after a fixed number of generations or when the improvement in fitness is below a threshold.
# Install and load GA package
if (!requireNamespace("GA", quietly = TRUE))
install.packages("GA")
library(GA)
# Simulated data
set.seed(123)
n <- 100
p <- 10
X <- matrix(rnorm(n * p), nrow = n, ncol = p)
# Only first 3 variables are relevant
y <-
5 + rowSums(X[, 1:3]) + rnorm(n, sd = 2)
# Convert to a data frame
data <- as.data.frame(X)
data$y <- y
# Define the fitness function
fitness_function <- function(binary_vector) {
selected_vars <- which(binary_vector == 1)
if (length(selected_vars) == 0)
return(-Inf) # Penalize empty subsets
model <- lm(y ~ ., data = data[, c(selected_vars, ncol(data))])
- AIC(model) # Return negative AIC (minimization problem)
}
# Run Genetic Algorithm
set.seed(123)
ga_result <-
ga(
type = "binary",
# Binary encoding for variable selection
fitness = fitness_function,
nBits = p,
# Number of predictors
popSize = 50,
# Population size
maxiter = 100,
# Maximum number of generations
run = 10,
# Stop if no improvement in 10 generations
seed = 123
)
# Extract the best solution
best_solution <- ga_result@solution[1, ]
selected_vars <- which(best_solution == 1)
cat("Selected Variables (Column Indices):\n", selected_vars, "\n")
#> Selected Variables (Column Indices):
#> 1 2 3 4
# Fit the final model with selected variables
final_model <-
lm(y ~ ., data = data[, c(selected_vars, ncol(data))])
cat("Final Model Summary:\n")
#> Final Model Summary:
print(summary(final_model))
#>
#> Call:
#> lm(formula = y ~ ., data = data[, c(selected_vars, ncol(data))])
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.4013 -1.3823 0.0151 1.0796 5.1537
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 5.2726 0.2101 25.096 < 2e-16 ***
#> V1 1.1477 0.2290 5.012 2.49e-06 ***
#> V2 0.9469 0.2144 4.416 2.66e-05 ***
#> V3 0.6864 0.2199 3.121 0.00239 **
#> V4 0.3881 0.1997 1.943 0.05496 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.058 on 95 degrees of freedom
#> Multiple R-squared: 0.3574, Adjusted R-squared: 0.3304
#> F-statistic: 13.21 on 4 and 95 DF, p-value: 1.352e-08
Interpretation
Selected Variables:
- The genetic algorithm identifies the subset of predictors that minimizes the fitness function (negative AIC in this example).
Model Performance:
- The final model can be evaluated using metrics such as adjusted $R^2$, prediction error, or cross-validation.
Convergence:
- The algorithm evolves towards better solutions over generations, as indicated by improvements in the best fitness value.
Advantages:
Can handle high-dimensional datasets and complex fitness functions.
Avoids getting trapped in local optima by introducing randomness.
Flexible: Can optimize any user-defined fitness function.
Limitations:
Computationally intensive for large datasets or many predictors.
Requires careful tuning of hyperparameters (e.g., population size, mutation rate).
May not guarantee a globally optimal solution.
Practical Considerations
When to use Genetic Algorithms?
For complex variable selection problems where traditional methods (e.g., stepwise selection) are insufficient.
When the fitness function is non-linear or involves interactions among predictors.
Tuning Tips:
Adjust population size and mutation rate based on dataset size and complexity.
Use parallel computing to speed up the evaluation of fitness functions.