15.1 Filter Methods (Statistical Criteria, Model-Agnostic)
15.1.1 Information Criteria-Based Selection
15.1.1.1 Mallows’s C Statistic
The Cp statistic (Mallows, 1973, Technometrics, 15, 661-675) (Mallows 1995) is a criterion used to evaluate the predictive ability of a fitted model. It balances model complexity and goodness-of-fit.
For a model with p parameters, let ˆYip be the predicted value of Yi. The total standardized mean square error of prediction is:
Γp=∑ni=1E(ˆYip−E(Yi))2σ2
Expanding Γp:
Γp=∑ni=1[E(ˆYip)−E(Yi)]2+∑ni=1Var(ˆYip)σ2
- The first term in the numerator represents the squared bias.
- The second term represents the prediction variance.
Key Insights
Bias-Variance Tradeoff:
- The bias decreases as more variables are added to the model.
- If the full model (p=P) is assumed to be the true model, E(ˆYip)−E(Yi)=0, implying no bias.
- The prediction variance increases as more variables are added: ∑Var(ˆYip)=pσ2.
- Therefore, the optimal model balances bias and variance by minimizing Γp.
Estimating Γp: Since Γp depends on unknown parameters (e.g., β), we use an estimate:
Cp=SSEpˆσ2−(n−2p)
- SSEp: Sum of squared errors for the model with p predictors.
- ˆσ2: Mean squared error (MSE) of the full model with all P−1 predictors.
Properties of Cp:
- As more variables are added, SSEp decreases, but the penalty term 2p increases.
- When there is no bias, E(Cp)≈p. Hence, good models have Cp values close to p.
Model Selection Criteria:
- Prediction-focused models: Consider models with Cp≤p.
- Parameter estimation-focused models: Consider models with Cp≤2p−(P−1) to avoid excess bias.
# Simulated data
set.seed(123)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
y <- 5 + 3*x1 - 2*x2 + rnorm(n, sd=2)
# Full model and candidate models
full_model <- lm(y ~ x1 + x2 + x3)
model_1 <- lm(y ~ x1)
model_2 <- lm(y ~ x1 + x2)
# Extract SSE and calculate Cp
calculate_cp <- function(model, full_model_sse, full_model_mse, n) {
sse <- sum(residuals(model)^2)
p <- length(coefficients(model))
cp <- (sse / full_model_mse) - (n - 2 * p)
return(cp)
}
# Full model statistics
full_model_sse <- sum(residuals(full_model)^2)
full_model_mse <- mean(residuals(full_model)^2)
# Cp values for each model
cp_1 <- calculate_cp(model_1, full_model_sse, full_model_mse, n)
cp_2 <- calculate_cp(model_2, full_model_sse, full_model_mse, n)
# Display results
cat("C_p values:\n")
#> C_p values:
cat("Model 1 (y ~ x1):", round(cp_1, 2), "\n")
#> Model 1 (y ~ x1): 83.64
cat("Model 2 (y ~ x1 + x2):", round(cp_2, 2), "\n")
#> Model 2 (y ~ x1 + x2): 6.27
For Mallows’s Cp criterion, lower values are preferred. Specifically:
Ideal Value: When the model is a good fit and has the correct number of predictors, Cp should be close to the number of predictors p plus 1 (i.e., p+1).
Model Comparison: Among competing models, you generally prefer the one with the smallest Cp, as long as it is close to p+1.
Overfitting Indicator: If Cp is significantly lower than p+1, it may suggest overfitting.
Underfitting Indicator: If Cp is much higher than p+1, it suggests the model is underfitting the data and missing important predictors.
15.1.1.2 Akaike Information Criterion (AIC)
The Akaike Information Criterion (AIC) is a widely used model selection metric that evaluates the tradeoff between model fit and complexity. It was introduced by Hirotugu Akaike and is rooted in information theory, measuring the relative quality of statistical models for a given dataset.
For a model with p parameters, the AIC is given by:
AIC=nln(SSEpn)+2p
Where:
n is the number of observations.
SSEp is the sum of squared errors for a model with p parameters.
Key Insights
- Components of AIC:
- The first term (nln(SSEp/n)): Reflects the goodness-of-fit of the model. It decreases as SSEp decreases, meaning the model better explains the data.
- The second term (2p): Represents a penalty for model complexity. It increases with the number of parameters to discourage overfitting.
- Model Selection Principle:
- Smaller AIC values indicate a better balance between fit and complexity.
- Adding parameters generally reduces SSEp, but increases the penalty term (2p). If AIC increases when a parameter is added, that parameter is likely unnecessary.
- Tradeoff:
- AIC emphasizes the tradeoff between:
- Precision of fit: Reducing the error in explaining the data.
- Parsimony: Avoiding unnecessary parameters to maintain model simplicity.
- AIC emphasizes the tradeoff between:
- Comparative Criterion:
- AIC does not provide an absolute measure of model quality; instead, it compares relative performance. The model with the lowest AIC is preferred.
# Simulated data
set.seed(123)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
y <- 5 + 3*x1 - 2*x2 + rnorm(n, sd=2)
# Candidate models
model_1 <- lm(y ~ x1)
model_2 <- lm(y ~ x1 + x2)
model_3 <- lm(y ~ x1 + x2 + x3)
# Function to manually compute AIC
calculate_aic <- function(model, n) {
sse <- sum(residuals(model)^2)
p <- length(coefficients(model))
aic <- n * log(sse / n) + 2 * p
return(aic)
}
# Calculate AIC for each model
aic_1 <- calculate_aic(model_1, n)
aic_2 <- calculate_aic(model_2, n)
aic_3 <- calculate_aic(model_3, n)
# Display results
cat("AIC values:\n")
#> AIC values:
cat("Model 1 (y ~ x1):", round(aic_1, 2), "\n")
#> Model 1 (y ~ x1): 207.17
cat("Model 2 (y ~ x1 + x2):", round(aic_2, 2), "\n")
#> Model 2 (y ~ x1 + x2): 150.87
cat("Model 3 (y ~ x1 + x2 + x3):", round(aic_3, 2), "\n")
#> Model 3 (y ~ x1 + x2 + x3): 152.59
Interpretation
Compare the AIC values across models:
A smaller AIC indicates a model with a better balance between fit and complexity.
If the AIC increases when moving from one model to another (e.g., from Model 2 to Model 3), the additional parameter(s) in the larger model may not be justified.
Advantages:
Simple to compute and widely applicable.
Penalizes overfitting, encouraging parsimonious models.
Limitations:
Assumes the model errors are normally distributed and independent.
Does not evaluate absolute model fit, only relative performance.
Sensitive to sample size; for smaller samples, consider using the corrected version, AICc.
Corrected AIC (AICc)
For small sample sizes (n/p≤40), the corrected AIC, AICc, adjusts for the sample size:
AICc=AIC+2p(p+1)n−p−1
This adjustment prevents over-penalizing models with more parameters when n is small.
15.1.1.3 Bayesian Information Criterion (BIC)
The Bayesian Information Criterion (BIC), also known as Schwarz Criterion, is another popular metric for model selection. It extends the concept of AIC by introducing a stronger penalty for model complexity, particularly when the number of observations is large. BIC is grounded in Bayesian probability theory and provides a framework for selecting the most plausible model among a set of candidates.
For a model with p parameters, the BIC is defined as:
BIC=nln(SSEpn)+pln(n)
Where:
n is the number of observations.
SSEp is the sum of squared errors for the model with p parameters.
p is the number of parameters, including the intercept.
Key Insights
- Components of BIC:
- The first term (nln(SSEp/n)): Measures the goodness-of-fit, similar to AIC. It decreases as SSEp decreases, indicating a better fit to the data.
- The second term (pln(n)): Penalizes model complexity. Unlike AIC’s penalty (2p), the penalty in BIC increases with ln(n), making it more sensitive to the number of observations.
- Model Selection Principle:
- Smaller BIC values indicate a better model.
- Adding parameters reduces SSEp, but the penalty term pln(n) grows more rapidly than AIC’s 2p for large n. This makes BIC more conservative than AIC in selecting models with additional parameters.
- Tradeoff:
- Like AIC, BIC balances:
- Precision of fit: Capturing the underlying structure in the data.
- Parsimony: Avoiding overfitting by discouraging unnecessary parameters.
- BIC tends to favor simpler models compared to AIC, particularly when n is large.
- Like AIC, BIC balances:
- Comparative Criterion:
- BIC, like AIC, is used to compare models. The model with the smallest BIC is preferred.
# Function to manually compute BIC
calculate_bic <- function(model, n) {
sse <- sum(residuals(model)^2)
p <- length(coefficients(model))
bic <- n * log(sse / n) + p * log(n)
return(bic)
}
# Calculate BIC for each model
bic_1 <- calculate_bic(model_1, n)
bic_2 <- calculate_bic(model_2, n)
bic_3 <- calculate_bic(model_3, n)
# Display results
cat("BIC values:\n")
#> BIC values:
cat("Model 1 (y ~ x1):", round(bic_1, 2), "\n")
#> Model 1 (y ~ x1): 212.38
cat("Model 2 (y ~ x1 + x2):", round(bic_2, 2), "\n")
#> Model 2 (y ~ x1 + x2): 158.68
cat("Model 3 (y ~ x1 + x2 + x3):", round(bic_3, 2), "\n")
#> Model 3 (y ~ x1 + x2 + x3): 163.01
Interpretation
Compare the BIC values across models:
Smaller BIC values suggest a better model.
If BIC increases when moving to a larger model, the added complexity may not justify the reduction in SSEp.
Comparison with AIC
Criterion | Penalty Term | Sensitivity to Sample Size | Preferred Model Selection |
---|---|---|---|
AIC | 2p | Less sensitive | More parameters |
BIC | pln(n) | More sensitive | Simpler models |
BIC generally prefers simpler models than AIC, especially when n is large.
In small datasets, AIC may perform better because BIC’s penalty grows significantly with ln(n).
Advantages:
Strong penalty for complexity makes it robust against overfitting.
Incorporates sample size explicitly, favoring simpler models as $n$ grows.
Easy to compute and interpret.
Limitations:
Assumes model errors are normally distributed and independent.
May underfit in smaller datasets where the penalty term dominates.
Like AIC, BIC is not an absolute measure of model quality but a relative one.
15.1.1.4 Hannan-Quinn Criterion (HQC)
The Hannan-Quinn Criterion (HQC) is a statistical metric for model selection, similar to AIC and BIC. It evaluates the tradeoff between model fit and complexity, offering a middle ground between the conservative penalty of BIC and the less stringent penalty of AIC. HQC is especially useful in time-series modeling and situations where large datasets are involved.
The HQC for a model with p parameters is defined as:
HQC=nln(SSEpn)+2pln(ln(n))
Where:
n: Number of observations.
SSEp: Sum of Squared Errors for the model with p predictors.
p: Number of parameters, including the intercept.
Key Insights
- Components:
- The first term (nln(SSEp/n)): Measures the goodness-of-fit, similar to AIC and BIC. Smaller SSE indicates a better fit.
- The second term (2pln(ln(n))): Penalizes model complexity. The penalty grows logarithmically with the sample size, similar to BIC but less severe.
- Model Selection Principle:
- Smaller HQC values indicate a better balance between model fit and complexity.
- Models with lower HQC are preferred.
- Penalty Comparison:
- HQC’s penalty lies between that of AIC and BIC:
- AIC: 2p (less conservative, favors complex models).
- BIC: pln(n) (more conservative, favors simpler models).
- HQC: 2pln(ln(n)) (balances AIC and BIC).
- HQC’s penalty lies between that of AIC and BIC:
- Use Case:
- HQC is particularly suited for large datasets or time-series models where overfitting is a concern, but BIC may overly penalize model complexity.
# Simulated data
set.seed(123)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
x4 <- rnorm(n)
y <- 5 + 3*x1 - 2*x2 + x3 + rnorm(n, sd=2)
# Prepare models
data <- data.frame(y, x1, x2, x3, x4)
model_1 <- lm(y ~ x1, data = data)
model_2 <- lm(y ~ x1 + x2, data = data)
model_3 <- lm(y ~ x1 + x2 + x3, data = data)
# Function to calculate HQC
calculate_hqc <- function(model, n) {
sse <- sum(residuals(model)^2)
p <- length(coefficients(model))
hqc <- n * log(sse / n) + 2 * p * log(log(n))
return(hqc)
}
# Calculate HQC for each model
hqc_1 <- calculate_hqc(model_1, n)
hqc_2 <- calculate_hqc(model_2, n)
hqc_3 <- calculate_hqc(model_3, n)
# Display results
cat("HQC values:\n")
#> HQC values:
cat("Model 1 (y ~ x1):", round(hqc_1, 2), "\n")
#> Model 1 (y ~ x1): 226.86
cat("Model 2 (y ~ x1 + x2):", round(hqc_2, 2), "\n")
#> Model 2 (y ~ x1 + x2): 156.44
cat("Model 3 (y ~ x1 + x2 + x3):", round(hqc_3, 2), "\n")
#> Model 3 (y ~ x1 + x2 + x3): 141.62
Interpretation
Comparing HQC Values:
Smaller HQC values indicate a better balance between goodness-of-fit and parsimony.
Select the model with the smallest HQC.
Tradeoffs:
HQC balances fit and complexity more conservatively than AIC but less so than BIC.
It is particularly useful when overfitting is a concern but avoiding overly simplistic models is also important.
Comparison with Other Criteria
Criterion | Penalty Term | Sensitivity to Sample Size | Preferred Model Selection |
---|---|---|---|
AIC | 2p | Low | Favors more complex models |
BIC | pln(n) | High | Favors simpler models |
HQC | 2pln(ln(n)) | Moderate | Balances fit and parsimony |
Advantages:
Less sensitive to sample size than BIC, avoiding excessive penalization for large datasets.
Provides a balanced approach to model selection, reducing the risk of overfitting while avoiding overly simplistic models.
Particularly useful in time-series analysis.
Limitations:
Like AIC and BIC, assumes model errors are normally distributed and independent.
HQC is not as widely implemented in statistical software, requiring custom calculations.
Practical Considerations
When to use HQC?
When both AIC and BIC provide conflicting recommendations.
For large datasets or time-series models where BIC may overly penalize complexity.
When to use AIC or BIC?
AIC for smaller datasets or when the goal is prediction.
BIC for large datasets or when parsimony is critical.
15.1.1.5 Minimum Description Length (MDL)
The Minimum Description Length (MDL) principle is a model selection framework rooted in information theory. It balances model complexity and goodness-of-fit by seeking the model that minimizes the total length of encoding the data and the model itself. MDL is a generalization of other model selection criteria like AIC and BIC but offers a more theoretical foundation.
Theoretical Foundation
MDL is based on the idea that the best model is the one that compresses the data most efficiently. It represents a tradeoff between:
- Model Complexity: The cost of describing the model, including the number of parameters.
- Data Fit: The cost of describing the data given the model.
The total description length is expressed as:
L(M,D)=L(M)+L(D|M)
Where:
L(M): The length of encoding the model (complexity of the model).
L(D|M): The length of encoding the data given the model (fit to the data).
Key Insights
- Model Complexity (L(M)):
- More complex models require longer descriptions, as they involve more parameters.
- Simpler models are favored unless the added complexity significantly improves the fit.
- Data Fit (L(D|M)):
- Measures how well the model explains the data.
- Poorly fitting models require more bits to describe the residual error.
- Tradeoff:
- MDL balances these two components, selecting the model that minimizes the total description length.
Connection to Other Criteria
MDL is closely related to BIC. In fact, the BIC criterion can be derived as an approximation of MDL for certain statistical models:
BIC=nln(SSEp/n)+pln(n)
However, MDL is more flexible and does not rely on specific assumptions about the error distribution.
Practical Use Cases
Time-Series Modeling: MDL is particularly effective for selecting models in time-series data, where overfitting is common.
Machine Learning: MDL is used in regularization techniques and decision tree pruning to prevent overfitting.
Signal Processing: In applications such as compression and coding, MDL directly guides optimal model selection.
# Simulated data
set.seed(123)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
y <- 5 + 3*x1 - 2*x2 + x3 + rnorm(n, sd=2)
# Prepare models
data <- data.frame(y, x1, x2, x3)
model_1 <- lm(y ~ x1, data = data)
model_2 <- lm(y ~ x1 + x2, data = data)
model_3 <- lm(y ~ x1 + x2 + x3, data = data)
# Function to calculate MDL
calculate_mdl <- function(model, n) {
sse <- sum(residuals(model)^2)
p <- length(coefficients(model))
mdl <- p * log(n) + n * log(sse / n)
return(mdl)
}
# Calculate MDL for each model
mdl_1 <- calculate_mdl(model_1, n)
mdl_2 <- calculate_mdl(model_2, n)
mdl_3 <- calculate_mdl(model_3, n)
# Display results
cat("MDL values:\n")
#> MDL values:
cat("Model 1 (y ~ x1):", round(mdl_1, 2), "\n")
#> Model 1 (y ~ x1): 219.87
cat("Model 2 (y ~ x1 + x2):", round(mdl_2, 2), "\n")
#> Model 2 (y ~ x1 + x2): 173.42
cat("Model 3 (y ~ x1 + x2 + x3):", round(mdl_3, 2), "\n")
#> Model 3 (y ~ x1 + x2 + x3): 163.01
Interpretation
Choosing the Best Model:
- The model with the smallest MDL value is preferred, as it achieves the best tradeoff between fit and complexity.
Practical Implications:
- MDL discourages overfitting by penalizing complex models that do not significantly improve data fit.
Advantages:
Theoretically grounded in information theory.
Offers a natural framework for balancing complexity and fit.
Flexible and can be applied across various modeling frameworks.
Limitations:
Computationally intensive, especially for non-linear models.
Requires careful formulation of $L(M)$ and $L(D | M)$ for non-standard models.
Less common in standard statistical software compared to AIC or BIC.
15.1.1.6 Prediction Error Sum of Squares (PRESS)
The Prediction Error Sum of Squares (PRESS) statistic measures the predictive ability of a model by evaluating how well it performs on data not used in fitting the model. PRESS is particularly useful for assessing model validity and identifying overfitting.
The PRESS statistic for a model with p parameters is defined as:
PRESSp=n∑i=1(Yi−ˆYi(i))2
Where:
ˆYi(i) is the prediction of the i-th response when the i-th observation is omitted during model fitting.
Yi is the observed response for the i-th observation.
Key Insights
- Leave-One-Out Cross-Validation (LOOCV):
- PRESS is computed by excluding each observation one at a time and predicting its response using the remaining data.
- This process evaluates the model’s generalizability and reduces overfitting.
- Model Selection Principle:
- Smaller values of PRESSp indicate better predictive performance.
- A small PRESSp suggests that the model captures the underlying structure of the data without overfitting.
- Computational Complexity:
- Computing ˆYi(i) for each observation can be computationally intensive for models with large p or datasets with many observations.
- Alternative approximations (e.g., using leverage values) can simplify computations.
# Function to compute PRESS
calculate_press <- function(model) {
residuals <- residuals(model)
h <- lm.influence(model)$hat # leverage values
press <- sum((residuals / (1 - h))^2) # PRESS formula using leverage
return(press)
}
# Calculate PRESS for each model
press_1 <- calculate_press(model_1)
press_2 <- calculate_press(model_2)
press_3 <- calculate_press(model_3)
# Display results
cat("PRESS values:\n")
#> PRESS values:
cat("Model 1 (y ~ x1):", round(press_1, 2), "\n")
#> Model 1 (y ~ x1): 854.36
cat("Model 2 (y ~ x1 + x2):", round(press_2, 2), "\n")
#> Model 2 (y ~ x1 + x2): 524.56
cat("Model 3 (y ~ x1 + x2 + x3):", round(press_3, 2), "\n")
#> Model 3 (y ~ x1 + x2 + x3): 460
Interpretation
Compare the PRESS values across models:
Models with smaller PRESS values are preferred as they exhibit better predictive ability.
A large PRESS value indicates potential overfitting or poor model generalizability.
Advantages:
Provides an unbiased measure of predictive performance.
Helps identify overfitting by simulating the model’s performance on unseen data.
Limitations:
Computationally intensive for large datasets or models with many predictors.
Sensitive to influential observations; high-leverage points can disproportionately affect results.
Alternative Approaches
To address the computational challenges of PRESS, alternative methods can be employed:
Approximation using leverage values: As shown in the example, leverage values simplify the calculation of ˆYi(i).
K-Fold Cross-Validation: Dividing the dataset into k folds reduces computational burden compared to LOOCV while still providing robust estimates.
15.1.2 Univariate Selection Methods
Univariate selection methods evaluate individual variables in isolation to determine their relationship with the target variable. These methods are often categorized under filter methods, as they do not involve any predictive model but instead rely on statistical significance and information-theoretic measures.
Univariate selection is particularly useful for:
Preprocessing large datasets by eliminating irrelevant features.
Reducing dimensionality before applying more complex feature selection techniques.
Improving interpretability by identifying the most relevant features.
The two main categories of univariate selection methods are:
- Statistical Tests: Evaluate the significance of relationships between individual features and the target variable.
- Information-Theoretic Measures: Assess the dependency between variables based on information gain and mutual information.
15.1.2.1 Statistical Tests
Statistical tests assess the significance of relationships between individual predictors and the target variable. The choice of test depends on the type of data:
Test | Used For | Example Use Case |
---|---|---|
Chi-Square Test | Categorical predictors vs. Categorical target | Checking if gender affects purchase behavior |
ANOVA (Analysis of Variance) | Continuous predictors vs. Categorical target | Testing if different income groups have varying spending habits |
Correlation Coefficients | Continuous predictors vs. Continuous target | Measuring the relationship between advertising spend and sales |
Check out Descriptive Statistics and Basic Statistical Inference for more details.
15.1.2.2 Information-Theoretic Measures
Information-theoretic measures assess variable relevance based on how much information they provide about the target.
15.1.2.2.1 Information Gain
Information Gain (IG) measures the reduction in uncertainty about the target variable when a predictor is known. It is used extensively in decision trees.
Formula: IG=H(Y)−H(Y|X) Where:
H(Y) = Entropy of target variable
H(Y|X) = Conditional entropy of target given predictor
A higher IG indicates a more informative variable.
15.1.2.2.2 Mutual Information
Mutual Information (MI) quantifies how much knowing one variable reduces uncertainty about another. Unlike correlation, it captures both linear and non-linear relationships.
Pros: Captures non-linear relationships, robust to outliers.
Cons: More computationally intensive than correlation.
Formula: MI(X,Y)=∑x,yP(x,y)logP(x,y)P(x)P(y) Where:
P(x,y) = Joint probability distribution of X and Y.
P(x), P(y) = Marginal probability distributions.
# Load Library
library(infotheo)
# Compute Mutual Information Between Two Features
set.seed(123)
X <- sample(1:5, 100, replace=TRUE)
Y <- sample(1:5, 100, replace=TRUE)
mutinformation(X, Y)
#> [1] 0.06852247
Since
X
andY
are independently sampled, we expect them to have no mutual dependence.The MI value should be close to 0, indicating that knowing
X
provides almost no information aboutY
, and vice versa.If the MI value is significantly greater than 0, it could be due to random fluctuations, especially in small samples.
Method | Type | Suitable For | Pros | Cons |
---|---|---|---|---|
Chi-Square Test | Statistical Test | Categorical vs. Categorical | Simple, interpretable | Requires large sample sizes |
ANOVA | Statistical Test | Continuous vs. Categorical | Handles multiple groups | Assumes normality |
Correlation | Statistical Test | Continuous vs. Continuous | Easy to compute | Only captures linear relations |
Information Gain | Information-Based | Any Variable Type | Good for decision trees | Requires computation of entropy |
Mutual Information | Information-Based | Any Variable Type | Captures non-linear dependencies | More computationally expensive |
15.1.3 Correlation-Based Feature Selection
Evaluates features based on their correlation with the target and redundancy with other features. Check out Descriptive Statistics and Basic Statistical Inference for more details.
15.1.4 Variance Thresholding
Variance Thresholding is a simple yet effective filter method used for feature selection. It removes features with low variance, assuming that low-variance features contribute little to model prediction. This technique is particularly useful when:
- Handling high-dimensional datasets where many features contain little useful information.
- Reducing computational complexity by removing uninformative features.
- Avoiding overfitting by eliminating features that are nearly constant across samples.
This method is most effective when dealing with binary features (e.g., categorical variables encoded as 0s and 1s) and numerical features with low variance.
Variance measures the spread of a feature’s values. A feature with low variance contains nearly the same value for all observations, making it less useful for predictive modeling.
For a feature X, variance is calculated as:
Var(X)=1nn∑i=1(Xi−ˉX)2
where:
Xi is an individual observation,
ˉX is the mean of X,
n is the number of observations.
Example: Features with Low and High Variance
Feature | Sample Values | Variance |
---|---|---|
Low Variance | 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 | 0.04 |
High Variance | 3, 7, 1, 9, 2, 8, 6, 4, 5, 0 | 7.25 |
15.1.4.1 Identifying Low-Variance Features
A variance threshold is set to remove features below a certain variance level. The default threshold is 0, which removes features with a single constant value across all samples.
# Load necessary library
library(caret)
# Generate synthetic dataset
set.seed(123)
data <- data.frame(
Feature1 = c(rep(0, 50), rep(1, 50)), # Low variance
Feature2 = rnorm(100, mean=10, sd=1), # High variance
Feature3 = runif(100, min=5, max=15), # Moderate variance
Feature4 = c(rep(3, 95), rep(4, 5)) # Almost constant
)
# Compute Variance of Features
variances <- apply(data, 2, stats::var)
print(variances)
#> Feature1 Feature2 Feature3 Feature4
#> 0.2525253 0.8332328 8.6631461 0.0479798
# Set threshold and remove low-variance features
threshold <- 0.1
selected_features <- names(variances[variances > threshold])
filtered_data <- data[, selected_features]
print(selected_features) # Remaining features after filtering
#> [1] "Feature1" "Feature2" "Feature3"
15.1.4.2 Handling Binary Categorical Features
For binary features (0/1 values), variance is computed as: Var(X)=p(1−p) where p is the proportion of ones.
If p≈0 or p≈1, variance is low, meaning the feature is almost constant.
If p=0.5, variance is highest, meaning equal distribution of 0s and 1s.
# Binary feature dataset
binary_data <- data.frame(
Feature_A = c(rep(0, 98), rep(1, 2)), # Low variance (almost all 0s)
Feature_B = sample(0:1, 100, replace=TRUE) # Higher variance
)
# Compute variance for binary features
binary_variances <- apply(binary_data, 2, stats::var)
print(binary_variances)
#> Feature_A Feature_B
#> 0.01979798 0.24757576
# Apply threshold (removing features with variance < 0.01)
threshold <- 0.01
filtered_binary <- binary_data[, binary_variances > threshold]
print(colnames(filtered_binary))
#> [1] "Feature_A" "Feature_B"
Aspect | Pros | Cons |
---|---|---|
Efficiency | Fast and computationally cheap | May remove useful features |
Interpretability | Simple to understand and implement | Ignores correlation with target |
Applicability | Works well for removing constant or near-constant features | Not useful for detecting redundant but high-variance features |