12.2 Categorical Variables
Transforming categorical variables into numerical representations is essential for machine learning models and statistical analysis. The key objectives include:
- Converting categorical data into a format suitable for numerical models.
- Improving model interpretability and performance.
- Handling high-cardinality categorical variables efficiently.
There are multiple ways to transform categorical variables, each with its advantages and use cases. The choice depends on factors like cardinality, ordinality, and model type.
12.2.1 One-Hot Encoding (Dummy Variables)
Creates binary indicator variables for each category.
Formula:
For a categorical variable with k unique values, create k binary columns:
x′i={1if xi=category0otherwise
When to Use:
Low-cardinality categorical variables (e.g., “Red”, “Blue”, “Green”).
Tree-based models (e.g., Random Forest, XGBoost).
Linear regression models (dummy variables prevent information loss).
12.2.2 Label Encoding
Assigns integer values to categories.
Formula:
If a categorical variable has k unique values:
Category →Integer
Example:
Category | Encoded Value |
---|---|
Red | 1 |
Blue | 2 |
Green | 3 |
When to Use:
Ordinal categorical variables (e.g., “Low”, “Medium”, “High”).
Neural networks (use embeddings instead of one-hot).
Memory-efficient encoding for high-cardinality features.
12.2.3 Feature Hashing (Hash Encoding)
Maps categories to a fixed number of hash bins, reducing memory usage.
When to Use:
High-cardinality categorical variables (e.g., user IDs, URLs).
Scenarios where an exact category match isn’t needed.
Sparse models (e.g., text data in NLP).
library(text2vec)
library(Matrix)
data(iris)
# Convert the 'Species' factor to character tokens
tokens <- word_tokenizer(as.character(iris$Species))
# Create an iterator over tokens
it <- itoken(tokens, progressbar = FALSE)
# Define the hash_vectorizer with a specified hash size (8 in this case)
vectorizer <- hash_vectorizer(hash_size = 8)
# Create a Document-Term Matrix (DTM) using the hashed features
hashed_dtm <- create_dtm(it, vectorizer)
# Inspect the first few rows of the hashed feature matrix
head(hashed_dtm)
#> 6 x 8 sparse Matrix of class "dgCMatrix"
#>
#> 1 . . . . . . 1 .
#> 2 . . . . . . 1 .
#> 3 . . . . . . 1 .
#> 4 . . . . . . 1 .
#> 5 . . . . . . 1 .
#> 6 . . . . . . 1 .
word_tokenizer: This function splits the character vector into tokens. Since
iris$Species
is already a categorical variable with values like"setosa"
,"versicolor"
, and"virginica"
, each value becomes a token.itoken: Creates an iterator over the tokens.
hash_vectorizer: Sets up a hashing vectorizer that transforms tokens into a sparse feature space of size
2^3 = 8
(becausehash_size = 8
means 28 bins; if you intend exactly 8 bins, you might adjust the parameter accordingly).create_dtm: Builds the document-term matrix (which in this case is analogous to a feature matrix for each observation).
12.2.4 Binary Encoding
Converts categories to binary representations and distributes them across multiple columns.
Example:
For four categories (“A”, “B”, “C”, “D”):
Category | Binary Code | Encoded Columns |
---|---|---|
A | 00 | 0, 0 |
B | 01 | 0, 1 |
C | 10 | 1, 0 |
D | 11 | 1, 1 |
When to Use:
High-cardinality categorical features (less memory than one-hot encoding).
Tree-based models (preserves some ordinal information).
library(mltools)
library(data.table)
# Convert the Species column to a data.table and perform one-hot encoding
binary_encoded <- one_hot(as.data.table(iris[, "Species"]))
head(binary_encoded)
#> V1_setosa V1_versicolor V1_virginica
#> 1: 1 0 0
#> 2: 1 0 0
#> 3: 1 0 0
#> 4: 1 0 0
#> 5: 1 0 0
#> 6: 1 0 0
12.2.5 Base-N Encoding (Generalized Binary Encoding)
Expands Binary Encoding to base N instead of binary.
When to Use:
- Similar to Binary Encoding, but allows for greater flexibility.
12.2.6 Frequency Encoding
Replaces each category with its frequency (proportion) in the dataset.
Formula: x′i=count(xi)total count When to Use:
High-cardinality categorical variables.
Feature engineering for boosting algorithms (e.g., LightGBM).
12.2.7 Target Encoding (Mean Encoding)
Encodes categories using the mean of the target variable.
Formula: x′i=E[Y|X=xi] When to Use:
Predictive models with categorical features strongly correlated with the target.
High-cardinality categorical variables.
Risk: Can lead to data leakage (use cross-validation).
12.2.8 Ordinal Encoding
Maps categories to ordered integer values based on logical ranking.
Example:
Category | Ordinal Encoding |
---|---|
Low | 1 |
Medium | 2 |
High | 3 |
When to Use:
- Ordinal variables with meaningful order (e.g., satisfaction ratings).
12.2.9 Weight of Evidence Encoding
Concept:
WoE is a method to convert categorical data into numerical values that capture the strength of the relationship between a feature (or category) and a binary outcome (like default vs. non-default).The Formula: WoE=log(P(Xi|Y=1)P(Xi|Y=0))
(Xi|Y=1): The probability (or proportion) of observing category Xi given the positive outcome (e.g., a “good” credit event).
P(Xi|Y=0): The probability of observing category Xi given the negative outcome (e.g., a “bad” credit event).
Logarithm: Taking the log of the ratio gives us a symmetric scale where:
A positive WoE indicates the category is more associated with the positive outcome.
A negative WoE indicates the category is more associated with the negative outcome.
When and Why to Use WoE Encoding?
Logistic Regression in Credit Scoring:
Logistic regression models predict probabilities in terms of log-odds. WoE encoding aligns well with this because it essentially expresses how the odds of the positive outcome change with different categories. This is why it’s popular in credit scoring models.Interpretability:
The WoE transformation makes it easier to understand and interpret the relationship between each category of a variable and the outcome. Each category’s WoE value tells you whether it increases or decreases the odds of a particular event (e.g., default).
Imagine you have a feature “Employment Status” with categories “Employed” and “Unemployed”:
Calculate Proportions:
P(Employed|Y=1)=0.8 (80% of good credit cases are employed)
P(Employed|Y=0)=0.4 (40% of bad credit cases are employed)
Compute WoE for “Employed”: WoEEmployed=log(0.80.4)=log(2)≈0.693 A positive value indicates that being employed increases the odds of a good credit outcome.
Repeat for “Unemployed”:
Suppose:P(Unemployed|Y=1)=0.2
P(Unemployed|Y=0)=0.6 WoEUnemployed=log(0.20.6)=log(13)≈−1.099 A negative value indicates that being unemployed is associated with a higher likelihood of a bad credit outcome.
Why is WoE Valuable?
Linear Relationship:
When you plug these WoE values into a logistic regression, the model essentially adds these values linearly, which fits nicely with how logistic regression models the log-odds.Stability & Handling of Missing Values:
WoE can also help in smoothing out fluctuations in categorical data, especially when there are many levels or some levels with few observations.Regulatory Acceptance:
In industries like banking, WoE is widely accepted because of its clear interpretability, which is crucial for compliance and transparency in credit risk modeling.
# Load required packages
library(dplyr)
library(knitr)
# Create a sample dataset
# We assume 100 good credit cases and 100 bad credit cases
# Good credit: 80 "Employed" and 20 "Unemployed"
# Bad credit: 40 "Employed" and 60 "Unemployed"
data <- data.frame(
employment_status = c(rep("Employed", 80), rep("Unemployed", 20),
rep("Employed", 40), rep("Unemployed", 60)),
credit = c(rep(1, 100), rep(0, 100))
)
# Calculate counts for each category
woe_table <- data %>%
group_by(employment_status) %>%
summarise(
good = sum(credit == 1),
bad = sum(credit == 0)
) %>%
# Calculate the distribution for good and bad credit cases
mutate(
dist_good = good / sum(good),
dist_bad = bad / sum(bad),
WoE = log(dist_good / dist_bad)
)
# Print the WoE table
kable(woe_table)
employment_status | good | bad | dist_good | dist_bad | WoE |
---|---|---|---|---|---|
Employed | 80 | 40 | 0.8 | 0.4 | 0.6931472 |
Unemployed | 20 | 60 | 0.2 | 0.6 | -1.0986123 |
# Merge the WoE values into the original data
data_woe <- data %>%
left_join(woe_table %>% dplyr::select(employment_status, WoE), by = "employment_status")
head(data_woe)
#> employment_status credit WoE
#> 1 Employed 1 0.6931472
#> 2 Employed 1 0.6931472
#> 3 Employed 1 0.6931472
#> 4 Employed 1 0.6931472
#> 5 Employed 1 0.6931472
#> 6 Employed 1 0.6931472
# Fit a logistic regression model using WoE as predictor
model <- glm(credit ~ WoE, data = data_woe, family = binomial)
# Summary of the model
summary(model)
#>
#> Call:
#> glm(formula = credit ~ WoE, family = binomial, data = data_woe)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.023e-12 1.552e-01 0.000 1
#> WoE 1.000e+00 1.801e-01 5.552 2.83e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 277.26 on 199 degrees of freedom
#> Residual deviance: 242.74 on 198 degrees of freedom
#> AIC: 246.74
#>
#> Number of Fisher Scoring iterations: 4
When you fit a logistic regression using the WoE-encoded variable, the model is essentially: log(P(Y=1)P(Y=0))=β0+β1⋅WoE Here, WoE represents the Weight of Evidence value for a given category.
Log Odds Change:
β1 indicates how much the log odds of a good credit outcome change for a one-unit increase in WoE. For example, if β1=0.5, then a one-unit increase in WoE is associated with an increase of 0.5 in the log odds of having a good credit outcome.Odds Ratio:
If you exponentiate β1, you get the odds ratio. For instance, if β1=0.5, then exp(0.5)≈1.65. This means that for each one-unit increase in WoE, the odds of having a good credit outcome are multiplied by about 1.65.
Why is This Meaningful?
Direct Link to the Data:
The WoE value itself is a transformation of the original categorical variable that reflects the ratio of the proportions of good to bad outcomes for that category. By using WoE, you’re directly incorporating this information into the model.Interpretability:
The interpretation becomes intuitive:A positive WoE indicates that the category is more associated with a good outcome.
A negative WoE indicates that the category is more associated with a bad outcome.
Thus, if β1 is positive, it suggests that as the category moves to one with a higher WoE (i.e., more favorable for a good outcome), the likelihood of a good outcome increases.
12.2.10 Helmert Encoding
Compares each category against the mean of previous categories.
When to Use:
- ANOVA models and categorical regression.
12.2.11 Probability Ratio Encoding
Encodes categories using the probability ratio of the target variable.
12.2.12 Backward Difference Encoding
Compares each category against the mean of all remaining categories.
12.2.13 Leave-One-Out Encoding
Similar to target encoding, but excludes the current observation to avoid bias.
12.2.17 Choosing the Right Encoding Method
Encoding Method | Best for Low Cardinality | Best for High Cardinality | Handles Ordinality | Suitable for Tree Models | Suitable for Linear Models |
---|---|---|---|---|---|
One-Hot Encoding | ✅ Yes | ❌ No | ❌ No | ✅ Yes | ✅ Yes |
Label Encoding | ✅ Yes | ✅ Yes | ✅ Yes | ❌ No | ✅ Yes |
Target Encoding | ✅ Yes | ✅ Yes | ❌ No | ✅ Yes | ✅ Yes |
Frequency Encoding | ✅ Yes | ✅ Yes | ❌ No | ✅ Yes | ✅ Yes |
Binary Encoding | ✅ Yes | ✅ Yes | ❌ No | ✅ Yes | ✅ Yes |