13 Variable Transformation
trafo
vignette
13.1 Continuous Variables
Purposes:
To change the scale of the variables
To transform skewed data distribution to normal distribution
13.1.2 Min-max scaling
\[ x_i' = \frac{x_i - x_{max}}{x_{max} - x_{min}} \]
dependent on the min and max values, which makes it sensitive to outliers.
best to use when you have values in a fixed interval.
13.1.3 Square Root/Cube Root
When variables have positive skewness or residuals have positive heteroskasticity.
Frequency counts variable
Data have many 0 or extremely small values.
13.1.4 Logarithmic
- Variables have positively skewed distribution
Formula | In case |
---|---|
\(x_i' = \log(x_i)\) | cannot work zero because log(0) = -Inf
|
\(x_i' = \log(x_i + 1)\) | variables with 0 |
\(x_i' = \log(x_i +c)\) | |
\(x_i' = \frac{x_i}{|x_i|}\log|x_i|\) | variables with negative values |
\(x_i'^\lambda = \log(x_i + \sqrt{x_i^2 + \lambda})\) | generalized log transformation |
For the general case of \(\log(x_i + c)\), choosing a constant is rather tricky.
The choice of the constant is critically important, especially when you want to do inference. It can dramatically change your model fit (see (Ekwaru and Veugelers 2018) for the independent variable case).
However, assuming that you do not have 0s because of
Censoring
No measurement errors (stemming from measurement tools)
We can proceed choosing c
(it’s okay if your 0’s are represent really small values).
13.1.7 Inverse/Reciprocal
Variables have platykurtic distribution
Data are positively skewed
Ratio data
plot(1/(cars$dist))
13.1.11 Lambert W x F Transformation
LambertW
package
Using moments to normalize data.
Usually need to compare with the Box-Cox Transformation and Yeo-Johnson Transformation
Can handle skewness, heavy-tailed.
l_dist <- LambertW::Gaussianize(cars$dist)
# small fix
l_dist %>% hist()
13.1.12 Inverse Hyperbolic Sine (IHS) transformation
Proposed by (N. L. Johnson 1949)
Can be applied to real numbers.
\[ \begin{aligned} f(x,\theta) &= \frac{\sinh^{-1} (\theta x)}{\theta} \\ &= \frac{\log(\theta x + (\theta^2 x^2 + 1)^{1/2})}{\theta} \end{aligned} \]
13.1.13 Box-Cox Transformation
\[ y^\lambda = \beta x+ \epsilon \]
to fix non-linearity in the error terms
work well between (-3,3) (i.e., small transformation).
or with independent variables
\[ x_i'^\lambda = \begin{cases} \frac{x_i^\lambda-1}{\lambda} & \text{if } \lambda \neq 0\\ \log(x_i) & \text{if } \lambda = 0 \end{cases} \]
And the two-parameter version is
\[ x_i' (\lambda_1, \lambda_2) = \begin{cases} \frac{(x_i + \lambda_2)^{\lambda_1}-1}{} & \text{if } \lambda_1 \neq 0 \\ \log(x_i + \lambda_2) & \text{if } \lambda_1 = 0 \end{cases} \]
More advances
# best lambda
bc$x[which(bc$y == max(bc$y))]
#> [1] 1.242424
# model with best lambda
mod_lambda = lm(cars$speed ^ (bc$x[which(bc$y == max(bc$y))]) ~ cars$dist,
data = cars)
plot(mod_lambda)
13.1.14 Yeo-Johnson Transformation
Similar to Box-Cox Transformation (when \(\lambda = 1\)), but allows for negative value
\[ x_i'^\lambda = \begin{cases} \frac{(x_i+1)^\lambda -1}{\lambda} & \text{if } \lambda \neq0, x_i \ge 0 \\ \log(x_i + 1) & \text{if } \lambda = 0, x_i \ge 0 \\ \frac{-[(-x_i+1)^{2-\lambda}-1]}{2 - \lambda} & \text{if } \lambda \neq 2, x_i <0 \\ -\log(-x_i + 1) & \text{if } \lambda = 2, x_i <0 \end{cases} \]
data(cars)
yj_speed <- bestNormalize::yeojohnson(cars$speed)
yj_speed$x.t %>% hist()
13.2 Categorical Variables
Purposes
- To transform to continuous variable (for machine learning models) (e.g., encoding/ embedding in text mining)
Approaches:
One-hot encoding
Label encoding
Feature hashing
Binary encoding
Base N encoding
Frequency encoding
Target encoding
Ordinal encoding
Helmert encoding
Mean encoding
Weight of evidence encoding
Probability ratio encoding
Backward difference encoding
Leave one out encoding
James-Stein encoding
M-estimator encoding
Thermometer encoding