13 Variable Transformation
trafo
vignette
13.1 Continuous Variables
Purposes:
To change the scale of the variables
To transform skewed data distribution to normal distribution
13.1.2 Min-max scaling
\[ x_i' = \frac{x_i - x_{max}}{x_{max} - x_{min}} \]
dependent on the min and max values, which makes it sensitive to outliers.
best to use when you have values in a fixed interval.
13.1.3 Square Root/Cube Root
When variables have positive skewness or residuals have positive heteroskasticity.
Frequency counts variable
Data have many 0 or extremely small values.
13.1.4 Logarithmic
- Variables have positively skewed distribution
Formula | In case |
---|---|
\(x_i' = \log(x_i)\) | cannot work zero because log(0) = -Inf
|
\(x_i' = \log(x_i + 1)\) | variables with 0 |
\(x_i' = \log(x_i +c)\) | |
\(x_i' = \frac{x_i}{|x_i|}\log|x_i|\) | variables with negative values |
\(x_i'^\lambda = \log(x_i + \sqrt{x_i^2 + \lambda})\) | generalized log transformation |
For the general case of \(\log(x_i + c)\), choosing a constant is rather tricky.
The choice of the constant is critically important, especially when you want to do inference. It can dramatically change your model fit (see (Ekwaru and Veugelers 2018) for the independent variable case).
J. Chen and Roth (2023) show that in causal inference problem, \(\log\) transformation of values with meaningful 0 is problematic. But there are solutions for each approach (e.g., DID, IV).
However, assuming that you do not have 0s because of
Censoring
No measurement errors (stemming from measurement tools)
We can proceed choosing c
(it’s okay if your 0’s are represent really small values).
13.1.7 Inverse/Reciprocal
Variables have platykurtic distribution
Data are positively skewed
Ratio data
plot(1/(cars$dist))
13.1.10 Arcsinh
- Proportion variable (0-1)
# cars$dist %>% MASS::truehist()
as_dist <- bestNormalize::arcsinh_x(cars$dist)
as_dist
#> Standardized asinh(x) Transformation with 50 nonmissing obs.:
#> Relevant statistics:
#> - mean (before standardization) = 4.230843
#> - sd (before standardization) = 0.7710887
as_dist$x.t %>% hist()
\[ arcsinh(Y) = \log(\sqrt{1 + Y^2} + Y) \]
Paper | Interpretation |
---|---|
Azoulay, Fons-Rosen, and Zivin (2019) | Elasticity |
Faber and Gaubert (2019) | Percentage |
Hjort and Poulsen (2019) | Percentage |
M. S. Johnson (2020) | Percentage |
Beerli et al. (2021) | Percentage |
Norris, Pecenco, and Weaver (2021) | Percentage |
Berkouwer and Dean (2022) | Percentage |
Cabral, Cui, and Dworsky (2022) | Elasticity |
Carranza et al. (2022) | Percentage |
Mirenda, Mocetti, and Rizzica (2022) | Percentage |
For a simple regression model, \(Y = \beta X\)
When both \(Y\) and \(X\) are transformed, the coefficient estimate represents elasticity, indicating the percentage change in \(Y\) for a 1% change in \(X\).
When only \(Y\) is in transformed and \(X\) is in raw form, the coefficient estimate represents the percentage change in \(Y\) for a one-unit change in \(X\).
13.1.11 Lambert W x F Transformation
LambertW
package
Using moments to normalize data.
Usually need to compare with the Box-Cox Transformation and Yeo-Johnson Transformation
Can handle skewness, heavy-tailed.
l_dist <- LambertW::Gaussianize(cars$dist)
# small fix
l_dist %>% hist()
13.1.12 Inverse Hyperbolic Sine (IHS) transformation
Proposed by (N. L. Johnson 1949)
Can be applied to real numbers.
\[ \begin{aligned} f(x,\theta) &= \frac{\sinh^{-1} (\theta x)}{\theta} \\ &= \frac{\log(\theta x + (\theta^2 x^2 + 1)^{1/2})}{\theta} \end{aligned} \]
13.1.13 Box-Cox Transformation
\[ y^\lambda = \beta x+ \epsilon \]
to fix non-linearity in the error terms
work well between (-3,3) (i.e., small transformation).
or with independent variables
\[ x_i'^\lambda = \begin{cases} \frac{x_i^\lambda-1}{\lambda} & \text{if } \lambda \neq 0\\ \log(x_i) & \text{if } \lambda = 0 \end{cases} \]
And the two-parameter version is
\[ x_i' (\lambda_1, \lambda_2) = \begin{cases} \frac{(x_i + \lambda_2)^{\lambda_1}-1}{} & \text{if } \lambda_1 \neq 0 \\ \log(x_i + \lambda_2) & \text{if } \lambda_1 = 0 \end{cases} \]
More advances
# best lambda
bc$x[which(bc$y == max(bc$y))]
#> [1] 1.242424
# model with best lambda
mod_lambda = lm(cars$speed ^ (bc$x[which(bc$y == max(bc$y))]) ~ cars$dist,
data = cars)
plot(mod_lambda)
13.1.14 Yeo-Johnson Transformation
Similar to Box-Cox Transformation (when \(\lambda = 1\)), but allows for negative value
\[ x_i'^\lambda = \begin{cases} \frac{(x_i+1)^\lambda -1}{\lambda} & \text{if } \lambda \neq0, x_i \ge 0 \\ \log(x_i + 1) & \text{if } \lambda = 0, x_i \ge 0 \\ \frac{-[(-x_i+1)^{2-\lambda}-1]}{2 - \lambda} & \text{if } \lambda \neq 2, x_i <0 \\ -\log(-x_i + 1) & \text{if } \lambda = 2, x_i <0 \end{cases} \]
data(cars)
yj_speed <- bestNormalize::yeojohnson(cars$speed)
yj_speed$x.t %>% hist()
13.2 Categorical Variables
Purposes
- To transform to continuous variable (for machine learning models) (e.g., encoding/ embedding in text mining)
Approaches:
One-hot encoding
Label encoding
Feature hashing
Binary encoding
Base N encoding
Frequency encoding
Target encoding
Ordinal encoding
Helmert encoding
Mean encoding
Weight of evidence encoding
Probability ratio encoding
Backward difference encoding
Leave one out encoding
James-Stein encoding
M-estimator encoding
Thermometer encoding