13.1 Continuous Variables

Purposes:

  • To change the scale of the variables

  • To transform skewed data distribution to normal distribution

13.1.1 Standardization

\[ x_i' = \frac{x_i - \bar{x}}{s} \]

when you have a few large numbers

13.1.2 Min-max scaling

\[ x_i' = \frac{x_i - x_{max}}{x_{max} - x_{min}} \]

dependent on the min and max values, which makes it sensitive to outliers.

best to use when you have values in a fixed interval.

13.1.3 Square Root/Cube Root

  • When variables have positive skewness or residuals have positive heteroskasticity.

  • Frequency counts variable

  • Data have many 0 or extremely small values.

13.1.4 Logarithmic

  • Variables have positively skewed distribution
Formula In case
\(x_i' = \log(x_i)\) cannot work zero because log(0) = -Inf
\(x_i' = \log(x_i + 1)\) variables with 0
\(x_i' = \log(x_i +c)\)
\(x_i' = \frac{x_i}{|x_i|}\log|x_i|\) variables with negative values
\(x_i'^\lambda = \log(x_i + \sqrt{x_i^2 + \lambda})\) generalized log transformation

For the general case of \(\log(x_i + c)\), choosing a constant is rather tricky.

The choice of the constant is critically important, especially when you want to do inference. It can dramatically change your model fit (see (Ekwaru and Veugelers 2018) for the independent variable case).

However, assuming that you do not have 0s because of

  • Censoring

  • No measurement errors (stemming from measurement tools)

We can proceed choosing c (it’s okay if your 0’s are represent really small values).

data(cars)
cars$speed %>% head()
#> [1] 4 4 7 7 8 9

log(cars$speed) %>% head()
#> [1] 1.386294 1.386294 1.945910 1.945910 2.079442 2.197225

# log(x+1)
log1p(cars$speed) %>% head()
#> [1] 1.609438 1.609438 2.079442 2.079442 2.197225 2.302585

13.1.5 Exponential

  • Negatively skewed data

  • Underlying logarithmic trend (e.g., survival, decay)

13.1.6 Power

  • Variables have negatively skewed distribution

13.1.7 Inverse/Reciprocal

  • Variables have platykurtic distribution

  • Data are positively skewed

  • Ratio data

data(cars)
head(cars$dist)
#> [1]  2 10  4 22 16 10
plot(cars$dist)

plot(1/(cars$dist))

13.1.8 Hyperbolic arcsine

  • Variables with positively skewed distribution

13.1.9 Ordered Quantile Norm

\[ x_i' = \Phi^{-1} (\frac{rank(x_i) - 1/2}{length(x)}) \]

ord_dist <- bestNormalize::orderNorm(cars$dist)
ord_dist
#> orderNorm Transformation with 50 nonmissing obs and ties
#>  - 35 unique values 
#>  - Original quantiles:
#>   0%  25%  50%  75% 100% 
#>    2   26   36   56  120
ord_dist$x.t %>% hist()

13.1.10 Arcsinh

  • Proportion variable (0-1)
cars$dist %>% hist()

# cars$dist %>% MASS::truehist()

as_dist <- bestNormalize::arcsinh_x(cars$dist)
as_dist
#> Standardized asinh(x) Transformation with 50 nonmissing obs.:
#>  Relevant statistics:
#>  - mean (before standardization) = 4.230843 
#>  - sd (before standardization) = 0.7710887
as_dist$x.t %>% hist()

13.1.11 Lambert W x F Transformation

LambertW package

data(cars)
head(cars$dist)
#> [1]  2 10  4 22 16 10
cars$dist %>% hist()



l_dist <- LambertW::Gaussianize(cars$dist)
# small fix
l_dist %>% hist()

13.1.12 Inverse Hyperbolic Sine (IHS) transformation

\[ \begin{aligned} f(x,\theta) &= \frac{\sinh^{-1} (\theta x)}{\theta} \\ &= \frac{\log(\theta x + (\theta^2 x^2 + 1)^{1/2})}{\theta} \end{aligned} \]

13.1.13 Box-Cox Transformation

\[ y^\lambda = \beta x+ \epsilon \]

to fix non-linearity in the error terms

work well between (-3,3) (i.e., small transformation).

or with independent variables

\[ x_i'^\lambda = \begin{cases} \frac{x_i^\lambda-1}{\lambda} & \text{if } \lambda \neq 0\\ \log(x_i) & \text{if } \lambda = 0 \end{cases} \]

And the two-parameter version is

\[ x_i' (\lambda_1, \lambda_2) = \begin{cases} \frac{(x_i + \lambda_2)^{\lambda_1}-1}{} & \text{if } \lambda_1 \neq 0 \\ \log(x_i + \lambda_2) & \text{if } \lambda_1 = 0 \end{cases} \]

More advances

library(MASS)
data(cars)
mod <- lm(cars$speed ~ cars$dist, data = cars)
# check residuals
plot(mod)


bc <- boxcox(mod, lambda = seq(-3, 3))


# best lambda
bc$x[which(bc$y == max(bc$y))]
#> [1] 1.242424

# model with best lambda
mod_lambda = lm(cars$speed ^ (bc$x[which(bc$y == max(bc$y))]) ~ cars$dist, 
                data = cars)
plot(mod_lambda)


# 2-parameter version
two_bc = geoR::boxcoxfit(cars$speed)
two_bc
#> Fitted parameters:
#>    lambda      beta   sigmasq 
#>  1.028798 15.253008 31.935297 
#> 
#> Convergence code returned by optim: 0
plot(two_bc)



# bestNormalize
bc_dist <- bestNormalize::boxcox(cars$dist)
bc_dist
#> Standardized Box Cox Transformation with 50 nonmissing obs.:
#>  Estimated statistics:
#>  - lambda = 0.4950628 
#>  - mean (before standardization) = 10.35636 
#>  - sd (before standardization) = 3.978036
bc_dist$x.t %>% hist()

13.1.14 Yeo-Johnson Transformation

Similar to Box-Cox Transformation (when \(\lambda = 1\)), but allows for negative value

\[ x_i'^\lambda = \begin{cases} \frac{(x_i+1)^\lambda -1}{\lambda} & \text{if } \lambda \neq0, x_i \ge 0 \\ \log(x_i + 1) & \text{if } \lambda = 0, x_i \ge 0 \\ \frac{-[(-x_i+1)^{2-\lambda}-1]}{2 - \lambda} & \text{if } \lambda \neq 2, x_i <0 \\ -\log(-x_i + 1) & \text{if } \lambda = 2, x_i <0 \end{cases} \]

data(cars)
yj_speed <- bestNormalize::yeojohnson(cars$speed)
yj_speed$x.t %>% hist()

13.1.15 RankGauss

  • Turn values into ranks, then ranks to values under normal distribution.

13.1.16 Summary

Automatically choose the best method to normalize data (code by bestNormalize)

bestdist <- bestNormalize::bestNormalize(cars$dist)
bestdist$x.t %>% hist()


boxplot(log10(bestdist$oos_preds), yaxt = "n")

# axis(2, at = log10(c(.1, .5, 1, 2, 5, 10)), 
#      labels = c(.1, .5, 1, 2, 5, 10))

References

Bartlett, Maurice S. 1947. “The Use of Transformations.” Biometrics 3 (1): 39–52.
Bickel, Peter J, and Kjell A Doksum. 1981. “An Analysis of Transformations Revisited.” Journal of the American Statistical Association 76 (374): 296–311.
Box, George EP, and David R Cox. 1981. An Analysis of Transformations Revisited, Rebutted. University of Wisconsin-Madison. Mathematics Research Center.
Ekwaru, John Paul, and Paul J Veugelers. 2018. “The Overlooked Importance of Constants Added in Log Transformation of Independent Variables with Zero Values: A Proposed Approach for Determining an Optimal Constant.” Statistics in Biopharmaceutical Research 10 (1): 26–29.
Johnson, N. L. 1949. “Systems of Frequency Curves Generated by Methods of Translation.” Biometrika 36 (1/2): 149. https://doi.org/10.2307/2332539.
Manly, Bryan FJ. 1976. “Exponential Data Transformations.” Journal of the Royal Statistical Society Series D: The Statistician 25 (1): 37–42.