13.1 Continuous Variables

Purposes:

  • To change the scale of the variables

  • To transform skewed data distribution to normal distribution

13.1.1 Standardization

\[ x_i' = \frac{x_i - \bar{x}}{s} \]

when you have a few large numbers

13.1.2 Min-max scaling

\[ x_i' = \frac{x_i - x_{max}}{x_{max} - x_{min}} \]

dependent on the min and max values, which makes it sensitive to outliers.

best to use when you have values in a fixed interval.

13.1.3 Square Root/Cube Root

  • When variables have positive skewness or residuals have positive heteroskasticity.

  • Frequency counts variable

  • Data have many 0 or extremely small values.

13.1.4 Logarithmic

  • Variables have positively skewed distribution
Formula In case
\(x_i' = \log(x_i)\) cannot work zero because log(0) = -Inf
\(x_i' = \log(x_i + 1)\) variables with 0
\(x_i' = \log(x_i +c)\)
\(x_i' = \frac{x_i}{|x_i|}\log|x_i|\) variables with negative values
\(x_i'^\lambda = \log(x_i + \sqrt{x_i^2 + \lambda})\) generalized log transformation

For the general case of \(\log(x_i + c)\), choosing a constant is rather tricky.

The choice of the constant is critically important, especially when you want to do inference. It can dramatically change your model fit (see (Ekwaru and Veugelers 2018) for the independent variable case).

J. Chen and Roth (2023) show that in causal inference problem, \(\log\) transformation of values with meaningful 0 is problematic. But there are solutions for each approach (e.g., DID, IV).

However, assuming that you do not have 0s because of

  • Censoring

  • No measurement errors (stemming from measurement tools)

We can proceed choosing c (it’s okay if your 0’s are represent really small values).

data(cars)
cars$speed %>% head()
#> [1] 4 4 7 7 8 9

log(cars$speed) %>% head()
#> [1] 1.386294 1.386294 1.945910 1.945910 2.079442 2.197225

# log(x+1)
log1p(cars$speed) %>% head()
#> [1] 1.609438 1.609438 2.079442 2.079442 2.197225 2.302585

13.1.5 Exponential

  • Negatively skewed data

  • Underlying logarithmic trend (e.g., survival, decay)

13.1.6 Power

  • Variables have negatively skewed distribution

13.1.7 Inverse/Reciprocal

  • Variables have platykurtic distribution

  • Data are positively skewed

  • Ratio data

data(cars)
head(cars$dist)
#> [1]  2 10  4 22 16 10
plot(cars$dist)

plot(1/(cars$dist))

13.1.8 Hyperbolic arcsine

  • Variables with positively skewed distribution

13.1.9 Ordered Quantile Norm

\[ x_i' = \Phi^{-1} (\frac{rank(x_i) - 1/2}{length(x)}) \]

ord_dist <- bestNormalize::orderNorm(cars$dist)
ord_dist
#> orderNorm Transformation with 50 nonmissing obs and ties
#>  - 35 unique values 
#>  - Original quantiles:
#>   0%  25%  50%  75% 100% 
#>    2   26   36   56  120
ord_dist$x.t %>% hist()

13.1.10 Arcsinh

  • Proportion variable (0-1)
cars$dist %>% hist()

# cars$dist %>% MASS::truehist()

as_dist <- bestNormalize::arcsinh_x(cars$dist)
as_dist
#> Standardized asinh(x) Transformation with 50 nonmissing obs.:
#>  Relevant statistics:
#>  - mean (before standardization) = 4.230843 
#>  - sd (before standardization) = 0.7710887
as_dist$x.t %>% hist()

\[ arcsinh(Y) = \log(\sqrt{1 + Y^2} + Y) \]

Paper Interpretation
Azoulay, Fons-Rosen, and Zivin (2019) Elasticity
Faber and Gaubert (2019) Percentage
Hjort and Poulsen (2019) Percentage
M. S. Johnson (2020) Percentage
Beerli et al. (2021) Percentage
Norris, Pecenco, and Weaver (2021) Percentage
Berkouwer and Dean (2022) Percentage
Cabral, Cui, and Dworsky (2022) Elasticity
Carranza et al. (2022) Percentage
Mirenda, Mocetti, and Rizzica (2022) Percentage

For a simple regression model, \(Y = \beta X\)

When both \(Y\) and \(X\) are transformed, the coefficient estimate represents elasticity, indicating the percentage change in \(Y\) for a 1% change in \(X\).

When only \(Y\) is in transformed and \(X\) is in raw form, the coefficient estimate represents the percentage change in \(Y\) for a one-unit change in \(X\).

13.1.11 Lambert W x F Transformation

LambertW package

data(cars)
head(cars$dist)
#> [1]  2 10  4 22 16 10
cars$dist %>% hist()



l_dist <- LambertW::Gaussianize(cars$dist)
# small fix
l_dist %>% hist()

13.1.12 Inverse Hyperbolic Sine (IHS) transformation

\[ \begin{aligned} f(x,\theta) &= \frac{\sinh^{-1} (\theta x)}{\theta} \\ &= \frac{\log(\theta x + (\theta^2 x^2 + 1)^{1/2})}{\theta} \end{aligned} \]

13.1.13 Box-Cox Transformation

\[ y^\lambda = \beta x+ \epsilon \]

to fix non-linearity in the error terms

work well between (-3,3) (i.e., small transformation).

or with independent variables

\[ x_i'^\lambda = \begin{cases} \frac{x_i^\lambda-1}{\lambda} & \text{if } \lambda \neq 0\\ \log(x_i) & \text{if } \lambda = 0 \end{cases} \]

And the two-parameter version is

\[ x_i' (\lambda_1, \lambda_2) = \begin{cases} \frac{(x_i + \lambda_2)^{\lambda_1}-1}{} & \text{if } \lambda_1 \neq 0 \\ \log(x_i + \lambda_2) & \text{if } \lambda_1 = 0 \end{cases} \]

More advances

library(MASS)
data(cars)
mod <- lm(cars$speed ~ cars$dist, data = cars)
# check residuals
plot(mod)


bc <- boxcox(mod, lambda = seq(-3, 3))


# best lambda
bc$x[which(bc$y == max(bc$y))]
#> [1] 1.242424

# model with best lambda
mod_lambda = lm(cars$speed ^ (bc$x[which(bc$y == max(bc$y))]) ~ cars$dist, 
                data = cars)
plot(mod_lambda)


# 2-parameter version
two_bc = geoR::boxcoxfit(cars$speed)
two_bc
#> Fitted parameters:
#>    lambda      beta   sigmasq 
#>  1.028798 15.253008 31.935297 
#> 
#> Convergence code returned by optim: 0
plot(two_bc)



# bestNormalize
bc_dist <- bestNormalize::boxcox(cars$dist)
bc_dist
#> Standardized Box Cox Transformation with 50 nonmissing obs.:
#>  Estimated statistics:
#>  - lambda = 0.4950628 
#>  - mean (before standardization) = 10.35636 
#>  - sd (before standardization) = 3.978036
bc_dist$x.t %>% hist()

13.1.14 Yeo-Johnson Transformation

Similar to Box-Cox Transformation (when \(\lambda = 1\)), but allows for negative value

\[ x_i'^\lambda = \begin{cases} \frac{(x_i+1)^\lambda -1}{\lambda} & \text{if } \lambda \neq0, x_i \ge 0 \\ \log(x_i + 1) & \text{if } \lambda = 0, x_i \ge 0 \\ \frac{-[(-x_i+1)^{2-\lambda}-1]}{2 - \lambda} & \text{if } \lambda \neq 2, x_i <0 \\ -\log(-x_i + 1) & \text{if } \lambda = 2, x_i <0 \end{cases} \]

data(cars)
yj_speed <- bestNormalize::yeojohnson(cars$speed)
yj_speed$x.t %>% hist()

13.1.15 RankGauss

  • Turn values into ranks, then ranks to values under normal distribution.

13.1.16 Summary

Automatically choose the best method to normalize data (code by bestNormalize)

bestdist <- bestNormalize::bestNormalize(cars$dist)
bestdist$x.t %>% hist()


boxplot(log10(bestdist$oos_preds), yaxt = "n")

# axis(2, at = log10(c(.1, .5, 1, 2, 5, 10)), 
#      labels = c(.1, .5, 1, 2, 5, 10))

References

Azoulay, Pierre, Christian Fons-Rosen, and Joshua S Graff Zivin. 2019. “Does Science Advance One Funeral at a Time?” American Economic Review 109 (8): 2889–2920.
Bartlett, Maurice S. 1947. “The Use of Transformations.” Biometrics 3 (1): 39–52.
Beerli, Andreas, Jan Ruffner, Michael Siegenthaler, and Giovanni Peri. 2021. “The Abolition of Immigration Restrictions and the Performance of Firms and Workers: Evidence from Switzerland.” American Economic Review 111 (3): 976–1012.
Berkouwer, Susanna B, and Joshua T Dean. 2022. “Credit, Attention, and Externalities in the Adoption of Energy Efficient Technologies by Low-Income Households.” American Economic Review 112 (10): 3291–3330.
Bickel, Peter J, and Kjell A Doksum. 1981. “An Analysis of Transformations Revisited.” Journal of the American Statistical Association 76 (374): 296–311.
Box, George EP, and David R Cox. 1981. An Analysis of Transformations Revisited, Rebutted. University of Wisconsin-Madison. Mathematics Research Center.
Cabral, Marika, Can Cui, and Michael Dworsky. 2022. “The Demand for Insurance and Rationale for a Mandate: Evidence from Workers’ Compensation Insurance.” American Economic Review 112 (5): 1621–68.
Carranza, Eliana, Robert Garlick, Kate Orkin, and Neil Rankin. 2022. “Job Search and Hiring with Limited Information about Workseekers’ Skills.” American Economic Review 112 (11): 3547–83.
Chen, Jiafeng, and Jonathan Roth. 2023. “Logs with Zeros? Some Problems and Solutions.” The Quarterly Journal of Economics, qjad054.
Ekwaru, John Paul, and Paul J Veugelers. 2018. “The Overlooked Importance of Constants Added in Log Transformation of Independent Variables with Zero Values: A Proposed Approach for Determining an Optimal Constant.” Statistics in Biopharmaceutical Research 10 (1): 26–29.
Faber, Benjamin, and Cecile Gaubert. 2019. “Tourism and Economic Development: Evidence from Mexico’s Coastline.” American Economic Review 109 (6): 2245–93.
Hjort, Jonas, and Jonas Poulsen. 2019. “The Arrival of Fast Internet and Employment in Africa.” American Economic Review 109 (3): 1032–79.
Johnson, Matthew S. 2020. “Regulation by Shaming: Deterrence Effects of Publicizing Violations of Workplace Safety and Health Laws.” American Economic Review 110 (6): 1866–1904.
Johnson, N. L. 1949. “Systems of Frequency Curves Generated by Methods of Translation.” Biometrika 36 (1/2): 149. https://doi.org/10.2307/2332539.
Manly, Bryan FJ. 1976. “Exponential Data Transformations.” Journal of the Royal Statistical Society Series D: The Statistician 25 (1): 37–42.
Mirenda, Litterio, Sauro Mocetti, and Lucia Rizzica. 2022. “The Economic Effects of Mafia: Firm Level Evidence.” American Economic Review 112 (8): 2748–73.
Norris, Samuel, Matthew Pecenco, and Jeffrey Weaver. 2021. “The Effects of Parental and Sibling Incarceration: Evidence from Ohio.” American Economic Review 111 (9): 2926–63.