3.4 Transform a numeric variable
You can use any R arithmetic function to transform a continuous variable. For example, suppose you want to create a variable that is the square of another variable (e.g., to use a quadratic polynomial in a linear regression).
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 2.0 6.0 8.4 10.0 69.0 15
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 4 36 157 100 4761 15
The tidyverse code to transform a variable uses the mutate()
function.
Another a common transformation is the log-transformation (usually the natural logarithm) for continuous variables that are skewed to the right. Many variables that represent time or size have a highly skewed distribution. When using a log-transformation, always check for zeros and negative numbers prior to transformation – log(0)
is undefined and R will return a value of -Inf
, log(-1)
(or log
of any negative number) is also undefined and will return a value of NaN
along with a warning. You do not want a transformation that turns legitimate values into infinite or missing values.
If the original variable x
is always positive then you can use log(x)
directly. If x
has some 0 values, then add a small number before transforming. Start with log(x + 1)
, but sometimes a larger or smaller number works better. Decide by looking at a histogram after the transformation. If the bar on the far left is far away from the other bars, consider adding a different number; for example, log(x + 10)
or log(x + 0.1)
. If x
has many zero values, or any negative values, a log-transformation might not be the best choice.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 2.0 6.0 8.4 10.0 69.0 15
mydat$log_Yrs_From_Dx <- log(mydat$Yrs_From_Dx + 1)
# NOTE: log() is the natural log
hist(mydat$log_Yrs_From_Dx) # More symmetric, bar at the left not far away
In this case log(x + 1) worked, but in other cases you may need a different number.
# If you had added too small a number,
# the bar to the left would be too far away from the rest of the data
hist(log(mydat$Yrs_From_Dx + 0.1))
The tidyverse code for a log-transformation:
# Very skewed
mydat_tibble %>%
ggplot(aes(x = Yrs_From_Dx)) +
geom_histogram(bins = 10, color="black", fill="white") +
labs(y = "Frequency", x = "Years from Diagnosis")