3.4 Transform a numeric variable
You can use any R arithmetic function to transform a continuous variable. For example, suppose you want to create a variable that is the square of another variable (e.g., to use a quadratic polynomial in a linear regression).
$Yrs_From_Dx_2 <- mydat$Yrs_From_Dx^2
mydat
# Before vs. after
summary(mydat$Yrs_From_Dx)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 2.0 6.0 8.4 10.0 69.0 15
summary(mydat$Yrs_From_Dx_2)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 4 36 157 100 4761 15
plot(mydat$Yrs_From_Dx, mydat$Yrs_From_Dx_2)
The tidyverse code to transform a variable uses the mutate()
function.
<- mydat_tibble %>%
mydat_tibble mutate(Yrs_From_Dx_2 = Yrs_From_Dx^2)
Another a common transformation is the log-transformation (usually the natural logarithm) for continuous variables that are skewed to the right. Many variables that represent time or size have a highly skewed distribution. When using a log-transformation, always check for zeros and negative numbers prior to transformation – log(0)
is undefined and R will return a value of -Inf
, log(-1)
(or log
of any negative number) is also undefined and will return a value of NaN
along with a warning. You do not want a transformation that turns legitimate values into infinite or missing values.
If the original variable x
is always positive then you can use log(x)
directly. If x
has some 0 values, then add a small number before transforming. Start with log(x + 1)
, but sometimes a larger or smaller number works better. Decide by looking at a histogram after the transformation. If the bar on the far left is far away from the other bars, consider adding a different number; for example, log(x + 10)
or log(x + 0.1)
. If x
has many zero values, or any negative values, a log-transformation might not be the best choice.
hist(mydat$Yrs_From_Dx) # Very skewed
summary(mydat$Yrs_From_Dx) # There are zeros, so add 1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 2.0 6.0 8.4 10.0 69.0 15
$log_Yrs_From_Dx <- log(mydat$Yrs_From_Dx + 1)
mydat
# NOTE: log() is the natural log
hist(mydat$log_Yrs_From_Dx) # More symmetric, bar at the left not far away
In this case log(x + 1) worked, but in other cases you may need a different number.
# If you had added too large a number, still too skewed
hist(log(mydat$Yrs_From_Dx + 10))
# If you had added too small a number,
# the bar to the left would be too far away from the rest of the data
hist(log(mydat$Yrs_From_Dx + 0.1))
The tidyverse code for a log-transformation:
# Very skewed
%>%
mydat_tibble ggplot(aes(x = Yrs_From_Dx)) +
geom_histogram(bins = 10, color="black", fill="white") +
labs(y = "Frequency", x = "Years from Diagnosis")
summary(mydat$Yrs_From_Dx)
# There are zeros, so add 1
<- mydat_tibble %>%
mydat_tibble mutate(log_Yrs_From_Dx = log(Yrs_From_Dx + 1))
# NOTE: log() is the natural log
# More symmetric, bar at the left not far away
%>%
mydat_tibble ggplot(aes(x = log_Yrs_From_Dx)) +
geom_histogram(bins = 10, color="black", fill="white") +
labs(y = "Frequency", x = "log(Years from Diagnosis + 1)")