6.11 Box-Cox Transformations

6.11.1 Review: transformations

We’ve previously discussed transforming variables as a way to address possible violations of the assumptions and conditions for inference, particularly regression inference.

Transforming variables couldn’t help you with non-independent observations, of course. But it can help with meeting other conditions: linearity, constant variance, Normality of errors. Sometimes it can even help address outliers – points that look like outliers before a transformation may not be as big a deal afterwards. If all goes well, a transformation may be able to fix multiple issues at once!

But this leaves us with the question: what transformation should we use?

Thus far, the answer has been pretty much “try stuff and see.” And, I guess, “don’t do anything super weird without a good reason.” Both of these are still valid, and good advice for statistical analysis in general. But sometimes a little more guidance is nice.

There are some mnemonics like the “ladder of powers” and “Tukey’s circle” for suggesting which transformations to try. I personally have never found them all that useful, but if you do, great!

What we’re about to explore is another tool for suggesting transformations to use, and what’s cool is that this one actually has a mathematical basis. We won’t actually go into what that mathematical basis is, but it’s nice to know there is one. This tool is called Box-Cox transformations.

6.11.2 Candidate transformations for Box-Cox

There are many possible Box-Cox transformations, but they all share some specific characteristics.

First of all, Box-Cox transformation is about transforming \(y\), the response variable. If you are doing a multiple regression and there’s one particular predictor that’s weird, Box-Cox isn’t necessarily the right tool for the job – that’s one reason why checking added-variable plots is so helpful.

Second: Box-Cox transformations are power transformations. There are some mathematical details, but the key element of a Box-Cox transformation is raising the \(y\) values to some power. We call this power \(\lambda\); “lambda” is the Greek letter “L.”

A Box-Cox transformation with \(\lambda = 2\) is equivalent to squaring \(y\); \(\lambda = 1/2\) would be the square root of \(y\); and so on. The one exception to this rule is \(\lambda = 0\). If \(\lambda=0\), we don’t raise \(y\) to the 0 power; instead we take the natural log of \(y\). So you can see that basically all your favorite transformations are Box-Cox transformations!

6.11.3 The Box-Cox plot

R has some built-in functions for trying out various Box-Cox transformations. The really clever thing about these functions is this: they actually assess how useful each transformation would be, mathematically.

This “score” is actually a mathematical quantity called the log-likelihood, which is why we represent the Box-Cox parameter with the letter L. I’ll chat with you about it one on one if you want to know more, but likelihood is definitely outside the scope of this course.

How exactly that works is a topic for another course. The short version is that you get a mathematical “score” that describes how well the data points work after you apply the transformation. Box-Cox’s primary goal is actually to obtain constant variance of the residuals, but as we’ve noted before, this often results in improving linearity and Normality as well.

The usual way to look at this is to create a plot. Let’s take the mtcars data for an example, and look at the gas mileage of different cars (as the response) vs. their engine displacement (as the predictor). A scatterplot of the raw data shows something of a curved relationship:

mtcars %>%
  ggplot() +
  geom_point(aes(x = disp, y = mpg))

…which becomes even more accentuated when we look at a plot of residuals vs. fitted values:

mpg_lm = lm(mpg ~ disp, data = mtcars)
mpg_lm %>% plot(which = 1)

So we ask R to make us a Box-Cox plot. It looks like this:


This is a bit new and different, so let’s unpack.

The y axis here is the “log-likelihood” – that’s the measure of how well the data points fit the conditions that I mentioned earlier, and which I will not define for you now. Don’t worry about the actual numbers on that axis. You’re just interested in where the curve is higher or lower.

The x axis is \(\lambda\) – the parameter of the Box-Cox transformation. And the curve shows the relationship between them: at each value of \(\lambda\), the height of this curve is the log-likelihood when you apply a Box-Cox transformation with that \(\lambda\). The \(\lambda\)s where the curve is highest correspond to the transformations that work the best.

Now it is important to note that Box-Cox is not a precision instrument. That’s why R shows you those dashed lines: the center vertical line shows the actual maximum, but the two lines on either side of it show you a range where the log-likelihood is almost as good. So any transformation in that range (or even really near it) will work more or less equally well.

You should fall back on thinking about context, and the “don’t do anything weird without a reason rule”: just because the “best” \(\lambda\) is -.19 or whatever doesn’t mean you should use the transformation \(1/y^{-0.19}\). Look for sensible transformations that are in or near the range of good values.

What do we see in this example? Well, 0 is in that good range; remember, the Box-Cox for \(\lambda=0\) is the log transformation, not \(y^0\). So we could take a log. -1 is also pretty close, so we could consider taking \(y^{-1}\), which is \(1/y\). \(1/\sqrt{y}\) would also work – that’s the transformation for \(\lambda = -0.5\). All things considered, I’d probably go for \(1/y\), because that’s actually directly interpretable in context: you just have to think of gas mileage as “gallons per mile” instead of “miles per gallon.” It’s maybe not quite as effective as the log, but if my goal is to explain gas mileage (not just predict it) then keeping things interpretable is probably worth it.

One final note is to keep an eye out for \(\lambda=1\). That corresponds to \(y^1\)…which is to say, \(y\). If doing nothing seems to work about as well as any other transformation, then you should strongly consider doing nothing – your results will be more interpretable. For example, here’s some simulated data that already has a linear relationship with constant variance:

And here’s what happens when we make a Box-Cox plot:

boxCox(lm(y ~ x, data = strong_dat))

Because 1 is in the range, we know that there’s no power transformation of \(y\) that would make a real improvement in how well we meet the conditions.

Response moment: Why isn’t the Box-Cox transformation for \(\lambda=0\) raising \(y\) to the 0 power?