Transformations

We have already seen with the Trees data how a well chosen transformation can be very effective in harmonizing the assumptions of a linear model. The most effective way of doing this is to consider the science of the process which has generated the data, to see if a natural transformation emerges. Consider the Rodent data below, where a plot on the original scale shows a rather odd pattern, with most of the data bunched up at the left hand side. Mass is presumably approximately proportional to volume. Speed is a linear measurement (roughly related to the length of the animal’s stride) and so we might expect a geometrical transformation of some kind to apply again. The log transformation will produce an additive, linear model from a multiplicative one. The plots show that transforming both mass and speed in this way is very effective.

When scientific guidance is not available, we can simply seek a transformation which makes the assumptions of the model more appropriate. The choice of transformation is one which involves experience and judgement.

Mass and speed of quadrupedal rodents

In an investigation of the relationship between mass and speed in animals, Garland (1983) collected information from published articles on these two variables for a large number of different species. These measurements are given below for a variety of four-footed rodents. (The common names of the species are taken from Corbet & Hill (1986)) Notice that the measurements are not all recorded to the same level of accuracy since the results have been collated from the work of a number of different scientists.

North American rodent data
Mass (kg) Speed (ms\(^{-1}\))
North American Porcupine 9 3.2
Woodchuck 4 16
Long-clawed ground squirrel 0.6 36
Long-tailed souslik 0.6 20
Eastern grey squirrel 0.55 27
European souslik 0.5 18
European red squirrel and Persian squirrel 0.4 20
Belding’s ground squirrel 0.3 13
Rat 0.25 9.7
American red squirrel 0.22 15
Golden Hamster 0.11 9
Eastern American chipmunk 0.1 17
Chisel-toothed kangaroo rat 0.05600 21
Meadow vole 0.05000 11
Least chipmunk 0.04500 16
Merriman’s kangaroo rat 0.03500 32
Fawn hopping mouse 0.03500 14
Pine mouse 0.030000 6.8
Deer mouse 0.030000 9.1
White footed mouse 0.02500 11
Woodland jumping mouse 0.02500 8.6
North American meadow jumping mouse 0.01800 8.9
House mouse 0.01600 13
load("week5/Lecture10/rodent.rda")
a<-qplot(Mass,Speed, data=rodent)
b<-qplot(log(Mass),log(Speed),data=rodent)
grid.arrange(a,b,ncol=2)

The plot on the right hand side shows the data on the original scale and the plot on the right hand side shows log transformed speed and mass.

#Model with all data not transformed
coef.lm1a<-as.numeric(coef(lm(Speed~Mass,data=rodent)))

#Model removing North American Porcupine not transformed
coef.lm2a<-coef(lm(Speed~Mass,data=rodent[-1,]))


a<-qplot(rodent$Mass,rodent$Speed,main="Mass and Speed of Rodents",
         xlab="Mass",ylab="Speed") + 
     geom_text( aes(x=6, y=1.1, label="North American Porcupine",color="red"), 
                show.legend = FALSE) +
     geom_point( aes(x=rodent[1,1], y=rodent[1,2],color="red"),
                 shape=21,size=5,show.legend = FALSE,alpha=1)+
                 scale_shape(solid = FALSE) +
                 stat_smooth(method = "lm", se = FALSE, col="red") + 
     geom_abline(intercept=coef.lm2a[1], slope=coef.lm2a[2], color = "blue")


#Model with all data
coef.lm1<-as.numeric(coef(lm(log(Speed)~log(Mass),data=rodent)))

#Model removing North American Porcupine 
coef.lm2<-coef(lm(log(Speed)~log(Mass),data=rodent[-1,]))


b<-qplot(log(rodent$Mass),log(rodent$Speed),main="log Mass and Speed of Rodents",
         xlab="log(Mass)",ylab="log(Speed)") + 
     geom_text( aes(x=-0.3, y=1.1, label="North American Porcupine",color="red"), 
                show.legend = FALSE) +
     geom_point( aes(x=log(rodent[1,1]), y=log(rodent[1,2]),color="red"),
                 shape=21,size=5,show.legend = FALSE,alpha=1)+
                 scale_shape(solid = FALSE) +
                 stat_smooth(method = "lm", se = FALSE, col="red") + 
     geom_abline(intercept=coef.lm2[1], slope=coef.lm2[2], color = "blue")

grid.arrange(a,b,ncol=2)

The plot of Speed vs Mass shows an odd pattern. There is no obvious linear relationship between the two variables. Once we take the log transformation, the linear relationship is much more clear. However, there is still one point, located at the bottom right that looks odd.

The blue fitted lines was estimated excluding the North American Porcupine and the red fitted line was estimated including the North American Porcupine. Hopefully you agree that the fitted model relating to log transformed data with excluding the North American Porcupine provides the most reasonable fit (although we should check model assumptions at this point).