Examples
Tree volume
The dataset refers to the volume (cubic feet) and diameter (inches) (at 54 inches above the ground) and height (feet) for a sample of 31 black cherry trees in the Allegheny National Forest Pennsylvania. The data were collected in order to find an estimate for the volume of a tree (and therefore for the timber yield), given its height and diameter. A starting point for estimating volume using these data is the geometric formula for a cylinder:
\[\mathrm{volume} = \pi*\left(\frac{\mathrm{diameter}}{2}\right)^2*\mathrm{height}\]
You can download these data from Moodle under TREES.csv.
Exploratory Plots
We can start by exploring the relationship between the two predictors and the response in two separate plots.
Suggested Model
Apart from the suggestion of a slight curvature in the plot of volume versus diameter, the scatterplots indicate that a multiple linear regression model with volume as a response and diameter and height as explanatory variables may be appropriate. This model is shown below (along with the residual plots produced after fitting the model).
\[\mathrm{volume}_i = \beta_0+\beta\mathrm{diameter}_i+\gamma\mathrm{height}_i+\epsilon_i\]
Residual Plots from this multiple linear regression model
While the initial scatterplots looked reasonable, the residuals versus fitted values plot highlights some evidence of curvature. This effect is not very marked, but there is a suggestion that the residuals tend to be positive, negative and then positive again, as we move from left to right in this plot.
This curvature (and the underlying geometric model) suggest that using a log transformation is appropriate for these data. (The log transform will produce, an additive, linear model from a multiplicative one.)
Linear model with a natural log transformation
\[\mathrm{log}(\mathrm{volume}_i) = \beta_0+\beta_1 \mathrm{log}(\mathrm{diameter}_i)+\gamma \mathrm{log}(\mathrm{height}_i)+\epsilon.\]
We now take a log transformation of all the variables and again plot the response against the two predictors.
Residual plots from linear model with log transformation:
When we move to the log scale, evidence of curvature in the residual plot disappears. However, the strongest argument for the use of the log transformation in this example is the underlying geometric model outlined earlier.
The principal issue with these data is how volume should be predicted from diameter and height.
##
## Call:
## lm(formula = log(Volume) ~ log(Height) + log(Girth), data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.168561 -0.048488 0.002431 0.063637 0.129223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.63162 0.79979 -8.292 5.06e-09 ***
## log(Height) 1.11712 0.20444 5.464 7.81e-06 ***
## log(Girth) 1.98265 0.07501 26.432 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08139 on 28 degrees of freedom
## Multiple R-squared: 0.9777, Adjusted R-squared: 0.9761
## F-statistic: 613.2 on 2 and 28 DF, p-value: < 2.2e-16
A useful summary is provided by \(R^2\) and \(R^2\) (adj). For the model which incorporates both explanatory variables logged, \(R^2\) (adj) = 97.6%. Therefore, 97.6% of the variability in log volume can be explained by its dependence on log diameter and log height.
For every one unit increase in log diameter, log volume increases by 1.98 on average, assuming that height remains the same. Similarly for every one unit increase in log height, log volume increases by 1.12 on average, assuming that the diameter remains the same.
Remember to always interpret regression coefficients with respect to the measurements. In this case we have log transformed our variables and should interpret coefficients on this scale.