Tree volume

Consider the trees data we have previous analysed.

trees<-read.csv("week8/Lecture15/TREES.csv")

We want to find the variable(s) that is (are) `most’ significant following the steps outlined above.

We first fit the full model

\[E(Y) = \alpha+\beta x_1+\gamma x_2\] where \(Y\) denotes log (volume), \(x_1\) denotes log (diameter), \(x_2\) denotes log (height).

trees.llm=lm(log(Volume)~log(Diameter)+log(Height),data=trees)
summary(trees.llm)
## 
## Call:
## lm(formula = log(Volume) ~ log(Diameter) + log(Height), data = trees)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.168561 -0.048488  0.002431  0.063637  0.129223 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6.63162    0.79979  -8.292 5.06e-09 ***
## log(Diameter)  1.98265    0.07501  26.432  < 2e-16 ***
## log(Height)    1.11712    0.20444   5.464 7.81e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08139 on 28 degrees of freedom
## Multiple R-squared:  0.9777, Adjusted R-squared:  0.9761 
## F-statistic: 613.2 on 2 and 28 DF,  p-value: < 2.2e-16
confint(trees.llm)
##                   2.5 %    97.5 %
## (Intercept)   -8.269912 -4.993322
## log(Diameter)  1.828998  2.136302
## log(Height)    0.698353  1.535894

The 95% C.I.s for \(\beta\) and \(\gamma\) are (1.83, 2.13) and (0.70, 1.54) respectively. These C.I.s lead us to retain the full model, with terms for both log (diameter) and log (height). The \(R^2\) and \(R^2\)(adj) for the full model are 97.8% and 97.6%. To verify our initial analysis we should look at the smaller models (i.e. those with only one variable).

The regression output from the models fitted with each explanatory variable individually is displayed below.

Regression Analysis: log volume versus log diameter

summary(lm(log(Volume)~log(Diameter),data=trees))
## 
## Call:
## lm(formula = log(Volume) ~ log(Diameter), data = trees)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.205999 -0.068702  0.001011  0.072585  0.247963 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.35332    0.23066  -10.20 4.18e-11 ***
## log(Diameter)  2.19997    0.08983   24.49  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.115 on 29 degrees of freedom
## Multiple R-squared:  0.9539, Adjusted R-squared:  0.9523 
## F-statistic: 599.7 on 1 and 29 DF,  p-value: < 2.2e-16

Regression Analysis: log volume versus log height

summary(lm(log(Volume)~log(Height),data=trees))
## 
## Call:
## lm(formula = log(Volume) ~ log(Height), data = trees)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65691 -0.27917 -0.08039  0.42193  0.61252 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13.9587     3.7553  -3.717 0.000857 ***
## log(Height)   3.9821     0.8677   4.589 7.93e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4074 on 29 degrees of freedom
## Multiple R-squared:  0.4207, Adjusted R-squared:  0.4008 
## F-statistic: 21.06 on 1 and 29 DF,  p-value: 7.928e-05

For the models with log (diameter) alone and log (height) alone the \(R^2\) values are 95.4% and 42.1% respectively. Although we know that log (height) is related to log (volume), it does not give much practical assistance in terms of additional predictive power. It may be that we want to fit a model with only log diameter alone, which might have better predictive power, but based on the above analysis we will keep both variables as predictor.