12 Multiple Regression

In this short chapter, we present an example of linear regression with multiple explanatory variables.

12.0.1 Example One: cystifbr

Consider the data in cystfibr in the library ISwR. Our treatment of this data set follows the treatment given in Dalgaard’s book. Suppose we wish to model the lung capacity in terms of the other variables, based on this data. Let’s get a start by plotting pemax (the measure of lung capacity) versus all of the other variables:

cystfibr <- ISwR::cystfibr
longCyst <- gather(cystfibr, key = variable, value   = value, -pemax)
ggplot(longCyst, aes(x = value, y = pemax)) + geom_point() +
  facet_wrap(facets = ~variable, scales = "free_x") 

At a first glance, pemax seems correlated with age, weight and height. These variables are also probably strongly correlated with one another, since this is data dealing with adolescents. Let’s check:

cor(cystfibr$age, cystfibr$weight)
## [1] 0.9058675
cor(cystfibr$age, cystfibr$height)
## [1] 0.926052
cor(cystfibr$height, cystfibr$weight)
## [1] 0.9206953

So, yes, they are highly correlated. We are going to run multiple regression on all of the variables at once, and see what we get:

my.mod <- lm(pemax~., data = cystfibr)
summary(my.mod)
## 
## Call:
## lm(formula = pemax ~ ., data = cystfibr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.338 -11.532   1.081  13.386  33.405 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 176.0582   225.8912   0.779    0.448
## age          -2.5420     4.8017  -0.529    0.604
## sex          -3.7368    15.4598  -0.242    0.812
## height       -0.4463     0.9034  -0.494    0.628
## weight        2.9928     2.0080   1.490    0.157
## bmp          -1.7449     1.1552  -1.510    0.152
## fev1          1.0807     1.0809   1.000    0.333
## rv            0.1970     0.1962   1.004    0.331
## frc          -0.3084     0.4924  -0.626    0.540
## tlc           0.1886     0.4997   0.377    0.711
## 
## Residual standard error: 25.47 on 15 degrees of freedom
## Multiple R-squared:  0.6373, Adjusted R-squared:  0.4197 
## F-statistic: 2.929 on 9 and 15 DF,  p-value: 0.03195

A couple of things to notice: first, the p-value is 0.03195, which indicates that the variables together do have a statistically significant predictive value for pemax. However, looking at the list, no one variable seems necessary. The reason for this is that the p-values for each variable is testing what happens when we remove only that variable. So, for example, we can remove weight and not upset the model too much. Of course, since height is strongly correlated with weight, it may just be that height will serve as a proxy for weight in the event that weight is removed!

Let’s look at an anova table:

anova(my.mod)
## Analysis of Variance Table
## 
## Response: pemax
##           Df  Sum Sq Mean Sq F value   Pr(>F)   
## age        1 10098.5 10098.5 15.5661 0.001296 **
## sex        1   955.4   955.4  1.4727 0.243680   
## height     1   155.0   155.0  0.2389 0.632089   
## weight     1   632.3   632.3  0.9747 0.339170   
## bmp        1  2862.2  2862.2  4.4119 0.053010 . 
## fev1       1  1549.1  1549.1  2.3878 0.143120   
## rv         1   561.9   561.9  0.8662 0.366757   
## frc        1   194.6   194.6  0.2999 0.592007   
## tlc        1    92.4    92.4  0.1424 0.711160   
## Residuals 15  9731.2   648.7                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We get a different result, because in this case, the table is telling us the significance of adding/removing variables one at a time. For example, age is significant all by itself. Once age is added, sex, height, etc. are not significant. Note that bmp is close to significant. However, we are testing 9 things, so we should expect that relatively often we will get a false small p-value.

One way to test this further is to see whether it is permissible to remove all of the variables once age is included. We do that using anova again.

my.mod2 <- lm(pemax~age, data = cystfibr)
anova(my.mod2, my.mod) #Only use when models are nested! R doesn't check
## Analysis of Variance Table
## 
## Model 1: pemax ~ age
## Model 2: pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc + 
##     tlc
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     23 16734.2                           
## 2     15  9731.2  8    7002.9 1.3493 0.2936

As you can see, the p-value is 0.2936, which is not significant. Therefore, we can reasonably conclude that once age is included in the model, that is about as good as we can do. Note: This does NOT mean that age is necessarily the best predictor of pemax. The reason that it was included in the final model is largely due to the fact that it was listed first in the data set!

Let’s see what happens if we change the order:

my.mod <- lm(pemax~height+age+weight+sex+bmp+fev1+rv+frc+tlc, data = cystfibr)
summary(my.mod)
## 
## Call:
## lm(formula = pemax ~ height + age + weight + sex + bmp + fev1 + 
##     rv + frc + tlc, data = cystfibr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.338 -11.532   1.081  13.386  33.405 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 176.0582   225.8912   0.779    0.448
## height       -0.4463     0.9034  -0.494    0.628
## age          -2.5420     4.8017  -0.529    0.604
## weight        2.9928     2.0080   1.490    0.157
## sex          -3.7368    15.4598  -0.242    0.812
## bmp          -1.7449     1.1552  -1.510    0.152
## fev1          1.0807     1.0809   1.000    0.333
## rv            0.1970     0.1962   1.004    0.331
## frc          -0.3084     0.4924  -0.626    0.540
## tlc           0.1886     0.4997   0.377    0.711
## 
## Residual standard error: 25.47 on 15 degrees of freedom
## Multiple R-squared:  0.6373, Adjusted R-squared:  0.4197 
## F-statistic: 2.929 on 9 and 15 DF,  p-value: 0.03195

Note that the overall p-value is the same.

anova(my.mod)
## Analysis of Variance Table
## 
## Response: pemax
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## height     1 9634.6  9634.6 14.8511 0.001562 **
## age        1  646.2   646.2  0.9960 0.334098   
## weight     1  769.6   769.6  1.1863 0.293271   
## sex        1  790.8   790.8  1.2189 0.286964   
## bmp        1 2862.2  2862.2  4.4119 0.053010 . 
## fev1       1 1549.1  1549.1  2.3878 0.143120   
## rv         1  561.9   561.9  0.8662 0.366757   
## frc        1  194.6   194.6  0.2999 0.592007   
## tlc        1   92.4    92.4  0.1424 0.711160   
## Residuals 15 9731.2   648.7                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, we see that height is significant, and nothing else. If we test the model pemax ~ height versus pemax ~ ., we get

anova(my.mod2, my.mod) #Only use when models are nested!
## Analysis of Variance Table
## 
## Model 1: pemax ~ age
## Model 2: pemax ~ height + age + weight + sex + bmp + fev1 + rv + frc + 
##     tlc
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     23 16734.2                           
## 2     15  9731.2  8    7002.9 1.3493 0.2936

And again, we can conclude that once we have added height, there is not much to be gained by adding other variables. We leave it to the reader to test whether weight follows the same pattern as height and age.

Let’s first examine a couple of other ways we could have proceeded.

my.mod <- lm(pemax~., data = cystfibr)
summary(my.mod)
## 
## Call:
## lm(formula = pemax ~ ., data = cystfibr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.338 -11.532   1.081  13.386  33.405 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 176.0582   225.8912   0.779    0.448
## age          -2.5420     4.8017  -0.529    0.604
## sex          -3.7368    15.4598  -0.242    0.812
## height       -0.4463     0.9034  -0.494    0.628
## weight        2.9928     2.0080   1.490    0.157
## bmp          -1.7449     1.1552  -1.510    0.152
## fev1          1.0807     1.0809   1.000    0.333
## rv            0.1970     0.1962   1.004    0.331
## frc          -0.3084     0.4924  -0.626    0.540
## tlc           0.1886     0.4997   0.377    0.711
## 
## Residual standard error: 25.47 on 15 degrees of freedom
## Multiple R-squared:  0.6373, Adjusted R-squared:  0.4197 
## F-statistic: 2.929 on 9 and 15 DF,  p-value: 0.03195

We start to eliminate variables that are not statistically significant in the model. We will begin with the other lung function data, for as long as we are allowed to continue to remove. The order chosen for removal is: tlc, frc, rv, fev1.

summary(lm(pemax~height+age+weight+sex+bmp+fev1+rv+frc, data = cystfibr))
## 
## Call:
## lm(formula = pemax ~ height + age + weight + sex + bmp + fev1 + 
##     rv + frc, data = cystfibr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.072 -10.067   0.113  13.526  36.990 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 221.8055   185.4350   1.196   0.2491  
## height       -0.5428     0.8428  -0.644   0.5286  
## age          -3.1346     4.4144  -0.710   0.4879  
## weight        3.3157     1.7672   1.876   0.0790 .
## sex          -4.6933    14.8363  -0.316   0.7558  
## bmp          -1.9403     1.0047  -1.931   0.0714 .
## fev1          1.0183     1.0392   0.980   0.3417  
## rv            0.1857     0.1887   0.984   0.3396  
## frc          -0.2605     0.4628  -0.563   0.5813  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.78 on 16 degrees of freedom
## Multiple R-squared:  0.6339, Adjusted R-squared:  0.4508 
## F-statistic: 3.463 on 8 and 16 DF,  p-value: 0.01649
summary(lm(pemax~height+age+weight+sex+bmp+fev1+rv, data = cystfibr))
## 
## Call:
## lm(formula = pemax ~ height + age + weight + sex + bmp + fev1 + 
##     rv, data = cystfibr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.425 -12.391   3.834  14.797  36.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 166.71822  154.31294   1.080   0.2951  
## height       -0.40981    0.79257  -0.517   0.6118  
## age          -1.81783    3.66773  -0.496   0.6265  
## weight        2.87386    1.55120   1.853   0.0814 .
## sex           0.10239   11.89990   0.009   0.9932  
## bmp          -1.94971    0.98415  -1.981   0.0640 .
## fev1          1.41526    0.74788   1.892   0.0756 .
## rv            0.09567    0.09798   0.976   0.3425  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.28 on 17 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.4729 
## F-statistic: 4.076 on 7 and 17 DF,  p-value: 0.008452
summary(lm(pemax~height+age+weight+sex+bmp+fev1, data = cystfibr))
## 
## Call:
## lm(formula = pemax ~ height + age + weight + sex + bmp + fev1, 
##     data = cystfibr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.238  -7.403  -0.081  15.534  36.028 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 260.6313   120.5215   2.163   0.0443 *
## height       -0.6067     0.7655  -0.793   0.4384  
## age          -2.9062     3.4898  -0.833   0.4159  
## weight        3.3463     1.4719   2.273   0.0355 *
## sex          -1.2115    11.8083  -0.103   0.9194  
## bmp          -2.3042     0.9136  -2.522   0.0213 *
## fev1          1.0274     0.6329   1.623   0.1219  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.24 on 18 degrees of freedom
## Multiple R-squared:  0.6057, Adjusted R-squared:  0.4743 
## F-statistic: 4.608 on 6 and 18 DF,  p-value: 0.00529
summary(lm(pemax~height+age+weight+sex+bmp, data = cystfibr))
## 
## Call:
## lm(formula = pemax ~ height + age + weight + sex + bmp, data = cystfibr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.194  -9.412  -2.425   9.157  40.112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 280.4482   124.9556   2.244   0.0369 *
## height       -0.6853     0.7962  -0.861   0.4001  
## age          -3.0750     3.6352  -0.846   0.4081  
## weight        3.5546     1.5281   2.326   0.0312 *
## sex         -11.5281    10.3720  -1.111   0.2802  
## bmp          -1.9613     0.9263  -2.117   0.0476 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.27 on 19 degrees of freedom
## Multiple R-squared:  0.548,  Adjusted R-squared:  0.429 
## F-statistic: 4.606 on 5 and 19 DF,  p-value: 0.006388

When removing variables in a stepwise manner such as above, it is important to consider adding variables back into the model. Since we have removed all of the lung function data one variable at a time, let’s see whether the group of lung function data is significant as a whole. This serves as a sort of test as to whether we need to consider adding variables back in to the model.

mod1 <- lm(pemax~., data = cystfibr)
mod2 <- lm(pemax~height+age+weight+sex+bmp, data = cystfibr)
anova(mod2, mod1)
## Analysis of Variance Table
## 
## Model 1: pemax ~ height + age + weight + sex + bmp
## Model 2: pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc + 
##     tlc
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     19 12129.2                           
## 2     15  9731.2  4      2398 0.9241 0.4758

Since the \(p\)-value is 0.4758, this is an indication that the group of variables is not significant as a whole, and we are at least somewhat justified in removing the group.

Returning to variable selection, we see that age, height and sex don’t seem important, so let’s start removing.

summary(lm(pemax~height+weight+sex+bmp, data = cystfibr))
## 
## Call:
## lm(formula = pemax ~ height + weight + sex + bmp, data = cystfibr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.202 -11.699  -1.845  10.322  38.744 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 251.3973   119.2859   2.108   0.0479 *
## height       -0.8128     0.7762  -1.047   0.3075  
## weight        2.6947     1.1329   2.379   0.0275 *
## sex         -11.5458    10.2979  -1.121   0.2755  
## bmp          -1.4881     0.7330  -2.030   0.0558 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.09 on 20 degrees of freedom
## Multiple R-squared:  0.5309, Adjusted R-squared:  0.4371 
## F-statistic:  5.66 on 4 and 20 DF,  p-value: 0.003253
summary(lm(pemax~weight+sex+bmp, data = cystfibr))
## 
## Call:
## lm(formula = pemax ~ weight + sex + bmp, data = cystfibr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47.05 -11.65   4.23  13.75  45.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 132.9306    37.9132   3.506 0.002101 ** 
## weight        1.5810     0.3910   4.044 0.000585 ***
## sex         -11.7140    10.3203  -1.135 0.269147    
## bmp          -1.0140     0.5777  -1.755 0.093826 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.14 on 21 degrees of freedom
## Multiple R-squared:  0.5052, Adjusted R-squared:  0.4345 
## F-statistic: 7.148 on 3 and 21 DF,  p-value: 0.001732
summary(lm(pemax~weight+bmp, data = cystfibr))
## 
## Call:
## lm(formula = pemax ~ weight + bmp, data = cystfibr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.924 -13.399   4.361  16.642  48.404 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 124.8297    37.4786   3.331 0.003033 ** 
## weight        1.6403     0.3900   4.206 0.000365 ***
## bmp          -1.0054     0.5814  -1.729 0.097797 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.31 on 22 degrees of freedom
## Multiple R-squared:  0.4749, Adjusted R-squared:  0.4271 
## F-statistic: 9.947 on 2 and 22 DF,  p-value: 0.0008374

Again, we have removed a group of variables that are related, so let’s see if the group as a whole is significant. If it were, then we would need to consider adding variables back in to our model.

mod3 <- lm(pemax~weight+bmp, data = cystfibr)
anova(mod3, mod2)
## Analysis of Variance Table
## 
## Model 1: pemax ~ weight + bmp
## Model 2: pemax ~ height + age + weight + sex + bmp
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     22 14090                           
## 2     19 12129  3    1961.3 1.0241 0.4041

This indicates that we are justified in removing the age, weight and sex variables. And now, bmp also doesn’t seem important. We are left with modeling pemax on weight, similar to what we had before.

There is a built-in R function that does variable selection using the Akaike Information Content. This is a different criterion than the \(p\)-value criterion that we were using above, and you typically get difference models. Recall that mod1 is the full model.

aicMod <- step(mod1, direction = "both", trace = FALSE)
summary(aicMod)
## 
## Call:
## lm(formula = pemax ~ weight + bmp + fev1 + rv, data = cystfibr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39.77 -11.74   4.33  15.66  35.07 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 63.94669   53.27673   1.200 0.244057    
## weight       1.74891    0.38063   4.595 0.000175 ***
## bmp         -1.37724    0.56534  -2.436 0.024322 *  
## fev1         1.54770    0.57761   2.679 0.014410 *  
## rv           0.12572    0.08315   1.512 0.146178    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.75 on 20 degrees of freedom
## Multiple R-squared:  0.6141, Adjusted R-squared:  0.5369 
## F-statistic: 7.957 on 4 and 20 DF,  p-value: 0.000523

As is typical, the step procedure leaves more variables in the model than our \(p\)-value criterion. In this case, we would mode pemax on weight, bmp, fev1 and rv.

Returning to the model of pemax on height, let’s take a look at the residuals.

my.mod <- lm(pemax~height, data = cystfibr)
plot(pemax~height, data = cystfibr)
abline(my.mod)

plot(my.mod)

It is hard to say, but there definitely could be a U-shape in the Residuals vs Fitted plot, which means there might be some lurking variables that we have not yet taken into account. One possibility is that pemax actually depends on height and height-sqaured, rather than height itself. Let’s see what happens when we run the model:

my.mod2 <- lm(pemax~height + I(height^2), data = cystfibr)
summary(my.mod2)
## 
## Call:
## lm(formula = pemax ~ height + I(height^2), data = cystfibr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.411 -14.932  -2.288  12.787  44.933 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 615.36248  240.95580   2.554   0.0181 *
## height       -8.08324    3.32052  -2.434   0.0235 *
## I(height^2)   0.03064    0.01126   2.721   0.0125 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.18 on 22 degrees of freedom
## Multiple R-squared:  0.5205, Adjusted R-squared:  0.4769 
## F-statistic: 11.94 on 2 and 22 DF,  p-value: 0.0003081

Notice that in this case, both height and height^2 are significant.

Aside Let’s pause and think a moment about what we are doing. If we assume that we are just exploring the data, then the process that has led us to considering modeling pemax as a function of height and height^2 is marginal. Considering that pemax seems to have larger variance for taller children, it is not hard to imagine that a linear relationship could look quadratic. We also have no a prioiri reason to think that pemax might depend on height^2. So, think of this as an illustration of what can be done and not necessarily an ideal analysis of the data. A couple of things that might be reasonable would be to either (1) contact the experts to see whether there is any reason that height^2 might be relevant, and (2) collect more data and see whether the relationship was an artifact of this data or whether it is recreated the second time around. End of Aside

Let’s check the residuals:

plot(my.mod2)

Looks pretty good, except we seem to be violating constant variance. In particular, the variance seems to be higher for children with higher height/pemax values than for shorter children. Let’s re-examine the plot of pemax vs height:

ggplot(cystfibr, aes(x = height, y = pemax)) + geom_point()

This plot also seems to show that the variance of pemax depends on the height of the child.

12.0.2 Example Two: Secher data

Let’s examine the secher data. Here, we have the birth weight of 107 babies, together with two diameter measurements.

secher <- ISwR::secher
secher$no <- NULL
longSecher <- gather(data = secher, key = variable, value = value, -bwt)
ggplot(longSecher, aes(x = value, y = bwt)) + geom_point() +
  facet_wrap(facets = ~variable, scales = "free_x")

A brief glance at the plots (especially bwt versus bpd) gives an indication that the variance is not constant. Let’s use a log transformation of all three variables, and plot again.

longSecher$value <- log(longSecher$value)
longSecher$bwt <- log(longSecher$bwt)
ggplot(longSecher, aes(x = value, y = bwt)) + geom_point() +
  facet_wrap(facets = ~variable, scales = "free_x")

As you can see by the plots, the data seems more linear, and the variance seems to be constant across scales. Now, we model log(bwt) on log of the other two variables.

my.mod <- lm(I(log(bwt))~I(log(bpd)) + I(log(ad)), data = secher)
summary(my.mod)
## 
## Call:
## lm(formula = I(log(bwt)) ~ I(log(bpd)) + I(log(ad)), data = secher)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35074 -0.06741 -0.00792  0.05750  0.36360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5.8615     0.6617  -8.859 2.36e-14 ***
## I(log(bpd))   1.5519     0.2294   6.764 8.09e-10 ***
## I(log(ad))    1.4667     0.1467   9.998  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1068 on 104 degrees of freedom
## Multiple R-squared:  0.8583, Adjusted R-squared:  0.8556 
## F-statistic: 314.9 on 2 and 104 DF,  p-value: < 2.2e-16
plot(my.mod, which = c(1,3,5))

Looking at the residuals, there seems to be one possible outlier, which is data point 101.

secher[101,]
##      bwt bpd ad
## 101 1250  64 71
summary(secher)
##       bwt            bpd               ad       
##  Min.   :1150   Min.   : 64.00   Min.   : 71.0  
##  1st Qu.:2400   1st Qu.: 88.00   1st Qu.: 96.0  
##  Median :2800   Median : 91.00   Median :103.0  
##  Mean   :2739   Mean   : 89.48   Mean   :101.7  
##  3rd Qu.:3125   3rd Qu.: 93.00   3rd Qu.:108.0  
##  Max.   :4850   Max.   :100.00   Max.   :133.0

We see that the high leverage point is the minimum for both perimeters, and almost the minimum for birthweight. This could be an indication that our model doesn’t hold well for babies whose mothers have small perimeters. Referring back to our original plots of the log transformed data, it is definitely plausible that the model might break down near log(bpd) = 4.3. We certainly have very little data in that range, so our model would be somewhat suspect in that range even if the fit were a little bit better. As it is, I would be hesitant to apply the model to predict infant birth weight for mothers whose log(bpd) is less than 4.3, pending further verification.

For comparison’s sake, let’s see what happens if we only use ad or bpd.

my.mod2 <- lm(I(log(bwt))~I(log(ad)), data = secher)
summary(my.mod2)
## 
## Call:
## lm(formula = I(log(bwt)) ~ I(log(ad)), data = secher)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58560 -0.06609  0.00184  0.07479  0.48435 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.4446     0.5103  -4.791 5.49e-06 ***
## I(log(ad))    2.2365     0.1105  20.238  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1275 on 105 degrees of freedom
## Multiple R-squared:  0.7959, Adjusted R-squared:  0.794 
## F-statistic: 409.6 on 1 and 105 DF,  p-value: < 2.2e-16
plot(my.mod2, which = c(1,3,5))

predict(my.mod, newdata = data.frame(bpd = 64, ad = 71))
##       1 
## 6.84476
exp(6.84476)
## [1] 938.9479
predict(my.mod2, newdata = data.frame(ad = 71))
##      1 
## 7.0889
exp(7.0899)
## [1] 1199.788

Note that abdominal diameter remains a good predictor for birth weight even in the extreme cases, while biparietal diameter is very far off. Again, we have very little data in this range, so this is all highly speculative.

12.1 Exercises

  1. Consider the data set tlc in the ISwR library. Model the variable tlc inside the data set tlc on the other variables. Include plots and examine the residuals.

  2. Load the data set MRex2.csv from http://stat.slu.edu/~speegle/_book_data/MRex2.csv. This data set contains the following variables:
    • G = The total number of games played in the AL or NL each year.
    • HR = The number of home runs hit per game.
    • R = The number of runs scored per game.
    • SB = The number of stolen bases per game.
    • CS = The number of times caught stealing per game.
    • IBB = The number of intentional bases on balls per game.
    • X2B = The number of doubles hit per game.
    • AB = The number of at bats per game.
    • PSB = The percentage of time a base stealer was successful.
    • yearID = the year in question.
      Load the data frame and complete the following tasks:
    1. Compute the correlation coefficient for each pair of variables. Comment on pairs that are highly correlated.
    2. Create a scatterplot of Runs scored per game versus each of the other 8 variables.
    3. Create a linear model of the number of runs scored per game on all of the other variables except yearID.
    4. CS is considered a bad thing to happen, and leads to fewer runs being scored. Why is the coefficient associated with CS positive?
    5. Are SB, CS and PSB individually significant?
    6. Are SB, CS and PSB jointly significant?
    7. Intelligently remove variables and create a model that describes the response.
    8. Examine the residuals, and evaluate your model.