Weight changes in Herring Gulls

The data in the below table and figure are part of a large sample collected by the Department of Zoology at the University of Glasgow in a study of the weight changes in herring gulls throughout the year. Some birds were caught in June (coded as month 1) and others in December (month 2). Since weight is dependent on the size of the bird this information is recorded in the form of the head and bill length, HAB (in mm), the distance from the back of the head to the tip of the bill.

ggplot(data=hab, aes(x=HAB.length,y= Weight, colour = Month, shape= Month))+ 
  geom_point(size = 3, alpha = .8) +
  geom_smooth(method="lm", fill=NA, fullrange=TRUE) +
     theme(plot.background = element_rect(
          fill = "transparent",
          colour = NA,
          size = 1),    legend.position = c(0.15,0.85),
        legend.background = element_rect(fill="transparent", size=0.5, linetype="solid")) +
  guides(color = guide_legend(label.position = "right") )

A model with two separate lines appears to be plausible. We will now fit the separate lines model and examine it and conclude whether to keep two separate lines or reduce to parallel lines or even further to a single line. The summary numbers for the data are: $\begin{aligned} \bar{y}_{1.}& = 819.2857 \quad \bar{y}_{2.}=896\\ \bar{x}_{1.}& = 112.5 \quad \bar{x}_{2.}=113.1\\ S_{y_1y_1} & = 120642.9 \quad S_{y_2y_2}=105340.0\\ S_{x_1x_1} & = 333.5 \quad S_{x_2x_2} = 78.9\\ S_{x_1y_1}& = 5100 \quad S_{x_2y_2}=2404\\ n_1& = 14 \quad n_2 =10\\ p& = 4\end{aligned}$

We first fit the model with different slopes,

$E(Y_{ij}) = \alpha_i+\beta_i(x_{ij}-\bar{x}_{i.})$

We already have the formulae to do this: $\begin{aligned} \hat{\alpha_1}& = \bar{y}_{1.}\\ \hat{\alpha_2}& = \bar{y}_{2.}\\ \hat{\beta_1} & = \frac{S_{x_1y_1}}{S_{x_1x_1}}\\ \hat{\beta_2} & = \frac{S_{x_2y_2}}{S_{x_2x_2}}\\ RSS& = S_{y_1y_1}-\frac{(S_{x_1y_1})^2}{S_{x_1x_1}}+S_{y_2y_2}-\frac{(S_{x_2y_2})^2}{S_{x_2x_2}} = RSS_1+RSS_2\end{aligned}$

A 95% C.I. for $(\beta_1-\beta_2)$ is

$\hat{\beta_1}-\hat{\beta_2} \pm t(n-p; 0.975) \sqrt{\frac{RSS}{n-p}\left(\frac{1}{S_{x_1x_1}}+\frac{1}{S_{x_2x_2}}\right)}$

Using the summary statistics for the “different lines” model we have $\begin{aligned} \hat{\beta_1} & = 15.2924\\ \\ \hat{\beta_2} & = 30.4689\\ \\ RSS& = 42651.896+32092.65 = 74744.5\end{aligned}$

A C.I. for the difference in slopes is

$\hat{\beta_1}-\hat{\beta_2} \pm t(n_1+n_2-4; 0.975)\sqrt{\left(\frac{RSS_1+RSS_2}{n_1+n_2-4}\right)\left(\frac{1}{S_{x_1x_1}}+\frac{1}{S_{x_2x_2}}\right)}$

$-15.1765 \pm 2.086 \sqrt{\frac{74744.5}{20}\left(\frac{1}{333.5}+\frac{1}{78.9}\right)}$

i.e. $(15.17 \pm 26.2973)$ or $(-41.47,11.12)$ .

In R, the model above (that allows two completely separate linear regression lines for a response of weight, one for each month) can be achieved by fitting a linear model with a continuous covariate for HAB length, a factor covariate for group and a covariate that allows an interaction between HAB and month. The R code and analysis of variance table are provided below:

hab <- read.csv("week7/Lecture14/hab.csv")
hab.separate <- lm(Weight~HAB.length*Month,data=hab)
print(anova(hab.separate))

## Analysis of Variance Table
## 
## Response: Weight
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## HAB.length        1 145746  145746 38.9985 4.248e-06 ***
## Month             1  25126   25126  6.7231   0.01739 *  
## HAB.length:Month  1  14696   14696  3.9324   0.06127 .  
## Residuals        20  74745    3737                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The term HAB.length:Month is the interaction term. If this has a small p-value (i.e. typically $<$ 0.05) then it suggests that the model with completely separate regression lines is the most appropriate i.e. there is a statistically significant interaction. If this p-value is not $<$ 0.05 then we should consider the parallel lines model.

Since the confidence interval contains zero (although only just!), we cannot reject the parallel lines model. It therefore seems that a separate slope is not required for each month.

The parallel lines model produces $\begin{aligned} \hat{\beta} & = 18.1959\\ \\ RSS& = 89440.6\end{aligned}$

A C.I. for the vertical separation between the regression lines is

$\hat{\alpha}_1-\hat{\alpha}_2+\hat{\beta}(\bar{x}_{2.}-\bar{x}_{1.}) \pm t(n-p,0.975) \sqrt{\left(\frac{RSS}{n-p}\right)\left(\frac{1}{n_1}+\frac{1}{n_2}+\frac{(\bar{x}_{2.}-\bar{x}_{1.})^2}{S_{x_1x_1}+S_{x_2x_2}}\right)}$

$-66.0 \pm 2.08 \sqrt{\frac{89440.6}{21}\left(\frac{1}{10}+\frac{1}{14}+\frac{0.6^2}{333.5+78.9}\right)}$

i.e. $(-66.0 \pm 51.5)$ or (-117.5, -14.5).

This model can be fitted in R by removing the interaction term from the linear model i.e. we fit a model that only has a continuous covariate of HAB length and a factor covariate of month:

hab.parallel <- lm(Weight~HAB.length+Month,data=hab)
summary(hab.parallel)

## 
## Call:
## lm(formula = Weight ~ HAB.length + Month, data = hab)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.384  -46.786   -3.306   36.083  109.840 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1293.553    361.884  -3.574  0.00179 ** 
## HAB.length     18.196      3.214   5.662 1.28e-05 ***
## Month          65.797     27.090   2.429  0.02421 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.26 on 21 degrees of freedom
## Multiple R-squared:  0.6564, Adjusted R-squared:  0.6237 
## F-statistic: 20.06 on 2 and 21 DF,  p-value: 1.344e-05

If the p-value for the factor (which is Month here) is small (typically $<$ 0.05) then we conclude that the parallel lines model is the most appropriate. However, if the p-value is $>$ 0.05 it suggests that there is insufficient evidence of a difference between the two regression lines and one line may be appropriate i.e. the relationship is the same regardless of the group.

Since the confidence interval does not contain zero, the parallel lines model appears to be appropriate. It therefore seems that December weights are systematically higher than June weights i.e. group 2 weights are higher than group 1 weights since the interval is entirely negative. It is highly likely that the December weights are higher by between 14.5 and 117.5 grams.

We could have also considered the model selection using the anova function and compare the two separate, two parallel and single line. and as the p-value < 0.05 we do not have evidencew that the interaction term is significant, therefore we drop the interaction term and use the simpler parallel lines model.

hab.single <- lm(Weight~HAB.length,data=hab)
print(anova(hab.separate,hab.parallel,hab.single))

## Analysis of Variance Table
## 
## Model 1: Weight ~ HAB.length * Month
## Model 2: Weight ~ HAB.length + Month
## Model 3: Weight ~ HAB.length
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     20  74745                              
## 2     21  89441 -1    -14696 3.9324 0.06127 .
## 3     22 114566 -1    -25126 6.7231 0.01739 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1