Weight changes in Herring Gulls
The data in the below table and figure are part of a large sample collected by the Department of Zoology at the University of Glasgow in a study of the weight changes in herring gulls throughout the year. Some birds were caught in June (coded as month 1) and others in December (month 2). Since weight is dependent on the size of the bird this information is recorded in the form of the head and bill length, HAB (in mm), the distance from the back of the head to the tip of the bill.
ggplot(data=hab, aes(x=HAB.length,y= Weight, colour = Month, shape= Month))+
geom_point(size = 3, alpha = .8) +
geom_smooth(method="lm", fill=NA, fullrange=TRUE) +
theme(plot.background = element_rect(
fill = "transparent",
colour = NA,
size = 1), legend.position = c(0.15,0.85),
legend.background = element_rect(fill="transparent", size=0.5, linetype="solid")) +
guides(color = guide_legend(label.position = "right") )
A model with two separate lines appears to be plausible. We will now fit the separate lines model and examine it and conclude whether to keep two separate lines or reduce to parallel lines or even further to a single line. The summary numbers for the data are: \[\begin{aligned} \bar{y}_{1.}& = 819.2857 \quad \bar{y}_{2.}=896\\ \bar{x}_{1.}& = 112.5 \quad \bar{x}_{2.}=113.1\\ S_{y_1y_1} & = 120642.9 \quad S_{y_2y_2}=105340.0\\ S_{x_1x_1} & = 333.5 \quad S_{x_2x_2} = 78.9\\ S_{x_1y_1}& = 5100 \quad S_{x_2y_2}=2404\\ n_1& = 14 \quad n_2 =10\\ p& = 4\end{aligned}\]
We first fit the model with different slopes,
\[E(Y_{ij}) = \alpha_i+\beta_i(x_{ij}-\bar{x}_{i.})\]
We already have the formulae to do this: \[\begin{aligned} \hat{\alpha_1}& = \bar{y}_{1.}\\ \hat{\alpha_2}& = \bar{y}_{2.}\\ \hat{\beta_1} & = \frac{S_{x_1y_1}}{S_{x_1x_1}}\\ \hat{\beta_2} & = \frac{S_{x_2y_2}}{S_{x_2x_2}}\\ RSS& = S_{y_1y_1}-\frac{(S_{x_1y_1})^2}{S_{x_1x_1}}+S_{y_2y_2}-\frac{(S_{x_2y_2})^2}{S_{x_2x_2}} = RSS_1+RSS_2\end{aligned}\]
A 95% C.I. for \((\beta_1-\beta_2)\) is
\[\hat{\beta_1}-\hat{\beta_2} \pm t(n-p; 0.975) \sqrt{\frac{RSS}{n-p}\left(\frac{1}{S_{x_1x_1}}+\frac{1}{S_{x_2x_2}}\right)}\]
Using the summary statistics for the “different lines” model we have \[\begin{aligned} \hat{\beta_1} & = 15.2924\\ \\ \hat{\beta_2} & = 30.4689\\ \\ RSS& = 42651.896+32092.65 = 74744.5\end{aligned}\]
A C.I. for the difference in slopes is
\[\hat{\beta_1}-\hat{\beta_2} \pm t(n_1+n_2-4; 0.975)\sqrt{\left(\frac{RSS_1+RSS_2}{n_1+n_2-4}\right)\left(\frac{1}{S_{x_1x_1}}+\frac{1}{S_{x_2x_2}}\right)}\]
\[-15.1765 \pm 2.086 \sqrt{\frac{74744.5}{20}\left(\frac{1}{333.5}+\frac{1}{78.9}\right)}\]
i.e. \((15.17 \pm 26.2973)\) or \((-41.47,11.12)\).
In R
, the model above (that allows two completely separate linear
regression lines for a response of weight, one for each month) can be
achieved by fitting a linear model with a continuous covariate for HAB
length, a factor covariate for group and a covariate that allows an
interaction between HAB and month. The R
code and analysis of variance
table are provided below:
hab <- read.csv("week7/Lecture14/hab.csv")
hab.separate <- lm(Weight~HAB.length*Month,data=hab)
print(anova(hab.separate))
## Analysis of Variance Table
##
## Response: Weight
## Df Sum Sq Mean Sq F value Pr(>F)
## HAB.length 1 145746 145746 38.9985 4.248e-06 ***
## Month 1 25126 25126 6.7231 0.01739 *
## HAB.length:Month 1 14696 14696 3.9324 0.06127 .
## Residuals 20 74745 3737
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The term HAB.length:Month
is the interaction term. If this has a small
p-value (i.e. typically \(<\) 0.05) then it suggests that the model with
completely separate regression lines is the most appropriate i.e. there
is a statistically significant interaction. If this p-value is not \(<\)
0.05 then we should consider the parallel lines model.
Since the confidence interval contains zero (although only just!), we cannot reject the parallel lines model. It therefore seems that a separate slope is not required for each month.
The parallel lines model produces \[\begin{aligned} \hat{\beta} & = 18.1959\\ \\ RSS& = 89440.6\end{aligned}\]
A C.I. for the vertical separation between the regression lines is
\[\hat{\alpha}_1-\hat{\alpha}_2+\hat{\beta}(\bar{x}_{2.}-\bar{x}_{1.}) \pm t(n-p,0.975) \sqrt{\left(\frac{RSS}{n-p}\right)\left(\frac{1}{n_1}+\frac{1}{n_2}+\frac{(\bar{x}_{2.}-\bar{x}_{1.})^2}{S_{x_1x_1}+S_{x_2x_2}}\right)}\]
\[-66.0 \pm 2.08 \sqrt{\frac{89440.6}{21}\left(\frac{1}{10}+\frac{1}{14}+\frac{0.6^2}{333.5+78.9}\right)}\]
i.e. \((-66.0 \pm 51.5)\) or (-117.5, -14.5).
This model can be fitted in R
by removing the interaction term from
the linear model i.e. we fit a model that only has a continuous
covariate of HAB length and a factor covariate of month:
##
## Call:
## lm(formula = Weight ~ HAB.length + Month, data = hab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.384 -46.786 -3.306 36.083 109.840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1293.553 361.884 -3.574 0.00179 **
## HAB.length 18.196 3.214 5.662 1.28e-05 ***
## Month 65.797 27.090 2.429 0.02421 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.26 on 21 degrees of freedom
## Multiple R-squared: 0.6564, Adjusted R-squared: 0.6237
## F-statistic: 20.06 on 2 and 21 DF, p-value: 1.344e-05
If the p-value for the factor (which is Month here) is small (typically \(<\) 0.05) then we conclude that the parallel lines model is the most appropriate. However, if the p-value is \(>\) 0.05 it suggests that there is insufficient evidence of a difference between the two regression lines and one line may be appropriate i.e. the relationship is the same regardless of the group.
Since the confidence interval does not contain zero, the parallel lines model appears to be appropriate. It therefore seems that December weights are systematically higher than June weights i.e. group 2 weights are higher than group 1 weights since the interval is entirely negative. It is highly likely that the December weights are higher by between 14.5 and 117.5 grams.
We could have also considered the model selection using the anova
function and compare the two separate, two parallel and single line. and as the p-value < 0.05 we do not have evidencew that the interaction term is significant, therefore we drop the interaction term and use the simpler parallel lines model.
## Analysis of Variance Table
##
## Model 1: Weight ~ HAB.length * Month
## Model 2: Weight ~ HAB.length + Month
## Model 3: Weight ~ HAB.length
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 20 74745
## 2 21 89441 -1 -14696 3.9324 0.06127 .
## 3 22 114566 -1 -25126 6.7231 0.01739 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1