Trout
When the behaviour of a group of trout is studied, some fish are observed to become dominant and others to become subordinate. Dominant fish (coded as group 1) have freedom of movement whereas subordinate fish (coded as group 2) tend to congregate in the periphery of the waterway to avoid crossing the path of the dominant fish, Metcalfe (1986).
Data on energy expenditure and ration of food obtained were collected as part of a laboratory experiment for 20 trout, and are plotted below. Both variables are measured in units of calories per kilo-calorie per trout per day.
trout <- read.csv("week7/Lecture14/TROUT.csv")
trout$Group=factor(trout$Group)
ggplot(data=trout, aes(y=Energy,x= Ration, group=Group, color=Group,shape= Group)) +
geom_point(size = 3, alpha = .8) +
geom_smooth(method="lm", fill=NA, fullrange=TRUE)+
theme(plot.background = element_rect(
fill = "transparent",
colour = NA,
size = 1), legend.position = c(0.15,0.85),
legend.background = element_rect(fill="transparent", size=0.5,
linetype="solid")) +
scale_shape_discrete(breaks=c("1", "2"),
labels=c( "Dominant","Subordinate")) +
scale_color_discrete(breaks=c("1", "2"),
labels=c( "Dominant","Subordinate"))
Based on the scatterplot, it appears that a model with parallel lines to be an acceptable description of the data. We will now fit the parallel lines model and examine it. Summary statistics are provided below and have been calculated from the data.
\[\bar{x}_{1.} = 79.644, \, S_{x_1y_1} = 1772.273, \, \bar{y}_{1.} = 55.665, \, S_{x_1x_1} = 2193.056\]
\[\bar{x}_{2.} = 62.829, \, S_{x_2y_2} = 1787.014, \, \bar{y}_{2.} = 55.678, \, S_{x_2x_2} = 1327.703\]
\[\sum_{i=1}^2\sum_{j=1}^{n_i}y_{ij}^2 = 67166.63, \, n_1=n_2=10\]
Calculate a 95% confidence interval that can be used to assess whether the parallel line model should be reduced to a single line.
\(\newline\) Based on the summary statistics
\[\begin{aligned} n_1& = n_2=10\\ \hat{\beta} & = \frac{S_{x_1y_1}+S_{x_2y_2}}{S_{x_1x_1}+S_{x_2x_2}} = 1.010943\\ \hat{\alpha_1}& = \bar{y}_{1.}=55.665\\ \hat{\alpha_2}& = \bar{y}_{2.}=55.678\\ \\ b &=\left( \begin{array}{c} 1 \\ -1 \\ 62.829- 79.644 \\ \end{array} \right)=\left( \begin{array}{c} 1 \\ -1 \\ -16.815 \\ \end{array} \right)\\ RSS& = \sum_i\sum_jy_{ij}^2-n_1\bar{y}_{1.}^2-n_2\bar{y}_{2.}^2-\frac{(S_{x_1y_1}+S_{x_2y_2})^2}{S_{x_1x_1}+S_{x_2x_2}}\\ & = 67166.63-10 \times 55.665^2-10 \times 55.678^2-\frac{(1772.273+ 1787.014)^2}{(2193.056+ 1327.703)}\\ & = 1582.074\end{aligned}\]
\[\begin{aligned} \mathbf{b}^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{b} & = \left( \begin{array}{ccc} 1 & -1 & -16.815 \\ \end{array} \right)\left( \begin{array}{ccc} \frac{1}{10} & 0 & 0 \\ 0 & \frac{1}{10} & 0 \\ 0 & 0 & \frac{1}{5180.314} \\ \end{array} \right)\left( \begin{array}{c} 1 \\ -1 \\ -16.815 \\ \end{array} \right)\\ \\ & = \frac{1}{10}+\frac{1}{10}+\frac{(-16.815)^2}{3520.759}\\ \\ & = 0.2803\end{aligned}\]
The 95% C.I. is therefore
\[\hat{\alpha}_1-\hat{\alpha}_2+\hat{\beta}(\bar{x}_{2.}-\bar{x}_{1.}) \pm t(n-p,0.975) \sqrt{\left(\frac{RSS}{n-p}\right)\left(\frac{1}{n_1}+\frac{1}{n_2}+\frac{(\bar{x}_{2.}-\bar{x}_{1.})^2}{S_{x_1x_1}+S_{x_2x_2}}\right)}\]
\[55.665-55.678+1.0109(-16.815) \pm t(17;0.975)\sqrt{\frac{1582.074}{17}0.2803}\]
\[-17.0113 \pm 10.7757\] \[(-27.79, -6.24)\]
The above interval does not contain zero so we should keep the parallel line models. This interval in entirely negative. Recall this is an interval estimate for the intercept term for group 1 (dominant) minus group 2 (subordinate).
Therefore the intercept term for regression line for group 1 (dominant) is significantly lower than that of group 2 (subordinate). This suggests that the regression line for group 1 (dominant) lies significantly lower than that of group 2 (subordinate).
This suggests that for any given amount of food obtained, x, dominant fish on average expend less energy than subordinate fish.
Now, we see how we can use R output to come to the same conclusion.
trout.lm<-lm(Energy~Ration+Group,data=trout)
trout.lm.single<-lm(Energy~Ration,data=trout)
summary(trout.lm)
##
## Call:
## lm(formula = Energy ~ Ration + Group, data = trout)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.130 -5.139 -0.870 2.199 25.622
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -24.8506 13.3031 -1.868 0.07910 .
## Ration 1.0109 0.1626 6.218 9.36e-06 ***
## Group2 17.0120 5.1075 3.331 0.00396 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.647 on 17 degrees of freedom
## Multiple R-squared: 0.6946, Adjusted R-squared: 0.6587
## F-statistic: 19.33 on 2 and 17 DF, p-value: 4.182e-05
The first indicator that we should keep the two parallel lines is the fact that the term Group
is significant in the summary of the output of model trout.lm<-lm(Energy~Ration+Group,data=trout)
. In particular the p-value is 0.00396 which is less than 0.05.
The model fitted using R
is specifed slightly different to that defined in the previous lecture.
95% Confidence interval for parallel lines model
Recall how we specified the parallel regression model in order to obtain a 95% Confidence interval for parallel lines model.
\(\newline\) Data: \((y_{ij}, x_{ij}); \quad i=1,2; \quad j=1, \dots,n_i\).
\(\newline\) Model : \(E(Y_{ij}) = \alpha_i+\beta(x_{ij}-\bar{x}_{i.})\)
Paralle lines model in R
Compare this to the specific model we have specifed in R
.
\(\newline\) Data: \((y_{j}, x_{j},\mbox{group}_j); \quad j=1, \dots,n_1+n_2\).
\(\newline\) Model : \(E(Y_{j}) = a+b x_{j} + c I_{\mbox{group}_j}\)
\(\newline\)
where
\[
I_{group_j}=
\begin{cases}
1 \hspace{0.5cm} \mbox{if } \mbox{group}_j=2 \\
0 \hspace{0.5cm} \mbox{if } \mbox{group}_j=1
\end{cases}
\]
From the R
output
\[\begin{eqnarray*} \hat{a} &=& -24.8506\\ \hat{b} &=& 1.0109\\ \hat{c} &=& 17.0120 \end{eqnarray*}\]
and from there, we can deduce that \(\hat{\beta} = 1.0109\) and the intercept term in relation to the dominant group (1)is -24.8506 and the subordinate group (2) intercept term for the is -24.9506+17.0120=-7.9386. Therefore
\[\begin{eqnarray*} \hat{\alpha_1} &=& -24.8506+ 1.0109 \times 79.644 = 55.662\\ \hat{\alpha_2} &=& -7.9386+ 1.0109 \times 62.829 = 55.575\\ \hat{\beta} &=& 1.0109 \end{eqnarray*}\]
Notice that the difference between the two groups from the R
output is 17.0120. The coding Group2
tells us that group 2 is signficantly higher than group 1 (since the p-value < 0.05). Therefore, accounting for Ration
group 2 fish on average expend more energy that group 1 fish.
We can obtain the same conclusion using the anova
function. In this case we have two competing models. Model 1 is the model including Ration and Group and Model 2 is the model including only Ration. When given a sequence of models in this way, anova
tests the models against one another. The null model is the simplier of the two (Model 2). Given the p-value of 0.003958, we have evidence to reject this model.
## Analysis of Variance Table
##
## Model 1: Energy ~ Ration + Group
## Model 2: Energy ~ Ration
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 17 1582.1
## 2 18 2614.5 -1 -1032.5 11.094 0.003958 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s now plot the data again. This time, we will include 95% confidence interval bands for each group.
fit<-predict(trout.lm, interval="confidence")
trout.fit<-data.frame(trout,fit)
ggplot(data=trout.fit, aes(x=Ration,y= Energy, colour = Group, shape= Group))+
geom_point(size = 3, alpha = .8) +
geom_line(aes(x=Ration,y=fit),size=1.5)+
geom_ribbon(aes(ymin = lwr, ymax = upr), linetype = "dotted", alpha = I(0.2)) + #For CI
theme(plot.background = element_rect(
fill = "transparent",
colour = NA,
size = 1), legend.position = c(0.15,0.85),
legend.background = element_rect(fill="transparent", size=0.5, linetype="solid")) +
guides(color = guide_legend(label.position = "right") )
Notice the two CI’s do not overlap over the whole range. So our conclusion would be to keep both lines.