Example
As an illustrative example we will use the the protein in pregnancy data, which we have used before to demonstrate some of these assumptions.
Protein in Pregnancy
Recall data were collected through interest in whether the level of protein changes in expectant mothers throughout pregnancy. Observations have been taken on 19 healthy women. Each woman was at a different stage of pregnancy (gestation). We have already fit a simple linear regression that describes the relationship between the mothers’ protein levels and the gestation length and estimated the parameters in this model as illustrated below.
#Read in the data (you can find the data csv file on Moodle)
protein<-read.csv("week5/Lecture9/PROTEIN.csv",header=TRUE)
#Fit a simple linear regression
protein.lm<-lm(Protein~Gestation,data=protein)
#Print summary of model
summary(protein.lm)
##
## Call:
## lm(formula = Protein ~ Gestation, data = protein)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.16853 -0.08720 -0.01009 0.08578 0.20422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.201738 0.083363 2.420 0.027 *
## Gestation 0.022844 0.003295 6.934 2.42e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1151 on 17 degrees of freedom
## Multiple R-squared: 0.7388, Adjusted R-squared: 0.7234
## F-statistic: 48.08 on 1 and 17 DF, p-value: 2.416e-06
## Analysis of Variance Table
##
## Response: Protein
## Df Sum Sq Mean Sq F value Pr(>F)
## Gestation 1 0.63667 0.63667 48.076 2.416e-06 ***
## Residuals 17 0.22513 0.01324
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Data: \((y_i,x_i), \quad i=1,\dots,n\)
\(y_i\), protein level of mother \(i\)
\(x_i\), gestation of baby \(i\) (in weeks)
Model: \(\mathrm{E}(y_i)=0.2017 + 0.0228x_i\)
#Plot the response variable protein against explanatory variable gestation
#with fitted line.
library(ggplot2)
ggplot(protein, aes(x = Gestation, y = Protein)) +
geom_point(size=3.2, alpha = 0.4, col="blue") +
ggtitle("Protein in Pregnancy") +
geom_smooth(method = "lm",fullrange=TRUE, color="black",size=2,se=FALSE)
In addition, the ANOVA table is
Component | Degrees of freedom (df) | Sum of squares (SS) | Mean squares (MS) | F value |
Model | 1 | 0.6367 | 0.63667 | 48.076 |
Residual | 17 | 0.2251 | 0.01324 | |
Total | 18 | 0.8618 |
and so we can calculate
\[\begin{aligned} R^2&=1-\frac{RSS}{TSS}\\ &= 1 - \frac{0.22513}{0.8618}\\ &= 0.7388\\ \\ \\ r&=\sqrt{0.7388}\\ &=0.8595. \end{aligned}\]
Hence the correlation between gestation and protein is 0.86 and gestation explains 73.9% of the variation in protein level in healthy pregnant women. Therefore, you may argue that this fitted model provides an adequate fit to the data.