Example

As an illustrative example we will use the the protein in pregnancy data, which we have used before to demonstrate some of these assumptions.

Protein in Pregnancy

Recall data were collected through interest in whether the level of protein changes in expectant mothers throughout pregnancy. Observations have been taken on 19 healthy women. Each woman was at a different stage of pregnancy (gestation). We have already fit a simple linear regression that describes the relationship between the mothers’ protein levels and the gestation length and estimated the parameters in this model as illustrated below.

#Read in the data (you can find the data csv file on Moodle)
protein<-read.csv("week5/Lecture9/PROTEIN.csv",header=TRUE)

#Fit a simple linear regression
protein.lm<-lm(Protein~Gestation,data=protein)

#Print summary of model
summary(protein.lm)
## 
## Call:
## lm(formula = Protein ~ Gestation, data = protein)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.16853 -0.08720 -0.01009  0.08578  0.20422 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.201738   0.083363   2.420    0.027 *  
## Gestation   0.022844   0.003295   6.934 2.42e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1151 on 17 degrees of freedom
## Multiple R-squared:  0.7388, Adjusted R-squared:  0.7234 
## F-statistic: 48.08 on 1 and 17 DF,  p-value: 2.416e-06
#Print ANOVA table
anova(protein.lm)
## Analysis of Variance Table
## 
## Response: Protein
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Gestation  1 0.63667 0.63667  48.076 2.416e-06 ***
## Residuals 17 0.22513 0.01324                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Data: \((y_i,x_i), \quad i=1,\dots,n\)

\(y_i\), protein level of mother \(i\)

\(x_i\), gestation of baby \(i\) (in weeks)

Model: \(\mathrm{E}(y_i)=0.2017 + 0.0228x_i\)

#Plot the response variable protein against explanatory variable gestation 
#with fitted line.
library(ggplot2)
ggplot(protein, aes(x = Gestation, y = Protein)) +
          geom_point(size=3.2, alpha = 0.4, col="blue") +
          ggtitle("Protein in Pregnancy") +
  geom_smooth(method = "lm",fullrange=TRUE, color="black",size=2,se=FALSE)

In addition, the ANOVA table is

Component Degrees of freedom (df) Sum of squares (SS) Mean squares (MS) F value
Model 1 0.6367 0.63667 48.076
Residual 17 0.2251 0.01324
Total 18 0.8618

and so we can calculate

\[\begin{aligned} R^2&=1-\frac{RSS}{TSS}\\ &= 1 - \frac{0.22513}{0.8618}\\ &= 0.7388\\ \\ \\ r&=\sqrt{0.7388}\\ &=0.8595. \end{aligned}\]

Hence the correlation between gestation and protein is 0.86 and gestation explains 73.9% of the variation in protein level in healthy pregnant women. Therefore, you may argue that this fitted model provides an adequate fit to the data.