7.3 Predicting a regression line

To predict a regression line in R, we use the command lm(), where lm stands for ‘linear model’. lm() takes as its first argument a formula. As we’ve seen before in R, a formula is indicated by a tilde (~). The tilde says ‘predict what’s on the left hand side of the tilde by using the right hand side’. So the outcome variable goes on the left hand side and the exposure variable goes on the right hand side. Since the formula is the first (leftmost) argument, we cannot pipe the data in directly - instead we indicate whether the piped data should go with a fullstop.

#--- Fit a regression line and store the result
mod1 <- bab9 %>% lm(bweight ~ gestwks, data = .)

#--- Same code, no pipe
#mod1 <- lm(bweight ~ gestwks, data = bab9)

#--- Get a summary of the regression
summary(mod1)
## 
## Call:
## lm(formula = bweight ~ gestwks, data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1810   -285     -7    283   1248 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4865.25     290.08   -16.8   <2e-16 ***
## gestwks       206.64       7.48    27.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 441 on 639 degrees of freedom
## Multiple R-squared:  0.544,  Adjusted R-squared:  0.543 
## F-statistic:  762 on 1 and 639 DF,  p-value: <2e-16
#--- Get confidence intervals
confint(mod1)
##             2.5 % 97.5 %
## (Intercept) -5435  -4296
## gestwks       192    221

Exercise 16.2: What are the estimated values of the two parameters? Write down the regression equation in the form used in the lecture notes or this practical, using the above estimates in the equation.

From the summary, you should be able to extract the relevant information. Underneath the Coefficients table, the Estimate subheading provides the estimated values of \(A\) (Intercept), -4865.25 and of \(B\) gestwks, 206.64. Confidence intervals can be extracted with the confint() command. Note that a t-test is also automatically conducted. The t-test here is a test of the null hypothesis that the coefficient is equal to zero, and the p-value quantifies the strength of evidence against this hypothesis.

Exercise 16.3: What are the standard errors of the two parameter estimates? How strong is the evidence that there exists a linear association between the two variables?

It is possible to use this model to predict birthweight from gestational age. We do this by using R’s predict() command. These predicted values are generated from the black line that we graphed earlier in the practical.

#--- Generate predicted values and observe the first ten
bab9$predicted <- predict(mod1)
head(bab9$predicted)
## [1] 2933 3225 2516 3254 3066 2958