2 Examples

2.1 Example 1: Simple linear regression

The admissions committee of a comprehensive state university selected at random the records of 200 second-semester freshmen. The results, first-semester college GPA and SAT scores,are stored in the data frame GRADES. The admissions committee wants to study the linear relationship between first-semester college grade point average (gpa) and scholastic aptitude test (sat) scores.


(a) i. Create a scatterplot of the data to investigate the relationship between gpa and sat scores.

We can use the codeplot(). What variables do we want on our x and y axis?

plot(gpa ~ sat, data = GRADES, xlab = "SAT score", ylab = "GPA")
  1. Look at your plot. What trends do you see?


(b) Obtain the least squares estimates for β0 and β1, and state the estimated regression function using

  1. Summation notation

We can define x and y as our variables. These can then be used in the formulas for β0 and β1 as given in the textbook.

Y <- GRADES$gpa
x <- GRADES$sat
b1 <- sum((x - mean(x)) * (Y - mean(Y)))/sum((x - mean(x))^2)
b0 <- mean(Y) - b1 * mean(x)
c(b0, b1)
[1] -1.19206381  0.00309427]
  1. Using the R function lm() to verify your answer in b)i.

The information needed in the lm() function is the axes and the data set you want R to read them from.

> model.lm<- lm(gpa~ sat,data= GRADES)
> coef(model.lm)
(Intercept)         sat
-1.19206381  0.00309427

These estimated regression function is therefore:
ˆYi=1.1921+0.0031xi


(c) What is the correct interpretation of the regression function?


(d) Use R to calculate the point estimate of the change in the mean GPA when the SAT score increases by 50 points:

> b1*50

[1] 0.1547135

2.2 Example 2: Multiple linear regression

In b) (ii), the function lm() was used to find estimates for \beta_0 and \beta_1 for a simple linear regression model. To use the function lm() with multiple linear regression models, one specifies the predictors for a multiple linear regression model on the right side of the tilde (~) operator inside the lm() function. The data frame HSWRESTLER contains the body fat measurements of 78 high school wrestlers. Try to create a multiple linear regression model for regressing hwfat (hydrostatic fat - the response variable) onto abs (abdominal fat) and triceps (tricep fat).

The R code below stores the multiple linear regression model for regressing hwfat (hydrostatic fat) onto abs (abdominalfat) and triceps (tricep fat). The estimated coefficients for \beta_0, \beta_1, and \beta_2 determine the plane of best fit for the given values.

hsw.lm <- lm(HWFAT ~ ABS + TRICEPS, data= HSWRESTLER)
coef(summary(hsw.lm))  # lm coefficients

The estimated coefficients for \beta_0, \beta_1, and \beta_2 are

\beta_0:

\beta_1 = \beta_{ABS}:

\beta_2 = \beta_{TRICEPS}:

Interpret what each coefficient means:

A one unit increase in abdominal fat will lead to a roughly increase in hydrostatic fat . Similiarly, a one unit increase in tricep fat will lead to a roughly increase in hydrostatic fat .