2 Examples

2.1 Example 1: Simple linear regression

The admissions committee of a comprehensive state university selected at random the records of 200 second-semester freshmen. The results, first-semester college GPA and SAT scores,are stored in the data frame GRADES. The admissions committee wants to study the linear relationship between first-semester college grade point average (gpa) and scholastic aptitude test (sat) scores.


(a) i. Create a scatterplot of the data to investigate the relationship between gpa and sat scores.

We can use the codeplot(). What variables do we want on our x and y axis?

plot(gpa ~ sat, data = GRADES, xlab = "SAT score", ylab = "GPA")
  1. Look at your plot. What trends do you see?


(b) Obtain the least squares estimates for \(\beta_0\) and \(\beta_1\), and state the estimated regression function using

  1. Summation notation

We can define x and y as our variables. These can then be used in the formulas for \(\beta_0\) and \(\beta_1\) as given in the textbook.

Y <- GRADES$gpa
x <- GRADES$sat
b1 <- sum((x - mean(x)) * (Y - mean(Y)))/sum((x - mean(x))^2)
b0 <- mean(Y) - b1 * mean(x)
c(b0, b1)
[1] -1.19206381  0.00309427]
  1. Using the R function lm() to verify your answer in b)i.

The information needed in the lm() function is the axes and the data set you want R to read them from.

> model.lm<- lm(gpa~ sat,data= GRADES)
> coef(model.lm)
(Intercept)         sat
-1.19206381  0.00309427

These estimated regression function is therefore:
\(\hat{Y}_i = -1.1921 + 0.0031x_i\)


(c) What is the correct interpretation of the regression function?


(d) Use R to calculate the point estimate of the change in the mean GPA when the SAT score increases by 50 points:

> b1*50

[1] 0.1547135

2.2 Example 2: Multiple linear regression

In b) (ii), the function lm() was used to find estimates for \(\beta_0\) and \(\beta_1\) for a simple linear regression model. To use the function lm() with multiple linear regression models, one specifies the predictors for a multiple linear regression model on the right side of the tilde (~) operator inside the lm() function. The data frame HSWRESTLER contains the body fat measurements of 78 high school wrestlers. Try to create a multiple linear regression model for regressing hwfat (hydrostatic fat - the response variable) onto abs (abdominal fat) and triceps (tricep fat).

The R code below stores the multiple linear regression model for regressing hwfat (hydrostatic fat) onto abs (abdominalfat) and triceps (tricep fat). The estimated coefficients for \(\beta_0\), \(\beta_1\), and \(\beta_2\) determine the plane of best fit for the given values.

hsw.lm <- lm(HWFAT ~ ABS + TRICEPS, data= HSWRESTLER)
coef(summary(hsw.lm))  # lm coefficients

The estimated coefficients for \(\beta_0\), \(\beta_1\), and \(\beta_2\) are

\(\beta_0\):

\(\beta_1 = \beta_{ABS}\):

\(\beta_2 = \beta_{TRICEPS}\):

Interpret what each coefficient means:

A one unit increase in abdominal fat will lead to a roughly increase in hydrostatic fat . Similiarly, a one unit increase in tricep fat will lead to a roughly increase in hydrostatic fat .