2 Examples
2.1 Example 1: Simple linear regression
The admissions committee of a comprehensive state university selected at random the records of 200 second-semester freshmen. The results, first-semester college GPA and SAT scores,are stored in the data frame GRADES
. The admissions committee wants to study the linear relationship between first-semester college grade point average (gpa
) and scholastic aptitude test (sat
) scores.
(a) i. Create a scatterplot of the data to investigate the relationship between gpa
and sat
scores.
We can use the codeplot()
. What variables do we want on our x and y axis?
plot(gpa ~ sat, data = GRADES, xlab = "SAT score", ylab = "GPA")
- Look at your plot. What trends do you see?
(b) Obtain the least squares estimates for \(\beta_0\) and \(\beta_1\), and state the estimated regression function using
- Summation notation
We can define x and y as our variables. These can then be used in the formulas for \(\beta_0\) and \(\beta_1\) as given in the textbook.
<- GRADES$gpa
Y <- GRADES$sat
x <- sum((x - mean(x)) * (Y - mean(Y)))/sum((x - mean(x))^2)
b1 <- mean(Y) - b1 * mean(x)
b0 c(b0, b1)
1] -1.19206381 0.00309427] [
- Using the R function
lm()
to verify your answer in b)i.
The information needed in the lm()
function is the axes and the data set you want R to read them from.
> model.lm<- lm(gpa~ sat,data= GRADES)
> coef(model.lm)
(Intercept) sat-1.19206381 0.00309427
These estimated regression function is therefore:
\(\hat{Y}_i = -1.1921 + 0.0031x_i\)
(c) What is the correct interpretation of the regression function?
(d) Use R to calculate the point estimate of the change in the mean GPA when the SAT score increases by 50 points:
> b1*50
1] 0.1547135 [
2.2 Example 2: Multiple linear regression
In b) (ii), the function lm()
was used to find estimates for \(\beta_0\) and \(\beta_1\) for a simple linear regression model. To use the function lm()
with multiple linear regression models, one specifies the predictors for a multiple linear regression model on the right side of the tilde (~) operator inside the lm() function. The data frame HSWRESTLER
contains the body fat measurements of 78 high school wrestlers. Try to create a multiple linear regression model for regressing hwfat
(hydrostatic fat - the response variable) onto abs
(abdominal fat) and triceps
(tricep fat).
The R code below stores the multiple linear regression model for regressing hwfat
(hydrostatic fat) onto abs
(abdominalfat) and triceps
(tricep fat). The estimated coefficients for \(\beta_0\), \(\beta_1\), and \(\beta_2\) determine the plane of best fit for the given values.
<- lm(HWFAT ~ ABS + TRICEPS, data= HSWRESTLER)
hsw.lm coef(summary(hsw.lm)) # lm coefficients
The estimated coefficients for \(\beta_0\), \(\beta_1\), and \(\beta_2\) are
\(\beta_0\):
\(\beta_1 = \beta_{ABS}\):
\(\beta_2 = \beta_{TRICEPS}\):
Interpret what each coefficient means:
A one unit increase in abdominal fat will lead to a roughly increase in hydrostatic fat . Similiarly, a one unit increase in tricep fat will lead to a roughly increase in hydrostatic fat .