Chapter 5 Linear Regression (FQA, OSCM)

5.1 Simple Linear Regression

Simple linear regression allows us to model a continuous variable $Y$ as a function of a single $X$ variable. The assumed underlying relationship is:

$Y = \beta_{0} + \beta_{1}X + \epsilon$

Using the method of least squares, we can estimate $\beta_{0}$ and $\beta_{1}$ from our sample data.

Below we use the lm() function to model the Salary variable as a function of the Age variable from data.

modelSimple <- lm(Salary ~ Age, data=data)

To view the coefficient estimates and other diagnostic information about our model, we apply the summary() function to modelSimple:

summary(modelSimple)

## 
## Call:
## lm(formula = Salary ~ Age, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -103272  -21766    2428   23138   90680 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 67133.70    4455.17   15.07   <2e-16 ***
## Age          2026.83      98.07   20.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32630 on 918 degrees of freedom
##   (80 observations deleted due to missingness)
## Multiple R-squared:  0.3175, Adjusted R-squared:  0.3168 
## F-statistic: 427.1 on 1 and 918 DF,  p-value: < 2.2e-16

We can extract our estimates of the coefficients ( $\beta_{0}$ and $\beta_{1}$ ) with the coef() function.

coef(modelSimple)

## (Intercept)         Age 
##   67133.703    2026.828

The confint() function provides a 95% confidence interval for the intercept ( $\beta_{0}$ ) and slope ( $\beta_{1}$ ).

confint(modelSimple)

##                 2.5 %    97.5 %
## (Intercept) 58390.197 75877.208
## Age          1834.364  2219.292

We can visualize our estimated regression line with the abline() function. First we need to create a scatterplot of Salary and Age using plot(). If we then run the abline() function on our model, it will plot the regression equation on top of the scatterplot.

plot(data$Age, data$Salary)
abline(modelSimple)

5.2 Multiple Linear Regression

Multiple linear regression allows us to model a continuous variable $Y$ as a function of multiple $X$ variables. The assumed underlying relationship is:

$Y = \beta_{0} + \beta_{1}X_{1} +\beta_{2}X_{2} + ... + \beta_{k}X_{k} + \epsilon$

We use the same functions as the previous section, but now specify multiple $X$ variables in our call to lm().

modelMultiple <- lm(Salary ~ Age + Rating + Gender + Degree, data=data)
summary(modelMultiple)

## 
## Call:
## lm(formula = Salary ~ Age + Rating + Gender + Degree, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64403 -16227    352  15917  70513 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9481.98    4358.57   2.175 0.029850 *  
## Age                2006.08      70.08  28.627  < 2e-16 ***
## Rating             5181.07     401.49  12.905  < 2e-16 ***
## GenderMale         8220.11    1532.33   5.364 1.03e-07 ***
## DegreeBachelor's  23588.25    2452.07   9.620  < 2e-16 ***
## DegreeHigh School -9477.56    2444.09  -3.878 0.000113 ***
## DegreeMaster's    31211.02    2437.29  12.806  < 2e-16 ***
## DegreePh.D        44253.05    2434.40  18.178  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23220 on 912 degrees of freedom
##   (80 observations deleted due to missingness)
## Multiple R-squared:  0.6568, Adjusted R-squared:  0.6541 
## F-statistic: 249.3 on 7 and 912 DF,  p-value: < 2.2e-16

coef(modelMultiple)

##       (Intercept)               Age            Rating        GenderMale 
##          9481.982          2006.083          5181.073          8220.111 
##  DegreeBachelor's DegreeHigh School    DegreeMaster's        DegreePh.D 
##         23588.253         -9477.556         31211.018         44253.050

confint(modelMultiple)

##                         2.5 %    97.5 %
## (Intercept)          927.9903 18035.974
## Age                 1868.5508  2143.615
## Rating              4393.1220  5969.023
## GenderMale          5212.7994 11227.422
## DegreeBachelor's   18775.8963 28400.609
## DegreeHigh School -14274.2509 -4680.860
## DegreeMaster's     26427.6608 35994.376
## DegreePh.D         39475.3712 49030.729

5.3 Diagnostics

To test the normality assumption of our multiple linear regression model, we can look at a histogram and boxplot of the model’s residuals:

hist(modelMultiple$residuals)

boxplot(modelMultiple$residuals)

To check for linearity and constant variance, we can create a scatter plot of the residuals v. fitted values of the model:

plot(modelMultiple$residuals ~ modelMultiple$fitted, cex = 0.7)
abline(h = 0, lwd = 2)

Because we have nested models (modelSimple is a subset of modelMultiple) we can run an F-test using the anova() command:

anova(modelSimple, modelMultiple)

## Analysis of Variance Table
## 
## Model 1: Salary ~ Age
## Model 2: Salary ~ Age + Rating + Gender + Degree
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1    918 9.7755e+11                                   
## 2    912 4.9163e+11  6 4.8592e+11 150.23 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The small p-value suggests that the additional variables in modelMultiple (Rating, Gender, and Degree) do provide predictive ability.

We can use the AIC() function to compare the AIC of the models:

AIC(modelSimple, modelMultiple)

##               df      AIC
## modelSimple    3 21738.07
## modelMultiple  9 21117.74

5.4 Prediction

To apply our models to new observations, we can use the predict() function. Suppose we want to predict the salary of a female employee who is 50, has a rating of eight, and has a Master’s degree.

newData <- data.frame(Age=50, Rating = 8, Gender = "Female", Degree = "Master's")

The predict() function allows us to apply our models to this new data set.

predict(modelSimple, newData)

##        1 
## 168475.1

predict(modelMultiple, newData)

##        1 
## 182445.7