Chapter 5 Linear Regression (FQA, OSCM)

5.1 Simple Linear Regression

Simple linear regression allows us to model a continuous variable \(Y\) as a function of a single \(X\) variable. The assumed underlying relationship is:

\[Y = \beta_{0} + \beta_{1}X + \epsilon\]

Using the method of least squares, we can estimate \(\beta_{0}\) and \(\beta_{1}\) from our sample data.

Below we use the lm() function to model the Salary variable as a function of the Age variable from data.

To view the coefficient estimates and other diagnostic information about our model, we apply the summary() function to modelSimple:

## 
## Call:
## lm(formula = Salary ~ Age, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -103272  -21766    2428   23138   90680 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 67133.70    4455.17   15.07   <2e-16 ***
## Age          2026.83      98.07   20.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32630 on 918 degrees of freedom
##   (80 observations deleted due to missingness)
## Multiple R-squared:  0.3175, Adjusted R-squared:  0.3168 
## F-statistic: 427.1 on 1 and 918 DF,  p-value: < 2.2e-16

We can extract our estimates of the coefficients (\(\beta_{0}\) and \(\beta_{1}\)) with the coef() function.

## (Intercept)         Age 
##   67133.703    2026.828

The confint() function provides a 95% confidence interval for the intercept (\(\beta_{0}\)) and slope (\(\beta_{1}\)).

##                 2.5 %    97.5 %
## (Intercept) 58390.197 75877.208
## Age          1834.364  2219.292

We can visualize our estimated regression line with the abline() function. First we need to create a scatterplot of Salary and Age using plot(). If we then run the abline() function on our model, it will plot the regression equation on top of the scatterplot.

5.2 Multiple Linear Regression

Multiple linear regression allows us to model a continuous variable \(Y\) as a function of multiple \(X\) variables. The assumed underlying relationship is:

\[Y = \beta_{0} + \beta_{1}X_{1} +\beta_{2}X_{2} + ... + \beta_{k}X_{k} + \epsilon\]

We use the same functions as the previous section, but now specify multiple \(X\) variables in our call to lm().

## 
## Call:
## lm(formula = Salary ~ Age + Rating + Gender + Degree, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64403 -16227    352  15917  70513 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9481.98    4358.57   2.175 0.029850 *  
## Age                2006.08      70.08  28.627  < 2e-16 ***
## Rating             5181.07     401.49  12.905  < 2e-16 ***
## GenderMale         8220.11    1532.33   5.364 1.03e-07 ***
## DegreeBachelor's  23588.25    2452.07   9.620  < 2e-16 ***
## DegreeHigh School -9477.56    2444.09  -3.878 0.000113 ***
## DegreeMaster's    31211.02    2437.29  12.806  < 2e-16 ***
## DegreePh.D        44253.05    2434.40  18.178  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23220 on 912 degrees of freedom
##   (80 observations deleted due to missingness)
## Multiple R-squared:  0.6568, Adjusted R-squared:  0.6541 
## F-statistic: 249.3 on 7 and 912 DF,  p-value: < 2.2e-16
##       (Intercept)               Age            Rating        GenderMale 
##          9481.982          2006.083          5181.073          8220.111 
##  DegreeBachelor's DegreeHigh School    DegreeMaster's        DegreePh.D 
##         23588.253         -9477.556         31211.018         44253.050
##                         2.5 %    97.5 %
## (Intercept)          927.9903 18035.974
## Age                 1868.5508  2143.615
## Rating              4393.1220  5969.023
## GenderMale          5212.7994 11227.422
## DegreeBachelor's   18775.8963 28400.609
## DegreeHigh School -14274.2509 -4680.860
## DegreeMaster's     26427.6608 35994.376
## DegreePh.D         39475.3712 49030.729

5.3 Diagnostics

To test the normality assumption of our multiple linear regression model, we can look at a histogram and boxplot of the model’s residuals:

To check for linearity and constant variance, we can create a scatter plot of the residuals v. fitted values of the model:

Because we have nested models (modelSimple is a subset of modelMultiple) we can run an F-test using the anova() command:

## Analysis of Variance Table
## 
## Model 1: Salary ~ Age
## Model 2: Salary ~ Age + Rating + Gender + Degree
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1    918 9.7755e+11                                   
## 2    912 4.9163e+11  6 4.8592e+11 150.23 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The small p-value suggests that the additional variables in modelMultiple (Rating, Gender, and Degree) do provide predictive ability.

We can use the AIC() function to compare the AIC of the models:

##               df      AIC
## modelSimple    3 21738.07
## modelMultiple  9 21117.74

5.4 Prediction

To apply our models to new observations, we can use the predict() function. Suppose we want to predict the salary of a female employee who is 50, has a rating of eight, and has a Master’s degree.

The predict() function allows us to apply our models to this new data set.

##        1 
## 168475.1
##        1 
## 182445.7