4 Multiple linear regression

Multiple regression extends simple two-variable regression to the case that still has one response but possibly many predictors. The method is motivated by scenarios where many variables may be simultaneously connected to an output.

To illustrate the modeling process, we will use the loans_full_schema dataset (from the openintro package), which has information on loans from the peer-to-peer lender Lending Club. The dataset includes terms of the loan as well as information about the borrower. The outcome/response variable we would like to better understand is the interest rate assigned to the loan. For instance, all other characteristics held constant, does it matter how much debt someone already has? Does it matter if their income has been verified? Multiple regression will help us answer these and other questions. Let’s start by loading the data and saving it with a shorter name:

library(openintro)
loans <- loans_full_schema

The dataset includes results from 10,000 loans. To see the descriptions of the variables, type ?loans_full_schema The first 5 rows are in table 4.1.

Table 4.1: First five rows of the loans dataset.
emp_title emp_length state homeownership annual_income verified_income debt_to_income annual_income_joint verification_income_joint debt_to_income_joint delinq_2y months_since_last_delinq earliest_credit_line inquiries_last_12m total_credit_lines open_credit_lines total_credit_limit total_credit_utilized num_collections_last_12m num_historical_failed_to_pay months_since_90d_late current_accounts_delinq total_collection_amount_ever current_installment_accounts accounts_opened_24m months_since_last_credit_inquiry num_satisfactory_accounts num_accounts_120d_past_due num_accounts_30d_past_due num_active_debit_accounts total_debit_limit num_total_cc_accounts num_open_cc_accounts num_cc_carrying_balance num_mort_accounts account_never_delinq_percent tax_liens public_record_bankrupt loan_purpose application_type loan_amount term interest_rate installment grade sub_grade issue_month loan_status initial_listing_status disbursement_method balance paid_total paid_principal paid_interest paid_late_fees
global config engineer 3 NJ MORTGAGE 90000 Verified 18.01 NA NA 0 38 2001 6 28 10 70795 38767 0 0 38 0 1250 2 5 5 10 0 0 2 11100 14 8 6 1 92.9 0 0 moving individual 28000 60 14.07 652.53 C C3 Mar-2018 Current whole Cash 27015.86 1999.33 984.14 1015.19 0
warehouse office clerk 10 HI RENT 40000 Not Verified 5.04 NA NA 0 NA 1996 1 30 14 28800 4321 0 1 NA 0 0 0 11 8 14 0 0 3 16500 24 14 4 0 100.0 0 1 debt_consolidation individual 5000 36 12.61 167.54 C C1 Feb-2018 Current whole Cash 4651.37 499.12 348.63 150.49 0
assembly 3 WI RENT 40000 Source Verified 21.15 NA NA 0 28 2006 4 31 10 24193 16000 0 0 28 0 432 1 13 7 10 0 0 3 4300 14 8 6 0 93.5 0 0 other individual 2000 36 17.09 71.40 D D1 Feb-2018 Current fractional Cash 1824.63 281.80 175.37 106.43 0
customer service 1 PA RENT 30000 Not Verified 10.16 NA NA 0 NA 2007 0 4 4 25400 4997 0 1 NA 0 0 1 1 15 4 0 0 2 19400 3 3 2 0 100.0 1 0 debt_consolidation individual 21600 36 6.72 664.19 A A3 Jan-2018 Current whole Cash 18853.26 3312.89 2746.74 566.15 0
security supervisor 10 CA RENT 35000 Verified 57.96 57000 Verified 37.66 0 NA 2008 7 22 16 69839 52722 0 0 NA 0 0 1 6 4 16 0 0 10 32700 20 15 13 0 100.0 0 0 credit_card joint 23000 36 14.07 786.87 C C3 Mar-2018 Current whole Cash 21430.15 2324.65 1569.85 754.80 0

4.1 Multiple LS model

Consider a sample of size \(n\) with one response variable, \(Y\), and \(k\) explanatory variables, \(X_1, X_2, \dots, X_k.\) Denote the values of the response variable by \((Y_1, Y_2, \dots, Y_n)\) and the values of the \(k\) explanatory variables by \[\begin{eqnarray} (X_{11}, X_{21}, X_{31}, \dots, X_{n1})\\ (X_{12}, X_{22}, X_{32}, \dots, X_{n2})\\ \vdots\\ (X_{1k}, X_{2k}, X_{3k}, \dots, X_{nk}) \end{eqnarray}\]

A linear model that relates \(Y\) with the predictors \(X_1, X_2, \dots, X_k\) has the form \[\hat{Y} = \hat{a} + \hat{b}_1 X_1 + \hat{b}_2X_2 + \cdots + \hat{b}_kX_k\] We can write, for each case in the dataset: \[\hat{Y}_i = \hat{a} + \hat{b}_1 X_{i1} + \hat{b}_2X_{i2} + \cdots + \hat{b}_kX_{ik}, \quad i=1,2,\cdots,n.\] Under the least squares criterion, the coefficients \(\hat{a}\), \(\hat{b}_1\), \(\hat{b}_2\), \(\dots\), \(\hat{b}_k\) are the values that minimize the sum of the squares of the residuals, that is, they minimize \[\sum_{i=1}^n(Y_i - \hat{Y}_i)^2 = \sum_{i=1}^n\left(Y_i - ({a} + {b}_1 X_{i1} + {b}_2X_{i2} + \cdots + {b}_kX_{ik})\right)^2.\] The solution to this optimization problem is better written using matrix notation and we omit the formulas for the \(\hat{b}\)’s here. However, it’s important to know that the formula for each \(\hat{b}\) depends on the \(X_{ij}\)’s and \(Y_i\)’s (similar to the one for simple linear regression). Once \(\hat{b}_1, \hat{b}_2, \dots, \hat{b}_k\) are calculated, \(\hat{a}\) is calculated as \[\hat{a} = \overline{Y} - \hat{b}_1 \overline{X}_1 - \hat{b}_2 \overline{X}_2 - \cdots - \hat{b}_k \overline{X}_k,\] where \(\bar{X}_j\) is the mean of the variable \(X_j\), \(j=1,2,\dots,k.\) This means that, like in the simple linear regression case, the LS model passes through the point whose coordinates are the averages of all the variables.

In R, a (least squares) multiple linear regression model is computed with the same lm function that we used to compute simple linear regressions. We only need to add more predictors.

For instance, a model that uses annual_income and debt_to_income as predictors of interest_rate can be constructed by running:

lm(interest_rate ~ annual_income + debt_to_income, data = loans)
## 
## Call:
## lm(formula = interest_rate ~ annual_income + debt_to_income, 
##     data = loans)
## 
## Coefficients:
##    (Intercept)   annual_income  debt_to_income  
##      1.206e+01      -5.841e-06       4.264e-02

We can write the model as \[\widehat{interest\_rate} = 12.06 - 0.00000584\times annual\_income + 0.0426 \times debt\_to\_income.\]

The interpretation of the slopes is a little different than in regression with one predictor. Since the calculation of each slope uses information from all the variables, the slope of a variable in multiple regression will not be the same as when that variable was the only predictor. To illustrate this, let’s find the LS model when annual_income is the only predictor and the LS model when debt_to_income is the only predictor:

lm(interest_rate ~ annual_income, data = loans)
## 
## Call:
## lm(formula = interest_rate ~ annual_income, data = loans)
## 
## Coefficients:
##   (Intercept)  annual_income  
##     1.304e+01     -7.693e-06
lm(interest_rate ~ debt_to_income, data = loans)
## 
## Call:
## lm(formula = interest_rate ~ debt_to_income, data = loans)
## 
## Coefficients:
##    (Intercept)  debt_to_income  
##       11.51145         0.04718

The interpretations of the slopes in the single-predictor models are:

  • An average increase of $1 in annual income is associated with an average decrease of 0.00000769% in loan interest rate. We can also say (for better description) that an average increase of $10,000 in annual income is associated with an average decrease of 0.0769% in loan interest rate.

  • An average increase of 1% in the debt-to-income ratio is associated with an average increase of 0.047% in loan interest rate. Alternatively, we can say that an average increase of 10% in the debt-to-income ratio is associated with an average increase of 0.47% in loan interest rate.

The interpretation of the intercepts should only be made if they are not an extrapolation. Since there are people with no income and people with a 0 debt-to-income ratio, we can interpret the intercept:

  • Borrowers with no income have an average interest rate of 13%.

  • Borrowers with 0 debt-to-income ratio have an average interest rate of 11.5%.

Now let’s turn back to the model that has both variables together. We can interpret the slopes as follows:

  • Holding debt-to-income ratio constant, an average increase of $10,000 in annual income is associated with an average decrease of 0.0584% in loan interest rate. “Holding the debt-to-income ratio constant” can be interpreted as “For borrowers with similar debt-to-income ratio, an average increase…” We can also say that “After taking debt-to-income ratio into account, an average increase…”

  • Holding annual income constant, an average increase of $10% in debt-to-income ratio is associated with an average increase of 0.426% in loan interest rate. Here “Holding the annual income constant” can be interpreted as “For borrowers with similar annual income, an average increase…” We can also say that “After taking annual income into account, an average increase…”

4.2 Categorical predictors

Let’s start by fitting a (LS) linear regression model for interest rate with a single predictor indicating whether or not a person has a bankruptcy in their record. First, let’s create such variable:

loans$bankruptcy <- if_else(loans$public_record_bankrupt > 0, "yes", "no")
lm(interest_rate ~ bankruptcy, data = loans)
## 
## Call:
## lm(formula = interest_rate ~ bankruptcy, data = loans)
## 
## Coefficients:
##   (Intercept)  bankruptcyyes  
##       12.3380         0.7368

The model is given by \[\widehat{interest\_rate} = 12.34 + 0.74\times bankruptcy.\]

The bankruptcy variable takes one of two values: “yes” when the borrower has a bankruptcy in their history and “no” otherwise. A slope of 0.74 means that the model predicts a 0.74% higher interest rate for those borrowers with a bankruptcy in their record (see Section 3.2 for a review of the interpretation for categorical predictor variables.)

Now let’s fit a model using a 3-level categorical variable, such as verified_income. We can ask R which are the levels of the variable verified_income using the function levels:

levels(loans$verified_income)
## [1] ""                "Not Verified"    "Source Verified" "Verified"

There is an empty level ““, which we should eliminate. Let’s redefine the levels to exclude the empty level:

loans$verified_income <- factor(loans$verified_income, levels = c("Not Verified", "Source Verified", "Verified"))
lm(interest_rate ~ verified_income, data = loans)
## 
## Call:
## lm(formula = interest_rate ~ verified_income, data = loans)
## 
## Coefficients:
##                    (Intercept)  verified_incomeSource Verified  
##                         11.099                           1.416  
##        verified_incomeVerified  
##                          3.254

We can ask R again which are the 3 levels of the variable verified_income:

levels(loans$verified_income)
## [1] "Not Verified"    "Source Verified" "Verified"

The first level listed, “Not verified”, will be the reference level in a linear regression model. We can then interpret the slopes as comparisons with the reference level. For example, borrowers who had their income source verified (but not the amount), had an interest rate that was on average 1.4% higher than those who didn’t have their income source verified. Borrowers who had their income amount verified, had an interest rate that was on average 3.25% higher than those who didn’t have their income verified.

The higher interest rate for borrowers who have verified their income source or amount is surprising. Intuitively, we’d think that a loan would look less risky if the borrower’s income has been verified. However, note that the situation may be more complex, and there may be confounding variables that we didn’t account for. For example, perhaps lender require borrowers with poor credit to verify their income. That is, verifying income in our data set might be a signal of some concerns about the borrower rather than a reassurance that the borrower will pay back the loan. For this reason, the borrower could be deemed higher risk, resulting in a higher interest rate. (What other confounding variables might explain this counter-intuitive relationship suggested by the model?)

Now let’s mix categorical and numerical predictors (I know you’ve been waiting for this!). The LS model that has debt_to_income and verified_income as predictors is found by running:

lm(interest_rate ~ debt_to_income + verified_income, data = loans)
## 
## Call:
## lm(formula = interest_rate ~ debt_to_income + verified_income, 
##     data = loans)
## 
## Coefficients:
##                    (Intercept)                  debt_to_income  
##                       10.34179                         0.03909  
## verified_incomeSource Verified         verified_incomeVerified  
##                        1.49071                         3.10048

The model is:

\[\begin{eqnarray} \widehat{interest\_rate} &=& 10.34 + 0.039\times debt\_to\_income + 1.49\times verified\_income(Source Verified) \\ &+& 3.10\times verified\_income(Verified). \end{eqnarray}\]

The interpretation of the coefficients are:

  • (Intercept) Borrowers with a 0 debt-to-income ratio and non-verified income, have on average an interest rate of 10.34%.

  • Regardless of type of verification, an average 1% increase in debt-to-income ratio is associated with an average increase of 0.039% in the loan’s interest rate.

  • Holding debt-to-income ratio constant, borrowers who have the source of their income verified tend to have an interest rate that is higher than those who have no verification by an average of 1.49%.

  • Holding debt-to-income ratio constant, borrowers who have the value of their income verified tend to have an interest rate that is higher than those who have no verification by an average of 3.10%.

Notice that this model produces three parallel lines for the relationship between interest_rate and debt_to_income, one for each verification type:

\[\begin{eqnarray} \mbox{Not Verified}:\widehat{interest\_rate} &=& 10.34 + 0.039\times debt\_to\_income + 1.49\times 0 + 3.10\times 0 \\ &=& 10.34 + 0.039\times debt\_to\_income. \\ \mbox{Source Verified}:\widehat{interest\_rate} &=& 10.34 + 0.039\times debt\_to\_income + 1.49\times 1 + 3.10\times 0 \\ &=& 11.83 + 0.039\times debt\_to\_income. \\ \mbox{Verified}:\widehat{interest\_rate} &=& 10.34 + 0.039\times debt\_to\_income + 1.49\times 0 + 3.10\times 1 \\ &=& 13.44 + 0.039\times debt\_to\_income. \end{eqnarray}\]

To plot the parallel lines, we can use the layer geom_parallel_slopes() from the moderndive package (first be sure to install the package).

library(moderndive)
ggplot(loans, aes(x = debt_to_income, y = interest_rate, color = verified_income)) +
  geom_point() +
  geom_parallel_slopes(se = FALSE)

Finally, let’s use a larger number of predictors. The LS model that has annual_income, debt_to_income, bankruptcy, and verified_income as predictors is found by running:

lm(interest_rate ~ annual_income + debt_to_income + bankruptcy + verified_income, data = loans)
## 
## Call:
## lm(formula = interest_rate ~ annual_income + debt_to_income + 
##     bankruptcy + verified_income, data = loans)
## 
## Coefficients:
##                    (Intercept)                   annual_income  
##                      1.084e+01                      -6.334e-06  
##                 debt_to_income                   bankruptcyyes  
##                      3.406e-02                       6.382e-01  
## verified_incomeSource Verified         verified_incomeVerified  
##                      1.530e+00                       3.125e+00

The model is:

\[\begin{eqnarray} \widehat{interest\_rate} &=& 10.84 - 0.00000633\times annual\_income + 0.034\times debt\_to\_income\\ &+& 0.638\times bankruptcy + 1.53\times verified\_income(Source Verified) \\ &+& 3.125\times verified\_income(Verified). \end{eqnarray}\]

The interpretation of a few of the coefficients are:

  • (Intercept) Borrowers with no annual income, a 0 debt_to_income, no bankruptcy, and non-verified income, have on average an interest rate of 10.84%.

  • Holding the other variables in the model constant, an average $10,000 increase in annual income is associated with an average decrease in interest rate of 0.063%.

  • Holding the other variables in the model constant, borrowers who have had a bankruptcy tend to have an interest rate that is higher than those who have no bankruptcy by an average of 0.64%.