4 Multiple linear regression
Multiple regression extends simple two-variable regression to the case that still has one response but possibly many predictors. The method is motivated by scenarios where many variables may be simultaneously connected to an output.
To illustrate the modeling process, we will use the loans_full_schema
dataset (from the openintro
package), which has information on loans from the peer-to-peer lender Lending Club. The dataset includes terms of the loan as well as information about the borrower. The outcome/response variable we would like to better understand is the
interest rate assigned to the loan. For instance, all other characteristics held constant, does it matter
how much debt someone already has? Does it matter if their income has been verified? Multiple
regression will help us answer these and other questions. Let’s start by loading the data and saving it with a shorter name:
The dataset includes results from 10,000 loans. To see the descriptions of the variables, type ?loans_full_schema
The first 5 rows are in table 4.1.
emp_title | emp_length | state | homeownership | annual_income | verified_income | debt_to_income | annual_income_joint | verification_income_joint | debt_to_income_joint | delinq_2y | months_since_last_delinq | earliest_credit_line | inquiries_last_12m | total_credit_lines | open_credit_lines | total_credit_limit | total_credit_utilized | num_collections_last_12m | num_historical_failed_to_pay | months_since_90d_late | current_accounts_delinq | total_collection_amount_ever | current_installment_accounts | accounts_opened_24m | months_since_last_credit_inquiry | num_satisfactory_accounts | num_accounts_120d_past_due | num_accounts_30d_past_due | num_active_debit_accounts | total_debit_limit | num_total_cc_accounts | num_open_cc_accounts | num_cc_carrying_balance | num_mort_accounts | account_never_delinq_percent | tax_liens | public_record_bankrupt | loan_purpose | application_type | loan_amount | term | interest_rate | installment | grade | sub_grade | issue_month | loan_status | initial_listing_status | disbursement_method | balance | paid_total | paid_principal | paid_interest | paid_late_fees |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
global config engineer | 3 | NJ | MORTGAGE | 90000 | Verified | 18.01 | NA | NA | 0 | 38 | 2001 | 6 | 28 | 10 | 70795 | 38767 | 0 | 0 | 38 | 0 | 1250 | 2 | 5 | 5 | 10 | 0 | 0 | 2 | 11100 | 14 | 8 | 6 | 1 | 92.9 | 0 | 0 | moving | individual | 28000 | 60 | 14.07 | 652.53 | C | C3 | Mar-2018 | Current | whole | Cash | 27015.86 | 1999.33 | 984.14 | 1015.19 | 0 | |
warehouse office clerk | 10 | HI | RENT | 40000 | Not Verified | 5.04 | NA | NA | 0 | NA | 1996 | 1 | 30 | 14 | 28800 | 4321 | 0 | 1 | NA | 0 | 0 | 0 | 11 | 8 | 14 | 0 | 0 | 3 | 16500 | 24 | 14 | 4 | 0 | 100.0 | 0 | 1 | debt_consolidation | individual | 5000 | 36 | 12.61 | 167.54 | C | C1 | Feb-2018 | Current | whole | Cash | 4651.37 | 499.12 | 348.63 | 150.49 | 0 | |
assembly | 3 | WI | RENT | 40000 | Source Verified | 21.15 | NA | NA | 0 | 28 | 2006 | 4 | 31 | 10 | 24193 | 16000 | 0 | 0 | 28 | 0 | 432 | 1 | 13 | 7 | 10 | 0 | 0 | 3 | 4300 | 14 | 8 | 6 | 0 | 93.5 | 0 | 0 | other | individual | 2000 | 36 | 17.09 | 71.40 | D | D1 | Feb-2018 | Current | fractional | Cash | 1824.63 | 281.80 | 175.37 | 106.43 | 0 | |
customer service | 1 | PA | RENT | 30000 | Not Verified | 10.16 | NA | NA | 0 | NA | 2007 | 0 | 4 | 4 | 25400 | 4997 | 0 | 1 | NA | 0 | 0 | 1 | 1 | 15 | 4 | 0 | 0 | 2 | 19400 | 3 | 3 | 2 | 0 | 100.0 | 1 | 0 | debt_consolidation | individual | 21600 | 36 | 6.72 | 664.19 | A | A3 | Jan-2018 | Current | whole | Cash | 18853.26 | 3312.89 | 2746.74 | 566.15 | 0 | |
security supervisor | 10 | CA | RENT | 35000 | Verified | 57.96 | 57000 | Verified | 37.66 | 0 | NA | 2008 | 7 | 22 | 16 | 69839 | 52722 | 0 | 0 | NA | 0 | 0 | 1 | 6 | 4 | 16 | 0 | 0 | 10 | 32700 | 20 | 15 | 13 | 0 | 100.0 | 0 | 0 | credit_card | joint | 23000 | 36 | 14.07 | 786.87 | C | C3 | Mar-2018 | Current | whole | Cash | 21430.15 | 2324.65 | 1569.85 | 754.80 | 0 |
4.1 Multiple LS model
Consider a sample of size \(n\) with one response variable, \(Y\), and \(k\) explanatory variables, \(X_1, X_2, \dots, X_k.\) Denote the values of the response variable by \((Y_1, Y_2, \dots, Y_n)\) and the values of the \(k\) explanatory variables by \[\begin{eqnarray} (X_{11}, X_{21}, X_{31}, \dots, X_{n1})\\ (X_{12}, X_{22}, X_{32}, \dots, X_{n2})\\ \vdots\\ (X_{1k}, X_{2k}, X_{3k}, \dots, X_{nk}) \end{eqnarray}\]
A linear model that relates \(Y\) with the predictors \(X_1, X_2, \dots, X_k\) has the form \[\hat{Y} = \hat{a} + \hat{b}_1 X_1 + \hat{b}_2X_2 + \cdots + \hat{b}_kX_k\] We can write, for each case in the dataset: \[\hat{Y}_i = \hat{a} + \hat{b}_1 X_{i1} + \hat{b}_2X_{i2} + \cdots + \hat{b}_kX_{ik}, \quad i=1,2,\cdots,n.\] Under the least squares criterion, the coefficients \(\hat{a}\), \(\hat{b}_1\), \(\hat{b}_2\), \(\dots\), \(\hat{b}_k\) are the values that minimize the sum of the squares of the residuals, that is, they minimize \[\sum_{i=1}^n(Y_i - \hat{Y}_i)^2 = \sum_{i=1}^n\left(Y_i - ({a} + {b}_1 X_{i1} + {b}_2X_{i2} + \cdots + {b}_kX_{ik})\right)^2.\] The solution to this optimization problem is better written using matrix notation and we omit the formulas for the \(\hat{b}\)’s here. However, it’s important to know that the formula for each \(\hat{b}\) depends on the \(X_{ij}\)’s and \(Y_i\)’s (similar to the one for simple linear regression). Once \(\hat{b}_1, \hat{b}_2, \dots, \hat{b}_k\) are calculated, \(\hat{a}\) is calculated as \[\hat{a} = \overline{Y} - \hat{b}_1 \overline{X}_1 - \hat{b}_2 \overline{X}_2 - \cdots - \hat{b}_k \overline{X}_k,\] where \(\bar{X}_j\) is the mean of the variable \(X_j\), \(j=1,2,\dots,k.\) This means that, like in the simple linear regression case, the LS model passes through the point whose coordinates are the averages of all the variables.
In R, a (least squares) multiple linear regression model is computed with the same lm
function that we used to compute simple linear regressions. We only need to add more predictors.
For instance, a model that uses annual_income
and debt_to_income
as predictors of interest_rate
can be constructed by running:
##
## Call:
## lm(formula = interest_rate ~ annual_income + debt_to_income,
## data = loans)
##
## Coefficients:
## (Intercept) annual_income debt_to_income
## 1.206e+01 -5.841e-06 4.264e-02
We can write the model as \[\widehat{interest\_rate} = 12.06 - 0.00000584\times annual\_income + 0.0426 \times debt\_to\_income.\]
The interpretation of the slopes is a little different than in regression with one predictor. Since the calculation of each slope uses information from all the variables, the slope of a variable in multiple regression will not be the same as when that variable was the only predictor. To illustrate this, let’s find the LS model when annual_income
is the only predictor and the LS model when debt_to_income
is the only predictor:
##
## Call:
## lm(formula = interest_rate ~ annual_income, data = loans)
##
## Coefficients:
## (Intercept) annual_income
## 1.304e+01 -7.693e-06
##
## Call:
## lm(formula = interest_rate ~ debt_to_income, data = loans)
##
## Coefficients:
## (Intercept) debt_to_income
## 11.51145 0.04718
The interpretations of the slopes in the single-predictor models are:
An average increase of $1 in annual income is associated with an average decrease of 0.00000769% in loan interest rate. We can also say (for better description) that an average increase of $10,000 in annual income is associated with an average decrease of 0.0769% in loan interest rate.
An average increase of 1% in the debt-to-income ratio is associated with an average increase of 0.047% in loan interest rate. Alternatively, we can say that an average increase of 10% in the debt-to-income ratio is associated with an average increase of 0.47% in loan interest rate.
The interpretation of the intercepts should only be made if they are not an extrapolation. Since there are people with no income and people with a 0 debt-to-income ratio, we can interpret the intercept:
Borrowers with no income have an average interest rate of 13%.
Borrowers with 0 debt-to-income ratio have an average interest rate of 11.5%.
Now let’s turn back to the model that has both variables together. We can interpret the slopes as follows:
Holding debt-to-income ratio constant, an average increase of $10,000 in annual income is associated with an average decrease of 0.0584% in loan interest rate. “Holding the debt-to-income ratio constant” can be interpreted as “For borrowers with similar debt-to-income ratio, an average increase…” We can also say that “After taking debt-to-income ratio into account, an average increase…”
Holding annual income constant, an average increase of $10% in debt-to-income ratio is associated with an average increase of 0.426% in loan interest rate. Here “Holding the annual income constant” can be interpreted as “For borrowers with similar annual income, an average increase…” We can also say that “After taking annual income into account, an average increase…”
4.2 Categorical predictors
Let’s start by fitting a (LS) linear regression model for interest rate with a single predictor indicating whether or not a person has a bankruptcy in their record. First, let’s create such variable:
loans$bankruptcy <- if_else(loans$public_record_bankrupt > 0, "yes", "no")
lm(interest_rate ~ bankruptcy, data = loans)
##
## Call:
## lm(formula = interest_rate ~ bankruptcy, data = loans)
##
## Coefficients:
## (Intercept) bankruptcyyes
## 12.3380 0.7368
The model is given by \[\widehat{interest\_rate} = 12.34 + 0.74\times bankruptcy.\]
The bankruptcy variable takes one of two values: “yes” when the borrower has a bankruptcy in their history and “no” otherwise. A slope of 0.74 means that the model predicts a 0.74% higher interest rate for those borrowers with a bankruptcy in their record (see Section 3.2 for a review of the interpretation for categorical predictor variables.)
Now let’s fit a model using a 3-level categorical variable, such as verified_income
. We can ask R which are the levels of the variable verified_income
using the function levels
:
## [1] "" "Not Verified" "Source Verified" "Verified"
There is an empty level ““, which we should eliminate. Let’s redefine the levels to exclude the empty level:
loans$verified_income <- factor(loans$verified_income, levels = c("Not Verified", "Source Verified", "Verified"))
##
## Call:
## lm(formula = interest_rate ~ verified_income, data = loans)
##
## Coefficients:
## (Intercept) verified_incomeSource Verified
## 11.099 1.416
## verified_incomeVerified
## 3.254
We can ask R again which are the 3 levels of the variable verified_income
:
## [1] "Not Verified" "Source Verified" "Verified"
The first level listed, “Not verified”, will be the reference level in a linear regression model. We can then interpret the slopes as comparisons with the reference level. For example, borrowers who had their income source verified (but not the amount), had an interest rate that was on average 1.4% higher than those who didn’t have their income source verified. Borrowers who had their income amount verified, had an interest rate that was on average 3.25% higher than those who didn’t have their income verified.
The higher interest rate for borrowers who have verified their income source or amount is surprising. Intuitively, we’d think that a loan would look less risky if the borrower’s income has been verified. However, note that the situation may be more complex, and there may be confounding variables that we didn’t account for. For example, perhaps lender require borrowers with poor credit to verify their income. That is, verifying income in our data set might be a signal of some concerns about the borrower rather than a reassurance that the borrower will pay back the loan. For this reason, the borrower could be deemed higher risk, resulting in a higher interest rate. (What other confounding variables might explain this counter-intuitive relationship suggested by the model?)
Now let’s mix categorical and numerical predictors (I know you’ve been waiting for this!). The LS model that has debt_to_income
and verified_income
as predictors is found by running:
##
## Call:
## lm(formula = interest_rate ~ debt_to_income + verified_income,
## data = loans)
##
## Coefficients:
## (Intercept) debt_to_income
## 10.34179 0.03909
## verified_incomeSource Verified verified_incomeVerified
## 1.49071 3.10048
The model is:
\[\begin{eqnarray} \widehat{interest\_rate} &=& 10.34 + 0.039\times debt\_to\_income + 1.49\times verified\_income(Source Verified) \\ &+& 3.10\times verified\_income(Verified). \end{eqnarray}\]
The interpretation of the coefficients are:
(Intercept) Borrowers with a 0 debt-to-income ratio and non-verified income, have on average an interest rate of 10.34%.
Regardless of type of verification, an average 1% increase in debt-to-income ratio is associated with an average increase of 0.039% in the loan’s interest rate.
Holding debt-to-income ratio constant, borrowers who have the source of their income verified tend to have an interest rate that is higher than those who have no verification by an average of 1.49%.
Holding debt-to-income ratio constant, borrowers who have the value of their income verified tend to have an interest rate that is higher than those who have no verification by an average of 3.10%.
Notice that this model produces three parallel lines for the relationship between interest_rate
and debt_to_income
, one for each verification type:
\[\begin{eqnarray} \mbox{Not Verified}:\widehat{interest\_rate} &=& 10.34 + 0.039\times debt\_to\_income + 1.49\times 0 + 3.10\times 0 \\ &=& 10.34 + 0.039\times debt\_to\_income. \\ \mbox{Source Verified}:\widehat{interest\_rate} &=& 10.34 + 0.039\times debt\_to\_income + 1.49\times 1 + 3.10\times 0 \\ &=& 11.83 + 0.039\times debt\_to\_income. \\ \mbox{Verified}:\widehat{interest\_rate} &=& 10.34 + 0.039\times debt\_to\_income + 1.49\times 0 + 3.10\times 1 \\ &=& 13.44 + 0.039\times debt\_to\_income. \end{eqnarray}\]
To plot the parallel lines, we can use the layer geom_parallel_slopes()
from the moderndive
package (first be sure to install the package).
library(moderndive)
ggplot(loans, aes(x = debt_to_income, y = interest_rate, color = verified_income)) +
geom_point() +
geom_parallel_slopes(se = FALSE)
Finally, let’s use a larger number of predictors. The LS model that has annual_income
, debt_to_income
, bankruptcy
, and verified_income
as predictors is found by running:
##
## Call:
## lm(formula = interest_rate ~ annual_income + debt_to_income +
## bankruptcy + verified_income, data = loans)
##
## Coefficients:
## (Intercept) annual_income
## 1.084e+01 -6.334e-06
## debt_to_income bankruptcyyes
## 3.406e-02 6.382e-01
## verified_incomeSource Verified verified_incomeVerified
## 1.530e+00 3.125e+00
The model is:
\[\begin{eqnarray} \widehat{interest\_rate} &=& 10.84 - 0.00000633\times annual\_income + 0.034\times debt\_to\_income\\ &+& 0.638\times bankruptcy + 1.53\times verified\_income(Source Verified) \\ &+& 3.125\times verified\_income(Verified). \end{eqnarray}\]
The interpretation of a few of the coefficients are:
(Intercept) Borrowers with no annual income, a 0 debt_to_income, no bankruptcy, and non-verified income, have on average an interest rate of 10.84%.
Holding the other variables in the model constant, an average $10,000 increase in annual income is associated with an average decrease in interest rate of 0.063%.
Holding the other variables in the model constant, borrowers who have had a bankruptcy tend to have an interest rate that is higher than those who have no bankruptcy by an average of 0.64%.