4 Multiple linear regression

Multiple regression extends simple two-variable regression to the case that still has one response but possibly many predictors. The method is motivated by scenarios where many variables may be simultaneously connected to an output.

To illustrate the modeling process, we will use the loans_full_schema dataset (from the openintro package), which has information on loans from the peer-to-peer lender Lending Club. The dataset includes terms of the loan as well as information about the borrower. The outcome/response variable we would like to better understand is the interest rate assigned to the loan. For instance, all other characteristics held constant, does it matter how much debt someone already has? Does it matter if their income has been verified? Multiple regression will help us answer these and other questions. Let’s start by loading the data and saving it with a shorter name:

library(openintro)
loans <- loans_full_schema

The dataset includes results from 10,000 loans. To see the descriptions of the variables, type ?loans_full_schema The first 5 rows are in table 4.1.

Table 4.1: First five rows of the loans dataset.
emp_title	emp_length	state	homeownership	annual_income	verified_income	debt_to_income	annual_income_joint	verification_income_joint	debt_to_income_joint	months_since_last_delinq	earliest_credit_line	inquiries_last_12m	total_credit_lines	open_credit_lines	total_credit_limit	total_credit_utilized	num_historical_failed_to_pay	months_since_90d_late	total_collection_amount_ever	current_installment_accounts	accounts_opened_24m	months_since_last_credit_inquiry	num_satisfactory_accounts	num_active_debit_accounts	total_debit_limit	num_total_cc_accounts	num_open_cc_accounts	num_cc_carrying_balance	num_mort_accounts	account_never_delinq_percent	tax_liens	public_record_bankrupt	loan_purpose	application_type	loan_amount	term	interest_rate	installment	grade	sub_grade	issue_month	loan_status	initial_listing_status	disbursement_method	balance	paid_total	paid_principal	paid_interest
global config engineer	3	NJ	MORTGAGE	90000	Verified	18.01	NA		NA	38	2001	6	28	10	70795	38767	0	38	1250	2	5	5	10	2	11100	14	8	6	1	92.9	0	0	moving	individual	28000	60	14.07	652.53	C	C3	Mar-2018	Current	whole	Cash	27015.86	1999.33	984.14	1015.19
warehouse office clerk	10	HI	RENT	40000	Not Verified	5.04	NA		NA	NA	1996	1	30	14	28800	4321	1	NA	0	0	11	8	14	3	16500	24	14	4	0	100.0	0	1	debt_consolidation	individual	5000	36	12.61	167.54	C	C1	Feb-2018	Current	whole	Cash	4651.37	499.12	348.63	150.49
assembly	3	WI	RENT	40000	Source Verified	21.15	NA		NA	28	2006	4	31	10	24193	16000	0	28	432	1	13	7	10	3	4300	14	8	6	0	93.5	0	0	other	individual	2000	36	17.09	71.40	D	D1	Feb-2018	Current	fractional	Cash	1824.63	281.80	175.37	106.43
customer service	1	PA	RENT	30000	Not Verified	10.16	NA		NA	NA	2007	0	4	4	25400	4997	1	NA	0	1	1	15	4	2	19400	3	3	2	0	100.0	1	0	debt_consolidation	individual	21600	36	6.72	664.19	A	A3	Jan-2018	Current	whole	Cash	18853.26	3312.89	2746.74	566.15
security supervisor	10	CA	RENT	35000	Verified	57.96	57000	Verified	37.66	NA	2008	7	22	16	69839	52722	0	NA	0	1	6	4	16	10	32700	20	15	13	0	100.0	0	0	credit_card	joint	23000	36	14.07	786.87	C	C3	Mar-2018	Current	whole	Cash	21430.15	2324.65	1569.85	754.80

4.1 Multiple LS model

Consider a sample of size $n$ with one response variable, $Y$ , and $k$ explanatory variables, $X_1, X_2, \dots, X_k.$ Denote the values of the response variable by $(Y_1, Y_2, \dots, Y_n)$ and the values of the $k$ explanatory variables by $\begin{eqnarray} (X_{11}, X_{21}, X_{31}, \dots, X_{n1})\\ (X_{12}, X_{22}, X_{32}, \dots, X_{n2})\\ \vdots\\ (X_{1k}, X_{2k}, X_{3k}, \dots, X_{nk}) \end{eqnarray}$

A linear model that relates $Y$ with the predictors $X_1, X_2, \dots, X_k$ has the form $\hat{Y} = \hat{a} + \hat{b}_1 X_1 + \hat{b}_2X_2 + \cdots + \hat{b}_kX_k$ We can write, for each case in the dataset: $\hat{Y}_i = \hat{a} + \hat{b}_1 X_{i1} + \hat{b}_2X_{i2} + \cdots + \hat{b}_kX_{ik}, \quad i=1,2,\cdots,n.$ Under the least squares criterion, the coefficients $\hat{a}$ , $\hat{b}_1$ , $\hat{b}_2$ , $\dots$ , $\hat{b}_k$ are the values that minimize the sum of the squares of the residuals, that is, they minimize $\sum_{i=1}^n(Y_i - \hat{Y}_i)^2 = \sum_{i=1}^n\left(Y_i - ({a} + {b}_1 X_{i1} + {b}_2X_{i2} + \cdots + {b}_kX_{ik})\right)^2.$ The solution to this optimization problem is better written using matrix notation and we omit the formulas for the $\hat{b}$ ’s here. However, it’s important to know that the formula for each $\hat{b}$ depends on the $X_{ij}$ ’s and $Y_i$ ’s (similar to the one for simple linear regression). Once $\hat{b}_1, \hat{b}_2, \dots, \hat{b}_k$ are calculated, $\hat{a}$ is calculated as $\hat{a} = \overline{Y} - \hat{b}_1 \overline{X}_1 - \hat{b}_2 \overline{X}_2 - \cdots - \hat{b}_k \overline{X}_k,$ where $\bar{X}_j$ is the mean of the variable $X_j$ , $j=1,2,\dots,k.$ This means that, like in the simple linear regression case, the LS model passes through the point whose coordinates are the averages of all the variables.

In R, a (least squares) multiple linear regression model is computed with the same lm function that we used to compute simple linear regressions. We only need to add more predictors.

For instance, a model that uses annual_income and debt_to_income as predictors of interest_rate can be constructed by running:

lm(interest_rate ~ annual_income + debt_to_income, data = loans)

## 
## Call:
## lm(formula = interest_rate ~ annual_income + debt_to_income, 
##     data = loans)
## 
## Coefficients:
##    (Intercept)   annual_income  debt_to_income  
##      1.206e+01      -5.841e-06       4.264e-02

We can write the model as $\widehat{interest\_rate} = 12.06 - 0.00000584\times annual\_income + 0.0426 \times debt\_to\_income.$

The interpretation of the slopes is a little different than in regression with one predictor. Since the calculation of each slope uses information from all the variables, the slope of a variable in multiple regression will not be the same as when that variable was the only predictor. To illustrate this, let’s find the LS model when annual_income is the only predictor and the LS model when debt_to_income is the only predictor:

lm(interest_rate ~ annual_income, data = loans)

## 
## Call:
## lm(formula = interest_rate ~ annual_income, data = loans)
## 
## Coefficients:
##   (Intercept)  annual_income  
##     1.304e+01     -7.693e-06

lm(interest_rate ~ debt_to_income, data = loans)

## 
## Call:
## lm(formula = interest_rate ~ debt_to_income, data = loans)
## 
## Coefficients:
##    (Intercept)  debt_to_income  
##       11.51145         0.04718

The interpretations of the slopes in the single-predictor models are:

An average increase of $1 in annual income is associated with an average decrease of 0.00000769% in loan interest rate. We can also say (for better description) that an average increase of $10,000 in annual income is associated with an average decrease of 0.0769% in loan interest rate.
An average increase of 1% in the debt-to-income ratio is associated with an average increase of 0.047% in loan interest rate. Alternatively, we can say that an average increase of 10% in the debt-to-income ratio is associated with an average increase of 0.47% in loan interest rate.

The interpretation of the intercepts should only be made if they are not an extrapolation. Since there are people with no income and people with a 0 debt-to-income ratio, we can interpret the intercept:

Borrowers with no income have an average interest rate of 13%.
Borrowers with 0 debt-to-income ratio have an average interest rate of 11.5%.

Now let’s turn back to the model that has both variables together. We can interpret the slopes as follows:

Holding debt-to-income ratio constant, an average increase of $10,000 in annual income is associated with an average decrease of 0.0584% in loan interest rate. “Holding the debt-to-income ratio constant” can be interpreted as “For borrowers with similar debt-to-income ratio, an average increase…” We can also say that “After taking debt-to-income ratio into account, an average increase…”
Holding annual income constant, an average increase of $10% in debt-to-income ratio is associated with an average increase of 0.426% in loan interest rate. Here “Holding the annual income constant” can be interpreted as “For borrowers with similar annual income, an average increase…” We can also say that “After taking annual income into account, an average increase…”

4.2 Categorical predictors

Let’s start by fitting a (LS) linear regression model for interest rate with a single predictor indicating whether or not a person has a bankruptcy in their record. First, let’s create such variable:

loans$bankruptcy <- if_else(loans$public_record_bankrupt > 0, "yes", "no")
lm(interest_rate ~ bankruptcy, data = loans)

## 
## Call:
## lm(formula = interest_rate ~ bankruptcy, data = loans)
## 
## Coefficients:
##   (Intercept)  bankruptcyyes  
##       12.3380         0.7368

The model is given by $\widehat{interest\_rate} = 12.34 + 0.74\times bankruptcy.$

The bankruptcy variable takes one of two values: “yes” when the borrower has a bankruptcy in their history and “no” otherwise. A slope of 0.74 means that the model predicts a 0.74% higher interest rate for those borrowers with a bankruptcy in their record (see Section 3.2 for a review of the interpretation for categorical predictor variables.)

Now let’s fit a model using a 3-level categorical variable, such as verified_income. We can ask R which are the levels of the variable verified_income using the function levels:

levels(loans$verified_income)

## [1] ""                "Not Verified"    "Source Verified" "Verified"

There is an empty level ““, which we should eliminate. Let’s redefine the levels to exclude the empty level:

loans$verified_income <- factor(loans$verified_income, levels = c("Not Verified", "Source Verified", "Verified"))

lm(interest_rate ~ verified_income, data = loans)

## 
## Call:
## lm(formula = interest_rate ~ verified_income, data = loans)
## 
## Coefficients:
##                    (Intercept)  verified_incomeSource Verified  
##                         11.099                           1.416  
##        verified_incomeVerified  
##                          3.254

We can ask R again which are the 3 levels of the variable verified_income:

levels(loans$verified_income)

## [1] "Not Verified"    "Source Verified" "Verified"

The first level listed, “Not verified”, will be the reference level in a linear regression model. We can then interpret the slopes as comparisons with the reference level. For example, borrowers who had their income source verified (but not the amount), had an interest rate that was on average 1.4% higher than those who didn’t have their income source verified. Borrowers who had their income amount verified, had an interest rate that was on average 3.25% higher than those who didn’t have their income verified.

The higher interest rate for borrowers who have verified their income source or amount is surprising. Intuitively, we’d think that a loan would look less risky if the borrower’s income has been verified. However, note that the situation may be more complex, and there may be confounding variables that we didn’t account for. For example, perhaps lender require borrowers with poor credit to verify their income. That is, verifying income in our data set might be a signal of some concerns about the borrower rather than a reassurance that the borrower will pay back the loan. For this reason, the borrower could be deemed higher risk, resulting in a higher interest rate. (What other confounding variables might explain this counter-intuitive relationship suggested by the model?)

Now let’s mix categorical and numerical predictors (I know you’ve been waiting for this!). The LS model that has debt_to_income and verified_income as predictors is found by running:

lm(interest_rate ~ debt_to_income + verified_income, data = loans)

## 
## Call:
## lm(formula = interest_rate ~ debt_to_income + verified_income, 
##     data = loans)
## 
## Coefficients:
##                    (Intercept)                  debt_to_income  
##                       10.34179                         0.03909  
## verified_incomeSource Verified         verified_incomeVerified  
##                        1.49071                         3.10048

The model is:

$\begin{eqnarray} \widehat{interest\_rate} &=& 10.34 + 0.039\times debt\_to\_income + 1.49\times verified\_income(Source Verified) \\ &+& 3.10\times verified\_income(Verified). \end{eqnarray}$

The interpretation of the coefficients are:

(Intercept) Borrowers with a 0 debt-to-income ratio and non-verified income, have on average an interest rate of 10.34%.
Regardless of type of verification, an average 1% increase in debt-to-income ratio is associated with an average increase of 0.039% in the loan’s interest rate.
Holding debt-to-income ratio constant, borrowers who have the source of their income verified tend to have an interest rate that is higher than those who have no verification by an average of 1.49%.
Holding debt-to-income ratio constant, borrowers who have the value of their income verified tend to have an interest rate that is higher than those who have no verification by an average of 3.10%.

Notice that this model produces three parallel lines for the relationship between interest_rate and debt_to_income, one for each verification type:

$\begin{eqnarray} \mbox{Not Verified}:\widehat{interest\_rate} &=& 10.34 + 0.039\times debt\_to\_income + 1.49\times 0 + 3.10\times 0 \\ &=& 10.34 + 0.039\times debt\_to\_income. \\ \mbox{Source Verified}:\widehat{interest\_rate} &=& 10.34 + 0.039\times debt\_to\_income + 1.49\times 1 + 3.10\times 0 \\ &=& 11.83 + 0.039\times debt\_to\_income. \\ \mbox{Verified}:\widehat{interest\_rate} &=& 10.34 + 0.039\times debt\_to\_income + 1.49\times 0 + 3.10\times 1 \\ &=& 13.44 + 0.039\times debt\_to\_income. \end{eqnarray}$

To plot the parallel lines, we can use the layer geom_parallel_slopes() from the moderndive package (first be sure to install the package).

library(moderndive)
ggplot(loans, aes(x = debt_to_income, y = interest_rate, color = verified_income)) +
  geom_point() +
  geom_parallel_slopes(se = FALSE)

Finally, let’s use a larger number of predictors. The LS model that has annual_income, debt_to_income, bankruptcy, and verified_income as predictors is found by running:

lm(interest_rate ~ annual_income + debt_to_income + bankruptcy + verified_income, data = loans)

## 
## Call:
## lm(formula = interest_rate ~ annual_income + debt_to_income + 
##     bankruptcy + verified_income, data = loans)
## 
## Coefficients:
##                    (Intercept)                   annual_income  
##                      1.084e+01                      -6.334e-06  
##                 debt_to_income                   bankruptcyyes  
##                      3.406e-02                       6.382e-01  
## verified_incomeSource Verified         verified_incomeVerified  
##                      1.530e+00                       3.125e+00

The model is:

$\begin{eqnarray} \widehat{interest\_rate} &=& 10.84 - 0.00000633\times annual\_income + 0.034\times debt\_to\_income\\ &+& 0.638\times bankruptcy + 1.53\times verified\_income(Source Verified) \\ &+& 3.125\times verified\_income(Verified). \end{eqnarray}$

The interpretation of a few of the coefficients are:

(Intercept) Borrowers with no annual income, a 0 debt_to_income, no bankruptcy, and non-verified income, have on average an interest rate of 10.84%.
Holding the other variables in the model constant, an average $10,000 increase in annual income is associated with an average decrease in interest rate of 0.063%.
Holding the other variables in the model constant, borrowers who have had a bankruptcy tend to have an interest rate that is higher than those who have no bankruptcy by an average of 0.64%.