5 More on LS regression
5.1 A word on statistical inference for regression
You may have noticed that when we apply the summary
function to a linear model we get a column labeled “Std. Error”, one labeled “t value”, and one labeled “Pr(>|t|)”. The column “Pr(>|t|)” has probability values, called p-values, which help address hypothesis tests. (see section 2.6 for an introduction to hypothesis tests.)
For example, let’s take another look at the linear model for interest_rate
that uses four variables as a predictors (an introduction to the loans
dataset is in chapter 4):
m <- lm(interest_rate ~ annual_income + debt_to_income + bankruptcy + verified_income, data = loans)
summary(m)
##
## Call:
## lm(formula = interest_rate ~ annual_income + debt_to_income +
## bankruptcy + verified_income, data = loans)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.8306 -3.5383 -0.7035 2.6924 19.8852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.084e+01 1.228e-01 88.274 < 2e-16 ***
## annual_income -6.334e-06 7.549e-07 -8.392 < 2e-16 ***
## debt_to_income 3.406e-02 3.287e-03 10.362 < 2e-16 ***
## bankruptcy 6.382e-01 1.468e-01 4.347 1.39e-05 ***
## verified_incomeSource Verified 1.530e+00 1.097e-01 13.947 < 2e-16 ***
## verified_incomeVerified 3.125e+00 1.292e-01 24.196 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.788 on 9970 degrees of freedom
## (24 observations deleted due to missingness)
## Multiple R-squared: 0.08251, Adjusted R-squared: 0.08205
## F-statistic: 179.3 on 5 and 9970 DF, p-value: < 2.2e-16
Recall that an LS model with \(k\) predictors has the form
\[\hat{Y} = \hat{a}_ + \hat{b}_1 X_1 + \hat{b}_2 X_2 + \cdots + \hat{b}_k X_k.\]
Usually, the values of the variables \(X_1\), \(X_2\), …, \(X_k\), and \(Y\) are a sample from a larger population or underlying process. In that case, we can then think that there exists “true” population values for the slopes and intercept, and that \(\hat{a}\), \(\hat{b}_1\), …, \(\hat{b}_k\) are an estimate for these true values. In the case of the loans
data, we can think of the 10,000 loans in the dataset as as sample from a much larger population of borrowers, and that there exists a “true” linear model that represents the trend in the population.
Therefore, when constructing a linear model from a sample, in addition to being able to interpret the slopes and \(R^2\), it is also desirable to assess whether there’s evidence that the “true” slope for each variable is not zero. To better understand the relevance of this task, let’s think about what would it mean for a slope to be zero. We would interpret it as a change in \(X\) being associated with no change in \(Y\). That is, \(X\) and \(Y\) are likely unrelated. Very rarely though we will encounter a slope in a linear model that is exactly equal to 0. However, the “true” population slope could still be zero even if the sample slope is not. Let’s call the population slopes \(b_1\), \(b_2\), …, \(b_k\), and the population intercept \(a\) (that is, we take out the hats). We would like to assess, based on the data, the following competing hypotheses for each population slope in a linear model:
\(H_0\): \(b_j = 0\)
\(H_a\): \(b_j \neq 0,\)
for each \(j = 1, 2, \dots, k.\)
To carry out this assessment, we need to gain insight about the values that \(\hat{b}_j\) can possibly take when a sample of size \(n\) is obtained from a population or process. The distribution of all possible values of \(\hat{b}_j\) is called the sampling distribution of \(\hat{b}_j\). Under the assumption that the population parameter \(b_j\) is zero, the distribution of all possible values of \(\hat{b}_j\) is called the null distribution of \(\hat{b}_j\). The sampling and null distributions of an estimator are what we use to make statistical inference about a population. We can take two approaches to inference: a theoretical or a simulation one, which we will develop in more detail after we cover probability. The following is a brief description of the elements in a regression table that are related to statistical inference:
The standard error of \(\hat{b}_j\) quantifies how much variation in the fitted slope one would expect between different samples. It is defined as the standard deviation of the sampling distribution of \(\hat{b}_j\). The standard error values play a key role in inferring the value of the unknown population slope.
The “t value” for a slope \(\hat{b}_j\) is a test statistic16 relating to the hypothesis test:
\(H_0\): \(b_j = 0\)
\(H_a\): \(b_j \neq 0.\)
Its specific value is given by \(t=\frac{\hat{b}_j}{SE(\hat{b}_j)},\) where \(SE(\hat{b}_j)\) is the estimated standard error of \(\hat{b}_j.\) The larger the t value, the stronger the evidence for \(H_a\).
The p-value for a slope \(\hat{b}_j\) is a probability value relating to the hypothesis test stated above. More specifically, a p-value is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic assuming the null hypothesis \(H_0\) is true. We can intuitively think of the p-value as quantifying how “extreme” the observed fitted slope is in a “hypothesized universe” where there is no relationship between the explanatory and response variables. A small p-value is evidence in favor of \(H_a\). Traditionally, p-values that are smaller than 0.05 are considered acceptable evidence for \(H_a\) and would result in the rejection of \(H_0\). In other words, one would reject the hypothesis that there is no relationship between the explanatory and response variables in favor of the hypothesis that there is. That is to say, the evidence suggests there is a significant relationship.
In our previous regression example, the p-values for all the slopes are extremely small, which means that for each slope, we can reject the hypothesis that the population slope is zero. In that case, we would say that the association between an explanatory variable (when taking all others into account) is statistically significant. It should be noted though, that statistical significance does not imply practical significance. The practical significance of an estimate depends on what associations are meaningful for the problem at hand. For example, in the model for interest_rate
with 4 predictors, the slope for bankruptcy
is 0.64, which means that holding all other variables constant, having a bankruptcy in their history is associated with an average increase in interest rate of 0.64%. The p-value for bankruptcy
indicated that this assoiation is statistically significant (that is, there’s evidence that the population slope for bankruptcy is not zero). However, 0.64% is an increase of less than 1%, which may have very little practical meaning. We would say then that bankruptcy has a small (or tiny, in this case) effect on interest rate, even though it is statistically significant.
A caveat here is that the results of hypotheses tests for the slopes are only reliable if certain conditions for inference for LS regression are met, which we introduce in the next section.
5.2 Regression diagnostics
To perform statistical inference for LS regression, some conditions need to be satisfied. These conditions must be met for the assumed underlying mathematical and probability theory to hold true, and they can be checked by inspecting the residuals of the model (see definition 3.3). The four conditions needed for inference for regression are listed below.
“Straight enough” relationship between variables. This means that the relationship between each predictor and the response variable should be “straight enough”. In other words, there should be no obvious bend on their scatterplots.
Independence of the residuals. This means that the errors of the liner model vary in a random fashion, not a systematic one. In other words, the different observations in our data must be independent of one another.
Constant variance of the residuals. This means that the variability of the errors of the linear model remain approximately constant for different values of the predictors. In other words, the value and spread of the residuals should not depend on the values of the explanatory variables.
Normality of the residuals. Specificaly, the distribution of the residuals is “bell-shaped” and is centered at 0. Visually, this corresponds to a histogram that is fairly symmetric and bell-shaped, with a peak at 0. We will look in more detail at Normal distributions in chapter 8.
When we use the lm
function to generate a linear model, several pieces of information are stored in the model, and one of them are the residuals. We can retrieve them by running m$residuals
, where “m” is the name we gave to the model. The following code shows the first 10 residuals for the last model we constructed:
## 1 2 3 4 5 6
## 0.05922272 1.21124562 4.25042360 -4.27827534 -1.64970812 -4.12693040
## 7 8 9 10
## 0.63327093 -0.23725014 0.38670471 -4.58626564
Further, the plot
function when applied to a linear model gives 4 plots that help with the diagnostics of the conditions above. For our previous model, the diagnostic plots are given by:
The diagnostic plots show residuals in four different ways. Let’s take a look at the first plot:
- Residuals vs Fitted
This plot shows whether residuals have non-linear patterns. There could be a non-linear relationship between predictor variables and an outcome variable and the pattern could show up in this plot if the model doesn’t capture the non-linear relationship. If you find equally spread residuals around a horizontal line without distinct patterns, that is a good indication you don’t have non-linear relationships.
In our example above, there seems to be a pattern in the residuals, indicating that there is likely a better model that captures the relationship between interest_rate
and the 4 predictors in the model.
- Normal Q-Q plot
This plot shows whether residuals can be considered normally distributed. If the points follow a straight line, then it is an indication that the residuals are normally distributed. However, it’s usually the case that there’s some deviation from the line. Only Q-Q plots that deviate in an obvious and consistent way from the line should be flagged. Alternativelly, one may plot a histogram of the residuals to check for normality.
In our example, the points don’t follow a straight line, which indicates that the residuals are not normal. Let’s check this with a histogram:
The histogram shows that the residuals are slightly right-skewed, which justifies the non-linear pattern in the Q-Q plot.
- Scale-Location plot
It’s also called Spread-Location plot. This plot shows whether residuals are spread equally along the ranges of predictors. This is how you can check the assumption of constant variance. It’s a good sign if you see a horizontal line with equally (randomly) spread points. In our example, residuals seem to have approximately constant variability.
- Residuals vs Leverage
This plot helps us to find influential cases, if any. Not all outliers are influential in linear regression analysis. Even though data have extreme values, they might not be influential to determine a regression line. That means, the results wouldn’t be much different if we either include or exclude them from analysis. They follow the trend in the majority of cases and they don’t really matter; they are not influential. On the other hand, some cases could be very influential even if they look to be within a reasonable range of the values. They could be extreme cases against a regression line and can alter the results if we exclude them from analysis. Another way to put it is that they don’t get along with the trend in the majority of the cases.
Unlike the other plots, in this plot patterns are not relevant. We watch out for outlying values at the upper right corner or at the lower right corner. Those spots are the places where cases can be influential against a regression line. Look for cases outside of a dashed line, Cook’s distance. When cases are outside of a Cook’s distance of 1, the cases are likely influential to the regression results. The regression results will likely be altered if we exclude those cases.
In our example, there are no influential outliers (no points beyond a Cook’s distance of 1.)
Several of the aspects mentioned in these plots will be address again after we have covered more theory.
Independence condition
The four diagnostic plots can’t help us in a clear way on how to assess whether the errors are independent. We can use the acf
(autocorrelation) function to help in this task. The acf
function calculates Pearson correlations for a variable with its lagged values. The acf
function returns a plot of these autocorrelation values along with a confidence band for them. If autocorrelations are high (beyond the confidence band), it is an indication that the values provided are not independent.
For our example, the autocorrelations (lagged correlations) of the residuals are given by:
Notice that all autocorrelations are within the blue band, supporting that the residuals may be independent. An important remark is that low autocorrelations is not “proof” that the residuals are independent, but observing high autocorrelations is enough evidence that they are not! Therefore, the acf
graph is a tool to help us catch instances of non-independence, rather than “proving” independence.
** Conclusion for the loans
example**:
For the model interest_rate ~ annual_salary + debt_to_ratio + bankruptcy + verified_income
, we must take the results of the hypothesis tests for the slopes with caution, because some conditions for inference seemed to be violated. There’s research that shows that the tests are robust to some violations, for example, the normality condition. Sometines, applying transformations to variables in the model can resolve some of the bad diagnostics.
5.3 Model selection
The best model is not always the most complicated. Sometimes including variables that are not evidently important can actually reduce the accuracy of predictions. In this section, we discuss model selection strategies, which will help us eliminate variables from the model that are found to be less important. It’s common (at least in the statistical world) to refer to models that have undergone such variable pruning as parsimonious.
In practice, the model that includes all available explanatory variables of interest is often referred to as the full model. The full model may not be the best model, and if it isn’t, we want to identify a smaller model that is preferable.
The adjusted \(R^2\) measures the strength of a model fit, and it is a useful tool for evaluating which predictors are adding value to the model, where adding value means they are (likely) improving the accuracy in predicting future outcomes.
Two common strategies for adding or removing variables in a multiple regression model are called backward elimination and forward selection. These techniques are often referred to as step- wise model selection strategies, because they add or delete one variable at a time as they “step” through the candidate predictors.
5.3.1 Backward elimination
Backward elimination starts with the model that includes all potential predictor variables. Then variables are eliminated one-at-a-time from the model until we cannot improve the adjusted \(R^2\). The strategy within each elimination step is to eliminate the variable that leads to the largest improvement in adjusted \(R^2\).
To fit an LS model using all predictors available in a dataset, one can use the syntax lm(y ~.)
, which means fit a model that has y
as the response and all other variables in the dataset as predictors. Let’s return to the loans
dataset and consider the following 15 predictors (instead of only 4 predictors):
“homeownership”, “annual_income”, “bankruptcy”, “verified_income”, “debt_to_income”, “delinq_2y”, “months_since_last_delinq”, “earliest_credit_line”, “inquiries_last_12m”, “total_credit_lines”, “open_credit_lines”, “total_credit_limit”, “total_credit_utilized”, “term”, “current_installment_accounts”.
To make coding easier, we can define a new dataset with only these variables plus interest_rate
. Let’s call this dataset loans2
:
loans2 <- loans[, c("homeownership", "annual_income", "bankruptcy", "verified_income", "debt_to_income", "delinq_2y", "months_since_last_delinq", "earliest_credit_line", "inquiries_last_12m", "total_credit_lines", "open_credit_lines", "total_credit_limit", "total_credit_utilized", "term", "current_installment_accounts", "interest_rate")]
The code above selects the columns of the loans
dataset, by name, that will be part of the new dataset.
Now we can obtain the full model for interest_rate
by running:
##
## Call:
## lm(formula = interest_rate ~ ., data = loans2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.6630 -3.0612 -0.6911 2.3471 20.1936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.272e+01 1.856e+01 -4.996 6.07e-07 ***
## homeownershipOWN 3.837e-01 2.085e-01 1.840 0.0658 .
## homeownershipRENT 4.005e-01 1.659e-01 2.414 0.0158 *
## annual_income -1.815e-06 1.115e-06 -1.628 0.1036
## bankruptcy 5.549e-01 2.338e-01 2.373 0.0177 *
## verified_incomeSource Verified 8.683e-01 1.492e-01 5.819 6.37e-09 ***
## verified_incomeVerified 2.150e+00 1.788e-01 12.023 < 2e-16 ***
## debt_to_income 5.278e-02 5.262e-03 10.031 < 2e-16 ***
## delinq_2y 3.414e-01 8.205e-02 4.160 3.24e-05 ***
## months_since_last_delinq -1.891e-02 3.647e-03 -5.186 2.25e-07 ***
## earliest_credit_line 4.919e-02 9.262e-03 5.310 1.15e-07 ***
## inquiries_last_12m 2.472e-01 2.709e-02 9.126 < 2e-16 ***
## total_credit_lines -4.065e-02 8.777e-03 -4.632 3.73e-06 ***
## open_credit_lines 3.619e-03 1.785e-02 0.203 0.8393
## total_credit_limit -4.373e-06 4.729e-07 -9.248 < 2e-16 ***
## total_credit_utilized 6.950e-06 1.683e-06 4.128 3.72e-05 ***
## term 1.605e-01 5.970e-03 26.879 < 2e-16 ***
## current_installment_accounts 3.416e-02 2.733e-02 1.250 0.2114
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.247 on 4313 degrees of freedom
## (5669 observations deleted due to missingness)
## Multiple R-squared: 0.2569, Adjusted R-squared: 0.254
## F-statistic: 87.71 on 17 and 4313 DF, p-value: < 2.2e-16
The values of \(R^2\) and \(R^2_{adj}\) are:
## [1] 0.2568993
## [1] 0.2539703
Their values are quite similar, with \(R^2\) being just slightly higher than \(R^2_adj\) (recall that \(R^2_adj\) is never larger than \(R^2\)). Now we can experiment with trying to reduce this model to one with 14 predictors instead of 15. To decide which predictor should be eliminated, we can exclude a predictor and check how that exclusion changes the value of \(R^2_{adj}\). We record the changes in \(R^2_{adj}\) caused by excluding each variable. The variable that causes the highest increase in \(R^2_{adj}\) when eliminated, should be the first to be excluded. For example, if we eliminate open_credit_lines
from the model, then \(R^2_{adj}\) becomes:
## [1] 0.2541361
Notice that this value of \(R^2_{adj}\) is larger than the one for the full model. This means that eliminating open_credit_lines
is not much of a loss to the model. So we check the changes in \(R^2_{adj}\) for all the other variables and eliminate the one that causes \(R^2_{adj}\) to increase the most. Then once we have a model with 13 variables, we proceed to check whether we can eliminate one more variable by going through the same procedure described above. I won’t make you “watch” this entire process, but the process ends when no elimination of a variable can cause an increase in \(R^2_{adj}\).
This process is called backward elimination because we begin with a full model and then proceed to eliminate variables that don’t contribute much to the model.
This process can be used with any other measure of fit; it is not exclusive to \(R^2_{adj}\). For instance, we could have used the p-values instead and started by eliminating the variable with the largest p-value, then re-fitting the model, then eliminating the vriable with the largest p-value in the re-fitted model, and so on, until all p-values are “small enough”.
One of the most popular measures used in model selection is the AIC, which stands for Akaike Information Criterion. Its formal definition is beyond the scope of this course, but it can be used in a similar manner as \(R^2_{adj}\), with the difference that a lower AIC is better (in contrast to a higher \(R^2_{adj}\) being better). The value of AIC is highly dependent on the data, and it is not a number bounded by 1, like \(R^2_{adj}\). The default implementation in R for model selection is the step
function, which uses the AIC as a criterion for eliminating or adding variables. That is, at each step, we eliminate the variable that decreases the AIC the most.
When runing the step
function, we need to ensure that the dataset comprised of all predictors and the response variable have no missing values (otherwise the function returns an error). We can accomplish this by applying na.omit
to the dataset, as shown below:
full2 <- lm(interest_rate ~., data = na.omit(loans2))
reduced <- step(full2, direction = "backward")
## Start: AIC=12545.55
## interest_rate ~ homeownership + annual_income + bankruptcy +
## verified_income + debt_to_income + delinq_2y + months_since_last_delinq +
## earliest_credit_line + inquiries_last_12m + total_credit_lines +
## open_credit_lines + total_credit_limit + total_credit_utilized +
## term + current_installment_accounts
##
## Df Sum of Sq RSS AIC
## - open_credit_lines 1 0.7 77803 12544
## - current_installment_accounts 1 28.2 77831 12545
## <none> 77802 12546
## - annual_income 1 47.8 77850 12546
## - homeownership 2 125.2 77928 12548
## - bankruptcy 1 101.6 77904 12549
## - total_credit_utilized 1 307.5 78110 12561
## - delinq_2y 1 312.2 78115 12561
## - total_credit_lines 1 387.0 78189 12565
## - months_since_last_delinq 1 485.2 78288 12570
## - earliest_credit_line 1 508.7 78311 12572
## - inquiries_last_12m 1 1502.4 79305 12626
## - total_credit_limit 1 1542.8 79345 12629
## - debt_to_income 1 1815.1 79618 12643
## - verified_income 2 2608.5 80411 12684
## - term 1 13032.8 90835 13214
##
## Step: AIC=12543.59
## interest_rate ~ homeownership + annual_income + bankruptcy +
## verified_income + debt_to_income + delinq_2y + months_since_last_delinq +
## earliest_credit_line + inquiries_last_12m + total_credit_lines +
## total_credit_limit + total_credit_utilized + term + current_installment_accounts
##
## Df Sum of Sq RSS AIC
## - current_installment_accounts 1 35.3 77839 12544
## <none> 77803 12544
## - annual_income 1 47.5 77851 12544
## - homeownership 2 125.9 77929 12547
## - bankruptcy 1 102.1 77905 12547
## - total_credit_utilized 1 307.3 78110 12559
## - delinq_2y 1 311.5 78115 12559
## - months_since_last_delinq 1 486.2 78289 12569
## - earliest_credit_line 1 508.5 78312 12570
## - total_credit_lines 1 608.8 78412 12575
## - inquiries_last_12m 1 1515.6 79319 12625
## - total_credit_limit 1 1545.0 79348 12627
## - debt_to_income 1 1834.6 79638 12642
## - verified_income 2 2610.4 80414 12682
## - term 1 13032.6 90836 13212
##
## Step: AIC=12543.56
## interest_rate ~ homeownership + annual_income + bankruptcy +
## verified_income + debt_to_income + delinq_2y + months_since_last_delinq +
## earliest_credit_line + inquiries_last_12m + total_credit_lines +
## total_credit_limit + total_credit_utilized + term
##
## Df Sum of Sq RSS AIC
## <none> 77839 12544
## - annual_income 1 53.3 77892 12544
## - homeownership 2 131.5 77970 12547
## - bankruptcy 1 102.4 77941 12547
## - delinq_2y 1 310.1 78149 12559
## - total_credit_utilized 1 443.7 78282 12566
## - months_since_last_delinq 1 486.4 78325 12568
## - earliest_credit_line 1 559.5 78398 12573
## - total_credit_lines 1 573.7 78412 12573
## - inquiries_last_12m 1 1506.0 79345 12625
## - total_credit_limit 1 1571.1 79410 12628
## - debt_to_income 1 1871.0 79710 12644
## - verified_income 2 2599.6 80438 12682
## - term 1 13099.6 90938 13215
Notice that the step
function took the following steps:
First it eliminated
open_credit_lines
because this reduced the AIC the most (from 12545.55 to 12544).Then it re-fit the model without
open_credit_lines
and looked for which variable decreased the AIC the most. That variable wascurrent_installment_accounts
, which was then eliminated.Then it re-fit the previous model without
current_installment_accounts
and looked for which variable decreased the AIC the most. No such variable was found, so the procedure stopped.
The final reduced model was:
##
## Call:
## lm(formula = interest_rate ~ homeownership + annual_income +
## bankruptcy + verified_income + debt_to_income + delinq_2y +
## months_since_last_delinq + earliest_credit_line + inquiries_last_12m +
## total_credit_lines + total_credit_limit + total_credit_utilized +
## term, data = na.omit(loans2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.9938 -3.0748 -0.6752 2.3622 20.1760
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.644e+01 1.836e+01 -5.252 1.58e-07 ***
## homeownershipOWN 3.851e-01 2.083e-01 1.849 0.0646 .
## homeownershipRENT 4.133e-01 1.657e-01 2.495 0.0126 *
## annual_income -1.911e-06 1.112e-06 -1.718 0.0858 .
## bankruptcy 5.568e-01 2.337e-01 2.382 0.0173 *
## verified_incomeSource Verified 8.693e-01 1.492e-01 5.827 6.07e-09 ***
## verified_incomeVerified 2.146e+00 1.788e-01 12.002 < 2e-16 ***
## debt_to_income 5.331e-02 5.234e-03 10.184 < 2e-16 ***
## delinq_2y 3.399e-01 8.197e-02 4.146 3.45e-05 ***
## months_since_last_delinq -1.893e-02 3.646e-03 -5.193 2.17e-07 ***
## earliest_credit_line 5.104e-02 9.166e-03 5.569 2.72e-08 ***
## inquiries_last_12m 2.468e-01 2.701e-02 9.137 < 2e-16 ***
## total_credit_lines -3.707e-02 6.573e-03 -5.639 1.82e-08 ***
## total_credit_limit -4.399e-06 4.713e-07 -9.333 < 2e-16 ***
## total_credit_utilized 7.760e-06 1.565e-06 4.959 7.35e-07 ***
## term 1.608e-01 5.965e-03 26.948 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.247 on 4315 degrees of freedom
## Multiple R-squared: 0.2566, Adjusted R-squared: 0.254
## F-statistic: 99.27 on 15 and 4315 DF, p-value: < 2.2e-16
Notice that not all variables in the final model are statistically significant. If this is an important aspect of a desirable final model, then you could proceed to further eliminate variables.
5.3.2 Forward selection
The forward selection strategy is the reverse of the backward elimination technique. Instead of eliminating variables one-at-a-time, we add variables one-at-a-time until we cannot find any variables that improve the model (as measured by \(R^2_{adj}\), or AIC, or p-value, etc).
Backward elimination and forward selection sometimes arrive at different final models. For the previous example, a forward selection strategy gives:
min.model <- lm(interest_rate ~ 1, data = na.omit(loans2))
full_formula <- formula(full2)
reduced_f <- step(min.model, direction = "forward", scope = full_formula)
## Start: AIC=13797.53
## interest_rate ~ 1
##
## Df Sum of Sq RSS AIC
## + term 1 13089.7 91610 13221
## + verified_income 2 5030.8 99669 13588
## + debt_to_income 1 2476.2 102224 13696
## + total_credit_limit 1 1785.0 102915 13725
## + inquiries_last_12m 1 1400.8 103299 13741
## + earliest_credit_line 1 1351.9 103348 13743
## + annual_income 1 1332.6 103367 13744
## + months_since_last_delinq 1 702.7 103997 13770
## + delinq_2y 1 585.8 104114 13775
## + homeownership 2 519.2 104181 13780
## + current_installment_accounts 1 335.2 104365 13786
## + total_credit_lines 1 225.3 104474 13790
## + bankruptcy 1 163.6 104536 13793
## <none> 104700 13798
## + total_credit_utilized 1 29.1 104671 13798
## + open_credit_lines 1 10.3 104689 13799
##
## Step: AIC=13221.1
## interest_rate ~ term
##
## Df Sum of Sq RSS AIC
## + verified_income 2 3209.4 88401 13071
## + total_credit_limit 1 2950.5 88660 13081
## + debt_to_income 1 2374.4 89236 13109
## + annual_income 1 1657.7 89952 13144
## + earliest_credit_line 1 1567.2 90043 13148
## + homeownership 2 1267.8 90342 13165
## + inquiries_last_12m 1 1198.7 90411 13166
## + months_since_last_delinq 1 947.7 90662 13178
## + delinq_2y 1 833.1 90777 13184
## + total_credit_lines 1 560.9 91049 13196
## + bankruptcy 1 209.7 91400 13213
## + current_installment_accounts 1 116.2 91494 13218
## + open_credit_lines 1 97.1 91513 13218
## <none> 91610 13221
## + total_credit_utilized 1 13.6 91596 13222
##
## Step: AIC=13070.65
## interest_rate ~ term + verified_income
##
## Df Sum of Sq RSS AIC
## + total_credit_limit 1 3555.2 84845 12895
## + debt_to_income 1 1857.7 86543 12981
## + annual_income 1 1728.7 86672 12987
## + homeownership 2 1579.6 86821 12997
## + earliest_credit_line 1 1521.3 86879 12998
## + inquiries_last_12m 1 958.3 87442 13025
## + months_since_last_delinq 1 844.2 87556 13031
## + delinq_2y 1 755.4 87645 13036
## + total_credit_lines 1 554.2 87846 13045
## + bankruptcy 1 190.3 88210 13063
## + open_credit_lines 1 118.1 88283 13067
## + current_installment_accounts 1 116.6 88284 13067
## + total_credit_utilized 1 42.8 88358 13071
## <none> 88401 13071
##
## Step: AIC=12894.87
## interest_rate ~ term + verified_income + total_credit_limit
##
## Df Sum of Sq RSS AIC
## + debt_to_income 1 2085.47 82760 12789
## + inquiries_last_12m 1 1413.29 83432 12824
## + months_since_last_delinq 1 1084.38 83761 12841
## + delinq_2y 1 938.62 83907 12849
## + earliest_credit_line 1 798.09 84047 12856
## + total_credit_utilized 1 685.40 84160 12862
## + current_installment_accounts 1 393.87 84452 12877
## + annual_income 1 154.60 84691 12889
## + homeownership 2 175.57 84670 12890
## + bankruptcy 1 84.60 84761 12892
## <none> 84845 12895
## + total_credit_lines 1 27.61 84818 12896
## + open_credit_lines 1 25.52 84820 12896
##
## Step: AIC=12789.09
## interest_rate ~ term + verified_income + total_credit_limit +
## debt_to_income
##
## Df Sum of Sq RSS AIC
## + inquiries_last_12m 1 1361.15 81399 12719
## + months_since_last_delinq 1 1206.32 81554 12728
## + delinq_2y 1 1034.24 81726 12737
## + earliest_credit_line 1 1011.24 81749 12738
## + total_credit_utilized 1 254.95 82505 12778
## + total_credit_lines 1 217.38 82543 12780
## + homeownership 2 230.48 82530 12781
## + current_installment_accounts 1 125.47 82635 12784
## + bankruptcy 1 89.19 82671 12786
## <none> 82760 12789
## + open_credit_lines 1 24.84 82735 12790
## + annual_income 1 2.16 82758 12791
##
## Step: AIC=12719.26
## interest_rate ~ term + verified_income + total_credit_limit +
## debt_to_income + inquiries_last_12m
##
## Df Sum of Sq RSS AIC
## + months_since_last_delinq 1 1290.66 80108 12652
## + delinq_2y 1 1140.83 80258 12660
## + earliest_credit_line 1 846.83 80552 12676
## + total_credit_lines 1 501.13 80898 12694
## + homeownership 2 296.07 81103 12708
## + total_credit_utilized 1 183.05 81216 12712
## + open_credit_lines 1 145.30 81254 12714
## + current_installment_accounts 1 80.53 81318 12717
## <none> 81399 12719
## + bankruptcy 1 36.15 81363 12719
## + annual_income 1 3.50 81395 12721
##
## Step: AIC=12652.04
## interest_rate ~ term + verified_income + total_credit_limit +
## debt_to_income + inquiries_last_12m + months_since_last_delinq
##
## Df Sum of Sq RSS AIC
## + earliest_credit_line 1 895.49 79213 12605
## + total_credit_lines 1 523.25 79585 12626
## + delinq_2y 1 268.72 79839 12640
## + homeownership 2 299.46 79809 12640
## + total_credit_utilized 1 232.97 79875 12641
## + open_credit_lines 1 147.22 79961 12646
## + current_installment_accounts 1 94.65 80014 12649
## + bankruptcy 1 87.56 80021 12649
## <none> 80108 12652
## + annual_income 1 3.53 80105 12654
##
## Step: AIC=12605.35
## interest_rate ~ term + verified_income + total_credit_limit +
## debt_to_income + inquiries_last_12m + months_since_last_delinq +
## earliest_credit_line
##
## Df Sum of Sq RSS AIC
## + total_credit_lines 1 313.680 78899 12590
## + delinq_2y 1 302.273 78910 12591
## + total_credit_utilized 1 220.503 78992 12595
## + homeownership 2 198.544 79014 12598
## + bankruptcy 1 105.084 79108 12602
## + open_credit_lines 1 89.743 79123 12602
## + current_installment_accounts 1 45.991 79167 12605
## <none> 79213 12605
## + annual_income 1 0.405 79212 12607
##
## Step: AIC=12590.17
## interest_rate ~ term + verified_income + total_credit_limit +
## debt_to_income + inquiries_last_12m + months_since_last_delinq +
## earliest_credit_line + total_credit_lines
##
## Df Sum of Sq RSS AIC
## + total_credit_utilized 1 466.54 78433 12566
## + delinq_2y 1 314.66 78584 12575
## + current_installment_accounts 1 198.94 78700 12581
## + homeownership 2 203.19 78696 12583
## + bankruptcy 1 105.19 78794 12586
## <none> 78899 12590
## + open_credit_lines 1 13.73 78885 12591
## + annual_income 1 0.00 78899 12592
##
## Step: AIC=12566.48
## interest_rate ~ term + verified_income + total_credit_limit +
## debt_to_income + inquiries_last_12m + months_since_last_delinq +
## earliest_credit_line + total_credit_lines + total_credit_utilized
##
## Df Sum of Sq RSS AIC
## + delinq_2y 1 323.29 78109 12551
## + bankruptcy 1 112.87 78320 12562
## + homeownership 2 117.90 78315 12564
## + current_installment_accounts 1 44.24 78388 12566
## + annual_income 1 36.86 78396 12566
## <none> 78433 12566
## + open_credit_lines 1 7.46 78425 12568
##
## Step: AIC=12550.59
## interest_rate ~ term + verified_income + total_credit_limit +
## debt_to_income + inquiries_last_12m + months_since_last_delinq +
## earliest_credit_line + total_credit_lines + total_credit_utilized +
## delinq_2y
##
## Df Sum of Sq RSS AIC
## + bankruptcy 1 98.536 78011 12547
## + homeownership 2 115.496 77994 12548
## + current_installment_accounts 1 46.171 78063 12550
## + annual_income 1 40.279 78069 12550
## <none> 78109 12551
## + open_credit_lines 1 12.083 78097 12552
##
## Step: AIC=12547.13
## interest_rate ~ term + verified_income + total_credit_limit +
## debt_to_income + inquiries_last_12m + months_since_last_delinq +
## earliest_credit_line + total_credit_lines + total_credit_utilized +
## delinq_2y + bankruptcy
##
## Df Sum of Sq RSS AIC
## + homeownership 2 118.878 77892 12544
## + current_installment_accounts 1 45.917 77965 12547
## + annual_income 1 40.587 77970 12547
## <none> 78011 12547
## + open_credit_lines 1 10.451 78000 12548
##
## Step: AIC=12544.52
## interest_rate ~ term + verified_income + total_credit_limit +
## debt_to_income + inquiries_last_12m + months_since_last_delinq +
## earliest_credit_line + total_credit_lines + total_credit_utilized +
## delinq_2y + bankruptcy + homeownership
##
## Df Sum of Sq RSS AIC
## + annual_income 1 53.253 77839 12544
## + current_installment_accounts 1 41.077 77851 12544
## <none> 77892 12544
## + open_credit_lines 1 7.734 77884 12546
##
## Step: AIC=12543.56
## interest_rate ~ term + verified_income + total_credit_limit +
## debt_to_income + inquiries_last_12m + months_since_last_delinq +
## earliest_credit_line + total_credit_lines + total_credit_utilized +
## delinq_2y + bankruptcy + homeownership + annual_income
##
## Df Sum of Sq RSS AIC
## <none> 77839 12544
## + current_installment_accounts 1 35.335 77803 12544
## + open_credit_lines 1 7.896 77831 12545
The final model in forward selection is:
##
## Call:
## lm(formula = interest_rate ~ term + verified_income + total_credit_limit +
## debt_to_income + inquiries_last_12m + months_since_last_delinq +
## earliest_credit_line + total_credit_lines + total_credit_utilized +
## delinq_2y + bankruptcy + homeownership + annual_income, data = na.omit(loans2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.9938 -3.0748 -0.6752 2.3622 20.1760
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.644e+01 1.836e+01 -5.252 1.58e-07 ***
## term 1.608e-01 5.965e-03 26.948 < 2e-16 ***
## verified_incomeSource Verified 8.693e-01 1.492e-01 5.827 6.07e-09 ***
## verified_incomeVerified 2.146e+00 1.788e-01 12.002 < 2e-16 ***
## total_credit_limit -4.399e-06 4.713e-07 -9.333 < 2e-16 ***
## debt_to_income 5.331e-02 5.234e-03 10.184 < 2e-16 ***
## inquiries_last_12m 2.468e-01 2.701e-02 9.137 < 2e-16 ***
## months_since_last_delinq -1.893e-02 3.646e-03 -5.193 2.17e-07 ***
## earliest_credit_line 5.104e-02 9.166e-03 5.569 2.72e-08 ***
## total_credit_lines -3.707e-02 6.573e-03 -5.639 1.82e-08 ***
## total_credit_utilized 7.760e-06 1.565e-06 4.959 7.35e-07 ***
## delinq_2y 3.399e-01 8.197e-02 4.146 3.45e-05 ***
## bankruptcy 5.568e-01 2.337e-01 2.382 0.0173 *
## homeownershipOWN 3.851e-01 2.083e-01 1.849 0.0646 .
## homeownershipRENT 4.133e-01 1.657e-01 2.495 0.0126 *
## annual_income -1.911e-06 1.112e-06 -1.718 0.0858 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.247 on 4315 degrees of freedom
## Multiple R-squared: 0.2566, Adjusted R-squared: 0.254
## F-statistic: 99.27 on 15 and 4315 DF, p-value: < 2.2e-16
Notice that in this case the backward and forward selections arrived at the same final model.
A test statistic is a sample statistic formula used for hypothesis testing.↩︎