Chapter 6 Logistic Regression (FQA)
For this section we will be using the loans
data set, which contains information about loan applications. The outcome variable is default
, which indicates whether each loan defaulted (default = 1
) or repayed (default = 0
).
Logistic regression allows us to model a binary (0 / 1
) variable \(Y\) as a function of one or more \(X\) variables. The assumed underlying relationship is:
\[P(Y = 1) = \frac{1}{1 + e^{-(\beta_{0} + \beta_{1}X_{1} +\beta_{2}X_{2} + ... + \beta_{k}X_{k})}} \]
We use the glm()
function to fit a logistic regression in R. The syntax is similar to that of the lm()
function, but we need to add the additional argument (family = binomial
).
modelLog <- glm(default ~ purpose + int.rate + installment + log.annual.inc + dti + fico,
data = loans,
family = binomial)
As with linear regression, the summary()
function provides detailed information about our model.
##
## Call:
## glm(formula = default ~ purpose + int.rate + installment + log.annual.inc +
## dti + fico, family = binomial, data = loans)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2136 -0.6389 -0.5216 -0.3748 2.6753
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.6986873 1.1960945 7.273 3.53e-13 ***
## purposecredit_card -0.5300076 0.1075344 -4.929 8.28e-07 ***
## purposedebt_consolidation -0.3679648 0.0757552 -4.857 1.19e-06 ***
## purposeeducational 0.1105924 0.1496645 0.739 0.4599
## purposehome_improvement 0.1257786 0.1243317 1.012 0.3117
## purposemajor_purchase -0.3810833 0.1642754 -2.320 0.0204 *
## purposesmall_business 0.5572451 0.1151029 4.841 1.29e-06 ***
## int.rate 3.4872802 1.7282796 2.018 0.0436 *
## installment 0.0011184 0.0001726 6.480 9.19e-11 ***
## log.annual.inc -0.2855770 0.0538629 -5.302 1.15e-07 ***
## dti 0.0059689 0.0043349 1.377 0.1685
## fico -0.0112846 0.0012846 -8.784 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8424.0 on 9577 degrees of freedom
## Residual deviance: 8003.4 on 9566 degrees of freedom
## AIC: 8027.4
##
## Number of Fisher Scoring iterations: 5
We can also use predict()
to apply our model to new observations, but for logistic regression we need to add the argument type = "response"
.
newData <- data.frame(purpose = "home_improvement", int.rate = 0.10, installment = 400,
log.annual.inc = 11, dti = 14.5, fico = 730)
predict(modelLog, newData, type = "response")
## 1
## 0.1581578