Chapter 12 Logistic Regression
12.1 Motivation
In previous weeks, we have been focusing on regression problems, where the response variables are quantitative. Here we will see how we can work with qualitative response variables which leads to classification problems.
Qualitative variables take values in an unordered set. For example:
patient status \(\in\) \(\{\)dead, alive\(\}\)
weather \(\in\) \(\{\) rainy, cloudy, sunny \(\}\)
Take the case where we are trying to predict a patient’s outcome. We might have access to a number of variables (quantitative or qualitative) that indicate the patient’s health status, e.g. height, weight, age, gender, heart rate, consciousness.
Quantitative variables: height, weight, age, heart rate
Qualitative variables: gender, consciousness
The task here is to build a function that takes features \(X\) (e.g. height, age) as input and predicts the value for the response \(Y\) (i.e. dead or alive). While predicting the response itself is valuable, we often want to estimate the probability that \(X\) maps to one of the potential categories.
For example, if we are looking at insurance fraud, estimating the probability that a claim is fraudulent is often more valuable to a bank than the actual classification of the event in fraudulent or not. The same applies if we are looking at health outcomes, say for example, whether someone is likely to have cancer or not in the future. While life threatening, estimating probabilities and consequently the risk associated to a patient (ideally over time) is more valuable than asserting their future state.
Example: The Default data set
Here we will look at the Default dataset that contains information on credit card balance and income for 10,000 people, and also whether they have defaulted and whether they are a student or not.
library("ISLR2")
##
## Attaching package: 'ISLR2'
## The following object is masked _by_ '.GlobalEnv':
##
## Hitters
## The following object is masked from 'package:MASS':
##
## Boston
## The following objects are masked from 'package:ISLR':
##
## Auto, Credit
data(Default)
head(Default)
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
It’s good practice to check for missing data, here we check for missing values in the default variable:
print(paste("Missing data:", sum(is.na(Default$default)),sep=" ",collapse=""))
## [1] "Missing data: 0"
table(Default$default,dnn="Default")
## Default
## No Yes
## 9667 333
We now produce a pairs
plot to look at the relationship between all variables in our dataset:
pairs(Default)
The pairs
plot isn’t particularly interesting but we note that balance
and income
are the only two quantitative predictors. Let’s produce a scatter of balance
vs income
and colour the points by the default
variable.
library(scales) #loads the alpha function to add transparency to colours
plot(Default$balance, Default$income, col=alpha(c("red","blue")[Default$default],0.6), pch=3, xlab="Balance", ylab="Income")
There seems to be a pattern emerging where users with a high balance
are more likely to default. Let’s inspect this using a boxplot:
boxplot(balance~default, data=Default)
boxplot(income~default,data=Default)
It seems that balance
is likely to be a good predictor for whether someone is likely to default
or not, but not income
.
Let’s take default as our response variable Y and we will rewrite it as:
\[ Y = \begin{cases} 0, \; if \;\; No,\\ 1, \; if \;\; Yes. \end{cases} \]
Note that by treating the response \(Y\) as a quantitative variable, we can apply the linear regression model (\(E(Y|X=x)=\beta_0+\beta_1x\)):
= lm(as.numeric(default == "Yes") ~ balance, data=Default)
lm.fit summary(lm.fit)
##
## Call:
## lm(formula = as.numeric(default == "Yes") ~ balance, data = Default)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.23533 -0.06939 -0.02628 0.02004 0.99046
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.075191959 0.003354360 -22.42 <0.0000000000000002 ***
## balance 0.000129872 0.000003475 37.37 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1681 on 9998 degrees of freedom
## Multiple R-squared: 0.1226, Adjusted R-squared: 0.1225
## F-statistic: 1397 on 1 and 9998 DF, p-value: < 0.00000000000000022
Let’s explore it further by producing a scatter plot of balance
vs default
and add the regression line to the plot.
plot(Default$balance, as.numeric(Default$default=="Yes"),col="red",xlab="balance",ylab="default")
abline(lm.fit, col="blue", lwd = 2)
par(mfrow=c(1,2))
plot(y=resid(lm.fit), x=fitted(lm.fit))
qqnorm(resid(lm.fit))
par(mfrow=c(1,1))
This doesn’t seem like a particularly good model…
12.2 Simple Logistic Regression
Note that in the Default data example, \(E(Y|X=x)=P(Y=1|X=x)\), which means the linear regression model is trying to predict the probability of defaulting. Unfortunately, for balances close to zero we predict a negative probability of defaulting; if we were to predict for very large balances, we would get values bigger than 1. These predictions are not sensible, since of course the true probability of defaulting, regardless of credit card balance, must fall between 0 and 1.
To avoid the inadequacies of the linear model fit on a binary response, we must model the probability of our response using a function that gives outputs between 0 and 1 for all values of \(X\). In logistic regression, we use the logistic function to transform/map the outputs to \((0,1)\).
Say \(p(X)=P(Y=1|X)\), we write the logistic function \[ p(X)=\frac{\exp(\beta_0+\beta_1X)}{1+\exp(\beta_0+\beta_1X)} \] This transformation would guarantee that \(p(X)∈(0,1)\) for any \(\beta_0\), \(\beta_1\), and \(X\). Note that \[ \ln\frac{p(X)}{1−p(X)}=\beta_0+\beta_1X \] and we can modify this to take any number of response variables. This transformation is called the logit transformation.
The standard approach for producing estimates \(\beta_0\) and \(\beta_1\) of the regression coefficients in the logistic regression model is to use maximum likelihood estimation. One of the advantages of this method is that maximum likelihood estimators have a number of nice theoretical properties which can be exploited to derive confidence intervals for the regression coefficients, perform hypothesis tests, and so on. In R, all this goes on under the hood when we use the glm
function to fit the logistic regression model. When using the glm
function, we need to set the argument family="binomial"
to indicate that we want to fit a logistic regression model, and not some other kind of generalised linear model.
Now let’s try to fit a linear logistic regression model to the Default dataset using the glm
function in R with family
set to binomial
.
= glm(as.numeric(Default$default=="Yes") ~ balance, data = Default, family = "binomial")
glm.fit summary(glm.fit)
##
## Call:
## glm(formula = as.numeric(Default$default == "Yes") ~ balance,
## family = "binomial", data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2697 -0.1465 -0.0589 -0.0221 3.7589
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.6513306 0.3611574 -29.49 <0.0000000000000002 ***
## balance 0.0054989 0.0002204 24.95 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1596.5 on 9998 degrees of freedom
## AIC: 1600.5
##
## Number of Fisher Scoring iterations: 8
First note that now the fitted values in our regression are between \(0\) and \(1\).
summary(glm.fit$fitted.values)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000237 0.0003346 0.0021888 0.0333000 0.0142324 0.9810081
And a plot of our model output looks far more sensible. In the plot below, we have the raw data in red, the fitted logistic curve in blue, and the fitted values in black.
plot(Default$balance, as.numeric(Default$default=="Yes"),col="red",xlab="balance",ylab="default")
points(glm.fit$data$balance,glm.fit$fitted.values, col = "black", pch = 4)
curve(predict(glm.fit,data.frame(balance = x),type="resp"),col="blue",lwd=2,add=TRUE)
From the logistic regression summary, we have estimates for \(\beta_0\) and \(\beta_1\) and we can estimate the probability of default for an individual given their balance. Say someone has a balance of \(£500\), then
\[ \hat{p}(X)=\frac{\exp(\hat{\beta_0}+\hat{\beta_1}X)}{1+\exp(\hat{\beta_0}+\hat{\beta_1}X)} \] and \(\hat{p}\) equals
exp(sum(glm.fit$coefficients*c(1,500)))/(1+exp(sum(glm.fit$coefficients*c(1,500))))
## [1] 0.0003699132
Repeat the above experiment with different values for the balance
variable. What do you see?
12.3 Logistic regression with several variables
We can extend the logit transformation to accommodate any number of response variables. Simply write:
\[ \ln\frac{p(X)}{1−p(X)}=\beta_0+\beta_1X_1+\beta_2X_2+…+\beta_pX_p \] and \[ p(X)=\frac{\exp(\beta_0+\beta_1X_1+\beta_2X_2+…+\beta_pX_p)}{1+\exp(\beta_0+\beta_1X_1+\beta_2X_2+…+\beta_pX_p)} \]
And this can even be extended to the Generalised Additive Models to allow for non-linear relationships to be considered in the model by: \[ \ln\frac{p(X)}{1−p(X)}=\beta_0+f_1(X_1)+f_2(X_2)+…+f_p(X_p) \]
12.4 Interpreting coefficients
The interpretation of the coefficients in the logistic regression isn’t as simple as in the linear regression scenario since we are predicting the probability of an outcome rather than the outcome itself.
If \(\beta_1=0\), then there is no relationship between \(Y\) and \(X\). If \(β_1>0\), \(X\) gets larger and so does the probability that \(Y=1\). If \(\beta_1<0\), the opposite happens, that is, as \(X\) gets larger, the probability that \(Y=1\) gets smaller.
12.5 Significance
We also want to investigate whether the coefficient values are significant. That is, we want to know whether the coefficients are statistically different from zero.
In linear regression, we use a \(t\)-test to check significance for coefficients. In logistic regression, we use a \(z\)-test (normal test) instead. Let’s look at the summary of the regression:
summary(glm.fit) #default vs balance
##
## Call:
## glm(formula = as.numeric(Default$default == "Yes") ~ balance,
## family = "binomial", data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2697 -0.1465 -0.0589 -0.0221 3.7589
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.6513306 0.3611574 -29.49 <0.0000000000000002 ***
## balance 0.0054989 0.0002204 24.95 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1596.5 on 9998 degrees of freedom
## AIC: 1600.5
##
## Number of Fisher Scoring iterations: 8
In this case, we see that balance
is likely to be a good predictor for default
and the \(z\)-test indicates that we should include this variable in our model. And as we had seen before, the higher the balance, the more likely someone is to default.