3.6 Logistic regression (optional)
Sometimes we want to predict a binary dependent variable, i.e., a variable that can only take on two values, based on a number of continuous or categorical independent variables. For example, say we’re interested in testing whether or not a listing is a gem depends on the price and the room type of the listing.
We cannot use ANOVA or linear regression here because the dependent variable is a binary variable and hence not normally distributed. Another problem with ANOVA or linear regression, is that it may predict values that are not possible (e.g., our predicted value may be 5, but only 0 and 1 make sense for this dependent variable). Therefore, we will use logistic regression. Logistic regression first transforms the dependent variable \(\textrm{Y}\) with the logit transformation. The logit transformation takes the natural logarithm of the odds that the dependent variable is equal to 1:
\(\textrm{odds} = \dfrac{P(Y = 1)}{P(Y = 0)} = \dfrac{P(Y = 1)}{1 - P(Y = 1)}\)
and then \(\textrm{logit}(P(Y = 1)) = \textrm{ln}(\dfrac{P(Y = 1)}{1 - P(Y = 1)})\).
This makes sure our dependent variable is normally distributed and is not constrained to be either 0 or 1.
Let’s carry out the logistic regression:
<- glm(gem ~ price * room_type, data=airbnb, family="binomial") # the family = "binomial" argument tells R to treat the dependent variable as a 0 / 1 variable
logistic.model summary(logistic.model) # ask for the regression output
##
## Call:
## glm(formula = gem ~ price * room_type, family = "binomial", data = airbnb)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4909 -0.3819 -0.3756 -0.3660 2.5307
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.5270702 0.0570207 -44.318 <2e-16 ***
## price -0.0007185 0.0004030 -1.783 0.0746 .
## room_typePrivate room -0.0334362 0.1072091 -0.312 0.7551
## room_typeShared room 1.0986318 1.1278159 0.974 0.3300
## price:room_typePrivate room -0.0011336 0.0012941 -0.876 0.3810
## price:room_typeShared room -0.0562610 0.0381846 -1.473 0.1406
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8663.8 on 17650 degrees of freedom
## Residual deviance: 8648.6 on 17645 degrees of freedom
## AIC: 8660.6
##
## Number of Fisher Scoring iterations: 7
We see that the only marginally significant predictor of whether or not a listing is a gem, is the price of the listing. You could report this as follows: “Controlling for room type and the interaction between price and room type, there was a marginally significant negative relationship between price and the probability of a listing being a gem (\(\beta\) = -0.0007185, \(\chi\)(17645) = -1.783, p = 0.075).”
The interpretation of the regression coefficients in logistic regression is different than in the case of linear regression:
summary(logistic.model)
##
## Call:
## glm(formula = gem ~ price * room_type, family = "binomial", data = airbnb)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4909 -0.3819 -0.3756 -0.3660 2.5307
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.5270702 0.0570207 -44.318 <2e-16 ***
## price -0.0007185 0.0004030 -1.783 0.0746 .
## room_typePrivate room -0.0334362 0.1072091 -0.312 0.7551
## room_typeShared room 1.0986318 1.1278159 0.974 0.3300
## price:room_typePrivate room -0.0011336 0.0012941 -0.876 0.3810
## price:room_typeShared room -0.0562610 0.0381846 -1.473 0.1406
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8663.8 on 17650 degrees of freedom
## Residual deviance: 8648.6 on 17645 degrees of freedom
## AIC: 8660.6
##
## Number of Fisher Scoring iterations: 7
The regression coefficient of price is -0.0007185. This means that a one unit increase in price will lead to a -0.0007185 increase in the log odds of gem
being equal to 1 (i.e., of a listing being a gem). For example:
\(\textrm{logit}(P(Y = 1 | \textrm{price} = 60 )) = \textrm{logit}(P(Y = 1 | \textrm{price} = 59 )) - 0.0007185\)
\(\Leftrightarrow \textrm{logit}(P(Y = 1 | \textrm{price} = 60 )) - \textrm{logit}(P(Y = 1 | \textrm{price} = 59 )) = -0.0007185\)
\(\Leftrightarrow \textrm{ln(odds}(P(Y = 1 | \textrm{price} = 60 )) - \textrm{ln(odds}(P(Y = 1 | \textrm{price} = 59 )) = -0.0007185\)
\(\Leftrightarrow \textrm{ln}(\dfrac{ \textrm{odds}(P(Y = 1 | \textrm{price} = 60 )}{\textrm{odds}(P(Y = 1 | \textrm{price} = 59 )}) = -0.0007185\)
\(\Leftrightarrow \dfrac{ \textrm{odds}(P(Y = 1 | \textrm{price} = 60 )}{\textrm{odds}(P(Y = 1 | \textrm{price} = 59 )} = e^{-0.0007185}\)
\(\Leftrightarrow \textrm{odds}(P(Y = 1 | \textrm{price} = 60 ) = e^{-0.0007185} * \textrm{odds}(P(Y = 1 | \textrm{price} = 59 )\)
Thus the regression coefficient in a logistic regression should be interpreted as the relative increase in the odds of the dependent variable being equal to 1, for every unit increase in the predictor, controlling for all other factors in our model. In this case, the odds of a listing being a gem should be multiplied by \(e^{-0.0007185}\) = 0.999 or should be decreased with 0.1 percent, for every one unit increase in price. In other words, more expensive listings are less likely to be gems. In the specific example above, the odds of being a gem of a listing priced 60 is \(e^{(-0.0007185 \times 5)}\) = 0.996 times the odds of being a gem of a listing priced 59.
3.6.1 Measuring the fit of a logistic regression: percentage correctly classified
Our model uses the listing’s price and room type to predict whether the listing is a gem or not. When making predictions, it’s natural to ask ourselves whether our predictions are any good. In other words, using price and room type, how often do we correctly predict whether a listing is a gem or not? To get an idea of the quality of our predictions, we can take the price and the room type of the listings in our dataset, predict whether the listings are gems, and compare our predictions to the actual gem status of the listings. Let’s first make the predictions:
<- airbnb %>%
airbnb mutate(prediction = predict(logistic.model, airbnb))
# Create a new column called prediction in the airbnb data frame and store in that the prediction,
# based on logistic.model, for the airbnb data
# Have a look at those predictions:
head(airbnb$prediction)
## 1 2 3 4 5 6
## -4.790232 -4.448355 -4.049498 -4.619293 -4.106477 -4.847211
# Compare it with the observations:
head(airbnb$gem)
## [1] no gem no gem no gem no gem no gem no gem
## Levels: no gem gem
Do you see the problem? The predictions are logits, i.e., logarithms of odds that the listings are gems, but the observations simply tell us whether a listing is a gem or not. For a meaningful comparison between predictions and observations, we need to transform the logits into a decision: gem or no gem. It’s easy to transform logits into odds by taking the exponential of the logit. The relationship between odds and probabilities is the following:
\(\textrm{odds} = \dfrac{P(Y = 1)}{P(Y = 0)} = \dfrac{P(Y = 1)}{1 - P(Y = 1)}\)
\(\Leftrightarrow \dfrac{odds}{1 + \dfrac{P(Y = 1)}{1 - P(Y = 1)}} = \dfrac{\dfrac{P(Y = 1)}{1 - P(Y = 1)}}{1 + \dfrac{P(Y = 1)}{1 - P(Y = 1)}}\)
\(\Leftrightarrow \dfrac{odds}{1 + odds} = \dfrac{\dfrac{P(Y = 1)}{1 - P(Y = 1)}}{\dfrac{1 - P(Y = 1) + P(Y = 1)}{1 - P(Y = 1)}}\)
\(\Leftrightarrow \dfrac{odds}{1 + odds} = \dfrac{P(Y = 1)}{1 - P(Y = 1) + P(Y = 1)}\)
\(\Leftrightarrow \dfrac{odds}{1 + odds} = P(Y = 1)\)
Now let’s calculate, for each listing, the probability that the listing is a gem:
<- airbnb %>%
airbnb mutate(prediction.logit = predict(logistic.model, airbnb),
prediction.odds = exp(prediction.logit),
prediction.probability = prediction.odds / (1+prediction.odds))
# Have a look at the predicted probabilities:
head(airbnb$prediction.probability)
## 1 2 3 4 5 6
## 0.008242034 0.011562542 0.017132489 0.009763496 0.016198950 0.007789092
The first few numbers are very low probabilities, but there are some higher probabilities as well and all predictions lie between 0 and 1, as they should. We now need to decide which probability is enough for us to predict that a listing is a gem or not. An obvious choice is a probability of 0.5: A probability higher than 0.5 means we predict it’s more likely that a listing is gem than that it is not. Let’s convert probabilities into predictions:
<- airbnb %>%
airbnb mutate(prediction = case_when(prediction.probability<=.50 ~ "no gem",
> .50 ~ "gem"))
prediction.probability
# Have a look at the predictions:
head(airbnb$prediction)
## [1] "no gem" "no gem" "no gem" "no gem" "no gem" "no gem"
A final step is to compare predictions with observations:
table(airbnb$prediction, airbnb$gem)
##
## no gem gem
## no gem 16471 1180
Normally, we’d see a 2x2 table, but we see a table with one predicted value in the rows and two observed values in the columns. This is because all predicted probabilities are below .50 and therefore we always predict no gem
. Let’s lower the treshold to predict that a listing is a gem:
<- airbnb %>%
airbnb mutate(prediction = case_when(prediction.probability<=.07 ~ "no gem",
> .07 ~ "gem"),
prediction.probabilityprediction = factor(prediction, levels = c("no gem","gem"))) # make sure no gem is the first level of our factor
# Have a look at the table:
table(airbnb$prediction,airbnb$gem)
##
## no gem gem
## no gem 11240 854
## gem 5231 326
We can see that with this decision rule (predict gem whenever the predicted probability of gem is higher than 7%), we get 11240 + 326 predictions correct and 5231 + 854 predictions wrong, which is a hit rate of (11240 + 326) / (11240 + 326 + 5231 + 854) = 65.5%.