Chapter 15 Logistic Regression

Hello! Today, we’ll be discussing what logistic regressions are and how we can use this model for machine learning. Let us begin by loading some packages and importing our data, which will focus on voting in the 2020 election.

library(tidyverse)
library(broom)
library(datasets)
library(gglm)

survey_data <- read_csv("data/survey_2020-10_kaiser_voting_31118013.csv")

If you check out the columns of this survey, there are a range of potentially interesting variables. The two we’ll focus on today are likelihood to vote for Biden (surveydata$bidenvoter) or for Trump (surveydata$trumpvoter).

15.1 Data Cleaning

Let us begin by cleaning up our data. For our model, we’ll consider the following variables: party affiliation (this one should be the most significant and least surprising–Democrats will vote for Biden and Republicans will vote for Trump), job loss, children in the household, income, and attitudes towards abortion (more specifically, attitudes towards Roe v. Wade).

#table(survey_data$vote2020)
survey_data$vote2020[survey_data$vote2020 == "Else"] <- NA
survey_data$vote2020[survey_data$vote2020 == "Undecided voter"] <- NA

#table(survey_data$bidenvoter)
#table(survey_data$trumpvoter)

#table(survey_data$party) #party affiliation
survey_data$party[survey_data$party == "Don't Know"] <- NA
survey_data$party[survey_data$party == "Or what?"] <- NA
survey_data$party[survey_data$party == "Refused"] <- NA
survey_data$party <- as.factor(survey_data$party)

#table(survey_data$inclosstotal) #did someone lose a job or hours
survey_data$inclosstotal[survey_data$inclosstotal == "Don't Know"] <- NA
survey_data$inclosstotal[survey_data$inclosstotal == "Don't know"] <- NA
survey_data$inclosstotal[survey_data$inclosstotal == "Refused"] <- NA
survey_data$inclosstotal <- as.factor(survey_data$inclosstotal)

#table(survey_data$child) #children in the household?
survey_data$child[survey_data$child == "Refused"] <- NA
survey_data$child <- as.factor(survey_data$child)

#table(survey_data$income) #income
survey_data$income[survey_data$income == "Refused"] <- NA
survey_data$income[survey_data$income == "Don't Know"] <- NA
survey_data$income <- factor(survey_data$income, levels = c("Less than $20,000", 
                                                            "$20,000 to less than $30,000", 
                                                            "$30,000 to less than $40,000",
                                                            "$40,000 to less than $50,000", 
                                                            "$50,000 to less than $75,000", 
                                                            "$75,000 to less than $90,000",
                                                            "$90,000 to less than $100,000", 
                                                            "$100,000 or more"))
survey_data$income <- as.numeric(survey_data$income)

#table(survey_data$q11) #abortion; should Roe v. Wade be overturned?
survey_data$q11[survey_data$q11 == "Refused"] <- NA
survey_data$q11[survey_data$q11 == "Don't Know"] <- NA
survey_data$q11[survey_data$q11 == "Don't know"] <- NA
survey_data$q11 <- as.factor(survey_data$q11)

In addition to cleaning our variables, let’s also make sure that our dependent variable is a binary and not a numeric.

selected_variables <- survey_data %>% 
  dplyr::select(c("vote2020", "bidenvoter", "trumpvoter", "party", "inclosstotal", "child", "income", "q11"))
colnames(selected_variables)[5] <- "jobloss" #let's rename these to variables so they are more intuitive
colnames(selected_variables)[8] <- "abortion"

#bidenvoter and trumpvoter are both numeric variables (0/1), so let's make them binary variables
selected_variables$bidenv <- ifelse(selected_variables$bidenvoter == "Bidenvoter", TRUE, FALSE)
selected_variables$trumpv <- ifelse(selected_variables$trumpvoter == "Trumpvoter", TRUE, FALSE)
selected_variables$party <- relevel(selected_variables$party, "Independent")

For the purposes of this tutorial, we are going to focus on the participants who have completed all of the answers to the aforementioned question. This means we will lose some data (i.e., the participants who didn’t answer these variables).

logistic_data <- selected_variables[complete.cases(selected_variables), ]

If you are interested in advanced solutions for imputing or filling missing data, one package I have come across recently is mice (check out the documentation here and this slightly advanced tutorial here).

For other options to deal with missing data, I encourage you to check out Selva Prabakaran’s tutorial.

What does this data frame look like?

head(logistic_data)
## # A tibble: 6 x 10
##   vote2020    bidenvo~1 trump~2 party jobloss child income abort~3 bidenv trumpv
##   <chr>       <chr>     <chr>   <fct> <fct>   <fct>  <dbl> <fct>   <lgl>  <lgl> 
## 1 Biden voter Bidenvot~ Else    Inde~ No      Yes        8 No, do~ TRUE   FALSE 
## 2 Biden voter Bidenvot~ Else    Inde~ No      No         5 No, do~ TRUE   FALSE 
## 3 Biden voter Bidenvot~ Else    Demo~ Yes     No         4 No, do~ TRUE   FALSE 
## 4 Trump voter Else      Trumpv~ Repu~ No      No         8 Yes, o~ FALSE  TRUE  
## 5 Biden voter Bidenvot~ Else    Inde~ No      No         3 No, do~ TRUE   FALSE 
## 6 Biden voter Bidenvot~ Else    Demo~ No      No         8 No, do~ TRUE   FALSE 
## # ... with abbreviated variable names 1: bidenvoter, 2: trumpvoter, 3: abortion

Huzzah! This data frame (logistic_data) has 2 labels that we want the computer to code for (bidenv and trumpv). To do this, we will use 5 features (in this case, variables): party affiliation, whether they have a job, whether they have a child, income, and opinions about abortion. Features are an essential concept in machine learning. Simply put, a feature is one of the variables you put into a machine learning algorithm (you may recognize the term “feature” from quanteda::dfm, which stands for “document-feature matrix”). Machine learning discourse often focuses on “features” (the input) and “labels” (the output), so it’s worth becoming familiar with the language.

Let us now proceed with our logistic regression.

15.2 Logistic Regression

A logistic regression is a useful model when your dependent variable is a bivariate, meaning that there are two values (e.g., “do you intend to vote or not? Do you have COVID-19 or not?).

For this tutorial, we will be using logistic regressions to create a model that will help us estimate two models: (1) whether a person will vote or Biden or not and (2) whether a person will vote for Trump or not. We can use glm() to create these models. But, rather than using the Gaussian or normal distribution, we will use the binomial distribution by changing the family argument.

15.2.1 Creating Models

b_logit <- glm(bidenv ~ party + abortion + jobloss + child + income, data = logistic_data, family = "binomial")
summary(b_logit)
## 
## Call:
## glm(formula = bidenv ~ party + abortion + jobloss + child + income, 
##     family = "binomial", data = logistic_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0149  -0.1152   0.1539   0.1752   3.2008  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            0.965804   0.397286   2.431   0.0151 *  
## partyDemocrat          3.047042   0.444175   6.860 6.89e-12 ***
## partyRepublican       -3.313814   0.344844  -9.610  < 2e-16 ***
## abortionYes, overturn -3.087941   0.372446  -8.291  < 2e-16 ***
## joblossYes             0.105518   0.285927   0.369   0.7121    
## childYes               0.007559   0.332979   0.023   0.9819    
## income                 0.051966   0.060120   0.864   0.3874    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1074.5  on 791  degrees of freedom
## Residual deviance:  362.1  on 785  degrees of freedom
## AIC: 376.1
## 
## Number of Fisher Scoring iterations: 6
anova(b_logit, test = 'Chisq') #%>% tidy()
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: bidenv
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                       791    1074.48             
## party     2   609.78       789     464.70   <2e-16 ***
## abortion  1   101.75       788     362.95   <2e-16 ***
## jobloss   1     0.09       787     362.86   0.7673    
## child     1     0.02       786     362.84   0.8906    
## income    1     0.75       785     362.10   0.3878    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can get the output of this logistic regression in one of two ways. The first is to use summary(), as we traditionally have. If we are interested in the cumulative statistical significance of a categorical variable, it may also be useful to use anova().

t_logit <- glm(trumpv ~ party + abortion + jobloss + child + income, data = logistic_data, family = "binomial")
summary(t_logit)
## 
## Call:
## glm(formula = trumpv ~ party + abortion + jobloss + child + income, 
##     family = "binomial", data = logistic_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2008  -0.1752  -0.1539   0.1152   3.0149  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -0.965804   0.397286  -2.431   0.0151 *  
## partyDemocrat         -3.047042   0.444175  -6.860 6.89e-12 ***
## partyRepublican        3.313814   0.344844   9.610  < 2e-16 ***
## abortionYes, overturn  3.087941   0.372446   8.291  < 2e-16 ***
## joblossYes            -0.105518   0.285927  -0.369   0.7121    
## childYes              -0.007559   0.332979  -0.023   0.9819    
## income                -0.051966   0.060120  -0.864   0.3874    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1074.5  on 791  degrees of freedom
## Residual deviance:  362.1  on 785  degrees of freedom
## AIC: 376.1
## 
## Number of Fisher Scoring iterations: 6
anova(t_logit, test = 'Chisq') #%>% tidy()
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: trumpv
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                       791    1074.48             
## party     2   609.78       789     464.70   <2e-16 ***
## abortion  1   101.75       788     362.95   <2e-16 ***
## jobloss   1     0.09       787     362.86   0.7673    
## child     1     0.02       786     362.84   0.8906    
## income    1     0.75       785     362.10   0.3878    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

15.2.2 Model Evaluation

When it comes to logistic regressions, R-squares are not appropriate because the output of a logistic regression is a proportion (likelihood of one of the binaries). However, there are other fit statistics that can help you evaluate how good your logit regression model is.

One particularly useful tool for comparing models is, in fact, the anova() function! Let’s try this now by making a logistic model for likelihood of voting for Trump, but without the party variable.

t_logit_nopty <- glm(trumpv ~ abortion + jobloss + child + income, data = logistic_data, family = "binomial")
#summary(t_logit_nopty)

anova(t_logit_nopty, t_logit) #%>% tidy()
## Analysis of Deviance Table
## 
## Model 1: trumpv ~ abortion + jobloss + child + income
## Model 2: trumpv ~ party + abortion + jobloss + child + income
##   Resid. Df Resid. Dev Df Deviance
## 1       787     731.41            
## 2       785     362.10  2   369.31

A couple things to note from these results. First, the p-value is statistically significant, which suggests that one model is substantially better than the other. We can confirm this with other diagnostics, like the AIC ( Akaike information criterion) and BIC (Bayesian information criterion), which are both tools for choosing optimal models. You can learn more about model selection tools (or “criteria”) here.

Interpreting AIC and BIC results is easy: the smaller the number, the better the fit. Therefore, when comparing two models, you want to choose the one that is smaller.

AIC(t_logit)
## [1] 376.0954
AIC(t_logit_nopty)
## [1] 741.4095
BIC(t_logit)
## [1] 408.8173
BIC(t_logit_nopty)
## [1] 764.7823

Using both the AIC and the BIC, we can see that t_logit is a better model. Sometimes, like here, AIC and BIC will produce similar results. But sometimes, criteria will suggest different optimal models. It is worth getting a better sense of how these criteria select the “best” model. In general, AIC will prefer larger models (models with more variables) because BIC penalizes overly complex models with many parameters that do not provide much information. As mention in this Stack Exchange thread, some folks prefer to use the BIC when trying to select an optimal lag for a time series model. But it is good practice to include the results of both the AIC and BIC for transparency (as a note, this information is often provided in the appendix of a paper, as opposed to the main text).

Want to learn more about using anova() to compare regression models? YaRrr has a great tutorial on that.

15.3 Logistic Regression for ML

But how do we use logistic regression for machine learning? To do this, we have to check the quality of the model that we have created.

To use this logistic regression model for basic machine learning, we will use the popular caret package. caret stands for Classification And REgression Training, and it is extremely popular for programmers using R for machine learning. We’ll be using caret more when we get to the supervised machine learning class, so if you want to get a head start on this content, check out this (long) tutorial by caret author and Rstudio software engineer Max Kuhn. He also has a shorter vignette on caret, which you can find here.

#install.packages("caret") 
#caret is a pretty big package with lots of dependencies, so it may take some time to load
library(caret)

15.3.1 Training & Test Sets

Currently, we have one dataset with labels (in our case, “labels” means two variables: a variable for whether someone voted for Biden or not and a variable for whether someone voted for Trump or not). To do supervised machine learning with this dataset, we’ll have to split the data in two. The first part will be used to train a logistic regression model (like we did above). The second part will be used to test the accuracy of the logistic regression model. The second partition allows us to ask: does the logistic regression model we are creating accurately predict our variables of interest?

In caret, there is a useful function called createDataPartition(), which will create an index you can use to split your data. You can also designate the proportion using the p argument in createDataPartition(). A common rule of thumb to split data so that 70 or 80% of the data is used for training and 30 or 20% is used for testing. With larger datasets, you can partition your data further, which is useful if you need to train multiple models, as this person said they did.

Another common reason why you would have more than two partitions (“splits”) is when you want to tweak your parameters as you model. When doing this, it is useful to build a “validation set” by splitting your training set into a training partition (the part you use to train the original model) and a validation partition, which you can use to fine tune your model. The reason you do not use your test set is because you want this test set to remain “pure”–that is, untouched by the model. Remember, you want the test set to test the quality of the model, so the test set should not be used until you are ready to evaluate the final model.

For now, given our sample size (and time), let’s just split our data into two partitions: a training set (80%) and a test set(20%).

split <- createDataPartition(y = logistic_data$bidenvoter, p = 0.8,list = FALSE)
#?createDataPartition #learn more about this function

training_set <- logistic_data[split,] #we will use this data to train the model
test_set <- logistic_data[-split,] #we will use this data to test the model

Learn more about data partitions here.

15.3.2 Modeling

Now that we have our partitions, let’s construct our model!

t_logit_trained <- glm(trumpv ~ party + abortion + jobloss + child + income, data = training_set, family = "binomial")
summary(t_logit_trained)
## 
## Call:
## glm(formula = trumpv ~ party + abortion + jobloss + child + income, 
##     family = "binomial", data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1820  -0.1928  -0.1722   0.1211   2.8942  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -1.12213    0.46411  -2.418   0.0156 *  
## partyDemocrat         -2.82005    0.45209  -6.238 4.44e-10 ***
## partyRepublican        3.50363    0.40660   8.617  < 2e-16 ***
## abortionYes, overturn  2.93653    0.38908   7.547 4.44e-14 ***
## joblossYes             0.03112    0.31988   0.097   0.9225    
## childYes              -0.14416    0.37434  -0.385   0.7001    
## income                -0.03272    0.06838  -0.478   0.6323    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 861.49  on 634  degrees of freedom
## Residual deviance: 294.62  on 628  degrees of freedom
## AIC: 308.62
## 
## Number of Fisher Scoring iterations: 6

If you compare this to the results of our first model (summary(t_logit)), you will find that the results are pretty similar (though not exactly the same, obviously, since you are using only a part of the data).

15.3.3 Testing the model

Now that we have our model, we are ready to test it. We’ll do this using the base R function predict(). predict() has two required arguments: a glm model (in our case, t_logit_trained) and the new dataset we want to apply this model to. Because we are using a logistic regression, we will also modify another argument, type. The default to type is scale, which is used to predict numeric dependent variables, so we will change this to type = "response" to make sure we get the right results for a binary dependent variable.

test_predict <- predict(t_logit_trained, 
                        newdata = test_set,
                        type = "response")
head(test_predict)
##          1          2          3          4          5          6 
## 0.99367113 0.01361791 0.01621013 0.89574985 0.01843492 0.01618461

When reading the head() of this new list of predicted values, you may notice that we do not have a clear “yes” or “no” response. Instead, what we have are proportions: the likelihood that someone will vote for Trump or not.

This is different from our test_set$trumpv variable (as we can see below). To make comparisons, therefore, we should establish a decision rule. The one we’ll use in this tutorial is: if the model predicts that a person has an over 60% chance of voting for Trump, we will assign that prediction TRUE. If not, we will assign it FALSE.

data.frame(test_predict, test_set$trumpv) %>% head(10)
##    test_predict test_set.trumpv
## 1    0.99367113            TRUE
## 2    0.01361791           FALSE
## 3    0.01621013           FALSE
## 4    0.89574985           FALSE
## 5    0.01843492           FALSE
## 6    0.01618461           FALSE
## 7    0.89465336            TRUE
## 8    0.01517500           FALSE
## 9    0.01674025           FALSE
## 10   0.99461252            TRUE
test_predict_binary <- ifelse(test_predict > 0.6, TRUE, FALSE)
data.frame(test_predict_binary, test_set$trumpv) %>% head(10)
##    test_predict_binary test_set.trumpv
## 1                 TRUE            TRUE
## 2                FALSE           FALSE
## 3                FALSE           FALSE
## 4                 TRUE           FALSE
## 5                FALSE           FALSE
## 6                FALSE           FALSE
## 7                 TRUE            TRUE
## 8                FALSE           FALSE
## 9                FALSE           FALSE
## 10                TRUE            TRUE

As you can already see with these results, our predictions are not completely perfect. With the 8th observation, the participant said they planned to vote for Trump. However our model predicts that this participant will not vote for Trump.

Given this, how can we assess our model to know whether it is really a “good” model? Are there a lot of folks like the participant in row 8, or are they an outlier? Let’s use table() to compare our predicted and true results.

comparison <- table(test_predict_binary, test_set$trumpv)
sum(diag(comparison))/sum(comparison)
## [1] 0.9171975

Based on this table, we can tell that there were 8 false positives (situations where the model thought the participant would vote for Trump when that person said otherwise) and 4 false negatives (situations where the model thought the participant would not vote for Trump, but the participant said they would). Our 8th observation falls into the second category.

These results mean that the model is accurate about 92% of the time.

15.4 Bonus: ROC & AUC

Another useful way to evaluate a model is to look at the “ROC curve”. ROC stands for receiver operating characteristic, and ROC curves are common when using a machine learning algorithm to classify binary observations.

A related statistic to this is the AUC, which stands for “Area Under [the ROC] Curve”. When the AUC is high, our classifier is more accurate. AUC scores tend to range from 0.5 (a bad classifier that is no better than random chance) and 1 (a completely accurate classifier).

To construct our ROC curve, we’ll use the package ROCR, which is a great package for visualizing the performance of classifier. Learn more about this package here. For our AUC metric, we’ll use the Metrics package, which contains many functions for evaluating supervised machine learning algorithms. Check out the documentation for this package here.

#install.packages("Metrics")
#install.packages("ROCR")
library(Metrics)
## 
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
## 
##     precision, recall
library(ROCR)

The first thing we’ll want to do is create a prediction object using prediction() in the ROCR package. This is the packages’ way of wranging data to use for its other functions. Then, we’ll use performance() to create a performance object, which we can use to plot our ROC. In addition to our prediction object (pr), our performance() function will include two more arguments: measure and x.measure, which are the specific arguments you use to explictly state the evaluation method used. Check out the help page (?performance) for a list of additional performance measures

pr <- ROCR::prediction(test_predict, test_set$trumpv)

perf <- ROCR::performance(pr,
                          measure = "tpr", #true positive rate
                          x.measure = "fpr") #false positive rate
plot(perf)

As you see, when we plot() this performance object, the measure argument becomes the y-axis and the x.measure argument is the x-axis. The better the model, the more likely your “curve” will start to appear as a square. A random guess (presumably, 50% accuracy) would look like a vertical diagonal. Thus, your goal as a machine learning programmer is to create a model with a larger “area under the curve.”

To get the actual AUC metric, we can use the auc() function in Metrics, like below.

Metrics::auc(test_set$trumpv,test_predict)
## [1] 0.9618729

That’s a great AUC!

Want more practice with logit regressions? Check out these tutorials