8 Classification

Classifying outcomes into 1’s and 0’s is a classic problem in data science. So many decisions and outcomes can be classified into ‘yes something happened,’ or ‘no, that thing didn’t happen.’ This something can be if someone bought an item or not, if a person defaulted on a credit card or not, the stock market went up or not, or even if a tumor is cancerous or not. These are all 1’s and 0’s!

Luckily, there are lots of methods to figure out what affects if something is a 1 or a 0, as well as to predict if something will be a 1 or a 0. We’ll be focusing on two for this lesson, logistic regression and the k-nearest neighbors algorithm (kNN). We focus on these two as they both do the same task but through very different means. Logistic regression is a parametric statistical method that is an extension of linear regression (and thus has assumptions that should be met). kNN is a non-parametric algorithm that is then free from assumptions about the relationship between the target and feature.

The differences in the nature of these models brings up a fundamental question - how do we compare them? If both work on a given data set, how do we choose one over the other? Thus, we’ll be diving into the world of first training models on a set of data and then testing the accuracy on a different set of data. Under this paradigm we choose the model that does the best at explaining unseen data!

In this lesson we’re going to take a dataset, do all the preprocessing necessary to fit both logistic regression models and kNN models and then compare them on a test set of data. What you’ll see is that 90% of our work is wrangling our data/preprocessing and fitting the models is a comically simple couple of lines. Let’s dig in.

8.1 Importing and exploring our data

Today we’re going to be working with a dataset on if people ‘churn’ from a telecommunications company. Churn is a fancy way of saying someone leaves or quits a service. This is obviously an important thing for companies as they want to know who is likely to leave so they can do their best to retain their customers or identify faults in their current business model.

8.1.1 Loading packages

Let’s load our packages. Note - caret is the major machine learning package for R and is fantastic! Be sure to install it you haven’t done so already. You’ll also need the ISLR package associated with the textbook as well as RANN and e1071.

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.0.2
library(ISLR)
library(caret)
## Warning: package 'caret' was built under R version 4.0.5

8.1.2 And our data

This data set has a bunch of features that we’re going to chop out, so let’s do that right away and then call our slimmed down dataset ts

telco <- read_csv("https://docs.google.com/spreadsheets/d/1DZbq89b7IPXXzi_fjmECATtTm4670N_janeqxYLzQv8/gviz/tq?tqx=out:csv")

ts <- telco %>% 
  select(-ID, -online_security, -online_backup, -tech_support, -streaming_tv, -streaming_movies, -multi_line, -device_protection)

8.1.3 Quick look at the variables

So our target is churn… a bunch of yes and no values that we want to predict. We then have a mix of datatypes in our features with both numeric columns and character columns.

glimpse(ts)
## Rows: 7,043
## Columns: 13
## $ gender          <chr> "Female", "Male", "Male", "Male", "Female", "Female...
## $ senior          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ partner         <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Y...
## $ dependents      <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "N...
## $ tenure          <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49,...
## $ phone           <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No"...
## $ internet        <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber o...
## $ contract        <chr> "Month-to-month", "One year", "Month-to-month", "On...
## $ paperless_bill  <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No"...
## $ payment_method  <chr> "Electronic check", "Mailed check", "Mailed check",...
## $ monthly_charges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29...
## $ total_charges   <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 19...
## $ churn           <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", ...

Looking at a summary we see that we have some NA values in total_charges that we’ll have to deal with.

summary(ts$total_charges)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    18.8   401.4  1397.5  2283.3  3794.7  8684.8      11

Also, many of our character columns have multiple levels, so we’ll have to apply the one hot encoding that we learned about in our preprocessing lesson.

unique(ts$internet) # for example
## [1] "DSL"         "Fiber optic" "No"

8.2 Preprocessing

8.2.1 Getting dummy variables - one hot encoding

The first step we’re going to take is to get our dummy variables. This is something that needs to be done for both our training and test data (that we’ll split later), and it doesn’t matter if it’s done all at once (unlike imputation). So let’s knock this out now.

First, let’s create our dummy object that we’ll use for the conversion. We’ll use the dummyVars() function from caret. Note how the formula is churn ~ . This is telling it not to convert churn as that’s our target, but look at all the other features (that’s what the period does) and convert if needed. I’ve also specified fullRank = TRUE so that it automatically drops one factor level which we want. The level dropped is the reference level then.

ts_dummy <- dummyVars(churn ~ ., data = ts, fullRank = TRUE)

But go just run ts_dummy into your console… it’s not actually our data. It’s an object that contains a set of ‘instructions’ on how to convert all the columns to dummy variables. In order to actually convert our data we need to apply this object to our data frame.

We can do this by using the predict() function. In it we’ll apply our ts_dummy object to our ts data frame to encode it. Note, you need to also convert the whole thing to a data frame again using data.frame.

ts <- predict(ts_dummy, newdata = ts)
ts <- data.frame(ts)

Taking a look at our encoded data frame we see that now features that contained multiple levels within have been broken out into multiple binary columns. But, notice how our target, churn is missing.

glimpse(ts)
## Rows: 7,043
## Columns: 16
## $ genderMale                            <dbl> 0, 1, 1, 1, 0, 0, 1, 0, 0, 1,...
## $ senior                                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ partnerYes                            <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ dependentsYes                         <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,...
## $ tenure                                <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 2...
## $ phoneYes                              <dbl> 0, 1, 1, 0, 1, 1, 1, 0, 1, 1,...
## $ internetFiber.optic                   <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,...
## $ internetNo                            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ contractOne.year                      <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ contractTwo.year                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ paperless_billYes                     <dbl> 1, 0, 1, 0, 1, 1, 1, 0, 1, 0,...
## $ payment_methodCredit.card..automatic. <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ payment_methodElectronic.check        <dbl> 1, 0, 0, 0, 1, 1, 0, 0, 1, 0,...
## $ payment_methodMailed.check            <dbl> 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,...
## $ monthly_charges                       <dbl> 29.85, 56.95, 53.85, 42.30, 7...
## $ total_charges                         <dbl> 29.85, 1889.50, 108.15, 1840....

We need to grab churn from our original data and join it back to our ts dataframe. Remember the function cbind() will bind together any columns you feed it.

churn_vals <- telco %>% 
  select(churn) # go back to original data and get just churn as dummyVars drops it!

ts <- cbind(ts, churn_vals) # attach it back

8.2.2 Dealing with nuances in model preferences

Lots of models don’t like binary variables encoded as text (e.g. ‘yes’ vs. ‘no’). Let’s use an ifelse() to convert all ‘yes’ values in churn to 1 and everything else to a 0. We’ll also convert it to a factor straight away as that’s frequently necessary for classification models.

ts$churn <-  factor(ifelse(ts$churn == 'Yes', 1, 0))

8.2.3 Dealing with nuances in my preferences

When we one hot encoded it took the various levels found in the columns and then created column names that used those levels. The problem is that those values when in the columns had spaces, which it turned into periods in the column names. I can’t deal with periods and underscores in the same column names.

See! Gross.

colnames(ts)
##  [1] "genderMale"                           
##  [2] "senior"                               
##  [3] "partnerYes"                           
##  [4] "dependentsYes"                        
##  [5] "tenure"                               
##  [6] "phoneYes"                             
##  [7] "internetFiber.optic"                  
##  [8] "internetNo"                           
##  [9] "contractOne.year"                     
## [10] "contractTwo.year"                     
## [11] "paperless_billYes"                    
## [12] "payment_methodCredit.card..automatic."
## [13] "payment_methodElectronic.check"       
## [14] "payment_methodMailed.check"           
## [15] "monthly_charges"                      
## [16] "total_charges"                        
## [17] "churn"

Let’s use some regex to replace all periods with underscores. Remember that period is a special character so you need to escape it with two backslashes ‘\’. I’m also going to add a ‘+’ to the end of my search string as ‘+’ tells it to look for one or more of the symbols. This way it’ll grab the double period too!

colnames(ts) <- colnames(ts) %>% str_replace_all('\\.+', '_')
colnames(ts) # check!
##  [1] "genderMale"                          
##  [2] "senior"                              
##  [3] "partnerYes"                          
##  [4] "dependentsYes"                       
##  [5] "tenure"                              
##  [6] "phoneYes"                            
##  [7] "internetFiber_optic"                 
##  [8] "internetNo"                          
##  [9] "contractOne_year"                    
## [10] "contractTwo_year"                    
## [11] "paperless_billYes"                   
## [12] "payment_methodCredit_card_automatic_"
## [13] "payment_methodElectronic_check"      
## [14] "payment_methodMailed_check"          
## [15] "monthly_charges"                     
## [16] "total_charges"                       
## [17] "churn"

OK, now let’s just remove that hanging ‘’ on the one column name. Since ’’ isn’t a special character we don’t need the ‘\’. But, we can add a ‘$’ at the end which means to only look at the end of the string. We’ll replace with nothing as we just want it removed.

colnames(ts) <- colnames(ts) %>% str_replace_all('_$', '')
colnames(ts) # check!
##  [1] "genderMale"                          "senior"                             
##  [3] "partnerYes"                          "dependentsYes"                      
##  [5] "tenure"                              "phoneYes"                           
##  [7] "internetFiber_optic"                 "internetNo"                         
##  [9] "contractOne_year"                    "contractTwo_year"                   
## [11] "paperless_billYes"                   "payment_methodCredit_card_automatic"
## [13] "payment_methodElectronic_check"      "payment_methodMailed_check"         
## [15] "monthly_charges"                     "total_charges"                      
## [17] "churn"

8.3 Splitting into training and test sets

We’re going to have a whole week on how to split your data, but here we’re going to do a quick basic version of it. We’re going to do what’s called an 80/20 train/test split, where we take 80% of the data to train our model with and then save the other 20% for testing. What’s a bit annoying is that we also need to split our data into our target and features. This’ll leave us with four different chunks of data:

  1. Training set of features
  2. Training target
  3. Test set of features
  4. Test target

The caret package has function that makes this really easy. You can feed the function createDataPartition() your target, and your split ratio and it’ll give you back a list of row numbers that you can use to randomly select 80% of the data. Let’s see it in action.

ts_split <- createDataPartition(ts$churn, p = 0.8, list = FALSE)

So what’s in this ts_split object? Looking at the first 10 values we see that there is a continuous index but then under Resample1 we see that there are some randomly missing values… those are the 20% that weren’t selected!

head(ts_split, 10)
##       Resample1
##  [1,]         2
##  [2,]         4
##  [3,]         6
##  [4,]         8
##  [5,]         9
##  [6,]        10
##  [7,]        11
##  [8,]        12
##  [9,]        14
## [10,]        15

We can use this to slice out our training and test sets. Let’s first go and get our training features. So we want to select all rows listed in ts_split and then all columns that are not churn. Slicing the rows is easy as we just put ts_split on the row side of our square brackets.

But how do we select all columns but churn. names(ts) will give us a list of column names from our data frame. Our %in% operator allows us to then check if something in that list of names is in a list we provide… in this case c('churn'). The trick here is that we can use the ! operator which means ‘not’. So we’re saying to r ’give us all the name that are not in c('churn'). Go and play with the individual code chunks by entering just names(ts) or expansions on that into your console to see what it does. For now, here’s how you slice out just our training features using the above.

features_train <- ts[ ts_split, !(names(ts) %in% c('churn'))] 

Let’s get our test set of features. We want the same columns as above, so that doesn’t change, but we want those rows that are not listed in ts_split. We can just say ‘subtract rows not in ts_split’ using the - operator!

features_test  <- ts[-ts_split, !(names(ts) %in% c('churn'))]

Grabbing our targets is easier as we can just call that one column directly.

target_train <- ts[ ts_split, "churn"]
target_test <- ts[-ts_split, "churn"]

8.4 Centering and scaling

Last step before fitting models! We need to center and scale our continuous features. We also need to impute any missing values. Now, we learned how to do this manually in our preprocessing lesson, but the wonderful caret package actually has a nice function to do this on all our variables at once (don’t hate me for making you learn the hard way).

There is one critical thing to know, though… you should preprocess your training set on its own and separate from you test set of data. The goal of the train/test paradigm is to have your test set act as a totally new set of observations with no idea of what’s come before. If you center/scale/impute all at once you’ll be ‘feeding forward’ information to your test set. This is called data leakage and can drive your model to perform really well on test data, but terrible once it encounters truly new data.

We’ll use the preProcess() function to do our preprocessing. What’s great is that we can tell it all the things we want it to do at once within method = In this case we’ll tell it to center, scale, and even use KNN to impute missing values.

preprocess_object <- preProcess(features_train, 
                                  method = c('center', 'scale', 'knnImpute'))
preprocess_object # We can look at the object and it'll tell us what it did
## Created from 5627 samples and 16 variables
## 
## Pre-processing:
##   - centered (16)
##   - ignored (0)
##   - 5 nearest neighbor imputation (16)
##   - scaled (16)

Now we can use ’predict()` to apply that preprocess object to our data (just like we used predict for creating our dummy variables).

features_train <- predict(preprocess_object, newdata = features_train)
features_test <- predict(preprocess_object, newdata = features_test)

A quick histogram shows that tenure has been scaled

hist(features_train$tenure)

8.5 Fitting our models

Now we can finally fit our models! We’ll fit or kNN model first and then our logistic regression model.

8.5.1 Fitting a naive model: Making a baseline

Fitting a baseline ‘model’ is a commonly overlooked step of the data science process. There are many fancy ways to try and predict an outcome, but too frequently people don’t step back and ask ‘how well would I do using the simplest possible model?’ What that simple model is exactly can vary a lot by the problem. In the case of classification, as we’re doing here, it’s simply asking ‘what would my error be if I predicted that every observation would be the majority class?’ In other words, if you just predict everything to be a ‘1’, how bad would you do?

A baseline model like this is key as it let’s you know the value of your model itself. Are you doing better than essentially no model at all? So let’s make that ‘naive’ model.

To start, let’s ask how many churns vs. non churn observations we have in our test dataset.

summary(as.factor(target_test))
##    0    1 
## 1034  373

Okay, so we have 1034 individuals that didn’t churn in our test dataset. Let’s calculate the error rate if we predicted that every one of these test observations was going to be a 0.

In this case since we already coded our ‘churn’ individuals to be 1, we can simply sum up the number of times we wrongly predict 0 (as that’s our naive prediction of the majority class).

sum(ifelse(target_test == '1', 1, 0))
## [1] 373

And if we divide that value by the length of our test_target then we can get at the naive error rate.

fp <- sum(ifelse(target_test == '1', 1,0))
n <- length(target_test)
fp/n
## [1] 0.2651031

So we can see that if we simply said that nobody churns then we’d have an error rate of ~27%. Let’s keep this in mind at the end.

8.5.2 Fitting a kNN model

Let’s start with fitting our kNN model. The syntax is simple with you specifying your features first, then your target, then how many nearest neighbors you want it to check.

knn_fit <- knn3(features_train, target_train, k = 5)
knn_fit # just check it
## 5-nearest neighbor model
## Training set outcome distribution:
## 
##    0    1 
## 4140 1496

This output is telling us what it found in terms of not churns (0) and churns (1). But we already know that. We want to predict how well it does on new data, in this case our test set. So let’s use our trusty predict() function to predict the classes of our test data using our fitted model, knn_fit. Note that specifying type = 'class' just converts the predictions to 0’s and 1’s rather than the probabilities.

knn_pred <- predict(knn_fit, features_test, type = 'class' )

Our knn_pred is just a series of predicted values for churn in our test set.

head(knn_pred, 10)
##  [1] 1 1 0 0 0 0 1 0 0 0
## Levels: 0 1

As we want to compare these to the known values which are in target_test we need to join them together. cbind() makes quick work of this.

predictions <- cbind(data.frame(target_test, knn_pred))
summary(predictions)
##  target_test knn_pred
##  0:1034      0:1082  
##  1: 373      1: 325

We can see that kNN calculated a similar number of churn and non_churn values, but we’re not sure how many of those are true positive/negatives or false positive/negatives. We’ll calculate actual error rates later so we can actually assess model quality, but first let’s fit our logistic regression model.

8.5.3 Fitting a logistic regression model

I hate to break it to you, but base R’s regression models are easier to fit if your target and features are in the same data frame. So let’s reverse all that splitting of targets and features in our training set.

First’s well join our target and features back together into a new data frame

full_train <- cbind(features_train, target_train)
glimpse(full_train)
## Rows: 5,636
## Columns: 17
## $ genderMale                          <dbl> 0.9844188, 0.9844188, -1.015647...
## $ senior                              <dbl> -0.4425003, -0.4425003, -0.4425...
## $ partnerYes                          <dbl> -0.9599055, -0.9599055, -0.9599...
## $ dependentsYes                       <dbl> -0.6560339, -0.6560339, -0.6560...
## $ tenure                              <dbl> 0.07753003, 0.52699251, -0.9848...
## $ phoneYes                            <dbl> 0.3264935, -3.0623046, 0.326493...
## $ internetFiber_optic                 <dbl> -0.8733655, -0.8733655, 1.14479...
## $ internetNo                          <dbl> -0.5291376, -0.5291376, -0.5291...
## $ contractOne_year                    <dbl> 1.9441326, 1.9441326, -0.514276...
## $ contractTwo_year                    <dbl> -0.5592668, -0.5592668, -0.5592...
## $ paperless_billYes                   <dbl> -1.1954316, -1.1954316, 0.83636...
## $ payment_methodCredit_card_automatic <dbl> -0.5222652, -0.5222652, -0.5222...
## $ payment_methodElectronic_check      <dbl> -0.7122253, -0.7122253, 1.40380...
## $ payment_methodMailed_check          <dbl> 1.8224973, -0.5486003, -0.54860...
## $ monthly_charges                     <dbl> -0.249689946, -0.738776600, 1.1...
## $ total_charges                       <dbl> -0.1605271, -0.1822675, -0.6372...
## $ target_train                        <fct> 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0...

But that would be too easy… it renamed our churn column, so let’s use rename() to rename it back to churn

full_train <- full_train %>%
  rename(churn = target_train)

Now we can fit our model just like we fit regression models. The only difference is that we use the function glm() as this is a generalized linear model. We also have to tell it family, in this case ‘binomial’

log_train <- glm(churn ~ ., family = 'binomial', data = full_train)

Let’s take a quick look at a summary of our model. Notice how there are all the p-values, standard errors, etc. This is one of the big differences between KNN and logistic regression. Logistic regression gives you insight into what features are important and what they do. kNN doesn’t do this. So if you’re trying to make inference about a process, logistic regression is a lot better (assuming it’s appropriate).

summary(log_train)
## 
## Call:
## glm(formula = churn ~ ., family = "binomial", data = full_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7150  -0.6962  -0.2871   0.8032   3.6312  
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         -1.799995   0.065961 -27.289  < 2e-16 ***
## genderMale                          -0.022274   0.035812  -0.622 0.533967    
## senior                               0.099344   0.034218   2.903 0.003693 ** 
## partnerYes                          -0.002261   0.042645  -0.053 0.957718    
## dependentsYes                       -0.057035   0.045101  -1.265 0.206009    
## tenure                              -1.638645   0.173167  -9.463  < 2e-16 ***
## phoneYes                            -0.186364   0.048194  -3.867 0.000110 ***
## internetFiber_optic                  0.407420   0.073030   5.579 2.42e-08 ***
## internetNo                          -0.208470   0.086349  -2.414 0.015766 *  
## contractOne_year                    -0.326999   0.048228  -6.780 1.20e-11 ***
## contractTwo_year                    -0.770766   0.087121  -8.847  < 2e-16 ***
## paperless_billYes                    0.178123   0.040315   4.418 9.95e-06 ***
## payment_methodCredit_card_automatic -0.002661   0.051922  -0.051 0.959129    
## payment_methodElectronic_check       0.168713   0.049508   3.408 0.000655 ***
## payment_methodMailed_check          -0.022158   0.053878  -0.411 0.680879    
## monthly_charges                      0.065307   0.134226   0.487 0.626578    
## total_charges                        0.923522   0.179752   5.138 2.78e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6522.7  on 5635  degrees of freedom
## Residual deviance: 4729.3  on 5619  degrees of freedom
## AIC: 4763.3
## 
## Number of Fisher Scoring iterations: 6

Alright, now predict our test targets using our test features. Use type = 'response' so R knows to give you back the probability.

log_pred <- predict(log_train, newdata = features_test, type = 'response')
head(log_pred) # just look at the first few values... note the 0-1 scale.
##          1          3          5          7         13         16 
## 0.65259243 0.39340062 0.71052555 0.45189363 0.11027067 0.04882795

We can convert these probabilities to classes. In this case anything greater than or equal to 0.5 is a 1 and everything else a 0.

log_pred <- ifelse(log_pred >= 0.5, 1, 0)

Finally add those to our predictions data frame. Convert to a factor along the way so that it knows they’re distinct classes.

predictions$log_pred <- factor(log_pred)
summary(predictions)
##  target_test knn_pred log_pred
##  0:1034      0:1082   0:1086  
##  1: 373      1: 325   1: 321

8.5.4 Fitting a linear discriminant analysis (LDA) model

NOTE I don’t have tests or quizzes on this anymore, but the content is here if you’re interested in another common classification model.

We’re going to fit one last model quick using linear discriminant analysis. I’ll let you read about it in the book.

library(MASS) # Load mass - install if needed
lda_train <- lda(churn ~ ., data = full_train) # fit model
lda_pred <- predict(lda_train, newdata = features_test, type = 'response') # predict
predictions$lda_pred <- lda_pred$class # add predictions to data frame

How did it do?

summary(predictions)
##  target_test knn_pred log_pred lda_pred
##  0:1034      0:1082   0:1086   0:1096  
##  1: 373      1: 325   1: 321   1: 311

8.6 Calculating error rate

So we have three different models that have each made predictions for our test data. How do we compare these models? We can look at how often they predict the correct class to calculate the error rate. The formula is as follows. \[ \frac{1}{n} \sum_{i = 1}^{n} I(y_i \ne \hat{y}_i)\] In other words, sum how many times our predicted y does not equal the true y and divided by the number of observations. We can do this using some simple logical comparisons. Remember in R you can do things like this to check if something is TRUE or FALSE

5 == 5
## [1] TRUE
'yes' == 'yes'
## [1] TRUE
7 != 8
## [1] TRUE
7 != 7
## [1] FALSE

8.6.1 Calculating error rates for our data

So we can do the same thing comparing or columns of known and predicted values. The following code uses an ifelse() to compare every value in target_test against its corresponding prediction in knn_pred. If they’re not equal it assigns a 1 (it’s an error), if they are an equal it assigns a zero (it was predicted correctly).

predictions$knn_error <- ifelse(predictions$target_test != predictions$knn_pred, 1, 0)
summary(predictions)
##  target_test knn_pred log_pred lda_pred   knn_error     
##  0:1034      0:1082   0:1086   0:1096   Min.   :0.0000  
##  1: 373      1: 325   1: 321   1: 311   1st Qu.:0.0000  
##                                         Median :0.0000  
##                                         Mean   :0.2217  
##                                         3rd Qu.:0.0000  
##                                         Max.   :1.0000

Awesome, so our mean value of knn_error is our error rate! Let’s do it for the other two models.

predictions$log_error <- ifelse(predictions$target_test != predictions$log_pred, 1, 0)
predictions$lda_error <- ifelse(predictions$target_test != predictions$lda_pred, 1, 0)
summary(predictions)
##  target_test knn_pred log_pred lda_pred   knn_error        log_error     
##  0:1034      0:1082   0:1086   0:1096   Min.   :0.0000   Min.   :0.0000  
##  1: 373      1: 325   1: 321   1: 311   1st Qu.:0.0000   1st Qu.:0.0000  
##                                         Median :0.0000   Median :0.0000  
##                                         Mean   :0.2217   Mean   :0.1791  
##                                         3rd Qu.:0.0000   3rd Qu.:0.0000  
##                                         Max.   :1.0000   Max.   :1.0000  
##    lda_error     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.1919  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

So, it looks like our logistic regression model had the lowest error rate. And, it has a lower error rate than our naive model which is great. Still, this can be a bit confusing at first if you just look at the number of 1 vs 0 values for each model as it looks like kNN predicted a similar ratio as the true data. But, remember that it’s about how many times those predictions were incorrect! This is commonly looked at through confusion matrix.

8.6.2 Making a confusion matrix

Let’s use the caret function confusionMatrix() to calculate this. Just provide the predicted values first and the true values second. We’ll store it as an object knn_conf and then check out just the $table attribute.

knn_conf <- confusionMatrix(predictions$knn_pred, predictions$target_test)
knn_conf$table
##           Reference
## Prediction   0   1
##          0 902 180
##          1 132 193

Compare to logistic regression

log_conf <- confusionMatrix(predictions$log_pred, predictions$target_test)
log_conf$table
##           Reference
## Prediction   0   1
##          0 934 152
##          1 100 221

So now it’s a lot easier to see how logistic regression did better. There are more true negatives (the 0 - 0 intersection where the model correctly predicts the individual didn’t churn) and true positives (the 1 - 1 intersection where the model correctly predicts that the individual did churn) than false positives (predicting churn when they didn’t) or false negatives (predicting they didn’t churn when they did).

8.7 Tuning k

One issue we have here is that we’re just using a relatively random value of K for our KNN algorithm. It might be fitting poorly because it might need to choose from more or fewer neighbors (bigger or smaller K). Many kNN algorithms default to k = 5, but there isn’t any hard and fast rule.

8.7.1 Choosing k based on sample size

A general rule of thumb is to choose using \(k = \sqrt{n}\).

k_form <- sqrt(length(target_test)) #length(target_test) gives us number of observations n
k_form
## [1] 37.51

You always want a k that’s odd so you don’t have ties, so let’s round down to 37

knn_fit_37 <- knn3(features_train, target_train, k = 37)
knn_pred_37 <- predict(knn_fit_37, features_test, type = 'class' )
predictions$knn_pred_37 <- knn_pred_37
predictions$knn_error_37 <- ifelse(predictions$target_test != predictions$knn_pred_37, 1, 0)
summary(predictions)
##  target_test knn_pred log_pred lda_pred   knn_error        log_error     
##  0:1034      0:1082   0:1086   0:1096   Min.   :0.0000   Min.   :0.0000  
##  1: 373      1: 325   1: 321   1: 311   1st Qu.:0.0000   1st Qu.:0.0000  
##                                         Median :0.0000   Median :0.0000  
##                                         Mean   :0.2217   Mean   :0.1791  
##                                         3rd Qu.:0.0000   3rd Qu.:0.0000  
##                                         Max.   :1.0000   Max.   :1.0000  
##    lda_error      knn_pred_37  knn_error_37   
##  Min.   :0.0000   0:1104      Min.   :0.0000  
##  1st Qu.:0.0000   1: 303      1st Qu.:0.0000  
##  Median :0.0000               Median :0.0000  
##  Mean   :0.1919               Mean   :0.1962  
##  3rd Qu.:0.0000               3rd Qu.:0.0000  
##  Max.   :1.0000               Max.   :1.0000

8.7.2 Testing a range of k values

You can also use brute force to test your data with models that vary in k. Here’s some code that generates a range of k values and then fits each one and stores its values and error to data frames. We’ll learn about for loops later so don’t worry about this code now.

k_seq <- seq(from = 1, to = 51, by = 2)

error_df <- data.frame(matrix(ncol = 2, nrow= 0))
x <- c('error', 'k')
colnames(error_df) <- x

for(i in k_seq) {
  knn_fit <- knn3(features_train, target_train, k = i)
  knn_pred <- predict(knn_fit, features_test, type = 'class')
  error <- mean(ifelse(predictions$target_test != knn_pred, 1, 0))
  error_df[i,'error'] <- error
  error_df[i, 'k'] <- i
}
error_df <- drop_na(error_df) # get rid of even NA values that are inserted

Looking at the resulting data frame we have a bunch of error values for each k in our sequence.

head(error_df)
##       error  k
## 1 0.2643923  1
## 2 0.2231699  3
## 3 0.2217484  5
## 4 0.2210377  7
## 5 0.2139303  9
## 6 0.2125089 11

Which we can plot easily

ggplot(error_df,
       aes( x = k, y = error)) +
  geom_line() +
  geom_point() +
  labs(x = 'k', y = 'Error')

So we can see that our error steadily decreases until it hits k = 31 and then levels off. Not far off from the k of 37 we calculated earlier, and overall much better than the error associated with leaving k at 5!

Let’s just refit this model and add it back to our predictions data frame.

knn_fit_31 <- knn3(features_train, target_train, k = 31)
knn_pred_31 <- predict(knn_fit_31, features_test, type = 'class' )
predictions$knn_pred_31 <- knn_pred_31
predictions$knn_error_31 <- ifelse(predictions$target_test != predictions$knn_pred_31, 1, 0)
summary(predictions)
##  target_test knn_pred log_pred lda_pred   knn_error        log_error     
##  0:1034      0:1082   0:1086   0:1096   Min.   :0.0000   Min.   :0.0000  
##  1: 373      1: 325   1: 321   1: 311   1st Qu.:0.0000   1st Qu.:0.0000  
##                                         Median :0.0000   Median :0.0000  
##                                         Mean   :0.2217   Mean   :0.1791  
##                                         3rd Qu.:0.0000   3rd Qu.:0.0000  
##                                         Max.   :1.0000   Max.   :1.0000  
##    lda_error      knn_pred_37  knn_error_37    knn_pred_31  knn_error_31   
##  Min.   :0.0000   0:1104      Min.   :0.0000   0:1105      Min.   :0.0000  
##  1st Qu.:0.0000   1: 303      1st Qu.:0.0000   1: 302      1st Qu.:0.0000  
##  Median :0.0000               Median :0.0000               Median :0.0000  
##  Mean   :0.1919               Mean   :0.1962               Mean   :0.1926  
##  3rd Qu.:0.0000               3rd Qu.:0.0000               3rd Qu.:0.0000  
##  Max.   :1.0000               Max.   :1.0000               Max.   :1.0000

8.8 What I’m not teaching you

I’m indirectly getting at the idea of class imbalance when talking about naive models. When doing classification problems you can’t simply look at the error rate of a model and say ‘it’s doing better than random chance.’ Most people assume random chance as 50:50, but that’s in the case of coin flips where you actually have that ratio. In this case randomly picking out one person and asking if they’ll stay or churn is more like a 70:30 chance. So it’s critical to in some way consider the baseline of the data to know if you’re doing well.

I’m also not getting into the variety of other metrics to assess the performance of classification problems. Other metrics such as AUC, F1, recall, and precision all consider how well you’re doing on specific classes or a balance of them. They exist and are very valueable, but are overlook for the sake of time.

8.9 Wrapping up

Here we learned about three different ways to classify data using R. We found that despite the diversity of methods, all of them wound up doing pretty much the same on our test set. Which one is best? Well, it depends on your needs and question. There will be datasets where one might fail miserably, and thus it’s good to have the others in your pocket. There also might be times where you’re interested in the process that generated the 1 vs 0, and thus you’ll want to specifically use logistic regression as it can provide insight into what features are having what effect. It all comes down thinking about the question and the pros and cons of the tools in your toolkit!