Chapter 7 Bayes Classifier

In a classification setting, the prediction problem can generally be formulated as: given this observation for the predictor variables, which class has the highest probability of containing the observation.

Assume a classification problem with a target variable with two classes, POS and NEG, and one binary predictor variable X1. Given a training set with n observations, this training set can be used to calculate the conditional a posteriori probabilities
P(X1 = 0 | POS),
P(X1 = 1 | POS),
P(X1 = 0 | NEG) and
P(X1 = 1 | NEG).

Table 1
Simple Example Training Data Set

flextable(df_train) %>% align(align = "center", part="all")

In the example in Table 1,

P(X1 = 0 | Y=POS) = 2/6
P(X1 = 1 | Y=POS) = 4/6
P(X1 = 0 | Y=NEG) = 3/4
P(X1 = 1 | Y=NEG) = 1/4

Given a new observation of the predictor variable, it is intuitively clear what the best prediction is for the Y-variable, based on the information in the training set.
If X1=1, than it is more probable that the observation belongs to the POS class, because P(X1=1 | Y=POS) is higher than P(X1=1 | Y=NEG).

This way to decide uses P(X1=1 | Y=POS) and P(X1=1 | Y=NEG) to predict the correct class.
In a formal sense these are not the correct probabilities to make this decision. What is actually needed to predict the class based on the observation X1=1 are the probabilities
P(Y=POS | X1=1), which is not the same as P(X1 = 1 | Y=POS), and
P(Y=NEG | X1=1).
Based on the information in Table 1 these probabilities are:
P(Y=POS | X1=1) = 4/5
P(Y=NEG | X1=1) = 1/5

In this simple example, with just one predictor, the needed probabilities can be easily calculated. But in real world problems there are far more predictors than one and the situation becomes more complicated.

The formal way to calculate P(Y=POS | X1=1) which can be generated for more complex problems is:

\[P(Y=POS | X1=1) = \frac{P(X1=1\ and\ Y=POS)}{P(X1=1)} =\]

\[\frac{P(Y=POS)\ *\ P(X1=1 | Y=POS)}{P(X1=1)} = \frac{6/10 * 4/6}{5/10} = \frac{4}{5}\]

For this example using this formula is not necessary, but in more complicated cases it is.

Table 2 Example with two Features

flextable(df_train_2) %>% align(align = "center", part="all")

Assume for a new observations X1 = 0 and X2 = 1.
The probabilities to calculate to make a prediction for the Y-variable are:
P(Y = NEG | X1 = 0 and X2 = 1) and
P(Y = Pos | X1 = 0 and X2 = 1).

Using the same formule as in the first example:
\(P(Y=NEG | X1=0\ and\ X2=1) = \frac{P(Y=NEG)*P(X1=0\ and\ X2=1|Y=NEG)}{P(X1=0\ and\ X2=1)}\)

\(P(Y=POS | X1=0\ and\ X2=1) = \frac{P(Y=POS)*P(X1=0\ and\ X2=1|Y=POS)}{P(X1=0\ and\ X2=1)}\)

The denominator of both expressions are the same. The numerator for the first expression can be split up further assuming that X1 and X2 are two conditionally independent variables, which means that the value X1 takes on given a certain condition doesn’t depend on the value X2 takes on given this condition and vice versa. Although in a real world problem this assumption would be violated most of the times, using this assumption to calculate the denominator leads to

P(Y = NEG) * P(X1 = 0 and X2 = 1 | Y = NEG) =
= P(Y = NEG) * P(X1 = 0 | Y = NEG) * P(X2 = 1 |Y = NEG)
Using the training data the a posteriori probabilities in this expressesion can easily be calculated.

P(Y = NEG) * P(X1 = 0 | Y = NEG) * P(X2 = 1 |Y = NEG) =
= 4/10 * 3/4 * 2/4

Based on the independency assumption, the numerator can be calculated as follows:
P(X1 = 0 and X2 = 1) = P(X1 = 0) * P(X2 = 1) etc.

In a real world example the number of features is of course far more, but the principles are the same. The method is named NaiveBayes because it uses Baysian statistics, it is naive because it assumes that the features are mutually independent, which is really a naive assumption.

7.1 A real world example, textmining: sms spam detection

The Naive Bayes model can be used for spam detection. From a training collection of sms text messages, that are labeled Spam or Ham, a set of used words can be extracted.
After a couple of data cleaning operations, see below, it is counted how often words occur in the text messages in the training set. Words with a frequency equal or higher than a chosen threshold, e.g. five, are collected in a dictionary. Only these words are used to distinguish Spam from Ham.
As the next step, each word is converted into a binary variable with the length of the number of messages in the training set, with value 1 in cell j if the word occurs one or more times in the jth text message and 0 otherwise. These binary variables are mutated into factor variables which can be used to generate NaiveBayes model that distinguishes Spam from Ham.

This example uses a dataset from Kaggle.

Read the file with the sms messages as it is published om Kaggle and view the first six messages.

Table 3
Head of the SMS Messages Data Set

sms_raw <- read_csv("data/kaggle_sms/sms_spam.csv")

#transform sms_raw$type into a factor variable
sms_raw$type <- factor(sms_raw$type, levels = c("ham", "spam"),
                       labels=c("ham", "spam"))

flextable(head(sms_raw)) %>% autofit()

The R package tm (tm stands for text mining) comes with a couple of helpful functions for Text Mining. In order to use the text mining functions, the data to be investigated must be in a so-called Corpus of text documents.
The first step in this analysis is to convert the set of SMS messages into such a corpus. First the text messages are transformed into a vector source this is an R vector that interprets every element as a text document.

Table 4
First Five Elements of SMS message in a Vector Source

sms_vector_source <- VectorSource(sms_raw$text)
sms_vector_source[1:5]
[1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."                                            
[2] "Ok lar... Joking wif u oni..."                                                                                                                              
[3] "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
[4] "U dun say so early hor... U c already then say..."                                                                                                          
[5] "Nah I don't think he goes to usf, he lives around here though"                                                                                              

As a next step, create a Corpus with all the sms messages as text documents. This is in fact a list of lists, every list contains the text of the document and metadata about the document.

sms_corpus <- Corpus(sms_vector_source)

Now it is time to perform a couple of text cleaning preparation steps. The tm package has functions for these actions.
(1) Because it is assumed that capitalised letters will not be used to distinguish Ham from Spam, replace all uppercase letters by lowercase letters.
(2) Remove numbers from the text messages. If it is assumed that numbers in the messages can be helpful to diferentiate between Spam and Ham, a more advanced way to deal with numbers is required
(3) Lots of short words - like and, or, if, on, to etc. - are not useful to distinct Ham from Spam; they can be removed from the texts before generating a model. (4) Remove punctuation.
(5) Remove unnecessary spaces.

Table 5
Example of Cleaned Text Message

#transform to lower case text messages
#the tm::tm_map function applies a function on a corpus object
corpus_clean <- tm_map(sms_corpus, tolower) %>% 
#remove numbers
  tm_map(removeNumbers) %>% 
#remove stopwords
  tm_map(removeWords, stopwords()) %>% 
#remove punctuation
  tm_map(removePunctuation) %>% 
#remove additional spaces
  tm_map(stripWhitespace)

corpus_clean[1]$content
[1] "go jurong point crazy available bugis n great world la e buffet cine got amore wat"

Assuming that splitting the data in a training and a test set will be used to assess the Naive Bayes model, the next step is splitting the data, e.g. 70% in the training set and 30% in the test set.

set.seed(20210332)
train <- sample(1:length(corpus_clean),
                size = .7*length(corpus_clean),
                replace=FALSE)

sms_raw_train <- sms_raw[train,]
sms_raw_test <- sms_raw[-train,]

corpus_train <- corpus_clean[train]
corpus_test <- corpus_clean[-train]

sum(train)
[1] 10914129

The next step is to construct a Document Term Matrix for the training set, this is a matrix in which rows correspond to documents and columns correspond to the terms in the documents. The cells contain the number of times a term occurs in a document.

Table 6
Document Term Matrix

dtm_train <- DocumentTermMatrix(corpus_train)
inspect(dtm_train)
<<DocumentTermMatrix (documents: 3901, terms: 6529)>>
Non-/sparse entries: 29853/25439776
Sparsity           : 100%
Maximal term length: 40
Weighting          : term frequency (tf)
Sample             :
      Terms
Docs   call can free get got just know ltgt now will
  1219    0   0    0   0   0    0    0    1   0    0
  126     0   0    0   0   0    0    0    1   0    0
  1344    0   0    0   0   0    0    0    0   0    0
  2901    0   0    0   1   0    0    0    0   0    0
  3164    1   0    0   0   0    2    0    0   0    1
  3223    0   0    0   1   0    0    0    3   0    2
  3278    0   0    0   0   0    0    0   18   0    0
  3442    0   0    0   0   0    0    1    0   0    0
  3515    0   0    0   1   0    0    0    0   0   11
  797     0   1    0   0   0    0    0    1   0    0

To distinguish Ham from Spam not every word in the corpus are useful. Words must appear in a couple of messages to be useful. A choice must be made for the threshold of the number of messages in which a word appears to be used in the model, e.g. 5 times.
First construct a vector with words with a frequency of at least 5.

Table 8
First 10 Terms with Frequency at Least 5

frequent_terms_5 <- findFreqTerms(dtm_train, lowfreq=5)
frequent_terms_5[1:10]
 [1] "awaiting"   "call"       "collect"    "collection" "currently" 
 [6] "just"       "message"    "exam"       "march"      "take"      

The Naive Bayes model uses as features not the number of times a term appears in a message, but only whether a term appears in a message.
It is possible to construct a Binary DTM in which the cells indicate whether a document contains the term (cell value = 1) or not (cell value = 0). It is this Binary DTM that is used in the Naive Bayes model. The Binary DTM is constructed for the words with frequency at least.

Table 7
Binary Documemt Term Matrix for Training Data

#Binary DTM for training data
dtm_train_bin <- DocumentTermMatrix(
                      corpus_train,
                      control=list(weighting=weightBin,
                                   dictionary=frequent_terms_5))

#Binary DTM for test data; needed to asses the model
dtm_test_bin <- DocumentTermMatrix(
                      corpus_test,
                      control=list(weighting=weightBin,
                                   dictionary=frequent_terms_5))
inspect(dtm_train_bin)
<<DocumentTermMatrix (documents: 3901, terms: 1148)>>
Non-/sparse entries: 21934/4456414
Sparsity           : 100%
Maximal term length: 19
Weighting          : binary (bin)
Sample             :
      Terms
Docs   call can free get got just know like now will
  1344    0   0    0   0   0    0    0    0   0    0
  1442    0   1    0   0   0    0    1    0   0    0
  1725    0   0    0   0   1    0    0    1   0    0
  2446    0   0    0   0   0    0    1    0   0    0
  3164    1   0    0   0   0    1    0    0   0    1
  3167    1   0    0   1   0    0    0    0   0    0
  3223    0   0    0   1   0    0    0    1   0    1
  3442    0   0    0   0   0    0    1    1   0    0
  3515    0   0    0   1   0    0    0    1   0    1
  797     0   1    0   0   0    0    0    0   0    0

The columns in the Binary DTM must be transformed into factor variables to use them in a Naive Bayes model.
Then the Binare DTM is ready to generate a Naive Bayes model.

#first use as matrix() to convert DTM matrix from a list into a matrix 
dtm_train_bin_matrix <- as.matrix(dtm_train_bin)
dtm_test_bin_matrix <- as.matrix(dtm_test_bin)

#convert the columns into factor
dtm_train_bin_matrix <- apply(dtm_train_bin_matrix, 2, factor)
dtm_test_bin_matrix <- apply(dtm_test_bin_matrix, 2, factor)

#generate model
nb_model <- naiveBayes(x=dtm_train_bin_matrix,
                       y=sms_raw_train$type)

summary(nb_model)
          Length Class  Mode     
apriori      2   table  numeric  
tables    1148   -none- list     
levels       2   -none- character
isnumeric 1148   -none- logical  
call         3   -none- call     

Assessing the model:
(1) Use the model to make predictions on the test data
(2) Assess the model using a confusion matrix

preds <- predict(nb_model, dtm_test_bin_matrix)

cf <- table(preds, sms_raw_test$type)

cf
      
preds   ham spam
  ham  1445   32
  spam    7  189

As can be seen in the Confusion Matrix, the model makes a good distinction between Spam and Ham. Only 7 of the 1452 Ham messages (0.5%) are classified as Spam while 32 of the 221 Spam messages (14.5%) are classified as Ham.

The caret::confusionMatrix() function gives a lot of metrics which can be used to assess a classification model. Which metric is most applicable depends on the context of the problem in question.

confusionMatrix(preds, sms_raw_test$type)
Confusion Matrix and Statistics

          Reference
Prediction  ham spam
      ham  1445   32
      spam    7  189
                                          
               Accuracy : 0.9767          
                 95% CI : (0.9683, 0.9834)
    No Information Rate : 0.8679          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8932          
                                          
 Mcnemar's Test P-Value : 0.0001215       
                                          
            Sensitivity : 0.9952          
            Specificity : 0.8552          
         Pos Pred Value : 0.9783          
         Neg Pred Value : 0.9643          
             Prevalence : 0.8679          
         Detection Rate : 0.8637          
   Detection Prevalence : 0.8828          
      Balanced Accuracy : 0.9252          
                                          
       'Positive' Class : ham