# Chapter 8 Naive Bayes

## 8.1 A thought problem

A police officer has a breathalyzer which indicates false drunkenness in 5% of the cases in which the driver is sober. However, the breathalyzers never fail to detect a truly drunk person. Suppose on a given evening 1 in 1,000 drivers are driving with alcohol over the legal limit. A traffic checkpoint stop is set up, drivers are selected at random, and the selected drivers are required to take a breathalyzer test.

Assume that a particular driver is found to be over the legal limit for alcohol for according to the breathalyzer. Assume nothing else about the driver. What is the probability that driver really is over the limit?

Many people have answered that the probability is as high as 0.95, but the correct probability is about 0.02. How can the proper probability that the person is really drunk be estimated? This calls for Bayes Theorem.

Bayes’ theorem is a formula that describes how to update the prior probability of an event when additional evidence is made available. Prior probabilities from a Bayesian perspective are based on the known likelihoods from historical data. In the example of random checks of drivers, the prior is 1/1000 or .001 that the driver is drunk.

To estimate the probability of identifying a drunk driver, the results of the breathalyzer can be used to update the probability estimate. The revised probability is called the posterior probability. To determine the posterior probability, Bayes theorem can be used. The goal is to find the probability that the driver is drunk given that the breathalyzer indicated he/she is drunk, which can be represented as:

• p(drunk|POS), where “POS” means that the breathalyzer indicates that the driver is drunk_
Bayes’ Theorem tells us that:
• p(drunk|POS) = [p(POS|drunk) X p(drunk)] / p(POS)
• where p(POS) = p(POS|drunk) X p(drunk) + p(POS|Sober) X p(Sober)
We have the following data:
• p(drunk) = 0.001
• p(Sober) = 0.999
• p(POS|drunk) = 1.00 (the breathalyzer is 100% accurate if the person is actually drunk)
• p(POS|sober) = 0.05 (the breathalyzer mistakenly reports a sober driver is drunk 5% of the time)

Given the data and a positive indication on the breathalyzer test for a randomly selected driver, what is the probability that the person is drunk?

• The numerator of Bayes formula = [p(POS|drunk) X p(drunk)] = [1.00 X 0.001] =0.001
• The denominator of Bayes formula = p(POS|drunk) X p(drunk) + p(POS|Sober) X p(Sober) = 1.0 X 0.001 + 0.05 X 0.999 = 0.001 + .04995 = 0.05095

Substituting the numerator and denominator into Bayes theorem yields:

• p(drunk|POS) = 0.001 / 0.05095 = .0196

The framework of Bayes theorem can be applied to a supervised analytics problem.

## 8.2 Bayes Theorem applied to predictive analytics

Consider a case of a cable service provider that has over two million subscribers. The company decides to perform a test market to predict whether current customers will subscribe to a new service.

The test involved sending an offer to a random sample of 1,000 current customers. This can be cast as a Bayesian model. Using the results of the test, the company would like a predictive model to use for the rest of its customers.

The test results were that 400 customers bought the new service, so the prior probability of purchase was 0.40. This is illustrated in Figure 8.1 which has a grid representing the 1,000 customers in the test.

If this prior probability is applied to the entire subscriber base, then the company would expect to have 400,000 positive responses. The process of contacting customers via mail, email, and telephone to offer the new service had a cost, so the company wanted to know if there was a way to make the contacting process more efficient. That is, was there a way of increasing the positive rate?.

It turned out that the company had data on the gender and age (young, or old) of its subscribers and thus this information was available on those in the test market. Using gender, the probability of purchase can be refined. It turns out that there were 600 female customers in the test, 300 of whom subscribed to the new service and 400 males, 100 of whom subscribed. The probability was further refining using age, with results shown in Figure 8.2.

By simply counting the number of customers in each shaded area, the posterior probabilities of each segment could be calculated.

• prob (Subscribing | male, young) = 50/ (50+180) = .217
• prob (Subscribing | male, old) = 50/ (50+120) = .294
• prob (Subscribing |female, young) = 180/ (180 + 150) = .545
• prob (Subscribing | female, old) = 120/ (120+150) = .444

So, this small example shows that the Bayes model can be used for predicting the classification of new observations. To classify a new case, find all of the observations in the sample with exactly the same descriptive characteristics. With this set of observations, count the number of positive and negative outcomes and apply the counting scheme discussed above.

This approach does not work if there are many predictors. Many practical predictive modeling problems have many predictors. So, the Bayesian idea works in theory, but not always in practice.

As a more practical example, assume you want to predict a binary target class with true or false as the outcomes using 15 binary predictors. Assume that you it needed at least 50 observations in each one of the resulting cells to make a reasonable estimate of the true versus false values of the binary target. The very minimum number of observations you would need is 50 x 215 = 1.638.400 observations. Even this may not be enough because the distribution of observations may not be uniform and many of the cells will have too few observations.

The “solution” to this problem is to use the Naïve Bayes model. The word solution is in quotes because the problem is not really solved. Instead, an approximation, which works well in many practical situations, is used. The approximation is based on the assumption that the predictor variables operate independently of one another. That is, naive Bayes assumes that the presence of a specific feature is unrelated to the presence of any other feature. If the predictors operate independently, then the joint probabilities of multiple variables can be simply estimated as the product of the individual probabilities.

## 8.3 Illustration of Naïve Bayes with a “toy” data set

The small data set consists of 14 observations with the target variable “play tennis” and weather characteristics thought to affect the decision to play or not play. The observations are shown in Table 8.1.

Table 8.1: The tennis data set.
Observation Play.Tennis Outlook Temperature Humidity Wind
1 No Sunny Hot High Weak
2 No Sunny Hot High Strong
3 Yes Overcast Hot High Weak
4 Yes Rain Mild High Weak
5 Yes Rain Cool Normal Weak
6 No Rain Cool Normal Strong
7 Yes Overcast Cool Normal Strong
8 No Sunny Mild High Weak
9 Yes Sunny Cool Normal Weak
10 Yes Rain Mild Normal Weak
11 Yes Sunny Mild Normal Strong
12 Yes Overcast Mild High Strong
13 Yes Overcast Hot Normal Weak
14 No Rain Mild High Strong

Using the data in Table 8.1, the following probabilities were calculated:

The calculations shown in Figure 8.3 were simply obtained by counting. For example, to obtain the conditional probability of Sunny given Not playing, note that five observations were for Sunny conditions. Of those five observations, three indicated Sunny, so the conditional probability is 3/5 = .60. Similar calculations were done for each of the probabilities in the table.

To obtain probabilities of playing versus not playing for Outlook = Sunny, Temperature = Mild, Humidity = High, and Wind = Strong, the following calcuations were made using the naive Bayes model:

• The value for playing tennis:
• Prob(Outlook=Sunny Given Playing tennis = Yes) = 0.222 times
• Prob(Temperature=Mild Given Playing tennis = Yes) = 0.444 times
• Prob(Humidity=High Given Playing tennis = Yes) = 0.333 times
• Prob(Wind=Weak Given Playing tennis = Yes) = 0.333 times
• Prob(Playing tennis = Yes) = 0.644
• which equals = 0.222 X 0.444 X 0.333 X 0.333 X 0.643 = 0.0071
• ==================================================================
• The value for not playing tennis:
• Prob(Outlook=Sunny Given Playing tennis = Yes) X
• Prob(Temperature=Mild Given Playing tennis = Yes) X
• Prob(Humidity=High Given Playing tennis = Yes) X
• Prob(Wind=Weak Given Playing tennis = Yes) X
• Prob(Playing tennis = Yes)
• which equals = 0.600 X 0.400 X 0.800 X 0.600 X 0.357 = 0.0412
• ================================================================== *
• The probability of playing = 0.0071 / (0.0071 + 0.0412) = .1465
• Since the probability of playing is less than 0.50, the prediction is “Not play”

Similar calculations were completed for each of the 14 observations with a summary of the predictions in Table 8.2. Thirteen of the 14 predictions were correct using the Naïve Bayes model.

Table 8.2: Prediction accuracy with tennis data set using naive Bayes.
Observation Play.Tennis Probability Prediction
1 No 0.2046 No
2 No 0.0790 No
3 Yes 0.9993 Yes
4 Yes 0.5365 Yes
5 Yes 0.9328 Yes
6 No 0.8224 Yes*
7 Yes 0.9995 Yes
8 No 0.3397 No
9 Yes 0.8606 Yes
10 Yes 0.9025 Yes
11 Yes 0.5784 Yes
12 Yes 0.9993 Yes
13 Yes 0.9996 Yes
14 No 0.2784 No
Note: * Indicates prediction error.

## 8.4 The assumption of conditional independence

Referring to Figure 8.4, what is the probability of getting a three on a roll of the die, “red” on the spinner, and heads on a flip of a coin? Since the three experiments are independent, the probability of is simply 1/6 X 1/4 X 1/2 = 1/48 = .0208.

This is what naïve Bayes analysis assumes about the effects of the predictors on the target class in a supervised model.

## 8.5 Naïve Bayes with continuous predictors

For simplicity, the previous examples only had categorical predictors, but Naïve Bayes can be used with continuous predictors. There are two approaches that can be used for continuous predictors. A simple solution is to discretize the continuous variables into a few categories. However, doing so is sometimes subjective. For instance, in categorizing temperature, someone may select 80 degrees as the cutoff at which temperature can be considered as “High,” whereas another person (from the tropics!) may choose to select 90 degrees as the border between “Medium” and “High.” This subjectivity causes obvious loss of information. But it can still be used as a quick way to get going before applying naive Bayes classification.

Another method is to represent continuous variables with a probability density function. Typically, the normal or Gaussian distribution is used, but some software programs can use other distributions. The normal distribution is convenient since a continuous variable can be represented using just its mean and standard deviation. Some software implementations of Naïve Bayes offer the choice of other distribution function, e.g., Poisson.

The way this works is demonstrated in Figure 8.5. Consider a continuous variable, V, that is a predictor of a categorical variable Y which is either True or False. Observations on V in the data sample are grouped according to the Y values. The means and standard deviations of each group are computed and used to form the two normal density functions shown in Figure 8.5. The conditional probabilities Prob(X|Target = False) and Prob(X|Target = True) which are needed for the naïve Bayes model are then obtained from the density functions. This method assumes that the normal distribution usefully represents the variable V.

## 8.6 Laplace Smoothing

The naïve Bayes algorithm can have a problem in certain situations, especially with small sample sizes. The problem happens if a particular value does not occur with frequency greater than zero in any level of a predictor. In this case, the conditional probability becomes zero and since the conditional probabilities are multiplied in a chain, this causes all posterior probabilities that included the level to be zero. (This was actually the case in the tennis example illustrated earlier. For the condition not playing tennis, the overcast level of the weather never occurred.)

To avoid this, a Laplace Smoother is used. There are several ways to incorporate the smoother with the simplest being to add one to every count in the combination of predictor values.

## 8.7 Example using naïve Bayes with churn data

The churn data set was analyzed using naive Bayes in KNIME. The KNIME workflow is shown in Figure 8.6. The same preprocessing of the churn data was included and SMOTE was used to balance the target values in the training data.

Node descriptions for the workflow in Figure 8.6 are in Table 8.3.

Table 8.3: Node descriptions for naive Byes with churn data.
Node Label Description
2 Math Formula Compute square root of total charges
3 Partitioning Stratified sampling on Churn; 70/30 split.
4 SMOTE Oversample minority cases
5 Naïve Bayes Learning Run naïve Bayes on training data.
6 Naïve Bayes Predictor Use the naïve Bayes model to predict test data.
7 ROC Curve Create ROC curve and AUC.
8 Scorer Calculate performance metrics and confusion matrix.

The evaluation metrics results for naïve Bayes are shown in Table 8.4. For comparison, the metrics from a basic decision tree as well as three ensemble models are also shown. The naïve Bayes model performed comparably. The area under the ROC curve was greater than decision trees, but lower than that for the ensemble models. Interestingly, naïve Bayes traded specificity (lower) for sensitivity (higher). Overall, however, while naïve Bayes is a contender for classification, it does not perform as well as more complex models.

Table 8.4: Comparative performance of naïve Bayes with ensemble models.
Model ROC AUC Accuracy Sensitivity Specificity
Naïve Bayes 0.820 0.702 0.831 0.655
Decision tree 0.803 0.745 0.754 0.742
Random forest 0.837 0.758 0.770 0.754
GBT 0.846 0.758 0.807 0.740
XGBoost 0.846 0.752 0.783 0.740

## 8.8 Spam detection using naïve Bayes

Email has provided a convenient mode of communication that is used throughout the world by millions of people for business and personal messages. The huge number of unsolicited commercial messages most people receive daily soon, however, is at best an annoyance and at worst a means for deception or even criminal activity. The proliferation and variety of these unsolicited email messages, now called spam or junk mail, led to the development of software programs detect and screen out such emails. Spam filters have been developed to sift through email messages to separate the “ham” from the “spam.” The challenge in designing spam filters is to make the algorithm selective enough to identify spam while not flagging legitimate messages. It has been estimated that about 45% of global e-mail traffic was spam.

Naïve Bayes has been used as the machine learning engine for spam filters because of its simplicity, speed, and accuracy. Many enhancements of the basic naïve Bayes model have been made to improve its performance and other algorithms have been used such as k-nearest neighbors, support vector machines.

A data set of 5,556 messages labeled as spam or ham email messages was downloaded and analyzed using KNIME. Example messages in the data set are in Table 8.5:

Table 8.5: Examples of spam and ham email messages.
Class Message
ham What you doing? how are you?
ham Siva is in hostel aha:-.
spam Sunshine Quiz! Win a super Sony DVD recorder if you can name the capital of Australia? Text MQUIZ to 82277.
spam PRIVATE! Your 2003 Account Statement for shows 800 un-redeemed S.I.M. points. Call 08718738001 Identifier Code: 49557 Expires 26/11/04

The text file was converted into a file consisting of a bag of words.9 The KNIME workflow is in Figure 8.7. and node descriptions are in Table 8.6. The workflow a table with created 5,572 rows (one for each message) and 12,230 columns with indicators for terms. (The details for textual analysis will be covered in more detail in the chapter on Text Analytics.)

Table 8.6: Node descriptions for naive Bayes with SPAM / HAM data workflow.
Node Label Description
2 Strings to Document Convert observations to documents.
3 Punctuation Erasure Remove all punctuation from the documents.
4 N Chars Filter Removes all terms with less than 3 characters.
5 Number Filter Remove all terms that consist of numbers.
6 Case Converter Convert all terms to lower case.
7 Stop Word Filter Remove stop words using built-in list.
8 Bag Of Words Creater Changes each document to individual terms, one term in each row; creates a table with 90,102 rows.
9 Document Vector Create a vector with one row per document and columns with binary indicator for presence of term (0 or 1).
10 Category To Class Adds the appropriate string class variable (either spam or ham).
11 Number to String Convert the numbers in the bag of words to strings.
12 Table Writer Write bag of words to file SPAMHAM.table.

The bag of words data from the preprocessing step was submitted to naive Bayes in KNIME (Figure 8.8. Descriptions of each node used to run the naïve Bayes model are in Table 8.7.

Table 8.7: Descriptions of nodes in the naïve Bayes workflow.
Node Label Description
2 Column Filter Remove column “Document.”
3 Partitioning Create 70/30 split into training and test partitions.
4 Naïve Bayes Learner Run naïve Bayes on training data.
5 Naïve Bayes Predictor Use the naïve Bayes model to predict test data.
6 Scorer Calculate performance metrics and confusion matrix.

The results of the naïve Bayes analysis of the spam data set show quite good accuracy (over 98%). As shown in the confusion matrix created for the test data (Table 8.8), zero actual ham statements were misclassified as spam while 24 spam statements were classified as ham. This is a good result since mistakes in filtering out legitimate messages are more serious than mistakenly classifying spam as legitimate.

Table 8.8: Confusion matrix on test data for naïve Bayes classifications of spam vs.ham.
Predicted
Actual Ham Spam Totals
Ham 1,444 0 1,444
Spam 24 204 228
Totals 1,468 204 1,672

## 8.9 Comments on naïve Bayes

When used as a predictive algorithm, Naïve Bayes works quite well. In fact, it works surprisingly well given the strong assumption of predictor independence and the simplicity of the model itself compared with other predictive algorithms. The algorithm has been found to work especially well when there is a is a large number of predictor variables. This may be because dependence among predictors is less likely. It also works when there are missing values.

One weakness is that probability estimates for combinations of predictors are not always very accurate. For example, with a binary target of True and False, the estimated probabilities of True and False may be off, but the classification may still be correct.

Compared with full Bayesian analysis, Naïve Bayes calculations are practical in situations with smaller sample sizes and uses all observations in a data set, not just those with matching predictor values; this makes the calculations practical in more situations.

The key assumption (again) is that the predictor variables are independent. This assumption does not always hold. It is especially a problem when predictor variables are highly correlated with one another.

The analysis of the spam/ham data showed that naïve Bayes easily handled a table with many more columns than rows, which some supervised models cannot do so.

### References

Karimovich, Khamidov Sherzod Jaloldin ugli, Ganiev Salim, and Olimov Iskandar Salimbayevich. 2020a. “An Empirical Study of the Naïve Bayes Classifier.” In Proceedings of the 22nd International Conference on Machine Learning, 625–32.
———. 2020b. “Analysis of Machine Learning Methods for Filtering Spam Messages in Email Services.” In Proceedings of the 22nd International Conference on Machine Learning, 625–32.
Kuhn, Max, and Kjell Johnson. 2016. Applied Predictive Modeling. 2nd ed. New York: Springer.
Ma, Kunihito Yamamori, Thae Ma, and Aye Thida. 2020. “A Comparative Approach to Naïve Bayes Classifier and Support Vector Machine for Email Spam Classification.” In IEEE 9th Global Conference on Consumer Electronics.
“Play Tennis: Simple Dataset with Decisions about Playing Tennis.” n.d. https://www.coursera.org/learn/machine-learning-under-the-hood.
“SMS Spam Collection Data Set, UCI Machine Learning Repository.” n.d. https://archive.ics.uci.edu/ml/datasets/sms+spam+collection.
“Spam Statistics and Facts.” n.d. https://www.spamlaws.com/spam-stats.html.

1. The “bag of words” model simply counts the number of occurences of known words in a document and does not consider the order or syntax of the words. It is therefore a simplified representing text.↩︎