Chapter 8 Naive Bayes
8.1 A thought problem
A police officer has a breathalyzer which indicates false drunkenness in 5% of the cases in which the driver is sober. However, the breathalyzers never fail to detect a truly drunk person. Suppose on a given evening 1 in 1,000 drivers are driving with alcohol over the legal limit. A traffic checkpoint stop is set up, drivers are selected at random, and the selected drivers are required to take a breathalyzer test.
Assume that a particular driver is found to be over the legal limit for alcohol for according to the breathalyzer. Assume nothing else about the driver. What is the probability that driver really is over the limit?
Many people have answered that the probability is as high as 0.95, but the correct probability is about 0.02. How can the proper probability that the person is really drunk be estimated? This calls for Bayes Theorem.
Bayes’ theorem is a formula that describes how to update the prior probability of an event when additional evidence is made available. Prior probabilities from a Bayesian perspective are based on the known likelihoods from historical data. In the example of random checks of drivers, the prior is 1/1000 or .001 that the driver is drunk.
To estimate the probability of identifying a drunk driver, the results of the breathalyzer can be used to update the probability estimate. The revised probability is called the posterior probability. To determine the posterior probability, Bayes theorem can be used. The goal is to find the probability that the driver is drunk given that the breathalyzer indicated he/she is drunk, which can be represented as:
- p(drunk|POS), where “POS” means that the breathalyzer indicates that the driver is drunk_
- p(drunk|POS) = [p(POS|drunk) X p(drunk)] / p(POS)
- where p(POS) = p(POS|drunk) X p(drunk) + p(POS|Sober) X p(Sober)
- p(drunk) = 0.001
- p(Sober) = 0.999
- p(POS|drunk) = 1.00 (the breathalyzer is 100% accurate if the person is actually drunk)
- p(POS|sober) = 0.05 (the breathalyzer mistakenly reports a sober driver is drunk 5% of the time)
Given the data and a positive indication on the breathalyzer test for a randomly selected driver, what is the probability that the person is drunk?
- The numerator of Bayes formula = [p(POS|drunk) X p(drunk)] = [1.00 X 0.001] =0.001
- The denominator of Bayes formula = p(POS|drunk) X p(drunk) + p(POS|Sober) X p(Sober) = 1.0 X 0.001 + 0.05 X 0.999 = 0.001 + .04995 = 0.05095
Substituting the numerator and denominator into Bayes theorem yields:
- p(drunk|POS) = 0.001 / 0.05095 = .0196
The framework of Bayes theorem can be applied to a supervised analytics problem.
8.2 Bayes Theorem applied to predictive analyticsConsider a case of a cable service provider that has over two million subscribers. The company decides to perform a test market to predict whether current customers will subscribe to a new service.
The test involved sending an offer to a random sample of 1,000 current customers. This can be cast as a Bayesian model. Using the results of the test, the company would like a predictive model to use for the rest of its customers.
The test results were that 400 customers bought the new service, so the prior probability of purchase was 0.40. This is illustrated in Figure 8.1 which has a grid representing the 1,000 customers in the test.
If this prior probability is applied to the entire subscriber base, then the company would expect to have 400,000 positive responses. The process of contacting customers via mail, email, and telephone to offer the new service had a cost, so the company wanted to know if there was a way to make the contacting process more efficient. That is, was there a way of increasing the positive rate?.
It turned out that the company had data on the gender and age (young, or old) of its subscribers and thus this information was available on those in the test market. Using gender, the probability of purchase can be refined. It turns out that there were 600 female customers in the test, 300 of whom subscribed to the new service and 400 males, 100 of whom subscribed. The probability was further refining using age, with results shown in Figure 8.2.
By simply counting the number of customers in each shaded area, the posterior probabilities of each segment could be calculated.
- prob (Subscribing | male, young) = 50/ (50+180) = .217
- prob (Subscribing | male, old) = 50/ (50+120) = .294
- prob (Subscribing |female, young) = 180/ (180 + 150) = .545
- prob (Subscribing | female, old) = 120/ (120+150) = .444
So, this small example shows that the Bayes model can be used for predicting the classification of new observations. To classify a new case, find all of the observations in the sample with exactly the same descriptive characteristics. With this set of observations, count the number of positive and negative outcomes and apply the counting scheme discussed above.
This approach does not work if there are many predictors. Many practical predictive modeling problems have many predictors. So, the Bayesian idea works in theory, but not always in practice.
As a more practical example, assume you want to predict a binary target class with true or false as the outcomes using 15 binary predictors. Assume that you it needed at least 50 observations in each one of the resulting cells to make a reasonable estimate of the true versus false values of the binary target. The very minimum number of observations you would need is 50 x 215 = 1.638.400 observations. Even this may not be enough because the distribution of observations may not be uniform and many of the cells will have too few observations.
The “solution” to this problem is to use the Naïve Bayes model. The word solution is in quotes because the problem is not really solved. Instead, an approximation, which works well in many practical situations, is used. The approximation is based on the assumption that the predictor variables operate independently of one another. That is, naive Bayes assumes that the presence of a specific feature is unrelated to the presence of any other feature. If the predictors operate independently, then the joint probabilities of multiple variables can be simply estimated as the product of the individual probabilities.
8.3 Illustration of Naïve Bayes with a “toy” data set
The small data set consists of 14 observations with the target variable “play tennis” and weather characteristics thought to affect the decision to play or not play. (“Play Tennis: Simple Dataset with Decisions about Playing Tennis,” n.d.) The observations are shown in Table 8.1.
Using the data in Table 8.1, the following probabilities were calculated:
The calculations shown in Figure 8.3 were simply obtained by counting. For example, to obtain the conditional probability of Sunny given Not playing, note that five observations were for Sunny conditions. Of those five observations, three indicated Sunny, so the conditional probability is 3/5 = .60. Similar calculations were done for each of the probabilities in the table.
To obtain probabilities of playing versus not playing for Outlook = Sunny, Temperature = Mild, Humidity = High, and Wind = Strong, the following calcuations were made using the naive Bayes model:
- The value for playing tennis:
- Prob(Outlook=Sunny Given Playing tennis = Yes) = 0.222 times
- Prob(Temperature=Mild Given Playing tennis = Yes) = 0.444 times
- Prob(Humidity=High Given Playing tennis = Yes) = 0.333 times
- Prob(Wind=Weak Given Playing tennis = Yes) = 0.333 times
- Prob(Playing tennis = Yes) = 0.644
- which equals = 0.222 X 0.444 X 0.333 X 0.333 X 0.643 = 0.0071
- The value for not playing tennis:
- Prob(Outlook=Sunny Given Playing tennis = Yes) X
- Prob(Temperature=Mild Given Playing tennis = Yes) X
- Prob(Humidity=High Given Playing tennis = Yes) X
- Prob(Wind=Weak Given Playing tennis = Yes) X
- Prob(Playing tennis = Yes)
- which equals = 0.600 X 0.400 X 0.800 X 0.600 X 0.357 = 0.0412
- ================================================================== *
- The probability of playing = 0.0071 / (0.0071 + 0.0412) = .1465
- Since the probability of playing is less than 0.50, the prediction is “Not play”
Similar calculations were completed for each of the 14 observations with a summary of the predictions in Table 8.2. Thirteen of the 14 predictions were correct using the Naïve Bayes model.
|Note: * Indicates prediction error.|
8.4 The assumption of conditional independence
Referring to Figure 8.4, what is the probability of getting a three on a roll of the die, “red” on the spinner, and heads on a flip of a coin? Since the three experiments are independent, the probability of is simply 1/6 X 1/4 X 1/2 = 1/48 = .0208.
This is what naïve Bayes analysis assumes about the effects of the predictors on the target class in a supervised model.
8.5 Naïve Bayes with continuous predictors
For simplicity, the previous examples only had categorical predictors, but Naïve Bayes can be used with continuous predictors. There are two approaches that can be used for continuous predictors. A simple solution is to discretize the continuous variables into a few categories. However, doing so is sometimes subjective. For instance, in categorizing temperature, someone may select 80 degrees as the cutoff at which temperature can be considered as “High,” whereas another person (from the tropics!) may choose to select 90 degrees as the border between “Medium” and “High.” This subjectivity causes obvious loss of information. But it can still be used as a quick way to get going before applying naive Bayes classification.
Another method is to represent continuous variables with a probability density function. Typically, the normal or Gaussian distribution is used, but some software programs can use other distributions. The normal distribution is convenient since a continuous variable can be represented using just its mean and standard deviation. Some software implementations of Naïve Bayes offer the choice of other distribution function, e.g., Poisson.
The way this works is demonstrated in Figure 8.5. Consider a continuous variable, V, that is a predictor of a categorical variable Y which is either True or False. Observations on V in the data sample are grouped according to the Y values. The means and standard deviations of each group are computed and used to form the two normal density functions shown in Figure 8.5. The conditional probabilities Prob(X|Target = False) and Prob(X|Target = True) which are needed for the naïve Bayes model are then obtained from the density functions. This method assumes that the normal distribution usefully represents the variable V.
8.6 Laplace Smoothing
The naïve Bayes algorithm can have a problem in certain situations, especially with small sample sizes. The problem happens if a particular value does not occur with frequency greater than zero in any level of a predictor. In this case, the conditional probability becomes zero and since the conditional probabilities are multiplied in a chain, this causes all posterior probabilities that included the level to be zero. (This was actually the case in the tennis example illustrated earlier. For the condition not playing tennis, the overcast level of the weather never occurred.)
To avoid this, a Laplace Smoother (Kuhn and Johnson 2016) is used. There are several ways to incorporate the smoother with the simplest being to add one to every count in the combination of predictor values.
8.7 Example using naïve Bayes with churn data
The churn data set was analyzed using naive Bayes in KNIME. The KNIME workflow is shown in Figure 8.6. The same preprocessing of the churn data was included and SMOTE was used to balance the target values in the training data.
|1||File Reader||Read the file ChurnData.csv.|
|2||Math Formula||Compute square root of total charges|
|3||Partitioning||Stratified sampling on Churn; 70/30 split.|
|4||SMOTE||Oversample minority cases|
|5||Naïve Bayes Learning||Run naïve Bayes on training data.|
|6||Naïve Bayes Predictor||Use the naïve Bayes model to predict test data.|
|7||ROC Curve||Create ROC curve and AUC.|
|8||Scorer||Calculate performance metrics and confusion matrix.|
The evaluation metrics results for naïve Bayes are shown in Table 8.4. For comparison, the metrics from a basic decision tree as well as three ensemble models are also shown. The naïve Bayes model performed comparably. The area under the ROC curve was greater than decision trees, but lower than that for the ensemble models. Interestingly, naïve Bayes traded specificity (lower) for sensitivity (higher). Overall, however, while naïve Bayes is a contender for classification, it does not perform as well as more complex models.
8.8 Spam detection using naïve Bayes
Email has provided a convenient mode of communication that is used throughout the world by millions of people for business and personal messages. The huge number of unsolicited commercial messages most people receive daily soon, however, is at best an annoyance and at worst a means for deception or even criminal activity. The proliferation and variety of these unsolicited email messages, now called spam or junk mail, led to the development of software programs detect and screen out such emails. Spam filters have been developed to sift through email messages to separate the “ham” from the “spam.” The challenge in designing spam filters is to make the algorithm selective enough to identify spam while not flagging legitimate messages. It has been estimated that about 45% of global e-mail traffic was spam. (“Spam Statistics and Facts,” n.d.)
Naïve Bayes has been used as the machine learning engine for spam filters because of its simplicity, speed, and accuracy. Many enhancements of the basic naïve Bayes model have been made to improve its performance and other algorithms have been used such as k-nearest neighbors, support vector machines.(Karimovich and Salimbayevich 2020b) (Ma and Thida 2020)
A data set of 5,556 messages labeled as spam or ham email messages (“SMS Spam Collection Data Set, UCI Machine Learning Repository,” n.d.) was downloaded and analyzed using KNIME. Example messages in the data set are in Table 8.5:
|ham||What you doing? how are you?|
|ham||Siva is in hostel aha:-.|
|spam||Sunshine Quiz! Win a super Sony DVD recorder if you can name the capital of Australia? Text MQUIZ to 82277.|
|spam||PRIVATE! Your 2003 Account Statement for shows 800 un-redeemed S.I.M. points. Call 08718738001 Identifier Code: 49557 Expires 26/11/04|
The text file was converted into a file consisting of a bag of words.9 The KNIME workflow is in Figure 8.7. and node descriptions are in Table 8.6. The workflow a table with created 5,572 rows (one for each message) and 12,230 columns with indicators for terms. (The details for textual analysis will be covered in more detail in the chapter on Text Analytics.)
|1||Excel Reader||Read the Excel file SPAMHAM.xlsx.|
|2||Strings to Document||Convert observations to documents.|
|3||Punctuation Erasure||Remove all punctuation from the documents.|
|4||N Chars Filter||Removes all terms with less than 3 characters.|
|5||Number Filter||Remove all terms that consist of numbers.|
|6||Case Converter||Convert all terms to lower case.|
|7||Stop Word Filter||Remove stop words using built-in list.|
|8||Bag Of Words Creater||Changes each document to individual terms, one term in each row; creates a table with 90,102 rows.|
|9||Document Vector||Create a vector with one row per document and columns with binary indicator for presence of term (0 or 1).|
|10||Category To Class||Adds the appropriate string class variable (either spam or ham).|
|11||Number to String||Convert the numbers in the bag of words to strings.|
|12||Table Writer||Write bag of words to file SPAMHAM.table.|
|1||Table Reader||Read the file SPAMHAM.table.|
|2||Column Filter||Remove column “Document.”|
|3||Partitioning||Create 70/30 split into training and test partitions.|
|4||Naïve Bayes Learner||Run naïve Bayes on training data.|
|5||Naïve Bayes Predictor||Use the naïve Bayes model to predict test data.|
|6||Scorer||Calculate performance metrics and confusion matrix.|
The results of the naïve Bayes analysis of the spam data set show quite good accuracy (over 98%). As shown in the confusion matrix created for the test data (Table 8.8), zero actual ham statements were misclassified as spam while 24 spam statements were classified as ham. This is a good result since mistakes in filtering out legitimate messages are more serious than mistakenly classifying spam as legitimate.
The “bag of words” model simply counts the number of occurences of known words in a document and does not consider the order or syntax of the words. It is therefore a simplified representing text.↩︎