12.1 Naive Bayes Models

To illustrate the methods used in this chapter, the OkCupid data will be used in conjunction with the naive Bayes classification model. This model uses Bayes formula to predict classification probabilities.

\[\begin{align} Pr[Class|Predictors] &= \frac{Pr[Class]\times Pr[Predictors|Class]}{Pr[Predictors]} \notag \\ &= \frac{Prior\:\times\:Likelihood}{Evidence} \notag \end{align}\]

In English:

Given our predictor data, what is the probability of each class?

There are three elements to this equation. The prior is the general probability of each class (e.g., the rate of STEM profiles) that can either be estimated from the training data or determined using expert knowledge. The likelihood measures the relative probability of the observed predictor data occurring for each class. And the evidence is a normalization factor that can be computed from the prior and likelihood. Most of the computations occur when determining the likelihood. For example, if there are multiple numeric predictors, a multivariate probability distribution can be used for the computations. However, this is very difficult to compute outside of a bivariate normal distribution or may require an abundance of training set data. The determination of the likelihood becomes even more complex when the features are a combination of numeric and continuous values. The naive aspect of this model is due to a very stringent assumption: the predictors are assumed to be independent. This enables the joint likelihood to be computed as a product of individual class-specific values.

For example, suppose a naive Bayes model with two predictors: the number of punctuation marks in the essays as well as the stated religion. To start, consider the religion predictor (which has 10 possible values). For this predictor, a cross-tabulation is made between the values and the outcome and the probability of each religion value, within each class, is computed. Figure 12.1(a) shows the class-specific probabilities. Suppose an individual listed their religion as Atheist. The computed probability of being an atheist is 0.213 for STEM profiles and 0.103 otherwise.

For the continuous predictor, the number of punctuation marks, the distribution of the predictor is computed separately for each class. One way to accomplish this is by binning the predictor and using the histogram frequencies to estimate the probabilities. A better approach is to compute a nonparametric density function on these data (with a \(\log_{10}\) transformation) (Wand and Jones 1994). Figure 12.1(b) shows the two density curves for each class. There is a slight trend where the non-STEM profiles trend to use less punctuation. Suppose that an individual used about 150 punctuation marks in their essay (indicated by the horizontal black line). To compute the relative density for each class, the corresponding heights of the density lines for each class are used. Here, the density for a STEM profile was 0.697 and was 0.558 for non-STEMs.

Figure 12.1: Visualizations of the conditional distributions for a continuous and a discrete predictor for the OkCupid data.

Now that the statistics have been calculated for each predictor and class value, the overall prediction can be made. First, the values for both predictors are multiplied together to form the likelihood statistic for each class. Table 12.1 shows the details for the computations. Both religion and the punctuation mark data were more consistent with the profile being STEM; the ratio of likelihood values for the classes is \(0.148 \div 0.057 = 2.59\), indicating that the profile is more likely to correspond to a STEM career (based on the training data). However, Bayes’ Rule also includes a term for the prior information, which is the overall rate that we would find STEM profiles in the population of interest⁸⁸. In these data, the STEM rate was about 18.5%, which is most likely higher than in other parts of the country. To get the final prediction, the prior and likelihood are multiplied together and their values are normalized to sum to one to give us the posterior probabilities (which are just the class probability predictions). In the end, the probability that this particular profile is associated with a STEM job is only 37%; our prior information about the outcome was extreme enough to contradict what the observed data were indicating.

Table 12.1: Values used in the naive Bayes model computations.
	Predictor Values
Class	Religion	Punctuation	Likelihood	Prior	Posterior
STEM	0.213	0.697	0.148	0.185	0.37
other	0.103	0.558	0.057	0.815	0.63

Considering each predictor separately makes the computations much faster and these models can be easily deployed since the information needed to compute the likelihood can be stored in efficient look-up tables. However, the assumption of independent predictors is not plausible for many data sets. As an example, the number of punctuation marks has strong rank-correlations with the number of words (correlation: 0.847), the number of commas (0.694), and so on. However, in many circumstances, the model can produce good predictions even when disregarding correlation among predictors.

One other relevant aspect of this model is how qualitative predictors are handled. For example, religion was not decomposed into indicator variables for this analysis. Alternatively, this predictor could be decomposed into the complete set of all 10 religion responses and these can be used in the model instead. However, care must be taken that the new binary variables are still treated as qualitative; otherwise a continuous density function (similar to the analysis in Figure 12.1(b)) is used to represent a predictor with only two possible values. When the naive Bayes model used all of the feature sets described in Section 5.6, roughly 110 variables, the cross-validated area under the ROC curve was computed to be 0.798. When the qualitative variables were converted to separate indicators the AUC value was 0.799. The difference in performance was negligible, although the model with the individual indicator variables took 1.8-fold longer to compute due to the number of predictors increasing from 110 to 298.

This model can have some performance benefit from reducing the number of predictors since less noise is injected into the probability calculations. There is an ancillary benefit to using less features; the independence assumption tends to produce highly pathological class probability distributions as the number of predictors approaches the number of training set points. In these cases, the distributions of the class probabilities become somewhat bi-polar; very few predicted points fall in the middle of the range (say greater than 20% and less than 80%). These U-shaped distributions can be difficult to explain to consumers of the model since most incorrectly predicted samples have a high certainty of being true (despite the opposite being the case). Reducing the number of predictors can mitigate this issue.

This model will be used to demonstrate global search methods using the OkCupid data.

Recall that the prior probability was also discussed in Section 3.2.2 when the prevalence was used in some calculations.↩