12 Tutorial 3: Conditional Probability

12.0.1 The inspiration for this assignment

This tutorial and the lab which accompanies it was adapted from a STAT 13 lab that was given as part of a 2004 version of that course at UCLA. The original lab is linked below, though that file could disappear at any time because the instructor on whose webspace it’s hosted has not been at UCLA for a long time and they are not listed on the Statistics department’s faculty page.

http://www.stat.ucla.edu/~cocteau/teaching/stat13/labs/lab3.pdf

12.0.2 Our data

The dataset we will be working with is about spam email. We will not be working with the messages themselves, but rather a table of information about characteristics of over 4,500 email messages collected by researchers at Hewlett-Packard (HP) Labs. Some of these messages were spam and some were not. The original data is available at the link below.

https://archive.ics.uci.edu/ml/datasets/spambase

Our dataset differs from the original dataset hosted by UCI in some significant ways because of the cleaning process that it underwent to prepare it for this lab assignment. That process and its aims will be detailed in a subsequent tutorial.

Let’s have a peek at our data.

## 'data.frame':    4601 obs. of  49 variables:
##  $ make      : logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
##  $ address   : logi  TRUE TRUE FALSE FALSE FALSE FALSE ...
##  $ all       : logi  TRUE TRUE TRUE FALSE FALSE FALSE ...
##  $ 3d        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ our       : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ over      : logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
##  $ remove    : logi  FALSE TRUE TRUE TRUE TRUE FALSE ...
##  $ internet  : logi  FALSE TRUE TRUE TRUE TRUE TRUE ...
##  $ order     : logi  FALSE FALSE TRUE TRUE TRUE FALSE ...
##  $ mail      : logi  FALSE TRUE TRUE TRUE TRUE FALSE ...
##  $ receive   : logi  FALSE TRUE TRUE TRUE TRUE FALSE ...
##  $ will      : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
##  $ people    : logi  FALSE TRUE TRUE TRUE TRUE FALSE ...
##  $ report    : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
##  $ addresses : logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
##  $ free      : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
##  $ business  : logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
##  $ email     : logi  TRUE TRUE TRUE FALSE FALSE FALSE ...
##  $ you       : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
##  $ credit    : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ your      : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
##  $ font      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ 000       : logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
##  $ money     : logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
##  $ hp        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ hpl       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ george    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ 650       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ lab       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ labs      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ telnet    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ 857       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ data      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ 415       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ 85        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ technology: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ 1999      : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
##  $ parts     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pm        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ direct    : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ cs        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ meeting   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ original  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ project   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ re        : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ edu       : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ table     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ conference: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ spam      : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

We see that we have and 4,601 observations in this dataset. Each observation represents one email message.

There are 49 variables in this dataset. The first 48 variables represent whether or not a word was present in a message. For example, if the value for will is TRUE for an observation, this means that the word “true” was present in this email message. If the value for business is FALSE for an observation, this means that the word “business” was not present in that email message.

The 49th variable, spam, represents whether or not an observation is a spam message. A value of TRUE means it is a spam message and a value of FALSE means that it is not a spam message. How many spam messages and how many real messages are in this dataset?

## spam
## FALSE  TRUE   Sum 
##  2788  1813  4601

Notice that the table above looks a little bit different from ones we’ve made in the past. By now you’ve probably developed strong enough analytical skills to understand why, so I will leave that up to you. This tutorial will contain more examples of tables like this.

12.0.3 Which words did each message contain?

Since each observation consists of 49 different logical variables, we need to come up with a creative way of browsing this data. One preliminary question worth asking is, which words in the dataset were present in the 25th message?

An easy way we could try to answer this is by using bracketing. To get the 25th row of our dataset, all we have to type into our console is email2[25, ]. This will return the 25th row and all of the values for every column. And we’ll add [1:48] on the end because for now we only care about which words were present in an email message and not whether it was spam or not.

##     make address   all    3d   our  over remove internet order  mail
## 25 FALSE   FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE FALSE FALSE
##    receive  will people report addresses  free business email   you credit
## 25   FALSE FALSE  FALSE  FALSE     FALSE FALSE    FALSE FALSE FALSE  FALSE
##     your  font   000 money    hp   hpl george   650   lab  labs telnet
## 25 FALSE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE FALSE FALSE  FALSE
##      857  data   415    85 technology  1999 parts    pm direct    cs
## 25 FALSE FALSE FALSE FALSE      FALSE FALSE FALSE FALSE  FALSE FALSE
##    meeting original project   re   edu table conference
## 25   FALSE    FALSE   FALSE TRUE FALSE FALSE      FALSE

This output is not very easy to interpret. Sure, it prints out every word that’s present in the dataset along with whether these words were present or not in the 25th message. But it’s hard to tell at a glance which words were present and which weren’t, and it’s even hard to tell the difference between the words and their truth values because the words and truth values are both words. There has to be a better way! And indeed there is: through functional programming.

Below is a function called email_words() that I wrote to answer this question more effectively. If you’ve learned about flow statements (if/else statements) in a Computer Science class before, the flow statement in this function may look redundant to you. This is due to a certain idiosyncrasy in the R language that’s way beyond the scope of this discussion. But if you ever find yourself trying to use flow statements together with for loops inside of R, this is something that you’ll need to read up on in order to get these things to work together.

Anyway, let’s try using this function and see what it does.

## [1] "re"

It looks like the only word that was present in the 25th observation was "re". This is likely the automatically generated “Re:” heading that email clients use to indicate that a message is a reply to an earlier message.

Let’s try using this function one more time for the 250th message and see if we can get a more interesting result.

##  [1] "all"     "our"     "remove"  "order"   "mail"    "receive" "free"   
##  [8] "email"   "you"     "hp"      "1999"    "direct"

This observation contained many more words than the 25th observation.

12.0.4 Filtering and re-ordering data with dplyr

It’s nice that we’ve figured out a quick way of checking to see which words were present in each of the 4,601 messages, but it would be more interesting to see which words are most popular in real messages and which are most popular in spam messages. We can do that by filtering and then re-ordering our data.

First, we’re going to create two new dataframes. One will contain only spam messages and the other will contain only real messages.

This is only the first step. By now we ought to know without looking that spam_messages and real_messages look just like email2 because all we did above was split that email2 object into two objects based on whether or not the value for spam was TRUE or FALSE for each observation in that dataset. Both of these smaller datasets have 49 variables, but a fraction of the observations of the original dataset.

Since we want to rank the counts of each of the words in spam_messages and real_messages, we need to generate a summary dataframe which contains this information. We will do that below using a familiar process. First we generate seperate summaries for the spam messages and the real messages and then we paste these summaries together side by side using cbind(). Finally, we’ll have a look at our work so far using head().

##   spam.word spam.count real.word real.count
## 1      make        641      make        412
## 2   address        625   address        273
## 3       all       1115       all        773
## 4        3d         39        3d          8
## 5       our       1134       our        614
## 6      over        681      over        318

We are making good progress. Now we have a table which shows the frequencies of each of the words in messages that are real and in messages that are spam. But we need to reorder this information by count to get what we’re really looking for.

##    spam.word spam.count real.word real.count
## 1        you       1608       you       1619
## 2       your       1466      will       1182
## 3       will       1143        hp       1040
## 4        our       1134      your        957
## 5        all       1115        re        824
## 6       free        989       hpl        784
## 7       mail        827       all        773
## 8     remove        764    george        772
## 9   business        697      1999        728
## 10     email        688       our        614

It looks like there is some overlap between the 10 most popular words in real and spam messages. For example, the most popular word for both types of messages is "you"! But there are some differences. For instance, the word free is one of the top spam words and the word hp is one of the top real words.

Another question worth asking is which words have the biggest difference in frequency between spam and real messages. I.e., if we subtract the frequencies of each spam word and each real word, which ones have the biggest differences in frequency?

##   spam.word spam_minus_real
## 1      free             737
## 2    remove             721
## 3     money             627
## 4       000             525
## 5       our             520

The spam words with the biggest differences in frequency are shown above. It’s not hard to imagine why these would appear more frequently in spam messages, especially words like “free” and “money”.

##   real.word spam_minus_real
## 1        hp            -990
## 2    george            -764
## 3       hpl            -757
## 4      1999            -627
## 5      labs            -433

The real words with the biggest differences in frequency are shown above. Their frequenies have a negative sign because they appeared in far more real messages than spam messages. It’s a bit tricker to imagine why this would be so, but when you think about where our data came from, it becomes clearer.

12.0.5 Using proportion tables for conditional probability calculations

So far we’ve only looked at which words were most common in spam and real messages and compared their frequencies. This is not the same thing as calculating or comparing the probabilities of events related to these words.

Now that we’ve covered the concept of conditional probability in class, now would be an appropriate time to use it. But before we proceed, recall the definition of conditional probability.

Conditional probability: The conditional probability of one event given that some other event has happened. (Defined mathematically below.)

  • P(A|B) = P(A and B) / P(B)

It is also possible to calculate conditional probabilities using Bayes’ formula, but we will be using the above definition of conditional probability instead because it’s easier to work with on our programming tasks.

Recall that P(A and B) is the probability that events A and B occur simultaneously. It is also known as the intersection of A and B.

Now we’re going to have a look at a table which can be used to represent marginal probabilities (defined in a moment) and the probabilities of two events.

##        spam
## credit       FALSE       TRUE        Sum
##   FALSE 0.59574006 0.31210606 0.90784612
##   TRUE  0.01021517 0.08193871 0.09215388
##   Sum   0.60595523 0.39404477 1.00000000

This table shows two variables: spam and credit. There are four events in this table, one for each TRUE and FALSE value of the spam and credit variables. (By the way, you make a table like this, you should use the addmargins() function and the dnn parameter. If you don’t, it will make your table very difficult to read.)

The marginal probabilities of each event is given in the bottom row and the rightmost column. For example, the probability that spam is TRUE is 0.39404477 and the probability that credit is FALSE is 0.90784612.

The probabilities of intersection are given by cells which represent the simultaneous occurrence of two different events. For example, the probability that credit is TRUE and spam is TRUE at the same time is 0.08193871.

basic_table is a table object, and so we can extract elements from it using bracketing like we sometimes do with dataframes. For instance, I can extract and print the intersectional probability I identified above by using bracketing. The numbers I will use correspond to the row and column where that probability is located. This is easy to figure out because it’s the same basic coordinate system we use when graphing things in the first quadrant of a 2D space. And this is something that you’ll need to remember in order to complete your lab.

## [1] 0.08193871

Now let’s try to do something interesting with this table. Let’s answer this question: What is the probability that a message is spam given that it contains the word ‘credit’?

First, let’s make sure we understand the question in terms of probability. A big clue here is the use of the word ‘given’. This means that we’re trying to calculate this kind of probability: P(A|B). But which event is which? Recall that P(A|B) in plain English is “the probability of event A given event B”. This means that for the question we’re trying to answer, event A is spam == TRUE and event B is credit == TRUE. We need to use this formula: P(A|B) = P(A and B) / P(B).

We’re going to find the answer in a programmatic way.

##        spam
## credit       FALSE       TRUE        Sum
##   FALSE 0.59574006 0.31210606 0.90784612
##   TRUE  0.01021517 0.08193871 0.09215388
##   Sum   0.60595523 0.39404477 1.00000000
## [1] 0.8891509

This is a pretty interesting finding: if a message in our dataset contains the word ‘credit’, there is an approximately 89% chance that this message is spam. This strongly suggests that blocking all messages that contain this word is a good idea.

Now what about the probability that a message is spam given that the message does not contain the word ‘credit’? You may be tempted to blurt out that it’s the complement of the result above, but you’d be incorrect. We need to use the conditional probability formula again in order to get the correct answer.

##        spam
## credit       FALSE       TRUE        Sum
##   FALSE 0.59574006 0.31210606 0.90784612
##   TRUE  0.01021517 0.08193871 0.09215388
##   Sum   0.60595523 0.39404477 1.00000000
## [1] 0.3437874

This probability is more than triple the complement of the last result, but this result is still consistent with our last one: it’s more likely than not that if a message contains the word ‘credit’, it is a spam message.

12.0.6 Using functional programming to rapidly calculate many conditional probabilities

Now we’re going to narrow our focus on one conditional probability for each word in our dataset: the probability that a message is spam given that it contains a certain word.

In the background I’ve written a function called p_spam_given_word() which can calculate this probability for any of the words in our dataset. Below we see a familiar result.

## [1] 0.8891509

By itself this function isn’t very powerful, although it significantly simplifies the process of calculating individual conditional probabilities for each of the words in our dataset. This function becomes much more powerful when we use it together with other tools we’ve learned how to use in previous labs and tutorials.

Suppose we wanted to calculate the probability that a message is spam given it contains each of the words in our dataset. We could run p_spam_given_word() 48 more times, once for each word in our dataset, and that would give us each of the individual probabilities that we need, and that should only require one or two lines of code if we really know what we’re doing.

Then if we’re really clever, we can figure out a way to store this information in a way that will allow us run a slightly deeper analysis on it. For instance, we could divide the words up by their probabilities into groups that are high, medium and low risk for being spam given that each one is present in a message.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.006757 0.090839 0.468764 0.426424 0.701489 0.946716

Since these probabilities range from about 0 to about 100%, a good way to group them would be to create groups for words with probabilities that range from 0%-33% another for 33%-66% and another for 66%-100%.

And finally, if we’re really, really smart, we ought to be able to take this dataset and summarize it visually because that would make our work truly meaningful since we’re so great at creating slick visualizations now.

We can definitely reproduce the plot above! But it will take some research and experimentation…