11 Some Probability Rules

11.1 Multiplication rule

Recall the multiplication rule: $\begin{aligned} P (A \cap B) & = P (A | B) P (B) \\ = P (B | A) P (A) \end{aligned}$
The multiplication rule says that you should think “multiply” when you see “and”. However, be careful about what you are multiplying: to find a joint probability you need an marginal (i.e., unconditional) probability and an appropriate conditional probability.
The multiplication rule is useful in situations where conditional probabilities are easier to obtain directly than joint probabilities.
The multiplication rule extends naturally to more than two events (though the notation gets messy). For three events, we have $P (A_{1} \cap A_{2} \cap A_{3}) = P (A_{1}) P (A_{2} | A_{1}) P (A_{3} | A_{1} \cap A_{2})$
And in general, $P (A_{1} \cap A_{2} \cap A_{3} \cap A_{4} \cap \dots) = P (A_{1}) P (A_{2} | A_{1}) P (A_{3} | A_{1} \cap A_{2}) P (A_{4} | A_{1} \cap A_{2} \cap A_{3}) \dots$
The multiplication rule is useful for computing probabilities of events that can be broken down into component “stages” where conditional probabilities at each stage are readily available. At each stage, condition on the information about all previous stages.

Example 11.1 The birthday problem concerns the probability that at least two people in a group of $n$ people have the same birthday. Ignore multiple births and February 29 and assume that the other 365 days are all equally likely¹.

If $n = 30$ , what do you think the probability that at least two people share a birthday is: 0-20%, 20-40%, 40-60%, 60-80%, 80-100%? How large do you think $n$ needs to be in order for the probability that at least two people share a birthday to be larger than 0.5? Just make guesses before proceeding to calculations.
Now consider $n = 3$ people, labeled 1, 2, and 3. What is the probability that persons 1 and 2 have different birthdays?
What is the probability that persons 1, 2, and 3 all have different birthdays given that persons 1 and 2 have different birthdays?
What is the probability that persons 1, 2, and 3 all have different birthdays?
When $n = 3$ , what is the probability that at least two people share a birthday?
For $n = 30$ , find the probability that at least two people have the same birthday.
Write a clearly worded sentence interpreting the probability in the previous part as a long run relative frequency.
When $n = 100$ the probability is about 0.9999997. If you are in a group of 100 people and no one shares your birthday, should you be surprised?

11.2 Law of total probability

Example 11.2 Each question on a multiple choice test has four options. You know with certainty the correct answers to 70% of the questions. For 20% of the questions, you can eliminate two of the incorrect choices with certainty, but you guess at random among the remaining two options. For the remaining 10% of questions, you have no idea and guess one of the four options at random.

Randomly select a question from this test. What is the probability that you answer the question correctly?

Construct an appropriate twoway table and use it to find the probability of interest.
For any given question on the exam, your probability of answering it correctly is either 1, 0.5, or 0.25, depending on if you know it, can eliminate two choices, or are just guessing. How does your probability of correcting answering a randomly selected question relate to these three values? Which value — 1, 0.5, or 0.25 —is the overall probability closest to, and why?

Law of total probability. If $C_{1}, \dots, C_{k}$ are disjoint with $C_{1} \cup \dots \cup C_{k} = Ω$ , then $\begin{aligned} P (A) & = \sum_{i = 1}^{k} P (A \cap C_{i}) \\ = \sum_{i = 1}^{k} P (A | C_{i}) P (C_{i}) \end{aligned}$
The events $C_{1}, \dots, C_{k}$ , which represent the “cases”, form a partition of the sample space; each outcome $ω \in Ω$ lies in exactly one of the $C_{i}$ .
The law of total probability says that we can interpret the unconditional probability $P (A)$ as a probability-weighted average of the case-by-case conditional probabilities $P (A | C_{i})$ where the weights $P (C_{i})$ represent the probability of encountering each case.

Example 11.3 Imagine a light that flashes every few seconds². The light randomly flashes green with probability 0.75 and red with probability 0.25, independently from flash to flash.

Write down a sequence of G’s (for green) and R’s (for red) to predict the colors for the next 40 flashes of this light. Before you read on, please take a minute to think about how you would generate such a sequence yourself.
Most people produce a sequence that has 30 G’s and 10 R’s, or close to those proportions, because they are trying to generate a sequence for which each outcome has a 75% chance for G and a 25% chance for R. That is, they use a strategy in which they predict G with probability 0.75, and R with probability 0.25. How well does this strategy do? Compute the probability of correctly predicting any single item in the sequence using this strategy.
Describe a better strategy. (Hint: can you find a strategy for which the probability of correctly predicting any single flash is 0.75?)

11.3 Conditioning on the first step

Conditioning and using the law of probability is an effective strategy in solving many problems, even when the problem doesn’t seem to involve conditioning.
For example, when a problem involves iterations or steps it is often useful to condition on the result of the first step.

Example 11.4 You and your friend are playing the “lookaway challenge”.

The game consists of possibly multiple rounds. In the first round, you point in one of four directions: up, down, left or right. At the exact same time, your friend also looks in one of those four directions. If your friend looks in the same direction you’re pointing, you win! Otherwise, you switch roles and the game continues to the next round — now your friend points in a direction and you try to look away. As long as no one wins, you keep switching off who points and who looks. The game ends, and the current “pointer” wins, whenever the “looker” looks in the same direction as the pointer.

Suppose that each player is equally likely to point/look in each of the four directions, independently from round to round. What is the probability that the player who starts as the pointer (you) wins the game?

Why might you expect the probability to not be equal to 0.5?
If you start as the pointer, what is the probability that you win in the first round?
If $p$ denotes the probability that the player who starts as the pointer wins the game, what is the probability that the player who starts as the looker wins the game? (Note: $p$ is the probability that the person who starts as pointer wins the whole game, not necessarily in the first round.)
Condition on the result of the first round and set up an equation to solve for $p$ .

11.4 Bayes rule

Example 11.5 Continuing Example 11.2. Each question on a multiple choice test has four options. You know with certainty the correct answers to 70% of the questions. For 20% of the questions, you can eliminate two of the incorrect choices with certainty, but you guess at random among the remaining two options. For the remaining 10% of questions, you have no idea and guess one of the four options at random.

Randomly select a question from this test, and suppose you answered the question correctly.

Given that you answered the question correctly, what is the probability that you knew its answer with certainty?
Given that you answered the question correctly, what is the probability that you had eliminated two of the options (and guessed among the remaining two)?
Given that you answered the question correctly, what is the probability that you just guessed its answer from the four options?
How do these probabilities compare to the “prior” probabilities?

Bayes’ rule describes how to update uncertainty in light of new information, evidence, or data.
Bayes’ rule for events specifies how a prior probability $P (H)$ of event $H$ is updated in response to the evidence $E$ to obtain the posterior probability $P (H | E)$ . $P (H | E) = \frac{P (E | H) P (H)}{P (E)}$
- Event $H$ represents a particular hypothesis (or model or case)
- Event $E$ represents observed evidence (or data or information)
- $P (H)$ is the unconditional or prior probability of $H$ (prior to observing evidence $E$ )
- $P (H | E)$ is the conditional or posterior probability of $H$ after observing evidence $E$ .
- $P (E | H)$ is the likelihood of evidence $E$ given hypothesis (or model or case) $H$
Bayes rule is often used when there are multiple hypotheses or cases. Suppose $H_{1}, \dots, H_{k}$ is a series of distinct hypotheses which together account for all possibilities, and $E$ is any event (evidence).
Combining Bayes’ rule with the law of total probability, $\begin{aligned} P (H_{j} | E) & = \frac{P (E | H_{j}) P (H_{j})}{P (E)} \\ = \frac{P (E | H_{j}) P (H_{j})}{\sum_{i = 1}^{k} P (E | H_{i}) P (H_{i})} \\ P (H_{j} | E) & \propto P (E | H_{j}) P (H_{j}) \end{aligned}$
The symbol $\propto$ is read “is proportional to”. The relative ratios of the posterior probabilities of different hypotheses are determined by the product of the prior probabilities and the likelihoods, $P (E | H_{j}) P (H_{j})$ . The marginal probability of the evidence, $P (E)$ , in the denominator simply normalizes the numerators to ensure that the updated (conditional on the evidence) probabilities sum to 1 over all the distinct hypotheses.
In short, Bayes’ rule says $posterior \propto prior \times likelihood$

Example 11.6 Suppose that you’ve written a computer program with three sections (call them A, B, and C). When you run the program, you find that it has one (and only one) bug. Your subjective probabilities are that there’s a 50% chance that the bug is in section A, compared to 30% for section B and 20% for section C. So, you decide to look in section A to try to find the bug. But you know that you don’t always find a bug even when you look in the correct section. Let’s say that if the bug is really in the section that you look in, there’s an 80% chance that you’ll find it and so a 20% chance that you’ll miss it. Suppose that you look in section A and you do not find the bug. How does this change your probabilities for where the bug is?

Given that you do not find the bug when you look in section A, what is your guess for the posterior probability that the bug is really in section A? Section B? Section C?
Translate this problem into the Bayes rule framework. What are the hypotheses? What are the prior probabilities? What is the evidence? What is the likelihood of the evidence under each hypothesis?
Use Bayes rule to compute your posterior probabilities for where the bug is given that you look in section A and you do not find the bug.
Now after searching section A and not finding the bug, you decide to search section B. Unfortunately, you don’t find the bug in section B either. Use Bayes rule to compute your posterior probabilities for where the bug is. (Hint: what should the prior probabilities be in this part?)
Now after searching sections A and B and not finding the bug, you decide to search section C. Unfortunately, you don’t find the bug there either. Without doing any calculations, what do you think are your posterior probabilities?

Bayesian analysis is often an iterative process.
Posterior probabilities are updated after observing some information or data. These probabilities can then be used as prior probabilities before observing new data.
Posterior probabilities can be sequentially updated as new data becomes available, with the posterior probabilities after the previous stage serving as the prior probabilities for the next stage.
The final posterior probabilities only depend upon the cumulative data. It doesn’t matter if we sequentially update the posterior after each new piece of data or only once after all the data is available; the final posterior probabilities will be the same either way. Also, the final posterior probabilities are not impacted by the order in which the data are observed.

11.5 Conditional versus unconditional probabilities

Example 11.7 Consider a group of 5 people: Harry, Bella, Frodo, Anakin, Katniss. Suppose each of their names is written on a slip of paper and the 5 slips of paper are placed into a hat. The papers are mixed up and 2 are pulled out, one after the other without replacement.

What is the probability that Harry is the first name selected?
What is the probability that Harry is the second name selected?
If you were asked question (2) before question (1), would your answer change? Should it?
If Bella is the first name selected, what is the probability that Harry is the second name selected?
If Harry is the first name selected, what is the probability that Harry is the second name selected?
How is the probability that Harry is the second name selected related to the probabilities in the two previous parts?
If Bella is the second name selected, what is the probability that Harry was the first name selected?

Be careful to distinguish between conditional and unconditional probabilities.
A conditional probability reflects “new” information about the outcome of the random phenomenon. In the absence of such information, we must continue to account for all the possibilities. When computing probabilities, be sure to only reflect information that is known.

Which isn’t quite true. However, a non-uniform distribution of birthdays only increases the probability that at least two people have the same birthday. To see that, think of an extreme case like if everyone were born in September.↩︎
Thanks to Allan Rossman for this example.↩︎