# Chapter 11 Simulate for Probabilistic thinking

*Probability*— of our textbook, which we also examine in the previous section.

**The reading below is required,**Whitlock and Schluter (2020) is not.

Motivating scenarios: We want to build an intuitive feel for probability that we can build from simulation.

**Learning goals: By the end of this chapter you should be able to**

- Develop simulations to solidify our probabilistic intuition.

## 11.1 Why Simulate?

A lot of statistics is wrapping our heads around probabilistic thinking. To me the key here is to get good intuition for probabilistic thinking. Mathematical formulae (like those introduced last chapter) are essential for rapid computation and help some people understand the concepts. But I personally understand things better when I see them, so I often simulate.

I simulate to:

- Make sure my math is right.

- Build intuition for probability problems.

- Solve probability questions that do not have a mathematical solution.

So, how do we simulate? Last Chapter we introduced some simulation that was built into the seeing theory app, but that app will not store our output or keep a record of our code.

More naturally, we can code probabilistic events. We can use a bunch of `R`

functions to simulate, depending on our goals. Today, we’ll the `sample()`

function, which we saw when we considering sampling and uncertainty in our estimates.

**Our goal is to use simulation to reinforce concepts in probability and to develop an intuition for probabilistic outcomes. This should solidify our mathematical understanding of the rules of probability developed in the previous chapter.**

## 11.2 Simulating exclusive events with the `sample()`

function in `R`

While I love the seeing theory app, it is both limiting and does not store our outoput or keep a record of our code. Rather we use the `sample()`

function in `R`

.

Let’s simulate the simple our coin-tossing example introduced in the previous chapter to get started. Remember that of sample space for a single coin flip is

- Coin lands heads side up

- Coin lands tails side up

To flip a single coin with a 40% chance of landing heads up, we type:

`sample(x = c("Heads","Tails"), size = 1, replace = TRUE, prob = c(.4,.6))`

`## [1] "Tails"`

To flip a sample of five coins, each with a 45% chance of landing tails up, we type:

`sample(x = c("Heads","Tails"), size = 5, replace = TRUE, prob = c(.55,.45))`

`## [1] "Heads" "Tails" "Heads" "Tails" "Heads"`

If we do not give `R`

a value for `prob`

(e.g. `x = c("Heads","Tails"), size = 5, replace = TRUE)`

) it assumes that each outcome is equally probable.

**Keeping track of and visualizing simulated proportions from a single sample in **`R`

`R`

Now let us move on from our coin flipping example to our ball dropping examples from seeing theory. Remember that potential outcomes are

- The ball can fall through the orange bin (A), with probability P(A) = 3/6.

- The ball can fall through the green bin (B), with probability P(B) = 1/6.

- The ball can fall through the blue bin (C) with probability P(C) = 2/6.

We can set these probabilities in `R`

by assigning each to a variable, as follows

```
<- 3/6 # aka 1/2
P_A <- 1/6 # aka 1/6
P_B <- 2/6 # aka 1/3 P_C
```

We can then write a signle line of R code, to select ten balls with replacement

`sample(c("A", "B", "C"), size = 10, replace = TRUE, prob = c(P_A, P_B, P_C))`

`## [1] "B" "C" "C" "C" "A" "B" "B" "C" "C" "C"`

The code above produces a vector, but we most often want data in tibbles. Let’s do this, but for funzies, lets make a sample of five hundred balls, with probabilities equal to those in Figure above. To do so, we simply make a tibble and assign our vector to a column

```
<- 500
n_balls <- tibble(balls = sample(x = c("A","B","C"),
ball_dist_exclusive size = n_balls,
replace = TRUE,
prob = c(P_A, P_B, P_C)))
```

**Let’s have a look at the result**

**Let’s summarize the results**

There are a few tidyverse tricks we can use to find proportions.

- We can do this for all outcomes at once by

- First
`group_by`

the outcome… e.g.`group_by(balls, .drop = FALSE)`

, where`.drop=FALSE`

tells R we want to know if there are zero of some outcome. And

- Use the
`n()`

function inside of`summarise()`

.

```
%>%
ball_dist_exclusive group_by(balls, .drop = FALSE) %>%
::summarise(count = n(), proportion = count / n_balls) dplyr
```

```
## # A tibble: 3 × 3
## balls count proportion
## <chr> <int> <dbl>
## 1 A 251 0.502
## 2 B 70 0.14
## 3 C 179 0.358
```

Or we can count these ourselves by.

1.`sum`

ing the number of times balls fall through A, B, and C inside of the `summarise()`

function, without grouping by anything.

```
%>%
ball_dist_exclusive ::summarise(n_A = sum(balls == "A") ,
dplyrn_B = sum(balls == "B"),
n_C = sum(balls == "C"))
```

```
## # A tibble: 1 × 3
## n_A n_B n_C
## <int> <int> <int>
## 1 251 70 179
```

**Let’s make a nice plot**

```
<- c(A = "#EDA158", B = "#62CAA7", C = "#98C5EB")
ball_colors ggplot(ball_dist_exclusive, aes(x = balls, fill = balls)) +
geom_bar(show.legend = FALSE)+
scale_fill_manual(values = ball_colors)+
labs(title = "One simulation (n = 500)")
```

**Simulations to find the proportions of OR and NOT**

Say we wanted to count the number of balls that fell through A, B.

Remember that the proportion of a sample that is \(a\) OR \(b\), (aka \(Pr_(a \text{ or } b)\) is the proportion that are \(a\) (\(Pr_a\)) plus the proportion that are \(b\) (\(Pr_b\)) minus the probability of \(A\) and \(B\), \(Pr(AB)\) – which equals zero when events are mutually exclusive. In `R`

we find this as

```
%>%
ball_dist_exclusive summarise(Pr_A = mean(balls == "A"),
Pr_B = mean(balls == "B"),
Pr_AorB = Pr_A + Pr_B)
```

```
## # A tibble: 1 × 3
## Pr_A Pr_B Pr_AorB
## <dbl> <dbl> <dbl>
## 1 0.502 0.14 0.642
```

Alternatively, we could simply use the OR `|`

operator

```
%>%
ball_dist_exclusive summarise(Pr_AorB = mean(balls == "A" | balls == "B"))
```

```
## # A tibble: 1 × 1
## Pr_AorB
## <dbl>
## 1 0.642
```

Because A, B and C make up all of sample space, we could find the proportion A or B as one minus the proportion of C.

```
%>%
ball_dist_exclusive ::summarise(p_AorB = 1- mean(balls == "C") ) dplyr
```

```
## # A tibble: 1 × 1
## p_AorB
## <dbl>
## 1 0.642
```

Note that all these ways of finding \(Pr(A\text{ or }B)\) are all equivalent, and are close to our expected value of \(1/2+1/6 = 0.667\)

## 11.3 **Simulating proportions for many samples in **`R`

`R`

We can also use the `sample()`

function to simulate a **sampling distribution.** So without any math tricks.

There are a bunch of ways to do this but my favorite recipe is:

- Make a sample of size
`sample_size`

\(\times\)`n_reps`

.

- Assign the first
`sample_size`

outcomes to the first replicate, the second`sample_size`

outcomes to the second replicate etc…

- Summarize the output.

Here I show these steps

**1. Make a sample of size sample_size \(\times\) n_reps.**

```
<- 1000
n_reps <- tibble(balls = sample(x = c("A","B","C"),
ball_sampling_dist_exclusive size = n_balls * n_reps, # sample 500 balls 1000 times
replace = TRUE,
prob = c(P_A, P_B, P_C)))
```

**2. Assign the first sample_size outcomes to the first replicate, the second sample_size outcomes to the second replicate etc…**

```
<- ball_sampling_dist_exclusive %>%
ball_sampling_dist_exclusive mutate(replicate = rep(1:n_reps, each = n_balls))
```

**Let’s have a look at the result** looking at only the first 2 replicates:

**3. Summarize the output.**

```
<- ball_sampling_dist_exclusive %>%
ball_sampling_dist_exclusive group_by(replicate, balls, .drop=FALSE) %>% # make sure to keep zeros
::summarise(count = n(),.groups = "drop") # count number of balls of each color each replicate dplyr
```

**Let’s have a look at the result**

**MAKE A PLOT**

```
ggplot(ball_sampling_dist_exclusive, aes(x = count, fill = balls)) +
geom_density(alpha = .8, show.legend = FALSE, adjust =1.5)+
scale_fill_manual(values = ball_colors)+
geom_vline(xintercept = n_balls * c(P_A, P_B, P_C), lty = 2, color = ball_colors)+
annotate(x = n_balls * c(P_A, P_B, P_C), y = c(.02,.02,.02),
geom = "text",label = c("A","B","C"), size = 6 )+
labs(title = "Many simulations")
```

**SUMMARIZE UNCERTAINTY**

We can estimate the variability in a random estimate from this population with a sample of size 500 (the standard error) as the standard deviation of the density plots, in Fig. 11.1.

We do so as follows:

```
%>%
ball_sampling_dist_exclusive group_by(balls) %>%
::summarise(se = sd(count)) dplyr
```

```
## # A tibble: 3 × 2
## balls se
## <chr> <dbl>
## 1 A 10.7
## 2 B 8.31
## 3 C 9.96
```

## 11.4 Simulating Non-Exclusive events

We just examined a case in which a ball could not fall through both A and B, so all options where mutually exclusive.

Recall that this need not be true. Let us revisit a case when falling through A and B are not exclusive.

### 11.4.1 Simulating independent events

Recal that events are **independent** if the the occurrence of one event provides no information about the other. Let us consider the case of in which \(A\) and \(B\) are independent (below).

We can simulate independent events as two different columns in a tibble, as follows.

```
<- 1/3
p_A <- 1/3
p_B <- tibble(A = sample(c("A","not A"),
nonexclusive1 size = n_balls,
replace = TRUE,
prob = c(p_A, 1 - p_A)),
B = sample(c("B","not B"),
size = n_balls,
replace = TRUE,
prob = c(p_B, 1 - p_B)))
```

**Summary counts**

```
%>%
nonexclusive1 group_by(A,B)%>%
::summarise(count = n(), .groups = "drop") %>%# lose all groupings after sumarising
dplyrmutate(expected_count = 500 * c(1/3 * 1/3, 2/3 * 1/3, 1/3 * 2/3, 2/3 * 2/3),
observed_prop = count / sum(count),
expected_prop = expected_count /500
)
```

```
## # A tibble: 4 × 6
## A B count expected_count observed_prop
## <chr> <chr> <int> <dbl> <dbl>
## 1 A B 52 55.6 0.104
## 2 A not B 117 111. 0.234
## 3 not A B 104 111. 0.208
## 4 not A not B 227 222. 0.454
## # … with 1 more variable: expected_prop <dbl>
```

**A plot**

```
ggplot(nonexclusive1, aes(x =A, fill = B))+
geom_bar(position = "dodge")+
scale_fill_manual(values = c("black","grey"))+
theme_light()
```

#### The general addition rule

Remember that the proportion of a sample that is \(a\) OR \(b\), (aka \(Pr_(a \text{ or } b)\) is the proportion that are \(a\) (\(Pr_a\)) plus the proportion that are \(b\) (\(Pr_b\)) minus the probability of \(A\) and \(B\), \(Pr(AB)\) – which does not equal zero for non-exclusive events. Let’s make sure this works with some light `R`

coding!

```
%>%
nonexclusive1 summarise(Pr_A = mean(A == "A"),
Pr_B = mean(B == "B"),
Pr_AorB = mean(A == "A" | B == "B"),
# lets verify the general addition principle by
# summing and subtracting double counts
Pr_AandB = mean(A == "A" & B == "B"),
Pr_AorBmath = Pr_A + Pr_B - Pr_AandB
)
```

```
## # A tibble: 1 × 5
## Pr_A Pr_B Pr_AorB Pr_AandB Pr_AorBmath
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.338 0.312 0.546 0.104 0.546
```

### 11.4.2 Simulating Non-independece

Recall that events are **non-independent** if their conditional probabilities differ from their unconditional probabilities. Let us revisit our example of non-independence from the previous chapter:

The probability of A, P(A) = \(\frac{1}{3}\).

The probability of A conditional on B, P(A|B) = \(\frac{1}{4}\).

The probability of B, P(B) = \(\frac{2}{3}\).

The probability of B conditional on A, P(B|A) = \(\frac{1}{2}\).

So, how to simulate conditional probabilities? The easiest way to do this is to:

- Simulate one variable first.

- Simulate another variable next, with appropriate conditional probabilities.

So let’s do this, simulating A first

```
<- 1/3
p_A <- tibble(A = sample(c("A","not A"), size = n_balls,
nonindep_sim replace = TRUE, prob = c(p_A, 1 - p_A)))
```

Now we simulate \(B\). But first we need to know

- P(B|A) which we know to be \(\frac{1}{2}\), and

- P(B|
**not A**), which we don’t know. We saw how we could calculate that value in the previous chapter. Recall that p(B|**not A**) = 3/4.

We also use one new`R`

trick – the`case_when()`

function. The fomr of this is

`case_when(<thing meets some criteria> ~ <output something>,`

`<thing meets different criteria> ~ <output somthing else>)`

We can have as many cases as we want there! This is a lot like the `ifelse()`

function that you may have seen elsewhere, but is easier and safer to use.

```
<- 1/2
p_B_given_A <- 3/4
p_B_given_notA <- nonindep_sim %>%
nonindep_sim group_by(A) %>%
mutate(B = case_when(A == "A" ~ sample(c("B","not B"), size = n(),
replace = TRUE, prob = c(p_B_given_A , 1 - p_B_given_A )),
!= "A" ~ sample(c("B","not B"), size = n(),
A replace = TRUE, prob = c(p_B_given_notA , 1 - p_B_given_notA))
%>%
)) ungroup()
```

Let’s see what we got!!

#### Making sure simulations worked

It is always worth doing some checks to make sure our simulation isn’t too off base – let’s make sure that

about 2/3 of all balls went through B (taking the mean of a logical to find the proportion).

```
%>%
nonindep_sim summarise(total_prop_B = mean(B=="B"))
```

```
## # A tibble: 1 × 1
## total_prop_B
## <dbl>
## 1 0.688
```

about 1/2 of As went through B and 3/4 of **not A** went through B.

```
%>%
nonindep_sim group_by(A) %>%
summarise(conditional_prop_B = mean(B=="B"))
```

```
## # A tibble: 2 × 2
## A conditional_prop_B
## <chr> <dbl>
## 1 A 0.539
## 2 not A 0.761
```

When doing these checks, remember that sampling error is smallest with large sample sizes, so trying something out with a large sample size will help you better see signal.

#### 11.4.2.1 Recovering Bayes’ theorem from our simulation results

In the previous chapter we introduced Bayes’ theorem \(P(A|B) = \frac{P(B|A) P(A)}{P(B)}\) allows us to flip conditional probabilities. I think its easier to understand this as a simple proportion. Consider

\(Pr(A|B)\) as the proportion of outcomes A out of all cases when we see B,

\(Pr(B|A) \times Pr(A)\) as the proportion of times we observe B and A.

\(Pr(B)\) as the proportion of times we observe B.

That is, the proportion of times we get A when we have B is simply the number of times we see A and B divided by the number of times we see A.

So we can find the proportion of balls that

- Fell through A and B,

- Fell through neither A nor B,

- Fell through A not B,

- Fell through B not A,

```
%>%
nonindep_sim group_by(A, B, .drop = FALSE) %>%
summarise(prop = n() / n_balls)
```

```
## `summarise()` has grouped output by 'A'. You can
## override using the `.groups` argument.
```

```
## # A tibble: 4 × 3
## # Groups: A [2]
## A B prop
## <chr> <chr> <dbl>
## 1 A B 0.178
## 2 A not B 0.152
## 3 not A B 0.51
## 4 not A not B 0.16
```

We can also find, for example, the proportion of balls that

- Fell through A conditional on
**not**falling through B.

```
%>%
nonindep_sim filter(B == "not B") %>%
summarise(prop_A_given_notB = mean(A == "A"))
```

```
## # A tibble: 1 × 1
## prop_A_given_notB
## <dbl>
## 1 0.487
```

This value of approximately 1/2 lines up with a visual estimate that about half of the space not taken up by B in Figure 10.5 is covered by A.

Tips for conditional proportions.

To learn the proportion of events with outcome *a* given outcome *b*, we divide the proportion of outcomes with both *a* and *b* by the proportion of outcomes all outcomes with a and b by the proportion of outcomes with outcome *b*.
\[prop(a|b) = \frac{prop(a\text{ and }b)}{prop(b)}\]

## 11.5 A biologically inspired simulation

Let us revisit our flu vaccine example to think more biologically about Bayes’ theorem. First, we lay out some rough probabilities:

Every year, a little more than half of the population gets a flu vaccine, let us say P(No Vaccine) = 0.58.

About 30% of people who do not get the vaccine get the flu, P(Flu|No Vaccine) = 0.30.

The probability of getting the flu diminishes by about one-fourth among the vaccinated, P(Flu|Vaccine) = 0.075.

So you are way less likely to get the flu if you get the flu vaccine, but what is the probability that a random person who was vaccinated P(Vaccine|Flu)? We churned through the math last chapter - let’s see if simulation helps us think. To do so we simulate from these conditional probabilities, count, and divide the number of people that had the flu and the vaccine by the total number of people who had the flu.

```
<- 0.42
p_Vaccine <- 0.075
p_Flu_given_Vaccine <- 0.300
p_Flu_given_NoVaccine <- 10000000
n_inds
<- tibble(vaccine = sample(c("Vaccinated","Not Vaccinated"),
flu_sim prob =c(p_Vaccine, 1- p_Vaccine),
size = n_inds,
replace = TRUE )) %>%
group_by(vaccine) %>%
mutate(flu = case_when(vaccine == "Vaccinated" ~ sample(c("Flu","No Flu"),
size = n(),
replace = TRUE,
prob = c(p_Flu_given_Vaccine, 1 - p_Flu_given_Vaccine)),
!= "Vaccinated" ~ sample(c("Flu","No Flu"),
vaccine size = n(),
replace = TRUE,
prob = c(p_Flu_given_NoVaccine, 1 - p_Flu_given_NoVaccine))
%>%
)) ungroup()
```

**Let’s browse the first 1000 values**

**Let’s find all the proportion of each combo**

```
<- flu_sim %>%
flu_sum group_by(vaccine, flu, .drop = FALSE) %>%
summarise(sim_prop = n() / n_inds, .groups = "drop")
```

**Compare to predictions**

Recal that we fund predicitons from math:

\[\begin{equation} \begin{split} P(\text{Vaccine|Flu}) &= P(\text{Vaccine and Flu}) / P(\text{Flu}) \\ P(\text{Vaccine|Flu}) &= \tfrac{P(\text{Flu|Vaccine}) \times P(\text{Vaccine})}{P(\text{Vaccine}) \times P(\text{Flu}|\text{Vaccine}) + P(\text{No Vaccine}) \times P(\text{Flu}|\text{No Vaccine})}\\ P(\text{Vaccine|Flu}) &= 0.0315 / 0.2055 \\ P(\text{Vaccine|Flu}) &=0.1533 \end{split} \end{equation}\]

So, while the vaccinated make up 42% of the population, they only make up 15% of people who got the flu.

```
<- tibble(vaccine = c("Not Vaccinated", "Not Vaccinated", "Vaccinated", "Vaccinated"),
precitions flu = c("Flu", "No Flu", "Flu", "No Flu"),
math_prob = c((1-p_Vaccine) * (p_Flu_given_NoVaccine) ,
1-p_Vaccine) * (1- p_Flu_given_NoVaccine),
(* (p_Flu_given_Vaccine),
(p_Vaccine) * (1-p_Flu_given_Vaccine)))
(p_Vaccine)
full_join(flu_sum, precitions)
```

```
## # A tibble: 4 × 4
## vaccine flu sim_prop math_prob
## <chr> <chr> <dbl> <dbl>
## 1 Not Vaccinated Flu 0.174 0.174
## 2 Not Vaccinated No Flu 0.406 0.406
## 3 Vaccinated Flu 0.0315 0.0315
## 4 Vaccinated No Flu 0.388 0.388
```

**Now let’s check our flipping of conditional probabilities (aka Bayes’ theorem)**

```
%>%
flu_sum filter(flu == "Flu") %>% # only looking at people with flu
mutate(prop_given_flu = sim_prop / sum(sim_prop))
```

```
## # A tibble: 2 × 4
## vaccine flu sim_prop prop_given_flu
## <chr> <chr> <dbl> <dbl>
## 1 Not Vaccinated Flu 0.174 0.847
## 2 Vaccinated Flu 0.0315 0.153
```

This is basically what we got by math (recall \(P(\text{Vaccine|Flu}) =0.1533\) from the previous chapter).

## 11.6 How to do probability – Math or simulation?

In the previous chapter we saw that we could work through a bunch of proababilistic thinking mathematically. In fact – everything we cover here could be discovered by math. This holds not only for these chapters, but for pretty much all of stats.

But for me, simulation provides us with a good intuition, every problem we cover this term could be exactly solved by simulation (while math requires numerous assumptions), and some situations can uniquely solve problems that do not have nice mathematical answers. I often wonder – “Why do we even use math?”

The major advantages of math over simulation are:

**Math gives the same answers every time.**Unlike random simulation, math is not influenced by chance, so (for example) a sampling error on our computer mean that the sampling distribution from simulation will differ slightly each time, while math based sampling distributions do not change by chance.**Math is fast.**Simulations can take a bunch of computer time and energy (it has to do the same thing a bunch of times). With math, the computer quickly run some numbers through a calculator. The computational burdens we’ll face this term are pretty minimal, but with more complex stats this can actually matter.**Math provides insight.**Simulations are great for helping us build feelings, intuition and even probabilities. But clear mathematical equations allow us to precisely see the relationship between components of our model.

### References

*Perspectives in Biology and Medicine*59 (2): 147–55.

*The Analysis of Biological Data*. Third Edition.