Chapter 11 Simulate for Probabilistic thinking
Motivating scenarios: We want to build an intuitive feel for probability that we can build from simulation.
Learning goals: By the end of this chapter you should be able to
- Develop simulations to solidify our probabilistic intuition.
11.1 Why Simulate?
A lot of statistics is wrapping our heads around probabilistic thinking. To me the key here is to get good intuition for probabilistic thinking. Mathematical formulae (like those introduced last chapter) are essential for rapid computation and help some people understand the concepts. But I personally understand things better when I see them, so I often simulate.
I simulate to:
- Make sure my math is right.
- Build intuition for probability problems.
- Solve probability questions that do not have a mathematical solution.
So, how do we simulate? Last Chapter we introduced some simulation that was built into the seeing theory app, but that app will not store our output or keep a record of our code.
More naturally, we can code probabilistic events. We can use a bunch of R
functions to simulate, depending on our goals. Today, we’ll the sample()
function, which we saw when we considering sampling and uncertainty in our estimates.
Our goal is to use simulation to reinforce concepts in probability and to develop an intuition for probabilistic outcomes. This should solidify our mathematical understanding of the rules of probability developed in the previous chapter.
11.2 Simulating exclusive events with the sample()
function in R
While I love the seeing theory app, it is both limiting and does not store our outoput or keep a record of our code. Rather we use the sample()
function in R
.
Let’s simulate the simple our coin-tossing example introduced in the previous chapter to get started. Remember that of sample space for a single coin flip is
- Coin lands heads side up
- Coin lands tails side up
To flip a single coin with a 40% chance of landing heads up, we type:
sample(x = c("Heads","Tails"), size = 1, replace = TRUE, prob = c(.4,.6))
## [1] "Tails"
To flip a sample of five coins, each with a 45% chance of landing tails up, we type:
sample(x = c("Heads","Tails"), size = 5, replace = TRUE, prob = c(.55,.45))
## [1] "Heads" "Tails" "Heads" "Tails" "Heads"
If we do not give R
a value for prob
(e.g. x = c("Heads","Tails"), size = 5, replace = TRUE)
) it assumes that each outcome is equally probable.
Keeping track of and visualizing simulated proportions from a single sample in R
Now let us move on from our coin flipping example to our ball dropping examples from seeing theory. Remember that potential outcomes are
- The ball can fall through the orange bin (A), with probability P(A) = 3/6.
- The ball can fall through the green bin (B), with probability P(B) = 1/6.
- The ball can fall through the blue bin (C) with probability P(C) = 2/6.
We can set these probabilities in R
by assigning each to a variable, as follows
<- 3/6 # aka 1/2
P_A <- 1/6 # aka 1/6
P_B <- 2/6 # aka 1/3 P_C
We can then write a signle line of R code, to select ten balls with replacement
sample(c("A", "B", "C"), size = 10, replace = TRUE, prob = c(P_A, P_B, P_C))
## [1] "B" "C" "C" "C" "A" "B" "B" "C" "C" "C"
The code above produces a vector, but we most often want data in tibbles. Let’s do this, but for funzies, lets make a sample of five hundred balls, with probabilities equal to those in Figure above. To do so, we simply make a tibble and assign our vector to a column
<- 500
n_balls <- tibble(balls = sample(x = c("A","B","C"),
ball_dist_exclusive size = n_balls,
replace = TRUE,
prob = c(P_A, P_B, P_C)))
Let’s have a look at the result
Let’s summarize the results
There are a few tidyverse tricks we can use to find proportions.
- We can do this for all outcomes at once by
- First
group_by
the outcome… e.g.group_by(balls, .drop = FALSE)
, where.drop=FALSE
tells R we want to know if there are zero of some outcome. And
- Use the
n()
function inside ofsummarise()
.
%>%
ball_dist_exclusive group_by(balls, .drop = FALSE) %>%
::summarise(count = n(), proportion = count / n_balls) dplyr
## # A tibble: 3 × 3
## balls count proportion
## <chr> <int> <dbl>
## 1 A 251 0.502
## 2 B 70 0.14
## 3 C 179 0.358
Or we can count these ourselves by.
1.sum
ing the number of times balls fall through A, B, and C inside of the summarise()
function, without grouping by anything.
%>%
ball_dist_exclusive ::summarise(n_A = sum(balls == "A") ,
dplyrn_B = sum(balls == "B"),
n_C = sum(balls == "C"))
## # A tibble: 1 × 3
## n_A n_B n_C
## <int> <int> <int>
## 1 251 70 179
Let’s make a nice plot
<- c(A = "#EDA158", B = "#62CAA7", C = "#98C5EB")
ball_colors ggplot(ball_dist_exclusive, aes(x = balls, fill = balls)) +
geom_bar(show.legend = FALSE)+
scale_fill_manual(values = ball_colors)+
labs(title = "One simulation (n = 500)")
Simulations to find the proportions of OR and NOT
Say we wanted to count the number of balls that fell through A, B.
Remember that the proportion of a sample that is \(a\) OR \(b\), (aka \(Pr_(a \text{ or } b)\) is the proportion that are \(a\) (\(Pr_a\)) plus the proportion that are \(b\) (\(Pr_b\)) minus the probability of \(A\) and \(B\), \(Pr(AB)\) – which equals zero when events are mutually exclusive. In R
we find this as
%>%
ball_dist_exclusive summarise(Pr_A = mean(balls == "A"),
Pr_B = mean(balls == "B"),
Pr_AorB = Pr_A + Pr_B)
## # A tibble: 1 × 3
## Pr_A Pr_B Pr_AorB
## <dbl> <dbl> <dbl>
## 1 0.502 0.14 0.642
Alternatively, we could simply use the OR |
operator
%>%
ball_dist_exclusive summarise(Pr_AorB = mean(balls == "A" | balls == "B"))
## # A tibble: 1 × 1
## Pr_AorB
## <dbl>
## 1 0.642
Because A, B and C make up all of sample space, we could find the proportion A or B as one minus the proportion of C.
%>%
ball_dist_exclusive ::summarise(p_AorB = 1- mean(balls == "C") ) dplyr
## # A tibble: 1 × 1
## p_AorB
## <dbl>
## 1 0.642
Note that all these ways of finding \(Pr(A\text{ or }B)\) are all equivalent, and are close to our expected value of \(1/2+1/6 = 0.667\)
11.3 Simulating proportions for many samples in R
We can also use the sample()
function to simulate a sampling distribution. So without any math tricks.
There are a bunch of ways to do this but my favorite recipe is:
- Make a sample of size
sample_size
\(\times\)n_reps
.
- Assign the first
sample_size
outcomes to the first replicate, the secondsample_size
outcomes to the second replicate etc…
- Summarize the output.
Here I show these steps
1. Make a sample of size sample_size
\(\times\) n_reps
.
<- 1000
n_reps <- tibble(balls = sample(x = c("A","B","C"),
ball_sampling_dist_exclusive size = n_balls * n_reps, # sample 500 balls 1000 times
replace = TRUE,
prob = c(P_A, P_B, P_C)))
2. Assign the first sample_size
outcomes to the first replicate, the second sample_size
outcomes to the second replicate etc…
<- ball_sampling_dist_exclusive %>%
ball_sampling_dist_exclusive mutate(replicate = rep(1:n_reps, each = n_balls))
Let’s have a look at the result looking at only the first 2 replicates:
3. Summarize the output.
<- ball_sampling_dist_exclusive %>%
ball_sampling_dist_exclusive group_by(replicate, balls, .drop=FALSE) %>% # make sure to keep zeros
::summarise(count = n(),.groups = "drop") # count number of balls of each color each replicate dplyr
Let’s have a look at the result
MAKE A PLOT
ggplot(ball_sampling_dist_exclusive, aes(x = count, fill = balls)) +
geom_density(alpha = .8, show.legend = FALSE, adjust =1.5)+
scale_fill_manual(values = ball_colors)+
geom_vline(xintercept = n_balls * c(P_A, P_B, P_C), lty = 2, color = ball_colors)+
annotate(x = n_balls * c(P_A, P_B, P_C), y = c(.02,.02,.02),
geom = "text",label = c("A","B","C"), size = 6 )+
labs(title = "Many simulations")
SUMMARIZE UNCERTAINTY
We can estimate the variability in a random estimate from this population with a sample of size 500 (the standard error) as the standard deviation of the density plots, in Fig. 11.1.
We do so as follows:
%>%
ball_sampling_dist_exclusive group_by(balls) %>%
::summarise(se = sd(count)) dplyr
## # A tibble: 3 × 2
## balls se
## <chr> <dbl>
## 1 A 10.7
## 2 B 8.31
## 3 C 9.96
11.4 Simulating Non-Exclusive events
We just examined a case in which a ball could not fall through both A and B, so all options where mutually exclusive.
Recall that this need not be true. Let us revisit a case when falling through A and B are not exclusive.
11.4.1 Simulating independent events
Recal that events are independent if the the occurrence of one event provides no information about the other. Let us consider the case of in which \(A\) and \(B\) are independent (below).
We can simulate independent events as two different columns in a tibble, as follows.
<- 1/3
p_A <- 1/3
p_B <- tibble(A = sample(c("A","not A"),
nonexclusive1 size = n_balls,
replace = TRUE,
prob = c(p_A, 1 - p_A)),
B = sample(c("B","not B"),
size = n_balls,
replace = TRUE,
prob = c(p_B, 1 - p_B)))
Summary counts
%>%
nonexclusive1 group_by(A,B)%>%
::summarise(count = n(), .groups = "drop") %>%# lose all groupings after sumarising
dplyrmutate(expected_count = 500 * c(1/3 * 1/3, 2/3 * 1/3, 1/3 * 2/3, 2/3 * 2/3),
observed_prop = count / sum(count),
expected_prop = expected_count /500
)
## # A tibble: 4 × 6
## A B count expected_count observed_prop
## <chr> <chr> <int> <dbl> <dbl>
## 1 A B 52 55.6 0.104
## 2 A not B 117 111. 0.234
## 3 not A B 104 111. 0.208
## 4 not A not B 227 222. 0.454
## # … with 1 more variable: expected_prop <dbl>
A plot
ggplot(nonexclusive1, aes(x =A, fill = B))+
geom_bar(position = "dodge")+
scale_fill_manual(values = c("black","grey"))+
theme_light()
The general addition rule
Remember that the proportion of a sample that is \(a\) OR \(b\), (aka \(Pr_(a \text{ or } b)\) is the proportion that are \(a\) (\(Pr_a\)) plus the proportion that are \(b\) (\(Pr_b\)) minus the probability of \(A\) and \(B\), \(Pr(AB)\) – which does not equal zero for non-exclusive events. Let’s make sure this works with some light R
coding!
%>%
nonexclusive1 summarise(Pr_A = mean(A == "A"),
Pr_B = mean(B == "B"),
Pr_AorB = mean(A == "A" | B == "B"),
# lets verify the general addition principle by
# summing and subtracting double counts
Pr_AandB = mean(A == "A" & B == "B"),
Pr_AorBmath = Pr_A + Pr_B - Pr_AandB
)
## # A tibble: 1 × 5
## Pr_A Pr_B Pr_AorB Pr_AandB Pr_AorBmath
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.338 0.312 0.546 0.104 0.546
11.4.2 Simulating Non-independece
Recall that events are non-independent if their conditional probabilities differ from their unconditional probabilities. Let us revisit our example of non-independence from the previous chapter:
The probability of A, P(A) = \(\frac{1}{3}\).
The probability of A conditional on B, P(A|B) = \(\frac{1}{4}\).
The probability of B, P(B) = \(\frac{2}{3}\).
The probability of B conditional on A, P(B|A) = \(\frac{1}{2}\).
So, how to simulate conditional probabilities? The easiest way to do this is to:
- Simulate one variable first.
- Simulate another variable next, with appropriate conditional probabilities.
So let’s do this, simulating A first
<- 1/3
p_A <- tibble(A = sample(c("A","not A"), size = n_balls,
nonindep_sim replace = TRUE, prob = c(p_A, 1 - p_A)))
Now we simulate \(B\). But first we need to know
- P(B|A) which we know to be \(\frac{1}{2}\), and
- P(B| not A), which we don’t know. We saw how we could calculate that value in the previous chapter. Recall that p(B| not A) = 3/4.
We also use one newR
trick – thecase_when()
function. The fomr of this is
case_when(<thing meets some criteria> ~ <output something>,
<thing meets different criteria> ~ <output somthing else>)
We can have as many cases as we want there! This is a lot like the ifelse()
function that you may have seen elsewhere, but is easier and safer to use.
<- 1/2
p_B_given_A <- 3/4
p_B_given_notA <- nonindep_sim %>%
nonindep_sim group_by(A) %>%
mutate(B = case_when(A == "A" ~ sample(c("B","not B"), size = n(),
replace = TRUE, prob = c(p_B_given_A , 1 - p_B_given_A )),
!= "A" ~ sample(c("B","not B"), size = n(),
A replace = TRUE, prob = c(p_B_given_notA , 1 - p_B_given_notA))
%>%
)) ungroup()
Let’s see what we got!!
Making sure simulations worked
It is always worth doing some checks to make sure our simulation isn’t too off base – let’s make sure that
about 2/3 of all balls went through B (taking the mean of a logical to find the proportion).
%>%
nonindep_sim summarise(total_prop_B = mean(B=="B"))
## # A tibble: 1 × 1
## total_prop_B
## <dbl>
## 1 0.688
about 1/2 of As went through B and 3/4 of not A went through B.
%>%
nonindep_sim group_by(A) %>%
summarise(conditional_prop_B = mean(B=="B"))
## # A tibble: 2 × 2
## A conditional_prop_B
## <chr> <dbl>
## 1 A 0.539
## 2 not A 0.761
When doing these checks, remember that sampling error is smallest with large sample sizes, so trying something out with a large sample size will help you better see signal.
11.4.2.1 Recovering Bayes’ theorem from our simulation results
In the previous chapter we introduced Bayes’ theorem \(P(A|B) = \frac{P(B|A) P(A)}{P(B)}\) allows us to flip conditional probabilities. I think its easier to understand this as a simple proportion. Consider
\(Pr(A|B)\) as the proportion of outcomes A out of all cases when we see B,
\(Pr(B|A) \times Pr(A)\) as the proportion of times we observe B and A.
\(Pr(B)\) as the proportion of times we observe B.
That is, the proportion of times we get A when we have B is simply the number of times we see A and B divided by the number of times we see A.
So we can find the proportion of balls that
- Fell through A and B,
- Fell through neither A nor B,
- Fell through A not B,
- Fell through B not A,
%>%
nonindep_sim group_by(A, B, .drop = FALSE) %>%
summarise(prop = n() / n_balls)
## `summarise()` has grouped output by 'A'. You can
## override using the `.groups` argument.
## # A tibble: 4 × 3
## # Groups: A [2]
## A B prop
## <chr> <chr> <dbl>
## 1 A B 0.178
## 2 A not B 0.152
## 3 not A B 0.51
## 4 not A not B 0.16
We can also find, for example, the proportion of balls that
- Fell through A conditional on not falling through B.
%>%
nonindep_sim filter(B == "not B") %>%
summarise(prop_A_given_notB = mean(A == "A"))
## # A tibble: 1 × 1
## prop_A_given_notB
## <dbl>
## 1 0.487
This value of approximately 1/2 lines up with a visual estimate that about half of the space not taken up by B in Figure 10.5 is covered by A.
Tips for conditional proportions.
To learn the proportion of events with outcome a given outcome b, we divide the proportion of outcomes with both a and b by the proportion of outcomes all outcomes with a and b by the proportion of outcomes with outcome b.
\[prop(a|b) = \frac{prop(a\text{ and }b)}{prop(b)}\]
11.5 A biologically inspired simulation
Let us revisit our flu vaccine example to think more biologically about Bayes’ theorem. First, we lay out some rough probabilities:
Every year, a little more than half of the population gets a flu vaccine, let us say P(No Vaccine) = 0.58.
About 30% of people who do not get the vaccine get the flu, P(Flu|No Vaccine) = 0.30.
The probability of getting the flu diminishes by about one-fourth among the vaccinated, P(Flu|Vaccine) = 0.075.
So you are way less likely to get the flu if you get the flu vaccine, but what is the probability that a random person who was vaccinated P(Vaccine|Flu)? We churned through the math last chapter - let’s see if simulation helps us think. To do so we simulate from these conditional probabilities, count, and divide the number of people that had the flu and the vaccine by the total number of people who had the flu.
<- 0.42
p_Vaccine <- 0.075
p_Flu_given_Vaccine <- 0.300
p_Flu_given_NoVaccine <- 10000000
n_inds
<- tibble(vaccine = sample(c("Vaccinated","Not Vaccinated"),
flu_sim prob =c(p_Vaccine, 1- p_Vaccine),
size = n_inds,
replace = TRUE )) %>%
group_by(vaccine) %>%
mutate(flu = case_when(vaccine == "Vaccinated" ~ sample(c("Flu","No Flu"),
size = n(),
replace = TRUE,
prob = c(p_Flu_given_Vaccine, 1 - p_Flu_given_Vaccine)),
!= "Vaccinated" ~ sample(c("Flu","No Flu"),
vaccine size = n(),
replace = TRUE,
prob = c(p_Flu_given_NoVaccine, 1 - p_Flu_given_NoVaccine))
%>%
)) ungroup()
Let’s browse the first 1000 values
Let’s find all the proportion of each combo
<- flu_sim %>%
flu_sum group_by(vaccine, flu, .drop = FALSE) %>%
summarise(sim_prop = n() / n_inds, .groups = "drop")
Compare to predictions
Recal that we fund predicitons from math:
\[\begin{equation} \begin{split} P(\text{Vaccine|Flu}) &= P(\text{Vaccine and Flu}) / P(\text{Flu}) \\ P(\text{Vaccine|Flu}) &= \tfrac{P(\text{Flu|Vaccine}) \times P(\text{Vaccine})}{P(\text{Vaccine}) \times P(\text{Flu}|\text{Vaccine}) + P(\text{No Vaccine}) \times P(\text{Flu}|\text{No Vaccine})}\\ P(\text{Vaccine|Flu}) &= 0.0315 / 0.2055 \\ P(\text{Vaccine|Flu}) &=0.1533 \end{split} \end{equation}\]
So, while the vaccinated make up 42% of the population, they only make up 15% of people who got the flu.
<- tibble(vaccine = c("Not Vaccinated", "Not Vaccinated", "Vaccinated", "Vaccinated"),
precitions flu = c("Flu", "No Flu", "Flu", "No Flu"),
math_prob = c((1-p_Vaccine) * (p_Flu_given_NoVaccine) ,
1-p_Vaccine) * (1- p_Flu_given_NoVaccine),
(* (p_Flu_given_Vaccine),
(p_Vaccine) * (1-p_Flu_given_Vaccine)))
(p_Vaccine)
full_join(flu_sum, precitions)
## # A tibble: 4 × 4
## vaccine flu sim_prop math_prob
## <chr> <chr> <dbl> <dbl>
## 1 Not Vaccinated Flu 0.174 0.174
## 2 Not Vaccinated No Flu 0.406 0.406
## 3 Vaccinated Flu 0.0315 0.0315
## 4 Vaccinated No Flu 0.388 0.388
Now let’s check our flipping of conditional probabilities (aka Bayes’ theorem)
%>%
flu_sum filter(flu == "Flu") %>% # only looking at people with flu
mutate(prop_given_flu = sim_prop / sum(sim_prop))
## # A tibble: 2 × 4
## vaccine flu sim_prop prop_given_flu
## <chr> <chr> <dbl> <dbl>
## 1 Not Vaccinated Flu 0.174 0.847
## 2 Vaccinated Flu 0.0315 0.153
This is basically what we got by math (recall \(P(\text{Vaccine|Flu}) =0.1533\) from the previous chapter).
11.6 How to do probability – Math or simulation?
In the previous chapter we saw that we could work through a bunch of proababilistic thinking mathematically. In fact – everything we cover here could be discovered by math. This holds not only for these chapters, but for pretty much all of stats.
But for me, simulation provides us with a good intuition, every problem we cover this term could be exactly solved by simulation (while math requires numerous assumptions), and some situations can uniquely solve problems that do not have nice mathematical answers. I often wonder – “Why do we even use math?”
The major advantages of math over simulation are:
Math gives the same answers every time. Unlike random simulation, math is not influenced by chance, so (for example) a sampling error on our computer mean that the sampling distribution from simulation will differ slightly each time, while math based sampling distributions do not change by chance.
Math is fast. Simulations can take a bunch of computer time and energy (it has to do the same thing a bunch of times). With math, the computer quickly run some numbers through a calculator. The computational burdens we’ll face this term are pretty minimal, but with more complex stats this can actually matter.
Math provides insight. Simulations are great for helping us build feelings, intuition and even probabilities. But clear mathematical equations allow us to precisely see the relationship between components of our model.