10 Sampling and Simulation
10.1 Monte Carlo simulations
In the 5-card poker game, the probability of a flush is \(1/508 \approx 0.00197\). It’s a good exercise to derive this number analytically based on basic combinatorics. With a computer, we can approximate this number using Monte Carlo simulations (founded on the Law of Large Numbers) by simply repeatedly generating a poker hand and checking whether it’s a flush or not. Since the number we are estimating is quite small, the number of Monte Carlo replicates has to be quite large to result in a decent approximation. We use some code taken from a probability course at Duke University to generate a poker hand and check whether a hand is a flush.
source("https://www2.stat.duke.edu/courses/Spring12/sta104.1/Homework/poker_simulation.R")
= 1e5 # number of MC replicates
B = 0
number_flush for (b in 1:B) {
= deal_hand() # generate a poker hand
x = number_flush + (what_hand(x) == "Flush")
number_flush
}cat(sprintf("fraction of flush hands (out of %i): %f", B, number_flush/B))
fraction of flush hands (out of 100000): 0.002030
10.2 Monte Carlo integration
Simulations, again via the Law of Large Numbers, allow for the computation of integrals, since integrals can be interpreted as expectations. This avenue offers some advantages over numerical integration such as ease of implementation (although code for numerical integration is readily available) and lower computational complexity in higher dimensions. We perform some numerical experiments in dimension one where a Monte Carlo approach is typically less desirable, but where we can more easily quantify its accuracy compared to numerical integration.
= function(x){sqrt(x)} # function to be integrated
f = 2/3 # integral over [0, 1]
exact_integral = function() {
num_integral integrate(f, 0, 1)$value
# numerical integration
} = function() {
mc_integral = runif(1e6)
x return(mean(f(x)))
# MC integration with 1e6 replicates }
First, we look at the accuracy
= num_integral() - exact_integral
num_error num_error
[1] 7.483435e-08
= 1e2
S = numeric(S)
mc_error for (s in 1:S) {
= mc_integral() - exact_integral
mc_error[s]
}summary(mc_error)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-5.234e-04 -2.157e-04 -1.993e-05 -1.365e-05 1.359e-04 7.905e-04
# Next, we look at computational complexity
require(benchr)
summary(benchmark(num_integral(), mc_integral(), times = 100, progress = FALSE), digits = 3, row.names = FALSE)
Time units : microseconds
expr n.eval min lw.qu median mean up.qu max total
num_integral() 100 11.9 14 43.8 36.1 56.1 105 3610
mc_integral() 100 8820.0 9000 9510.0 10200.0 10700.0 38400 1020000
relative
1
217
In dimension one, at least for this simple function, numerical integration is much superior in both accuracy and speed. This is as expected. In higher dimensions, however, and in fact already at dimension 4 or 5, numerical integration may not even be a contender as it typically is not computationally viable.
10.3 Rejection sampling
Here is a simple example of rejection sampling, where the goal is to sample uniformly at random from the unit disc. The approach is based on sampling the unit square (which contains the unit disc) uniformly at random (which is straightforward) and only keeping those draws that fall in the disc.
require(plotrix)
plot(0, 0, type = "n", xlim = c(-1,1), ylim = c(-1,1), xlab = "", ylab = "", asp = 1)
draw.circle(0, 0, 1, lty = 2)
rect(-1, -1, 1, 1, lty = 3)
legend("topleft", c("accepted", "rejected"), col = c("green", "red"), pch = 16, xpd = TRUE, inset = c(0, -0.15))
= 100 # desired sample size
n = 0 # number of samples drawn from the proposal distribution
t while (t < n) {
= runif(2, -1, 1)
y = "red"
color if (y[1]^2 + y[2]^2 <= 1) {
= t+1
t = "green"
color
}points(y[1], y[2], pch = 16, col = color)
}
10.4 Markov chain Monte Carlo
For illustration, we consider the setting of species co-occurrence, important in Ecology, where a presence-absence matrix is available, and we want to draw uniformly at random from the set of binary matrices of same size and with the same row and column totals. This is one of the random models considered in that field meant to represent a situation where the various species have populated the various geographical sites “at random”. The Markov chain consists in swapping \(2 \times 2\) submatrices whenever possible.
require(cooccur)
data(finches) # 13 species over 17 sites
rownames(finches) # species (corresponding to rows)
[1] "Geospiza magnirostris" "Geospiza fortis"
[3] "Geospiza fuliginosa" "Geospiza difficilis"
[5] "Geospiza scandens" "Geospiza conirostris"
[7] "Camarhynchus psittacula" "Camarhynchus pauper"
[9] "Camarhynchus parvulus" "Platyspiza crassirostris"
[11] "Cactospiza pallida" "Cactospiza heliobates"
[13] "Certhidea olivacea"
colnames(finches) # sites (corresponding to columns)
[1] "Seymour" "Baltra" "Isabella" "Fernandina"
[5] "Santiago" "Rabida" "Pinzon" "Santa.Cruz"
[9] "Santa.Fe" "San.Cristobal" "Espanola" "Floreana"
[13] "Genovesa" "Marchena" "Pinta" "Darwin"
[17] "Wolf"
write.table(finches, row = FALSE, col = FALSE)
0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 0
1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0
0 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 1
1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 0
0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
0 0 1 1 1 1 1 1 1 0 0 1 0 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0
0 0 1 1 1 1 1 1 1 1 0 1 0 1 1 0 0
0 0 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
= 1e4 # number of steps in the Markov chain
B = diag(2)
I = matrix(c(0, 1, 1, 0), 2, 2)
J = as.matrix(finches)
X rownames(X) = NULL
colnames(X) = NULL
for (b in 1:B) {
= sample(1:nrow(X), 2)
rows = sample(1:ncol(X), 2)
cols if (sum(X[rows, cols] == I) == 4) X[rows, cols] = J
if (sum(X[rows, cols] == J) == 4) X[rows, cols] = I
}write.table(X, row = FALSE, col = FALSE)
0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1
1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 0
0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
0 0 0 1 1 1 1 1 1 1 0 1 0 1 1 0 0
1 0 1 1 1 1 1 1 1 1 0 1 1 0 1 0 0
0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0
0 0 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 0 0 1 1 1 0 1 0 0 1 1 0
1 0 1 1 1 1 1 1 1 1 0 1 0 1 0 0 0
0 0 1 1 1 0 1 0 0 0 0 0 0 1 1 0 0
0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
This code produces only one approximately uniform random binary matrix (among those with the same row and column totals). Statistical inference typically requires the drawing of multiple realizations.