## 2.5 Introduction to simulation

A **probability model** consists of a sample space, associated events and random variables (or stochastic processes), and a probability measure which specifies how to assign probabilities to events. Given a probability model, we can **simulate** outcomes, occurrences of events, values of random variables, according to the specifications of the probability measure.

Recall from Section 1.2.1 that probabilities can be interpreted as long run relative frequencies. Therefore we can approximate \(\textrm{P}(A)\), the probability of event \(A\), by simulating the random phenomenon a large number of times and computing the relative frequency of \(A\)

\[ \textrm{P}(A) \approx \frac{\text{number of repetitions on which $A$ occurs}}{\text{number of repetitions}} \quad \text{for a large number of simulated repetitions} \]

While we generally use technology to conduct large scale simulations, it is helpful to first consider how we might conduct a simulation by hand using physical objects like coins, dice, cards, or spinners.

### 2.5.1 Tactile simulation: Boxes and spinners

Many random phenomena can be represented in terms of a **“box model”**.

- Imagine a box containing “tickets” with labels. Examples include:
- Fair coin flip. 2 tickets: 1 labeled H and 1 labeled T
- 90% free throw shooter. 10 tickets: 9 labeled “make” and 1 labeled “miss”
- Card shuffling. 52 cards: each card with a pair of labels (face value, suit).
- Random digit dialing. 10 tickets: labeled 0 through 9 (corresponding to single digits).

- The tickets are shuffled in the box, some number are drawn out — either
*with replacement or without replacement*of the tickets before the next draw^{27} - In some cases, the order in which the tickets are drawn matters; in other cases the order is irrelevant. For example,
- Dealing a 5 card poker hand: Select 5 cards without replacement, order does not matter
- Random digit dialing: Select 10 cards with replacement (for a phone number with area code), order matters (e.g., 805-555-1234 is a different outcome than 805-555-4321)

- Then something is done with the tickets, typically to measure random variables of interest. For example, you might flip a coin 10 times (by drawing from the H/T box 10 times with replacement) and count the number of H.

If the draws are made with replacement, we can think of a single “spinner” instead of a box, spun multiple times. For example:

- Fair coin flip. Spinner with half of the area corresponding to H and half T
- 90% free throw shooter. Spinner with 90% of the area correspoding to “make” and 10% “miss”.
- Random digit dialing. Spinner marked with digits 0-9, possibly with some digits more likely than others. spun multiple times. Depending on what regions you are trying to sample, you might have one spinner to generate area code, one to generate the next three digits, and one to generate the last four digits.

**Example 2.28**Let \(X\) be the sum of two rolls of a fair four-sided die, and let \(Y\) be the larger of the two rolls (or the common value if a tie). Set up a box model and explain how you would use it to simulate a single realization of \((X, Y)\). Could you use a spinner instead?

*Solution *
to Example 2.28

Box with four tickets, labeled 1, 2, 3, 4. Draw two tickets with replacement. Let \(X\) be the sum of the two numbers drawn and \(Y\) the larger of the two numbers drawn.

Also, you could set up a spinner with 4 sectors, corresponding to 1, 2, 3, 4, each with 25% of the total area; see Figure 2.5. Spin the spinner twice. Let \(X\) be the sum of the two numbers spun and \(Y\) the larger of the two numbers spun.

The spinner in Figure 2.5 simulates the individual die rolls. We will see in Figure 2.7 below a separate spinner for generating \((X, Y)\) pairs directly.

A simulation involves simulating the probability model for a large number of repetitions and using the results to investigate properties of interest. It is important to distinguish between what entails (1) one repetition of the simulation and its output, and (2) the simulation itself and output from many repetitions.

Continuing the dice rolling example, each repetition involves drawing two tickets with replacement from the box (or two spins of the spinner), resulting in an outcome \(\omega\), and then measuring \(X(\omega)\) and \(Y(\omega)\). We could then repeat this process say 10000 times for a simulation whose output is 10000 \((X, Y)\) pairs simulated according to the probability model.

The following plots illustrate a summary of the output for the dice rolling simulation, conducted by hand, in a class of about 35 students, with each student performing two repetitions of the simulation (two pairs of rolls.) While this simulation only has roughly 70 repetitions, a larger scale simulation follows the same process just with more repetitions.

NEED PLOT HERE

After running say 10000 repetitions of the dice rolling simulation, we could use the results to answer questions like

- What proportion of repetitions yielded \(X=6\)? (Count the number of repetitions for which \(X=6\) and divide by 10000.) This proportion is an approximation of \(\textrm{P}(X=6)\), the probability that the sum of two rolls of a fair four-sided die is 6.
- What was the average value of \(X\)? (Sum the 10000 values of \(X\) and divide by 10000.) This average is an approximation of the
*expected value*\(\textrm{E}(X)\). (See Section**??**.) - What was the average value of \(XY\)? (For each repetition, find the product \(XY\). Then sum the 10000 simulated values of \(XY\) and divide by 10000.) This average is an approximation of \(\textrm{E}(XY)\). (See Section
**??**.) - What proportion of repetitions yielded \(X=6\) and \(Y=3\)? (Count the number of repetitions for which both \(X=6\) and \(Y=3\) and divide by 10000.) This proportion approximates \(\textrm{P}(X=6, Y=3)\).
- Among the repetions for which \(X=6\), what proportion yielded \(Y=3\)? (Count the number of repetitions for which both \(X=6\) and \(Y=3\) and divide by the number of repetitions for which \(X=6\).) This proportion approximates the
*conditional probability*\(\textrm{P}(Y=3|X=6)\). (See Section 3.1.)

### 2.5.2 Technology simulation: Symbulate

**Note: some of the plots and tables in this and the following sections are not displayed properly in the text. See the accompanying Jupyter notebooks for a better representation of Symbulate output.**

We now introduce a technology simulation using the Python package Symbulate. The syntax of Symbulate mirrors the “language of probability” in that the primary objects in Symbulate are the same as the primary components of a probability model: probability spaces, random variables, events. Once these components are specified, Symbulate allows users to simulate many times from the probability model and summarize those results.

This section contains a brief introduction to Symbulate; many more examples can be found throughout the text or in the Symbulate documentation. The following command imports Symbulate during a Python session.

We continue the example where \(X\) is the sum of two rolls of a fair four-sided die, and let \(Y\) is the larger of the two rolls (or the common value if a tie). The following Symbulate code defines a probability space^{28} `P`

for simulating the 16 equally likely ordered pairs of rolls via a box model.

The above code tells Symbulate to draw 2 tickets, with replacement^{29}, from a box with tickets labeled
1, 2, 3, and 4. Each simulated outcome consists of an ordered pair of rolls. The `sim`

command simulates realizations of probability space outcomes (or events, random variables, or stochastic processes).

`## <symbulate.results.Results object at 0x00000000187089E8>`

Symbulate objects have methods such as `sim`

, `tabulate`

, which can be chained together; in this way a line of code can be loosely read left to right as a sentence. For example, the following represents “From the probability space `P`

simulating 10000 outcomes and tabulate the results.”

{(4, 1): 643, (4, 3): 643, (3, 3): 601, (1, 3): 641, (2, 3): 614, (2, 1): 620, (4, 4): 658, (2, 4): 636, (2, 2): 626, (3, 2): 616, (4, 2): 646, (1, 1): 564, (3, 4): 655, (1, 4): 600, (3, 1): 565, (1, 2): 672}

A Symbulate `RV`

is specified by the probability space on which it is defined and the mapping function which defines it. Recall that \(X\) is the sum of the two dice rolls and \(Y\) is the larger (max).

The above code simply defines the random variables. Since a random variable \(X\) is a function, any `RV`

can be called as a function^{30} to return its value \(X(\omega)\) for a particular outcome \(\omega\) in the probability space.

```
## Warning: Calling an RV as a function simply applies the function that defines the RV to the input, regardless of whether that input is a possible outcome in the underlying probability space.
## Warning: Calling an RV as a function simply applies the function that defines the RV to the input, regardless of whether that input is a possible outcome in the underlying probability space.
## 5 3
```

The following commands simulate 100 values of the random variable `Y`

and store the results as `y`

. Note that, consistent with standard notation, the random variable itself is denoted with an uppercase letter `Y`

, while the realized values of it are denoted with a lowercase letter `y`

.

`## <symbulate.results.RVResults object at 0x0000000018723860>`

Values and their frequencies can be summarized using `tabulate`

.

`## {4: 46, 3: 27, 2: 20, 1: 7}`

By default, `tabulate`

returns frequencies (counts). Adding the argument `normalize = True`

returns relative frequencies (proportions).

`## {4: 0.46, 3: 0.27, 2: 0.2, 1: 0.07}`

We can plot the 100 individual simulated values of \(Y\) in a rug plot.

The rug plot emphasizes that realizations of the random varible \(Y\) are numbers along a number line. However, the rug plot does not adequately summarize the relative frequencies. Instead, calling `.plot()`

produces^{31} an *impulse plot* which displays the simulated values and their relative frequencies.

Now we simulate 10,000 values of the random variable `Y`

and summarize the simulation output.

`## {4: 4332, 2: 1894, 3: 3115, 1: 659}`

Events can also be defined and simulated. For programming reasons, events are enclosed in parentheses `()`

rather than braces \(\{\}\). A realization of an event is `True`

if the event occurs for the simulated outcome, or `False`

if not.

`## {False: 7526, True: 2474}`

For logical equality use a double equal sign `==`

. For example, `(Y == 3)`

represents the event \(\{Y=3\}\).

`## {False: 6868, True: 3132}`

The following is a summary for \(X\)

`## {5: 2516, 4: 1903, 7: 1232, 6: 1881, 3: 1289, 8: 596, 2: 583}`

We can simulate pairs of \(X\) and \(Y\) using^{32} `&`

. We store the simulation output as `xy`

to emphasize that `xy`

contains pairs of values.

`## <symbulate.results.RVResults object at 0x0000000025687C18>`

Pairs of values can also be tabulated.

`## {(5, 4): 1235, (3, 2): 1246, (6, 3): 605, (5, 3): 1301, (6, 4): 1213, (2, 1): 637, (7, 4): 1266, (4, 3): 1224, (8, 4): 634, (4, 2): 639}`

`## {(5, 4): 0.1235, (3, 2): 0.1246, (6, 3): 0.0605, (5, 3): 0.1301, (6, 4): 0.1213, (2, 1): 0.0637, (7, 4): 0.1266, (4, 3): 0.1224, (8, 4): 0.0634, (4, 2): 0.0639}`

Individual pairs can be plotting in a scatter plot, which is a two-dimensional analog of a rug plot.

The values can be “jittered” slightly so that points do not coincide.

The two-dimensional analog of an impulse plot is a tile plot. For two discrete variables, the `'tile'`

plot type produces a tile plot (a.k.a. heat map) where rectangles represent the simulated pairs with their relative frequencies visualized on a color scale.

We can add the impulse plot for each of each \(X\) and \(Y\) in the margins of the tile plot using the `'marginals'`

argument.

Notice that the tile plot suggests a tactile method for simulating \((X, Y)\) pairs directly. Namely, pairs generated using the spinner in Figure 2.7 below will follow the same pattern as in the above tile plot. (The spinner in Figure 2.7 represents the *joint distribution* of \((X, Y)\).)

In the code above we used a box model to specify the probability space correspoding to pairs of rolls of a four-sided die. When tickets are equally likely and sampled with replacement, a Discrete Uniform model can also be used. Think of a `DiscreteUniform(a, b)`

probability space corresponding to a spinner with sectors of equal area labeled with integer values from `a`

to `b`

(inclusive). For example, the spinner in Figure 2.5 corresponds to `DiscreteUniform(1, 4)`

.

In the following, \(U\) represents the result of a single roll of a fair four-sided die. The default mapping function in `RV`

is the identity, so `U = RV(P)`

represents^{33} \(U(\omega) = \omega\).

For two rolls the probability space corresponds to spinning the DiscreteUniform spinner two times, which is coded^{34} as `DiscreteUniform(a = 1, b = 4) ** 2`

. The first line below has the same effect^{35} as `BoxModel([1, 2, 3, 4], size = 2, replace = True)`

.

### 2.5.3 Approximating probabilities - margin of error

Recall Section 1.2.1. The probability of an event can be approximated by simulating the random phenomenon a large number of times and computing the relative frequency of the event. However, while after enough repetitions we expect the simulated relative frequency to be *close to* the true probability, there probably won’t be an exact match. Therefore, in addition to reporting the approximate probability, we should also provide a margin of error which indicates how close we think our simulated relative frequency is to the true probability.

Continuing the dice example, suppose we want to estimate \(\textrm{P}(X=6)\), the probability that the sum of two rolls of a fair four-sided is six. The true probability is \(3/16=0.1875\). We will now carry out an analysis similar to the coin flipping example in Section 1.2.1 to investigate simulation margin of error and how it is influenced by the number of simulated values used to compute the relative frequency.

We will perform a “meta-simulation”. The process is as follows

- Simulate two rolls of a fair four-sided die. Compute the sum (\(X\)) and see if it is equal to 6.
- Repeat step 1 \(n\) times to generate \(n\) simulated values of the sum (\(X\)). Compute the relative frequency of sixes: count the number of the \(n\) simulated values equal to 6 and divide by \(n\). Denote this relative frequency \(\hat{p}\).
- Repeat step 2 a large number of times, recording the relative frequency \(\hat{p}\) for each set of \(n\) values.

Be sure to distinguish between steps 2 and 3. A simulation will typically involve just steps 1 and 2, resulting in a single relative frequency based on \(n\) simulated values. Step 3 is the “meta” step; we see how this relative frequency varies from simulation to simulation to help us in determing an appropriate margin of error. The important quantity in this analysis is \(n\), the *number of simulated values used to compute relative frequency* in a single simulation. We wish to see how \(n\) impacts margin of error. The number of simulations in step 3 just needs to be “large” enough to provide a clear picture of how the relative frequency varies from simulation to simulation. The more the relative frequency varies from simulation to simulation, the larger the margin of error needs to be.

In the meta-simulation, the main quantity of interest is the relative frequency, which will vary from simulation to simulation. We can combine steps 1 and 2 to put the meta-simulation in the framework of the simulations from earlier in this section. Namely, we can code the meta-simulation as a simulation in which

- A sample space outcome represents \(n\) values of the sum of two fair-four sided dice
- The main random variable of interest is the proportion of the \(n\) values which are equal to 6.

Let’s first consider \(n=100\). The following Symbulate code defines the probability space corresponding to 100 values of the sum of two-fair four sided dice. Notice the use of `apply`

which functions much in the same way^{36} as `RV`

.

`## <symbulate.results.Results object at 0x0000000022F4C588>`

In the code above

`DiscreteUniform(1, 4) ** 2`

simulates two rolls of a fair four-sided die`.apply(sum)`

computes the sum of the two rolls`** n`

repeats the process`n`

times to generate a set of`n`

independent values, each value representing the sum of two rolls of a fair four-sided die`P.sim(5)`

simulates 5 sets, each set consisting of`n`

sums

Now we define the random variable which takes as an input a set of \(n\) sums and returns the proportion of the \(n\) sums which are equal to six.

`## <symbulate.results.RVResults object at 0x0000000026D05780>`

In the code above

`phat`

is an`RV`

defined on the probability space`P`

. Recall that an outcome of`P`

is a set of`n`

sums (and each sum is the sum of two rolls of a fair four-sided die).- The function that defines the
`RV`

is`count.eq(6)`

, which counts the number of values in the set that are equal to 6. We then^{37}divide by`n`

, the total number of values in the set, to get the relative frequency. (Remember that a transformation of a random variable is also a random variable.) `phat.sim(5)`

generates 5 simulated values of the relative frequency`phat`

. Each simulated value of`phat`

is the relative frequency of sixes in`n`

sums of two rolls of a fair four-sided die.

Now we simulate and summarize a large number of values of `phat`

. We’ll simulate 100 values for illustration. Be sure not to confuse 100 with `n`

. Remmeber, the important quantity is `n`

, the number of simulated values used in computing each relative frequency.

We see that the 100 relative frequencies are roughly centered around the true probability 0.1875, but there is variability in the relative frequencies from simulation to simulation. From the range of values, we see that most relative frequencies are within about 0.07 from the true probability 0.1875.

Now we repeat the analysis, but with \(n=10000\). In this case, each relative frequency is computed based on 10000 independent values, each value representing a sum of two rolls of a fair four-sided die. As before, we simulate 100 relative frequencies.

```
n = 10000
P = (DiscreteUniform(1, 4) ** 2).apply(sum) ** n
phat = RV(P, count_eq(6)) / n
plt.figure()
phat.sim(100).plot()
plt.show()
```

Again we see that the 100 relative frequencies are roughly centered around the true probability 0.1875, but there is less variability in the relative frequencies from simulation to simulation for \(n=10000\) than for \(n=100\). From the range of values, we see that most relative frequencies are within about 0.007 from the true probability 0.1875. That is, the larger the number (\(n\)) of values used in the computation of relative frequency, the smaller the margin of error. As in Section 1.2.1 it appears that when we increase \(n\) by a factor of 100 (from 100 to 10000) we achieve an extra decimal place in precision. This is indeed the case in general.

- Remember: in any simulation the resulting probabilities are approximate.
- The margin of error between an actual probability and a simulated relative frequency is roughly on the order \(1/\sqrt{n}\), where \(n\) is the number of simulated values used to calculate the relative frequency
- More precisely, if \(\hat{p}\) represents the simulated relative frequency, we estimate with 95% confidence
^{38}that the interval with endpoints \[ \hat{p}\pm 2 \sqrt{\frac{\hat{p}(1-\hat{p})}{N}} \] contains the actual probability. - Warning: alternative methods are necessary when the actual probability being estimated is close to 0 or to 1.

### 2.5.4 Brief summary of Symbulate commands

Many scenarios require only a few lines of Symbulate code to set up, run, analyze, and visualize. Table comprises the requisite Symbulate commands for a wide variety of situations.

Command | Function |
---|---|

`BoxModel` |
Define a box model |

`ProbabilitySpace` |
Define a custom probability space |

`RV` |
Define random variables, vectors, or processes |

`RandomProcess` |
Define a discrete or continuous time stochastic process |

`apply` |
Apply transformations |

`[]` (brackets) |
Access a component of a random vector, or a random process at a particular time |

`*` (and `**` ) |
Define independent probability spaces or distributions |

`AssumeIndependent` |
Assume random variables or processes are independent |

`|` (vertical bar) |
Condition on events |

`&` |
Join multiple random variables into a random vector |

`sim` |
Simulate outcomes, events, and random variables, vectors, and processes |

`tabulate` |
Tabulate simulated values |

`plot` |
Plot simulated values |

`filter` (and relatives) |
Create subsets of simulated values (`filter_eq` , `filter_lt` , etc) |

`count` (and relatives) |
Count simulated values which satisfy some critera (`count_eq` , `count_lt` , etc) |

Statistical summaries | `mean` , `median` , `sd` , `var` , `quantile` , `corr` , `cov` , etc. |

Common models | See Chapters ?? and ?? |

While no previous experience with Python is required, it is also possible to incorporate Python programming with Symbulate code. In particular, Python functions or loops can be used: to define or transform random variables or stochastic processes, or to investigate the effects of changing parameter values. Also, while many common plots are built in with the Symbulate `plot`

function, the Matplotlib package can be used to create or customize plots. Some of these features will be illustrated in the next section and many more examples are found throughout the text.

“With replacement” always implies replacement at a uniformly random point in the box. Think of “with replacement” as “with replacement and reshuffling” before the next draw.↩

Technically, a probability space is a triple \((\Omega, \mathcal{F}, \textrm{P})\). We primarily view a Symbulate probability space as a description of the probability model rather than an explicit specification of \(\Omega\). For example, we define a

`BoxModel`

instead of creating a set with all possible outcomes. We tend to represent a probability space with`P`

, even though this is a slight abuse of notation.↩The default argument for

`replace`

is`True`

, so we could have just written`BoxModel([1, 2, 3, 4], size = 2)`

.↩The warning you get when you call a

`RV`

as a function just means that Symbulate is not going to check for you that the inputs to the function you used to define the`RV`

actually match up with the outcomes of the probability space`P`

.↩For discrete random variables

`'impulse'`

is the default plot type. Like`.tabulate()`

, the`.plot()`

method also has a`normalize`

argument; the default is`normalize=True`

.↩Technically

`&`

joins two`RV`

s together to form a random*vector*. While we often intepret Symbulate`RV`

as random variable, it really functions as random vector.↩For technical reasons, Symbulate will not plot simulated outcomes from a

`ProbabilitySpace`

, only simulated realizations of an`RV`

. Essentially, as mentioned in earlier sections, probability space outcomes can be*anything*— not necessarily numbers — and so there are too many potential situations and plots for Symbulate to handle with a single`plot`

command. In contrast, simulated realizations of`RV`

are numbers (or vectors of numbers) which facilitates plotting. Plotting simulated outcomes from a probability space`P`

with numerical outcomes can be achieved by first defining`X = RV(P)`

and then simulating and plotting values of`X`

. That is, replace`P.sim(1000).plot()`

with`RV(P).sim(1000).plot()`

. However, this only works if the outcomes of`P`

are numerical.↩For now you can interpret

`DiscreteUniform(a = 1, b = 4)`

as a spinner with four quarters labeled 1, 2, 3, 4, and`** 2`

as “spin the spinner twice”. In Python,`**`

represents exponentiation; e.g.,`2 ** 5 = 32`

. So`DiscreteUniform(a = 1, b = 4) ** 2`

is equivalent to`DiscreteUniform(a = 1, b = 4) * DiscreteUniform(a = 1, b = 4)`

. Future sections will reveal why the product`*`

notation is natural for independent spins of spinners.↩`BoxModel`

assumes equally likely tickets by default, but there are options for non-equally likely cases.↩One difference between

`RV`

and`apply`

:`apply`

preserves the type of the input object. That is, if`apply`

is applied to a`ProbabilitySpace`

then the output will be a`ProbabilitySpace`

; if`apply`

is applied to an`RV`

then the output will be an`RV`

. In contrast,`RV`

always creates and`RV`

.↩Unfortunately, for techincal reasons,

`RV(P, count_eq(6) / n)`

will not work. It is possible to divide by`n`

within`RV`

if we define a custom function`def rel_freq_six(x): return x.count_eq(6) / n`

and then define`RV(P, ref_freq_six)`

.↩We will see the rationale behind this formula later in the class. The factor 2 comes from the fact that for a Normal distribution, about 95% of values are within 2 standard deviations of the mean. Technically, the factor 2 corresponds to 95% confidence only when a single probability is estimated. If multiple probabilities are estimated simultaneously, then alternative methods should be used, e.g., increasing the factor 2 using a Bonferroni correction. For example, a multiple of 5 rather than 2 produces very conservative error bounds.↩