2.5 Introduction to simulation

A probability model consists of a sample space, associated events and random variables (or stochastic processes), and a probability measure which specifies how to assign probabilities to events. Given a probability model, we can simulate outcomes, occurrences of events, values of random variables, according to the specifications of the probability measure.

Recall from Section 1.2.1 that probabilities can be interpreted as long run relative frequencies. Therefore we can approximate \(\textrm{P}(A)\), the probability of event \(A\), by simulating the random phenomenon a large number of times and computing the relative frequency of \(A\)

\[ \textrm{P}(A) \approx \frac{\text{number of repetitions on which $A$ occurs}}{\text{number of repetitions}} \quad \text{for a large number of simulated repetitions} \]

While we generally use technology to conduct large scale simulations, it is helpful to first consider how we might conduct a simulation by hand using physical objects like coins, dice, cards, or spinners.

2.5.1 Tactile simulation: Boxes and spinners

Many random phenomena can be represented in terms of a “box model”.

  • Imagine a box containing “tickets” with labels. Examples include:
    • Fair coin flip. 2 tickets: 1 labeled H and 1 labeled T
    • 90% free throw shooter. 10 tickets: 9 labeled “make” and 1 labeled “miss”
    • Card shuffling. 52 cards: each card with a pair of labels (face value, suit).
    • Random digit dialing. 10 tickets: labeled 0 through 9 (corresponding to single digits).
  • The tickets are shuffled in the box, some number are drawn out — either with replacement or without replacement of the tickets before the next draw27
  • In some cases, the order in which the tickets are drawn matters; in other cases the order is irrelevant. For example,
    • Dealing a 5 card poker hand: Select 5 cards without replacement, order does not matter
    • Random digit dialing: Select 10 cards with replacement (for a phone number with area code), order matters (e.g., 805-555-1234 is a different outcome than 805-555-4321)
  • Then something is done with the tickets, typically to measure random variables of interest. For example, you might flip a coin 10 times (by drawing from the H/T box 10 times with replacement) and count the number of H.

If the draws are made with replacement, we can think of a single “spinner” instead of a box, spun multiple times. For example:

  • Fair coin flip. Spinner with half of the area corresponding to H and half T
  • 90% free throw shooter. Spinner with 90% of the area correspoding to “make” and 10% “miss”.
  • Random digit dialing. Spinner marked with digits 0-9, possibly with some digits more likely than others. spun multiple times. Depending on what regions you are trying to sample, you might have one spinner to generate area code, one to generate the next three digits, and one to generate the last four digits.
Example 2.28 Let \(X\) be the sum of two rolls of a fair four-sided die, and let \(Y\) be the larger of the two rolls (or the common value if a tie). Set up a box model and explain how you would use it to simulate a single realization of \((X, Y)\). Could you use a spinner instead?

Solution to Example 2.28

Box with four tickets, labeled 1, 2, 3, 4. Draw two tickets with replacement. Let \(X\) be the sum of the two numbers drawn and \(Y\) the larger of the two numbers drawn.

Also, you could set up a spinner with 4 sectors, corresponding to 1, 2, 3, 4, each with 25% of the total area; see Figure 2.5. Spin the spinner twice. Let \(X\) be the sum of the two numbers spun and \(Y\) the larger of the two numbers spun.

Spinner corresponding to a single roll of a fair four-sided die.

Figure 2.5: Spinner corresponding to a single roll of a fair four-sided die.

The spinner in Figure 2.5 simulates the individual die rolls. We will see in Figure 2.7 below a separate spinner for generating \((X, Y)\) pairs directly.

A simulation involves simulating the probability model for a large number of repetitions and using the results to investigate properties of interest. It is important to distinguish between what entails (1) one repetition of the simulation and its output, and (2) the simulation itself and output from many repetitions.

Continuing the dice rolling example, each repetition involves drawing two tickets with replacement from the box (or two spins of the spinner), resulting in an outcome \(\omega\), and then measuring \(X(\omega)\) and \(Y(\omega)\). We could then repeat this process say 10000 times for a simulation whose output is 10000 \((X, Y)\) pairs simulated according to the probability model.

The following plots illustrate a summary of the output for the dice rolling simulation, conducted by hand, in a class of about 35 students, with each student performing two repetitions of the simulation (two pairs of rolls.) While this simulation only has roughly 70 repetitions, a larger scale simulation follows the same process just with more repetitions.

NEED PLOT HERE

After running say 10000 repetitions of the dice rolling simulation, we could use the results to answer questions like

  • What proportion of repetitions yielded \(X=6\)? (Count the number of repetitions for which \(X=6\) and divide by 10000.) This proportion is an approximation of \(\textrm{P}(X=6)\), the probability that the sum of two rolls of a fair four-sided die is 6.
  • What was the average value of \(X\)? (Sum the 10000 values of \(X\) and divide by 10000.) This average is an approximation of the expected value \(\textrm{E}(X)\). (See Section ??.)
  • What was the average value of \(XY\)? (For each repetition, find the product \(XY\). Then sum the 10000 simulated values of \(XY\) and divide by 10000.) This average is an approximation of \(\textrm{E}(XY)\). (See Section ??.)
  • What proportion of repetitions yielded \(X=6\) and \(Y=3\)? (Count the number of repetitions for which both \(X=6\) and \(Y=3\) and divide by 10000.) This proportion approximates \(\textrm{P}(X=6, Y=3)\).
  • Among the repetions for which \(X=6\), what proportion yielded \(Y=3\)? (Count the number of repetitions for which both \(X=6\) and \(Y=3\) and divide by the number of repetitions for which \(X=6\).) This proportion approximates the conditional probability \(\textrm{P}(Y=3|X=6)\). (See Section 3.1.)

2.5.2 Technology simulation: Symbulate

Note: some of the plots and tables in this and the following sections are not displayed properly in the text. See the accompanying Jupyter notebooks for a better representation of Symbulate output.

We now introduce a technology simulation using the Python package Symbulate. The syntax of Symbulate mirrors the “language of probability” in that the primary objects in Symbulate are the same as the primary components of a probability model: probability spaces, random variables, events. Once these components are specified, Symbulate allows users to simulate many times from the probability model and summarize those results.

This section contains a brief introduction to Symbulate; many more examples can be found throughout the text or in the Symbulate documentation. The following command imports Symbulate during a Python session.

We continue the example where \(X\) is the sum of two rolls of a fair four-sided die, and let \(Y\) is the larger of the two rolls (or the common value if a tie). The following Symbulate code defines a probability space28 P for simulating the 16 equally likely ordered pairs of rolls via a box model.

The above code tells Symbulate to draw 2 tickets, with replacement29, from a box with tickets labeled 1, 2, 3, and 4. Each simulated outcome consists of an ordered pair of rolls. The sim command simulates realizations of probability space outcomes (or events, random variables, or stochastic processes).

## <symbulate.results.Results object at 0x00000000187089E8>

Symbulate objects have methods such as sim, tabulate, which can be chained together; in this way a line of code can be loosely read left to right as a sentence. For example, the following represents “From the probability space P simulating 10000 outcomes and tabulate the results.”

{(4, 1): 643, (4, 3): 643, (3, 3): 601, (1, 3): 641, (2, 3): 614, (2, 1): 620, (4, 4): 658, (2, 4): 636, (2, 2): 626, (3, 2): 616, (4, 2): 646, (1, 1): 564, (3, 4): 655, (1, 4): 600, (3, 1): 565, (1, 2): 672}

A Symbulate RV is specified by the probability space on which it is defined and the mapping function which defines it. Recall that \(X\) is the sum of the two dice rolls and \(Y\) is the larger (max).

The above code simply defines the random variables. Since a random variable \(X\) is a function, any RV can be called as a function30 to return its value \(X(\omega)\) for a particular outcome \(\omega\) in the probability space.

## Warning: Calling an RV as a function simply applies the function that defines the RV to the input, regardless of whether that input is a possible outcome in the underlying probability space.
## Warning: Calling an RV as a function simply applies the function that defines the RV to the input, regardless of whether that input is a possible outcome in the underlying probability space.
## 5 3

The following commands simulate 100 values of the random variable Y and store the results as y. Note that, consistent with standard notation, the random variable itself is denoted with an uppercase letter Y, while the realized values of it are denoted with a lowercase letter y.

## <symbulate.results.RVResults object at 0x0000000018723860>

Values and their frequencies can be summarized using tabulate.

## {4: 46, 3: 27, 2: 20, 1: 7}

By default, tabulate returns frequencies (counts). Adding the argument normalize = True returns relative frequencies (proportions).

## {4: 0.46, 3: 0.27, 2: 0.2, 1: 0.07}

We can plot the 100 individual simulated values of \(Y\) in a rug plot.

The rug plot emphasizes that realizations of the random varible \(Y\) are numbers along a number line. However, the rug plot does not adequately summarize the relative frequencies. Instead, calling .plot() produces31 an impulse plot which displays the simulated values and their relative frequencies.

Now we simulate 10,000 values of the random variable Y and summarize the simulation output.

## {4: 4332, 2: 1894, 3: 3115, 1: 659}
Simulation-based approximation distribution of \(Y\), the larger (or common value if a tie) of two rolls of a fair four-sided die.

Figure 2.6: Simulation-based approximation distribution of \(Y\), the larger (or common value if a tie) of two rolls of a fair four-sided die.

Events can also be defined and simulated. For programming reasons, events are enclosed in parentheses () rather than braces \(\{\}\). A realization of an event is True if the event occurs for the simulated outcome, or False if not.

## {False: 7526, True: 2474}

For logical equality use a double equal sign ==. For example, (Y == 3) represents the event \(\{Y=3\}\).

## {False: 6868, True: 3132}

The following is a summary for \(X\)

## {5: 2516, 4: 1903, 7: 1232, 6: 1881, 3: 1289, 8: 596, 2: 583}

We can simulate pairs of \(X\) and \(Y\) using32 &. We store the simulation output as xy to emphasize that xy contains pairs of values.

## <symbulate.results.RVResults object at 0x0000000025687C18>

Pairs of values can also be tabulated.

## {(5, 4): 1235, (3, 2): 1246, (6, 3): 605, (5, 3): 1301, (6, 4): 1213, (2, 1): 637, (7, 4): 1266, (4, 3): 1224, (8, 4): 634, (4, 2): 639}
## {(5, 4): 0.1235, (3, 2): 0.1246, (6, 3): 0.0605, (5, 3): 0.1301, (6, 4): 0.1213, (2, 1): 0.0637, (7, 4): 0.1266, (4, 3): 0.1224, (8, 4): 0.0634, (4, 2): 0.0639}

Individual pairs can be plotting in a scatter plot, which is a two-dimensional analog of a rug plot.

The values can be “jittered” slightly so that points do not coincide.

The two-dimensional analog of an impulse plot is a tile plot. For two discrete variables, the 'tile' plot type produces a tile plot (a.k.a. heat map) where rectangles represent the simulated pairs with their relative frequencies visualized on a color scale.

We can add the impulse plot for each of each \(X\) and \(Y\) in the margins of the tile plot using the 'marginals' argument.

Notice that the tile plot suggests a tactile method for simulating \((X, Y)\) pairs directly. Namely, pairs generated using the spinner in Figure 2.7 below will follow the same pattern as in the above tile plot. (The spinner in Figure 2.7 represents the joint distribution of \((X, Y)\).)

Spinner which generates \((X, Y)\) pairs, where \(X\) is the sum and \(Y\) is the larger (or common value if a tie) of two rolls of a fair four-sided die.

Figure 2.7: Spinner which generates \((X, Y)\) pairs, where \(X\) is the sum and \(Y\) is the larger (or common value if a tie) of two rolls of a fair four-sided die.

In the code above we used a box model to specify the probability space correspoding to pairs of rolls of a four-sided die. When tickets are equally likely and sampled with replacement, a Discrete Uniform model can also be used. Think of a DiscreteUniform(a, b) probability space corresponding to a spinner with sectors of equal area labeled with integer values from a to b (inclusive). For example, the spinner in Figure 2.5 corresponds to DiscreteUniform(1, 4).

In the following, \(U\) represents the result of a single roll of a fair four-sided die. The default mapping function in RV is the identity, so U = RV(P) represents33 \(U(\omega) = \omega\).

For two rolls the probability space corresponds to spinning the DiscreteUniform spinner two times, which is coded34 as DiscreteUniform(a = 1, b = 4) ** 2. The first line below has the same effect35 as BoxModel([1, 2, 3, 4], size = 2, replace = True).

2.5.3 Approximating probabilities - margin of error

Recall Section 1.2.1. The probability of an event can be approximated by simulating the random phenomenon a large number of times and computing the relative frequency of the event. However, while after enough repetitions we expect the simulated relative frequency to be close to the true probability, there probably won’t be an exact match. Therefore, in addition to reporting the approximate probability, we should also provide a margin of error which indicates how close we think our simulated relative frequency is to the true probability.

Continuing the dice example, suppose we want to estimate \(\textrm{P}(X=6)\), the probability that the sum of two rolls of a fair four-sided is six. The true probability is \(3/16=0.1875\). We will now carry out an analysis similar to the coin flipping example in Section 1.2.1 to investigate simulation margin of error and how it is influenced by the number of simulated values used to compute the relative frequency.

We will perform a “meta-simulation”. The process is as follows

  1. Simulate two rolls of a fair four-sided die. Compute the sum (\(X\)) and see if it is equal to 6.
  2. Repeat step 1 \(n\) times to generate \(n\) simulated values of the sum (\(X\)). Compute the relative frequency of sixes: count the number of the \(n\) simulated values equal to 6 and divide by \(n\). Denote this relative frequency \(\hat{p}\).
  3. Repeat step 2 a large number of times, recording the relative frequency \(\hat{p}\) for each set of \(n\) values.

Be sure to distinguish between steps 2 and 3. A simulation will typically involve just steps 1 and 2, resulting in a single relative frequency based on \(n\) simulated values. Step 3 is the “meta” step; we see how this relative frequency varies from simulation to simulation to help us in determing an appropriate margin of error. The important quantity in this analysis is \(n\), the number of simulated values used to compute relative frequency in a single simulation. We wish to see how \(n\) impacts margin of error. The number of simulations in step 3 just needs to be “large” enough to provide a clear picture of how the relative frequency varies from simulation to simulation. The more the relative frequency varies from simulation to simulation, the larger the margin of error needs to be.

In the meta-simulation, the main quantity of interest is the relative frequency, which will vary from simulation to simulation. We can combine steps 1 and 2 to put the meta-simulation in the framework of the simulations from earlier in this section. Namely, we can code the meta-simulation as a simulation in which

  • A sample space outcome represents \(n\) values of the sum of two fair-four sided dice
  • The main random variable of interest is the proportion of the \(n\) values which are equal to 6.

Let’s first consider \(n=100\). The following Symbulate code defines the probability space corresponding to 100 values of the sum of two-fair four sided dice. Notice the use of apply which functions much in the same way36 as RV.

## <symbulate.results.Results object at 0x0000000022F4C588>

In the code above

  • DiscreteUniform(1, 4) ** 2 simulates two rolls of a fair four-sided die
  • .apply(sum) computes the sum of the two rolls
  • ** n repeats the process n times to generate a set of n independent values, each value representing the sum of two rolls of a fair four-sided die
  • P.sim(5) simulates 5 sets, each set consisting of n sums

Now we define the random variable which takes as an input a set of \(n\) sums and returns the proportion of the \(n\) sums which are equal to six.

## <symbulate.results.RVResults object at 0x0000000026D05780>

In the code above

  • phat is an RV defined on the probability space P. Recall that an outcome of P is a set of n sums (and each sum is the sum of two rolls of a fair four-sided die).
  • The function that defines the RV is count.eq(6), which counts the number of values in the set that are equal to 6. We then37 divide by n, the total number of values in the set, to get the relative frequency. (Remember that a transformation of a random variable is also a random variable.)
  • phat.sim(5) generates 5 simulated values of the relative frequency phat. Each simulated value of phat is the relative frequency of sixes in n sums of two rolls of a fair four-sided die.

Now we simulate and summarize a large number of values of phat. We’ll simulate 100 values for illustration. Be sure not to confuse 100 with n. Remmeber, the important quantity is n, the number of simulated values used in computing each relative frequency.

We see that the 100 relative frequencies are roughly centered around the true probability 0.1875, but there is variability in the relative frequencies from simulation to simulation. From the range of values, we see that most relative frequencies are within about 0.07 from the true probability 0.1875.

Now we repeat the analysis, but with \(n=10000\). In this case, each relative frequency is computed based on 10000 independent values, each value representing a sum of two rolls of a fair four-sided die. As before, we simulate 100 relative frequencies.

Again we see that the 100 relative frequencies are roughly centered around the true probability 0.1875, but there is less variability in the relative frequencies from simulation to simulation for \(n=10000\) than for \(n=100\). From the range of values, we see that most relative frequencies are within about 0.007 from the true probability 0.1875. That is, the larger the number (\(n\)) of values used in the computation of relative frequency, the smaller the margin of error. As in Section 1.2.1 it appears that when we increase \(n\) by a factor of 100 (from 100 to 10000) we achieve an extra decimal place in precision. This is indeed the case in general.

  • Remember: in any simulation the resulting probabilities are approximate.
  • The margin of error between an actual probability and a simulated relative frequency is roughly on the order \(1/\sqrt{n}\), where \(n\) is the number of simulated values used to calculate the relative frequency
  • More precisely, if \(\hat{p}\) represents the simulated relative frequency, we estimate with 95% confidence38 that the interval with endpoints \[ \hat{p}\pm 2 \sqrt{\frac{\hat{p}(1-\hat{p})}{N}} \] contains the actual probability.
  • Warning: alternative methods are necessary when the actual probability being estimated is close to 0 or to 1.

2.5.4 Brief summary of Symbulate commands

Many scenarios require only a few lines of Symbulate code to set up, run, analyze, and visualize. Table comprises the requisite Symbulate commands for a wide variety of situations.

Command Function
BoxModel Define a box model
ProbabilitySpace Define a custom probability space
RV Define random variables, vectors, or processes
RandomProcess Define a discrete or continuous time stochastic process
apply Apply transformations
[] (brackets) Access a component of a random vector, or a random process at a particular time
* (and **) Define independent probability spaces or distributions
AssumeIndependent Assume random variables or processes are independent
| (vertical bar) Condition on events
& Join multiple random variables into a random vector
sim Simulate outcomes, events, and random variables, vectors, and processes
tabulate Tabulate simulated values
plot Plot simulated values
filter (and relatives) Create subsets of simulated values (filter_eq, filter_lt, etc)
count (and relatives) Count simulated values which satisfy some critera (count_eq, count_lt, etc)
Statistical summaries mean, median, sd, var, quantile, corr, cov, etc.
Common models See Chapters ?? and ??

While no previous experience with Python is required, it is also possible to incorporate Python programming with Symbulate code. In particular, Python functions or loops can be used: to define or transform random variables or stochastic processes, or to investigate the effects of changing parameter values. Also, while many common plots are built in with the Symbulate plot function, the Matplotlib package can be used to create or customize plots. Some of these features will be illustrated in the next section and many more examples are found throughout the text.


  1. “With replacement” always implies replacement at a uniformly random point in the box. Think of “with replacement” as “with replacement and reshuffling” before the next draw.

  2. Technically, a probability space is a triple \((\Omega, \mathcal{F}, \textrm{P})\). We primarily view a Symbulate probability space as a description of the probability model rather than an explicit specification of \(\Omega\). For example, we define a BoxModel instead of creating a set with all possible outcomes. We tend to represent a probability space with P, even though this is a slight abuse of notation.

  3. The default argument for replace is True, so we could have just written BoxModel([1, 2, 3, 4], size = 2).

  4. The warning you get when you call a RV as a function just means that Symbulate is not going to check for you that the inputs to the function you used to define the RV actually match up with the outcomes of the probability space P.

  5. For discrete random variables 'impulse' is the default plot type. Like .tabulate(), the .plot() method also has a normalize argument; the default is normalize=True.

  6. Technically & joins two RVs together to form a random vector. While we often intepret Symbulate RV as random variable, it really functions as random vector.

  7. For technical reasons, Symbulate will not plot simulated outcomes from a ProbabilitySpace, only simulated realizations of an RV. Essentially, as mentioned in earlier sections, probability space outcomes can be anything — not necessarily numbers — and so there are too many potential situations and plots for Symbulate to handle with a single plot command. In contrast, simulated realizations of RV are numbers (or vectors of numbers) which facilitates plotting. Plotting simulated outcomes from a probability space P with numerical outcomes can be achieved by first defining X = RV(P) and then simulating and plotting values of X. That is, replace P.sim(1000).plot() with RV(P).sim(1000).plot(). However, this only works if the outcomes of P are numerical.

  8. For now you can interpretDiscreteUniform(a = 1, b = 4) as a spinner with four quarters labeled 1, 2, 3, 4, and ** 2 as “spin the spinner twice”. In Python, ** represents exponentiation; e.g., 2 ** 5 = 32. So DiscreteUniform(a = 1, b = 4) ** 2 is equivalent to DiscreteUniform(a = 1, b = 4) * DiscreteUniform(a = 1, b = 4). Future sections will reveal why the product * notation is natural for independent spins of spinners.

  9. BoxModel assumes equally likely tickets by default, but there are options for non-equally likely cases.

  10. One difference between RV and apply: apply preserves the type of the input object. That is, if apply is applied to a ProbabilitySpace then the output will be a ProbabilitySpace; if apply is applied to an RV then the output will be an RV. In contrast, RV always creates and RV.

  11. Unfortunately, for techincal reasons, RV(P, count_eq(6) / n) will not work. It is possible to divide by n within RV if we define a custom function def rel_freq_six(x): return x.count_eq(6) / n and then define RV(P, ref_freq_six).

  12. We will see the rationale behind this formula later in the class. The factor 2 comes from the fact that for a Normal distribution, about 95% of values are within 2 standard deviations of the mean. Technically, the factor 2 corresponds to 95% confidence only when a single probability is estimated. If multiple probabilities are estimated simultaneously, then alternative methods should be used, e.g., increasing the factor 2 using a Bonferroni correction. For example, a multiple of 5 rather than 2 produces very conservative error bounds.