20 Simulation

20.1 Generating Random Numbers

Watch a video of this section

Simulation is an important (and big) topic for both statistics and for a variety of other areas where there is a need to introduce randomness. Sometimes you want to implement a statistical procedure that requires random number generation or sampling (i.e. Markov chain Monte Carlo, the bootstrap, random forests, bagging) and sometimes you want to simulate a system and random number generators can be used to model random inputs.

R comes with a set of pseuodo-random number generators that allow you to simulate from well-known probability distributions like the Normal, Poisson, and binomial. Some example functions for probability distributions in R

  • rnorm: generate random Normal variates with a given mean and standard deviation
  • dnorm: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of points)
  • pnorm: evaluate the cumulative distribution function for a Normal distribution
  • rpois: generate random Poisson variates with a given rate

For each probability distribution there are typically four functions available that start with a “r”, “d”, “p”, and “q”. The “r” function is the one that actually simulates randon numbers from that distribution. The other functions are prefixed with a

  • d for density
  • r for random number generation
  • p for cumulative distribution
  • q for quantile function (inverse cumulative distribution)

If you’re only interested in simulating random numbers, then you will likely only need the “r” functions and not the others. However, if you intend to simulate from arbitrary probability distributions using something like rejection sampling, then you will need the other functions too.

Probably the most common probability distribution to work with the is the Normal distribution (also known as the Gaussian). Working with the Normal distributions requires using these four functions

Here we simulate standard Normal random numbers with mean 0 and standard deviation 1.

We can modify the default parameters to simulate numbers with mean 20 and standard deviation 2.

If you wanted to know what was the probability of a random Normal variable of being less than, say, 2, you could use the pnorm() function to do that calculation.

You never know when that calculation will come in handy.

20.2 Setting the random number seed

When simulating any random numbers it is essential to set the random number seed. Setting the random number seed with set.seed() ensures reproducibility of the sequence of random numbers.

For example, I can generate 5 Normal random numbers with rnorm().

Note that if I call rnorm() again I will of course get a different set of 5 random numbers.

If I want to reproduce the original set of random numbers, I can just reset the seed with set.seed().

In general, you should always set the random number seed when conducting a simulation! Otherwise, you will not be able to reconstruct the exact numbers that you produced in an analysis.

It is possible to generate random numbers from other probability distributions like the Poisson. The Poisson distribution is commonly used to model data that come in the form of counts.

20.3 Simulating a Linear Model

Watch a video of this section

Simulating random numbers is useful but sometimes we want to simulate values that come from a specific model. For that we need to specify the model and then simulate from it using the functions described above.

Suppose we want to simulate from the following linear model

\[ y = \beta_0 + \beta_1 x + \varepsilon \]

where \(\varepsilon\sim\mathcal{N}(0,2^2)\). Assume \(x\sim\mathcal{N}(0,1^2)\), \(\beta_0=0.5\) and \(\beta_1=2\). The variable x might represent an important predictor of the outcome y. Here’s how we could do that in R.

We can plot the results of the model simulation.

What if we wanted to simulate a predictor variable x that is binary instead of having a Normal distribution. We can use the rbinom() function to simulate binary random variables.

Then we can procede with the rest of the model as before.

We can also simulate from generalized linear model where the errors are no longer from a Normal distribution but come from some other distribution. For examples, suppose we want to simulate from a Poisson log-linear model where

\[ Y \sim Poisson(\mu) \]

\[ \log \mu = \beta_0 + \beta_1 x \]

and \(\beta_0=0.5\) and \(\beta_1=0.3\). We need to use the rpois() function for this

Now we need to compute the log mean of the model and then exponentiate it to get the mean to pass to rpois().

You can build arbitrarily complex models like this by simulating more predictors or making transformations of those predictors (e.g. squaring, log transformations, etc.).

20.4 Random Sampling

Watch a video of this section

The sample() function draws randomly from a specified set of (scalar) objects allowing you to sample from arbitrary distributions of numbers.

To sample more complicated things, such as rows from a data frame or a list, you can sample the indices into an object rather than the elements of the object itself.

Here’s how you can sample rows from a data frame.

Now we just need to create the index vector indexing the rows of the data frame and sample directly from that index vector.

Other more complex objects can be sampled in this way, as long as there’s a way to index the sub-elements of the object.

20.5 Summary

  • Drawing samples from specific probability distributions can be done with “r” functions
  • Standard distributions are built in: Normal, Poisson, Binomial, Exponential, Gamma, etc.
  • The sample() function can be used to draw random samples from arbitrary vectors
  • Setting the random number generator seed via set.seed() is critical for reproducibility