Chapter 9 Limit Theorems and Conditional Expectation

“Conditioning is the soul of Statistics” - Joe Blitzstein

We are currently in the process of editing Probability! and welcome your input. If you see any typos, potential edits or changes in this Chapter, please note them here.


These two topics aren’t necessarily directly related, but both are vital concepts in Statistics. Limit Theorems discuss long-run random variable behavior, and are extremely useful in applied Statistics. Conditional Expectation can be a very tricky and subtle concept; we’ve seen how important it is to ‘think conditionally,’ and we now apply this paradigm to expectation.

Law of Large Numbers

‘Limit Theorems,’ as the name implies, are simply results that help us deal with random variables as we take a limit. The first limit theorem that we will discuss, the Law of Large Numbers (often abbreviated LLN), is an intuitive result; it essentially says that the sample mean of a random variable will eventually approach the true mean of that random variable as we take more and more draws. We will formalize this further.

Consider i.i.d. random variables \(X_1, X_2, ..., X_n\). Let the mean of each random variable be \(\mu\) (i.e., \(E(X_1) = \mu\), \(E(X_2) = \mu\), etc.). We define the sample mean \(\bar{X}_n\) to be:

\[\bar{X}_n = \frac{X_1 + ... + X_n}{n}\]

So, the \(n\) in the subscript of \(\bar{X}_n\) just determines how many random variables we sum (we then divide by \(n\), because we want to take a mean).

Take a second to think about this definition. We are essentially taking \(n\) draws from a distribution (you can think of each \(X\) term as a draw from a specific distribution) and dividing the total of these draws by \(n\), or the number of draws. It’s key to realize that the sample mean \(\bar{X}_n\) is itself random. Remember, a function of random variables is still a random variable, and here, the sample mean is most certainly a function (specifically, the ‘mean’ function) of other random variables. It makes sense that this sample mean will fluctuate, because the components that make it up (the \(X\) terms) are themselves random.

Now, on to the actual results. We have two different versions of the LLN:

Strong Law of Large Numbers: The sample mean \(\bar{X}_n\) goes to the true mean \(\mu\) as \(n \rightarrow \infty\) with probability 1. This is a formal way of saying that the sample mean will definitely approach the true mean.

Weak Law of Large Numbers: For all \(\epsilon > 0\), \(P(|\bar{X}_n - \mu| > \epsilon) \rightarrow 0\) as \(n \rightarrow \infty\). This is a formal way of saying that the probability that \(\bar{X}_n\) is at least \(\epsilon\) away from \(\mu\) goes to 0 as \(n\) grows; the idea is that we can imagine \(\epsilon\) being very small (so \(\bar{X}_n\) must be very close to, and essentially equal to, \(\mu\)).

We won’t consider the proofs of these limit theorems here (you can prove the Weak Law using ‘Chebyshev’s Inequality,’ which you can read about on William Chen and Professor Blitzstein’s cheatsheet). You can see how the ‘Strong’ law is ‘stronger’ than the ‘Weak’ law: we say that the sample mean will converge the true mean, not just that it will be within \(\epsilon\) of the true mean.

Anyways, let’s think about the Strong version: that the sample mean will definitely go to the true mean. This is an intuitive result. Imagine if we were flipping a fair coin over and over and keeping track of the ‘running mean’ of the number of heads (i.e., the average number of heads from the first two flips, then the average number of heads from the first three flips, etc.). It makes sense that the running mean might be far from the true mean of \(1/2\) in the earlier stages: maybe we get a lot of tails in the first group of flips. However, we can envision that, as we continue to flip more and more coins, the running mean should even out and approach the true mean of \(1/2\).

This ‘coin example’ is just another way of stating the process of taking a sample mean and letting \(n\) grow. You can further explore the LLN with our Shiny app; reference this tutorial video for more.

Click here to watch this video in your browser. As always, you can download the code for these applications here.

Central Limit Theorem

We’ll call this the CLT for short. This is the second limit theorem that we will discuss, and it also deals with the long-run behavior of the sample mean as \(n\) grows. We’ve referenced this result in this book, and you’ve likely heard it in an introductory Statistics context. In general, the colloquial result is that ‘everything becomes Normal eventually.’ We can, of course, formalize this a bit more.

Consider i.i.d. random variables \(X_1, X_2, ..., X_n\), each with mean \(\mu\) and variance \(\sigma^2\) (don’t get tripped up here: we’ve only seen \(\mu\) and \(\sigma^2\) used as a mean and variance in the context of a Normal distribution, but we can call any mean \(\mu\) and any variance \(\sigma^2\); it doesn’t necessarily have to be a Normal distribution! For example, if we had that each \(X\) was distributed \(Unif(0, 1)\), then \(\mu = 1/2\) and \(\sigma^2 = 1/12\)).

Again, define \(\bar{X}_n\) as the ‘sample mean’ of \(X\). We can write this out as:

\[\bar{X}_n = \frac{X_1 + ... + X_n}{n}\]

As discussed above, we know that the sample mean is itself a random variable, and we know that it approaches the true mean in the long run by the LLN. However, are we able to nail down a specific distribution for this random variable as we approach the long run, not just the value that it converges to? A good place to start is to find the mean and variance; these parameters won’t tell us what the distribution is, but they will be useful once we determine the distribution. To find the expectation and variance, we can just ‘brute force’ our calculations (i.e., just go ahead and do it!). First, for the expectation, we take the expectation of both sides:

\[E(\bar{X}_n) = E\big(\frac{X_1 + ... + X_n}{n}\big)\]

By linearity:

\[= E\big(\frac{X_1}{n}\big) + ... + E\big(\frac{X_n}{n}\big)\]

Since \(n\) is a known constant, we can factor it out of the expectation:

\[= \frac{1}{n}E(X_1) + ... + \frac{1}{n}E(X_n)\]

Now, we are left with the expectation, or mean, of each \(X\). Do we know these values? Well, recall above that, by the set-up of the problem, each \(X\) is a random variable with mean \(\mu\). That is, the expectation of each \(X\) is \(\mu\). We get:

\[= \frac{\mu}{n} + ... + \frac{\mu}{n}\]

We have \(n\) of these terms, so they sum to:


So, we get that \(E(\bar{X}_n) = \mu\). Think about this result: it says that the average of the sample mean is equal to \(\mu\), where \(\mu\) is the average of each of the random variables that make up the sample mean. This is intuitive; the sample mean should have an average of \(\mu\) (you could say that \(\bar{X}_n\) is unbiased for \(\mu\), since it has expectation \(\mu\); this is a concept that you will explore more in a more applied Statistics context). Let’s now turn to the Variance:

\[Var(\bar{X}_n) = Var\big(\frac{X_1 + ... + X_n}{n}\big)\]

We know that the \(X\) terms are independent, so the variance of the sum is the sum of the variances:

\[= Var\big(\frac{X_1}{n}\big) + ... + Var\big(\frac{X_n}{n}\big)\]

Since \(n\) is a constant, we factor it out (remembering to square it):

\[= \frac{1}{n^2}Var(X_1) + ... + \frac{1}{n^2}Var(X_n)\]

Do we know the variance of each \(X\) term? In the set-up of the problem, it was defined as \(\sigma^2\), so we can simply plug in \(\sigma^2\) for each variance:

\[= \frac{\sigma^2}{n^2} + ... + \frac{\sigma^2}{n^2}\]

We have \(n\) of these terms (since we have \(n\) random variables and thus \(n\) variances) so this simplifies to:

\[= \frac{\sigma^2}{n}\]

Consider this result for a moment. First, consider when \(n = 1\). In this case, the sample mean is just \(X_1\) (since, by definition, we would have \(\frac{X_1}{1} = X_1\)). The variance that we calculated above, \(\frac{\sigma^2}{n}\), comes out to \(\sigma^2\) when \(n = 1\), which makes sense, since this is just the variance of \(X_1\). Next, consider what happens to this variance as \(n\) grows; it gets smaller and smaller, since \(n\) is in the denominator. Does this make sense? As \(n\) grows, we are essentially adding up more and more random variables in our sample mean calculation. It makes sense, then, that the overall sample mean will have less variance; among other things, adding up more random variables means that the effect of ‘outlier’ random variables is lessened (i.e., if we observe an extremely large value for \(X_1\), it is mediated by the sheer number of random variables).

So, we found that the sample mean has mean \(\mu\) and variance \(\frac{\sigma^2}{n}\), where \(\mu\) is the mean of each underlying random variable, \(\sigma^2\) is the variance of each underlying random variable, and \(n\) is the total number of random variables. Now that we have the parameters, we are ready for the main result of the CLT.

The CLT states that, for large \(n\), the distribution of the sample mean approaches a Normal distribution. This is an extremely powerful result, because it holds no matter what the distribution of the underlying random variables (i.e., the \(X\)’s) is. We know that Normal random variables are governed by the mean and variance (i.e., these are the two parameters), and we already found the mean and variance of \(\bar{X}_n\), so we can say:

\[\bar{X_n} \rightarrow^D N(\mu, \frac{\sigma^2}{n})\]

Where \(\rightarrow^D\) means ‘converges in distribution’; it’s implied here that this convergence takes place as \(n\), or the number of underlying random variables, grows.

Think about this distribution as \(n\) gets extremely large. The mean, \(\mu\), will be unaffected, but the variance will be close to 0, so the distribution will essentially be a constant (specifically, the constant \(\mu\) with no variance). This makes sense: if we take an extremely high number of draws from a distribution, we should get that this sample mean is at the true mean, with very little variance. It’s also the result we saw from the LLN, which said that the sample mean approaches a constant: as \(n\) grows here, we approach a variance of 0, which essentially means we have a constant (since constants have variance 0). The CLT just describes the distribution ‘on the way’ to the LLN convergence.

Hopefully this brings some clarity to the statement “everything becomes Normal”: taking the sum of i.i.d. random variables (we worked with the sample mean here, but the sample mean is just the sum divided by a constant \(n\)), regardless of the underlying distribution of the random variables, yields a Normal distribution. You can further explore the CLT with our Shiny app; reference this tutorial video for more.

Click here to watch this video in your browser. As always, you can download the code for these applications here.

Conditional Expectation

Let’s walk through what the idea of ‘Conditional Expectation’ means intuitively and provide some (hopefully) illuminating examples. Recall that \(P(A|B)\), where \(A\) and \(B\) are events, gives the probability of event \(A\) occurring given that event \(B\) occurred (that is, a conditional probability). We can analogously define, then, \(E(X|A)\), where \(X\) is a random variable and \(A\) is an event, as the expectation of \(X\) given that \(A\) occurred.

This shouldn’t be too big a step. For example, let’s define \(X\) as a random variable that counts the number of heads in 2 flips of a fair coin. Let \(A\) be the event that the first coin flip shows tails. Then \(E(X|A)\) is .5. Why? Well, conditioning on \(A\) occurring, we know that the first flip was tails, so we have one more flip that could possibly be heads. The expectation of the number of heads on this one single flip is .5, so we have 0 + .5 = .5.

This past example is relatively easy, but we can formalize this concept with something that is analogous to the Law of Total Probability (LOTP). With \(X\) as a random variable and \(A\) as an event:

\[E(X) = E(X|A)P(A) + E(X|A^c)P(A^c)\]

Similar to LOTP, this is called the Law of Total Expectation, or LOTE for short. This makes sense; we’re splitting apart the two outcomes for \(A\) (either \(A\) occurs or it does not occur), taking the expectation of \(X\) in both states and weighting each expectation by the probability that we’re in that state. It’s the same as LOTP, but for expectation.

In the example above, where \(X\) was the number of heads in 2 fair flips and \(A\) is the event ‘tails on the first flip,’ we can show that this equation holds. \(E(X)\) is 1, by the story of a Binomial random variable. \(P(A) = P(A^c) = .5\), since each flip has a .5 probability of coming up heads. \(E(X|A)\) is .5, as we argued above. \(E(X|A^c)\) is 1.5, since if \(A^c\) occurs then we know we already have 1 heads and we have 1 more flip, which has expectation .5 heads, so we get 1 + .5 = 1.5. Putting it all together:

\[.5 \cdot .5 + .5 \cdot 1.5 = 1\]

Which is the same as \(E(X)\), and thus our simple LOTE sanity check holds.

This example demonstrated conditional expectation given an event. Things get a little bit trickier when you think about conditional expectation given a random variable. The best way to frame this topic is to realize that when you are taking an expectation, you are making a prediction of what value the random variable will take on. You are finding the average value of that random variable.

Let’s say \(X\) is a random variable, and we want to find \(E(X|X)\). What this is saying, in words, is the best prediction of the random variable \(X\) given that we know the random variable \(X\). In this case, then, we are trying to predict \(X\) and we know \(X\). So, our best prediction of \(X\) is \(X\), since we know \(X\). That is, \(E(X|X) = X\) (a bit tricky, right?).

There’s a bit of a notational snag here, and you might see a similar expression written in a different form. For example, we might see \(E(X|X = x)\). This is asking for the expectation of the random variable \(X\), given that we know the random variable \(X\) crystallizes to the value \(x\) (i.e., a standard normal takes on a value 0; in this case, \(X\) is the standard normal and \(0\) is \(x\)). Since we know that \(X\) took on the constant value \(x\), we know that \(E(X|X = x) = x\).

The difference between these two, \(E(X|X)\) and \(E(X|X = x)\), is simply that in the second case we specify that \(X\) takes on \(x\), and in the first case we don’t. Usually, the second term is considered more ‘long-hand’ notation (often, people consider them to mean the same thing).

One more extension of this concept is \(E(h(X)|X)\), where \(h\) is some function (maybe \(h(y) = y^2\)). In this case, you can probably guess what the answer is: \(E(h(X)|X) = h(X)\). Why is that? Well, we are assuming that we know \(X\), and we are trying to get a best guess for \(h(X)\), so all we do is plug in \(X\), which we know, into our function \(h\). In the long-hand notation, we might have \(E(h(X)|X = x) = h(x)\), just like we saw above (if we consider \(X\) specifically crystallizing to \(x\)).

Now we’ll move a step further and think about conditional expectation for different random variables; that is, \(E(Y|X)\). Again, this means that we want the best prediction for \(Y\) given that we know the random variable \(X\). A good way to frame this is to think about how the distribution of \(Y\) changes conditionally; maybe \(Y\) has a different distribution if we know the value of \(X\), and we then have to find the expectation of this distribution.

Let’s consider an example. Say that \(Y \sim Bin(X, .5)\), and let \(X\) also be a random variable such that \(X \sim DUnif(1,10)\) (here, \(DUnif\) stands for the Discrete Uniform distribution, or a Uniform that can only take on integer values: recall that the usual Uniform distribution is continuous and thus the support includes more than just integers. For this specific example, then, \(X\) takes on a an integer from 1 to 10. You can think of the overall structure as flipping a coin - the Binomial random variable - and counting heads, where the number of total flips, or \(X\), is random). Let’s say that we’re interested in \(E(Y|X)\). Well, we know that since \(Y\) is a Binomial, its expectation is the number of trials times the probability of success on each trial. Of course, the number of trials, \(X\), is random, but if we condition on it, then we can say that we know the value. So, in this case, since \(X\) is ‘known,’ we can say \(E(Y|X) = \frac{X}{2}\). The long hand, if we actually knew what value \(X\) crystallizes to, would be \(E(Y|X = x) = \frac{x}{2}\), which is maybe a bit more intuitive because it shows how we are conditioning on \(X\) mapping to a specific value.

It’s important to remember that in this case, when we are considering \(E(Y|X)\), we are still left with a random variable. We saw that \(E(Y|X) = \frac{X}{2}\), and recall that \(X\) is random, so we haven’t actually given an answer that is constant. More on this later…

Finally, if \(X\) and \(Y\) are independent, then the conditional expectation simplifies. You can probably guess, but in the case of independence, \(E(Y|X) = E(Y)\). That’s because knowing the r.v. \(X\) does not give any information that is helpful in predicting \(Y\), so the conditional expectation is equal to the marginal expectation. This is analogous to how \(P(A|B) = P(A)\) if events \(A\) and \(B\) are independent.

Adam and Eve

Let’s talk now about two very useful results of conditional expectation. We’ll present them, and then walk through an example to show how they can be useful. First is Adam’s Law. For random variables \(X\) and \(Y\):

\[E(Y) = E\big(E(Y|X)\big)\]

This is also commonly called The Law of Iterated Expectation and Tower Expectation.

Before we delve deeper into this, think about what the two sides are giving. The left side is the classic expectation that we are used to, which returns a constant (the average of \(Y\)). The inner part of the second side, \(E(Y|X)\), is itself a random variable (remember the example above with a random number of flips). Then, we apply another expectation, so we have the expectation of a random variable on the right side, just like the expectation of the random variable \(Y\) on the right side. It’s like we have two ‘levels’ of randomness, \(X\) and \(Y\), so we need two expectation ‘operators’ to aggregate out the randomness and get a constant. The first \(E()\) aggregates out the randomness from \(Y\), where we keep \(X\) fixed, and the second \(E()\) aggregates out the randomness of \(X\) (‘aggregate’ is a good word here, especially if we think about expectation in terms of LOTUS: summing/integrating a function times a PMF/PDF. You can even think about this in terms of a double integral: first, we take the integral with respect to one variable while we hold the other variable constant. When the dust clears from the first integral, we integrate over the remaining variable). So, mechanically at least, this makes sense. Now, we can turn to an example where conditional expectation and Adam’s Law are useful:

Click here to watch this video in your browser.

Now, let’s move on and discuss Eve’s Law. The formula for Eve’s Law is given by:

\[Var(Y) = E\big(Var(Y|X)\big) + Var\big(E(Y|X)\big)\]

The name comes from the fact that we have an \(E\) on the RHS, followed by \(V\)’s, followed by another \(E\). More on the intuition on this later. You’re probably thinking ‘where the heck do these come from, and why is this useful?’ Let’s do an example that will hopefully satiate these concerns.

Example 9.1

Recall the example we did at the start of the chapter. Let \(Y\) be Binomial, and let \(X \sim DUnif(1,10)\), or a random integer from 1 to 10. Conditional on \(X\), we know the distribution of \(Y\); that is, \(Y|X \sim Bin(X,.5)\). We found that the conditional expectation, \(E(Y|X)\), is \(\frac{X}{2}\), but we saw that the answer contains \(X\) and thus is still random.

Now let’s think about if we want to find \(E(Y)\), or the average value \(Y\) takes on. There’s two levels of randomness here, essentially. First, you randomly select how many times you flip the coin (since we have a Binomial with probability parameter .5, this is essentially flipping a coin and counting heads), and then you actually have to flip the coin that specified number of times. We know using Adam’s law that:

\[E(Y) = E\big(E(Y|X)\big)\]

And we know that \(E(Y|X) = \frac{X}{2}\), so we are left with:

\[E(Y) = E(\frac{X}{2}) = \frac{E(X)}{2}\]

We know that \(E(X) = 5.5\) (just think about what the average should be if you draw a random integer from 1 to 10), so we are left with \(E(Y) = \frac{5.5}{2} = 2.75\).

Let’s think about what we did here, since the implications are very helpful in illuminating what’s going on in Adam’s law (as discussed in the video example above). We wanted to find the expectation of \(Y\). However, this was a little tricky, since \(Y\) depends largely on another random variable, \(X\). So, we broke up our expectation of \(Y\) into \(E\big(E(Y|X)\big)\), and first we took the expectation of \(Y\) conditional on \(X\) (first level of expectation to undo the first level of randomness, \(Y\), while we held the second level of randomness, \(X\), constant by conditioning on it) and then we took the expectation with respect to \(X\) (second level of expectation to undo the second level of randomness). Basically, first we pretended \(X\) was fixed, and then we took another expectation.

Sounds tricky, but, as mentioned above, think about it as if there are two random parts: the number of coin flips, and then the actual flips of the coin. We took the expectation of the actual flips and pretended we knew how many coin flips there would be (thinking conditionally) and then when we had this expectation in terms of a random number of coin flips, we took the expectation of this random variable. It seems strange, but it helps when \(Y\) depends on something random (\(X\)). We fix \(X\) first and go from there.

Now let’s compute \(Var(Y)\). We know from Eve’s Law:

\[Var(Y) = E\big(Var(Y|X)\big) + Var\big(E(Y|X)\big)\]

Let’s look at the first term, Recall the conditional distribution: \(Y|X \sim Bin(X,.5)\), so we know \(Var(Y|X) = \frac{X}{4}\). We then take the expectation of this, which becomes \(\frac{5.5}{4} = 1.375\).

Then we have the second term. We found before that \(E(Y|X) = \frac{X}{2}\), and then we apply the Variance operator to get \(Var(\frac{X}{2}) = \frac{1}{4}Var(X)\). The variance of a \(DUnif(a,b)\) is given by \(\frac{(b - a + 1)^2 - 1}{12}\) (not going to prove this here, since we don’t often work with this distribution) so we are left with, because \(a = 1\) and \(b = 10\), \(\frac{(10 - 1 + 1)^2 - 1}{12} = 8.25\) for \(Var(X)\). Putting it all together:

\[1.375 + 8.25/4 = 3.44\]

Let’s confirm this with a simulation in R. We’ll simulate \(X\), then \(Y\) based on \(X\), then see if the mean and variance of \(Y\) match our results.

sims = 1000

#generate r.v.'s
X = sample(1:10, sims, replace = TRUE)

#generate Y based on X
Y = sapply(X, function(x) rbinom(1, x, 1/2))

#should get 2.75 and 3.44
mean(Y); var(Y)
## [1] 2.661
## [1] 3.335414

The specific numerical answer is not super important, since it doesn’t really help for intuition. What’s important is understanding the steps we took to get here. We can also now pause and actually think about intuition for Eve’s Law. Remember the random variable that we’re finding the variance of: the number of heads for a random number of coin flips. Think about how we would find this variance. There are two sources of variability: the number of coins that we will flip is random, and then the outcome of these flips is also random. You can think of the first term, the \(E\big(Var(Y|X)\big)\), as the variance of the actual flips for the experiment. In words, this term is asking for the average variance for the Binomial random variable, which is \(Y|X\). The second term, \(Var\big(E(Y|X)\big)\), marks the variance from the other source: the random number of flips. This is asking for the variance of the average of the Binomial; that is, how much the Binomial changes across different scenarios (i.e., when we flip the coin 3 times, flip the coin 9 times, etc.).

In a more general case, you can think of the \(E\big(Var(Y|X)\big)\) term as the ‘in-group variance.’ Here, it captures the variance of a specific Binomial. You can think of the other term, \(Var\big(E(Y|X)\big)\), as the variance between groups, or ‘inter-group variance.’ That is, the 4 flip Binomial is different from the 7 flip Binomial, and this term captures this Variance.

This exercise was just to develop some intuition on why Eve’s law holds. It’s pretty convenient that we can just add up these in-group and between-group variances to get the total variances, but we won’t prove it here. What’s probably more important is realizing when to use Adam and Eve. Hopefully the above example will provide some clarity: they are useful when we have a conditional distribution, and more general when there is something in the problem that we wish we knew.




You observe a sequence of \(n\) normal random variables. The first is a standard normal: \(X_1 \sim N(0, 1)\). Then, the second random variable in the sequence has variance 1 and mean of the first random variable, so \(X_2|X_1 \sim N(X_1, 1)\). In general, \(X_j|X_{j - 1} \sim N(X_{j - 1}, 1)\).

  1. Find \(E(X_n)\) for \(n \geq 2\).

  2. Find \(Var(X_n)\).


Let \(X \sim N(0, 1)\) and \(Y|X \sim N(X, 1)\). Find \(Cov(X, Y)\).


(With help from Matt Goldberg)

You need to design a way to split $100 randomly among three people. ‘Random’ here means symmetrical: if \(X_i\) is the amount that the \(i^{th}\) person receives, then \(X_1,X_2\) and \(X_3\) must be i.i.d. Also, the support of \(X_1\) must be from 0 to 100; otherwise, we could simply assign each person a constant $33.33.

  1. Consider the following scheme to randomly split the money: you draw a random value from 0 to 100 and give that amount to the first person, then you draw a random amount from 0 to the amount of money left ($100 minus what you gave to the first person) and give that to the second person, etc. Show why this scheme violates the ‘symmetrical’ property that we want.

  2. Consider the following scheme: generate three values, one for each person, from a \(Unif(0, 1)\) r.v. Then, normalize the values (divide each value by the sum of the three values) and assign each person the corresponding proportional value out of $100 (i.e., if the first person has a normalized value of .4, give him $40). Show why this scheme results in the correct expectation for each person (i.e., the expectation satisfies the property of symmetry, and each person expects $33.33).


Let \(X \sim N(0, 1)\) and \(Y = |X|\). Find \(Corr(X, Y)\).


CJ has \(X \sim Pois(\lambda)\) chores to run, and will spend \(M \sim Pois(\lambda)\) time at each chore. Time spent at each chore is independent of the number of chores and the time spent at other chores. Let \(Y\) be the total amount of time spent doing chores. Find \(E(Y)\).


Brandon is a cell. During his life cycle, he has \(Pois(\lambda)\) offspring before he dies.

  1. Each of his descendants have i.i.d. \(Pois(\lambda)\) offspring distributions (i.e., like Brandon they independently have \(Pois(\lambda)\) children before they die). Let \(X_n\) be the size of the \(n^{th}\) generation, and let Brandon be the \(0^{th}\) generation (so \(X_0 = 1\), since Brandon is just 1 cell, and \(X_1\) is the number of children that Brandon has, or the first generation). Find \(E(X_n)\).

  2. Discuss for what values of \(\lambda\) the generation mean goes to infinity as \(n\) grows, and for what values the generation mean goes to 0.


Brandon is a cell. During his life cycle, he has \(Pois(\lambda)\) offspring before he dies. Each of his descendants have i.i.d. \(Pois(\lambda)\) offspring distributions (i.e., like Brandon they independently have \(Pois(\lambda)\) children before they die). Let \(X_n\) be the size of the \(n^{th}\) generation, and let Brandon be the \(0^{th}\) generation (so \(X_0 = 1\), since Brandon is just 1 cell, and \(X_1\) is the number of children that Brandon has, or the first generation).

Find \(Var(X_n)\). You can do this by finding a general form for \(Var(X_n)\) in terms of \(Var(X_{n - 1})\), and then using this equation to write out some of the first variances in the sequence (i.e., \(Var(X_1)\), \(Var(X_2)\), etc.). From here, you can guess at the general pattern of the sequence and see what \(Var(X_n)\) will be. Of course, this ‘guess’ is not a formal proof; you could prove this rigorously using induction, but this book is not focused on induction, and the ‘guess’ will suffice!


Each year, the top (Men and Women) collegiate basketball teams in the NCAA square off in a massive, single elimination tournament. The tournament is colloquially known as “March Madness,” and, despite recent expansion to include ‘play-in’ games, it can be thought of as a 64-team tournament. Teams are ‘seeded’ (i.e., assigned a seed, 1 to 16) based on their performance during the season. Lower seed values are better (i.e., 1 is the best, 16 is the worst) and the teams are paired in the first round based on seeds (i.e., they are paired such that the seeds total to 17: each 1 seed plays a 16 seed, each 10 seed plays a 7 seed, etc.).

Of late, the UConn Huskies have had highly successful tournaments. The Men’s Team have won championships in 2011 and 2014 (as well as previously in 1999 and 2004) and the Women’s Team, considered by many the greatest collegiate team in the nation (across all sports), won 4 straight championships from 2013 to 2016, as well as 7 championships from 1995 to 2010.

Of course, in this tournament, it does not make sense to assume that teams are equal; in fact, they are seeded based on their ability. Consider this common model to assess win probability of for a random match up. Let \(a\) be the seed of the first team, and \(b\) be the seed of the second team, and let \(p\) be the probability that the first team wins. We can model \(p\) with a Beta prior such that \(p \sim Beta(b, a)\). Based on this prior, find the probability that the first team wins, and, based on this probability, explain why this is a reasonable choice for a prior (i.e., consider how the probability changes as \(a\) changes relative to \(b\)).


Let \(X\) and \(Y\) be i.i.d. \(N(0, 1)\) and \(Z\) be a random variable such that it takes on the value \(X\) (the value that \(X\) crystallizes to) or the value \(Y\) (the value that \(Y\) crystallizes to) with equal probabilities (recall we saw a similar structure in Chapter 7, where we showed that the vector \((X, Y, Z)\) is not Multivariate Normal). Find \(Cov(X, Z)\).


A stoplight in town toggles from red to green (no yellow). The times for the ‘toggles’ (switching from the current color to the other color) are distributed according to a Poisson process with rate parameter \(\lambda\). If you drive through the stoplight at a random time during the day, what is your expected wait time at the light?


Dollar bills are the base currency in the United States. Bills are used widely in 6 denominations: $1, $5, $10, $20, $50, $100 (the $2 still exists, but is not widely used). Imagine that you randomly select one of these denominational values from $5 to $100 (i.e., $10) and withdraw it from your bank. On average, how many withdrawals must you make to withdraw at least $15?

BH Problems

The problems in this section are taken from Blitzstein and Hwang (2014). The questions are reproduced here, and the analytical solutions are freely available online. Here, we will only consider empirical solutions: answers/approximations to these problems using simulations in R.

BH 9.10

A coin with probability \(p\) of Heads is flipped repeatedly. For (a) and (b), suppose that \(p\) is a known constant, with \(0<p<1\).

  1. What is the expected number of flips until the pattern HT is observed?

  2. What is the expected number of flips until the pattern HH is observed?

  3. Now suppose that \(p\) is unknown, and that we use a Beta(\(a,b\)) prior to reflect our uncertainty about \(p\) (where \(a\) and \(b\) are known constants and are greater than 2). In terms of \(a\) and \(b\), find the corresponding answers to (a) and (b) in this setting.

BH 9.13

Let \(X_1,X_2\) be i.i.d., and let \(\bar{X}= \frac{1}{2}(X_1+X_2)\) be the sample mean. In many statistics problems, it is useful or important to obtain a conditional expectation given \(\bar{X}\). As an example of this, find \(E(w_1X_1+w_2X_2 | \bar{X})\), where \(w_1,w_2\) are constants with \(w_1+w_2=1\).

BH 9.15

Consider a group of \(n\) roommate pairs at a college (so there are \(2n\) students). Each of these \(2n\) students independently decides randomly whether to take a certain course, with probability \(p\) of success (where “success” is defined as taking the course).

Let \(N\) be the number of students among these \(2n\) who take the course, and let \(X\) be the number of roommate pairs where both roommates in the pair take the course. Find \(E(X)\) and \(E(X|N)\).

BH 9.16

Show that \(E( (Y - E(Y|X))^2|X) = E(Y^2|X) - (E(Y|X))^2,\) so these two expressions for \(Var(Y|X)\) agree.

BH 9.22

Let \(X\) and \(Y\) be random variables with finite variances, and let \(W=Y - E(Y|X)\). This is a residual: the difference between the true value of \(Y\) and the predicted value of \(Y\) based on \(X\).

  1. Compute \(E(W)\) and \(E(W|X)\).
  1. Compute \(Var(W)\), for the case that \(W|X \sim N(0,X^2)\) with \(X \sim N(0,1)\).

BH 9.23

One of two identical-looking coins is picked from a hat randomly, where one coin has probability \(p_1\) of Heads and the other has probability \(p_2\) of Heads. Let \(X\) be the number of Heads after flipping the chosen coin \(n\) times. Find the mean and variance of \(X\).

BH 9.30

Emails arrive one at a time in an inbox. Let \(T_n\) be the time at which the \(n^{th}\) email arrives (measured on a continuous scale from some starting point in time). Suppose that the waiting times between emails are i.i.d. Expo(\(\lambda\)), i.e., \(T_1, T_2 - T_1, T_3 - T_2,...\) are i.i.d. Expo(\(\lambda\)).

Each email is non-spam with probability \(p\), and spam with probability \(q=1-p\) (independently of the other emails and of the waiting times). Let \(X\) be the time at which the first non-spam email arrives (so \(X\) is a continuous r.v., with \(X = T_1\) if the 1st email is non-spam, \(X = T_2\) if the 1st email is spam but the 2nd one isn’t, etc.).

  1. Find the mean and variance of \(X\).

  2. Find the MGF of \(X\). What famous distribution does this imply that \(X\) has (be sure to state its parameter values)?

BH 9.33

Judit plays in a total of \(N \sim Geom(s)\) chess tournaments in her career. Suppose that in each tournament she has probability \(p\) of winning the tournament, independently. Let \(T\) be the number of tournaments she wins in her career.

  1. Find the mean and variance of \(T\).

  2. Find the MGF of \(T\). What is the name of this distribution (with its parameters)?

BH 9.36

A certain stock has low volatility on some days and high volatility on other days. Suppose that the probability of a low volatility day is \(p\) and of a high volatility day is \(q=1-p\), and that on low volatility days the percent change in the stock price is \(N(0,\sigma^2_1)\), while on high volatility days the percent change is \(N(0,\sigma^2_2)\), with \(\sigma_1 < \sigma_2\).

Let \(X\) be the percent change of the stock on a certain day. The distribution is said to be a mixture of two Normal distributions, and a convenient way to represent \(X\) is as \(X=I_1X_1 + I_2X_2\) where \(I_1\) is the indicator r.v. of having a low volatility day, \(I_2=1-I_1\), \(X_j \sim N(0,\sigma^2_j)\), and \(I_1,X_1,X_2\) are independent.

  1. Find \(Var(X)\) in two ways: using Eve’s law, and by calculating \(Cov(I_1X_1 + I_2X_2, I_1X_1 + I_2X_2)\) directly.

  2. Recall from Chapter 6 that the kurtosis of an r.v. \(Y\) with mean \(\mu\) and standard deviation \(\sigma\) is defined by \[Kurt(Y) = \frac{E(Y-\mu)^4}{\sigma^4}-3.\] Find the kurtosis of \(X\) (in terms of \(p,q,\sigma^2_1,\sigma^2_2\), fully simplified). The result will show that even though the kurtosis of any Normal distribution is 0, the kurtosis of \(X\) is positive and in fact can be very large depending on the parameter values.

BH 9.43

Empirically, it is known that 49% of children born in the U.S. are girls (and 51% are boys). Let \(N\) be the number of children who will be born in the U.S. in March of next year, and assume that \(N\) is a Pois(\(\lambda)\) random variable, where \(\lambda\) is known. Assume that births are independent (e.g., don’t worry about identical twins).

Let \(X\) be the number of girls who will be born in the U.S. in March of next year, and let \(Y\) be the number of boys who will be born then.

  1. Find the joint distribution of \(X\) and \(Y\). (Give the joint PMF.)

  2. Find \(E(N|X)\) and \(E(N^2|X)\).

BH 9.44

Let \(X_1,X_2,X_3\) be independent with \(X_i \sim Expo(\lambda_i)\) (so with possibly different rates). Recall from Chapter 7 that \[P(X_1 < X_2) = \frac{\lambda_1}{\lambda_1 + \lambda_2}.\]

  1. Find \(E(X_1 + X_2 + X_3 | X_1 > 1, X_2 > 2, X_3 > 3)\) in terms of \(\lambda_1,\lambda_2,\lambda_3\).

  2. Find \(P\left(X_1 = \min(X_1,X_2,X_3)\right)\), the probability that the first of the three Exponentials is the smallest.

  3. For the case \(\lambda_1 = \lambda_2 = \lambda_3 = 1\), find the PDF of \(\max(X_1,X_2,X_3)\). Is this one of the important distributions we have studied?

BH 9.45

A task is randomly assigned to one of two people (with probability 1/2 for each person). If assigned to the first person, the task takes an Expo(\(\lambda_1\)) length of time to complete (measured in hours), while if assigned to the second person it takes an Expo(\(\lambda_2\)) length of time to complete (independent of how long the first person would have taken). Let \(T\) be the time taken to complete the task.

  1. Find the mean and variance of \(T\).

  2. Suppose instead that the task is assigned to both people, and let \(X\) be the time taken to complete it (by whoever completes it first, with the two people working independently). It is observed that after \(24\) hours, the task has not yet been completed. Conditional on this information, what is the expected value of \(X\)?

BH 9.47

A certain genetic characteristic is of interest. It can be measured numerically. Let \(X_1\) and \(X_2\) be the values of the genetic characteristic for two twin boys. If they are identical twins, then \(X_1=X_2\) and \(X_1\) has mean \(0\) and variance \(\sigma^2\); if they are fraternal twins, then \(X_1\) and \(X_2\) have mean \(0\), variance \(\sigma^2\), and correlation \(\rho\). The probability that the twins are identical is \(1/2\). Find Cov(\(X_1,X_2\)) in terms of \(\rho,\sigma^2.\)

BH 9.48

The Mass Cash lottery randomly chooses 5 of the numbers from \(1,2,...,35\) each day (without repetitions within the choice of 5 numbers). Suppose that we want to know how long it will take until all numbers have been chosen. Let \(a_j\) be the average number of additional days needed if we are missing \(j\) numbers (so \(a_{0}=0\) and \(a_{35}\) is the average number of days needed to collect all 35 numbers). Find a recursive formula for the \(a_j\).

BH 10.17

Let \(X_1, X_2, ...\) be i.i.d. positive random variables with mean 2. Let \(Y_1, Y_2, ...\) be i.i.d. positive random variables with mean 3. Show that \[\frac{X_1+X_2+ \dots + X_n}{Y_1+Y_2 + \dots +Y_n} \to \frac{2}{3}\] with probability 1. Does it matter whether the \(X_i\) are independent of the \(Y_j\)?

BH 10.18

Let \(U_1, U_2, \dots, U_{60}\) be i.i.d. Unif(0,1) and \(X = U_1 + U_2 + \dots + U_{60}\).

  1. Which important distribution is the distribution of \(X\) very close to? Specify what the parameters are, and state which theorem justifies your choice.

  2. Give a simple but accurate approximation for \(P(X >17)\). Justify briefly.

BH 10.19

Let \(V_n \sim \chi^2_n\) and \(T_n \sim t_n\) for all positive integers \(n\).

  1. Find numbers \(a_n\) and \(b_n\) such that \(a_n(V_n - b_n)\) converges in distribution to \(N(0,1)\).

  2. Show that \(T^2_n/(n+T^2_n)\) has a Beta distribution (without using calculus).

BH 10.20

Let \(T_1, T_2, ...\) be i.i.d. Student-\(t\) r.v.s with \(m \geq 3\) degrees of freedom. Find constants \(a_n\) and \(b_n\) (in terms of \(m\) and \(n\)) such that \(a_n(T_1 + T_2 + \dots + T_n - b_n)\) converges to \(N(0,1)\) in distribution as \(n \to \infty\).

BH 10.21
  1. Let \(Y = e^X\), with \(X \sim Expo(3)\). Find the mean and variance of \(Y\).

  2. For \(Y_1,\dots,Y_n\) i.i.d. with the same distribution as \(Y\) from (a), what is the approximate distribution of the sample mean \(\bar{Y}_n = \frac{1}{n} \sum_{j=1}^n Y_j\) when \(n\) is large?

BH 10.22
  1. Explain why the \(Pois(n)\) distribution is approximately Normal if \(n\) is a large positive integer (specifying what the parameters of the Normal are).

  2. Stirling’s formula is an amazingly accurate approximation for factorials:

\[ n! \approx \sqrt{2\pi n} \left(\frac{n}{e}\right)^n,\] where in fact the ratio of the two sides goes to 1 as \(n \to \infty\). Use (a) to give a quick heuristic derivation of Stirling’s formula by using a Normal approximation to the probability that a Pois(\(n\)) r.v. is \(n\), with the continuity correction: first write \(P(N=n) = P(n-\frac{1}{2} < N < n + \frac{1}{2})\), where \(N \sim Pois(n)\).

BH 10.23
  1. Consider i.i.d. Pois(\(\lambda\)) r.v.s \(X_1,X_2,\dots\). The MGF of \(X_j\) is \(M(t) = e^{\lambda(e^t-1)}\). Find the MGF \(M_n(t)\) of the sample mean \(\bar{X}_n= \frac{1}{n} \sum_{j=1}^n X_j\).

  2. Find the limit of \(M_n(t)\) as \(n \to \infty\). (You can do this with almost no calculation using a relevant theorem; or you can use (a) and the fact that \(e^x \approx 1 + x\) if \(x\) is very small.)

BH 10.31

Let \(X\) and \(Y\) be independent standard Normal r.v.s and let \(R^2=X^2 + Y^2\) (where \(R>0\) is the distance from \((X,Y)\) to the origin).

  1. The distribution of \(R^2\) is an example of three of the important distributions we have seen (in ‘Probability!’ we have only learned about two of these distributions, so you only need to mention two). State which three of these distributions \(R^2\) is an instance of, specifying the parameter values.

  2. Find the PDF of \(R\).

  3. Find \(P(X>2Y+3)\) in terms of the standard Normal CDF \(\Phi\).

  4. Compute \(\textrm{Cov}(R^2,X)\). Are \(R^2\) and \(X\) independent?

BH 10.32

Let \(Z_1,...,Z_n \sim N(0,1)\) be i.i.d.

  1. As a function of \(Z_1\), create an Expo(\(1\)) r.v. \(X\) (your answer can also involve the standard Normal CDF \(\Phi\)).

  2. we haven’t learned the relevant information for this part

  3. Let \(X_1 = 3 Z_1 - 2 Z_2\) and \(X_2 = 4Z_1 + 6Z_2\). Determine whether \(X_1\) and \(X_2\) are independent (be sure to mention which results you’re using).

BH 10.33

Let \(X_1, X_2, \dots\) be i.i.d. positive r.v.s. with mean \(\mu\), and let \(W_n = \frac{X_1}{X_1+\dots + X_n}.\)

  1. Find \(E(W_n).\)

  2. What random variable does \(nW_n\) converge to (with probability \(1\)) as \(n \to \infty\)?

  3. For the case that \(X_j \sim Expo(\lambda)\), find the distribution of \(W_n\), preferably without using calculus. (If it is one of the named distributions, state its name and specify the parameters; otherwise, give the PDF.)


Blitzstein, J. K., and J. Hwang. 2014. Introduction to Probability. Chapman & Hall/CRC Texts in Statistical Science. CRC Press.