Chapter 1 Probability Review

“Never tell me the odds!” -Han Solo






Motivation




Although this book is about Stochastic Processes, techniques from Probability provide the crucial underpinnings for the different concepts that we will learn. The major (suggested) prerequisite for this book is some knowledge of Probability, which we will review in this chapter. Naturally, as this chapter is a review, one should not expect fully fleshed out explanations; see this book on Probability for more extensive discussions. This chapter will read more like a casual refresher.




1.1 Basic Probability and Counting


We start with the concept of ‘naive probability,’ which is simply defined as:

\[P(A) = \frac{number \; of \; favorable \; outcomes}{total \; number \; of \; outcomes}\]

Where \(P(A)\) means, in English, ‘the probability of event \(A\) occurring.’ For example, if we flip a fair, two-sided coin, \(P(Heads)\), or the probability of the event of the coin landing heads, is simply \(\frac{1}{2}\) (there is one favorable outcome - Heads - and two total outcomes - Heads and Tails). We often call this marginal probability (includes only a single event).


We can find the probability of single events occurring (i.e., the probability that event \(A\) occurs) or we can find the probability of multiple events occurring; this is defined as the intersection of two events. For example, we would write \(A \cap B\), which is the intersection of events \(A\) and \(B\), or the event where both events occur. For example, if we flip a fair, two-sided coin and roll a fair, six-sided die, letting \(A\) be the event that the coin lands Heads and \(B\) be the event that the die lands a 1, then the event \(A \cap B\) is the coin landing Heads and the die landing 1.

Similarly, if the intersection of two events, at least colloquially, means “and” (the coin landed heads and the die landed \(1\)), then the union of two events means “or.” For our example of events \(A\) and \(B\), the union of \(A\) and \(B\), written as \(A \cup B\), is simply the event that \(A\) takes place, \(B\) takes place or both \(A\) and \(B\) takes place (i.e., at least one takes place). If you were to draw a Venn Diagram of events \(A\) and \(B\), the intersection would be the overlapping part of the Venn Diagram and the union would be the entire area enclosed by the circles (including the intersection).


If two events are independent, their joint probability (or probability of their intersection) is just the product of their marginal probabilities. To observe this, we can use the example of events \(A\) and \(B\) above:

\[P(A \cap B) = P(A) P(B) = \frac{1}{2} \cdot \frac{1}{6} = \frac{1}{12}\]

If two events are not independent, or dependent (i.e., knowing the outcome of one event changes the marginal probability of the other event occurring, as opposed to independent events, where learning the outcome of one event does not change the probability of the other event occurring), then we need to employ conditional probability. We may write \(P(A|B)\) which, in English, means ‘the probability of \(A\) occurring conditional on event \(B\) occurring’ (i.e. the probability of \(A\) occurring given that \(B\) occurred). For finding the probability of the intersection of dependent events, we write:

\[P(A \cap B) = P(A) P(B|A) = P(B) P(A|B)\]

Note the symmetry here (we can either condition on \(A\) or on \(B\)). Also note that this equation holds for independent events, we just have that \(P(B|A)\) simplifies to \(P(B)\) (knowing that \(A\) occurs does not change the probability of \(B\) occurring if the events are independent) and \(P(A|B)\) simplifies to \(P(A)\).


We can also use conditioning to get the Law of Total Probability (LOTP). This is:

\[P(A) = P(A|B)P(B) + P(A|B^c) P(B^c)\]

Where \(B^c\) is, in English, ‘\(B\) complement,’ and simply is the event that \(B\) did not occur.

LOTP finds \(P(A)\) by considering two states of the world: when \(B\) occurs and when \(B^c\) occurs (\(B\) does not occur). It then takes the probability of \(A\) occurring in both of these worlds with \(P(A|B)\) and \(P(A|B^c)\) and weights these probabilities by the probability of the worlds actually taking place (\(P(B)\) and \(P(B^c)\)). Essentially, LOTP is a weighted average.


Conditioning also provides us with the infamous Bayes’ Rule:

\[P(A|B) = \frac{P(A)P(B|A)}{P(B)}\]

This equation can be derived by taking our above equation, \(P(A) P(B|A) = P(B) P(A|B)\), and dividing both sides by \(P(B)\). It can be extremely useful when you need \(P(A|B)\) but only have \(P(B|A)\) and is also the foundation for an entire branch of statistics (a discussion for a later time).


Throughout statistics we often find it useful to count combinations or permutations of different things. Among the most simple counting techniques is the multiplication rule which says that, for a process with \(r\) distinct steps, where the steps have \(n_1, n_2, ..., n_r\) choices (i.e., the \(i^{th}\) step has \(n_i\) choices), the total number of outcomes is \(n_1 \cdot n_2 \cdot ... \cdot n_r = \prod_{i = 1}^n n_i\) (where two outcomes are only considered the same if the same choice was made at each of the \(r\) steps).

The multiplication rule allows us to easily count the number of ways to ‘line things up.’ For example, say we wanted to count the number of total ways to order the letters A, B and C. We could easily list out the ways by hand…


  1. ABC
  2. ACB
  3. BCA
  4. BAC
  5. CBA
  6. CAB


But this would quickly get extremely tiresome for more letters (can you write out all of the combinations for \(ABCDEF\)?). Instead, we can consider the multiplication rule.

In the above example, we have three ‘spots’ that each letter could go in (the first spot, the second spot or the third spot). There are 3 choices for which letter goes in the first spot (naturally, A, B or C), 2 choices for which letter goes in the second spot (only 2 choices because one letter has already been chosen for the first spot!) and 1 choice for which letter goes in the last spot (whichever letter is left). Building this tree (indeed, it can be helpful to visualize the process branching out at each step) allows us to use the multiplication rule to multiply the number of choices at each step and get the total number of outcomes: \(3 \cdot 2 \cdot 1 = 6\). More succinctly written, this is \(3!\), or, in English, “3 factorial,” where:

\[n! = n \cdot (n - 1) \cdot ... \cdot 1 = \prod_{i = n - 1}^0 (n - i)\]

Finally, we can use the factorial to construct the binomial coefficient:

\[{n \choose k} = \frac{n!}{(n - k)! k!}\]

This counts the number of ways that \(k\) objects can be chosen from a population of \(n\) objects. For example, we may want to count the number of ways to choose 2 letters from the letters A, B and C. We can start by easily listing them out…


  1. AB
  2. AC
  3. BC


…or we can employ the binomial coefficient and write \({3 \choose 2} = \frac{3!}{(3 - 2)! 2!} = \frac{3!}{2!} = 3\). For a more detailed description on why the binomial coefficient works, see here.




1.2 Random Variables


As with many statistics topics, random variables will be used quite frequently throughout this book. A random variable \(X\) is, formally, a function that maps a sample space onto the real line.

Random variables can be discrete or continuous; in the former case, the support of the random variable (all possible values that the random variable can crystallize to) is discrete (something like \(\{0, 1, 2, ...\}\) or \(\{1, 2, 3, 4\}\)) and in the latter case the support is continuous (something like all numbers between \(0\) and \(1\), or all positive numbers).

Random variables are governed by Probability Mass Functions (PMFs) in the discrete case and Probability Density Functions (PDFs) in the continuous case. The PMF is simply defined by \(P(X = x)\), or the function that gives the probability that the random variable \(X\) crystallizes to the constant \(x\) (naturally, this is defined for the support of \(X\), or all values that \(X\) can take on). The PDF of \(X\), generally notated \(f(x)\), is a density function and therefore does not return probabilities; it has the property that:

\[P(c \leq X \leq d) = \int_c^d f(x)dx\]

Or, integrating the PDF from \(c\) to \(d\) gives the probability that \(X\) crystallizes in the range \((c, d)\). We can then write:

\[\int_a^b f(x) dx = 1\]

Where \(a\) is the lower bound of the support of \(X\) and \(b\) is the upper bound. We can always use \(\infty\) and \(-\infty\) instead of \(a\) and \(b\), as this will naturally be inclusive of the entire support of \(X\) (and the PDF of \(X\) is undefined for values not in the support). The intuition is that the probability of \(X\) crystallizing in its support is 1 (it cannot take on any values outside of its support). This connection to integration is why the ‘Density’ in ‘Probability Density Function’ is aptly used.


Both discrete and continuous random variables have Cumulative Distribution Functions (CDFs), which are simply defined as \(P(X \leq x)\), or the probability that the random variable \(X\) crystallizes to a value less than or equal to some constant \(x\). For PMFs, this can be written as:

\[\sum_{i = a}^x P(X = i)\]

Where \(a\) is the lower bound of the support of \(X\). This simply sums up the probability that \(X\) is any value less than or equal to the constant \(x\). For PDFs, we can use our integration structure and write:

\[P(X \leq x) = \int_a^x fx(x)dx\]

Where, again, \(a\) is the lower bound of the support of \(X\).

We often write the CDF as \(F(x) = P(X \leq x)\) (as opposed to the lowercase \(f(x)\) for the PDF) and, as we saw in the last equation, can integrate the PDF to find the CDF:

\[F(a) - F(b) = \int_a^b f(x) dx\]

PMFs, PDFs and CDFs are naturally very useful when working with random variables, but they also define distributions; if two random variables have the same PDF, they have the same distribution, but if they have different PDFs, they are different distributions. Distributions are distinct from random variables; while random variables are the actual ‘function that maps a sample space onto a real line’ (or, intuitively, the ‘random number generator’), distributions describe how random variables ‘act’ (they define how the random variable maps onto the real line). One good analogy is to think of the distribution as the recipe to make the hamburger, and the random variable as the actual hamburger; you can make many hamburgers from the same recipe!


Random variables are governed by parameters. For example, a random variable with a normal distribution has two parameters: a mean and a variance, usually denoted \(\mu\) and \(\sigma^2\). Different random variables take different parameters (of course, random variables need not have exactly two parameters; they can have more or less).

Random variables also have expectations, which are simply the mean (average) of the random variable. We denote this as \(E(X)\), where \(X\) is the random variable. For a discrete random variable, we have:

\[E(X) = \sum_x P(X = x) \cdot x\]

Which is essentially a weighted average (weighted by the probabilities \(P(X = x)\) of all of the values \(x\) that the random variable \(X\) can take on. Extending this to the continuous case, we have:

\[E(X) = \int_{-\infty}{\infty} x \cdot f(x)dx\]

Similarly, we can use LoTUS, or the ‘Law of the Unthinking Statistician,’ to find the expectation of a function, defined here as \(g\), of the random variable \(X\) (the function \(g\) could be as simple as \(g(X) = X^2\)). We have in the continuous case (the same holds for the discrete case, simply with the PMF instead of the PDF):

\[E(g(X)) = \int_{-\infty}^{\infty} g(x) \cdot f(x)dx\]

The intuition here is that we are just taking a weighted average of all the values that \(g(X)\) takes on. After all, a function of a random variable is a random variable; the random variable \(X\) crystallizes to some value, and then we plug it into the function \(g\) to transform it to another value (then multiply by the probability or density of \(X\) at that value).

Expectations also have some valuable properties. For example, for a random variable \(X\) and a constant \(a\), we have:

\[E(aX) = aE(X)\]

That is, the constant ‘factors out’ of the expectation. We also have Linearity of Expectation, or, for any random variables \(X\) and \(Y\):

\[E(X + Y) = E(X) + E(Y)\]

Even if \(X\) and \(Y\) are dependent.


We can use the expectation to get the variance of a random variable, which is another important concept. This is denoted \(Var(X)\), and we have the following equation:

\[Var(X) = E(X^2) - \big(E(X)\big)^2\]

Recall that we can find \(E(X^2)\) using LoTUS.


One key property for variance is that, for any constant \(a\) and any random variable \(X\):

\[Var(aX) = a^2 Var(X)\]

We can see this easily by plugging \(a\) into the equation above and using the fact that \(E(aX) = aE(X)\). Further, for two independent random variables \(X\) and \(Y\), we have:

\[Var(X + Y) + Var(X) + Var(Y)\]

Or, the variance of the sum of random variables is the sum of the variances of the individual random variables (we will see more on the case where the random variables are dependent later).



1.3 Moments


We can define the \(k^{th}\) moment of a random variable as \(E(X^k)\). Recall that using LoTUS we simply have (in the continuous case):

\[E(X^k) = \int_{-\infty}^{\infty} x^k f(x)dx\]

The first moment, when \(k = 1\), gives us the expectation \(E(X)\) (just plug \(k = 1\) in to \(E(X^k)\) to get \(E(X)\)), and the second moment, \(E(X^2)\), minus the first moment squared gives us the variance of \(X\).


We can find the moments of \(X\) using the above integration, but this may get tedious. It is often more effective to use the Moment Generating Function (MGF), which is written as \(M_X(t)\) (the \(t\) is a dummy variable, which we will see more of in a second, and \(X\) is the relevant random variable) and is given as:

\[M_X(t) = E(e^{tX})\]

Note that the MGF is a function of \(X\); again, \(t\) is just a dummy variable, which we will see in a moment. Note that, to calculate \(E(e^{tX})\), we can use LoTUS. For a continuous random variable:

\[E(e^{tX}) = \int_{-\infty}^{\infty} e^{tx} f(x)dx\]

Where \(f(x)\) is the PDF of \(X\) (this would work for a discrete random variable as well; we use this equation but sum over the PMF instead of integrate over the PDF). Note that after this integration is completed, the MGF will be a function of \(t\), as \(x\) will have been integrated out.


You can use the Moment Generating Function to, well, generate moments! Specifically, we have:

\[M_X^{\prime}(0) = E(X)\] \[M_X^{\prime \prime}(0) = E(X^2)\] \[...\]

The first equation, in English, means: the first derivative of the MGF w.r.t. (with respect to) \(t\) with \(0\) plugged in for \(t\) gives the first moment of \(X\). The second equation, in English means: the second derivative of the MGF w.r.t. \(t\) with \(0\) plugged in for \(t\) gives the second moment of \(X\). In general (as implied by the “\(...\)” above), the \(k^{th}\) derivative of the MGF w.r.t. \(t\) with \(0\) plugged in for \(t\) gives the \(k^{th}\) moment of the MGF.

This process can seem complicated at first; see a worked out example with the Exponential distribution here.


Finally, MGFs have a couple of useful properties. First, MGFs define random variables: if two random variables have the same MGF, then they have the same distribution. Secondly, the MGF of the sum of random variables is the product of each random variable’s MGF. For example, the MGF of \(X + Y\) is \(M_X(t)M_Y(t)\).



1.4 Famous Distributions


There are many widely used distributions that frequently surface in Statistics. We can provide a brief refresher here on some of the most prominent. The last page on this cheatsheet (as well as the entire cheat sheet itself, for other probability topics), developed by William Chen and Joe Blitzstein, is a helpful tool as well.



We can start with the discrete distributions:


Bernoulli: \(X \sim Bern(p)\). Takes on a value \(0\) with probability \(1-p\) or takes on the value \(1\) with probability \(p\). Because \(X\) can only take on two values, the PMF is quite simple (we’ve already defined it):

\[P(X = 1) = p\] \[P(X = 0) = 1 - p\]

The expected value is then simply \(E(X) = p\) and the variance is \(Var(X) = p(1-p)\).


Binomial: \(X \sim Bin(n, p)\). Counts the sum of \(n\) independent Bernoullis, each with probability \(p\) of success. The PMF is given by:

\[P(X = x) = {n \choose x} p^x (1 - p)^{n - x}\]

Where we can think of \(p^x\) giving the probability of \(x\) successes, \((1 - p)^{n - x}\) giving the probability of \(n - x\) failures, and \({n \choose x}\) counting the number of ways that \(x\) trials can succeed in a total of \(n\) trials (we choose \(x\) of the \(n\) trials to be successes).

The expected value is \(E(X) = np\), which is intuitive because a Binomial is the sum of \(n\) Bernoullis, each with expected value \(p\). The variance is \(Var(X) = np(1 - p)\), which we can again see is just the variance of a single Bernoulli times \(n\) (remember, the individual Bernoulli random variables are independent).


Geometric: \(X \sim Geom(p)\). Counts the number of failures until a success among independent, binary trials, each with probability \(p\) of success. The PMF is given by:

\[P(X = x) = (1 - p)^x p\]

As \((1 - p)^x\) gives the probability of \(x\) failures, with the \(p\) giving the probability of the success (the next trial).

The expected value is \(E(X) = \frac{1-p}{p}\). Intuitively, as the probability \(p\) of success for each trial gets larger, the expected value gets smaller; this makes sense because we expect less failures (the same is true for as \(p\) gets smaller; the expected value gets larger because we expected more failures!). The variance is given by \(\frac{1-p}{p^2}\).


Poisson: \(X \sim Pois(\lambda)\). This description is a little more vague, but the Poisson is often used to count the number of occurrences for some discrete event. The PMF is given by:

\[P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}\]

There’s not much we can intuit from the PMF. The expected value and variance are both \(E(X) = Var(X) = \lambda\). Although the Poisson may seem boring or irrelevant, it has a whole host of applications in Statistics, as we will soon see.



Next are the continuous distributions.


Uniform: \(X \sim Unif(a, b)\). Draws a ‘random number’ from \(a\) to \(b\) (when we say ‘random number,’ we mean that the probability of drawing a number from an interval is proportional to its length; that is, if we cut the interval into 10 equally sized pieces, each interval has an equal probability of containing the selected number). The PDF is given by:

\[f(x) = \frac{1}{b - a}\]

Notice that the PDF is constant w.r.t. \(x\) (it does not include \(x\)); that is because this is a ‘random draw,’ so the density of the distribution does not change as we move along the interval.

The expected value is \(E(X) = \frac{a + b}{2}\), or the average of the interval endpoints, and the variance is \(Var(X) = \frac{(b - a)^2}{12}\).


Exponential: \(X \sim Expo(\lambda)\). Essentially the continuous analog for the Geometric: models the (continuous) time waiting until some event occurs. The PDF is given by:

\[f(x) = \lambda e^{-\lambda x}\]

The expected value is given by \(E(X) = \frac{1}{\lambda}\) (so we might say that the process has an arrival rate of \(\lambda\)) and the variance is given by \(Var(X) = \frac{1}{\lambda^2}\).

The Exponential distribution has the memoryless property. Intuitively, this states that, no matter how long you have been waiting for the arrival of something (something that has an Exponential distribution, that is), your expected wait time going forward has the same Exponential distribution. Put more succinctly:

\[P(X > k + n | X > k) = P(X > n)\]

The Exponential distribution is the only continuous distribution with this property; its discrete analog, the Geometric distribution, is the only discrete distribution with this property.


Normal: \(X \sim N(\mu, \sigma^2)\). The most famous distribution of all, the Normal (often called the Gaussian or, colloquially at least, the Bell Curve) has a mean parameter \(\mu\) and a variance parameter \(\sigma^2\) (so \(E(X) = \mu\) and \(Var(X) = \sigma^2\)). The PDF is given by:

\[f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{(x - \mu)^2}{2\sigma^2}}\]

The Normal distribution is so famous because it shows up so frequently in real life, especially in physical characteristics like height, weight, shoe size, etc. This is thanks in part to the Central Limit Theorem, which essentially states that any given sample mean (i.e., the mean of a sample of realizations from a random variable) approaches a Normal distribution as the sample size grows sufficiently larger.

You have likely also heard of the ‘68-95-99.7 rule,’ which is used as a simple heuristic when working with the Normal; it means that 68% of the density of the distribution is within \(\sigma\) of \(\mu\) (one standard deviation of the mean), 95% of the density is within \(2\sigma\) of \(\mu\) and 99.7% of the density is within \(3\sigma\) of \(\mu\).

Finally, the Normal has the lovely property that the linear combination of Normal random variables is a Normal random variable; that is, if \(X\) and \(Y\) are Normal and \(a\) and \(b\) are constants, then if we have:

\[Z = aX + bY\]

We know that \(Z\) has a Normal distribution (yes, even if \(a = b = 0\), in which case \(Z\) is a ‘degenerate’ \(N(0, 0)\) distribution, with a mean of \(0\) and \(0\) variance).




1.5 Miscellaneous



We can wrap up with a couple of extra important topics that we may find useful in our study.



1.5.1 Covariance and Correlation


For two random variables \(X\) and \(Y\), we define the covariance as:

\[Cov(X, Y) = E(XY) - E(X)E(Y)\]

Intuitively, if the covariance is positive, the two random variables ‘tend to vary’ together, as when \(X\) is large (\(X\) is above its mean \(E(X)\)) and \(Y\) is large (\(Y\) is above its mean \(E(Y)\)), we will have \(E(XY) > E(X)E(Y)\) (and vice versa for when \(X\) and \(Y\) are ‘small’).

Covariance has a number useful properties. For any random variable \(X\) and any constant \(c\), we have:

\[Cov(X, c) = 0\]

As a random variable does not vary with a constant (we can also prove this by plugging \(c\) in for \(Y\) in our equation above and using the fact that \(E(cX) = cE(X)\)). Next, we have:

\[Cov(X, X) = Var(X)\]

Or, a random variable’s covariance with itself is just its variance (we can also see this by plugging \(X\) in for \(Y\) above; we just get the equation for variance!). For the covariance of a sum of random variables, we have:

\[Cov(X, Y + Z) = Cov(X, Y) + Cov(X, Z)\]

Which we could again prove by plugging in \(Y + Z\) for \(Y\) in our original covariance equation. These last two equations allow us to find the variance of the sum of random variables:

\[Var(X + Y) = Cov(X + Y, X + Y) = Cov(X, X) + Cov(Y + Y) + 2Cov(X, Y)\] \[= Var(X) + Var(Y) + 2Cov(X, Y)\]

Note that earlier we reviewed how \(Var(X + Y) = Var(X) + Var(Y)\) if \(X\) and \(Y\) are independent, which checks out with this equation (\(Cov(X, Y) = 0\) if \(X\) and \(Y\) are independent). This equation, though, gives us a more general solution and works if \(X\) and \(Y\) are dependent (i.e., their covariance is non-negative).


We can convert from covariance to correlation easily:

\[\rho = \frac{Cov(X, Y)}{\sigma_x \sigma_y}\]

Where \(\sigma_x\) is the standard deviation (square root of the variance) of \(X\), and likewise with \(\sigma_y\). Unlike covariance, which is unbounded, correlation can only be between 0 and 1. This neat property gives us a sense of magnitude (i.e., we know that a correlation of .95 is pretty high, as it is close to the upper bound, whereas we don’t know if a Covariance of 482.1 is high or low in context), which is why correlation is so popular in everyday use.



1.5.2 Joint and Conditional Distributions


We know that, in the discrete case, a ‘marginal’ PMF is given by \(P(X = x)\). For two random variables \(X\) and \(Y\), we define the Joint PMF as:

\[P(X = x \cap Y = y)\]

Harkening back to our introductory principles of probability, we find that, if \(X\) and \(Y\) are independent (that is, observing the outcome of \(X\) does not affect the distribution of \(Y\)), we have:

\[P(X = x \cap Y = y) = P(X = x) P(Y = y)\]

Or the product of the two marginal PMFs. If \(X\) and \(Y\) are dependent, we have:

\[P(X = x \cap Y = y) = P(X = x) P(Y = y | X = x) = P(Y = y) P(X = x | Y = y)\]

This employees the Conditional PMF, \(P(Y = y|X= x)\), which is, as you have probably reasoned, the probability that the random variable \(Y\) takes on the value \(y\) given that the random variable \(X\) takes on the value \(x\). The same principles apply to continuous random variables, except with PDFs instead of PMFs.


Note that, to go from a joint PDF (or PMF) to a marginal PDF (or PMF), we need to integrate out one of the variables (or sum out, if we are working with the PMF). For example, to get from the joint PDF \(f(x, y)\) to the marginal PDF \(f(x)\), we simply do:

\[f(x) = \int_{-\infty}^{\infty} f(x, y)dy\]

That is, we integrate w.r.t. \(y\), which integrates out the \(y\) variable and leaves us with a function of \(x\) (the PDF of the random variable \(X\)).



1.5.3 Adam and Eve


Often called the ‘Law of Iterated Expectation,’ this states that, for two random variables \(X\) and \(Y\), we have:

\[E(Y) = E\big(E(Y|X)\big)\]

That is, we take the expectation of \(Y\) while holding \(X\) constant (‘conditioning’ on \(X\)), which is the \(E(Y|X)\) part, and then we take the expectation of what we are left with (no longer holding \(X\) constant), which is the outer expectation.

This concept may be confusing at first glance; Example 9.1 in this chapter walks through a simple application.