## 2.7 Distributions of random variables

Random variables can potentially take many different values, usually with some values more likely than others. In the previous chapter, we used simulation to investigate the pattern of variability of random variables. Plots and summary statistics like the ones encountered in the previous chapter summarize *distributions*.

In general, a simulation takes the following steps.

**Set up.**Define a probability space, and related random variables and events. (Remember: random variables are often defined via transformations of other random variables.)**Simulate.**Simulate, according to the probability measure, outcomes, occurrences of events, and values of random variables.**Summarize.**Summarize simulation output in plots (impulse plot, histogram, scatter plot, etc) and summary statisics (relative frequencies, mean, standard deviation, correlation, etc) to describe*distributions*.

We use different plots and summary statistics depending on the number and types of the random variables under investigation. Commonly encountered random variables can be classified as discrete or
continuous (or a mixture of the two^{51}).

- A
**discrete**RV can take on only countably many isolated points on a number line. These are often counting type variables. - A
**continuous**RV can take any value within some interval. These are often measurement type variables.

The **(probability) distribution** of a RV specifies the possible values
of the RV and a way of determining corresponding probabilities. A distribution is determined by:

- The underlying probability measure \(\textrm{P}\), which represents all the assumptions about the random phenemenon
- The random \(X\) itself, that is, the function which maps sample space outcomes to numbers.

We saw some graphical represents of distributions in the previous section. We will see a few ways of specifying a distribution.

- A
*table*of possible values and corresponding probabilities (for discrete random variables) - A
*probability mass function*(for discrete RVs) or a*probability density function*(for continuous RVs) which maps possible values \(x\) to their respective probability (for discrete) or density (for continuous). - A
*cumulative distribution function*, which provides all the percentiles - By
*name, including values of relevant parameters*, e.g., “Exponential(1)”, “Normal(500, 100)”, “Binomial(5, 0.3)”. Some probabilistic situations are so common that the corresponding distributions have special names. Always be sure to specify values of relevant parameters (e.g., “Normal(500, 100) distribution” rather than just “Normal distribution”). Note that different named distributions have different parameters. For example, the parameters 0 and 1 mean something different for the Uniform(0, 1) than for the Normal(0, 1) distribution.

The distribution of a RV specifies the long run pattern of variation of values of the random variable over many repetitions of the underlying random phenenomenon. The distribution of a RV can be approximated by simulating an outcome of the random process, observing the value of the RV for that outcome, repeating this process many times, and computing related relative frequencies.

**Example 2.32 **
Consider the probability space corresponding to a sequence of 3 flips of a fair coin. Let \(X\) be the number of heads flipped, \(Y\) the number of tails, and \(Z\) the length of the longest streak of heads in a row. (Note: that a streak of length 1 is still a streak, so \(Z(HTH)=1\).)

- Find the distribution of \(X\).
- Identify two methods for simulating a value of \(X\).
- Are \(X\) and \(Y\) the same random variable?
- Find \(\textrm{P}(X=Y)\).
- Do \(X\) and \(Y\) have the same distribution? What does this mean?
- Suppose the coin is not fair. Would \(X\) and \(Y\) have the same distribution?
- Are \(X\) and \(Z\) the same random variable?
- Find \(\textrm{P}(X=Z)\).
- Do \(X\) and \(Z\) have the same distribution?

*Solution*to Example 2.32

Outcome (\(\omega\)) | \(X\) | \(Y\) | \(Z\) |
---|---|---|---|

HHH | 3 | 0 | 3 |

HHT | 2 | 1 | 2 |

HTH | 2 | 1 | 1 |

THH | 2 | 1 | 2 |

HTT | 1 | 2 | 1 |

THT | 1 | 2 | 1 |

TTH | 1 | 2 | 1 |

TTT | 0 | 3 | 0 |

- The above table lists the sample space of 8 outcomes and the corresponding values of \(X, Y, Z\). Assuming equally likely outcomes, each outcome has probability \(1/8\). The distribution of \(X\): \(X\) is equal to 0, 1, 2, 3 with respective probaility 1/8, 3/8, 3/8, 1/8. For example, \(\textrm{P}(X=1)=\textrm{P}(\{HTT, THT, TTH\})=3/8\).
- Simulate from the probability space: Flip a fair coin 3 times and let \(X\) be the number of heads; repeat many times. Simulate from the distribution: construct a spinner corresponding to the distribution of \(X\) — lands on 0, 1, 2, 3 with respective probaility 1/8, 3/8, 3/8, 1/8 – and spin once; repeat many times.
- No, \(X\) and \(Y\) are not the same random variable; they measure different things. For example, for the outcome HHH, \(X(HHH)=3\) but \(Y(HHH)=0\). Remember that random variables are functions, and for two functions to be the same, they have to return the same output for any given input.
- In an odd number of flips, there are no outcomes which satisfy the event \(\{X=Y\}\), so \(\textrm{P}(X=Y)=0\).
- Yes, \(X\) and \(Y\) have the same distribution. \(Y\) also takes values 0, 1, 2, 3, with respectively probability 1/8, 3/8, 3/8, 1/8. While the values of \(X\) and \(Y\) would not be equal for any particular outcome or repetition of the coin flipping simulation, in the long run (over many repetitions) the pattern of variablility of values would be the same for \(X\) and \(Y\).
- No, if the coin were not fair \(X\) and \(Y\) would have different distributions. If the coin were biased in favor of landing on heads, then an outcome like HHH would be more likely than TTT and so \(\textrm{P}(X=3)\) would be greater than \(\textrm{P}(X=0)\). But for the same coin, \(\textrm{P}(Y=3)\) would be less than \(\textrm{P}(Y=0)\). So \(\textrm{P}(X=3)\) would not equal \(\textrm{P}(Y=3)\). \(X\) would tend to take larger values, while \(Y\) would tend to take smaller values.
- No, \(X\) and \(Z\) are measuring different things. In particular, \(X(HTH)=2\) but \(Z(HTH)=1\).
- The only outcome that does not satisfy \(\{X=Z\}\) is HTH; that is \(\{X=Y\}^c = \{HTH\}\) and so \(\textrm{P}(X=Z)=7/8=0.875\).
- No. Both \(X\) and \(Z\) take values in \(\{0, 1, 2, 3\}\) but \(\textrm{P}(X=1)=3/8\) while \(\textrm{P}(Z=1)=4/8\). So while \(X=Z\) for almost 90% of repetitions, the overall pattern of variability of \(X\) is different from that of $Z.

```
P = BoxModel([0, 1], size = 3)
X = RV(P, sum) # number of Heads
Y = 3 - X # number of Tails
plt.figure()
(X & Y).sim(10000).plot(['tile', 'marginal'])
plt.show()
```

**Do NOT confuse a random variable with its distribution.** The RV is the numerical quantity being measured (e.g., SAT score). The distribution is the long run pattern of variation of many observed values of the RV. A distribution, like a spinner, is a blueprint for simulating values of the random variable. But a distribution is not the random variable itself. (“The map is not the territory.”)

Two random variables can have the same (long run) distribution, even if the values of the two random variables are never equal on any particular repetition (outcome). If \(X\) and \(Y\) have the same distribution, then the spinner used to simulate \(X\) values can also be used to simulate \(Y\) values; in the long run the patterns would be the same.

At the other extreme, two random variables \(X\) and \(Y\) are the same RV only if for every outcome of the random phenomenon the resulting values of \(X\) and \(Y\) are the same. That is, \(X\) and \(Y\) are the same RV only if they are the same *function*: \(X(\omega)=Y(\omega)\) for all \(\omega\in\Omega\). It is possible to have two random variables for which \(\textrm{P}(X=Y)\) is large, but \(X\) and \(Y\) have different distributions.

Many commonly encountered distributions have special names. For example, the distribution of \(X\), the number of heads in 3 flips of a fair coin (and also of \(Y\), the number of tails) in Example 2.32 is called the “Binomial(3, 0.5)” distribution. The random variable in each of the following situations has a Binomial(3, 0.5) distribution.

- \(W\) is the number of Heads in three flips of a fair coin
- \(X\) is the number of Tails in three flips of a fair coin
- \(Y\) is the number of even numbers rolled in three rolls of a fair six-sided die
- \(Z\) is the number of boys in a random sample of three births (assuming boys and girls are equally likely
^{52})

Each of these situations involves a different sample space of outcomes (coins, dice, births) with a random variable which counts according to different criteria (Heads, Tails, evens, boys). These examples illustrate that knowledge that a random variable has a specific distribution (e.g. Binomial(3, 0.5)) does not necessarily convey any information about the underlying outcomes or random variable (function) being measured.

The scenarios involving \(X, Y, Z\) illustrated that two random variables do not have to be defined on the same sample space in order to determine if they have the same distribution. This is in contrast to computing quantities like \(\textrm{P}(X=Y)\): \(\{X=Y\}\) is an event which cannot be investigated unless \(X\) and \(Y\) are defined for the same outcomes. For example, you could not estimate the probability that a student has the same score on both SAT Math and Reading exams unless you measured pairs of scores for each student in a sample. However, you could collect SAT Math scores for one set of students to estimate the marginal distribution of Math scores, while collectint Reading scores for a separate set of students to estimate the marginal distribution of Reading scores.

A random variable can be defined explicitly as a function on a probability space, or implicitly through its distribution.
The distribution of an RV is often assumed or specified directly,
without mention of the underlying probabilty space or the function
defining the random variable. For example, a problem might state “let \(Y\) have a Binomial(3, 0.5) distribution” or “let \(Y\) have an Exponential(1) distribution”. But remember, such statements do not necessarily convey any
information about the underlying sample space outcomes or random variable (function) being
measured. In Symbulate the `RV`

command can also be used to define a RV implicitly via its distribution. A definition like `X = RV(Binomial(3, 0.5))`

effectively defines a
random variable `X`

on an unspecified probability space via an
unspecified function.

**Example 2.33 **
Consider the probability space corresponding to two spins of the Uniform(0, 1) spinner and let \(U_1\) be the result of the first spin and \(U_2\) the result of the second. For each of the following pairs of random variables, determine whether or not they have the same distribution as each other.

- \(U_1\) and \(U_2\)
- \(U_1\) and \(1-U_1\)
- \(U_1\) and \(1+U_1\)
- \(U_1\) and \(U_1^2\)
- \(U_1+U_2\) and \(2U_1\)
- \(U_1\) and \(1-U_2\)
- Is the joint distribution of \((U_1, 1-U_1)\) the same as the joint distribution of \((U_1, 1 - U_2)\)?

*Solution*to Example 2.33

- Yes, each has a Uniform(0, 1) distribution.
- Yes, each has a Uniform(0, 1) distribution. For \(u\in[0, 1]\), \(1-u\in[0, 1]\), so \(U_1\) and \(1-U_1\) have the same possible values, and as discussed in Section 2.6.4 a linear rescaling does not change the shape of the distribution.
- No, the two variables do not have the same possible values. The shapes would be similar though; \(1+U_1\) has a Uniform(1, 2) distribution.
- No, a non-linear rescaling generally changes the shape of the distribution. For example, \(\textrm{P}(U_1\le0.49) = 0.49\), but \(\textrm{P}(U_1^2 \le 0.49) = \textrm{P}(U_1 \le 0.7) = 0.7\) Squaring a number in [0, 1] makes the number even smaller, so the distribution of \(U_1^2\) places higher density on smaller values than \(U_1\) does.
- No, we saw in Section 2.6.6 that \(U_1+U_2\) has a triangular shaped distribution on (0, 2), while \(2U_1\) has a Uniform(0, 2) distribution. Do not confuse a RV with its distribution. Just because \(U_1\) and \(U_2\) have the same distribution, you cannot replace \(U_2\) with \(U_1\) in transformations. The random variable \(U_1+U_2\) is not the same random variable as \(2U_1\); spinning a spinner and adding the spins will not necessarily produce the same value as spinner a spinner once and multiplying the value by 2.
- Yes, just like \(U_1\) and \(1-U_1\) have the same distribution.
- No. The marginal distributions are the same, but the joint distribution of \((U_1, 1-U_1)\) places all density along a line, while the joint density of \((U_1, 1-U_2)\) is distributed over the whole two-dimensional region \([0, 1]\times[0,1]\).

### 2.7.1 Discrete random variables: probability mass functions

Discrete random variables take at most countably many possible values (e.g. \(0, 1, 2, \ldots\)). They are often counting variables (e.g. \(X\) is the number of Heads in 10 coin flips). We have seen in several examples that the distribution of a discrete RV can be specified via a table listing the possible values of \(x\) and the corresponding probability \(\textrm{P}(X=x)\). Always be sure to specify the possible values of \(X\). The countable set of possible values of a discrete RV \(X\), \(\{x: \textrm{P}(X=x)>0\}\), is called its **support**.

In some cases, \(\textrm{P}(X=x)\) can be written explicitly as a function of \(x\). For example, if \(X\) is the number of heads in three flips of a fair coin, then \(\textrm{P}(X=x)\) can be written in terms of the following formula^{53}.
\[
\textrm{P}(X =x) = \frac{3!}{x!(3-x)!}(0.5)^{x}(0.5)^{3-x}, \qquad x = 0, 1, 2, 3.
\]
For example^{54}, \(\textrm{P}(X=2) = \frac{3!}{2!(3-2)!}(0.5)^{2}(0.5)^{3-2}=3/8\). The above formula is an example of a *probability mass function*.

**Definition 2.6 **
The **probability mass function (pmf)** (a.k.a. density (pdf)^{55}) of a *discrete* RV \(X\), defined on a probability space with probability measure \(\textrm{P}\), is a function \(p_X:\mathbb{R}\mapsto[0,1]\) which specifies each possible value of the RV and the probability that the RV takes that particular value: \(p_X(x)=\textrm{P}(X=x)\) for each possible value of \(x\).

The axioms of probabilty imply that a valid pmf must satisfy \[\begin{align*} p_X(x) & \ge 0 \quad \text{for all $x$}\\ p_X(x) & >0 \quad \text{for at most countably many $x$ (the possible values, i.e., support)}\\ \sum_x p_X(x) & = 1 \end{align*}\]

We have seen that a distribution of a discrete RV can be represented in a table, with a corresponding spinner. Think of a pmf as providing a compact formula for constructing the table/spinner.

**Example 2.34 **
A discrete random variable \(X\) follows Benford’s law if its pmf satisfies
\[
p_X(x) =
\begin{cases}
\log_{10}(1+\frac{1}{x}), & x = 1, 2, \ldots, 9,\\
0, & \text{otherwise}
\end{cases}
\]
Suppose \(X\) is a random variable defined on a probability space with probability measure \(\textrm{P}\) that follow Benford’s law.

- Construct a table specifying the distribution of \(X\), and the corresponding spinner.
- Find \(\textrm{P}(X \ge 3)\)

*Solution*to Example 2.34

- The table and spinner below specify the distribution of \(X\).
- \(\textrm{P}(X \ge 3) = 1 - \textrm{P}(X <3) = 1 - (0.301 + 0.176) = 0.523\).

\(x\) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|

\(\textrm{P}(X=x)\) | 0.301 | 0.176 | 0.125 | 0.097 | 0.079 | 0.067 | 0.058 | 0.051 | 0.046 |

A pmf is often specified by one or more parameters, such as “Binomial(3, 0.5)” or “Poisson(2.1)”. We will study some common discrete distributions in Chapter **??**.

The pmf of a discrete random variable provides the probability of “equal to” events: \(\textrm{P}(X = x)\). Probabilities for other general events, e.g., \(\textrm{P}(X \le x)\) can be obtained by summing the pmf over the range of values of interest.

### 2.7.2 Continuous random variables: probability density functions

The continuous analog of a pmf is a *probability density function*. However, while pmfs and pdfs play analogous roles, they are different in one fundamental way, namely, a pmf outputs probabilities directly, while a pdf does not. We will see that a pmf of a discrete random variable can be summed to find probabilities of related events, while a pdf of a continuous random variable must be *integrated* to find probabilities of related events.

Values of a continuous random variable are often displayed in a **histogram** which displays the frequencies of values falling in interval “bins”. The vertical axis of the histogram is on the density scale, so that *areas* of the bars correspond to relative frequencies.

We have seen examples (e,g., Figures 2.11, 2.15) where the shape of the histogram can be approximated by a smooth curve. This curve represents an idealized model of what the histogram would look like if infinitely many values were simulated and the histogram bins were infinitesimally small. The curve, called the pdf, represents “relative likelihood” as a function of possible values of \(X\). Just as area represents relative frequency in a histogram, area under a pdf represents probability.

**Definition 2.7 **
The **probability density function (pdf)** (a.k.a. density) of a *continuous* RV \(X\), defined on a probability space with probability measure \(\textrm{P}\), is a function \(f_X:\mathbb{R}\mapsto[0,\infty)\) which satisfies
\[\begin{align*}
1 & =\int_{-\infty}^\infty f_X(x) dx\\
\textrm{P}(a \le X \le b) & =\int_a^b f_X(x) dx, \qquad \text{for all } -\infty \le a \le b \le \infty
\end{align*}\]

For a continuous random variable \(X\), the probability that \(X\) takes a value in the interval \([a, b]\) is the *area under the pdf over the region* \([a,b]\).

A pdf will assign zero probability to intervals where the density is 0. A pdf is usually defined for all real values, but is often nonzero only for some subset of values, the possible values of the random variable. For example, if \(X\) only takes positive values, we write the pdf as \[ f_X(x) = \begin{cases} \text{some function of $x$}, & x>0\\ 0, & \text{otherwise.} \end{cases} \]

**Example 2.35 **
Suppose that \(X\) has a Uniform(\(a\), \(b\)) distribution.

- Sketch a plot of the pdf. Remember: the pdf is defined for all real values of \(x\).
- Specify the functional form of the pdf. Remember: the pdf is defined for all real values of \(x\).
- Find \(\textrm{P}(X\le x)\). Consider three cases: \(x<a\), \(a\le x \le b\), \(x>b\).
- Why does the formula from the previous part make sense?
- If \(U\) has a Uniform(0, 1) distribution, what is the distribution of \(X= a + (b-a)U\)?
- Donny Don’t says: since \(X\) and \(U\) are related by \(X= a + (b-a)U\) then \(f_X\) and \(f_U\) are also related by \(f_X = a + (b-a)f_U\). Do you agree? Explain.

```
P = Uniform(0, 1)
X = RV(P)
plt.figure()
X.sim(100000).plot(bins=1000)
Uniform(0, 1).plot()
plt.show()
```

*Solution*to Example 2.35

- We have seen that Uniform distributions are the continuous analog of “equally likely”. The pdf should be some positive constant on the interval \([a, b]\) and 0 outside of that interval.
- The constant \(c\) is determined by setting the total area under the curve equal to 1: \(\int_{-\infty}^\infty f_X(x)dx = \int_a^b c dx = 1\), (\(f_X(x)= 0\) outside \([a, b]\)). Since the density is constant, the area under the curve is just the area of a rectangle with base \(b-a\) and height \(c\), \(c(b-a)\). Setting the area equal to 1 we get \(c =1/(b-a)\). So the pdf is \[ f_X(x) = \begin{cases} \frac{1}{b-a}, & a\le x\le b,\\ 0, & \text{otherwise.} \end{cases} \]
- For \(x<a\), \(\textrm{P}(X\le x) = 0\). For \(x>b\), \(\textrm{P}(X\le x)=1\). For \(a\le x \le b\), \(\textrm{P}(X\le x) = \int_a^x \frac{1}{b-a}dx\), the area of a rectangle with base \(x-a\) and height \(b-a\), so \(\textrm{P}(X\le x) = \frac{x-a}{b-a}\). That is, \[ \textrm{P}(X\le x) = \begin{cases} 0, & x<a,\\ \frac{x-a}{b-a}, & a\le x\le b,\\ 1, & x>b. \end{cases} \]
- We have seen that for Uniform probability measures, probability is a ratio of sizes. In this case, where \(X\) takes a value along a continuous number line, size is measured by length. So the above says that the probability that \(X\) lies in a certain interval subset of \([a, b]\) is the length of the interval subset divided by the total length of the interval.

- The possible values of \(X= a + (b-a)U\) lie in the interval \([a, b]\), and we have seen that a linear rescaling does not change the shape of a distribution, so \(X\) has a Uniform(\(a\), \(b\)) distribution. More precisely, for \(a\le x\le b\), \(\textrm{P}(X\le x) = \textrm{P}(a+(b-a)U\le x) = \textrm{P}(U \le \frac{x-a}{b-a}) = \frac{x-a}{b-a}\) since for Uniform(0, 1), \(\textrm{P}(U \le u)=\frac{u-0}{1-0} = u\) if \(0\le u\le 1\), and \(0\le \frac{x-a}{b-a}\le 1\) if \(a\le x\le b\). This says that we can simulate values from any Uniform(\(a\), \(b\)) distribution by using a Uniform(0, 1) spinner and rescaling.
- Donny, Don’t confuse a random variable with its distribution. RVs are functions that return numbers so arithmetic operations make sense. But distributions don’t work the same way. You can add the SAT Math and Reading scores of a student; you cannot add the distribution of SAT Math scores for a collection of students to the distribution of SAT Reading scores for another collection of students. If you tried to do what Donny suggested, since \(f_U=1\) or \(f_U(u) = 0\), you might get something like \(f_X(x) = a+(b-a)(1) = b\) or \(f_X(x) = 0\). But that’s not the proper density. And that calculation ignores the fact that \(f_X\) and \(f_U\) are nonzero for different sets of values. Basically, there’s no way of trying to make this work without running into some nonsense.

In the previous example, we specified the general shape of the pdf (constant on \([a, b]\) and 0 otherwise), then found the constant that made the total area under the curve equal to 1.
In general, a pdf is often defined only up to some multiplicative constant \(c\), for example \(f_X(x) = c\times(\text{some function of x})\), or \(f_X(x) \propto \text{some function of x}\). The constant \(c\) does not affect the shape of the distribution as a function of \(x\), only the scale on the density (vertical) axis. The absolute scaling on the density axis is somewhat irrelevant; it is whatever it needs to be to provide the proper *area*. In particular, the total area under the pdf must be 1. The scaling constant is determined by the requirement that \(\int_{-\infty}^\infty f_X(x) dx = 1\). Given a specific pdf, the generic bounds \((-\infty, \infty)\) should be replaced by the range of possible values, that is, those values for which \(f_X(x)>0\).

What’s more important about the pdf is *relative* heights. For example, if \(f_X(\tilde{x})= 2f_X(x)\) then \(X\) is “twice as likely to be near \(\tilde{x}\) than to be near \(x\)” in a sense we will make precise below.

**Example 2.36 **
Consider a random variable \(U\) that has a Uniform(0, 1) distribution.

- Identify \(f_U\), the pdf of \(U\).
- Evaluate \(f_U(0.5)\). Is \(f_U(0.5)\) equal to \(\textrm{P}(U=0.5)\)? Is \(f_U(0.5)\) equal to the probability of anything?
- Compute \(\textrm{P}(U=0.5)\). Explain why this makes sense.
- Compute the probability that \(U\) rounded to two decimal places is 0.5.

*Solution*to Example 2.36

- \(f_U(u) = 1, 0<u<1\) (and \(f_U(u)=0\) otherwise).
- \(f_U(0.5)=1\). But this can’t be \(\textrm{P}(U=0.5)\); if \(\textrm{P}(U=0.5)\) were 1, then the value 0.5 would occur on 100% of repetitions. The density (height) at a particular point is not the probability of anything. Probabilities are determined by integrating the density.
- \(\textrm{P}(U=0.5)=0\).The “area” under the curve for the region \(\{0.5\}\) is just a line, which has area 0. Integrating \(\int_{0.5}^{0.5}(1)dx\). More on this point below.
- \(\textrm{P}(0.495 \le U < 0.505)=0.01\). Rounding \(U\) to two decimal places will return 0.5 if \(0.495\le U<0.505\). The integral \(\int_{0.495}^{0.505}(1)dx\) corresponds to the area of a rectangle with base \(0.505-0.495\) and height 1, so the area is 0.01.

Plugging a value into the pdf of a continuous random variable does *not* provide a probability. The pdf itself does not provide probabilities directly; instead a pdf must be integrated to find probabilities.

**The probability that a continuous RV \(X\) equals any particular value is 0.** That is, if \(X\) is continuous then \(\textrm{P}(X=x)=0\) for all \(x\). Therefore, for a *continuous* RV^{56}, \(\textrm{P}(X\le x) = \textrm{P}(X<x)\), etc. A continuous RV can take uncountably many distinct values. Simulating values of a continuous RV corresponds to an idealized spinner with an infinitely precise needle which can land on any value in a continuous scale.

In the Uniform(0, 1) case, \(0.500000000\ldots\) is different than \(0.50000000010\ldots\) is different than \(0.500000000000001\ldots\), etc. Consider the spinner in Figure 2.1. The spinner in the picture is only labeled in 100 increments of 0.01 each; when we spin, the probability that the needle lands closest to the 0.5 tick mark is 0.01. But if the spinner were labeled in increments 1000 increments of 0.001, the probability of landing closest to the 0.5 tick mark is 0.001. And with four decimal places of precision, the probability is 0.0001. And so on. The more precise we mark the axis, the smaller the probability the spinner lands closest to the 0.5 tick mark. The Uniform(0, 1) density represents what happens in the limit as the spinner becomes infinitely precise. The probability of landing closest to the 0.5 tick mark gets smaller and smaller, eventually becoming 0 in the limit.

A density is an idealized mathematical model. In practical applications, there is some acceptable degree of precision, and events like “\(X\), rounded to 4 decimal places, equals 0.5” correspond to intervals that do have positive probability. For continuous random variables, it doesn’t really make sense to talk about the probability that the random value equals a particular value. However, we can consider the probability that a random variable is *close to* a particular value.

To emphasize: The density \(f_X(x)\) at value \(x\) is *not* a probability. Rather, the density \(f_X(x)\) at value \(x\) is related to the probability that the RV \(X\) takes a value “close to \(x\)” in the following sense^{57}.
\[
\textrm{P}\left(x-\frac{\epsilon}{2} \le X \le x+\frac{\epsilon}{2}\right) \approx f_X(x)\epsilon, \qquad \text{for small $\epsilon$}
\]
The quantity \(\epsilon\) is a small number that represents the desired degree of precision. For example, rounding to two decimal places corresponds to \(\epsilon=0.01\).

Technically, any particular \(x\) occurs with probability 0, so it doesn’t really make sense to say that some values are more likely than others. However, a RV \(X\) is more likely to take values *close to* those values that have greater density. As we said previously, what’s important about a pdf is *relative* heights. For example, if \(f_X(\tilde{x})= 2f_X(x)\) then \(X\) is roughly “twice as likely to be near \(\tilde{x}\) than to be near \(x\)” in the above sense.
\[
\frac{f_X(\tilde{x})}{f_X(x)} = \frac{f_X(\tilde{x})\epsilon}{f_X(x)\epsilon} \approx \frac{\textrm{P}\left(\tilde{x}-\frac{\epsilon}{2} \le X \le \tilde{x}+\frac{\epsilon}{2}\right)}{\textrm{P}\left(x-\frac{\epsilon}{2} \le X \le x+\frac{\epsilon}{2}\right)}
\]

**Example 2.37 **
Let \(X\) be a random variable with the “Exponential(1)” distribution, illustrated by the smooth curve in Figure 2.11. Then the pdf of \(f_X\) is

\[ f_X(x) = \begin{cases} e^{-x}, & x>0,\\ 0, & \text{otherwise.} \end{cases} \]

- Verify that \(f_X\) is a valid pdf.
- Find \(\textrm{P}(X\le 1)\).
- Find \(\textrm{P}(1 \le X< 2.5)\).
- Compute \(\textrm{P}(X = 1)\).
- Approximate the probability that \(X\) rounded to two decimal places is 1.

*Solution*to Example

**??**

- Just need to check that the pdf integrates to 1: \(\int_0^\infty e^{-x}dx = 1\).
- \(\textrm{P}(X\le 1) = \int_0^1 e^{-x}dx = 1-e^{-1}\approx 0.632\). Recall the corresponding spinner in Figure 2.12; 63.2% of the area corresponds to \([0, 1]\).
- \(\textrm{P}(1 \le X< 2.5) = \int_1^{2.5} e^{-x}dx = e^{-1}-e^{-2.5}\approx 0.286\). See the illustration below.
- \(\textrm{P}(X = 1)=0\), since \(X\) is continuous.
- \(\textrm{P}(0.995<X<1.005)\approx f_X(1)(0.01)=e^{-1}(0.01)\approx 0.00367879\). See the illustration below. This provides a pretty good approximation of the true integral
^{58}\(\int_0.995^{1.005} e^{-x}dx = e^{-0.995}-e^{-1.005}\approx 0.00367881\)

### 2.7.3 Cumulative distribution functions

While pmfs and pdfs play analogous roles for discrete and continuous random variables, respectively, they do behave differently; pmfs provide probabilities directly, but pdfs do not. It is convenient to have one object that describes a distribution in the same way, regardless of the type of variable, and which returns probabilities directly. This object is called the *cumulative distribution function (cdf)*. While the definition might seem strange at first, you have probably already had experience with cumulative distribution functions.

**Example 2.38 **
According to data on students who took the SAT in 2018-2019, 1400 was the 94th percentile of SAT scores, while 1000 was the 40th percentile. How do you interpret these percentiles?

*Solution*to Example 2.38

Interpretation: 94% of SAT takers scores at or below 1400, and 6% of SAT takers score greater than 1400. Similarly, 40% of SAT takers scores at or below 1000, and 60% of SAT takers score greater than 1000.

Roughly, the value \(x\) is the \(p\)th percentile of a distribution of a random variable \(X\) if \(p\) percent of values of the variable are less than or equal to \(x\): \(\textrm{P}(X\le x) = p\). The *cumulative distribution function (cdf)* of a RV fills in the blank
for any given \(x\): \(x\) is the (blank) percentile. That is, for an input \(x\), the cdf outputs \(\textrm{P}(X\le x)\).

**Definition 2.8 **
The **cumulative distribution function (cdf)** (of a random variable \(X\) defined on a probability space with probability measure \(\textrm{P}\))
is the function, \(F_X: \mathbb{R}\mapsto[0,1]\), defined by
\(F_X(x) = \textrm{P}(X\le x)\). A cdf is defined for all real numbers \(x\)
regardless of whether \(x\) is a possible value of the RV \(X\).

**Example 2.39 **
Continuing Example 2.38, let \(X\) be the SAT score of a randomly selected
student (from this cohort), and let \(F_X\) be the cdf of \(X\). Evaluate the cdf for each of the following. For the purposes of this exercise, interpret these quantities in terms of actual SAT scores, which take values in 400, 410, 420, \(\ldots\), 1590, 1600.

- \(F_X(1400)\)
- \(F_X(1405)\)
- \(F_X(1000)\)
- \(F_X(1003.7)\)
- \(F_X(-3.1)\)
- \(F_X(390)\)
- \(F_X(399.5)\)
- \(F_X(1600)\)
- \(F_X(1610)\)
- \(F_X(2307.4)\)
- \(F_X(1400)-F_X(1000)\)

*Solution*to Example 2.39

- \(F_X(1400)=0.94\). We are told that \(\textrm{P}(X \le 1400) = 0.94\).
- \(F_X(1405) = 0.94\). In terms of reall SAT scores, \(\textrm{P}(X \le 1405) = \textrm{P}(X\le 1400)\).
- \(F_X(1000) = 0.40\). We are told that \(\textrm{P}(X \le 1000) = 0.40\).
- \(F_X(1003.7) = 0.40\). In terms of reall SAT scores, \(\textrm{P}(X \le 1003.7) = \textrm{P}(X\le 1000)\).
- \(F_X(-3.1)=0\). The smallest possible score is 400.
- \(F_X(390)=0\). The smallest possible score is 400.
- \(F_X(399.5)= 0\). The smallest possible score is 400.
- \(F_X(1600) = 1\). The largest possible score is 1600, so 100% of students score no more than 1600.
- \(F_X(1610) = 1\). The largest possible score is 1600.
- \(F_X(2307.4) = 1\). The largest possible score is 1600.
- \(0.54 = F_X(1400)-F_X(1000)=\textrm{P}(X\le 1400) - \textrm{P}(X \le 1000) = \textrm{P}(1000 < X \le 1400)\). 54% of SAT takers score greater than 1000 but at most 1400.

To understand a cdf, imagine a spinner for a particular distribution. Suppose a “second hand” starts at the smallest possible value (“12:00”) and sweeps clockwise around the spinner. The second hand sweeps out area as it goes; when the second hand is pointing at \(x\), the area that it has swept through represents \(\textrm{P}(X\le x)\). The cdf records the values of \(F_X(x) = \textrm{P}(X\le x)\) as the second hand moves along and points to different values of \(x\).

While a cdf is defined the same way for both discrete and continuous random variables, it is probably best understood in terms of continuous random variables. Remember that for a continuous random variable, \(\textrm{P}(X\le x)\) is the area under the density curve over the interval \((-\infty, x]\) (remember the density might be 0 for some values in this range). Imagine plotting a density curve and adding a vertical line at \(x\); \(\textrm{P}(X\le x)\) is the area under the curve to left of this line. The cdf is constructed by moving the vertical line from left to right, from smaller to larger values of \(x\), and recording the area under the curve to the left of the line, \(F_X(x) = \textrm{P}(X\le x)\), as \(x\) varies.

**Example 2.40 **
Let \(X\) be a random variable with the “Exponential(1)” distribution, illustrated by the smooth curve in Figure 2.11. Then the pdf of \(f_X\) is

\[ f_X(x) = \begin{cases} e^{-x}, & x>0,\\ 0, & \text{otherwise.} \end{cases} \]

- Find the cdf of \(X\), and sketch a plot of it.
- Evaluate and interpret \(F_X(1)\). How does this appear in Figure 2.11?
- Evaluate and interpret \(F_X(2)-F_X(1)\). How does this appear in Figure 2.11?
- Evaluate and interpret \(F_X(2)\). How does this appear in Figure 2.11?
- Find \(\textrm{P}(1 < X < 2.5)\) without integrating again.
- Suppose we had been given the cdf instead of the pdf. How could we find the pdf?

*Solution*to Example 2.40

- \(F_X(x)=0\) for \(x<0\). For \(x>0\) we integrate the density
^{59}\[ F_X(x) = \textrm{P}(X \le x) = \int_0^x e^{-u} du = 1 - e^{-x} \] So the cdf of \(X\) is \[ F_X(x) = \begin{cases} 1 - e^{-x}, & x>0,\\ 0, & \text{otherwise.} \end{cases} \] - \(F_X(1)=\textrm{P}(X\le 1) = 1-e^{-1}\approx 0.632\). This is represented in Figure 2.11 by the area of the region from 0 to 1, 63.2%.
- This is represented in Figure 2.11 by the area of the region from 1 to 2, 23.3%. \[ F_X(2)- F_X(1)=\textrm{P}(X\le 2) - \textrm{P}(X \le 1) =\textrm{P}(1<X\le 2)= (1-e^{-2})-(1-e^{-1})=e^{-1}-e^{-2}\approx 0.233$ \]
- \(F_X(2)=\textrm{P}(X\le 2) = 1-e^{-2}\approx 0.865\). This is represented in Figure 2.11 by the area of the region from 0 to 2, 63.2%+23.3% = 86.5%.
- \[ \textrm{P}(1 < X < 2.5) = \textrm{P}(X\le 2.5) - \textrm{P}(X \le 1) = (1-e^{-2.5})-(1-e^{-1})=e^{-1}-e^{-2.5}\approx 0.286 \]
- Since the cdf is obtained by integrating the pdf, the pdf if obtained by differentiating the cdf. Differentiate the cdf \(F_X(x)=1-e^{-x},\ x>0\) with respect to its argument \(x\) to obtain the pdf \(f_X(x) = e^{-x},\ x>0\).

For continuous random variables, think of the cdf as a “generic integral”. Rather than integrating from scratch to find \(\textrm{P}(X < 1)\), \(\textrm{P}(X < 2)\), \(\textrm{P}(1 < X< 2)\), etc, the integral is computed once for a generic \(x\) and then evaluated to find probabilities for specific values of \(x\), \(F_X(1)\), \(F_X(2)\), \(F_X(2)-F_X(1)\), etc.

For any random variable \(X\) with cdf \(F_X\) \[ F_X(b) - F_X(a) = \textrm{P}(a<X \le b) \] Note that whether the inequalities in the above event are strict (\(<\)) or not (\(\le\)) matters for discrete random variables, but not for continuous.

**Example 2.41 **
Let \(X\) be the number of heads in 3 flips of a fair coin.

- Find the cdf of \(X\) and sketch a plot of it.
- Let \(Y\) be the number of tails in 3 flips of a fair coin. Find the cdf of \(Y\).

*Solution*to Example 2.41

- \(X\) takes values 0, 1, 2, 3, with respective probabilities 1/8, 3/8, 3/8, 1/8. We sum these probabilities to find the cdf. For example, \(F_X(0) = \textrm{P}(X\le 0) = 1/8\), \(F_X(1)=\textrm{P}(X\le 1) = \textrm{P}(X=0) + \textrm{P}(X=1) = 1/8+3/8 = 0.5\). Remember that \(x\) is defined for any value of \(x\), for example \(F_X(1.5) = \textrm{P}(X\le 1.5)= \textrm{P}(X=0) + \textrm{P}(X=1)=0.5\). The cdf is a step function, which is flat for impossible values of \(x\) and jumps at possible values \(x\) with the jump size at \(x\) equal to the value of the pmf at \(x\). \[ F_X(x) = \begin{cases} 0, & x<0,\\ 1/8, & 0\le x<1,\\ 4/8, & 1\le x<2,\\ 7/8, & 2\le x<3,\\ 1, & x\ge 3. \end{cases} \]
- The cdf describes a distribution. Since \(X\) and \(Y\) have the same distribution, they will have the same cdf. The only difference would be labeling; we would call the cdf of \(Y\), \(F_Y\), and the argument of this function would typically (but not necessarily) be denoted \(y\).

A few properties of cdfs

- A cdf is defined for all values of \(x\), regardless if \(x\) is a possible value of the RV.
- A cdf is a non-decreasing function
^{60}: if \(x \le \tilde{x}\) then \(F_X(x)\le F_X(\tilde{x})\). - A cdf approaches 0 as the input approaches \(-\infty\): \(\lim_{x\to-\infty}F_X(x) = 0\)
- A cdf approaches 1 as the input approaches \(\infty\): \(\lim_{x\to\infty}F_X(x) = 1\)
- The cdf of a
*discrete*random variable is a step function.- The steps occur at the possible values of the random variable.
- The height of a particular step corresponds to the probability of that value, given by the pmf.

- The cdf of a
*continuous*random variable is a continuous function.- The cdf of a
*continuous*random variable is obtained by integrating the pdf, so - The pdf of a
*continuous*random variable is obtained by differentiating the cdf \[ F_X' = f_X \qquad \text{if $X$ is continuous} \]

- The cdf of a
- For any random variable \(X\) with cdf \(F_X\) \[ F_X(b) - F_X(a) = \textrm{P}(a<X \le b) \] Whether the inequalities in the above event are strict (\(<\)) or not (\(\le\)) matters for discrete random variables, but not for continuous.

One advantage to using cdfs is that they are defined the same way (\(F_X(x) = \textrm{P}(X\le x)\)) for both continuous and discrete random variables. So results stated in terms of cdfs apply for both discrete and continuous random variables. This is a little more convenient than having two versions of every definition/result/proof: a statement for discrete RVs in terms of pmfs and a separate statement for continuous RVs in terms of pdfs. The following definition is an example.

**Definition 2.9 **
Random variables \(X\) and \(Y\) **have the same distribution** if their cdfs are the same, that is, if \(F_X(u) = F_Y(u)\) for all^{61} \(u\in\mathbb{R}\).

That is, two random variables have the same distribution if all the percentiles are the same. While we generally think of two discrete random variables having the same distribution if they have the same pmf, and two continuous random variables having the same distribution if they have the same pdf, the above definition provides a consistent criteria for any two random variables to have the same distribution, regardless of type.

### 2.7.4 Distributions of transformations

Recall that a function of a random variable is also a random variable. If \(X\) is a random variable, then \(Y=g(X)\) is also a random variable and so it has a probability distribution. Unless \(g\) represents a linear rescaling, a transformation will change the shape of the distribution. So the question is: what is the distribution of \(g(X)\)? We’ll focus on transformations of continuous random variables, in which case the key to answering the question is to work with cdfs.

**Example 2.42 **
Let \(U\) be a random variable with a Uniform(0, 1) distribution, and let \(X=-\log(1-U)\). The goal is to find the pdf of \(X\).

- Identify the possible values of \(X\).
- Find \(F_X(1)\).
- Find the cdf \(F_X(x)\).
- Find the pdf \(f_X(x)\).

*Solution*to Example 2.42

- As always, first determine the range of possible values. When \(u=0\), \(-\log(1-u)=0\), and as \(u\) approaches 1, \(-\log(1-u)\) approaches \(\infty\); see the picture of the function below. So \(X\) takes values in \([0, \infty)\).
- \[ F_X(1) = \textrm{P}(X \le 1) = \textrm{P}(-\log(1-U)\le 1) = \textrm{P}(U\le 1-e^{-1}) \] The above follows since \(-\log(1-u)\le 1\) if and only if \(u\le 1-e^{-1}\); see the picture below. Now since \(U\) has a Uniform(0, 1) distribution, \(\textrm{P}(U\le u)=u\) for \(0<u<1\). The value \(1-e^{-1}\) is just a number between 0 and 1, so \(\textrm{P}(U\le 1-e^{-1}) = 1-e^{-1}\). Therefore \(F_X(1)=1-e^{-1}\approx 0.632\). (This is represented in Figure 2.11 by the area of the region from 0 to 1, 63.2%.)
- As suggested in the paragraph above, the key to finding the pdf is to work with cdfs. We basically repeat the calculation in the previous step, but for a generic \(x\) instead of 1. Consider \(0\le x<\infty\); we wish to find the cdf evaluated at \(x\). \[ F_X(x) = \textrm{P}(X \le x) = \textrm{P}(-\log(1-U)\le x) = \textrm{P}(U \le 1-e^{-x}) \] The above follows since, for \(0<x<\infty\), \(-\log(1-u)\le x\) if and only if \(u\le 1-e^{-x}\); see the picture below (illustrated for \(x=1\)). Now since \(U\) has a Uniform(0, 1) distribution, \(\textrm{P}(U\le u)=u\) for \(0<u<1\). For a fixed \(0<x<\infty\), the value \(1-e^{-x}\) is just a number between 0 and 1, so \(\textrm{P}(U\le 1-e^{-x}) = 1-e^{-x}\). Therefore \(F_X(x)=1-e^{-x}, 0<x<\infty\).
- Differentiate the cdf with respective to \(x\) to find the pdf. \[ f_X(x) = F'(x) =\frac{d}{dx}(1-e^{-x}) = e^{-x}, \qquad 0<x<\infty \] Thus we see that \(X\) has the pdf in Example 2.37.

If \(X\) is a continuous random variable whose distribution is known, the **cdf method** can be used to find the pdf of \(Y=g(X)\)

- Determine the possible values of \(Y\). Let \(y\) represent a generic possible value of \(Y\).
- The cdf of \(Y\) is \(F_Y(y) = \textrm{P}(Y\le y) = \textrm{P}(g(X) \le y)\).
- Rearrange \(\{g(X) \le y\}\) to get an event involving \(X\). Warning: it is not always \(\{X \le g^{-1}(y)\}\). Sketching a picture of the function \(g\) helps.
- Use what is known about the cdf of \(X\), \(F_X\), to get an expression for the cdf of \(Y\) which involves \(F_X\) and some transformation of the value \(y\).
- Differentiate the expression for \(F_Y(y)\) with respect to \(y\), and use what is known about \(F'_X = f_X\), to obtain the pdf of \(Y\). You will typically need to apply the chain rule when differentiating.

**Example 2.43 **
Let \(X\) be a random variable with a Uniform(-1, 1) distribution and Let \(Y=X^2\).

- Sketch the pdf of \(Y\).
- Run a simulation to approximate the pdf of \(Y\).
- Use the cdf method to find the pdf of \(Y\). Is the pdf consistent with your simulation results?

*Solution*to Example 2.43

- First the possible values: since \(-1< X<1\) we have \(0<Y<1\). The idea to the sketch is that squaring a number less than 1 in absolute value returns a smaller number. So the transformation “pushes values towards 0” making the density higher near 0. Consider the intervals \([0, 0.1]\) and \([0.9, 1]\) on the original scale; both intervals have probability 0.05 under the Uniform distribution. On the squared scale, these intervals correspond to \([0, 0.01]\) and \([0.81, 1]\) respectively. So the 0.05 probability is “squished” into \([0, 0.01]\), resulting in a greater height, while it is “spread out” over \([0.81, 1]\) resulting in a smaller height. Remember: probability is represented by area.
- See the simulation results below.
- Fix \(0<y<1\). We will do the calculation in terms of a generic \(y\), but it often helps to think of \(y\) as a particular number first, like we did in Example 2.42.

\[ F_Y(y) = \textrm{P}(Y\le y) = \textrm{P}(X^2\le y) = \textrm{P}(-\sqrt{y}\le X\le \sqrt{y}) = F_X(\sqrt{y}) - F_X(-\sqrt{y}) \] Note that the event of interest is*not*just \(\{X\le \sqrt{y}\}\); see the picture below. From here we can either

- use the cdf of \(X\) and then differentiate, or (2) differentiate and then use the pdf of \(X\). We’ll illustrate both. (1) Using the Uniform(-1, 1) cdf, the interval \([-\sqrt{y}, \sqrt{y}]\) has length \(2\sqrt{y}\), and the total length of \([-1, 1]\) is 2, so we have \[ F_y(y) = F_X(\sqrt{y}) - F_X(-\sqrt{y}) = \frac{2\sqrt{y}}{2} = \sqrt{y} \] Now differentiate with respect to the argument \(y\) to obtain \(f_Y(y) = \frac{1}{2\sqrt{y}}, 0<y<1\).
- Differentiate with respect to \(y\) and using the chain rule, we obtain \[\begin{align*} F_Y(y) & = F_X(\sqrt{y}) - F_X(-\sqrt{y})\\ \text{(Differentiate the above)}\qquad f_Y(y) & = f_X(\sqrt{y})\frac{1}{2\sqrt{y}} - f_X(-\sqrt{y})(-\frac{1}{2\sqrt{y}}) = \frac{1}{2\sqrt{y}}\left(f_X(\sqrt{y})+f_X(-\sqrt{y})\right) \end{align*}\] Since \(X\) has a Uniform(-1, 1) distribution, its pdf is \(f_X(x) = 1/2, -1<x<1\). But for \(0<y<1\), \(\sqrt{y}\) and \(-\sqrt{y}\) are just numbers in \([-1, 1]\), so \(f_X(\sqrt{y})=1/2\) and \(f_X(-\sqrt{y})=1/2\). Therefore, \(f_Y(y) = \frac{1}{2\sqrt{y}}, 0<y<1\). The histogram of simulated values seems consistent with this shape.

### 2.7.5 Quantile functions and “universality of the uniform”

Recall that the cdf fills in the following blank
for any given \(x\): \(x\) is the (blank) percentile.
The *quantile function* (essentially the inverse cdf^{62}) fills in the
following blank for a given \(p\): the \(p\)th percentile is (blank)

**Example 2.44 **
Let \(X\) have an Exponential(1) distribution. Recall that the cdf is \(F_X(x) = 1-e^{-x}, x>0\).

- Find the median (i.e., 50th percentile). How does this appear in Figure 2.11?
- Find the 63.2th percentile. How does this appear in Figure 2.11?
- Find the 86.5th percentile. How does this appear in Figure 2.11?
- Specify a function that returns the \(p\)th percentile for \(p\in[0,1]\). How does this function relate to the spinner in Figure 2.11?

*Solution*to Example 2.44

- Set \(0.5 = \textrm{P}(X\le x) = 1-e^{-x}\) and solve to find \(x=\log(2)\approx0.693\) This is represented by the halfway point in the spinner in Figure 2.11.
- Set \(0.632 = \textrm{P}(X\le x) = 1-e^{-x}\) and solve to find \(x\approx 1\). We see that the value 1 in Figure 2.11 has an area below it of 63.2%.
- Set \(0.865 = \textrm{P}(X\le x) = 1-e^{-x}\) and solve to find \(x\approx 2\). We see that the value 2 in Figure 2.11 has an area below it of 86.5%.
- Set \(p = \textrm{P}(X\le x) = 1-e^{-x}\) and solve for \(x\) to find \(x=-\log(1-p)\). So given \(p\), the area of any region starting from 0 in the spinner in Figure 2.11, \(-\log(1-p)\) labels the corresponding value on the axis.

Recall how the spinner in Figure 2.11 was constructed. We started with the Uniform(0, 1) spinner with equally spaced increments, and applied the transformation \(-\log(1-u)\), which “stretched” the intervals corresponding to higher probability and “shrunk” the intervals corresponding to lower probability. The distribution that we ended up with was the Exponential(1) distribution with cdf \(1-e^{-x}, x>0\). Notice now that the transformation \(-\log(1-u)\) corresponds to the quantile function of an Exponential(1) distribution.

But the previous example tells us how to go backwards. Starting from a cdf, we can construct the corresponding spinner by finding the quantile function, essentially the inverse cdf, and applying it to the equally spaced values on the Uniform(0, 1) spinner. The quantile function will stretch/shrink the intervals just right to correspond to the probabilities given by the cdf.

This is the idea behind “universality of the uniform”. Basically, we can always start with a spinner that is “equally likely” to land in (0, 1) and suitably stretch/scale the axis around the spinner to construct a spinner corresponding to any distribution of interest. The Uniform(0, 1) spinner returns a value in (0, 1); we obtain the corresponding percentil — that is, value of the variable, from the quantile function.

**Universality of the Uniform.** Let \(F\) be a cdf and \(Q\) its corresponding quantile function. Let \(U\) have a Uniform(0, 1) distribution and define the random variable \(X=Q(U)\). Then the cdf of \(X\) is \(F\).

In the above, \(U\) represents the result of the spin on the [0, 1] scale, and \(Q(U)\) is the corresponding value on the stretched/shrunk scale.

We’ll only prove the result assuming \(F\) is a continuous, strictly increasing function, so that the quantile function is just the inverse of \(F\), \(Q(p) = F^{-1}(p)\). \[ \textrm{P}(X \le x) = \textrm{P}(F^{-1}(U)\le x) = \textrm{P}(U\le F(x)) = F(x) \] The last step follows since \(F(x)\) is just a number in [0, 1] and for \(\textrm{P}(U\le u) = u\) for \(0\le u\le 1\) since \(U\) has a Uniform(0, 1) distribution.

### 2.7.6 Joint distributions

Most interesting problems involve two or more^{63} random variables defined on the same probability space. In these situations, we can consider how the variables vary together, or jointly, and study their relationship. The *joint distribution* of random variables \(X\) and \(Y\) (defined on the same probability space) is a probability distribution on \((x, y)\) *pairs*. In this context, the distribution of one of the variables alone is called a *marginal distribution*.

Think of a joint distribution of two discrete RVs as a spinner that returns pairs of values, like the one in Figure 2.7. A spinner is harder to conceptualize for two continuous RVs, but you can think of a “globe” like the one discussed in Section 2.6.8. However, in many common situations we will see that there are simpler alternatives to the globe method for simulating two continuous RVs.

There are joint analogs of pmfs (for two discrete RVs), pdfs (for two continuous RVs), and cdfs (for any RVs).

**Definition 2.10 **
The **joint probability mass function (pmf)** of two *discrete* random variables \((X,Y)\) defined on a probability space with probability measure \(\textrm{P}\) is the function \(p_{X,Y}:\mathbb{R}^2\mapsto[0,1]\) defined by
\[
p_{X,Y}(x,y) = \textrm{P}(X= x, Y= y) \qquad \text{ for all } x,y
\]

The axioms of probabilty imply that a valid joint pmf must satisfy \[\begin{align*} p_{X,Y}(x,y) & \ge 0 \quad \text{for all $x, y$}\\ p_{X,Y}(x,y) & >0 \quad \text{for at most countably many $(x,y)$ pairs (the possible values, i.e., support)}\\ \sum_x \sum_y p_{X,Y}(x,y) & = 1 \end{align*}\]

The *marginal pmfs* are determined by the joint pmf by the law of total probability. If we imagine a plot with blocks whose heights represent the joint probabilities, the marginal probability of a particular value of one variable can be obtained by “stacking” all the blocks corresponding to that value. In terms of a two-way table, a marginal distribution can be obtained by “collapsing” the table by summing across the rows or columns.

\[\begin{align*} p_X(x) & = \sum_y p_{X,Y}(x,y) \\ p_Y(y) & = \sum_x p_{X,Y}(x,y) \end{align*}\]

**Example 2.45 **
Recall the coin flipping problem from Example 2.20 and Section 2.6.3 regarding a sequence of 4 fair coin flips. Let \(Z\) be the number of flips immediately following H, and let \(Y\) be the number of flips immediately following H that result in H.

Table 2.1 displays the 16 possible outcomes in the sample space along with the value of \(X, Y, Z\) for each outcome. Note that \(Y\) is only defined for 14 of the outcomes, so when determining the joint distribution of \((Y, Z)\) we only consider those outcomes for which both random variables are defined.

- Find the joint pmf of \(Y\) and \(Z\).
- Find the marginal pmf of \(Y\).
- Find the marginal pmf of \(Z\).

*Solution*to Example 2.45

- The joint pmf of \(Y\) and \(Z\) follows from Table 2.1 and is given by the following table. For example, \(p_{Y,Z}(1,2) = \textrm{P}(Y=1, Z=2) = \textrm{P}(\{HHTH, HTHH, HHTT, THHT})=4/14\).

\(p_{Y, Z}(y, z)\) | |||
---|---|---|---|

\(y\) \ \(z\) | 1 | 2 | 3 |

0 | 5/14 | 1/14 | 0 |

1 | 1/14 | 4/14 | 0 |

2 | 0 | 1/14 | 1/14 |

3 | 0 | 0 | 1/14 |

The marginal pmf of \(Y\) is found by finding the row sums (collapsing the \(Z\) values). | \(y\) | \(p_Y(y)\) | |—-: |———: | | 0 | 6/14 | | 1 | 5/14 | | 2 | 2/14 | | 3 | 1/14 |

The marginal pmf of \(Z\) is found by finding the column sums (collapsing the \(Y\) values). | \(z\) | 1 | 2 | 3 | |———- |—–: |—–: |—–: | | \(p_Z(z)\) | 6/14 | 6/14 | 2/14 |

**Definition 2.11 **
The **joint probability density function (pdf)** of two *continuous* random variables \((X,Y)\) defined on a probability space with probability measure \(\textrm{P}\) is the function \(f_{X,Y}:\mathbb{R}^2\mapsto[0,\infty)\) which satisfies
\[
\textrm{P}(X\le x, Y\le y) = \int_{-\infty}^x\int_{-\infty}^y f_{X,Y}(u,v)\, dudv \qquad \text{ for all } x,y
\]

A valid joint pdf must satisfy
\[\begin{align*}
f_{X,Y}(x,y) & \ge 0\\
\int_{-\infty}^\infty\int_{-\infty}^\infty f_{X,Y}(x,y)\, dx dy & = 1
\end{align*}\]
Pairs of two continuous RVs will take values in the \((x, y)\) plane. A joint pdf is a surface with height \(f_{X,Y}(x,y)\) at \((x, y)\). The probability that the \((X,Y)\) pair of RVs lies in the region \(A\) is the *volume* under the pdf surface over the region \(A\)
\[
\textrm{P}[(X,Y)\in A] = \iint\limits_{A} f_{X,Y}(x,y)\, dx dy
\]

The marginal pdfs can be obtained from the joint pdf by the law of total probability. In the discrete case, to find the marginal probability that \(X\) is equal to \(x\), sum the joint pmf \(p_{X, Y}(x, y)\) over all possible \(y\) values. The continuous analog is to integrate the joint pdf \(f_{X,Y}(x,y)\) over all possible \(y\) values to find the marginal density of \(X\) at \(x\). This can be thought of as “stacking” or “collapsing” the joint pdf.

\[\begin{align*} f_X(x) & = \int_{-\infty}^\infty f_{X,Y}(x,y) dy \\ f_Y(y) & = \int_{-\infty}^\infty f_{X,Y}(x,y) dx \end{align*}\]

The marginal distribution of \(X\) is a distribution on \(x\) values only. For example, the pdf of \(X\) is a function of \(x\) only (and not \(y\)). (Similarly the pdf of \(Y\) is a function of \(y\) only and not \(x\).)

In general the marginal distributions do not determine the joint distribution, unless the RVs are independent. In terms of a table: you can get the totals from the interior cells, but in general you can’t get the interior cells from the totals.

Recall Example?:Let \(U_1\) and \(U_2\) be independent, each with a Uniform(0, 1) distribution. Let \(X=U_1+U_2\) and \(Y=\max(U_1, U_2)\).

**Example 2.46 **
Recall the example in Section 2.6.6. Consider the probability space corresponding to two spins of the Uniform(0, 1) spinner and let \(X\) be the sum of the two spins and \(Y\) the larger.

- Identify the possible \((X, Y)\) pairs.
- Find the joint pdf of \((X, Y)\).
- Suggest an expression for the marginal pdf of \(Y\).
- Derive the marginal pdf of \(Y\).
- Suggest an expression for the marginal pdf of \(X\).
- Derive the marginal pdf of \(X\).

*Solution*to Example 2.46

- Recall the discussion in Section 2.6.6. While marginally \(X\) takes values in (0, 2) and marginally \(Y\) takes values in (0, 1), not every pair in \((0,2)\times(0,1)\) is possible. Rather, the possible values of \((X, Y)\) lie in \(\{(x, y): 0<x<2, 0<y<1, x/2<y<x\}\).
- We saw in Section 2.6.6 that the joint density of \((X, Y)\) is constant over the range of possible values. The region \(\{(x, y): 0<x<2, 0<y<1, x/2<y<x\}\) is a triangle with area 1/2. The joint pdf will be a constant height floating above this triangle. The volume under the density surface is the volume of this triangular “wedge”. If the constant height is 2, then the volume under the surface will be 1. Therefore, the joint pdf of \((X, Y)\) is \[ f_{X, Y}(x, y) = 2, \qquad 0<x<2,\; 0<y<1,\; x/2<y<x. \]
- Marginally \(Y\) takes values in (0, 1). From the plots in Section 2.6.6 we see that the marginal density of \(Y\) is 0 at 0 and increases as \(y\) increases to 1, and it appears that the density is a linear function of \(y\). So we guess a function of the form \(f_Y(y) = cy, 0<y<1\). To find \(c\) remember that the total area under the density must be 1. Since the area under the density curve is the area of a triangle with base 1 and height (at \(y=1\)) of \(c\), we must have \(c=2\) in order for the area under the curve to be 1. So we guess that the marginal pdf of \(Y\) is \(f_Y(y) = 2y, 0<y<1\).

- Fix a particular \(y\) in (0, 1). Find the marginal pdf of \(Y\) at \(y\) by “collapsing” the joint pdf over the corresonding \(x\) values. Note that given \(y\), we must have \(y<x<2y\) (otherwise the density is 0). This determines the bounds of integration when integrating out the \(x\) variable over the “horizontal strip” at \(y\). \[ f_Y(y) = \int_{y}^{2y} 2 dx = 2x \right\rvert_{y}^{2y} = 2y \] Therefore the marginal pdf of \(Y\) is \[ f_Y(y) = 2y, 0<y<1 \] Notice that this is a function of \(y\) only, and is consistent with our guess in the previous part.
- Marginally \(X\) takes values in (0, 2). From the plots in Section 2.6.6 we see the the marginal density of \(X\) will be largest at 1 and will decrease symmetrically as \(x\) goes to 0 or 1. It also appears that the density is made up of two linear pieces, one with positive slope for \(0<x<1\) and one with negative slope for \(1<x<2\), that connect at \(x=1\). The positive slope piece is 0 at \(x=0\) and the negative slope piece is 0 at \(x=2\). So we guess a function of the form \[ f_X(x) = \begin{cases} cx, & 0\le x\le 1,\\ c(2-x), & 1\le x\le 2 \end{cases} \] When \(x=1\), both pieces have a height of \(c\). To find \(c\) we apply the principle that the total area under the density curve must be 1. The total area under the above curve is the area of a triangle with base 2 and height \(c\) (at \(x=1\)), so \(c\) must be 1 in order for the area to be 1.
- We find the marginal pdf at \(x\) by integrating out the \(y\) variable over the “vertical strip” at \(x\). But looking at the vertical strips in the joint density plot, we see that there will be two general cases: \(0<x<1\) and \(1<x<2\). Fix a particular \(x\) in (0, 1). Find the marginal pdf of \(X\) at \(x\) by “collapsing” the joint pdf over the corresonding \(y\) values. Note that given \(0<x<1\), we must have \(x/2<y<x\) (otherwise the density is 0). This determines the bounds of integration when integrating out the \(y\) variable over the vertical strip for \(0<x<1\) \[ f_X(x) = \int_{x/2}^{x} 2 dy = 2y \right\rvert_{x/2}^{x} = x \] Now fix a particular \(x\) in (1, 2); in this case we must have \(x/2<y<1\) (otherwise the density is 0). This determines the bounds of integration when integrating out the \(y\) variable over the vertical strip for \(1<x<2\) \[ f_X(x) = \int_{x/2}^{1} 2 dy = 2y \right\rvert_{x/2}^{1} = 2-x \] Therefore the marginal pdf of \(X\) is \[ f_X(x) = \begin{cases} x, & 0\le x\le 1,\\ 2-x, & 1\le x\le 2 \end{cases} \] Notice that this is a function of \(x\) only, and is consistent with our guess in the previous part.

**Definition 2.12 **
The **joint cdf** of random variables \((X,Y)\) defined on a probability space with probability measure \(\textrm{P}\) is the function \(F_{X,Y}:\mathbb{R}^2\mapsto[0,1]\) defined by
\[
F_{X,Y}(x,y) = \textrm{P}(X\le x, Y\le y) \qquad \text{ for all } x,y
\]

The marginal pdf of \(X\) is a function of \(x\) only (and not \(y\)). In general the marginal distributions do not determine the joint distribution, unless the RVs are independent.

### 2.7.7 Mixed discrete and continuous random variables

LATER

There is another type of weird random variable which has a “singular” distribution, like the Cantor distribution, but we’re counting these random variables as not commonly encountered.↩

Which isn’t quite true.↩

This is an example of a Binomial probability mass function. We will see the details behind this formula in Section

**??**.↩Recall factorial notation: if \(k\) is a positive integer, then \(k!=k(k-1)(k-2)\cdots (2)(1)\), e.g., \(5! = 5(4)(3)(2)(1) = 120\). Remember, \(0!=1\) by definition.↩

We use “pmf” for discrete distributions and reserve “pdf” for continuous probability density functions. The terms “pdf” and “density” are sometimes used in both discrete and continuous situations even though the objects the terms represent differ between the two situations (probability versus density). In particular, in R the

`d`

commands (`dbinom`

,`dnorm`

, etc) are used for both discrete and continuous distributions. In Symbulate, you can use`.pmf()`

for discrete distributions.)↩The same is

*not*true for discrete random variables. For example, if \(X\) is the number of heads in three flips of a fair coin then \(\textrm{P}(X<1)= \textrm{P}(X=0)=1/8\) but \(\textrm{P}(X \le 1)=\textrm{P}(X=0)+\textrm{P}(X=1) = 4/8\).↩This is true because an integral is really just a sum of the areas of many rectangles with narrow bases. Over a small interval of values surrounding \(x\), the density shouldn’t change that much, so we can estimate the area under the curve by the area of the rectangle with height \(f_X(x)\) and base each equal to the length of the small interval of interest.↩

Reporting so many decimal places is unnecessary, and provides a false sense of precision. All of these idealized mathematical models are at best approximately true in practice. However, we provide the extra decimal places here to compare the approximation with the “exact” calculation.↩

Here \(x\) represents a particular value of interest, so we use a different dummy variable, \(u\), in the integrand.↩

This follows from the subset rule, since if \(x\le \tilde{x}\) then \(\{X\le x\}\subseteq\{X\le \tilde{x}\}\)↩

Note that \(u\) just represents a dummy variable, the argument of the two functions. While we generally think of \(x\) as the argument of \(F_X\), that is just a convenient labeling. Here we are checking for equality of two

*functions*, so we need to use the same input for both. That is, something like “\(F_X(x) = F_Y(y)\)” makes no sense because \(x\) and \(y\) represent different inputs.↩If the cdf is a continuous function, then the quantile function is the inverse cdf. But the inverse of a cdf might not exist, if the cdf has jumps or flat spots. In particular, the inverse cdf does not exist for discrete RVs. So in general, the quantile function correspondinf to cdf \(F\) is defined as \(Q(p) = \inf\{u:F(u)\ge p\}\).↩

We mostly focus on the case of two random variables, but analogous definitions and concepts apply for more than two (though the notation can get a bit messier).↩