## 2.8 Expected Values

The probabilities that make up a cdf can be interpreted as long run relative frequencies. A distribution can also be described by certain summary characteristics, like its long run average value.

**Example 2.47 (Matching problem) **Recall the “matching problem” from Example 2.25.
A set of \(n\) cards labeled \(1, 2, \ldots, n\) are placed in \(n\)
boxes labeled \(1, 2, \ldots, n\), with exactly one card in each box. We are interested in the distribution of
\(Y\) be the number of cards (out of 4) which match the number of the
box in which they are placed.

Consider the case \(n=4\). We can consider as the sample space the possible ways in which the cards can be distributed to the four boxes. For example, 3214 represents that card 3 is returned (wrongly) to the first box, card 2 is returned (correctly) to the second box, etc. So the sample space consists of the following 24 outcomes^{64}, which we will assume are equally likely.

```
1234 1243 1324 1342 1423 1432 2134 2143
2314 2341 2413 2431 3124 3142 3214 3241
3412 3421 4123 4132 4213 4231 4312 4321
```

- Find the pmf of \(Y\).
- How could you use simulation to approximate the long run average value of \(Y\)?
- How could you simplify the calculation in the previous part? In other words, based on the theoretical distribution of \(Y\), what would be the corresponding mathematical formula for the theoretical long run average value of \(Y\), a.k.a., the
*expected value*of \(Y\), \(\textrm{E}(Y)\)? - Is \(\textrm{E}(Y)\) the “value that we would expect” on a
*single*repetition of the phenomenon? Is it the “most likely” value of the RV? If not, explain in what sense \(\textrm{E}(Y)\) is “expected.”

*Solution*to Example 2.47

- Recall the solution to Example 2.25.

\(y\) | 0 | 1 | 2 | 4 |
---|---|---|---|---|

\(p_Y(y)\) | 9/24 | 8/24 | 6/24 | 1/24 |

- We estimate “long run” quantities by performing a simulation for many repetitions. We could either simulate an outcome from the probability space and then measure \(Y\), or we could simulate \(Y\) values directly from its distribution. In either case, simulate many values of \(Y\) and compute the average as usual: sum all the values and divide by the number of values. The table below displays the results of a simulation with 24,000 repetitions.
- Each simulated value of \(Y\) is either 0, 1, 2, or 4; these are the only possible values. So when summing all the simulated values, each value in the sum is 0, 1, 2, or 4, e.g. \(2+0+1+0+0+4+2+2+1+\cdots\). Since each of the possible values is repeated multiple times in the sum, we just need to count the number of times each possible value appears in the sum. In other words, to compute the average of the simulated values, we just need to multiply each possible value by its relative frequency, and then sum. Since in the long run, the relative frequencies approximate the theoretical probabilities, the theoretical long run average should be \[ 0\times 9/24 + 1 \times 8/24 + 2 \times 6/24 + 4 \times 1/24 = 1 \]
- We wouldn’t necessarily expect to see 1 match in a single repetition; there is natural variability. Moreover, the most likely value of \(Y\) is 0. The expected value of 1 is what we we expect
*on average in the long run*. If the matching scenario were repeated many times, then the long run average number of matches would be (close to) 1.

The following code simulates the matching problem. The main programming aspect is to write the `count_matches`

function which takes as an input a sequence of prizes and returns as an output the number of matches. This function can then be used to define a `RV`

(just as we use built in functions like `sum`

and `max`

).

```
n = 4
labels = list(range(n)) # list of prize labels [0, ..., n-1]
# define a function which counts number of matches
def count_matches(x):
count = 0
for i in range(0, n, 1):
if x[i] == labels[i]:
count += 1
return count
P = BoxModel(labels, size = n, replace = False)
Y = RV(P, count_matches)
y = Y.sim(24000)
plt.figure()
y.plot()
plt.show()
```

`## 0.991625`

**Definition 2.13 **
The **expected value** (a.k.a. *expectation* a.k.a. *mean*), of a random variable \(X\) defined on a probability space with measure \(\textrm{P}\), is a number denoted \(\textrm{E}(X)\) representing the probability-weighted average value of \(X\). Expected value is defined as

\[\begin{align*} & \text{Discrete $X$ with pmf $p_X$:} & \textrm{E}(X) & = \sum_x x p_X(x)\\ & \text{Continuous $X$ with pdf $f_X$:} & \textrm{E}(X) & =\int_{-\infty}^\infty x f_X(x) dx \end{align*}\]

Keep in mind that \(\textrm{E}(X)\) represents a single number. The expected value is the “balance point” (center of gravity) of a distribution. Imagine the impulse plot/histogram is constructed by stacking blocks on a board. Then \(\textrm{E}(X)\) represents where you would need to place a stand underneath the board so that it doesn’t tip to one side.

**Example 2.48 **
Let \(X\) be a random variable which has the Exponential(1) distribution.

- Use simulation to approximate \(\textrm{E}(X)\).
- Donny Dont says \(\textrm{E}(X) = \int_0^\infty e^{-x}dx = 1\). Do you agree?
- Compute \(\textrm{E}(X)\). Compare with your simulation.
- Find the median value of \(X\). Is the median less than, greater than, or equal to the mean? Why does this make sense?

*Solution*to Example 2.48

`## 1.0111109252528008`

- See the simulation results above. The mean of the simulated values is around 1.
- Donny got the correct value, but that’s just coincidence. His method is wrong. Donny integrated \(\int_{-\infty}^\infty f_X(x)dx\) which will always be 1. To get the expected value, you need to find the probability weighted average value of \(x\): \(\int_{-\infty}^\infty x f_X(x)dx\). Don’t forget the \(x\).
- Since \(X\) is continuous, with pdf \(f_X(x) = e^{-x}\), we integrate \[ \textrm{E}(X) = \int_0^\infty x e^{-x}dx = 1 \]
- The cdf is \(1-e^{-x}\), so setting the cdf to 0.5 and solving for \(x\) yields the median: \(0.5=1-e^{-x}\) implies \(x=\log(2)\approx 0.693\). The mean is greater than the median because the large values in the right tail “pull the mean up”.

The probability of an event \(A\), \(\textrm{P}(A)\), is defined by the underlying probability measure \(\textrm{P}\). However, \(\textrm{P}(A)\) can be interpreted as a long run relative frequency and can be approximated via a simulation consisting of many repetitions of the random phenomenon. Similarly, the expected value of a random variable \(X\) is defined by the probability-weighted average according to the underlying probability measure. But the expected value can also be interpreted as the **long-run average value**, and so can be approximated via simulation. The fact that the long run average is equal to the probability-weighted average is known as the *law of large numbers*.

`## (0, 4)`

\(n=4\). The plot on the left displays the running average of simulated values of \(Y\), the number of matches, as the number of simulated values increases (from 1 to 100). The plot on the right displays a running average path for each of five separate simulations. We can see that when the number of simulated values is small, the averages vary a great deal from simulation to simulation. But as the number of simulated values increases, the average for each simulation starts to settle down to 1, the theoretical expected value." width="50%" />

`## (0, 4)`

\(n=4\). The plot on the left displays the running average of simulated values of \(Y\), the number of matches, as the number of simulated values increases (from 1 to 100). The plot on the right displays a running average path for each of five separate simulations. We can see that when the number of simulated values is small, the averages vary a great deal from simulation to simulation. But as the number of simulated values increases, the average for each simulation starts to settle down to 1, the theoretical expected value." width="50%" />

From a simulation perspective, you can read the symbol \(\textrm{E}(\cdot)\) as

- Simulate lots of values of what’s inside \((\cdot)\)
- Compute the average. (This is a “usual” average; just sum all the simulated values and divide by the number of simulated values.)

**Example 2.49 **
Continuing Example 2.32. Consider the probability space corresponding to a sequence of three
flips of a fair coin. Let \(X\) be the number of heads, \(Y\) the number of
tails, and \(Z\) the length of the longest streak of heads.

- Compute \(\textrm{E}(X)\).
- Find and interpret \(\textrm{P}(X=\textrm{E}(X))\).
- Is \(E(X)\) the “value that we would expect” on a repetition of the process? If not, explain in what sense \(\textrm{E}(X)\) is “expected.”
- Without doing any calculations, find \(\textrm{E}(Y)\). Explain.
- Without doing any calculations determine if \(\textrm{E}(Z)\) will be greater than, less than, or equal to \(\textrm{E}(X)\). Confirm your guess by computing \(\textrm{E}(Z)\).
- Suppose now that the coin is biased and has a probability of 0.6 of landing on heads. Would the sample space change? Would the RVs change? Would the probability measure \(\textrm{P}\) change? Would the distribution of \(X\) change? Would the expected value \(\textrm{E}(X)\) change?

*Solution*to Example 2.49

- \(\textrm{E}(X)=0(1/8) + 1(3/8) + 2(3/8) + 3(1/8) = 1.5\).
- \(\textrm{P}(X=\textrm{E}(X))=\textrm{P}(X = 1.5) = 0\). Remember, \(\textrm{E}(X)\) is just a number.
- \(E(X)\) is not the “value that we would expect” on a repetition of the process; it’s not even possible to observe an \(X\) of 1.5 on a single repetition. You can’t flip a coin 3 times and observed 1.5 heads in the 3 flips. But over many sets of 3 fair coin flips, we expect 1.5 heads per set on average.
- Since \(X\) and \(Y\) have the same distribution \(\textrm{E}(Y) =\textrm{E}(X) = 1.5\). If \(X\) and \(Y\) have the same long run pattern of variability, then they will have the same long run average value.
- Since for each outcome \(X\ge Z\) we have \(\textrm{E}(X) \ge \textrm{E}(Z)\). \(\textrm{E}(Z)=0(1/8) + 1(4/8) + 2(2/8) + 3(1/8) = 11/8\).
- The sample space and the RVs would not change. But the probability measure \(\textrm{P}\) would change and so would the distribution of \(X\) and \(\textrm{E}(X)\).

If two random variables \(X\) and \(Y\) have the same distribution (i.e., same long run pattern of variation) then they have the same expected value (i.e., same long run average value).

But the converse is not true: \(\textrm{E}(X) = \textrm{E}(Y)\) does NOT imply that \(X\) and \(Y\) have the same distribution. Expected value is just one summary characteristic of a distribution, i.e., the average. But there can be much more to the pattern of variation, i.e., the distribution. The following plot illustrates just a few different distributions that all have expected value 0.5. (Don’t worry about what these distributions are for now; we’ll encounter them later. Just know that they represent different patterns of variability.)

If \(X\le Y\) — that is, if \(X(\omega)\le Y(\omega)\) for all^{65}
\(\omega\in\Omega\) — then \(\textrm{E}(X)\le\textrm{E}(Y)\). That is, if for every outcome the value of \(X\) is no bigger than the value of \(Y\), then the long run average value of \(X\) can’t be any bigger than the long run average value of \(Y\).

The distribution of a RV and its expected value depend on the probability measure \(\textrm{P}\). If the probability measure changes (e.g., from representing a fair coin to a biased coin) then distributions and expected values of RVs will change. Remember that \(\textrm{P}\) represents a probability measure that incorporates all the underlying assumptions about the random phenomenon (the symbol \(\textrm{P}\) is more than just shorthand for “probability”). In the same way, the symbol \(\textrm{E}\) is more than just shorthand for “expected value”. Rather \(\textrm{E}\) represents the probability-weighted/long run average value according to all the underlying assumptions of the random phenonemon as specified by the probability measure \(\textrm{P}\). In fact, a more appropriate symbol might be \(\textrm{E}_{\textrm{P}}\) to emphasize the dependence on the probability measure. We will only use such notation if multiple probability measures are being considered on the same sample space, e.g., \(\textrm{E}_{\textrm{P}}\) represents the expected value according to probability measure \(\textrm{P}\) (e.g., fair coin), while \(\textrm{E}_{\textrm{Q}}\) represents expected value according to probability measure \(\textrm{Q}\) (e.g., biased coin).

### 2.8.1 “Law of the unconscious statistician”

A distribution is the complete picture of the long run pattern of variability of random variable. An expected value is just one particular characteristic of a distribution, namely, the long run average value. We can often compute expected values without first finding the entire distribution.

**Example 2.50 **
Consider the probability space corresponding to a sequence of four flips of a fair coin. Suppose that for each flip that lands on H Harry wins 1 from Tom, while for each flip that lands on T Harry loses 1 to Tom. Let \(X\) represent Harry’s net winnings after the four flips (starting with a net winnings of 0).

- Find the distribution of \(X\).
- Compute \(\textrm{E}(X)\).
- Let \(Y=X^2\). Find the distribution of \(Y\).
- Compute \(\textrm{E}(Y)\).
- How could we have computed \(\textrm{E}(X^2)\) without first finding the distribution of \(Y=X^2\)?
- Is \(\textrm{E}(X^2)\) equal to \((\textrm{E}(X))^2\)?

*Solution*to Example 2.50

- Recall the sample space of 16 equally likely outcomes

\(x\) | -4 | -2 | 0 | 2 | 4 |
---|---|---|---|---|---|

\(p_X(x)\) | 1/16 | 4/16 | 6/16 | 4/16 | 1/16 |

- \(\textrm{E}(X) = (-4)(1/16) + (-2)(4/16) + (0)(6/16)+(2)(6/16) + (4)(1/16) = 0\).
- The possible values of \(Y\) are 0, 4, 16. For example, \(p_Y(4) = \textrm{P}(Y=4) = \textrm{P}(X=-2)+\textrm{P}(X=2) = 8/16\).

\(y\) | 0 | 4 | 16 |
---|---|---|---|

\(p_Y(y)\) | 6/16 | 8/16 | 2/16 |

- $(Y) = (0)(6/16)+(4)(8/16)+(16)(2/16) = 4
- \(X^2\) takes value \((-4)^2\) with probability 1/16, \((-2)^2\) with probability 4/16, …, \((4)^2\) with probability 1/16. \(\textrm{E}(X^2) = (-4)^2(1/16) + (-2)^2(4/16) + (0)^2(6/16)+(2)^2(6/16) + (4)^2(1/16) = 0\). Finding the distribution of \(Y\) and then finding the expected value basically just groups some of the terms in the calculation in the previous sentence together.
- NO! \((E\(X))^2=0^2\).

The calculation of \(\textrm{E}(X^2)\) of part 5 of the previous exercise probably seemed pretty natural, and only required working with the distribution of \(X\) rather than first finding the distribution of a transformed random variable. It’s so natural, we could probably do it without thinking; this is the idea behind the following.

**Definition 2.14 (Law of the unconscious statistician ( LOTUS)) **
The **“law of the unconscious statistician” (LOTUS)** says that the expected value of a transformed RV can be found without finding the distribution of the transformed RV, simply by applying the probability weights of the original RV to the transformed values.

\[\begin{align*} & \text{Discrete $X$ with pmf $p_X$:} & \textrm{E}[g(X)] & = \sum_x g(x) p_X(x)\\ & \text{Continuous $X$ with pdf $f_X$:} & \textrm{E}[g(X)] & =\sum_{-\infty}^\infty g(x) f_X(x) dx \end{align*}\]

The lefthand side of LOTUS, \(\textrm{E}[g(X)]\), represents finding the expected value the “long way”: define \(Y=g(X)\), find the distribution of \(Y\) (e.g., using the cdf method in Section 2.7.4, use the definition of expected value to compuet \(\textrm{E}(Y)\). LOTUS says we don’t have to first find the distribution of \(Y=g(X)\) to find \(\textrm{E}[g(X)]\); rather, we just simply apply the transformation \(g\) to each possible value \(x\) of \(X\) and then apply the corresponding weight for \(x\) to \(g(x)\).

LOTUS is much more useful for continuous random variables.

**Example 2.51 **
Let \(X\) be a random variable with a Uniform(-1, 1) distribution and let \(Y=X^2\).
Recall that in Example 2.43 we found the pdf of \(Y\): \(f_Y(y) = \frac{1}{2\sqrt{y}}, 0<y<1\).

- Find \(\textrm{E}(X^2)\) using the distribution of \(Y\) and the definition of expected value. Remember: if we did not have the distribution of \(Y\), we would first have to derive it as in Example 2.43.
- Donny Dont says: I can just use LOTUS and replace \(x\) with \(x^2\), so \(\textrm{E}(X^2)\) is \(\int_{-infty}^\infty x^2 f_X(x^2) dx\). Do you agree?
- Find \(\textrm{E}(X^2)\) using LOTUS.
- Is \(\textrm{E}(X^2)\) equal to \((\textrm{E}(X))^2\)?

*Solution*to Example 2.51

- Notice that if we have the distribution of \(y\), we compute expected value according to the definiton, weighting \(y\) values by the pdf of \(Y\), \(f_Y(y)\) and then integrating over possible \(y\) values. \[ \textrm{E}(X^2) = \textrm{E}(Y) = \int_{-\infty}^\infty y f_Y(y) dy = \int_0^1 y\left(\frac{1}{2\sqrt{y}}\right)dy = 1/3 \]
- No, \(x^2\) is multiplied by the density at \(x\), \(f_X(x)\), not at \(x^2\). Think of the discrete random variable in Example 2.50. There we had \(x^2p_X(x)\), e.g., \((-4)^2(1/16)\). If we had squared the \(x\) values inside the pmf, we would have \(p_X(4)=0\).

- Now we just work with the distribution of \(X\), but average the squared values \(x^2\), instead of the \(x\) values. Notice that the integral below is a \(dx\) integral. Using LOTUS \[ \textrm{E}(X^2) = \int_{-\infty}^\infty x^2 f_X(x) dx = \int_{-1}^1 x^2 (1/2)dx = 1/3 \]
- NO!!!

`## 0.33394171479725826`

### 2.8.2 Linearity of expected value

**Example 2.52 **Recall Example 2.46 and Section 2.6.6. Consider the probability space corresponding to two spins of the Uniform(0, 1) spinner and let \(X\) be the sum of the two spins.

- Find \(\textrm{E}(X)\) using the pdf of \(X\) (see Example 2.46).
- Find \(\textrm{E}(X)\) using linearity of expected value.
Which way is easier?

- Since the pdf of \(X\) is symmetric about \(x=1\) we should have \(\textrm{E}(X)=1\), which integrating confirms. \[ \textrm{E}(X) = \int_{-\infty}^\infty x\,f_X(x)\, dx = \int_0^1 x (x)dx + \int_1^2 x (2-x) dx = 1/3 + 2/3 = 1. \]
- If \(U_1\) and \(U_2\) are the results of the two spins, each with a Uniform(0, 1) distribution and expected value 0.5, the \(\textrm{E}(X) = \textrm{E}(U_1+U_2) = \textrm{E}(U_1) + \textrm{E}(U_2) = 0.5 + 0.5 = 1\).
Linearity just involves adding two numbers and is much easier. The pdf method involves first finding the pdf of \(X\) and then integrating.

There are 4 cards that could potentially go in box 1, then 3 cards that could potentially go in box 2, 2 in box 3, and 1 left for box 4. This results in \(4\times3\times2\times1=4! = 24\) possible outcomes.↩

This is sufficient, but technically not necessary. If \(X\le Y\)

*almost surely*— that is, if \(\textrm{P}(X\le Y)=1\) — then \(\textrm{E}(X)\le \textrm{E}(Y)\).↩