5.4 Probability inequalitlies

The distribution of a random variable is a complete description of its pattern of variability. Knowing the distribution of a random variable allows you to compute the probability of any event involving it, but the full distribution is often unavailable or difficult to obtain. However, certain summary characteristics, like the mean or standard deviation, might be available. What can we say about a distribution based on only information about its mean or standard deviation?

5.4.1 Markov’s inequality

Example 5.18 According to 2019 data from the U.S. Census Bureau, the mean¹²⁷ annual income for U.S. households is about $100,000. Suppose that you know nothing else about the distribution of income, other than income can’t be negative. What can you say about the percent of households with incomes of at least $1 million?

Can 100% of households have income of at least $1 million?
Can 50% of households have income of at least $1 million?
What is the largest possible percentage of households with incomes of at least $1 million?

Solution. to Example 5.18

Show/hide solution

No, if 100% of households have incomes of least 1 million, then the mean must be at least 1 million. If every household had an income of exactly 1 million, then the mean would be 1 million.
No, if 50% of households have incomes of at least 1 million, then the mean must be at least 500,000. Even if 50% of households have incomes of exactly 1 million and the rest have incomes of 0 (the smallest possible value), the mean would be 500,000.
The idea is that if too many households have incomes above 1 million then the average can’t be 100,000. Classify each household as either having an income of at least 1 million or not. The mean will be smallest in the extreme case where each household has an income of either 1 million or 0; allowing other values will just pull the mean up. In this extreme scenario, let $p$ be the proportion of households with an income of 1 million. Then the mean is $1000000p + 0(1-p) = 1000000p$ , and setting it equal to 100000 yields $p=0.1$ . Therefore, given a mean of $100,000 it is theoretically possible for 10% of households to have incomes of at least $1 million. But 10% is the maximum possible percentage; if more than 10% of households have incomes above $1 million, than the mean would be strictly greater than $100,000. Knowing only that the mean is 100,000, all we can say is that between 0% and 10% of households have incomes of at least $1 million.

The scenario corresponding to 10% in the previous example is hypothetical, and in reality much less than 10% of U.S. households have incomes of at least $1 million. (Only about 10% of households have incomes above $200,000, so the percentage of households with incomes above $1 million is much smaller than 10%.) However, the scenario is theoretically possible so we must account for it. We can’t do any better based on knowing just the mean alone without any additional information about the distribution of incomes.

The previous example illustrates Markov’s inequality.

Theorem 5.1 (Markov's inequality) For any random variable $X$ and any constant $c>0$ $\textrm{P}(|X|\ge c) \le \frac{\textrm{E}(|X|)}{c}.$ In particular, if $\textrm{P}(X\ge0)=1$ then $\textrm{P}(X > c) \le \textrm{E}(X)/c$ .

The idea behind Markov’s inequality is that large values pull the mean up, so given a fixed value of the mean there is a limit on the probability that the random variable takes large values. The proof uses the same strategy as Example 5.18. Each value of $|X|$ is either at least $c$ or not. Consider the extreme situation where each value of $|X|$ is either 0 or $c$ . That is, define the random variable $Y=c\textrm{I}\{|X|\ge c\}$ . Then $|X|\ge Y$ ; if $|X|\ge c$ then $Y=c$ ; if $|X|<c$ then $Y=0$ . Therefore $\textrm{E}(|X|)\ge \textrm{E}(Y)$ and $\textrm{E}(|X|)\ge\textrm{E}(Y) = \textrm{E}(c\textrm{I}\{|X|\ge c\}) = c\textrm{P}(|X| \ge c).$ Divide by $c>0$ to get the result.

Think of $c$ as a large value in the measurement units of the random variable, so Markov’s inequality provides a very crude upper bound on the probability that $X$ takes extreme values in the absolute sense. Probabilities like $\textrm{P}(|X| \ge c)$ are called “tail probabilities” because they depend on the “tail” of the distribution which describes the pattern of variability for extreme values.

We can also express Markov’s inequality in relative terms. If $X\ge 0$ , then for any constant $k$ $\textrm{P}(X \ge k \textrm{E}(X)) \le \frac{\textrm{E}(X)}{k\textrm{E}(X)} = \frac{1}{k}$ Think of $k$ as a multiplier: what is the probability that the random variable takes a value at least $k$ times larger than the average value? Markov’s inequality says that at most $1/k$ of values are at least $k$ times greater than the mean. For example, at most 1/2 of values are at least 2 times as large as the mean; at most 1/3 of values are at least 3 times as large as the mean; etc.

Example 5.19 Suppose $X$ is a random variable with an Exponential(1) distribution. What does Markov’s inequality say about $\textrm{P}(X > 5)$ ? How does this compare to the true probability?

Solution. to Example 5.19

Show/hide solution

The mean of an Exponential(1) distribution is 1, so Markov’s inequality says $\textrm{P}(X\ge 5) \le \frac{\textrm{E}(X)}{5} = \frac{1}{5} = 0.2.$ The true probability is $e^{-5}\approx 0.0067$ . The true probability is about 30 times smaller than the bound provided by Markov’s inequality. Markov’s inequality only uses the fact the mean is 1; it provides a bound that works for any distribution with a mean of 1. But it is not guaranteed to work well for any particular distribution.

The upper bound provided by Markov’s inequality often grossly overestimates the tail probability. However, without further information, we cannot rule out the extreme but theoretically possible case in which the tail probability $\textrm{P}(|X|\ge c)$ is equal to the upper bound $\textrm{E}(|X|)/c$ . Markov’s inequality provides a bound that works for any distribution with a given mean, but it is not guaranteed to work well for any particular distribution.

If the upper bound is so bad, how is Markov’s inequality useful? Think of reading a news article involving some numerical variable. The article might mention the mean, but have you ever read a news article that mentions the standard deviation? At best, you might get a range of “typical” values, maybe a percentile or two, or a graph if you’re really lucky. But in many situations, the mean might be all that is available, and Markov’s inequality at least tells you something about the distribution based on the mean alone (even if it doesn’t tell you much).

5.4.2 Chebyshev’s inequality

Markov’s inequality only relies on the mean, but it provides very rough bounds on tail probabilities. If we have more information, then we can do better. In particular, if we also know the standard deviation then we can put better bounds on the probability that a random variable takes a value far from its mean.

Theorem 5.2 (Chebyshev's inequality) For any random variable $X$ and any constant $c>0$

$\textrm{P}\left(|X-\textrm{E}(X)|\ge c\right)\le \frac{\textrm{Var}(X)}{c^2}.$

Equivalently, for any constant $z>0$ , $\textrm{P}\left(\frac{|X-\textrm{E}(X)|}{\textrm{SD}(X)}\ge z\right)\le \frac{1}{z^2}.$

The first version of Chebyshev’s inequality bounds the probability that a random variable is at least $c$ measurement units away from its mean. The proof is an application of Markov’s inequality to the squared deviation random variable $|X-\textrm{E}(X)|^2$ : $\textrm{P}\left(|X-\textrm{E}(X)|\ge c\right) = \textrm{P}\left(|X-\textrm{E}(X)|^2\ge c^2\right) \le \frac{\textrm{E}\left(|X-\textrm{E}(X)|^2\right)}{c^2}= \frac{\textrm{Var}(X)}{c^2}.$

Example 5.20 Continuing Example 5.18, suppose again that annual income for U.S. households is about $100,000. Now assume the standard deviation of income is about $230,000. What can you say about the percent of households with incomes of at least $1 million?

Show/hide solution

We first need to relate the probability of interest to one that is of the form¹²⁸ in Chebyshev’s inequality. An income of 1 million is 900,000 dollars above the mean of 100,000.

$\textrm{P}\left(X \ge 1000000\right) = \textrm{P}\left(X-100000\ge 900000\right) \le \textrm{P}\left(|X-100000|\ge 900000\right).$ Now use Chebyshev’s inequality with $c=900000$ , $\textrm{E}(X) = 100000$ , and $\textrm{Var}(X) = 230000^2$ . $\textrm{P}\left(X \ge 1000000\right) \le \textrm{P}\left(|X-100000|\ge 900000\right)\le \frac{230000^2}{900000^2} = 0.065.$

Alternatively, a value of 1000000 is $z=(1000000 - 100000) / 230000 = 3.91$ SDs above the mean, so the probability that the standardized random variable is at least 3.91 is less than $1/3.91^2\approx 0.065$ .

With information about the mean and standard deviation, we can say that at most 6.5% of households have income above $1 million. This is still probably a vast overestimate, but it does improve on the bound of 10% from Markov’s inequality.

The bound in Markov’s inequality is on the order of $1/c$ and the bound in Chebyshev’s inequality is on the order of $1/c^2$ . Therefore, Chebyshev’s inequality usually provides a tighter bound, but you need to know the standard deviation in order to use it.

The second version of Chebyshev’s inequality follows by taking $c = z\textrm{SD}(X)$ in the first version. The second version bounds the probability that a random variable is at least $z$ standard deviations away from its mean. Chebyshev’s inequality says that for any distribution, the probability that the random variable takes a value within $z$ SDs of its mean is at least $1 - 1 / z^2$ . For any distribution,

( $z = 2$ .) At least 75% of values fall within 2 standard deviations of the mean.
( $z = 3$ .) At least 88.8% of values fall within 3 standard deviations of the mean.
( $z = 4$ .) At least 93.75% of values fall within 4 standard deviations of the mean.
( $z = 5$ .) At least 96% of values fall within 5 standard deviations of the mean.
( $z = 6$ .) At least 97.2% of values fall within 6 standard deviations of the mean.
and so on, for different values of $z$ .

This universal “empirical rule” works for any distribution, but will tend to be very conservative when applied to any particular distribution.

In short, Chebyshev’s inequality says that if a value is more than a few standard deviations away from the mean then it is a fairly extreme value, regardless of the shape of the distribution.

Example 5.21 Let $X$ be a random variable with an Exponential(1) distribution. What does Chebyshev’s inequality say about $\textrm{P}(X > 5)$ ? How does this compare to the bound from Markov’s inequality? To the true probability?

Solution. to Example 5.21

Show/hide solution

Both the mean and the standard deviation of an Exponential(1) distribution are 1. A value of 5 is $(5-1)/1=4$ units above the mean. Chebyshev’s inequality says that the probability that a value is at least 4 units away from the mean is at most $1/4^2 = 0.0625$ . This bound is 3 times smaller than 0.2, the bound from Markov’s inequality. It’s still not close to the true probability of $0.0067$ , but at least it’s an improvement over Markov’s inequality.

Another situation where bounds like Markov’s or Chebyshev’s inequality are useful is in proofs. Many theorems in probability consider what happens in the long run. For example, various results say certain probabilities approach 0 in the long run. (The law of large numbers, which we will see later, is of this form.) To prove such theorems, you don’t necessarily need to compute the probabilities to show they approach 0. It is enough to show that some rough upper bound on the probability converges to 0.

The mean is closer to $98,000 but we’re rounding to simplify a little. It is often more appropriate to consider median income, rather than mean income. The median annual income for U.S. households in 2019 was about $69,000.↩︎
The form we have stated bounds the probability that $X$ is far away from the mean. There are also one-sided Chebyshev’s inequalities that bound the probability that $X$ is far above (or below) its mean. If we recognize that $X$ in this example can’t be 900000 units below its mean, a one-sided Chebyshev’s inequality yields a bound of $(1/(1+900000/230000))^2\approx 0.042$ .↩︎