Useful probability inequalities

The following two probability inequalities are useful to show consistency in probability.

Proposition 3.2 (Markov's inequality) For any rv \(X,\)

\[\begin{align*} \mathbb{P}(|X|\geq k)\leq \frac{\mathbb{E}[X^2]}{k^2}, \quad k> 0. \end{align*}\]

Proof (Proof of Proposition 3.2). Assume that \(X\) is a continuous rv and let \(f\) be its pdf. We compute the second-order moment of \(X\):

\[\begin{align*} \mathbb{E}[X^2] &=\int_{-\infty}^{\infty} x^2 f(x)\,\mathrm{d}x\\ &=\int_{(-\infty,-k]\cup[k,\infty)} x^2 f(x)\,\mathrm{d}x+\int_{(-k,k)} x^2 f(x)\,\mathrm{d}x \\ &\geq\int_{(-\infty,-k]\cup[k,\infty)} x^2 f(x)\,\mathrm{d}x \\ &\geq k^2 \int_{(-\infty,-k]\cup[k,\infty)} f(x)\,\mathrm{d}x\\ &=k^2\mathbb{P}(|X|\geq k), \end{align*}\]

which is equivalent to

\[\begin{align*} \mathbb{P}(|X|\geq k)\leq \frac{\mathbb{E}[X^2]}{k^2}. \end{align*}\]

The proof for a discrete rv is analogous.

Proposition 3.3 (Chebyshev's inequality) Let \(X\) a rv with \(\mathbb{E}[X]=\mu\) and \(\mathbb{V}\mathrm{ar}[X]=\sigma^2<\infty.\) Then,

\[\begin{align*} \mathbb{P}\left(|X-\mu|\geq k\sigma\right)\leq \frac{1}{k^2}, \quad k>0. \end{align*}\]

Proof (Proof of Proposition 3.3). This inequality follows from Markov’s inequality. Indeed, taking

\[\begin{align*} X'=X-\mu, \quad k'=k\sigma, \end{align*}\]

and replacing \(X\) by \(X'\) and \(k\) by \(k'\) in Markov’s inequality, we get

\[\begin{align*} \mathbb{P}(|X-\mu|\geq k\sigma)\leq\frac{\sigma^2}{k^2\sigma^2}=\frac{1}{k^2}. \end{align*}\]

Chebyshev’s inequality is useful for obtaining probability bounds about \(\hat{\theta}\) when its probability distribution is unknown. We only need to know the expectation and variance of \(\hat{\theta}.\) Indeed, taking \(k=2,\) the Chebyshev’s inequality gives

\[\begin{align*} \mathbb{P}\big(|\hat{\theta}-\theta|\leq 2\sigma_{\hat{\theta}}\big)\geq 1-\frac{1}{4}=0.75, \end{align*}\]

which means that at least the \(75\%\) of the realized values of \(\hat{\theta}\) fall within the interval \([\theta-2\sigma_{\hat{\theta}},\theta+2\sigma_{\hat{\theta}}].\) However, if we additionally know that the distribution of the estimator \(\hat{\theta}\) is normal, \(\hat{\theta}\sim\mathcal{N}(\theta,\sigma_{\hat{\theta}}^2),\) then we obtain the much more precise result

\[\begin{align*} \mathbb{P}\big(|\hat{\theta}-\theta|\leq 2\sigma_{\hat{\theta}}\big)\approx 0.95>0.75. \end{align*}\]

The fact that the true probability, \(\approx 0.95,\) is in this case substantially larger than the lower bound given by Chebyshev’s inequality, \(0.75,\) is reasonable: Chebyshev’s inequality does not employ any knowledge on the distribution of \(\hat{\theta}.\) Thus, the precision increases as there is more information about the true distribution of \(\hat{\theta}.\)

Example 3.36 From previous experience, it is known that the time \(X\) (in minutes) that a periodic check of a machine requires is distributed as \(\Gamma(3,2).\) A new worker spends \(19\) minutes checking that machine. Is this time coherent with the previous experience?

We know that the mean and the variance of a gamma are given by

\[\begin{align*} \mu=\alpha\beta=3\times 2=6,\quad\sigma^2=\alpha\beta^2=3\times 2^2=12\to\sigma\approx 3.46. \end{align*}\]

Then, the difference between the checking time of the new worker and \(\mu\) is \(19-6=13.\) To see whether this difference is large or small, or, in other words, to see whether the checking time of this new worker is in line with previous checking times, we would want to know the probability

\[\begin{align*} \mathbb{P}(|X-\mu|\geq 13). \end{align*}\]

For that, we take \(k\sigma=13\) in Chebyshev’s inequality, so \(k=13/\sigma=13/3.46=3.76,\) and applying the inequality we readily get

\[\begin{align*} \mathbb{P}(|X-\mu|\geq 13)\leq \frac{1}{k^2}=\frac{1}{3.76^2}=0.071. \end{align*}\]

Since this probability bound is very small, the checking time is not coherent with the previous experience. Two things may have happened: either the new worker has faced a more complicated inspection or he/she is slower than the rest.

Rao–Blackwell’s Theorem

Rao–Blackwell’s Theorem provides an effective form of reducing the variance of an unbiased estimator using a sufficient statistic \(T.\) This process is sometimes known as Rao–Blackwellization and results in a new estimator with lower MSE.

Theorem 3.9 (Rao–Blackwell’s Theorem) Let \(T\) be a sufficient statistic for \(\theta.\) Let \(\hat{\theta}\) be an unbiased estimator of \(\theta.\) Then, the estimator

\[\begin{align*} \hat{\theta}':=\mathbb{E}\big[\hat{\theta}|T\big] \end{align*}\]


  1. \(\hat{\theta}'\) is independent of \(\theta.\)
  2. \(\mathbb{E}\big[\hat{\theta}'\big]=\theta,\) \(\forall \theta\in \Theta.\)
  3. \(\mathbb{V}\mathrm{ar}\big[\hat{\theta}'\big]\leq \mathbb{V}\mathrm{ar}\big[\hat{\theta}\big],\) \(\forall\theta\in\Theta.\)

In addition, \(\mathbb{V}\mathrm{ar}\big[\hat{\theta}'\big]= \mathbb{V}\mathrm{ar}\big[\hat{\theta}\big]\) if and only if \(\mathbb{P}\big(\hat{\theta}'=\hat{\theta}\big)=1,\) \(\forall \theta \in \Theta.\)

Observe that the new estimator \(\hat{\theta}'\) depends on the sample through the sufficient statistic \(T\) and, in particular, on the minimal sufficient statistic.

Example 3.37 Let \((X_1,\ldots,X_n)\) be a srs of \(X\sim \mathrm{Pois}(\lambda),\) that is, with pmf

\[\begin{align*} p(x;\lambda)=\frac{\lambda^x e^{-\lambda}}{x!}, \quad x=0,1,\ldots, \ \lambda>0. \end{align*}\]

Consider the parameter \(\theta=p(0;\lambda)=e^{-\lambda}.\) Let us perform a Let us perform a Rao–Blackwellization.

First, we need an unbiased estimator, for example:

\[\begin{align*} \hat{\theta}=\begin{cases} 1 & \text{if} \ X_1=0,\\ 0 & \text{if} \ X_1\neq 0, \end{cases} \end{align*}\]


\[\begin{align*} \mathbb{E}\big[\hat{\theta}\big]=1\times\mathbb{P}(X_1=0)+ 0\times\mathbb{P}(X_1\neq 0)=\mathbb{P}(X_1=0)=e^{-\lambda}. \end{align*}\]

Now we need a sufficient estimator. Writing the pmf of the Poisson in the form of the exponential family, we obtain

\[\begin{align*} p(x;\lambda)=e^{-\lambda}\frac{e^{x\log{\lambda}}}{x!} \end{align*}\]

and therefore \(T(X_1,\ldots,X_n)=\sum_{i=1}^n X_i\) is sufficient for \(\lambda\) and also for \(e^{-\lambda}.\)

Then, we can Rao–Blackwellize \(\hat{\theta}\):

\[\begin{align*} \hat{\theta}':=&\;\mathbb{E}\big[\hat{\theta}|T\big]\\ =&\;1\times\mathbb{P}\left(X_1=0\Bigg|\sum_{i=1}^n X_i=t\right)+0\times\mathbb{P}\left(X_1\neq 0\Bigg|\sum_{i=1}^n X_i=t\right) \\ =&\;\frac{\mathbb{P}(X_1=0,\sum_{i=1}^n X_i=t)}{\mathbb{P}(\sum_{i=1}^n X_i=t)}\\ =&\;\frac{\mathbb{P}(X_1=0)\mathbb{P}(\sum_{i=2}^n X_i=t)}{\mathbb{P}(\sum_{i=1}^n X_i=t)}. \end{align*}\]

Now, it holds that, if \(X_i\sim \mathrm{Pois}(\lambda),\) \(i=1,\ldots,n,\) are independent, then (see Exercise 1.20)

\[\begin{align*} \sum_{i=1}^n X_i \sim \mathrm{Pois}(n\lambda). \end{align*}\]


\[\begin{align*} \hat{\theta}'=\frac{e^{-\lambda}[(n-1)\lambda]^t e^{-(n-1)\lambda}/t!}{(n\lambda)^t e^{-n\lambda}/t!}=\left(\frac{n-1}{n}\right)^t. \end{align*}\]

Then, we have obtained the estimator

\[\begin{align*} \hat{\theta}'=\mathbb{E}\big[\hat{\theta}|T\big]=\left(\frac{n-1}{n}\right)^T, \end{align*}\]

which is unbiased, and whose variance is smaller than the one of \(\hat{\theta}.\) Indeed,

\[\begin{align*} \mathbb{E}\big[\hat{\theta}'\big]&=\sum_{x=1}^{\infty} \left(\frac{n-1}{n}\right)^x \frac{e^{-n\lambda}(n\lambda)^x}{x!}\\ &=e^{-n\lambda}\sum_{x=0}^{\infty} \frac{(n-1)^x \lambda^x}{x!}\\ &=e^{-n\lambda}e^{(n-1)\lambda}\\ &=e^{-\lambda}=\theta. \end{align*}\]

Therefore, \(\hat{\theta}'\) is unbiased. We compute its variance. For that, in the first place, we compute

\[\begin{align*} \mathbb{E}\big[\hat{\theta}'^2\big] &=\sum_{t=0}^{\infty}\left(\frac{n-1}{n}\right)^{2t} \frac{e^{-n\lambda}(n\lambda)}{t!}\\ &=e^{-n\lambda}\sum_{t=0}^{\infty}\left(\frac{(n-1)^2\lambda}{n}\right)\frac{1}{t!} \\ &=e^{-n\lambda}e^{(n-1)^2\lambda/n}\\ &=e^{-2\lambda+\lambda/n}. \end{align*}\]

Then, the variance is

\[\begin{align*} \mathbb{V}\mathrm{ar}\big[\hat{\theta}'\big]=\mathbb{E}\big[\hat{\theta}'^2\big]-\mathbb{E}^2\big[\hat{\theta}'\big]=e^{-2\lambda+\lambda/n}-e^{-2\lambda}=e^{-2\lambda}(e^{\lambda/n}-1). \end{align*}\]

We calculate the variance of \(\hat{\theta}\) for comparison:

\[\begin{align*} \mathbb{E}\big[\hat{\theta}^2\big]=1\times\mathbb{P}(X_1=0)=e^{-\lambda}. \end{align*}\]

As a consequence,

\[\begin{align*} \mathbb{V}\mathrm{ar}\big[\hat{\theta}\big]=e^{-\lambda}-e^{-2\lambda}=e^{-\lambda}(1-e^{-\lambda}). \end{align*}\]


\[\begin{align*} \frac{\mathbb{V}\mathrm{ar}\big[\hat{\theta}'\big]}{\mathbb{V}\mathrm{ar}\big[\hat{\theta}\big]}=\frac{e^{-2\lambda}(e^{\lambda/n}-1)}{e^{-\lambda}(1-e^{-\lambda})}=\frac{e^{\lambda/n}-1}{e^{\lambda}-1}<1, \quad \forall n\geq 1. \end{align*}\]