3 Mathematical expectation
Upon completion of this chapter, you should be able to:
- understand the concept and definition of mathematical expectation.
- compute the expectations of a random variable, functions of a random variable and linear functions of a random variable.
- compute the variance and other higher moments of a random variable.
- derive the moment-generating function of a random variable and linear functions of a random variable.
- find the moments of a random variable from the moment-generating function.
- state and use Tchebysheff’s inequality.
3.1 Mathematical expectation
Because random variables are random, knowing the outcome on any one realisation of the random process is not possible. Instead, we can talk about what we might expect to happen, or what might happen on average.
This is the idea of mathematical expectation. In more usual terms, the mathematical expression of the probability distribution of a random variable is the mean of the random variable. Mathematical expectation goes far beyond just computing means, but we begin here as the idea of a mean is easily understood.
The definition looks different in detail for discrete and continuous random variables, but the intention is the same.
Definition 3.1 (Expectation) The expectation or expected value (or mean) of a random variable X is defined as
E(X)=∑x∈RXxpX(x) for a discrete random variable X with pmf pX(x);
E(X)=∫∞−∞xfX(x) for a continuous random variable X with pdf fX(x).
Often we write μ=E(X), or μX to distinguish between random variables.
Effectively E(X) is a weighted average of the points in RX, the weights being the probabilities in the discrete case and probability densities in the continuous case.
Example 3.1 (Expectation for discrete variables) Consider the discrete random variable U with probability function
pU(u)={(u2+1)/5for u=−1,0,1;0elsewhere.
The expected value of U is, by definition,
E(U)=∑u=−1,0,1upU(u)=∑u=−1,0,1u×(u2+15)=(−1×(−1)2+15)+(0×(0)2+15)+(1×(1)2+15)=−2/15+0+2/15=0.
The expected value of U is 0.
Example 3.2 (Expectation for continuous variables) Consider a continuous random variable X with pdf
fX(x)={x/4for 1<x<3;0elsewhere.
The expected value of X is,
by definition,
E(X)=∫∞−∞xfX(x)dx=∫31x(x/4)dx=112x3|31=13/6.
The expected value of X is 13/6.
Example 3.3 (Expectation for a coin toss) Consider tossing a coin once and counting the number of tails.
Let this random variable be T.
The probability function is
pT(t)={0.5for t=0 or t=1;0otherwise.
The expected value of T is, by definition
E(T)=2∑i=1tpT(t)=Pr(T=0)×0+Pr(T=1)×1=(0.5×0)+(0.5×1)=0.5.
Of course, 0.5 tails can never actually be observed in practice on one toss.
But it would be silly to round up (or down) and say that the expected number of tails on one toss of a coin is one (or zero).
The expected value of 0.5 simply means that over a large number of repeats of this random process, we expect a tail to occur in half of those repeats.
Example 3.4 (Mean not defined) Consider the distribution of Z, with the probability density function fZ(z)={z−2for z≥1;0elsewhere as in Fig. 3.1. The expected value of Z is E[Z]=∫−∞1z1z2dz=∫∞11z=−logz|∞1. However, limz→∞−logz→∞. The expected value of E[Z] is undefined.

FIGURE 3.1: The probability function for the random variable Z. The mean is not defined.
3.2 Expectation of a function of a random variable
While the mean can be expressed in terms of mathematical expectation, mathematical expectation is a more general concept.
Let X be a discrete random variable with a probability function pX(x), or a continuous random variable with pdf fX(x). Also assume g(X) is a real-valued function of X. We can then define the expected value of g(X).
Definition 3.2 (Expectation for function of a random variable) The expected value of some function g(⋅) of a random variable X is:
E(g(X))=∑x∈RXg(x)pX(x) for a discrete random variable X with pmf pX(x);
E(g(X))=∫∞−∞g(x)fX(x)dx for a continuous random variable X with pdf fX(x).
If Y=g(X), write μY=E(Y)=E(g(X)). Importantly, the expectation operator is a linear operator, as stated below.
Theorem 3.1 (Expectation properties) For any random variable X and constants a and b, E(aX+b)=aE(X)+b.
Proof. Assume X is a discrete random variable with probability function pX(x). By Def. 3.2 with g(X)=aX+b, E(aX+b)=∑x(ax+b)pX(x)=a∑xpX(x)+∑xbpX(x)=aE(X)+b, using that ∑xpX(x)=1. (The proof in the continuous case uses the probability function with a pdf, and replaces summations with integrals.)
3.3 Measures of location
Having introduced the idea of mathematical expectation, other descriptions of a random variable can now be introduced.
Two general features of a distribution are the location or centre, and the variability. We have already met the mean (or expected value) as one measure of location.
3.3.1 The mean
The most important measure of location is the mean μ=E(X), which has already been defined (Sect. 3.1). The mean is the balance point of the distribution: the expected mean deviation of the random variable from μ is zero. In other words, E(X−μ)=0.
While the mean is commonly used, it is not always a suitable measure of centre: the balance point can be unduly affected by outliers and extreme skewness. We have also seen a situation where the mean is undefined (Example 3.4).
3.3.2 The mode and median
Other measures of centre sometimes may be more appropriate. The mode and median are two common alternatives. The mean, when it is defined, is more useful in theoretical work due to nicer mathematical properties,
Definition 3.3 (Mode) The mode(s) is a value(s) x∈RX such that:
- pX(x) attains its maximum (for the discrete case); or
- the value(s) of x at which fX(x) attains its maximum (for the continuous case).
The mode is not necessarily unique.
Definition 3.4 (Median) The median of a random variable X (or of a distribution) is a number ν such that
Pr(X≤ν)≥12 and Pr(X≥ν)≥12.
For X discrete, the median is not necessarily unique.
Example 3.6 (Mode and median for discrete distributions) If X is a discrete random variable with probability function defined in Table 3.1 (and shown in Fig. 3.2), the distribution is bimodal (i.e., has two modes), with modes at x=3 and x=4.
The median, say ν, is any value of X in the range 3<x<4, since any number in this range satisfies Pr(X≤x)=0.5 and Pr(X≥x)=0.5. The median is not unique.
x | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Pr(X=x) | 1128 | 7128 | 21128 | 35128 | 35128 | 21128 | 7128 | 1128 |

FIGURE 3.2: Finding the median, when it is not unique
Example 3.7 (Expectation for discrete variables) Let X be a discrete random variable with the probability function shown in Table 3.2 (and Fig. 3.3). The mode of this distribution is at x=3. The median is ν=3 since this is the only value of X that will satisfy Def. 3.4.
x | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
Pr(X=x) | 164 | 664 | 1564 | 2064 | 1564 | 664 | 164 |

FIGURE 3.3: The probability function and distribution function for H
3.4 Measures of variability
Apart from the mean, the most important description of a random variable is the variability: quantifying how the values of the random variable are dispersed. The most important measure of variability is the variance.
3.4.1 The variance and standard deviation
The variance can be expressed as a function of a random variable. (More correctly, we should say “the variance of the distribution of the random variable”, but our language is very common.)
The variance is a measure of the variability of a random variable. A small variance means the observations are nearly the same (i.e., small variation); a large variance means they are quite different.
Definition 3.5 (Variance) The variance of a random variable X (or of the distribution of X) is var(X)=E((X−μ)2) where μ=E(X). The variance of X is commonly denoted by σ2, or σ2X if distinguishing among variables is needed.
The variance is the expected value of the squared distance of the values of the random variable from the mean, weighted by the probability function. The unit of measurement for variance is the original unit of measurement squared. That is, if X is measured in metres, the variance of X is in metres2.
Describing the variability in terms of the original units is more natural, by taking the square root of the variance.
Definition 3.6 (Standard deviation) The standard deviation of a random variable X is defined as the positive square root of the variance (denoted by σ); i.e., sd(X)=σ=+√var(X)
The variance is less popular than the standard deviation in practice to describe variability. In theoretical work, however, the variance is easier to work with than standard deviation (due to the square root), and the variance, rather than standard deviation, features in many results in theoretical statistics.
Example 3.8 (Variance for a die toss) Suppose a fair die is tossed, and X denotes the number of points showing.
Then Pr(X=x)=1/6 for x=1,2,3,4,5,6 and
μ=E(X)=∑SxPr(X=x)=(1+2+3+4+5+6)/6=7/2.
The variance of X is then
σ2=var(X)=∑(X−μ)2Pr(X=x)=16[(1−72)2+(2−72)2+⋯+(6−72)2]=7024.
The standard deviation is then σ=√70/24=1.71.
An important result is known as the computational formula for variance.
Theorem 3.2 (Computational formula for variance) For any random variable X, var(X)=E(X2)−[E(X)]2.
Proof. Let E(X)=μ, then (using the properties of expectation in Theorem 3.1):
var(X)=E((X−μ)2)=E(X2−2Xμ+μ2)=E(X2)−E(2Xμ)+E(μ2)=E(X2)−2μE(X)+μ2=E(X2)−2μ2+μ2=E(X2)−μ2=E(X2)−(E(X))2.
This formula is often easier to use to compute var(X) than using the definition directly.
Example 3.9 (Variance for a die toss) Consider Example 3.9 again.
Then
E(X2)=∑Sx2Pr(X=x)=16[12+22+32+42=52+62]=91/6,
and so var(X)=91/6−(7/2)2=70/24, as before.
Example 3.10 (Variance using computational formula) Consider the continuous random variable X with pdf fX(x)={3x(2−x)/4for 0<x<2;0elsewhere. The variance of X can be computed in two ways: using var(X)=E((X−μ)2) or using the computational formula. The expected value of X is E(X)=∫20x.3x(2−x)4dx=1. To use the computational formula, E(X2)=65, and so var(X)=E(X2)−(E(X))2=1/5.
Using the definition,
var(X)=E((X−E(X))2)=E((X−1)2)=∫20(x−1)23x(2−x)/4dx=1/5.
Both methods give the same answer of course, and both methods require initial computation of E(X). The computational formula requires less work.
The variance represents the expected value of the squared distance of the values of the random variable from the mean. The variance is never negative, and is only zero when all the values of the random variable are identical (that is, there is no variation).
If most of the probability lies near the mean, the dispersion will be small; if the probability is spread out over a considerable range the dispersion will be large.
Example 3.11 (Infinite variance) In Example 3.4, E(X) was not defined. For that reason, the variance is also undefined, since computing the variance relies on having a finite value for E(X).
Theorem 3.3 (Variance properties) For any random variable X and constants a and b, var(aX+b)=a2var(X).
Proof. Using the computational formula for the variance:
var(aX+b)=E[(aX+b)2]−[E(aX+b)]2=E[a2X+2abX+b2]+(aμ+b)2=a2E[X2]−a2μ2=a2var(X).
The special case a=0 is instructive: var(b)=0 when b is constant; that is, a constant has zero variation.
Example 3.12 (Variance of a function of a random variable) Consider the random variable Y=4−2X where E(X)=1 and var(X)=3.
Then:
E(Y)=E(4−2X)=4−2×E(X)=2;var(Y)=var(4−2X)=(−2)2var(X)=12.
3.4.2 The mean absolute deviation
We have already seen variance and standard deviation are suitable measures of dispersion. Another alternative is the following.
Definition 3.7 (Mean absolute deviation) The mean absolute deviation (MAD) is defined as:
- E(|X−μ|)=∑x|x−μ|pX(x) for a discrete random variable X;
- E(|X−μ|)=∫∞−∞|x−μ|fX(x)dx for a continuous random variable X.
Example 3.13 (MAD of die toss) Consider the fair die described in Example 3.8.
Then μ=E(X)=7/2 and thus the MAD of X is
E(|X−μ|)=∑|x−μ|Pr(X=x)=16[|1−72|+|2−72|+⋯+|6−72|]=1.5.
Like the median, the MAD is difficult to use in theoretical work (due to the absolute value). The most-used measure of dispersion is the variance (or standard deviation).
3.5 Higher moments
The idea of a mean and variance are generalised in the following definitions.
Definition 3.8 (Raw moments) The rth moment about the origin, or rth raw moment, of a random variable X (where r is a positive integer) is defined as:
- μ′r=E(Xr)=∑XxrpX(x) for the discrete case;
- μ′r=E(Xr)=∫∞−∞xrfX(x) for the continuous case.
Definition 3.9 (Central moments) The rth central moment, or rth moment about the mean (where r is a positive integer), is defined as
- μr=E((X−μ)r)=∑x(x−μ)rpX(x) in the discrete case;
- μr=E((X−μ)r)=∫∞−∞(x−μ)rfX(x) in the continuous case.
From these definitions:
- the mean μ′1=μ is the first moment about the origin, or the first raw moment;
- μ′2=E(X2) is the second moment about the origin, or the second raw moment; and
- the variance μ2=σ2 is the second moment about the mean or the second central moment.
Higher moments also exist that describe other features of a random variable. One important higher moment is related to skewness.
Definition 3.10 (Skewness) The skewness of a distribution is defined as
γ1=μ3μ3/22.
If γ1>0 we say the distribution is positively (or right) skewed, and it is ‘stretched’ in the positive (negative) direction. Similarly, if γ1<0 we say the distribution is negatively (or left) skewed.
Definition 3.11 (Symmetry) The distribution of X is said to be symmetric if, for all x∈RX,
- Pr(X=μ+x)=Pr(X=μ−x) for the discrete case;
- Pr(X=μ+x)=fX(μ−x) for the continuous case.
For a symmetric distribution, the mean is also a median of the distribution, as these results show.
For a symmetric distribution, the odd central moments are zero (Exercise 3.10). This suggests that odd central moments can be used to measure the asymmetry of a distribution (as in Def. 3.10).
Example 3.14 (Skewness) Consider the random variable X in Example 3.10, where fX(x)=x(2−x) for 0<x<2. In that example, we found E(X)=μ′1=1 and E(X2)=μ2=6/5. Then, μ3=∫20(x−1)33x(2−x)/4dx=0, so that the skewness in Eq. (3.1) will be zero. This is expected, since the distribution is symmetric (Fig. 3.4).

FIGURE 3.4: The probability density function for X
Instead of central moments themselves, using expressions unaffected by a linear transformation of the type Y=aX+b (for constants a and b) are convenient. The ratio (μr)p/(μp)r is such an expression, and the simplest form is a function of μ23/μ32.
Example 3.15 (Skewness) Consider the random variable Y with pmf pY(y)={0.2for y=5;0.3for y=6;0.5for y=7;0elsewhere. Then μ′1=E(Y)=(5×0.2)+(6×0.3)+(7×0.5)=6.3. Likewise, μ2=E((y−6.3)2pY(y))=0.1982 and μ3=E((y−6.3)3pY(y))=0.02457. Hence, the skewness is γ1=μ3μ3/22=0.024570.19823/2=0.02785, so the distribution has positive skewness.
Another ratio gives a measure of kurtosis, which measures the heaviness of the tails in a distribution.
Definition 3.12 (Kurtosis) The kurtosis of a distribution is defined as
μ4μ22.
The excess kurtosis of a distribution is defined as
γ2=μ4μ22−3.
Excess kurtosis is commonly used, and sometimes is just called ‘kurtosis’.
Large values of kurtosis corresponds to greater proportion of the distribution in the tails. The excess kurtosis is defined so that the normal, or bell-shaped, distribution (Sect. 5.3), which has a kurtosis of 3, has a excess kurtosis of zero. Then:
- distributions with negative excess kurtosis are called platykurtic. These distribution have fewer, or less extreme, observations in the tail compared to the normal distribution (‘thinner tails’). Examples include the Bernoulli distribution.
- distributions with positive excess kurtosis are called leptokurtic. These distribution have more, or more extreme, observations in the tail compared to the normal distribution (‘fatter tails’). Examples include the exponential and Poisson distributions.
- distributions with zero excess kurtosis are called mesokurtic. The normal distribution is the obvious example.
Example 3.16 (Uses of skewness and kurtosis) Monypenny and Middleton (1998b) and Monypenny and Middleton (1998a) use the skewness and kurtosis to analyse wind gusts at Sydney airport.
Example 3.17 (Uses of skewness and kurtosis) Galagedera, Henry, and Silvapulle (2002) used higher moments in a capital analysis pricing model for Australian stock returns.
Example 3.18 (Skewness and kurtosis) Consider the discrete random variable U from Example 3.1.
The raw moments are
μ′r=E(Ur)=∑u=−1,0,1uru2+15=(−1)r(−1)2+15+(0)r(0)2+15+(1)r(1)2+15=2(−1)r5+0+25=25[(−1)r+1]
for the rth raw moment.
Then,
E(X)=μ′1=25[(−1)1+1]=0;E(X2)=μ′2=25[(−1)2+1]=4/5;E(X3)=μ′1=25[(−1)3+1]=0;E(X4)=μ′2=25[(−1)4+1]=4/5.
Since E(U)=0, then the rth central and raw moments are the same: μ′r=μr.
Notice that once the initial computations to find μ′r are complete, the evaluation of any raw moment is simple.
The skewness is γ1=μ3μ3/22=0(4/5)3/2=0, so the distribution is symmetric. The excess kurtosis is γ2=μ4μ22−3=4/5(4/5)2−3=−7/4, so the distribution is platykurtic.
By themselves, the mean, variance, skewness and kurtosis do not completely describe a distribution; many different distributions can be found having a given mean, variance, skewness and kurtosis. However, in general, all the moments of a distribution together define the distribution. This leads to the idea of a moment-generating function.
3.6 Moment-generating functions
3.6.1 Definition
So far, the distribution of a random variable has been described using a probability function or a distribution function. Sometimes, however, working with a different representation is useful (for example, see Sect. 7.4).
In this section, the moment-generating function is used to represent the distribution of the probabilities of a random variable. As the name suggests, this function can be used to generate any moment of a distribution. Other uses of the moment-generating function are seen later (see Sect. 7.4).
Definition 3.13 (Moment-generating function (mgf)) The moment-generating function (or mgf) MX(t) of the random variable X is defined as:
- MX(t)=E(exp(tX))=∑xexp(tx)pX(x) in the discrete case;
- MX(t)=E(exp(tX))=∫∞−∞exp(tx)fX(x) in the continuous case.
The mgf is defined as an infinite series or an infinite integral. Such an expression may not always exist (that is, converge to a finite value) for all values of t, so the mgf may not be defined for all values of t. Note that the mgf always exists for t=0; in fact MX(0)=1.
Provided the mgf is defined for some values of t other than zero, it uniquely defines a probability distribution, and we can use it to generate the moments of the distribution as described in Theorem 3.4.
Moment-generating functions are related to Laplace transformations.
Example 3.19 (Moment-generating function) Consider the random variable Y with pdf
fY(y)={exp(−y)for y>0;0elsewhere.
The mgf is
MY(t)=E[exp(tY)]=∫∞0exp(ty)exp(−y)dy=∫∞0exp{y(t−1)}dy=1t−1exp{y(t−1)}|y=∞y=0=(1−t)−1
provided t−1<0; that is, for t<1.
If t>1, the integral does not converge.
For example, if t=2,
12−1exp(y)|y=∞y=0=exp(0)−limy→∞exp(y)
which does not converge.
Example 3.20 (Mgf for die rolls) Consider the pf of X, the outcome of tossing a fair die (Example 3.8).
The mgf of X is
MX(t)=E(exp(tX))=6∑x=1exp(tx)pX(x)=16(et+e2t+e3t+e4t+e5t+e6t),
which exists for all values of t.
3.6.2 Using the mgf to generate moments
Replacing exp(xt) by its series expansion (App. A) in the definition of the mgf gives
MX(t)=∑x(1+xt+x2t22!+…)Pr(X=x)=1+μ′1t+μ′2t22!+μ′3t33!+…
Then, the rth moment of a distribution about the origin is seen to be the coefficient of tr/r! in the series expansion of MX(t):
dMX(t)dt=∑xxextPr(X=x)d2MX(t)dt2=∑xx2extPr(X=x),
and so, for each positive integer r,
drMX(t)dtr=∑xxrextPr(X=x).
On setting t=0, dMX(t)/dt|t=0=E(X), d2MX(t)/dt2|t=0=E(X2), and for each positive integer r,
drMX(t)dtr|t=0=E(Xr).
(Sometimes, drMX(t)/dtr|t=0 is written as M(r)(0) for brevity.)
This result is summarised in the following theorem.
Theorem 3.4 (Moments) The rth moment μ′r of the distribution of the random variable X about the origin is given by either
- the coefficient of tr/r!,r=1,2,3,… in the power series expansion of MX(t); or
- μ′r=drM(t)dtr|t=0 where MX(t) is the mgf of X.
Example 3.21 (Mean and variance from a MF) Continuing Example 3.19, the mean and variance of Y can be found from the mgf. To find the mean, first find ddtMY(t)=(1−t)−2. Setting t=0 gives the mean as E(Y)=1. Likewise, d2dt2MY(t)=2(1−t)−3. Setting t=0 gives E(Y2)=2. The variance is therefore var[Y]=2−12=1.
Once the moment-generating function has been computed, raw moments can be computed using E(Yr)=μ′r=drdtrMY(t)|t=0.
3.6.3 Some useful results
The moment-generating function can be used to derive the distribution of a function of a random variable (see Sect. 7.4). The following theorems are valuable for this task.
Theorem 3.5 (Mgf of linear combinations) If the random variable X has mgf MX(t) and Y=aX+b where a and b are constants, then the mgf of Y is MY(t)=E(exp{t(aX+b)})=exp(bt)MX(at).
Theorem 3.6 (Mgf of independent rvs) If X1, X2, …, Xn are n independent random variables, where Xi has mgf MXi(t), then the mgf of Y=X1+X2+⋯Xn is MY(t)=n∏i=1MXi(t).
Proof. The proofs are left as an exercise.
Note that in the special case when all the random variables are independently and identically distributed in Theorem 3.5, we have MY(t)=[MXi(t)]n.
Example 3.22 (Mgf of linear combinations) Consider the random variable X with pf
pX(x)=2(1/3)xfor x=1,2,3,…
and zero elsewhere.
The mgf of X is
MX(t)=∑x:p(x)>0exp(tx)pX(x)=∞∑x=1exp(tx)2(1/3)x=2∞∑x=1(exp(t)/3)x=2{exp(t)3+(exp(t)3)2+(exp(t)3)3+…}=2exp(t)3−exp(t)
where ∑∞y=1ay=a/(1−a) for a<1 has been used (App. A); here a=exp(t)/3.
Next consider finding the mgf of Y=(X−2)/3.
From Theorem 3.5 with a=1/3 and b=−2/3,
MY(t)=exp(−2t/3)MX(t/3)=2exp{(−t)/3}3−exp(t/3).
In practice, rather than identify a and b and remember Theorem 3.5, problems like this are best solved directly from the definition of the mgf:
MY(t)=E(exp(tY))=E(exp{t(X−2)/3})=E(exp{tX/3−2t/3})=exp(−2t/3)MX(t/3)=2exp{(−t)/3}3−exp(t/3).
3.6.4 Determining the distribution from the mgf
The mgf (if it exists) completely determines the distribution of a random variable hence, given a mgf, deducing the probability function should be possible. For some distributions, the pdf cannot be written in closed form (so the pdf can only be evaluated numerically), but the mgf is relatively simple to write down.
In the discrete case, the mgf is defined as
MX(t)=E(exp(tX))=∑XetxpX(x)
for X discrete with pf pX(x).
This can be expressed as
MX(t)=exp(tx1)pX(x1)+exp(tx2)pX(x2)+…=exp(tx1)Pr(X=x1)+exp(tx2)Pr(X=x2)+…
and so the probability function of Y can be deduced from the mgf.
Example 3.23 (Distribution from the mgf) Suppose a discrete random variable D has the mgf
MD(t)=13exp(2t)+16exp(3t)+112exp(6t)+512exp(7t).
Then, by the definition of the mgf in the discrete case given above, the coefficient of t in the exponential indicates values of D, and the coefficient indicates the probability of that value of y:
\begin{align*}
M_D(t)
&= \overbrace{\frac{1}{3} \exp(2t)}^{D = 2} + \overbrace{\frac{1}{6}\exp(3t)}^{D = 3} +
\overbrace{\frac{1}{12}\exp(6t)}^{D = 6} + \overbrace{\frac{5}{12}\exp(7t)}^{D = 7}\\
&= \Pr(D = 2)\exp(2t) + \Pr(D = 3)\exp(3t) + \\
& \quad \Pr(D = 6)\exp(6t) + \Pr(D = 7)\exp(7t).
\end{align*}
So the pf is
p_D(d) =
\begin{cases}
1/3 & \text{for $d=2$}\\
1/6 & \text{for $d=3$}\\
1/12 & \text{for $d=6$}\\
5/12 & \text{for $d=7$}\\
0 & \text{otherwise}
\end{cases}
(Of course, it is easy to check by computing the mgf for D from the pf found above; you should get the original mgf.)
Sometimes, using the results in App. A can be helpful.
Example 3.24 (Distribution from the mgf) Consider the mgf
M_X(t) = \frac{\exp(t)}{3 - 2\exp(t)}.
To find the corresponding probability function, one approach is to write the mgf as
M_X(t) = \frac{\exp(t)/3}{1 - 2\exp(t)/3}.
This is the sum of a geometric series (Eq. (A.5)):
a + ar + ar^2 + \ldots + ar^{n - 1}
\rightarrow \frac{a}{1 - r} \text{ as $n \rightarrow \infty$},
where a = \exp(t)/3 and r = 2\exp(t)/3.
Hence the mgf can be expressed as
\frac{1}{3}\exp(t) +
\frac{1}{3}\left(\frac{2}{3}\right) \exp(2t) +
\frac{1}{3}\left(\frac{2}{3}\right)^2 \exp(3t) + \dots
so that the probability function can be deduced as
\begin{align*}
\Pr(X = 1) &= \frac{1}{3};\\
\Pr(X = 2) &= \frac{1}{3}\left(\frac{2}{3}\right);\\
\Pr(X = 3) &= \frac{1}{3}\left(\frac{2}{3}\right)^2,
\end{align*}
or, in general,
p_x(x) = \frac{1}{3}\left( \frac{2}{3}\right)^{x - 1}\quad\text{for $x = 1, 2, 3, \dots$}.
(Later, this will be identified as a geometric distribution.)
The case of a continuous random variable is more involved.
Suppose the continuous random variable X has the mgf M_X(t).
Then the probability density function is (see Abramowitz and Stegun (1964), 26.1.10)
\begin{equation}
f_X(x) =
\frac{1}{2\pi} \int_{-\infty}^{\infty} M_X(it) \exp(-itx)\, dt,
\tag{3.3}
\end{equation}
where i = \sqrt{-1}.
Example 3.25 (Distribution from the mgf) Consider the mgf for a continuous random variable X such that M_X(t) = \exp(t^2/2), for y\in\mathbb{R} and t\in\mathbb{R}.
Then, M_X(it) = \exp\left( (it)^2/2 \right) = \exp(-t^2/2).
Using Eq. (3.3), the pdf is:
\begin{align*}
f_X(x)
&= \frac{1}{2\pi} \int_{-\infty}^{\infty} \exp(-t^2/2) \exp(-itx)\, dt,\\
&= \frac{1}{2\pi} \int_{-\infty}^{\infty} \exp(-t^2/2)\left[ \cos(-tx) + i\sin(-itx)\right]\, dt,
\end{align*}
since x\in\mathbb{R} and t\in\mathbb{R}.
Extracting just the real components:
\begin{align*}
f_X(x)
&= \frac{1}{2\pi} \int_{-\infty}^{\infty} \exp(-t^2/2) \cos(-tx) \, dt\\
&= \frac{1}{2\pi} \left( \sqrt{2\pi} \exp( -x^2/2 ) \right)
= \frac{1}{ \sqrt{2\pi} } \exp( -x^2/2 ),
\end{align*}
which will later be identified as a normal distribution with a mean of zero, and standard deviation of one.
In practice, using Eq. (3.3) can become tedious or intractable for producing a closed form expression for the pdf. However, Eq. (3.3) has been used to compute numerical values of the probability density function. For example, Dunn and Smyth (2008) used Eq. (3.3) to evaluate the Tweedie distributions (Dunn and Smyth 2001), for which (in general) the pdf has no closed form, but does have a simple mgf.
3.7 Tchebysheff’s inequality
Tchebysheff’s inequality applies to any probability distribution, and is sometimes useful in theoretical work or to provide bounds on probabilities.
Theorem 3.7 (Tchebysheff's theorem) Let X be a random variable with finite mean \mu and variance \sigma^2.
Then for any positive k,
\begin{equation}
\Pr\big(|X - \mu| \geq k\sigma \big)\leq \frac{1}{k^2}
\tag{3.4}
\end{equation}
or, equivalently
\begin{equation}
\Pr\big(|X - \mu| < k\sigma \big)\geq 1 - \frac{1}{k^2}.
\end{equation}
Proof. The proof for the continuous case only is given.
Let X be continuous with pdf f(x).
For some c > 0, then
\begin{align*}
\sigma^2
& = \int^\infty_{-\infty} (x - \mu )^2f(x)\,dx\\
& = \int^{\mu -\sqrt{c}}_{-\infty} (x - \mu )^2f(x)\, dx +
\int^{\mu + \sqrt{c}}_{\mu-\sqrt{c}}(x - \mu )^2f(x)\,dx +
\int^\infty_{\mu + \sqrt{c}}(x - \mu)^2f(x)\,dx\\
& \geq \int^{\mu -\sqrt{c}}_{-\infty} (x - \mu )^2f(x)\,dx +
\int^\infty_{\mu + \sqrt{c}}(x - \mu )^2f(x)\,dx,
\end{align*}
since the second integral is non-negative.
Now (x - \mu )^2 \geq c if x \leq \mu -\sqrt{c} or x\geq \mu + \sqrt{c}.
So in both the remaining integrals above, replace (x - \mu )^2 by c without altering the direction of the inequality:
\begin{align*}
\sigma^2
&\geq c \int^{\mu -\sqrt{c}}_{-\infty} f(x)\,dx + c\int^\infty_{\mu + \sqrt{c}}f(x)\,dx\\
&= c\,\Pr(X \leq \mu - \sqrt{c}\,) + c\,\Pr(X \geq \mu + \sqrt{c}\,)\\
&= c\,\Pr(|X - \mu| \geq \sqrt{c}\,).
\end{align*}
Putting \sqrt{c} = k\sigma, Eq. (3.4) is obtained.
With the probability function or pdf of a random variable X, then \text{E}(X) and \text{var}(X) can be found, but the converse is not true. That is, from a knowledge of \text{E}(X) and \text{var}(X) we cannot reconstruct the probability distribution of X and hence cannot compute probabilities such as \Pr(|X - \mu| \geq k\sigma). Nonetheless, using Tchebysheff’s inequality we can find a useful bound to either the probability outside or inside of \mu \pm k\sigma.
3.8 Exercises
Selected answers appear in Sect. D.3.
Exercise 3.1 The pdf for the random variable Y is defined as f_Y(y) = \begin{cases} 2y + k & \text{for $1\le y \le 2$};\\ 0 & \text{elsewhere}. \end{cases}
- Find the value of k.
- Plot the pdf of Y.
- Compute \text{E}(Y).
- Compute \text{var}(Y).
- Compute \Pr(Y > 1.5).
Exercise 3.2 The pmf for the random variable D is defined as p_D(d) = \begin{cases} 1/2 & \text{for $d = 1$};\\ 1/4 & \text{for $d = 2$};\\ k & \text{for $d = 3$};\\ 0 & \text{otherwise}, \end{cases} for a constant k.
- Compute the mean and variance of D.
- Find the mgf for D.
- Compute the mean and variance of D from the mgf.
- Compute \Pr(D < 3).
Exercise 3.3 (This Exercise follows Ex. 2.9.) Five people, including you and a friend, line up at random. The random variable X denotes the number of people between yourself and your friend.
Use the probability function of X found in Ex. 2.9, and find the mean number of people between you and your friend. Simulate this in R to confirm your answer.
Exercise 3.4 The mgf of the discrete random variable Z is M_Z(t) = [0.3\exp(t) + 0.7]^2.
- Compute the mean and variance of Z.
- Find the pf of Z.
Exercise 3.5 The mgf of G is M_G(t) = (1 - \beta t)^{-\alpha} (where \alpha and \beta are constants). Find the mean and variance of G.
Exercise 3.6 Suppose that the pdf of X is f_X(x) = \begin{cases} 2(1 - x) & \text{for $0 < x < 1$};\\ 0 & \text{otherwise}. \end{cases}
- Find the rth raw moment of X.
- Find \text{E}\big[(X + 3)^2\big] using the previous answer.
- Find the variance of X.
Exercise 3.7 Consider the pdf f_Y(y) = \frac{2}{y^2}\qquad y\ge 2.
- Show that the mean of the distribution is not defined.
- Show that the variance does not exist.
- Plot the probability density function over a suitable range.
- Plot the distribution function over a suitable range.
- Determine the median of the distribution.
- Determine the interquartile range of the distribution.
(The interquartile range is a measure of spread, and is calculated as the difference between the third quartile and the first quartile. The first quartile is the value below which 25% of the data lie; the third quartile is the value below which 75% of the data lie.) - Find \Pr(Y > 4 \mid Y > 3).
Exercise 3.8 The Cauchy distribution has the pdf
\begin{equation}
f = \frac{1}{\pi(1 + x^2)}\quad\text{for $x\in\mathbb{R}$}.
\tag{3.5}
\end{equation}
- Use R to draw the probability density function.
- Compute the distribution function for X. Again, use R to draw the function.
- Show that the mean of the Cauchy distribution is not defined.
- Find the variance and the mode of the Cauchy distribution.
Exercise 3.9 The exponential distribution has the pdf f_Y(y) = \frac{1}{\lambda}\exp( -y/\lambda) (for \lambda > 0) for y > 0 and is zero elsewhere.
- Determine the moment-generating function of Y.
- Use the moment-generating function to compute the mean and variance of the exponential distribution.
Exercise 3.10 Prove that for a continuous random variable X which has a distribution that is symmetric about 0 then M_X(t) = M_{-X}(t). Hence prove that for such a random variable, all odd moments about the origin are zero.
Exercise 3.11 (This exercise follows from Exercise 2.7.)
In a study modelling waiting times at a hospital (Khadem et al. 2008), patients are classified into one of three categories:
- Red: Critically ill or injured patients.
- Yellow: Moderately ill or injured patients.
- Green: Minimally injured or uninjured patients.
For ‘Yellow’ patients, the service time of doctors are modelled using a triangular distribution, with a minimum at 3.5 mins, a maximum at 30.5 minutes and a mode at 5 mins.
- Compute the mean of the service times.
- Compute the variance of the service times.
Exercise 3.12 The characteristic function of a random variable X, denoted \varphi(t), is defined as \varphi_X(t) = \text{E}[\exp(i t X)], where i = \sqrt{-1}. Unlike the mgf, the characteristic function is always defined, so is sometimes preferred over the mgf.
- Show that M_X(t) = \varphi_X(-it).
- Show that the mean of a random variable X is given by -i\varphi'(0) (where, as before, the notation means to compute the derivative of \varphi(t) with respect to t, and evaluate at t = 0).
Exercise 3.13 You will find Sect. A.2 useful.
- Write down the first three terms and general term of the expansion of (1 - a)^{-1}.
- Write down the first three terms and general term of the expansion of \text{E}\left[(1 - tX)^{-1}\right].
- Suppose R_X(t) = \text{E}\left [(1 - tX)^{-1} \right], called the geometric generating function of X. Suppose the random variable Y has a uniform distribution on (0, 1); i.e., f_Y(y) = 1 for 0 < y < 1. Determine the geometric generating function of Y from the definition of the expected value. Your answer will involve a term \log(1 - t).
- Using the answer in Part 3, expand the term \log(1 - t) by writing in terms of the infinite series.
- Equate the two series expansions in Part 2 and Part 4 to determine an expression for \text{E}(Y^n), n = 1, 2, 3,\dots.
Exercise 3.14 Suppose the random variable X is defined as f_X(x) = k (3x^2 + 4)\quad\text{for $-c < x < c$}, and is zero elsewhere. Solve for c and k if \text{var}(X) = 28/15. (Hint: Make sure you use the properties of the given probability distribution before embarking on complicated expressions!)
Exercise 3.15 Benford’s law (also see Exercise 2.15) describes the distribution of the leading digits of numbers that span many orders of magnitudes (e.g., lengths of rivers) as p_D(d) = \log_{10}\left( \frac{d + 1}{d} \right) \quad\text{for $d\in\{1, 2, \dots 9\}$}, Find the mean of D (i.e., the mean leading digit).
Exercise 3.16 The von Mises distribution is used to model angular data. The probability function is p_Y(y) = k \exp\{ \lambda\cos(y - \mu) \} for 0\le y < 2\pi, 0 \le \mu < 2\pi where \mu is the mean, and with \lambda > 0.
- Show that the constant k is a function of \lambda only.
- Find the median of the distribution.
- Using R, numerically integrate to find the value k when \mu = \pi/2 and \lambda = 1.
- The distribution function has no closed form. Use R to plot the distribution function for \mu = \pi/2 with \lambda = 1.
Exercise 3.17 The inverse Gaussian distribution has the pdf P_Y(y) = \frac{1}{\sqrt{2\pi y^3\phi}} \exp\left\{ -\frac{1}{2\phi} \frac{(y - \mu)^2}{y\mu^2}\right\} for y > 0, \mu > 0 and \phi > 0.
- Plot the distribution for \mu = 1 for various values of \phi; comment.
- The mgf is M_Y(t) = \exp\left\{ \frac{\lambda}{\mu} \left( 1 - \sqrt{1 - \frac{2\mu^2 t}{\lambda}} \right) \right\}. Use the mgf to deduce the mean and variance of the inverse Gaussian distribution.
Exercise 3.18 Consider the random variable W such that f_W(w) = \frac{c}{w^3}\quad\text{for $w > c$.}
- Find the value of c.
- Find \text{E}[W].
- Find \text{var}[W].
Exercise 3.19 The Pareto distribution has the distribution function F_X(x) = \begin{cases} 1 - \left(\frac{k}{x}\right)^\alpha & \text{for $x > k$}\\ 0 & \text{elesewhere}, \end{cases} for \alpha > 0 and parameter k,
- What values of k are possible?
- Find the density function for the Pareto distribution.
- Compute the mean and variance for the Pareto distribution.
- Find the mode of the Pareto distribution.
- Plot the distribution for \alpha = 3 and k = 3.
- For \alpha = 3 and k = 3, compute \Pr(X > 4 \mid X < 5).
- The Pareto distribution is often used to model incomes. For example, the “80-–20 rule” states that 20% of people receive 80% of all income (and, further, that 20% of the highest-earning 20% receive 80% of that 80%). Find the value of \alpha for which this rule holds.
Exercise 3.20 A mixture distribution is a mixture of two or more univariate distributions. For example, the heights of all adults may follow a mixture distribution: one normal distribution for adult females, and another for adult males. For a set of probability functions p^{(i)}_X(x) for i = 1, 2, \dots n and a set of weights w_i such that \sum w_i = 1 and w_i \ge 0 for all i, the mixture distribution f_X(x) is f_X(x) = \sum_{i = 1}^n w_i p^{(i)}_X(x).
- Compute the distribution function for p_X(x).
- Compute the mean and variance of f_X(x).
- Consider the case where p^{(i)}_X(x) has a normal distribution for i = 1, 2, 3, where the means are -1, 2, and 4 respectively, and the variances are 1, 1 and 4 respectively. Plot the probability density function of f_X(x) for various instructive values of the weights.
- Suppose heights of female adults have a normal distribution with mean 163 cm and a standard deviation of 5 cm, and adult males have heights with a mean of 175 cm with standard deviation of 7 cm, and constitute 48 % of the population (Australian Bureau of Statistics 1995). Deduce and plot the probability density of heights of adult Australians.
Exercise 3.21 Consider the distribution such that p_X(x) = \begin{cases} \frac{1}{K} & \text{for $x = 1$};\\ \frac{1}{x(x - 1)} & \text{for $x = 2, 3, \dots, K$};\\ 0 & \text{elsewhere} \end{cases} for K > 2.
- Find the mean and variance of X (as well as possible).
- Plot the distribution for various values of K.
- For K = 6, determine the mgf.
Exercise 3.22 The random variable V has the pmf f_V(v) = (1 - p)^{v - 1} p\quad\text{for $v = 1, 2, \dots$}, and zero elsewhere.
- Show that this is a valid pmf.
- Find \text{E}[V].
- Find \text{var}[V].
Exercise 3.23 Two dice are rolled. Deduce the pmf for the absolute difference between the two numbers that appear uppermost.
Exercise 3.24 The random variable Y has the pmf p_Y(y) = \frac{e^{-\lambda}\lambda^y}{y!}\quad\text{for $y = 0, 1, 2, \dots$}, where \lambda > 0. Find the mgf of Y, and hence show that \text{E}(Y) = \text{var}(Y).
Exercise 3.25 In practice, some distributions cannot be written in closed form, but can be given by writing their moment-generating function. To evaluate the density then requires an infinite summation or an infinite integral. Given a moment-generating function M(t), the probability density function can be reconstructed numerically from the integral using the inversion formula in Eq. (3.3).
The evaluation of the integral generally requires advanced numerical techniques. In this question, we just consider the exponential distribution as a simple example to demonstrate the use of the inversion formula.
- Write down the expression in Eq. (3.3) in the case of the exponential distribution, for which M_X(t) = \lambda/(\lambda - t) for t < \lambda.
- Only the real part of the integral is needed. Extract the real parts of this expression, and simplify the integrand. (The integrand is the expression to be integrated.)
- Plot the integrand from the last part from t = -50 to t = 2 in the case \lambda = 2 and x = 1.
Exercise 3.26 The density function for the random variable X is given as f(x) = x e^{-x} \quad\text{for $x > 0$}.
- Determine the moment-generating function (mgf) of X.
- Use the mgf to verify that \text{E}(X) = \text{var}(X).
- Suppose that Y = 1 - X. Determine \text{E}(Y).
Exercise 3.27 The Gumbel distribution has the cumulative distribution function
\begin{equation}
F(x; \mu, \beta) = \exp\left[ -\exp\left( -\frac{x - \mu}{\sigma}\right)\right]
\tag{3.6}
\end{equation}
(for \mu > 0 and \sigma > 0) and is often used to model extremes values (such as the distribution of the maximum river height).
- Deduce the probability function for the Gumbel distribution.
- Plot the Gumbel distribution for a variety of parameters.
- The maximum daily precipitation (im mm) in Oslo, Norway, is well-modelled using a Gumbel distribution with \mu = 2.6 and \sigma = 1.86. Draw this distribution, and explain what it means.
Exercise 3.28 The density function for the random variable X is given as f(x) = x e^{-x} \quad\text{for $x > 0$}.
- Determine \text{E}(X).
- Verify that \text{E}(X) = \text{var}(X).
Exercise 3.29 (This exercise follows from Ex. 2.13.) To detect disease in a population through a blood test, usually every individual is tested. If the disease is uncommon, however, an alternative method is often more efficient.
In the alternative method (called a pooled test), blood from n individuals is combined, and one test is conducted. If the test returns a negative result, then none of the n people have the disease; if the test returns a positive result, all n individuals are then tested individually to identify which individual(s) have the disease.
Suppose a disease occurs in an unknown proportion of people p of people. Let X be the number of tests to be performed for a group of n individuals.
- What is the expected number of tests needed in a group of n people using the pooled method?
- What is the variance of the number of tests needed in a group of n people using the pooled method?
- Explain what happens to the mean and variance as p \to 1 and as p \to 0, and how these results make sense in the context of the question
- If pooling was not used with a group of n people, the number of tests would be n: one for each person. Deduce an expression for the value of p for which the expected number of tests using the pooled approach exceeds the non-pooled approach.
- Produce a well-labelled plot showing the expected number of tests that are saved by using the pooled method when p = 0.1 for values of n from 2 to 10, and comment on what this shows practically.
- Suppose a test costs $15. What is the expected cost-saving for using the pooled-testing method with n = 4 and p = 0.1, if 200 people must be tested?
Exercise 3.30 Consider rolling a fair, six-sided die. The ‘running total’ is the total of all the numbers rolled on the die.
- Find the probability mass function for R, the number of rolls needed to obtain a running total of 3 or more.
- Find the expected number of rolls until the running total reaches 3 or more.