1.1 Probability review
1.1.1 Random variables
A triple (Ω,A,P) is called a probability space. Ω represents the sample space, the set of all possible individual outcomes of a random experiment. A is a σ-field, a class of subsets of Ω that is closed under complementation and numerable unions, and such that Ω∈A. A represents the collection of possible events (combinations of individual outcomes) that are assigned a probability by the probability measure P. A random variable is a map X:Ω⟶R such that X−1((−∞,x])={ω∈Ω:X(ω)≤x}∈A, ∀x∈R (the set X−1((−∞,x]) of possible outcomes of X is said measurable).
1.1.2 Cumulative distribution and probability density functions
The cumulative distribution function (cdf) of a random variable X is F(x):=P[X≤x]. When an independent and identically distributed (iid) sample X1,…,Xn is given, the cdf can be estimated by the empirical distribution function (ecdf)
Fn(x)=1nn∑i=11{Xi≤x},
where 1A:={1,A is true,0,A is false is an indicator function.2
Continuous random variables are characterized by either the cdf F or the probability density function (pdf) f=F′, the latter representing the infinitesimal relative probability of X per unit of length. We write X∼F (or X∼f) to denote that X has a cdf F (or a pdf f). If two random variables X and Y have the same distribution, we write Xd=Y.
1.1.3 Expectation
The expectation operator is constructed using the Riemann–Stieltjes “dF(x)” integral.3 Hence, for X∼F, the expectation of g(X) is
E[g(X)]:=∫g(x)dF(x)={∫g(x)f(x)dx, if X is continuous,∑{x∈R:P[X=x]>0}g(x)P[X=x], if X is discrete.
Unless otherwise stated, the integration limits of any integral are R or Rp. The variance operator is defined as Var[X]:=E[(X−E[X])2].
1.1.4 Random vectors, marginals, and conditionals
We employ bold face to denote vectors, assumed to be column matrices, and matrices. A p-random vector is a map X:Ω⟶Rp, X(ω):=(X1(ω),…,Xp(ω))′, such that each Xi is a random variable. The (joint) cdf of X is F(x):=P[X≤x]:=P[X1≤x1,…,Xp≤xp] and, if X is continuous, its (joint) pdf is f:=∂p∂x1⋯∂xpF.
The marginals of F and f are the cdfs and pdfs of Xi, i=1,…,p, respectively. They are defined as
FXi(xi):=P[Xi≤xi]=F(∞,…,∞,xi,∞,…,∞),fXi(xi):=∂∂xiFXi(xi)=∫Rp−1f(x)dx−i,
where x−i:=(x1,…,xi−1,xi+1,…,xp)′. The definitions can be extended analogously to the marginals of the cdf and pdf of different subsets of X.
The conditional cdf and pdf of X1|(X2,…,Xp) are defined, respectively, as
FX1|X−1=x−1(x1):=P[X1≤x1|X−1=x−1],fX1|X−1=x−1(x1):=f(x)fX−1(x−1).
The conditional expectation of Y|X is the following random variable4
E[Y|X]:=∫ydFY|X(y|X).
The conditional variance of Y|X is defined as
Var[Y|X]:=E[(Y−E[Y|X])2|X]=E[Y2|X]−E[Y|X]2.
Proposition 1.1 (Laws of total expectation and variance) Let X and Y be two random variables.
- Total expectation: if E[|Y|]<∞, then E[Y]=E[E[Y|X]].
- Total variance: if E[Y2]<∞, then Var[Y]=E[Var[Y|X]]+Var[E[Y|X]].
Exercise 1.1 Prove the law of total variance from the law of total expectation.
Figure 1.1 graphically summarizes the concepts of joint, marginal, and conditional distributions within the context of a 2-dimensional normal.

Figure 1.1: Visualization of the joint pdf (in blue), marginal pdfs (green), conditional pdf of X2|X1=x1 (orange), expectation (red point), and conditional expectation E[X2|X1=x1] (orange point) of a 2-dimensional normal. The conditioning point of X1 is x1=−2. Note the different scales of the densities, as they have to integrate one over different supports. Note how the conditional density (upper orange curve) is not the joint pdf f(x1,x2) (lower orange curve) with x1=−2 but its rescaling by 1fX1(x1). The parameters of the 2-dimensional normal are μ1=μ2=0, σ1=σ2=1 and ρ=0.75 (see Exercise 1.9). 500 observations sampled from the distribution are shown in black.
Exercise 1.2 Consider the random vector (X,Y) with joint pdf f(x,y)={ye−axy,x>0,y∈(0,b),0,else.
- Determine the value of b>0 that makes f a valid pdf.
- Compute E[X] and E[Y].
- Verify the law of the total expectation.
- Verify the law of the total variance.
Exercise 1.3 Consider the continuous random vector (X1,X2) with joint pdf given by
f(x1,x2)={2,0<x1<x2<1,0,else.
- Check that f is a proper pdf.
- Obtain the joint cdf of (X1,X2).
- Obtain the marginal pdfs of X1 and X2.
- Obtain the marginal cdfs of X1 and X2.
- Obtain the conditional pdfs of X1|X2=x2 and X2|X1=x1.
1.1.5 Variance-covariance matrix
For two random variables X1 and X2, the covariance between them is defined as
Cov[X1,X2]:=E[(X1−E[X1])(X2−E[X2])]=E[X1X2]−E[X1]E[X2],
and the correlation between them, as
Cor[X1,X2]:=Cov[X1,X2]√Var[X1]Var[X2].
The variance and the covariance are extended to a random vector X=(X1,…,Xp)′ by means of the so-called variance-covariance matrix:
Var[X]:=E[(X−E[X])(X−E[X])′]=E[XX′]−E[X]E[X]′=(Var[X1]Cov[X1,X2]⋯Cov[X1,Xp]Cov[X2,X1]Var[X2]⋯Cov[X2,Xp]⋮⋮⋱⋮Cov[Xp,X1]Cov[Xp,X2]⋯Var[Xp]),
where E[X]:=(E[X1],…,E[Xp])′ is just the componentwise expectation. As in the univariate case, the expectation is a linear operator, which now means that
E[AX+b]=AE[X]+b,for a q×p matrix A and b∈Rq.
It follows from (1.2) that
Var[AX+b]=AVar[X]A′,for a q×p matrix A and b∈Rq.
1.1.6 Inequalities
We conclude this section by reviewing some useful probabilistic inequalities.
Proposition 1.2 (Markov's inequality) Let X be a nonnegative random variable with E[X]<∞. Then
P[X≥t]≤E[X]t,∀t>0.
Proposition 1.3 (Chebyshev's inequality) Let X be a random variable with μ=E[X] and σ2=Var[X]<∞. Then
P[|X−μ|≥t]≤σ2t2,∀t>0.
Exercise 1.4 Prove Markov’s inequality using X=X1{X≥t}+X1{X<t}.
Exercise 1.5 Prove Chebyshev’s inequality using Markov’s.
Remark. Chebyshev’s inequality gives a quick and handy way of computing confidence intervals for the values of any random variable X with finite variance:
P[X∈(μ−tσ,μ+tσ)]≥1−1t2,∀t>0.
That is, for any t>0, the interval (μ−tσ,μ+tσ) has, at least, a probability 1−1/t2 of containing a random realization of X. The intervals are conservative, but extremely general. The table below gives the guaranteed coverage probability 1−1/t2 for common values of t.
t | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|
Guaranteed coverage | 0.75 | 0.8889 | 0.9375 | 0.96 | 0.9722 |
Exercise 1.6 Prove (1.4) from Chebyshev’s inequality.
Proposition 1.4 (Cauchy–Schwartz inequality) Let X and Y such that E[X2]<∞ and E[Y2]<∞. Then
|E[XY]|≤√E[X2]E[Y2].
Exercise 1.7 Prove Cauchy–Schwartz inequality “pulling a rabbit out of a hat”: consider the polynomial p(t)=E[(tX+Y)2]=At2+2Bt+C≥0, ∀t∈R.
Exercise 1.8 Does E[|XY|]≤√E[X2]E[Y2] hold? Observe that, due to the next proposition, |E[XY]|≤E[|XY|].
Proposition 1.5 (Jensen's inequality) If g is a convex function, then
g(E[X])≤E[g(X)].
Example 1.1 Jensen’s inequality has interesting derivations. For example:
- Take h=−g. Then h is a concave function and h(E[X])≥E[h(X)].
- Take g(x)=xr for r≥1, which is a convex function. Then E[X]r≤E[Xr]. If 0<r<1, then g(x)=xr is concave and E[X]r≥E[Xr].
- The previous results hold considering g(x)=|x|r. In particular, |E[X]|r≤E[|X|r] for r≥1.
- Consider 0≤r≤s. Then g(x)=xs/r is convex (since s/r≥1) and g(E[|X|r])≤E[g(|X|r)]=E[|X|s]. As a consequence, E[|X|s]<∞⟹E[|X|r]<∞ for 0≤r≤s.5
- The exponential (logarithm) function is convex (concave). Consequently, exp(E[X])≤E[exp(X)] and log(E[|X|])≥E[log(|X|)].