1.1 Basic probability review
A triple (Ω,A,P) is called a probability space. Ω represents the sample space, the set of all possible individual outcomes of a random experiment. A is a σ-algebra, a class of subsets of Ω that is closed under complementation and numerable unions, and such that Ω∈A. A represents the collection of possible events (combinations of individual outcomes) that are assigned a probability by the probability measure P. A random variable is a map X:Ω⟶R such that {ω∈Ω:X(ω)≤x}∈A (the set is measurable).
The cumulative distribution function (cdf) of a random variable X is F(x):=P[X≤x]. When an independent and identically distributed (iid) sample X1,…,Xn is given, the cdf can be estimated by the empirical distribution function (ecdf)
Fn(x)=1nn∑i=11{Xi≤x},
where 1A:={1,A is true,0,A is false is an indicator function. Continuous random variables are either characterized by the cdf F or the probability density function (pdf) f=F′, which represents the infinitesimal relative probability of X per unit of length. We write X∼F (or X∼f) to denote that X has a cdf F (or a pdf f). If two random variables X and Y have the same distribution, we write Xd=Y.
The expectation operator is constructed using the Lebesgue–Stieljes “dF(x)” integral. Hence, for X∼F, the expectation of g(X) is
E[g(X)]:=∫g(x)dF(x)={∫g(x)f(x)dx,X continuous,∑{i:P[X=xi]>0}g(xi)P[X=xi],X discrete.
Unless otherwise stated, the integration limits of any integral are R or Rp. The variance operator is defined as Var[X]:=E[(X−E[X])2].
We employ bold face to denote vectors (assumed to be column matrices) and matrices. A p-random vector is a map X:Ω⟶Rp, X(ω):=(X1(ω),…,Xp(ω)), such that each Xi is a random variable. The (joint) cdf of X is F(x):=P[X≤x]:=P[X1≤x1,…,Xp≤xp] and, if X is continuous, its (joint) pdf is f:=∂p∂x1⋯∂xpF. The marginals of F and f are the cdf and pdf of Xi, i=1,…,p, respectively. They are defined as:
FXi(xi):=P[Xi≤x]=∫Rp−1F(x)dx−i,fXi(xi):=∂∂xiFXi(xi)=∫Rp−1f(x)dx−i,
where x−i:=(x1,…,xi−1,xi+1,xp). The definitions can be extended analogously to the marginals of the cdf and pdf of different subsets of X.
The conditional cdf and pdf of X1|(X2,…,Xp) are defined, respectively, as
FX1|X−1=x−1(x1):=P[X1≤x1|X−1=x−1],fX1|X−1=x−1(x1):=f(x)fX−1(x−1).
The conditional expectation of Y|X is the following random variable2
E[Y|X]:=∫ydFY|X(y|X).
The conditional variance of Y|X is defined as
Var[Y|X]:=E[(Y−E[Y|X])2|X]=E[Y2|X]−E[Y|X]2.
Proposition 1.1 (Laws of total expectation and variance) Let X and Y be two random variables.
- Total expectation: if E[|Y|]<∞, then E[Y]=E[E[Y|X]].
- Total variance: if E[Y2]<∞, then Var[Y]=E[Var[Y|X]]+Var[E[Y|X]].
Exercise 1.1 Prove the law of total variance from the law of total expectation.
We conclude with some useful inequalities.
Proposition 1.2 (Markov's inequality) Let X be a non-negative random variable with E[X]<∞. Then
P[X>t]≤E[X]t,∀t>0.
Proposition 1.3 (Chebyshev's inequality) Let μ=E[X] and σ2=Var[X]. Then
P[|X−μ|≥t]≤σ2t2,∀t>0.
Exercise 1.2 Prove Markov’s inequality using X=X1{X>t}+X1{X≤t}. Then prove Chebyshev’s inequality using Markov’s. Hint: use the random variable (X−E[X])2.
Proposition 1.4 (Cauchy--Schwartz inequality) Let X and Y such that E[X2]<∞ and E[Y2]<∞. Then E[|XY|]≤√E[X2]E[Y2].
Proposition 1.5 (Jensen's inequality) If g is a convex function, then g(E[X])≤E[g(X)].
Example 1.1 Jensen’s inequality has interesting derivations. For example:
- Take h=−g. Then h is a concave function and we have that h(E[X])≥E[h(X)].
- Take g(x)=xr for r≥1. Then E[X]r≤E[Xr]. If 0<r<1, then E[X]r≥E[Xr].
- Consider 0≤r≤s. Then g(x)=xr/s is convex and g(E[|X|s])≤E[g(|X|s)]=E[|X|r]. As a consequence E[|X|r]<∞⟹E[|X|s]<∞ for 0≤r≤s. Finite moments of higher order implies finite moments of lower order.
Recall that the X-part is random!↩︎