1.1 Basic probability review

A triple (Ω,A,P) is called a probability space. Ω represents the sample space, the set of all possible individual outcomes of a random experiment. A is a σ-algebra, a class of subsets of Ω that is closed under complementation and numerable unions, and such that ΩA. A represents the collection of possible events (combinations of individual outcomes) that are assigned a probability by the probability measure P. A random variable is a map X:ΩR such that {ωΩ:X(ω)x}A (the set is measurable).

The cumulative distribution function (cdf) of a random variable X is F(x):=P[Xx]. When an independent and identically distributed (iid) sample X1,,Xn is given, the cdf can be estimated by the empirical distribution function (ecdf)

Fn(x)=1nni=11{Xix},

where 1A:={1,A is true,0,A is false is an indicator function. Continuous random variables are either characterized by the cdf F or the probability density function (pdf) f=F, which represents the infinitesimal relative probability of X per unit of length. We write XF (or Xf) to denote that X has a cdf F (or a pdf f). If two random variables X and Y have the same distribution, we write Xd=Y.

The expectation operator is constructed using the Lebesgue–Stieljes “dF(x)” integral. Hence, for XF, the expectation of g(X) is

E[g(X)]:=g(x)dF(x)={g(x)f(x)dx,X continuous,{i:P[X=xi]>0}g(xi)P[X=xi],X discrete.

Unless otherwise stated, the integration limits of any integral are R or Rp. The variance operator is defined as Var[X]:=E[(XE[X])2].

We employ bold face to denote vectors (assumed to be column matrices) and matrices. A p-random vector is a map X:ΩRp, X(ω):=(X1(ω),,Xp(ω)), such that each Xi is a random variable. The (joint) cdf of X is F(x):=P[Xx]:=P[X1x1,,Xpxp] and, if X is continuous, its (joint) pdf is f:=px1xpF. The marginals of F and f are the cdf and pdf of Xi, i=1,,p, respectively. They are defined as:

FXi(xi):=P[Xix]=Rp1F(x)dxi,fXi(xi):=xiFXi(xi)=Rp1f(x)dxi,

where xi:=(x1,,xi1,xi+1,xp). The definitions can be extended analogously to the marginals of the cdf and pdf of different subsets of X.

The conditional cdf and pdf of X1|(X2,,Xp) are defined, respectively, as

FX1|X1=x1(x1):=P[X1x1|X1=x1],fX1|X1=x1(x1):=f(x)fX1(x1).

The conditional expectation of Y|X is the following random variable2

E[Y|X]:=ydFY|X(y|X).

The conditional variance of Y|X is defined as

Var[Y|X]:=E[(YE[Y|X])2|X]=E[Y2|X]E[Y|X]2.

Proposition 1.1 (Laws of total expectation and variance) Let X and Y be two random variables.

  • Total expectation: if E[|Y|]<, then E[Y]=E[E[Y|X]].
  • Total variance: if E[Y2]<, then Var[Y]=E[Var[Y|X]]+Var[E[Y|X]].

Exercise 1.1 Prove the law of total variance from the law of total expectation.

We conclude with some useful inequalities.

Proposition 1.2 (Markov's inequality) Let X be a non-negative random variable with E[X]<. Then

P[X>t]E[X]t,t>0.

Proposition 1.3 (Chebyshev's inequality) Let μ=E[X] and σ2=Var[X]. Then

P[|Xμ|t]σ2t2,t>0.

Exercise 1.2 Prove Markov’s inequality using X=X1{X>t}+X1{Xt}. Then prove Chebyshev’s inequality using Markov’s. Hint: use the random variable (XE[X])2.

Proposition 1.4 (Cauchy--Schwartz inequality) Let X and Y such that E[X2]< and E[Y2]<. Then E[|XY|]E[X2]E[Y2].

Proposition 1.5 (Jensen's inequality) If g is a convex function, then g(E[X])E[g(X)].

Example 1.1 Jensen’s inequality has interesting derivations. For example:

  • Take h=g. Then h is a concave function and we have that h(E[X])E[h(X)].
  • Take g(x)=xr for r1. Then E[X]rE[Xr]. If 0<r<1, then E[X]rE[Xr].
  • Consider 0rs. Then g(x)=xr/s is convex and g(E[|X|s])E[g(|X|s)]=E[|X|r]. As a consequence E[|X|r]<E[|X|s]< for 0rs. Finite moments of higher order implies finite moments of lower order.

  1. Recall that the X-part is random!↩︎