## 1.1 Basic probability review

A triple $$(\Omega,\mathcal{A},\mathbb{P})$$ is called a probability space. $$\Omega$$ represents the sample space, the set of all possible individual outcomes of a random experiment. $$\mathcal{A}$$ is a $$\sigma$$-algebra, a class of subsets of $$\Omega$$ that is closed under complementation and numerable unions, and such that $$\Omega\in\mathcal{A}$$. $$\mathcal{A}$$ represents the collection of possible events (combinations of individual outcomes) that are assigned a probability by the probability measure $$\mathbb{P}$$. A random variable is a map $$X:\Omega\longrightarrow\mathbb{R}$$ such that $$\{\omega\in\Omega:X(\omega)\leq x\}\in\mathcal{A}$$ (the set is measurable).

The cumulative distribution function (cdf) of a random variable $$X$$ is $$F(x):=\mathbb{P}[X\leq x]$$. When an independent and identically distributed (iid) sample $$X_1,\ldots,X_n$$ is given, the cdf can be estimated by the empirical distribution function (ecdf)

\begin{align} F_n(x)=\frac{1}{n}\sum_{i=1}^n\mathbb{1}_{\{X_i\leq x\}}, \tag{1.1} \end{align}

where $$1_A:=\begin{cases}1,&A\text{ is true},\\0,&A\text{ is false}\end{cases}$$ is an indicator function. Continuous random variables are either characterized by the cdf $$F$$ or the probability density function (pdf) $$f=F'$$, which represents the infinitesimal relative probability of $$X$$ per unit of length. We write $$X\sim F$$ (or $$X\sim f$$) to denote that $$X$$ has a cdf $$F$$ (or a pdf $$f$$). If two random variables $$X$$ and $$Y$$ have the same distribution, we write $$X\stackrel{d}{=}Y$$.

The expectation operator is constructed using the Lebesgue–Stieljes “$$\,\mathrm{d}F(x)$$” integral. Hence, for $$X\sim F$$, the expectation of $$g(X)$$ is

\begin{align*} \mathbb{E}[g(X)]:=&\,\int g(x)\,\mathrm{d}F(x)\\ =&\, \begin{cases} \int g(x)f(x)\,\mathrm{d}x,&X\text{ continuous,}\\\sum_{\{i:\mathbb{P}[X=x_i]>0\}} g(x_i)\mathbb{P}[X=x_i],&X\text{ discrete.} \end{cases} \end{align*}

Unless otherwise stated, the integration limits of any integral are $$\mathbb{R}$$ or $$\mathbb{R}^p$$. The variance operator is defined as $$\mathbb{V}\mathrm{ar}[X]:=\mathbb{E}[(X-\mathbb{E}[X])^2]$$.

We employ bold face to denote vectors (assumed to be column matrices) and matrices. A $$p$$-random vector is a map $$\mathbf{X}:\Omega\longrightarrow\mathbb{R}^p$$, $$\mathbf{X}(\omega):=(X_1(\omega),\ldots,X_p(\omega))$$, such that each $$X_i$$ is a random variable. The (joint) cdf of $$\mathbf{X}$$ is $$F(\mathbf{x}):=\mathbb{P}[\mathbf{X}\leq \mathbf{x}]:=\mathbb{P}[X_1\leq x_1,\ldots,X_p\leq x_p]$$ and, if $$\mathbf{X}$$ is continuous, its (joint) pdf is $$f:=\frac{\partial^p}{\partial x_1\cdots\partial x_p}F$$. The marginals of $$F$$ and $$f$$ are the cdf and pdf of $$X_i$$, $$i=1,\ldots,p$$, respectively. They are defined as:

\begin{align*} F_{X_i}(x_i)&:=\mathbb{P}[X_i\leq x]=\int_{\mathbb{R}^{p-1}} F(\mathbf{x})\,\mathrm{d}\mathbf{x}_{-i},\\ f_{X_i}(x_i)&:=\frac{\partial}{\partial x_i}F_{X_i}(x_i)=\int_{\mathbb{R}^{p-1}} f(\mathbf{x})\,\mathrm{d}\mathbf{x}_{-i}, \end{align*}

where $$\mathbf{x}_{-i}:=(x_1,\ldots,x_{i-1},x_{i+1},x_p)$$. The definitions can be extended analogously to the marginals of the cdf and pdf of different subsets of $$\mathbf{X}$$.

The conditional cdf and pdf of $$X_1\vert(X_2,\ldots,X_p)$$ are defined, respectively, as

\begin{align*} F_{X_1\vert \mathbf{X}_{-1}=\mathbf{x}_{-1}}(x_1)&:=\mathbb{P}[X_1\leq x_1\vert \mathbf{X}_{-1}=\mathbf{x}_{-1}],\\ f_{X_1\vert \mathbf{X}_{-1}=\mathbf{x}_{-1}}(x_1)&:=\frac{f(\mathbf{x})}{f_{\mathbf{X}_{-1}}(\mathbf{x}_{-1})}. \end{align*}

The conditional expectation of $$Y\vert X$$ is the following random variable2

\begin{align*} \mathbb{E}[Y\vert X]:=\int y \,\mathrm{d}F_{Y\vert X}(y\vert X). \end{align*}

The conditional variance of $$Y|X$$ is defined as

\begin{align*} \mathbb{V}\mathrm{ar}[Y\vert X]:=\mathbb{E}[(Y-\mathbb{E}[Y\vert X])^2\vert X]=\mathbb{E}[Y^2\vert X]-\mathbb{E}[Y\vert X]^2. \end{align*}

Proposition 1.1 (Laws of total expectation and variance) Let $$X$$ and $$Y$$ be two random variables.

• Total expectation: if $$\mathbb{E}[|Y|]<\infty$$, then $$\mathbb{E}[Y]=\mathbb{E}[\mathbb{E}[Y\vert X]]$$.
• Total variance: if $$\mathbb{E}[Y^2]<\infty$$, then $$\mathbb{V}\mathrm{ar}[Y]=\mathbb{E}[\mathbb{V}\mathrm{ar}[Y\vert X]]+\mathbb{V}\mathrm{ar}[\mathbb{E}[Y\vert X]]$$.

Exercise 1.1 Prove the law of total variance from the law of total expectation.

We conclude with some useful inequalities.

Proposition 1.2 (Markov’s inequality) Let $$X$$ be a non-negative random variable with $$\mathbb{E}[X]<\infty$$. Then

\begin{align*} \mathbb{P}[X>t]\leq\frac{\mathbb{E}[X]}{t}, \quad\forall t>0. \end{align*}

Proposition 1.3 (Chebyshev’s inequality) Let $$\mu=\mathbb{E}[X]$$ and $$\sigma^2=\mathbb{V}\mathrm{ar}[X]$$. Then

\begin{align*} \mathbb{P}[|X-\mu|\geq t]\leq\frac{\sigma^2}{t^2},\quad \forall t>0. \end{align*}

Exercise 1.2 Prove Markov’s inequality using $$X=X1_{\{X>t\}}+X1_{\{X\leq t\}}$$. Then prove Chebyshev’s inequality using Markov’s. Hint: use the random variable $$(X-\mathbb{E}[X])^2$$.
Proposition 1.4 (Cauchy–Schwartz inequality) Let $$X$$ and $$Y$$ such that $$\mathbb{E}[X^2]<\infty$$ and $$\mathbb{E}[Y^2]<\infty$$. Then $$\mathbb{E}[|XY|]\leq\sqrt{\mathbb{E}[X^2]\mathbb{E}[Y^2]}$$.
Proposition 1.5 (Jensen’s inequality) If $$g$$ is a convex function, then $$g(\mathbb{E}[X])\leq\mathbb{E}[g(X)]$$.

Example 1.1 Jensen’s inequality has interesting derivations. For example:

• Take $$h=-g$$. Then $$h$$ is a concave function and we have that $$h(\mathbb{E}[X])\geq\mathbb{E}[h(X)]$$.
• Take $$g(x)=x^r$$ for $$r\geq 1$$. Then $$\mathbb{E}[X]^r\leq \mathbb{E}[X^r]$$. If $$0<r<1$$, then $$\mathbb{E}[X]^r\geq \mathbb{E}[X^r]$$.
• Consider $$0\leq r\leq s$$. Then $$g(x)=x^{r/s}$$ is convex and $$g(\mathbb{E}[|X|^s])\leq \mathbb{E}[g(|X|^s)]=\mathbb{E}[|X|^r]$$. As a consequence $$\mathbb{E}[|X|^r]<\infty\implies\mathbb{E}[|X|^s]<\infty$$ for $$0\leq r\leq s$$. Finite moments of higher order implies finite moments of lower order.

1. Recall that the $$X$$-part is random!↩︎