1.1 Basic probability review

A triple $(\Omega,\mathcal{A},\mathbb{P})$ is called a probability space. $\Omega$ represents the sample space, the set of all possible individual outcomes of a random experiment. $\mathcal{A}$ is a $\sigma$ -algebra, a class of subsets of $\Omega$ that is closed under complementation and numerable unions, and such that $\Omega\in\mathcal{A}.$ $\mathcal{A}$ represents the collection of possible events (combinations of individual outcomes) that are assigned a probability by the probability measure $\mathbb{P}.$ A random variable is a map $X:\Omega\longrightarrow\mathbb{R}$ such that $\{\omega\in\Omega:X(\omega)\leq x\}\in\mathcal{A}$ (the set is measurable).

The cumulative distribution function (cdf) of a random variable $X$ is $F(x):=\mathbb{P}[X\leq x].$ When an independent and identically distributed (iid) sample $X_1,\ldots,X_n$ is given, the cdf can be estimated by the empirical distribution function (ecdf)

$\begin{align} F_n(x)=\frac{1}{n}\sum_{i=1}^n\mathbb{1}_{\{X_i\leq x\}}, \tag{1.1} \end{align}$

where $1_A:=\begin{cases}1,&A\text{ is true},\\0,&A\text{ is false}\end{cases}$ is an indicator function. Continuous random variables are either characterized by the cdf $F$ or the probability density function (pdf) $f=F',$ which represents the infinitesimal relative probability of $X$ per unit of length. We write $X\sim F$ (or $X\sim f$ ) to denote that $X$ has a cdf $F$ (or a pdf $f$ ). If two random variables $X$ and $Y$ have the same distribution, we write $X\stackrel{d}{=}Y.$

The expectation operator is constructed using the Lebesgue–Stieljes “ $\,\mathrm{d}F(x)$ ” integral. Hence, for $X\sim F,$ the expectation of $g(X)$ is

$\begin{align*} \mathbb{E}[g(X)]:=&\,\int g(x)\,\mathrm{d}F(x)\\ =&\, \begin{cases} \int g(x)f(x)\,\mathrm{d}x,&X\text{ continuous,}\\\sum_{\{i:\mathbb{P}[X=x_i]>0\}} g(x_i)\mathbb{P}[X=x_i],&X\text{ discrete.} \end{cases} \end{align*}$

Unless otherwise stated, the integration limits of any integral are $\mathbb{R}$ or $\mathbb{R}^p.$ The variance operator is defined as $\mathbb{V}\mathrm{ar}[X]:=\mathbb{E}[(X-\mathbb{E}[X])^2].$

We employ bold face to denote vectors (assumed to be column matrices) and matrices. A $p$ -random vector is a map $\mathbf{X}:\Omega\longrightarrow\mathbb{R}^p,$ $\mathbf{X}(\omega):=(X_1(\omega),\ldots,X_p(\omega)),$ such that each $X_i$ is a random variable. The (joint) cdf of $\mathbf{X}$ is $F(\mathbf{x}):=\mathbb{P}[\mathbf{X}\leq \mathbf{x}]:=\mathbb{P}[X_1\leq x_1,\ldots,X_p\leq x_p]$ and, if $\mathbf{X}$ is continuous, its (joint) pdf is $f:=\frac{\partial^p}{\partial x_1\cdots\partial x_p}F.$ The marginals of $F$ and $f$ are the cdf and pdf of $X_i,$ $i=1,\ldots,p,$ respectively. They are defined as:

$\begin{align*} F_{X_i}(x_i)&:=\mathbb{P}[X_i\leq x]=\int_{\mathbb{R}^{p-1}} F(\mathbf{x})\,\mathrm{d}\mathbf{x}_{-i},\\ f_{X_i}(x_i)&:=\frac{\partial}{\partial x_i}F_{X_i}(x_i)=\int_{\mathbb{R}^{p-1}} f(\mathbf{x})\,\mathrm{d}\mathbf{x}_{-i}, \end{align*}$

where $\mathbf{x}_{-i}:=(x_1,\ldots,x_{i-1},x_{i+1},x_p).$ The definitions can be extended analogously to the marginals of the cdf and pdf of different subsets of $\mathbf{X}.$

The conditional cdf and pdf of $X_1\vert(X_2,\ldots,X_p)$ are defined, respectively, as

$\begin{align*} F_{X_1\vert \mathbf{X}_{-1}=\mathbf{x}_{-1}}(x_1)&:=\mathbb{P}[X_1\leq x_1\vert \mathbf{X}_{-1}=\mathbf{x}_{-1}],\\ f_{X_1\vert \mathbf{X}_{-1}=\mathbf{x}_{-1}}(x_1)&:=\frac{f(\mathbf{x})}{f_{\mathbf{X}_{-1}}(\mathbf{x}_{-1})}. \end{align*}$

The conditional expectation of $Y\vert X$ is the following random variable²

$\begin{align*} \mathbb{E}[Y\vert X]:=\int y \,\mathrm{d}F_{Y\vert X}(y\vert X). \end{align*}$

The conditional variance of $Y|X$ is defined as

$\begin{align*} \mathbb{V}\mathrm{ar}[Y\vert X]:=\mathbb{E}[(Y-\mathbb{E}[Y\vert X])^2\vert X]=\mathbb{E}[Y^2\vert X]-\mathbb{E}[Y\vert X]^2. \end{align*}$

Proposition 1.1 (Laws of total expectation and variance) Let $X$ and $Y$ be two random variables.

Total expectation: if $\mathbb{E}[|Y|]<\infty,$ then $\mathbb{E}[Y]=\mathbb{E}[\mathbb{E}[Y\vert X]].$
Total variance: if $\mathbb{E}[Y^2]<\infty,$ then $\mathbb{V}\mathrm{ar}[Y]=\mathbb{E}[\mathbb{V}\mathrm{ar}[Y\vert X]]+\mathbb{V}\mathrm{ar}[\mathbb{E}[Y\vert X]].$

Exercise 1.1 Prove the law of total variance from the law of total expectation.

We conclude with some useful inequalities.

Proposition 1.2 (Markov's inequality) Let $X$ be a non-negative random variable with $\mathbb{E}[X]<\infty.$ Then

$\begin{align*} \mathbb{P}[X>t]\leq\frac{\mathbb{E}[X]}{t}, \quad\forall t>0. \end{align*}$

Proposition 1.3 (Chebyshev's inequality) Let $\mu=\mathbb{E}[X]$ and $\sigma^2=\mathbb{V}\mathrm{ar}[X].$ Then

$\begin{align*} \mathbb{P}[|X-\mu|\geq t]\leq\frac{\sigma^2}{t^2},\quad \forall t>0. \end{align*}$

Exercise 1.2 Prove Markov’s inequality using $X=X1_{\{X>t\}}+X1_{\{X\leq t\}}.$ Then prove Chebyshev’s inequality using Markov’s. Hint: use the random variable $(X-\mathbb{E}[X])^2.$

Proposition 1.4 (Cauchy--Schwartz inequality) Let $X$ and $Y$ such that $\mathbb{E}[X^2]<\infty$ and $\mathbb{E}[Y^2]<\infty.$ Then $\mathbb{E}[|XY|]\leq\sqrt{\mathbb{E}[X^2]\mathbb{E}[Y^2]}.$

Proposition 1.5 (Jensen's inequality) If $g$ is a convex function, then $g(\mathbb{E}[X])\leq\mathbb{E}[g(X)].$

Example 1.1 Jensen’s inequality has interesting derivations. For example:

Take $h=-g.$ Then $h$ is a concave function and we have that $h(\mathbb{E}[X])\geq\mathbb{E}[h(X)].$
Take $g(x)=x^r$ for $r\geq 1.$ Then $\mathbb{E}[X]^r\leq \mathbb{E}[X^r].$ If $0<r<1,$ then $\mathbb{E}[X]^r\geq \mathbb{E}[X^r].$
Consider $0\leq r\leq s.$ Then $g(x)=x^{r/s}$ is convex and $g(\mathbb{E}[|X|^s])\leq \mathbb{E}[g(|X|^s)]=\mathbb{E}[|X|^r].$ As a consequence $\mathbb{E}[|X|^r]<\infty\implies\mathbb{E}[|X|^s]<\infty$ for $0\leq r\leq s.$ Finite moments of higher order implies finite moments of lower order.

Recall that the $X$ -part is random!↩︎