1.1 Basic probability review
A triple \((\Omega,\mathcal{A},\mathbb{P})\) is called a probability space. \(\Omega\) represents the sample space, the set of all possible individual outcomes of a random experiment. \(\mathcal{A}\) is a \(\sigma\)-algebra, a class of subsets of \(\Omega\) that is closed under complementation and numerable unions, and such that \(\Omega\in\mathcal{A}.\) \(\mathcal{A}\) represents the collection of possible events (combinations of individual outcomes) that are assigned a probability by the probability measure \(\mathbb{P}.\) A random variable is a map \(X:\Omega\longrightarrow\mathbb{R}\) such that \(\{\omega\in\Omega:X(\omega)\leq x\}\in\mathcal{A}\) (the set is measurable).
The cumulative distribution function (cdf) of a random variable \(X\) is \(F(x):=\mathbb{P}[X\leq x].\) When an independent and identically distributed (iid) sample \(X_1,\ldots,X_n\) is given, the cdf can be estimated by the empirical distribution function (ecdf)
\[\begin{align} F_n(x)=\frac{1}{n}\sum_{i=1}^n\mathbb{1}_{\{X_i\leq x\}}, \tag{1.1} \end{align}\]
where \(1_A:=\begin{cases}1,&A\text{ is true},\\0,&A\text{ is false}\end{cases}\) is an indicator function. Continuous random variables are either characterized by the cdf \(F\) or the probability density function (pdf) \(f=F',\) which represents the infinitesimal relative probability of \(X\) per unit of length. We write \(X\sim F\) (or \(X\sim f\)) to denote that \(X\) has a cdf \(F\) (or a pdf \(f\)). If two random variables \(X\) and \(Y\) have the same distribution, we write \(X\stackrel{d}{=}Y.\)
The expectation operator is constructed using the Lebesgue–Stieljes “\(\,\mathrm{d}F(x)\)” integral. Hence, for \(X\sim F,\) the expectation of \(g(X)\) is
\[\begin{align*} \mathbb{E}[g(X)]:=&\,\int g(x)\,\mathrm{d}F(x)\\ =&\, \begin{cases} \int g(x)f(x)\,\mathrm{d}x,&X\text{ continuous,}\\\sum_{\{i:\mathbb{P}[X=x_i]>0\}} g(x_i)\mathbb{P}[X=x_i],&X\text{ discrete.} \end{cases} \end{align*}\]
Unless otherwise stated, the integration limits of any integral are \(\mathbb{R}\) or \(\mathbb{R}^p.\) The variance operator is defined as \(\mathbb{V}\mathrm{ar}[X]:=\mathbb{E}[(X-\mathbb{E}[X])^2].\)
We employ bold face to denote vectors (assumed to be column matrices) and matrices. A \(p\)-random vector is a map \(\mathbf{X}:\Omega\longrightarrow\mathbb{R}^p,\) \(\mathbf{X}(\omega):=(X_1(\omega),\ldots,X_p(\omega)),\) such that each \(X_i\) is a random variable. The (joint) cdf of \(\mathbf{X}\) is \(F(\mathbf{x}):=\mathbb{P}[\mathbf{X}\leq \mathbf{x}]:=\mathbb{P}[X_1\leq x_1,\ldots,X_p\leq x_p]\) and, if \(\mathbf{X}\) is continuous, its (joint) pdf is \(f:=\frac{\partial^p}{\partial x_1\cdots\partial x_p}F.\) The marginals of \(F\) and \(f\) are the cdf and pdf of \(X_i,\) \(i=1,\ldots,p,\) respectively. They are defined as:
\[\begin{align*} F_{X_i}(x_i)&:=\mathbb{P}[X_i\leq x]=\int_{\mathbb{R}^{p-1}} F(\mathbf{x})\,\mathrm{d}\mathbf{x}_{-i},\\ f_{X_i}(x_i)&:=\frac{\partial}{\partial x_i}F_{X_i}(x_i)=\int_{\mathbb{R}^{p-1}} f(\mathbf{x})\,\mathrm{d}\mathbf{x}_{-i}, \end{align*}\]
where \(\mathbf{x}_{-i}:=(x_1,\ldots,x_{i-1},x_{i+1},x_p).\) The definitions can be extended analogously to the marginals of the cdf and pdf of different subsets of \(\mathbf{X}.\)
The conditional cdf and pdf of \(X_1\vert(X_2,\ldots,X_p)\) are defined, respectively, as
\[\begin{align*} F_{X_1\vert \mathbf{X}_{-1}=\mathbf{x}_{-1}}(x_1)&:=\mathbb{P}[X_1\leq x_1\vert \mathbf{X}_{-1}=\mathbf{x}_{-1}],\\ f_{X_1\vert \mathbf{X}_{-1}=\mathbf{x}_{-1}}(x_1)&:=\frac{f(\mathbf{x})}{f_{\mathbf{X}_{-1}}(\mathbf{x}_{-1})}. \end{align*}\]
The conditional expectation of \(Y\vert X\) is the following random variable2
\[\begin{align*} \mathbb{E}[Y\vert X]:=\int y \,\mathrm{d}F_{Y\vert X}(y\vert X). \end{align*}\]
The conditional variance of \(Y|X\) is defined as
\[\begin{align*} \mathbb{V}\mathrm{ar}[Y\vert X]:=\mathbb{E}[(Y-\mathbb{E}[Y\vert X])^2\vert X]=\mathbb{E}[Y^2\vert X]-\mathbb{E}[Y\vert X]^2. \end{align*}\]
Proposition 1.1 (Laws of total expectation and variance) Let \(X\) and \(Y\) be two random variables.
- Total expectation: if \(\mathbb{E}[|Y|]<\infty,\) then \(\mathbb{E}[Y]=\mathbb{E}[\mathbb{E}[Y\vert X]].\)
- Total variance: if \(\mathbb{E}[Y^2]<\infty,\) then \(\mathbb{V}\mathrm{ar}[Y]=\mathbb{E}[\mathbb{V}\mathrm{ar}[Y\vert X]]+\mathbb{V}\mathrm{ar}[\mathbb{E}[Y\vert X]].\)
Exercise 1.1 Prove the law of total variance from the law of total expectation.
We conclude with some useful inequalities.
Proposition 1.2 (Markov's inequality) Let \(X\) be a non-negative random variable with \(\mathbb{E}[X]<\infty.\) Then
\[\begin{align*} \mathbb{P}[X>t]\leq\frac{\mathbb{E}[X]}{t}, \quad\forall t>0. \end{align*}\]
Proposition 1.3 (Chebyshev's inequality) Let \(\mu=\mathbb{E}[X]\) and \(\sigma^2=\mathbb{V}\mathrm{ar}[X].\) Then
\[\begin{align*} \mathbb{P}[|X-\mu|\geq t]\leq\frac{\sigma^2}{t^2},\quad \forall t>0. \end{align*}\]
Exercise 1.2 Prove Markov’s inequality using \(X=X1_{\{X>t\}}+X1_{\{X\leq t\}}.\) Then prove Chebyshev’s inequality using Markov’s. Hint: use the random variable \((X-\mathbb{E}[X])^2.\)
Proposition 1.4 (Cauchy--Schwartz inequality) Let \(X\) and \(Y\) such that \(\mathbb{E}[X^2]<\infty\) and \(\mathbb{E}[Y^2]<\infty.\) Then \(\mathbb{E}[|XY|]\leq\sqrt{\mathbb{E}[X^2]\mathbb{E}[Y^2]}.\)
Proposition 1.5 (Jensen's inequality) If \(g\) is a convex function, then \(g(\mathbb{E}[X])\leq\mathbb{E}[g(X)].\)
Example 1.1 Jensen’s inequality has interesting derivations. For example:
- Take \(h=-g.\) Then \(h\) is a concave function and we have that \(h(\mathbb{E}[X])\geq\mathbb{E}[h(X)].\)
- Take \(g(x)=x^r\) for \(r\geq 1.\) Then \(\mathbb{E}[X]^r\leq \mathbb{E}[X^r].\) If \(0<r<1,\) then \(\mathbb{E}[X]^r\geq \mathbb{E}[X^r].\)
- Consider \(0\leq r\leq s.\) Then \(g(x)=x^{r/s}\) is convex and \(g(\mathbb{E}[|X|^s])\leq \mathbb{E}[g(|X|^s)]=\mathbb{E}[|X|^r].\) As a consequence \(\mathbb{E}[|X|^r]<\infty\implies\mathbb{E}[|X|^s]<\infty\) for \(0\leq r\leq s.\) Finite moments of higher order implies finite moments of lower order.
Recall that the \(X\)-part is random!↩︎