1.3 General notation and background
We use capital letters to denote random variables, such as X, and lowercase, such as x, to denote deterministic values. For example P[X=x] means “the probability that the random variable X takes the particular value x”. In predictive modeling we are concerned about the prediction or explanation of a response Y from a set of predictors X1,…,Xp. Both Y and X1,…,Xp are random variables, but we use them in a different way: our interest lies in predicting or explaining Y from X1,…,Xp. Other name for Y is dependent variable and X1,…,Xp are sometimes referred to as independent variables, covariates, or explanatory variables. We will not use these terminologies.
The cumulative distribution function (cdf) of a random variable X is F(x):=P[X≤x] and is a function that completely characterizes the randomness of X. Continuous random variables are also characterized by the probability density function (pdf) f(x)=F′(x),9 which represents the infinitesimal relative probability of X per unit of length. On the other hand, discrete random variables are also characterized by the probability mass function P[X=x]. We write X∼F (or X∼f if X is continuous) to denote that X has a cdf F (or a pdf f). If two random variables X and Y have the same distribution, we write Xd=Y.
For a random variable X∼F, the expectation of g(X) is defined as10
E[g(X)]:=∫g(x)dF(x):={∫g(x)f(x)dx, if X is continuous,∑{x∈R:P[X=x]>0}g(x)P[X=x], if X is discrete.
The sign “:=” emphasizes that the Left Hand Side (LHS) of the equality is defined for the first time as the Right Hand Side (RHS). Unless otherwise stated, the integration limits of any integral are R or Rp. The variance is defined as Var[X]:=E[(X−E[X])2]=E[X2]−E[X]2.
We employ boldface to denote vectors (assumed to be column matrices, although sometimes written in row-layout), like a, and matrices, like A. We denote by A′ to the transpose of A. Boldfaced capitals will be used simultaneously for denoting matrices and also random vectors X=(X1,…,Xp), which are collections of random variables X1,…,Xp. The (joint) cdf of X is11
F(x):=P[X≤x]:=P[X1≤x1,…,Xp≤xp]
and, if X is continuous, its (joint) pdf is f:=∂p∂x1⋯∂xpF.
The marginals of F and f are the cdf and pdf of Xj, j=1,…,p, respectively. They are defined as:
FXj(xj):=P[Xj≤xj]=F(∞,…,∞,xj,∞,…,∞),fXj(xj):=∂∂xjFXj(xj)=∫Rp−1f(x)dx−j,
where x−j:=(x1,…,xj−1,xj+1,…,xp). The definitions can be extended analogously to the marginals of the cdf and pdf of different subsets of X.
The conditional cdf and pdf of X1|(X2,…,Xp) are defined, respectively, as
FX1|X−1=x−1(x1):=P[X1≤x1|X−1=x−1],fX1|X−1=x−1(x1):=f(x)fX−1(x−1).
The conditional expectation of Y|X is the following random variable12
E[Y|X]:=∫ydFY|X(y|X).
For two random variables X1 and X2, the covariance between them is defined as
Cov[X1,X2]:=E[(X1−E[X1])(X2−E[X2])]=E[X1X2]−E[X1]E[X2],
and the correlation between them is defined as
Cor[X1,X2]:=Cov[X1,X2]√Var[X1]Var[X2].
The variance and the covariance are extended to a random vector X=(X1,…,Xp)′ by means of the so-called variance-covariance matrix:
Var[X]:=E[(X−E[X])(X−E[X])′]=E[XX′]−E[X]E[X]′=(Var[X1]Cov[X1,X2]⋯Cov[X1,Xp]Cov[X2,X1]Var[X2]⋯Cov[X2,Xp]⋮⋮⋱⋮Cov[Xp,X1]Cov[Xp,X2]⋯Var[Xp]),
where E[X]:=(E[X1],…,E[Xp])′ is just the componentwise expectation. As in the univariate case, the expectation is a linear operator, which now means that
E[AX+b]=AE[X]+b,for a q×p matrix A and b∈Rq.
It follows from (1.2) that
Var[AX+b]=AVar[X]A′,for a q×p matrix A and b∈Rq.
The p-dimensional normal of mean \boldsymbol{\mu}\in\mathbb{R}^p and covariance matrix \boldsymbol{\Sigma} (a p\times p symmetric and positive definite matrix) is denoted by \mathcal{N}_{p}(\boldsymbol{\mu},\boldsymbol{\Sigma}) and is the generalization to p random variables of the usual normal distribution. Its (joint) pdf is given by
\begin{align*} \phi(\mathbf{x};\boldsymbol{\mu},\boldsymbol{\Sigma}):=\frac{1}{(2\pi)^{p/2}|\boldsymbol{\Sigma}|^{1/2}}e^{-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})'\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})},\quad \mathbf{x}\in\mathbb{R}^p. \end{align*}
The p-dimensional normal has a nice linear property that stems from (1.2) and (1.3):
\begin{align} \mathbf{A}\mathcal{N}_p(\boldsymbol\mu,\boldsymbol\Sigma)+\mathbf{b}\stackrel{d}{=}\mathcal{N}_q(\mathbf{A}\boldsymbol\mu+\mathbf{b},\mathbf{A}\boldsymbol\Sigma\mathbf{A}').\tag{1.4} \end{align}
Notice that when p=1, and \boldsymbol{\mu}=\mu and \boldsymbol{\Sigma}=\sigma^2, then the pdf of the usual normal \mathcal{N}(\mu,\sigma^2) is recovered:13
\begin{align*} \phi(x;\mu,\sigma^2):=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}. \end{align*}
When p=2, the pdf is expressed in terms of \boldsymbol{\mu}=(\mu_1,\mu_2)' and \boldsymbol{\Sigma}=(\sigma_1^2,\rho\sigma_1\sigma_2;\rho\sigma_1\sigma_2,\sigma_2^2), for \mu_1,\mu_2\in\mathbb{R}, \sigma_1,\sigma_2>0, and -1<\rho<1:
\begin{align} &\phi(x_1,x_2;\mu_1,\mu_2,\sigma_1^2,\sigma_2^2,\rho):=\frac{1}{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}}\tag{1.5}\\ &\;\times\exp\left\{-\frac{1}{2(1-\rho^2)}\left[\frac{(x_1-\mu_1)^2}{\sigma_1^2}+\frac{(x_2-\mu_2)^2}{\sigma_2^2}-\frac{2\rho(x_1-\mu_1)(x_2-\mu_2)}{\sigma_1\sigma_2}\right]\right\}.\nonumber \end{align}
The surface defined by (1.5) can be regarded as a 3-dimensional bell. In addition, it serves to provide concrete examples of the functions introduced above:
Joint pdf:
\begin{align*} f(x_1,x_2)=\phi(x_1,x_2;\mu_1,\mu_2,\sigma_1^2,\sigma_2^2,\rho). \end{align*}
Marginal pdfs:
\begin{align*} f_{X_1}(x_1)=\int \phi(x_1,t_2;\mu_1,\mu_2,\sigma_1^2,\sigma_2^2,\rho)\,\mathrm{d}t_2=\phi(x_1;\mu_1,\sigma_1^2) \end{align*}
and f_{X_2}(x_2)=\phi(x_2;\mu_2,\sigma_2^2). Hence X_1\sim\mathcal{N}\left(\mu_1,\sigma_1^2\right) and X_2\sim\mathcal{N}\left(\mu_2,\sigma_2^2\right).
Conditional pdfs:
\begin{align*} f_{X_1| X_2=x_2}(x_1)=&\frac{f(x_1,x_2)}{f_{X_2}(x_2)}=\phi\left(x_1;\mu_1+\rho\frac{\sigma_1}{\sigma_2}(x_2-\mu_2),(1-\rho^2)\sigma_1^2\right),\\ f_{X_2| X_1=x_1}(x_2)=&\phi\left(x_2;\mu_2+\rho\frac{\sigma_2}{\sigma_1}(x_1-\mu_1),(1-\rho^2)\sigma_2^2\right). \end{align*}
Hence
\begin{align*} X_1&| X_2=x_2\sim\mathcal{N}\left(\mu_1+\rho\frac{\sigma_1}{\sigma_2}(x_2-\mu_2),(1-\rho^2)\sigma_1^2\right),\\ X_2&| X_1=x_1\sim\mathcal{N}\left(\mu_2+\rho\frac{\sigma_2}{\sigma_1}(x_1-\mu_1),(1-\rho^2)\sigma_2^2\right). \end{align*}
Conditional expectations:
\begin{align*} \mathbb{E}[X_1|X_2=x_2]&=\mu_1+\rho\frac{\sigma_1}{\sigma_2}(x_2-\mu_2),\\ \mathbb{E}[X_2|X_1=x_1]&=\mu_2+\rho\frac{\sigma_2}{\sigma_1}(x_1-\mu_1). \end{align*}
Joint cdf:
\begin{align*} \int_{-\infty}^{x_2}\int_{-\infty}^{x_1}\phi(t_1,t_2;\mu_1,\mu_2,\sigma_1^2,\sigma_2^2,\rho)\,\mathrm{d}t_1\,\mathrm{d}t_2. \end{align*}
Marginal cdfs: \int_{-\infty}^{x_1}\phi(t;\mu_1,\sigma_1^2)\,\mathrm{d}t=:\Phi(x_1;\mu_1,\sigma_1^2) and analogously \Phi(x_2;\mu_2,\sigma_2^2).
Conditional cdfs:
\begin{align*} \int_{-\infty}^{x_1}\phi\left(t;\mu_1+\rho\frac{\sigma_1}{\sigma_2}(x_2-\mu_2),(1-\rho^2)\sigma_1^2\right)\,\mathrm{d}t=\Phi\left(x_1;\mu_1+\rho\frac{\sigma_1}{\sigma_2}(x_2-\mu_2),(1-\rho^2)\sigma_1^2\right) \end{align*}
and analogously \Phi\left(x_2;\mu_2+\rho\frac{\sigma_2}{\sigma_1}(x_1-\mu_1),(1-\rho^2)\sigma_2^2\right).
Figure 1.4 graphically summarizes the concepts of joint, marginal, and conditional distributions within the context of a 2-dimensional normal.

Figure 1.4: Visualization of the joint pdf (in blue), marginal pdfs (green), conditional pdf of X_2| X_1=x_1 (orange), expectation (red point), and conditional expectation \mathbb{E}\lbrack X_2 | X_1=x_1 \rbrack (orange point) of a 2-dimensional normal. The conditioning point of X_1 is x_1=-2. Note the different scales of the densities, as they have to integrate one over different supports. Note how the conditional density (upper orange curve) is not the joint pdf f(x_1,x_2) (lower orange curve) with x_1=-2 but a rescaling of this curve by \frac{1}{f_{X_1}(x_1)}. The parameters of the 2-dimensional normal are \mu_1=\mu_2=0, \sigma_1=\sigma_2=1 and \rho=0.75. 500 observations sampled from the distribution are shown in black.
Finally, in the predictive models we will consider an independent and identically distributed (iid) sample of the response and the predictors. We use the following notation: Y_i is the i-th observation of the response Y and X_{ij} represents the i-th observation of the j-th predictor X_j. Thus we will deal with samples of the form \{(X_{i1},\ldots,X_{ip},Y_i)\}_{i=1}^n.
Respectively, F(x)=\int_{-\infty}^xf(t)\,\mathrm{d}t.↩︎
The precise mathematical meaning of “\mathrm{d}F_X(x)” is the Riemann–Stieltjes integral.↩︎
Understood as the probability that (X_1\leq x_1) and \ldots and (X_p\leq x_p).↩︎
Recall that the X-part of \mathbb{E}[Y| X] is random. However, \mathbb{E}[Y| X=x] is deterministic.↩︎
If \mu=0 and \sigma=1 (standard normal), then the pdf and cdf are simply denoted by \phi and \Phi, without extra parameters.↩︎