# A First Course on Statistical Inference

*Isabel Molina Peralta and Eduardo García Portugués*

*2018-11-21, v0.8*

# Chapter 1 Preliminaries

## 1.1 Random experiments

**Definition 1.1 (Random experiment) **A *random experiment* is an experiment with the following properties:

- its outcome is imposible to predict;
- if the experiment is repeated under the same conditions, the outcome may be different;
- the set of possible outcomes is known in advance.

The following concepts are associated with a random experiment:

- The set of possible outcomes of the experiment is termed as the
*sample space*and is denoted as \(\Omega\). - The individual outcomes of the experiment are called
*sample outcomes*,*realizations*or*elements*, and are denoted by \(\omega\in\Omega\). - An
*event*is a subset of \(\Omega\). Once the experiment has been performed, it is said that \(A\) “happened” if the outcome of the experiment, \(w\), belongs to \(A\).

**Example 1.1 **The next experiments are random experiments:

- Tossing a coin. The sample space is \(\Omega=\{``\mathrm{heads}",``\mathrm{tails}"\}\). Some events are: \(\emptyset\), \(\{``\mathrm{heads}"\}\), \(\{``\mathrm{tails}"\}\), \(\Omega\).
- Counting the number of car accidents within an hour in Spain. The sample space is \(\Omega=\mathbb{N}\cup\{0\}\).
- Randomly selecting a Spanish woman between \(20\) and \(40\) years old and measuring her weight (in kg). The sample space is \(\Omega=[m,\infty)\), where \(m\) is the minimum weight.

## 1.2 Probability definitions

A probability function is defined as a mapping of subsets (events) of the sample space \(\Omega\) to elements in \([0,1]\). Therefore, it is convenient to count on a “good” structure for these subsets, which will provide “good” properties to the probability function.

**Definition 1.2 (\(\sigma\)-algebra) **A \(\sigma\)-algebra \(\mathcal{A}\) over a set \(\Omega\) is a collection of subsets of \(\Omega\) with the following properties:

- \(\emptyset\in \mathcal{A}\);
- If \(A\in\mathcal{A}\), then \(A^c\in \mathcal{A}\), where \(A^c\) is the complementary of \(A\);
- If \(A_1,A_2,\ldots\in\mathcal{A}\), then \(\cup_{n=1}^{\infty} A_i\in \mathcal{A}\).

The following are two commonly employed \(\sigma\)-algebras.

**Definition 1.3 (Discrete \(\sigma\)-algebra)**Let \(\Omega\) be a set. The

*discrete \(\sigma\)-algebra*of \(\Omega\) is the

*power set*\(\mathcal{P}(\Omega)=\{A:A\subset \Omega\}\), that is, the collection of all subsets of \(\Omega\).

**Definition 1.4 (Borel \(\sigma\)-algebra)**Let \(\Omega=\mathbb{R}\) and consider the collection of intervals \[ I=\{(-\infty,a]: \ a\in \mathbb{R}\}. \] The

*Borel \(\sigma\)-algebra*, denoted by \(\mathcal{B}\), is defined as the smallest \(\sigma\)-algebra that contains \(I\).

*Remark.*The smallest \(\sigma\)-algebra coincides with the intersection of all \(\sigma\)-algebras containing \(I\).

*Remark. * The Borel \(\sigma\)-algebra \(\mathcal{B}\) contains all the complements, countable intersections and countable unions of elements of \(I\). Particullarly, \(\mathcal{B}\) contains all kinds of intervals and isolated points of \(\mathbb{R}\). However, \(\mathcal{B}\) is not \(\mathcal{P}(\mathbb{R})\) (indeed, \(\mathcal{B}\varsubsetneq\mathcal{P}(\mathbb{R})\)). For example:

- \((a,\infty)\in\mathcal{B}\), since \((a,\infty)=(-\infty,a]^c\), and \((-\infty,a]\in\mathbb{R}\).
- \((a,b]\in\mathbb{R}\), \(\forall a<b\), since \((a,b]=(-\infty,b]\cup (a,\infty)\), where \((-\infty,b]\in\mathcal{B}\) and \((a,\infty)\in\mathcal{B}\).
- \(\{a\}\in\mathcal{B}\), \(\forall a\in\mathbb{R}\), since \(\{a\}=\bigcup_{n=1}^{\infty}\left(a-\tfrac{1}{n},a\right]\), which belongs to \(\mathcal{B}\).

When the sample space \(\Omega\) is continuous and it is not \(\mathbb{R}\), but a subset of \(\mathbb{R}\), we need to define a \(\sigma\)-algebra over the subsets of \(\Omega\).

**Definition 1.5 (Restricted Borel \(\sigma\)-algebra)**Let \(A\subset \mathbb{R}\). The

*Borel \(\sigma\)-algebra restricted to \(A\)*is defined as \[ \mathcal{B}_{A}=\{B\cap A: \ B\in\mathcal{B}\}. \]

**Definition 1.6 (Measurable space)**The pair \((\Omega,\mathcal{A})\), where \(\Omega\) is a sample space and \(\mathcal{A}\) is a \(\sigma\)-algebra over \(\Omega\), is referred as a

*measurable space*.

**Example 1.2 **The measurable space for the experiment a described in Example 1.1 is \[
\Omega=\{``\mathrm{heads}", ``\mathrm{tails}"\}, \quad
\mathcal{A}=\{\emptyset,\{``\mathrm{heads}"\},\{``\mathrm{tails}"\},\Omega\}.
\] The sample space for experiment b is \(\Omega=\mathbb{N}^+\), where \(\mathbb{N}^+=\mathbb{N}\cup \{0\}\). Taking the \(\sigma\)-algebra \(\mathcal{P}(\Omega)\), then \((\Omega, \mathcal{P}(\Omega))\) is a measurable space.

For experiment c, in which the sample space is \(\Omega=[m,\infty)\subset\mathbb{R}\), *an* adequate \(\sigma\)-algebra is the Borel \(\sigma\)-algebra restricted to \(\Omega\), \(\mathcal{B}_{[m,\infty)}\).

A probability function maps an element of the \(\sigma\)-algebra to a real number in the interval \([0,1]\). Thus, probability functions are defined on measurable spaces.

**Example 1.3**The following tables show the relative frequencies of the outcomes of the random experiments of Example 1.1 when those experiments are repeated \(n\) times.

Tossing a coin \(n\) times. Table 1.1 and Figure 1.1 show that the relative frequencies of both “heads” and “tails” converge to \(0.5\).

Table 1.1: Relative frequencies of “heads” and “tails” for \(n\) random experiments. \(n\) Heads Tails 10 0.600 0.400 20 0.400 0.600 30 0.467 0.533 100 0.500 0.500 1000 0.495 0.505 Counting the number of car accidents for \(n\) independent hours in Spain (simulated data). Table 1.2 and Figure 1.2 show the convergence of the relative frequencies of the experiment.

Table 1.2: Relative frequencies of car accidents in Spain for \(n\) hours. \(n\) \(0\) \(1\) \(2\) \(3\) \(4\) \(5\) \(6\) \(7\) \(8\) \(\geq 9\) 10 0.000 0.000 0.300 0.300 0.100 0.100 0.100 0.000 0.000 0.100 20 0.000 0.000 0.200 0.200 0.100 0.100 0.200 0.100 0.000 0.100 30 0.000 0.033 0.267 0.133 0.100 0.100 0.233 0.067 0.000 0.067 100 0.030 0.040 0.260 0.140 0.160 0.110 0.170 0.050 0.020 0.020 1000 0.021 0.078 0.145 0.192 0.200 0.150 0.114 0.064 0.023 0.013 10000 0.018 0.074 0.149 0.193 0.194 0.159 0.106 0.057 0.028 0.022 Randomly selecting \(n\) Spanish women between \(20\) and \(40\) years old and measuring their weight (in kg; simulated data). Again, Table 1.3 and Figure 1.3 show the convergence of the relative frequencies of the weight intervals.

Table 1.3: Relative frequencies of weight intervals for \(n\) measured women. \(n\) \([0, 35)\) \([35, 45)\) \([45, 55)\) \([55, 65)\) \([65, \infty)\) 10 0.000 0.000 0.700 0.300 0.000 20 0.000 0.100 0.700 0.200 0.000 30 0.000 0.067 0.767 0.167 0.000 100 0.000 0.220 0.670 0.110 0.000 1000 0.003 0.200 0.690 0.107 0.000 5000 0.003 0.207 0.676 0.113 0.001

**Definition 1.7 (Frequentist definition of probability) **The *frequentist definition of probability* of an event \(A\) is the limit of the relative frequency of that event when the number of repetitions of the experiment tends to infinity.

If the experiment is repeated \(n\) times, and \(n_A\) is the number of repetitions in which \(A\) happens, then the probability of \(A\) is \[ \mathbb{P}(A)=\lim_{n\rightarrow \infty} \frac{n_A}{n}\,. \]

The *Laplace definition of probability* can be employed for experiments that have a finite number of possible outcomes, and whose results are equally likely.

**Definition 1.8 (Laplace definition of probability) **The *Laplace definition of probability* of an event \(A\) is the proportion of favourable outcomes to \(A\), that is, \[
\mathbb{P}(A)=\frac{\# A}{\#\Omega},
\] where \(\#\Omega\) is the number of possible outcomes of the experiment and \(\# A\) is the number of outcomes in \(A\).

Lastly, the *Kolmogorov axiomatic definition of probability* does not establish the probability as a *unique* function, as the previous probability definitions do, but presents three axioms that must be satisfied by any probability function.

**Definition 1.9 (Kolmogorov definition of probability) **Let \((\Omega,\mathcal{A})\) be a measurable space. A *probability function* is an application \(\mathbb{P}:\mathcal{A}\rightarrow \mathbb{R}\) that satisfies the following axioms:

- (
*Non-negativity*) \(\forall A\in\mathcal{A}\), \(\mathbb{P}(A)\geq 0\); - (
*Unitarity*) \(\mathbb{P}(\Omega)=1\); - (
*\(\sigma\)-additivity*) For any sequence \(A_1,A_2,\ldots\) of disjoint events (\(A_i\cap A_j=\emptyset\), \(i\neq j\)) of \(\mathcal{A}\), it holds \[ \mathbb{P}\left(\bigcup_{n=1}^{\infty} A_n\right)=\sum_{n=1}^{\infty} \mathbb{P}(A_n). \]

Observe that the \(\sigma\)-additivity property is well-defined: since \(\mathcal{A}\) is a \(\sigma\)-algebra, then the countable union belongs to \(\mathcal{A}\) also, and therefore the probability function takes as argument a proper element from \(\mathcal{A}\).

**Example 1.4**Consider the experiment a described in Example 1.1 with the measurable space \((\Omega,\mathcal{A})\), where \[ \Omega=\{``\mathrm{heads}",``\mathrm{tails}"\}, \quad \mathcal{A}=\{\emptyset,\{``\mathrm{heads}"\},\{``\mathrm{tails}"\},\Omega\}. \] A probability function is \(\mathbb{P}_1:\mathcal{A}\rightarrow[0,1]\), defined as \[ \mathbb{P}_1(\emptyset)=0, \ \mathbb{P}_1(\{``\mathrm{heads}"\})=\mathbb{P}_1(\{``\mathrm{tails}"\})=1/2, \ \mathbb{P}_1(\Omega)=1. \] It is straightforward to check that \(\mathbb{P}_1\) satisfies the three definitions of probability. Consider now \(\mathbb{P}_2:\mathcal{A}\rightarrow[0,1]\) defined as \[ \mathbb{P}_2(\emptyset)=0, \ \mathbb{P}_2(\{``\mathrm{heads}"\})=p<1/2, \ \mathbb{P}_2(\{``\mathrm{tails}"\})=1-p, \ \mathbb{P}_2(\Omega)=1. \] If the coin is fair, then \(\mathbb{P}_2\) does not satisfies the frequentist definition nor the Laplace defintion, since the outcomes are not equally likely. However, it does verify the Kolmogorv axiomatic definition.

**Example 1.5 **We can define a probability function for the experiment b of Example 1.1, with the measurable space \((\Omega,\mathcal{P}(\Omega))\), in the following way:

for the individual outcomes, the probability is defined as \[ \begin{array}{lllll} &\mathbb{P}(\{0\}) =0.018, &\mathbb{P}(\{1\}) =0.074, &\mathbb{P}(\{2\}) =0.149,\\ &\mathbb{P}(\{3\}) =0.193, &\mathbb{P}(\{4\}) =0.194, &\mathbb{P}(\{5\}) =0.159, \\ &\mathbb{P}(\{6\}) =0.106, &\mathbb{P}(\{7\}) =0.057, &\mathbb{P}(\{8\}) =0.028, \\ &\mathbb{P}(\{9\}) =0.022, &\mathbb{P}(\emptyset) =0, &\mathbb{P}(\{i\}) =0,\ \forall i> 9. \end{array} \]

for subsets of \(\Omega\) with more than one element, its probability is defined as the sum of probabilities of the individual outcomes belonging to each subset. This is, if \(A=\{a_1,\ldots,a_n\}\), with \(a_i\in \Omega\), then the probability of \(A\) is \[ \mathbb{P}(A)=\sum_{i=1}^n \mathbb{P}(\{a_i\}). \]

## 1.3 Random variables

**Definition 1.10 (Probability space)**A

*probability space*is the trio \((\Omega,\mathcal{A}, \mathbb{P})\), where \(\mathbb{P}\) is a probability function defined on the measurable space \((\Omega,\mathcal{A})\).

The *random variable* concept allows to transform the sample space \(\Omega\) of a random experiment into a set with good mathematical properties, such as the real numbers.

**Definition 1.11 (Measurable function)**Let \((\Omega,\mathcal{A})\) be measurable space. Let \(\mathcal{B}\) be Borel \(\sigma\)-algebra over \(\mathbb{R}\). A

*random variable*(r.v.) is a mapping \(X:\Omega\rightarrow \mathbb{R}\) that is

*measurable*, that is, that verifies \[ \forall B\in\mathcal{B}, \quad X^{-1}(B)\in\mathcal{A}, \] with \(X^{-1}(B)=\{\omega\in\Omega: X(\omega)\in B\}\).

In this way, a r.v. transforms a measurable space \((\Omega,\mathcal{A})\) into another measurable space with a more convenient mathematical framework, \((\mathbb{R},\mathcal{B})\), whose \(\sigma\)-algebra is the one generated by the class of intervals. The measurability condition of the r.v. allows to transfer the probability of any subset \(A\in\mathcal{A}\) to the probability of a subset \(B\in\mathcal{B}\), where \(B\) is precisely the image of \(A\) through the r.v. \(X\). The concept of r.v. reveals itself as a key translator: it allows to transfer the randomness “produced” by a random experiment with sample space \(\Omega\) to the mathematical-friendly \((\mathbb{R},\mathcal{B})\).

**Example 1.6 **For the experiments given in Example 1.1, the next are random variables:

A possible r.v. for the measurable space \[ \Omega=\{``\mathrm{heads}",``\mathrm{tails}"\}, \quad \mathcal{A}=\{\emptyset,\{``\mathrm{heads}"\},\{``\mathrm{tails}"\},\Omega\}, \] is \[ X(\omega)=\left\{\begin{array}{ll} 1 & \mathrm{if}\ \omega=``\mathrm{heads}",\\ 0 & \mathrm{if}\ \omega=``\mathrm{tails}". \end{array}\right. \] We can call this random variable “Number of heads when tossing a coin”. Indeed, this is a r.v., since for any subset \(B\in\mathcal{B}\), it holds:

- If \(0,1\in B\), then \(X^{-1}(B)=\Omega\in \mathcal{A}\).
- If \(0\in B\) but \(1\notin B\), then \(X^{-1}(B)=\{``\mathrm{tails}"\}\in \mathcal{A}\).
- If \(1\in B\) but \(0\notin B\), then \(X^{-1}(B)=\{``\mathrm{heads}"\}\in \mathcal{A}\).
- If \(0,1\notin B\), then \(X^{-1}(B)=\emptyset\in \mathcal{A}\).

For the measurable space \((\Omega,\mathcal{P}(\Omega))\), where \(\Omega=\mathbb{N}\cup \{0\}\), since the sample space is already contained in \(\mathbb{R}\), an adequate r.v. is \(X_1(\omega)=\omega\). Indeed, \(X_1\) is a r.v., because for \(B\in\mathcal{B}\), the set \[ X_1^{-1}(B)=\{\omega\in \mathbb{N}\cup\{0\}: \ X_1(\omega)=\omega\in B\} \] is the set of natural numbers (including zero) that belong to \(B\). But any countable set of natural numbers belongs to \(\mathcal{P}(\Omega)\), as this \(\sigma\)-algebra contains all the subsets of \(\mathbb{N}\cup\{0\}\). Therefore, \(X_1=\)“Number of car accidents within a day in Spain” is a r.v. Another possible r.v. is the one that indicates whether there is at least one car accident, and is given by \[ X_2(\omega)=\left\{\begin{array}{ll} 1 & \mathrm{if}\ \omega\in \mathbb{N}, \\ 0 & \mathrm{if} \ \omega=0. \end{array}\right. \] Indeed, \(X_2\) is a r.v., since for \(B\in\mathcal{B}\):

- If \(0,1\in B\), then \(X_2^{-1}(B)=\Omega\in \mathcal{P}(\Omega)\).
- If \(1\in B\) but \(0\notin B\), then \(X_2^{-1}(B)=\mathbb{N}\in \mathcal{P}(\Omega)\).
- If \(0\in B\) but \(1\notin B\), then \(X_2^{-1}(B)=\{0\}\in \mathcal{P}(\Omega)\).
- If \(0,1\notin B\), then \(X_2^{-1}(B)=\emptyset\in \mathcal{P}(\Omega)\).

- As in the previous case, for the measurable space \((\Omega,\mathcal{B}_{\Omega})\), where \(\Omega=[m,\infty)\), a r.v. is \(X_1(\omega)=\omega\), since for \(B\in \mathcal{B}\), we have \[ X_1^{-1}(B)=\{\omega\in[m,\infty): \ X_1(\omega)=\omega\in B\}=[m,\infty)\cap B\in \mathcal{B}_{\Omega}. \] \(X_1\) is therefore a r.v. that corresponds to the “Weight of a Spanish woman between 20 and 40 years old” is a r.v. Another possible r.v. is \[ X_2(\omega)=\left\{\begin{array}{ll} 1 & \mathrm{if}\ \omega \geq 65,\\ 0 & \mathrm{if}\ \omega<65, \end{array}\right. \] which corresponds to the concept “Weight of at least 65 kilograms”.

The induced probability of a r.v. is the probability function defined over subsets of \(\mathbb{R}\) that preserves the probabilities of the original events of \(\Omega\).

**Definition 1.12 (Induced probability of a r.v.)**Let \(\mathcal{B}\) be the Borel \(\sigma\)-algebra over \(\mathbb{R}\). The

*induced probability*of the r.v. \(X\) is the function \(\mathbb{P}_X:\mathcal{B}\rightarrow \mathbb{R}\) defined as \[ \mathbb{P}_X(B)=\mathbb{P}(X^{-1}(B)), \ \forall B\in \mathcal{B}. \]

**Example 1.7 **Consider the probability function \(\mathbb{P}_1\) defined in Example 1.4 and the r.v. \(X\) defined in Example 1.6 a. Let \(B\in\mathcal{B}\). The induced probability of \(X\) is described in the following way:

- If \(0,1\in B\), then \(\mathbb{P}_{1X}(B)=\mathbb{P}_1(X^{-1}(B))=\mathbb{P}_1(\Omega)=1\).
- If \(0\in B\) but \(1\notin B\), then \(\mathbb{P}_{1X}(B)=\mathbb{P}_1(X^{-1}(B))=\mathbb{P}_1(\{``\mathrm{tails}"\})=1/2\).
- If \(1\in B\) but \(0\notin B\), then \(\mathbb{P}_{1X}(B)=\mathbb{P}_1(X^{-1}(B))=\mathbb{P}_1(\{``\mathrm{heads}"\})=1/2\).
- If \(0,1\notin B\), then \(\mathbb{P}_{1X}(B)=\mathbb{P}_1(X^{-1}(B))=\mathbb{P}_1(\emptyset)=0\).

Therefore, the induced probability by \(X\) is \[ \mathbb{P}_{1X}(B)=\left\{\begin{array}{ll} 0 & \mathrm{if}\ 0,1\notin B,\\ 1/2 & \text{if $0$ or $1$ are in $B$},\\ 1 & \mathrm{if}\ 0,1\in B. \end{array}\right. \] Particullarly, we have the following probabilities:

- \(\mathbb{P}_{1X}(\{0\})=\mathbb{P}_1(X=0)=1/2\).
- \(\mathbb{P}_{1X}((-\infty,0])=\mathbb{P}_1(X\leq 0)=1/2\).
- \(\mathbb{P}_{1X}((0,1])=\mathbb{P}_1(0< X\leq 1)=1/2\).

**Example 1.8**For the probability function \(\mathbb{P}\) defined in Example 1.5 and the r.v. \(X_1\) defined in b of Example 1.6, the induced probability by the r.v. \(X_1\) is described in the following way. Let \(B\in\mathcal{B}\) be such that \(\mathbb{N}^+\cap B=\{a_1,a_2,\ldots,a_p\}\). Then: \[ \mathbb{P}_{X_1}(B)=\mathbb{P}(X_1^{-1}(B))=\mathbb{P}_1(\mathbb{N}^+\cap B)=\mathbb{P}(\{a_1,a_2,\ldots,a_p\})=\sum_{i=1}^p \mathbb{P}(\{a_i\}). \]

**Definition 1.13 (Cumulative distribution function)**The

*cumulative distribution function*(c.d.f.) of a r.v. \(X\) is the function \(F_X:\mathbb{R}\rightarrow \mathbb{R}\) defined as \[ F_X(x)=\mathbb{P}_X((-\infty,x]), \ \forall x\in \mathbb{R}. \]

**Proposition 1.1 (Properties of the c.d.f.) **

- \(\forall x\in\mathbb{R}, F_X(x)\in[0,1]\).
- The c.d.f. is monotonically non-decreasing, that is, \[ x<y \implies F_X(x)\leq F_X(y). \]
- Let \(a,b\in\mathbb{R}\) such that \(a<b\). Then, \[ \mathbb{P}(a<X\leq b)=F_X(b)-F_X(a). \]
- \(F_X(-\infty)\triangleq\lim_{n\rightarrow -\infty} F(x)=0\) and \(F_X(+\infty)\triangleq\lim_{n\rightarrow +\infty} F(x)=1\).
- \(F_X\) is right-continuous.
- The set of points where \(F_X\) is discontinuous is finite or countable.

**Definition 1.14 (Discrete r.v.)**A r.v. is

*discrete*if its range (or image set) \(R_X=\{x\in \mathbb{R}: x=X(\omega)\ \text{for some} \ \omega\in \Omega\}\) is finite or countable.

**Example 1.9**Among the r.v.’s defined in Example 1.6, the ones from parts a and b are discrete, and so it is \(X_2\) from part c.

In the case of discrete random variables, we can define a function that gives the probabilities of the individual points of its range \(R_X\).

**Definition 1.15 (Probability mass function)**The

*probability mass function*(p.m.f.) of a discrete r.v. \(X\) is the function \(p_X:\mathbb{R}\rightarrow \mathbb{R}\) such \[ p_X(x)=\mathbb{P}_X(\{x\}), \ \forall x\in \mathbb{R}. \] The notation \(\mathbb{P}(X=x)=p_X(x)\) is also often employed.

With the terminology “distribution of a discrete r.v. \(X\)” we refer to either the probability function induced by \(X\), \(\mathbb{P}_X\), the c.d.f. \(F_X\), or the p.m.f. \(p_X\). The motivation for this abuse of notation is that any of these functions determine the random behavior of \(X\).

**Definition 1.16 (Continuous r.v.) **A *continuous* r.v. (also denoted *absolutely continuous*) is the one that satisfies the following conditions:

- The distribution function \(F_X\) is always continuous.
- The distribution function \(F_X\) is differentiable and its derivative is continuous except in at most a countable set of points (or almost everywhere continuous).

We define the set of points with anomalous derivative as \[ S=\{x\in \mathbb{R}: \ \text{either $F_X'(x)$ does not exist or $F_X'$ is not continuous at $x$}\}. \] Then, for a continuous r.v., \(S\) is a countable set.

**Example 1.10**The r.v. \(X_1=\)“Weight of a Spanish woman between 20 and 40 years old”, defined in part c of Example 1.6, is continuous.

**Definition 1.17 (Probability density function) **The *probability density function* (p.d.f.) of \(X\) is a function \(f_X:\mathbb{R}\rightarrow \mathbb{R}\) defined as \[
f_X(x)=\left\{\begin{array}{ll}
0 & \mathrm{if} \ x\in S,\\
F_X'(x) & \mathrm{if}\ x\notin S.
\end{array}\right.
\] Sometimes the p.d.f. is simply referred as the *density function*.

The probability density function is called like that because, for any \(x\in\mathbb{R}\), it measures the density of the probability of an infinitesimal interval centered at \(x\).

As in the discrete case, a continuous r.v. \(X\) is determined by either the induced probability function \(\mathbb{P}_X\), the c.d.f. \(F_X\), or the p.d.f. \(f_X\). Therefore, whenever we refer to the “distribution of a r.v.”, we may be referring to any of these functions.

**Definition 1.18 (Moment generating function)**The

*moment generating function*(m.g.f.) of a r.v. \(X\) is the function \[ M_{X}(s)=\mathbb{E}[e^{sX}], \] for \(s\in(-h,h)\subset \mathbb{R}\), \(h>0\) and such that the expectation exists. If the expectation does not exist for any neighborhood around zero, then we say that the m.g.f. does

*not exist*.

**Example 1.11 (Gamma m.g.f.) **A r.v. \(X\) follows a gamma distribution with *shape* \(\alpha\) and *scale* \(\beta\), denoted as \(X\sim\Gamma(\alpha,\beta)\), if its p.d.f. belongs to the next parametric class of densities: \[
f_{X}(x;\alpha,\beta)=\frac{1}{\Gamma(\alpha)\beta^{\alpha}}x^{\alpha-1}e^{-x/\beta},
\ 0<x<\infty, \ \alpha>0,\ \beta>0,
\] where \(\Gamma(\alpha)\) is the *gamma function*, defined as \[
\Gamma(\alpha)=\int_{0}^{\infty} x^{\alpha-1}e^{-x}\,\mathrm{d}x.
\] The gamma function satisfies the following properties:

- \(\Gamma(1)=1\).
- \(\Gamma(\alpha)=(\alpha-1)\Gamma(\alpha-1)\), for any \(\alpha\geq 1\).
- \(\Gamma(n)=(n-1)!\), for any \(n\in\mathbb{N}\).

Let us compute the m.g.f. of \(X\):

\[\begin{align*} M_{X}(s)&=\mathbb{E}[e^{sX}]\\ &=\frac{1}{\Gamma(\alpha)\beta^{\alpha}}\int_{0}^{\infty} e^{sx} x^{\alpha-1}e^{-x/\beta}\,\mathrm{d}x\\ &=\frac{1}{\Gamma(\alpha)\beta^{\alpha}}\int_{0}^{\infty} x^{\alpha-1}e^{-(1/\beta -s)x}\,\mathrm{d}x. \end{align*}\]The *kernel* of a p.d.f is the main part of the density once the constants are removed. Observe that in the integrand we have the kernel of a gamma p.d.f. with shape \(\alpha\) and scale \((1/\beta-s)^{-1}=\beta/(1-s\beta)\). If \(s\geq 1/\beta\), then the integral is \(\infty\). However, if \(s<1/\beta\), then the integral is finite, and multiplying and dividing by the adequate constants, we obtain the m.g.f.

Recall that, since \(\beta>0\), the m.g.f. exists in the interval \((-h,h)\) with \(h=1/\beta\).

**Example 1.12 (Binomial m.g.f.) **A binomial r.v. measures the number of successes in \(n\) identical and independent realizations of a random experiment with two possible outcomes, “success” or “fail”, with “success” probability \(p\). The p.m.f. of such r.v. is \[
\mathbb{P}(X=x)=\left(\begin{array}{c}n \\ x \end{array}\right) p^x
(1-p)^{n-x}, \ x=0,1,2,\ldots
\] Let us compute its m.g.f. \[
M_{X}(s)=\sum_{x=0}^{\infty} e^{sx}\left(\begin{array}{c}n \\ x \end{array}\right) p^x
(1-p)^{n-x}=\sum_{x=0}^{\infty} \left(\begin{array}{c}n \\ x \end{array}\right) (pe^s)^x
(1-p)^{n-x}.
\] By Newton’s binomial, we know that \[
(a+b)^n =\sum_{x=0}^{\infty} \left(\begin{array}{c}n \\ x \end{array}\right) a^x
b^{n-x}=(a+b)^x,
\] and therefore the m.g.f. is \[
M_{X}(s)=(pe^s+1-p)^n, \ \forall s\in\mathbb{R}.
\]

The following theorem says that, as its name indicates, the m.g.f. of a r.v. \(X\) indeed generates the raw moments of \(X\) (that is, the moments centered about \(0\)).

**Theorem 1.1 (Moment generation by the m.g.f.)**Let \(X\) be a r.v. with m.g.f. \(M_{X}(s)\). Then, \[ \mathbb{E}[X^n]=M_{X}^{(n)}(0), \] where \(M_{X}^{(n)}(0)\) denotes the \(n\)-th derivative of the m.g.f. with respect to \(s\), evaluated at \(s=0\).

**Theorem 1.2 (Uniqueness of the m.g.f.)**Let \(X\) and \(Y\) two r.v.’s with m.g.f.’s \(M_{X}(s)\) and \(M_{Y}(s)\), respectively. If \[ M_{X}(s)=M_{Y}(s), \ \forall s\in (-h,h), \] then the distributions of \(X\) and \(Y\) are the same.

Therefore, the distribution of a r.v. is determined from its m.g.f.

**Proposition 1.2 (Properties of the m.g.f.) ** The m.g.f. satisfies the following properties:

- Let \(X\) be a r.v. with m.g.f. \(M_{X}\). Define the r.v. \[ Y=aX+b, \ a,b\in \mathbb{R}, \ a\neq 0. \] Then, the m.g.f. of the new r.v. \(Y\) is \[ M_{Y}(s)=e^{sb}M_{X}(as). \]
- Let \(X_1,\ldots,X_n\) independent r.v.’s with m.g.f.’s \(M_1,\ldots,M_n\), respectively. Let be the r.v. \[ Y=X_1+\cdots +X_n. \] Then, its m.g.f. is given by \[ M_{Y}(s)=\prod_{i=1}^n M_{i}(s). \]

*Proof*(Proof of Proposition 1.2.). In order to see i, observe that \[\begin{align*} M_{Y}(s)&=\mathbb{E}[e^{sY}]=\mathbb{E}[e^{s(aX+b)}]=\mathbb{E}[e^{saX}e^{sb}]\\ &=e^{sb}\mathbb{E}[e^{saX}]=e^{sb}M_{X}(as). \end{align*}\] To check ii, consider \[\begin{align*} M_{Y}(s)&=\mathbb{E}[e^{sY}]=\mathbb{E}\left[e^{s\sum_{i=1}^n X_i}\right]=\mathbb{E}\left[e^{sX_1}\cdots e^{sX_n}\right]\\ &=\mathbb{E}[e^{sX_1}]\cdots \mathbb{E}[e^{sX_n}]=\prod_{i=1}^n M_i(s). \end{align*}\]

The next theorem is very useful for proving properties of random variables by employing a simple-to-use function such as the m.g.f. For example, the Central Limit Theorem, a cornerstone result in statistical inference, can be easily proved using Theorem 1.3.

**Theorem 1.3 (Convergence of the m.g.f.)**Assume that \(X_n\), \(n=1,2,3,\ldots\) is a sequence of r.v.’s with m.g.f.’s \(M_{X_n}(s)\), \(n=1,2,3,\ldots\). Assume in addition that for all \(s\in(-h,h)\), \(h>0\), it holds \[ \lim_{n\rightarrow \infty} M_{X_n}(s)=M_{X}(s), \] where \(M_{X}(s)\) is a m.g.f. Then there exists a unique c.d.f. \(F_{X}\) whose moments are determined by \(M_{X}(s)\), and for all \(x\) where \(F_{X}(x)\) is continuous, it holds \[ \lim_{n\rightarrow \infty} F_{X_n}(x)=F_{X}(x). \]

That is, the *convergence of m.g.f’s implies the convergence of c.d.f.’s*.

## Exercises

**Exercise 1.1**Consider the discrete sample space \(\Omega=\{a,b,c,d\}\) and the mapping \(X: \Omega\rightarrow \mathbb{R}\) such that \(X(a)=X(b)=0\) and \(X(c)=X(d)=1\). Consider the \(\sigma\)-algebra generated by the sets \(\{a\}\) and \(\{c,d\}\). Prove that \(X\) is a r.v. for this \(\sigma\)-algebra.

**Exercise 1.2 **Consider the \(\sigma\)-algebra generated by the subsets \(\{a\}\) and \(\{c\}\) for \(\Omega=\{a,b,c,d\}\).

- Prove that the mapping in Exercise 1.1 is not a r.v. for this \(\sigma\)-algebra.
- Define a mapping that is a r.v. for this \(\sigma\)-algebra.

**Exercise 1.3 **Consider the experiment consisting in tossing two times a coin.

- Provide the sample space.
- For the \(\sigma\)-algebra \(\mathbb{P}(\Omega)\), consider the r.v. \(X=\) “Number of heads in two tosses”. Provide the range and the probability function induced by \(X\).

**Exercise 1.4**For the experiment of Exercise 1.3, consider the r.v. \(Y=\) “Difference between the number of heads and the number of tails”. Obtain its range and its induced probability function.

**Exercise 1.5 **A dice is rolled two times. Obtain the sample space and the p.m.f. of the following r.v.’s:

- \(X_1=\) “Sum of the resulting numbers”.
- \(X_2=\) “Absolute value of difference of the resulting numbers”.
- \(X_3=\) “Maximum of the resulting numbers”.
- \(X_4=\) “Minimum of the resulting numbers”.

**Exercise 1.6 **Let \(X\) be the r.v. defined as the number of tosses of a dice until getting five. Compute:

- The p.m.f. of \(X\).
- The c.d.f. of \(X\).
- \(\mathbb{P}(X<15)\) and \(\mathbb{P}(10<X\leq 20)\).

**Exercise 1.7 **Assume that the lifespan (in hours) of a flurescent tube is represented by the continuous r.v. \(X\) with p.d.f. \[
f(x)=\left\{ \begin{array}{ll} c/x^2 & x>100,\\ 0 & x\leq 100.
\end{array}\right.
\] Compute:

- The value of \(c\).
- The c.d.f. of \(X\).
- The probability that a tube lasts more than \(500\) hours.

**Exercise 1.8 **When storing excess flour in bags of \(100\) kg, a random error \(X\) is made in the measurement of the weight of the bags. The p.d.f. of the error \(X\) is given by \[
f(x)=\left\{ \begin{array}{ll} k(1-x^2) & -1<x<1,\\ 0 & \text{otherwise}.
\end{array}\right.
\]

- Compute the probability that a bag weights more than \(99.5\) kg.
- What is the percentage of bags with a weight between \(99.8\) and \(100.2\) kg?

**Exercise 1.9 **Consider the experiment consisting in tossing several times a coin.

- Compute the average number of heads obtained if the coin is tossed five times.
- Compute the same average when the coin is tossed \(n\) times.

**Exercise 1.10**A r.v. only takes the values \(1\) and \(3\) with a non-zero probability. If its expectation is \(8/3\), find the probabilities of these two values.

**Exercise 1.11 **The random number of received calls in a call center during an time interval of \(h\) minutes, \(X_h\), has a p.m.f. \[
\mathbb{P}(X_h=n)=\frac{(5h)^n}{n!}e^{-5h}, \ n=0,1,2,\ldots.
\]

- Find the average number of calls received in half an hour.
- What is the expected time that has to pass until an average of \(100\) calls is received?

**Exercise 1.12 **Let \(X\) be a r.v. following a Poisson distribution with expectation \(\lambda\), denoted \(X\sim \mathrm{Pois}(\lambda)\), whose p.m.f. is \[
\mathbb{P}(X=x)=p(x;\lambda)=\frac{\lambda^x}{x!}e^{-\lambda}, \ x=0,1,2,\ldots.
\] Compute the expectation and variance of the new r.v. \[
Y=\left\{ \begin{array}{ll}
1 & \mathrm{if} \ X=0,\\
0 & \mathrm{if} \ X\neq 0.
\end{array}\right.
\]

**Solution**. First, we obtain \(\mathbb{P}(X=0)=p(0;\lambda)=e^{-\lambda}\). Then, the expectation of \(Y\) is given by \[
\mathbb{E}[Y]=1\mathbb{P}(X=0)+0\mathbb{P}(X\neq 0)=\mathbb{P}(X=0)=e^{-\lambda}.
\] For the variance, we employ the formula \[
\mathbb{V}\mathrm{ar}[Y]=\mathbb{E}[Y^2]-E^2[Y].
\] Then, the second-order moment is given by \[
\mathbb{E}[Y^2]=1^2\mathbb{P}(X=0)+0^2\mathbb{P}(X\neq 0)=e^{-\lambda}.
\] Replacing in the above equation, we finally get \[
\mathbb{V}\mathrm{ar}[Y]=e^{-\lambda}-e^{-2\lambda}=e^{-\lambda}(1-e^{-\lambda}).
\]

**Exercise 1.13 **Consider a r.v. with density function \[
f(x)=\left\{ \begin{array}{ll} 6x(1-x) & 0<x<1,\\ 0 & \mathrm{otherwise.}
\end{array}\right.
\] Compute:

- \(\mathbb{E}[X]\) and \(\mathbb{V}\mathrm{ar}[X]\).
- \(\mathbb{P}\left(|X-\mathbb{E}[X]|<\sqrt{\mathbb{V}\mathrm{ar}[X]}\right)\).

**Exercise 1.14 **Find the m.g.f.’s of the r.v.’s whose p.m.f.’s are defined below and, using them, obtain the corresponding expectation and variance.

- \(\mathbb{P}(X=1)=p\), \(\mathbb{P}(X=0)=1-p\), where \(p\in(0,1)\).
- \(\mathbb{P}(X=n)=(1-p)^n p\), \(n=0,1,2,\ldots\), where \(p\in(0,1)\).

**Exercise 1.15**Compute the m.g.f. of \(X\sim \mathrm{Pois}(\lambda)\).

**Solution**. The p.m.f. is given in Exercise 1.12. The m.g.f. is \[\begin{align*} M_X(s)&=\mathbb{E}[e^{sX}]=\sum_{x=0}^{\infty} e^{sx} p(x;\lambda)\\ &=\sum_{x=0}^{\infty} e^{sx} \frac{\lambda^x e^{-\lambda}}{x!}=e^{-\lambda}\sum_{x=0}^{\infty} \frac{(e^s\lambda)^x}{x!}\\ &=e^{-\lambda}e^{e^s\lambda}=e^{-(1-e^s)\lambda}, \end{align*}\]

for all \(s\in\mathbb{R}\).

**Exercise 1.16 (Poisson additive property) **Let \(X_1,\ldots,X_n\) be a sequence of i.i.d. r.v.’s distributed as \(\mathrm{Pois}(\lambda)\). Prove that \[
Y=\sum_{i=1}^n X_i\sim \mathrm{Pois}(n\lambda).
\]

**Solution**. This is easily proved by using Theorem 1.2. Thanks to ii in Proposition 1.2, we have that \[
M_{Y}(s)=\left[M_{X_i}(s)\right]^n=e^{-(1-e^s)n\lambda},
\] which is the m.g.f. of a \(\mathrm{Pois}(n\lambda)\). Because of the uniqueness ensured by Theorem 1.2, we conclude that \(Y\sim \mathrm{Pois}(n\lambda)\).

**Exercise 1.17 (Gamma additive property) **Prove the additive property of the gamma distribution, that is, prove that if \(X_i\), \(i=1,\ldots,n\) are independent r.v.’s with respective distributions \(\Gamma(\alpha_i,\beta)\), \(i=1,\ldots,n\), then \[
\sum_{i=1}^n X_i\sim \Gamma\left(\sum_{i=1}^n\alpha_i,\beta\right).
\]

**Solution**. The m.g.f. of \(Y=\sum_{i=1}^n X_i\) is \[
M_{Y}(s)=\mathbb{E}[e^{Ys}]=\prod_{i=1}^n M_{X_i}(s)=\frac{1}{(1-s\beta)^{\sum_{i=1}^n
\alpha_i}}, \ s<1/\beta,
\] which is the m.g.f. of a gamma r.v. with shape \(\sum_{i=1}^n \alpha_i\) and scale \(\beta\).

**Exercise 1.18**Consider the random vector \((X,Y)\) with joint p.d.f. \[ f(x, y)=e^{-x}, \quad x>0,\ 0<y<x. \] Compute \(\mathbb{E}[X]\), \(\mathbb{E}[Y]\), and \(\mathbb{C}\mathrm{ov}[X,Y]\).