4 Ordinary Least Squares: The Classical Linear Regression Model

4.1 Finite-Sample Properties

Notation:

$Y_i$ : dependent variable.
$X_{ik}:$ $k$ th independent variable (or regressor) with $k=1,\dots,K \,$ .
Can be stochastic or deterministic.
$\varepsilon_i$ : stochastic error term
$i$ : indexes the $i$ th individual with $i=1,\dots,n$ , where $n$ is the sample size

Assumption 1.1: Linearity

$Y_i = \sum_{k=1}^K\beta_k X_{ik}+\varepsilon_i, \quad i=1,\dots,n \,.$

Usually, a constant (or intercept) is included, in this case $X_{i1}=1$ for all $i$ . In the following we will always assume that a constant is included in the linear model, unless otherwise stated. A special case is the so-called simple linear model, defined as

$Y_i = \beta_1+\beta_2 X_i +\varepsilon_i, \quad i=1,\dots,n \,.$

Often it is convenient to write the model with $K$ regressors using matrix notation

$Y_i = \mathbf{X}_i'\boldsymbol{\beta} +\varepsilon_i, \quad i=1,\dots,n \,,$

where $\mathbf{X}_i=(X_{i1},\dots,X_{iK})'$ and $\boldsymbol{\beta}=(\beta_1,\dots,\beta_K)'$ . Stacking all individual rows $i$ leads to

$\underset{(n\times 1)}{\mathbf{Y}} = \underset{(n\times K)}{\mathbf{X}}\underset{(K\times 1)}{\boldsymbol{\beta}} + \underset{(n\times 1)}{\boldsymbol{\varepsilon}} \, ,$

where

$\mathbf{Y} = \left(\begin{matrix}Y_1\\ \vdots\\Y_n\end{matrix}\right),\quad \mathbf{X} = \left( \begin{matrix} X_{11} & \dots & X_{1K} \\ \vdots & \ddots & \vdots \\ X_{n1} &\dots&X_{nK}\\ \end{matrix}\right),\quad\text{and}\quad \boldsymbol{\varepsilon}=\left(\begin{matrix}\varepsilon_1\\ \vdots\\ \varepsilon_n\end{matrix}\right).$

Assumption 1.1 by itself does not really impose any restrictions because we can always simply define $\varepsilon_i = Y_i - \sum_{k=1}^K\beta_k X_{ik}$ Assumption 1.1 becomes restrictive once we impose addition assumptions on $\varepsilon_i$ . We begin our analysis of the model linear model under the framework of the so-called classic assumptions.

Assumption 1.2: Strict Exogeneity

$\mathbb{E}(\varepsilon_i \mid \mathbf{X}) = 0$

or equivalently stated for the vector $\boldsymbol{\varepsilon}$

$\mathbb{E}(\boldsymbol{\varepsilon} \mid \mathbf{X}) = \mathbf{0}.$

Notice that in the presence of a constant regressor, setting the expectation to zero is a normalization.

Some Implications of Strict Exogeneity:

The law of iterated expectation implies that $\mathbb{E}(\varepsilon_i \mid \mathbf{X}_i) = 0$ , which implies that $\mathbb{E}(y_i \mid \mathbf{X}_i) = \mathbf{X}_i'\boldsymbol{\beta}$ In other words, the conditional expectation of $Y_i$ given $\mathbf{X}_i$ is a linear function of $\mathbf{X}_i$ .
Also by the law of iterated expectations, the unconditional mean of the error term is zero:

$\mathbb{E}(\varepsilon_i) = 0\quad(i=1,\dots,n)$

Generally, two random variables $X$ and $Y$ are said to be orthogonal if their cross moment is zero: $\mathbb{E}(XY)=0$ . Under strict exogeneity, the regressors are orthogonal to the error term for all observations, i.e., $\begin{align*} \mathbb{E}(X_{jk}\varepsilon_i) = 0\quad(i,j=1,\dots,n; k=1,\dots,K) \end{align*}$

Since the mean of the error term is zero ( $\mathbb{E}(\varepsilon_i)=0$ for all $i$ ), it follows that the orthogonality property ( $\mathbb{E}(X_{jk}\varepsilon_i)=0$ , for all $i,j,k$ ) is equivalent to a zero-correlation property. I.e., that $\begin{align*} Cov(\varepsilon_i,X_{jk}) = 0;\; i,j=1,\dots,n; k=1,\dots,K \tag{4.1} \end{align*}$ Therefore, the strict exogeneity assumption implies the requirement that regressors are uncorrelated with the current ( $i=j$ ), the past ( $i<j$ ) and the future ( $i>j$ ) error terms. Especially in time series contexts, this is usually found to be a too strong assumption.

Assumption 1.3: Rank Condition

$P(rank(\mathbf{X})=K) = 1$

This assumption demands that the event of one regressor being linearly dependent on the others occurs with a probability equal to zero. This assumption also implies the assumption that $n\geq K$ . The assumption is very strong because it rules out that we have iid data and dummy variables. For our finite sample properties, we will always condition on $\mathbf{X}$ and we can only condition on realizations of $\mathbf{X}$ for which $rank(\mathbf{X})=K$ .

Assumption 1.4: Spherical Error

$\begin{aligned} \mathbb{E}(\varepsilon_i^2|\mathbf{X}) &= \sigma^2>0\\ \mathbb{E}(\varepsilon_i\varepsilon_j|\mathbf{X}) &= 0,\quad\quad i\neq j. \end{aligned}$

Or more compactly written as,

$\mathbb{E}(\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}'\mid \mathbf{X}) = \sigma^2 I_n,\quad\quad \sigma^2>0.$

Thus, we assume that, for a given realization of $\mathbf{X}$ , the error process is uncorrelated ( $\mathbb{E}(\varepsilon_i\varepsilon_j\mid \mathbf{X})=0$ , for all $i\neq j$ ) and homoscedastic (same $\sigma^2$ , for all $i$ ). This is assumption is stronger than necessary and is easy to relax to allow for heteroskedasticity.

4.1.1 The Algebra of Least Squares

The OLS estimator $\mathbf{b}$ is defined as the minimizer of a specific loss function termed the sum of squared residuals

$SSR(\mathbf{b}^\ast) = \sum_{i=1}^n(Y_i-\mathbf{X}_i'\mathbf{b}^\ast)^2\;=\;(\mathbf{Y}-\mathbf{X}\mathbf{b}^\ast)'(\mathbf{Y}-\mathbf{X}\mathbf{b}^\ast).$

I.e., we have

$\mathbf{b}:=\arg\min_{\mathbf{b}^\ast\in\mathbb{R}^K}SSR(\mathbf{b}^\ast),$ We can easily minimize $SSR(\mathbf{b}^\ast)$ in closed form:

$\begin{aligned} SSR(\mathbf{b}^\ast) &= (\mathbf{Y}-\mathbf{X}\mathbf{b}^\ast)'(\mathbf{Y}-\mathbf{X}\mathbf{b}^\ast)\\ &= \mathbf{Y}'\mathbf{Y}-(\mathbf{X}\mathbf{b}^{\ast})'\mathbf{Y}-\mathbf{Y}'\mathbf{X}\mathbf{b}^{\ast}+\mathbf{b}^{\ast'}\mathbf{X}'\mathbf{X}\mathbf{b}^{\ast}\\ &= \mathbf{Y}'\mathbf{Y}-2\mathbf{Y}'\mathbf{X}\mathbf{b}^{\ast}+\mathbf{b}^{\ast'}\mathbf{X}'\mathbf{X}\mathbf{b}^{\ast}\\[2ex] \Rightarrow\quad\frac{d}{d\mathbf{b}^{\ast}}SSR(\mathbf{b}^{\ast}) &= -2\mathbf{X}'\mathbf{Y}+2\mathbf{X}'\mathbf{X}\mathbf{b}^{\ast} \end{aligned}$

Setting the first derivative so zero yields the so-called normal equations

$\mathbf{X}'\mathbf{X}\mathbf{b} = \mathbf{X}'\mathbf{Y},$

which lead to the OLS estimator

$\mathbf{b} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y},$

where $(\mathbf{X}'\mathbf{X})^{-1}$ exists (a.s.) because of our full rank assumption (Assumption 3).
Often it is useful to express $\mathbf{b}$ (and similar other estimators) in sample moment notation:

$\mathbf{b}=\mathbf{S}_{\mathbf{X}\mathbf{X}}^{-1}\mathbf{S}_{\mathbf{X}\mathbf{Y}},$

where $\mathbf{S}_{\mathbf{X}\mathbf{X}}=n^{-1}\mathbf{X}'\mathbf{X}=n^{-1}\sum_i\mathbf{X}_i\mathbf{X}_i'$ and $\mathbf{S}_{\mathbf{X}\mathbf{Y}}=n^{-1}\mathbf{X}'\mathbf{Y}=n^{-1}\sum_i\mathbf{X}_iY_i$ . This notation is more convenient for developing our large sample results.

Some quantities of interest:

The (OLS) fitted value: $\hat{Y}_i=\mathbf{X}_i\mathbf{b}$
In matrix notation: $\hat{\mathbf{Y}}=\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{P}\mathbf{Y}$
The (OLS) residual: $\hat{\varepsilon}_i=Y_i-\hat{Y}_i$
In matrix notation: $\hat{\boldsymbol{\varepsilon}} = \mathbf{Y}-\hat{\mathbf{Y}} = \left(\mathbf{I}_n-\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\right)\mathbf{Y} = \mathbf{M}\mathbf{Y}$ ,

where $\mathbf{P}=\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'$ is a so-called orthogonal projection matrix that projects any vector into the column space spanned by $\mathbf{X}$ and $\mathbf{M}=\mathbf{I}_n-\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'$ is the associated orthogonal projection matrix that projects any vector into the vector space that is orthogonal to that spanned by $\mathbf{X}$ . Projection matrices have some nice properties, listed in the following lemma.

Lemma 3.1.1 (Orthogonal Projection matrices) For $\mathbf{P}=\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'$ and $\mathbf{M}=\mathbf{I}_n-\mathbf{P}$ with $\mathbf{X}$ being of full rank it holds:

$\mathbf{P}$ and $\mathbf{M}$ are symmetric and idempotent, i.e.: $\mathbf{P}\mathbf{P}=\mathbf{P}\quad\text{ and }\quad \mathbf{M}\mathbf{M}=\mathbf{M}.$
Further properties: $\mathbf{X}'\mathbf{P}=\mathbf{X}',\quad \mathbf{X}'\mathbf{M}=\mathbf{0},\quad\text{ and }\quad \mathbf{P}\mathbf{M}=\mathbf{0}.$

Proofs follow directly from the definitions of $\mathbf{P}$ and $\mathbf{M}$ .
Using these results we obtain the following proposition on the OLS residuals and OLS fitted values.

Proposition 3.1.2 (OLS residuals) For the OLS residuals and the OLS fitted values it holds that

$\begin{aligned} \mathbf{X}'\hat{\boldsymbol{\varepsilon}} &= \mathbf{0}, \quad\text{and}\\ \mathbf{Y}'\mathbf{Y} &= \hat{\mathbf{Y}}'\hat{\mathbf{Y}}+\hat{\boldsymbol{\varepsilon}}'\hat{\boldsymbol{\varepsilon}}. \end{aligned}$

Proof. The first result can be shown as following: $\begin{align*} \mathbf{X}'\hat{\boldsymbol{\varepsilon}} &= \mathbf{X}'\mathbf{M}\mathbf{Y}\quad\text{(By Def. of $\mathbf{M}$)}\\ &= \mathbf{0}\mathbf{Y}\quad\text{(By Lemma 3.1.1 part (ii))}\\ &= \underset{(K\times 1)}{\mathbf{0}} \end{align*}$ The second result follows from:

$\begin{align*} \mathbf{Y}'\mathbf{Y} &= (\mathbf{P}\mathbf{Y}+\mathbf{M}\mathbf{Y})'(\mathbf{P}\mathbf{Y}+\mathbf{M}\mathbf{Y})\quad\text{(By Def.~of $\mathbf{P}$ and $\mathbf{M}$)}\\ &= (\mathbf{Y}'\mathbf{P}'+\mathbf{Y}'\mathbf{M}')(\mathbf{P}\mathbf{Y}+\mathbf{M}\mathbf{Y})\\ &= \mathbf{Y}'\mathbf{P}'\mathbf{P}\mathbf{Y}+\mathbf{Y}'\mathbf{M}'\mathbf{M}\mathbf{Y}+\mathbf{0}\quad\text{(By Lemma 3.1.1 part (ii))}\\ &= \hat{\mathbf{Y}}'\hat{\mathbf{Y}}+\hat{\boldsymbol{\varepsilon}}'\hat{\boldsymbol{\varepsilon}} \end{align*}$

The vector of residuals $\hat{\boldsymbol{\varepsilon}}$ has only $n-K$ so-called degrees of freedom. The vector looses $K$ degrees of freedom, since it has to satisfy the $K$ linear restrictions ( $\mathbf{X}'\hat{\boldsymbol{\varepsilon}}=\mathbf{0}$ ). Particularly, in the case with intercept we have that $\sum_{i=1}^n\hat{\boldsymbol{\varepsilon}_i}=\mathbf{0}$ .
This loss of $K$ degrees of freedom also appears in the definition of the unbiased variance estimator

$\begin{aligned} s^2 = \frac{1}{n-K}\sum_{i=1}^n\hat{\varepsilon_i^2}. \end{aligned}$

4.1.2 Coefficient of determination

The total sample variance of the dependent variable $\sum_{i=1}^n\left(Y_i-\bar{Y}\right)^2$ , where $\bar{Y}=\frac{1}{n}\sum_{i=1}^nY_i$ , can be decomposed as following:

Proposition 3.1.3 (Variance decomposition) For the OLS regression of the linear model with intercept it holds that

$\begin{aligned} \underset{\text{total variance}}{\sum_{i=1}^n\left(Y_i-\bar{Y}\right)^2} = \underset{\text{explained variance}}{\sum_{i=1}^n\left(\hat{Y}_i-\bar{\hat{Y}}\right)^2}+\underset{\text{unexplained variance}}{\sum_{i=1}^n\hat{\varepsilon_i}^2 } \end{aligned}$

Proof.

As a consequence of Prop. 3.1.2 we have for regressions with intercept: $\sum_{i=1}^n\hat{\varepsilon_i=0}$ . Hence, from $Y_i=\hat{Y}_i+\hat{\varepsilon_i}$ it follows that $\begin{aligned} \frac{1}{n}\sum_{i=1}^n Y_i &= \frac{1}{n}\sum_{i=1}^n \hat{Y}_i+\frac{1}{n}\sum_{i=1}^n \hat{\varepsilon_i} \\ \bar{Y} &= \bar{\hat{Y}}_i+0 \end{aligned}$
From Prop. 3.1.2 we know that:

$\begin{align*} \mathbf{Y}'\mathbf{Y} &= \hat{\mathbf{Y}}'\hat{\mathbf{Y}}+\hat{\boldsymbol{\varepsilon}}'\hat{\boldsymbol{\varepsilon}} \\ \mathbf{Y}'\mathbf{Y} -n\bar{Y}^2 &= \hat{\mathbf{Y}}'\hat{\mathbf{Y}}-n\bar{Y}^2+\hat{\boldsymbol{\varepsilon}}'\hat{\boldsymbol{\varepsilon}} \\ \mathbf{Y}'\mathbf{Y}-n\bar{Y}^2 &= \hat{\mathbf{Y}}'\hat{\mathbf{Y}}-n\bar{\hat{Y}}^2+\hat{\boldsymbol{\varepsilon}}' \hat{\boldsymbol{\varepsilon}}\quad\text{(By our result above.)} \\ \sum_{i=1}^n Y_i^2-n\bar{Y}^2 &= \sum_{i=1}^n\hat{Y}_i^2-n\bar{\hat{Y}}^2+\sum_{i=1}^n\hat{\varepsilon}_i^2 \\ \sum_{i=1}^n (Y_i-\bar{Y})^2 &= \sum_{i=1}^n (\hat{Y}_i-\bar{\hat{Y}})^2+\sum_{i=1}^n \hat{\varepsilon}_i^2 \end{align*}$

The larger the proportion of the explained variance, the better is the fit of the model. This motivates the definition of the so-called $R^2$ coefficient of determination:

$R^2=\frac{\sum_{i=1}^n\left(\hat{Y}_i-\bar{\hat{Y}}\right)^2}{\sum_{i=1}^n\left(Y_i-\bar{Y}\right)^2}\;=\;1-\frac{\sum_{i=1}^n\hat{\varepsilon}_i^2}{\sum_{i=1}^n\left(Y_i-\bar{Y}\right)^2}$

Obviously, we have that $0\leq R^2\leq 1$ . The closer $R^2$ lies to $1$ , the better is the fit of the model to the observed data. However, a high/low $R^2$ does not mean a validation/falsification of the estimated model. Any relation (i.e., model assumption) needs a plausible explanation from relevant economic theory.

The most often criticized disadvantage of the $R^2$ is that additional regressors (relevant or not) will always increase the $R^2$ .

Proposition 3.1.4 ( $R^2$ increase)

Let $R^2_1$ and $R^2_2$ result from

$\begin{aligned} \mathbf{y} &= \mathbf{X}_1\mathbf{b}_{11}+\hat{\boldsymbol{\varepsilon}}_1 \quad\text{and}\\ \mathbf{y} &= \mathbf{X}_1\mathbf{b}_{21}+\mathbf{X}_2\mathbf{b}_{22}+\hat{\boldsymbol{\varepsilon}}_2. \end{aligned}$

It then holds that $R^2_2\geq R^2_1$ .

Proof. Consider the sum of squared residuals,

$\begin{align*} S(\mathbf{\mathfrak{b}}_{21},\mathbf{\mathfrak{b}}_{22})=(\mathbf{Y}-\mathbf{X}_1\mathbf{\mathfrak{b}}_{21}+\mathbf{X}_2\mathbf{\mathfrak{b}}_{22})'(\mathbf{Y}-\mathbf{X}_1\mathbf{\mathfrak{b}}_{21}+\mathbf{X}_2\mathbf{\mathfrak{b}}_{22}) \end{align*}$ By definition, this sum is minimized by the OLS estimators $\mathbf{b}_{21}$ and $\mathbf{b}_{22}$ , i.e., $S(\mathbf{b}_{21},\mathbf{b}_{22})\leq S(\mathbf{\mathfrak{b}}_{21},\mathbf{\mathfrak{b}}_{22})$ . Consequently,

$\begin{align*} \hat{\boldsymbol{\varepsilon}}_{2}'\hat{\boldsymbol{\varepsilon}}_{2}=S(\mathbf{b}_{21},\mathbf{b}_{22})\leq S(\mathbf{b}_{11},0)=\hat{\boldsymbol{\varepsilon}}_{1}'\hat{\boldsymbol{\varepsilon}}_{1} \end{align*}$ which implies the statement:

$\begin{align*} R_2^2=1-\frac{\hat{\boldsymbol{\varepsilon}}_{2}'\hat{\boldsymbol{\varepsilon}}_{2}}{\sum_{i=1}^n\left(Y_i-\bar{Y}\right)^2}\geq 1-\frac{\hat{\boldsymbol{\varepsilon}}_{1}'\hat{\boldsymbol{\varepsilon}}_{1}}{\sum_{i=1}^n\left(Y_i-\bar{Y}\right)^2}=R_1^2 \end{align*}$

Because of this, the $R^2$ cannot be used as a criterion for model selection. Possible solutions are given by penalized criterions such as the so-called adjusted $R^2$ defined as

$\begin{aligned} \overline{R}^2 &= 1-\frac{ \frac{1}{n-K} \sum_{i=1}^n \hat{\varepsilon}_i^2}{ \frac{1}{n-1} \sum_{i=1}^n \left(Y_i-\bar{Y}\right)^2} \\ &= 1-\frac{n-1}{n-K}\left(1-R^2\right) \\ &= 1-\frac{n-1}{n-K}+\frac{n-1}{n-K}R^2\quad+\frac{K-1}{n-K}R^2-\frac{K-1}{n-K}R^2 \\ &= 1-\frac{n-1}{n-K}+R^2\quad+\frac{K-1}{n-K}R^2 \\ &= -\frac{K-1}{n-K}+R^2\quad+\frac{K-1}{n-K}R^2 \\ &= R^2-\frac{K-1}{n-K}\left(1-R^2\right) \leq R^2 \end{aligned}$

The adjustment is in terms of degrees of freedom.

4.1.3 Finite-Sample Properties of OLS

Notice that, by contrast to (the true but unknown) parameter vector $\boldsymbol{\beta}$ , $\mathbf{b}$ is a stochastic quantity, since it depends on $\boldsymbol{\varepsilon}$ through $\mathbf{Y}$ . The stochastic difference $\mathbf{b}-\boldsymbol{\beta}$ is termed the sampling error:

$\begin{aligned} \mathbf{b}-\boldsymbol{\beta} &= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}-\boldsymbol{\beta}\\ &= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'(\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon})-\boldsymbol{\beta}\\ &= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}\boldsymbol{\beta}+(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\varepsilon}-\boldsymbol{\beta}\\ &= \boldsymbol{\beta}+(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\varepsilon}-\boldsymbol{\beta}\\ &= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\varepsilon} \,, \end{aligned}$

The distribution of $\mathbf{b}$ depends (among others) on the sample size $n$ , although this is not made explicitly by our notation. In this section, we focus on the case of a fix, finite sample size $n$ .

Theorem 4.1 (Finite Sample Properties) The OLS estimator $\mathbf{b}$

is an unbiased estimator: $\mathbb{E}(\mathbf{b}|\mathbf{X})=\boldsymbol{\beta}$
has variance: $\mathbb{V}(\mathbf{b}|\mathbf{X})=\sigma^2(\mathbf{X}' \mathbf{X})^{-1}$
(Gauss-Markov Theorem) is efficient in the class of all linear unbiased estimators. That is, for any unbiased estimator $\tilde{\mathbf{b}}$ that is linear in $\mathbf{y}$ , we have: $\mathbb{V}(\tilde{\mathbf{b}}|\mathbf{X}) \geq \mathbb{V}(\mathbf{b} | \mathbf{X})$ in the matrix sense.

While part (ii) and (iii) need all of the classical Assumptions 1.1-1.4, part (i) needs only the Assumptions 1.1-1.3.

Note that, by saying: “ $\mathbb{V}(\tilde{\mathbf{b}}|\mathbf{X}) \geq \mathbb{V}(\mathbf{b} | \mathbf{X})$ in the matrix sense”, we mean that $\mathbb{V}(\tilde{\mathbf{b}}|\mathbf{X}) - \mathbb{V}(\mathbf{b} | \mathbf{X}) = \mathbf{D}$ , where $\mathbf{D}$ is a positive semidefinite $K\times K$ matrix, i.e., $\mathbf{a}'\mathbf{D}\mathbf{a}\geq 0$ for any $K$ -dimensional vector $\mathbf{a}$ . Observe that this implies that $\mathbb{V}(\tilde{\text{b}}_k|\mathbf{X}) \geq \mathbb{V}({\textrm{b}}_k | \mathbf{X})$ for any $k=1,\dots,K$ .

4.2 Asymptotics under the Classic Regression Model

In this section we proof that the OLS estimators $\mathbf{b}$ and $s^2$ applied to the classic regression model (defined by Assumptions 1.1 to 1.4) are consistent estimators as $n\to\infty$ . Even better, we can show that it is possible to drop the unrealistic normality assumption (Assumption 1.5.), but still to use the usual test statistics as long as the sample size $n$ is large. Though, before we can formally state the asymptotic properties, we first need to adjust the rank assumption (Assumption 1.3), such that the full column rank of $\mathbf{X}$ is guaranteed for the limiting case as $n\to\infty$ , too. Second, we will assume that the sample $(y_i,\mathbf{x}_i)$ is iid, which allows us to apply Kolmogorov’s strong LLN and Lindeberg-Levy’s CLT and simplifies some of the technical arguments.

Assumption 1.3 $^\ast$ : $\mathbb{E}(\mathbf{X}_i\mathbf{X}_i')=\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}},$
such that the $(K\times K)$ matrix $\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}}$ has full rank $K$ (i.e., is nonsingular).

Assumption 1.5 $^\ast$ : The sample $(\mathbf{X}_i,\varepsilon_i)$ , equivalently $(y_i,\mathbf{X}_i)$ , is iid for all $i=1,\dots,n$ , with existing and finite first, second, third, and fourth moments.

Note that existence and finiteness of the first two moments of $\mathbf{x}_i$ is actually already implied by Assumption 1.3 $^\ast$ .

Under the Assumptions 1.1, 1.2, 1.3 $^\ast$ , 1.4, and, 1.5 $^\ast$ we can show the following results.

Proposition 3.1.8 (Consistency of $\mathbf{S}_{\mathbf{x}\mathbf{x}}^{-1}$ )

$\begin{aligned} \left( \frac{1}{n} \mathbf{X}'\mathbf{X} \right)^{-1} = \mathbf{S}_{\mathbf{x}\mathbf{x}}^{-1} \quad \to_{P} \quad\boldsymbol{\Sigma}_{\mathbf{x}\mathbf{x}}^{-1} \end{aligned}$

Proof. 1st Part: Let us define $\bar{Z}_{kl}$ as $[\mathbf{S}_{\mathbf{X}\mathbf{X}}]_{kl} =\frac{1}{n}\sum_{i=1}^n\underbrace{X_{ik}X_{il}}_{Z_{i,kl}}=\bar{Z}_{kl}.$

From: $\begin{array}{ll} \mathbb{E}[Z_{i,kl}]=[\mathbf{S}_{\mathbf{X}\mathbf{X}}]_{kl}&\quad\text{(By Assumption 1.3}^\ast)\\ \text{and}&\\ Z_{i,kl}\quad\text{is iid and has four moments} &\quad\text{(By Assumption 1.5}^\ast) \end{array}$ it follows by Kolmogorov’s law of large numbers that $\begin{aligned} \bar{Z}_{kl}\to_{P} \left[\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}}\right]_{kl},\quad\text{for any}\quad 1\leq k,l\leq K. \end{aligned}$ Consequently, $\mathbf{S}_{\mathbf{X}\mathbf{X}}\to_{P}\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}}$ element-wise.
2nd Part: By the Continuous Mapping Theorem we have that also $\begin{aligned} \left(\mathbf{S}_{\mathbf{X}\mathbf{X}}\right)^{-1}\to_{P}\left(\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}}\right)^{-1}. \end{aligned}$

Proposition 3.1.9 (Consistency of $\mathbf{b}$ ) $\mathbf{b}\to_{P}\boldsymbol{\beta}$

Proof. We show the equivalent result that $\mathbf{b}-\boldsymbol{\beta}\to_{P} \mathbf{0}$ .
Remember: $\begin{aligned} \mathbf{b}-\boldsymbol{\beta} &=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\varepsilon}\\ &=(n^{-1}\mathbf{X}'\mathbf{X})^{-1}\frac{1}{n}\mathbf{X}'\boldsymbol{\varepsilon}\\ &=\left(\mathbf{S}_{\mathbf{X}\mathbf{X}}\right)^{-1}\;\frac{1}{n}\sum_{i=1}^n\mathbf{X}_i\varepsilon_i\end{aligned}$ From propositions 3.1.8: $\left(\mathbf{S}_{\mathbf{X}\mathbf{X}}\right)^{-1}\to_{P} \left(\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}}\right)^{-1}$ .
Let us focus on element-by-element asymptotics of $\frac{1}{n}\sum_{i=1}^n\mathbf{X}_i\varepsilon_i$ :
Define $\begin{aligned} \frac{1}{n}\sum_{i=1}^n\underbrace{X_{ik}\varepsilon_i}_{Z_{ik}}=\bar{Z}_{n,k}.\end{aligned}$ From $\begin{array}{ll} \mathbb{E}[Z_{ik}]=\mathbb{E}[X_{ik}\varepsilon_i]=0&\quad\text{(By Str. Exog. Ass 1.2)}\\ \text{and}&\\ Z_{ik}\quad\text{is iid and has four moments} &\quad\text{(By Assumption 1.5$^\ast$)} \end{array}$ it follows by Kolmogorov’s law of large numbers that $\begin{aligned} \bar{Z}_{n,k}=\frac{1}{n}\sum_{i=1}^nX_{ik}\varepsilon_i&\to_{P} 0\quad\text{for any}\quad 1\leq k\leq K.\end{aligned}$ Consequently, also $\begin{aligned} \frac{1}{n}\sum_{i=1}^n\mathbf{X}_i\varepsilon_i&\to_{P}\underset{(K\times 1)}{\mathbf{0}} \quad(\text{element-wise}).\end{aligned}$ Final step: From $\left(\mathbf{S}_{\mathbf{X}\mathbf{X}}\right)^{-1}\to_{P}\left(\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}}\right)^{-1}$ and $\frac{1}{n}\sum_{i=1}^n\mathbf{X}_i\varepsilon_i\to_{P}\mathbf{0}$ it follows by Slutsky’s Theorem that $\begin{align*} \mathbf{b}-\boldsymbol{\beta} &=\left(\mathbf{S}_{\mathbf{X}\mathbf{X}}\right)^{-1}\;\frac{1}{n}\sum_{i=1}^n\mathbf{X}_i\varepsilon_i\to_{P} \mathbf{0}. \end{align*}$

Furthermore, we can show that the appropriately scaled (by $\sqrt{n}$ ) sampling error $\mathbf{b}-\boldsymbol{\beta}$ of the OLS estimator is asymptotically normal distributed.

Proposition 3.1.10 (Sampling error limiting normality) $\sqrt{n}(\mathbf{b}-\boldsymbol{\beta})\overset{d}\longrightarrow N(\mathbf{0},\sigma^2 \boldsymbol{\Sigma}^{-1}_{\mathbf{X}\mathbf{X}}).$

In order to show Proposition 3.1.10, we will need to use the so-called Cramér Wold Device on multivariate convergence in distribution:

Cramér Wold Device: Let $\mathbf{Z}_n,\mathbf{Z}\in\mathbf{R}^K$ , then
$\mathbf{Z}_n\overset{d}\longrightarrow \mathbf{Z} \quad \text{if and only if} \quad \boldsymbol{\lambda}'\mathbf{Z}_n\overset{d}\longrightarrow \boldsymbol{\lambda}'\mathbf{Z}$ for any $\boldsymbol{\lambda}\in\mathbb{R}^K$ .

The Cramér Wold Device is needed, since convergence in distribution element-by-element $Z_{k,n}\overset{d}\longrightarrow z_k$ for $k=1,\dots,K$ , does not imply multivariate convergence in distribution $\mathbf{Z}_n\overset{d}\longrightarrow \mathbf{Z}$ .

Proof. Let’s start with some rearrangements: $\begin{aligned} \mathbf{b}-\boldsymbol{\beta} &=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\varepsilon}\\ &=(n^{-1}\mathbf{X}'\mathbf{X})^{-1}\frac{1}{n}\mathbf{X}'\boldsymbol{\varepsilon}\\ &=\left(\mathbf{S}_{\mathbf{X}\mathbf{X}}\right)^{-1}\;\frac{1}{n}\sum_{i=1}^n\mathbf{X}_i\varepsilon_i\\ \Leftrightarrow\sqrt{n}(\mathbf{b}-\boldsymbol{\beta})&=\left(\mathbf{S}_{\mathbf{X}\mathbf{X}}\right)^{-1}\;\left(\sqrt{n}\frac{1}{n}\sum_{i=1}^n\mathbf{X}_i\varepsilon_i\right)\end{aligned}$

From Proposition 3.1.8, we already know that $\left(\frac{1}{n}\mathbf{X}'\mathbf{X}\right)^{-1}=\mathbf{S}_{\mathbf{X}\mathbf{X}}^{-1}\quad\overset{p}\longrightarrow \quad\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}}^{-1}.$

What happens with $\begin{aligned} \sqrt{n}\underbrace{\frac{1}{n}\sum_{i=1}^n\overbrace{\mathbf{X}_i\varepsilon_i}^{\mathbf{Z}_i}}_{\bar{\mathbf{Z}}_n}=\sqrt{n}\,\bar{\mathbf{Z}}_n\quad ?\end{aligned}$ In the following we show that $\sqrt{n}\,\bar{\mathbf{Z}}_n\overset{d}\longrightarrow N(\mathbf{0},\sigma^2\,\boldsymbol{\Sigma}_{\mathbf{x}\mathbf{x}})$ using the Cramér Wold Device:
1st Moment: $\begin{aligned} \mathbb{E}(\boldsymbol{\lambda}'\mathbf{Z}_i)&= \boldsymbol{\lambda}'\;\underset{\text{(By Str. Exog. Ass 1.2)}}{\underbrace{\left(\begin{matrix}\mathbb{E}(\mathbf{X}_{i1}\varepsilon_i)\\\vdots\\\mathbb{E}(\mathbf{X}_{iK}\varepsilon_i)\end{matrix}\right)}_{\mathbf{0}}}=\boldsymbol{\lambda}'\mathbf{0}=0,\end{aligned}$

for any $\boldsymbol{\lambda}\in\mathbb{R}^{K}$ and for all $i=1,2,\dots$
2nd Moment: $\begin{aligned} \mathbb{V}(\boldsymbol{\lambda}'\mathbf{Z}_i) &=\boldsymbol{\lambda}'\mathbb{V}(\mathbf{Z}_i)\boldsymbol{\lambda}\\ &=\boldsymbol{\lambda}'\mathbb{E}(\varepsilon_i\mathbf{X}_i\mathbf{X}_i')\boldsymbol{\lambda}\\ &=\boldsymbol{\lambda}'\mathbb{E}(\mathbb{E}(\varepsilon_i\mathbf{X}_i\mathbf{X}_i'|\mathbf{X}))\boldsymbol{\lambda}\\ &=\boldsymbol{\lambda}'\mathbb{E}(\mathbf{X}_i\mathbf{X}_i'\underset{\text{(Ass 1.4)}}{\underbrace{\mathbb{E}(\varepsilon_i|\mathbf{X})}_{=\sigma^2}})\boldsymbol{\lambda}\\ &=\boldsymbol{\lambda}'\sigma^2\underset{\text{(Ass $1.3^\ast$)}}{\underbrace{\mathbb{E}(\mathbf{X}_i\mathbf{X}_i')}_{\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}}}}\boldsymbol{\lambda}=\sigma^2\boldsymbol{\lambda}'\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}}\boldsymbol{\lambda},\end{aligned}$ for any $\boldsymbol{\lambda}\in\mathbb{R}^{K}$ and for all $i=1,2,\dots$
From $\mathbb{E}(\boldsymbol{\lambda}'\mathbf{Z}_i)=0$ , $\mathbb{V}(\boldsymbol{\lambda}'\mathbf{Z}_i)=\sigma^2\boldsymbol{\lambda}'\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}}\boldsymbol{\lambda}$ , and $\mathbf{Z}_i=(\mathbf{X}_i\varepsilon_i)$ being iid (Ass $1.5^\ast$ ), it follows by the Lindeberg-Levy’s CLT and the Cramér Wold Device that $\begin{aligned} \sqrt{n}\boldsymbol{\lambda}'\bar{\mathbf{Z}}_n&\overset{d}\longrightarrow N(0,\sigma^2\boldsymbol{\lambda}'\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}}\boldsymbol{\lambda})\quad\text{(By Lindeberg-Levy's CLT)}\\ \Leftrightarrow \underbrace{\sqrt{n}\bar{\mathbf{Z}}_n}_{=\sqrt{n}\frac{1}{n}\sum_{i=1}^n\mathbf{X}_i\varepsilon_i}&\overset{d}\longrightarrow N(\mathbf{0},\sigma^2\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}})\quad\text{(Cramér Wold Device)}\end{aligned}$

Now, we can conclude the proof:
From $\mathbf{S}_{\mathbf{X}\mathbf{X}}^{-1}\quad\overset{p}\longrightarrow \quad\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}}^{-1}$ (by Proposition 3.1.8 and
$\sqrt{n}\frac{1}{n}\sum_{i=1}^n\mathbf{X}_i\varepsilon_i\overset{d}\longrightarrow N(\mathbf{0},\sigma^2\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}})$ it follows by Slutsky’s Theorem that

$\begin{align*} \underbrace{\left(\mathbf{S}_{\mathbf{X}\mathbf{X}}\right)^{-1}\;\left(\sqrt{n}\frac{1}{n}\sum_{i=1}^n\mathbf{X}_i\varepsilon_i\right)}_{\sqrt{n}(\mathbf{b}-\boldsymbol{\beta})}\overset{d}\longrightarrow N\left(\mathbf{0},\underbrace{(\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}}^{-1})\,(\sigma^2\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}})\,(\boldsymbol{\Sigma}_{\mathbf{X}\mathbf{X}}^{-1})'}_{\sigma^2\boldsymbol{\Sigma}^{-1}_{\mathbf{x}\mathbf{x}}}\right) \end{align*}$