A.2 Least squares and maximum likelihood estimation
Least squares had a prominent role in linear models. In certain sense, this is strange. After all, it is a purely geometrical argument for fitting a plane to a cloud of points and therefore it seems to do not rely on any statistical grounds for estimating the unknown parameters \boldsymbol{\beta}.
However, as we will see, least squares estimation is equivalent to maximum likelihood estimation under the assumptions of the model seen in Section 2.3.242 So maximum likelihood estimation, the most well-known statistical estimation method, is behind least squares if the assumptions of the model hold.
First, recall that given the sample \{(\mathbf{X}_i,Y_i)\}_{i=1}^n, due to the assumptions introduced in Section 2.3, we have that:
\begin{align*} Y_i|(X_{i1}=x_{i1},\ldots,X_{ip}=x_{ip})\sim \mathcal{N}(\beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip},\sigma^2), \end{align*}
with Y_1,\ldots,Y_n being independent conditionally on the sample of predictors. Equivalently stated in a compact matrix way (recall the notation behind (2.6)):
\begin{align*} \mathbf{Y}|\mathbf{X}\sim\mathcal{N}_n(\mathbf{X}\boldsymbol{\beta},\sigma^2\mathbf{I}). \end{align*}
From these two equations we can obtain the log-likelihood function of Y_1,\ldots,Y_n conditionally243 on \mathbf{X}_1,\ldots,\mathbf{X}_n as
\begin{align} \ell(\boldsymbol{\beta})=\log\left(\phi(\mathbf{Y};\mathbf{X}\boldsymbol{\beta},\sigma^2\mathbf{I})\right)=\sum_{i=1}^n\log\left(\phi(Y_i;(\mathbf{X}\boldsymbol{\beta})_i,\sigma)\right).\tag{A.3} \end{align} Maximization of (A.3) with respect to \boldsymbol{\beta} gives the maximum likelihood estimator \hat{\boldsymbol{\beta}}_\mathrm{ML}.
Now we are ready to show the next result.
Theorem A.1 Under the assumptions i–iv in Section 2.3, the maximum likelihood estimate of \boldsymbol{\beta} is the least squares estimate (2.7):
\begin{align*} \hat{\boldsymbol{\beta}}_\mathrm{ML}=\arg\max_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}\ell(\boldsymbol{\beta})=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}\mathbf{Y}. \end{align*}
Proof. Expanding the first equality at (A.3) gives244
\begin{align*} \ell(\boldsymbol{\beta})=-\log\left((2\pi)^{n/2}\sigma^n\right)-\frac{1}{2\sigma^2}(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})'(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}). \end{align*}
In order to differentiate with respect to \boldsymbol{\beta}, we use that, for two vector-valued functions f and g:
\begin{align*} \frac{\partial \mathbf{A}\mathbf{x}}{\partial \mathbf{x}}=\mathbf{A}\text{ and } \frac{\partial f(\mathbf{x})'g(\mathbf{x})}{\partial \mathbf{x}}=f(\mathbf{x})'\frac{\partial g(\mathbf{x})}{\partial \mathbf{x}}+g(\mathbf{x})'\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}}. \end{align*}
Then, differentiating with respect to \boldsymbol{\beta} and equating to zero gives
\begin{align*} \frac{1}{\sigma^2}(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})'\mathbf{X}=\frac{1}{\sigma^2}(\mathbf{Y}'\mathbf{X}-\boldsymbol{\beta}'\mathbf{X}'\mathbf{X})=0. \end{align*}
This means that optimizing \ell does not require knowledge on \sigma^2! This is a very convenient fact that allows to solve the above equation, yielding
\begin{align*} \hat{\boldsymbol{\beta}}=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}\mathbf{Y}. \end{align*}
A final comment on the benefits of relying on maximum likelihood estimation follows.
Maximum likelihood estimation is asymptotically optimal when estimating the unknown parameters of a model. This is a very appealing property that means that, when the sample size n is large, it is guaranteed to perform better than any other estimation method, where better is understood in terms of the mean squared error.