A.2 Least squares and maximum likelihood estimation

Least squares had a prominent role in linear models. In certain sense, this is strange. After all, it is a purely geometrical argument for fitting a plane to a cloud of points and therefore it seems to do not rely on any statistical grounds for estimating the unknown parameters \(\boldsymbol{\beta}.\)

However, as we will see, least squares estimation is equivalent to maximum likelihood estimation under the assumptions of the model seen in Section 2.3241. So maximum likelihood estimation, the most well-known statistical estimation method, is behind least squares if the assumptions of the model hold.

First, recall that given the sample \(\{(\mathbf{X}_i,Y_i)\}_{i=1}^n,\) due to the assumptions introduced in Section 2.3, we have that:

\[\begin{align*} Y_i|(X_{i1}=x_{i1},\ldots,X_{ip}=x_{ip})\sim \mathcal{N}(\beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip},\sigma^2), \end{align*}\]

with \(Y_1,\ldots,Y_n\) being independent conditionally on the sample of predictors. Equivalently stated in a compact matrix way (recall the notation behind (2.6)):

\[\begin{align*} \mathbf{Y}|\mathbf{X}\sim\mathcal{N}_n(\mathbf{X}\boldsymbol{\beta},\sigma^2\mathbf{I}). \end{align*}\]

From these two equations we can obtain the log-likelihood function of \(Y_1,\ldots,Y_n\) conditionally242 on \(\mathbf{X}_1,\ldots,\mathbf{X}_n\) as

\[\begin{align} \ell(\boldsymbol{\beta})=\log\left(\phi(\mathbf{Y};\mathbf{X}\boldsymbol{\beta},\sigma^2\mathbf{I})\right)=\sum_{i=1}^n\log\left(\phi(Y_i;(\mathbf{X}\boldsymbol{\beta})_i,\sigma)\right).\tag{A.3} \end{align}\] Maximization of (A.3) with respect to \(\boldsymbol{\beta}\) gives the maximum likelihood estimator \(\hat{\boldsymbol{\beta}}_\mathrm{ML}.\)

Now we are ready to show the next result.

Theorem A.1 Under the assumptions i–iv in Section 2.3, the maximum likelihood estimate of \(\boldsymbol{\beta}\) is the least squares estimate (2.7):

\[\begin{align*} \hat{\boldsymbol{\beta}}_\mathrm{ML}=\arg\max_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}\ell(\boldsymbol{\beta})=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}\mathbf{Y}. \end{align*}\]

Proof. Expanding the first equality at (A.3) gives243

\[\begin{align*} \ell(\boldsymbol{\beta})=-\log\left((2\pi)^{n/2}\sigma^n\right)-\frac{1}{2\sigma^2}(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})'(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}). \end{align*}\]

In order to differentiate with respect to \(\boldsymbol{\beta},\) we use that, for two vector-valued functions \(f\) and \(g\):

\[\begin{align*} \frac{\partial \mathbf{A}\mathbf{x}}{\partial \mathbf{x}}=\mathbf{A}\text{ and } \frac{\partial f(\mathbf{x})'g(\mathbf{x})}{\partial \mathbf{x}}=f(\mathbf{x})'\frac{\partial g(\mathbf{x})}{\partial \mathbf{x}}+g(\mathbf{x})'\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}}. \end{align*}\]

Then, differentiating with respect to \(\boldsymbol{\beta}\) and equating to zero gives

\[\begin{align*} \frac{1}{\sigma^2}(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})'\mathbf{X}=\frac{1}{\sigma^2}(\mathbf{Y}'\mathbf{X}-\boldsymbol{\beta}'\mathbf{X}'\mathbf{X})=0. \end{align*}\]

This means that optimizing \(\ell\) does not require knowledge on \(\sigma^2\)! This is a very convenient fact that allows to solve the above equation, yielding

\[\begin{align*} \hat{\boldsymbol{\beta}}=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}\mathbf{Y}. \end{align*}\]

A final comment on the benefits of relying on maximum likelihood estimation follows.

Maximum likelihood estimation is asymptotically optimal when estimating the unknown parameters of a model. This is a very appealing property that means that, when the sample size \(n\) is large, it is guaranteed to perform better than any other estimation method, where better is understood in terms of the mean squared error.


  1. Normality is especially important here due to the squares present in the exponential of the normal pdf.↩︎

  2. Since we assume that the randomness is on the response only.↩︎

  3. Recall that \(|\sigma^2\mathbf{I}|^{1/2}=\sigma^{n}.\)↩︎