Maximum likelihood or least squares

Since the likelihood functions indicates how likely observed data are give parameter values, then the maximum likelihood parameters are the parameters that are most likely to produce the observed data. Least squares parameters aim to minimise the difference between observed data and the predicted value.

Maximum likelihood can be applied whenever we can specify a probability distribution for the response data $\boldsymbol{Y}$ . In this case, we were able to specify a multivariate normal distribution under the normality assumption of the residuals $\boldsymbol{\epsilon}$ . Similarly the least squares approach is appropriate under this normality assumption. In fact the least squares estimation is a special case of maximum likelihood in this specific scenario with a normal linear regression model. Intuitively, maximum likelihood aims to maximise the probabiliy of the data. When we assume that data can be modelled through a normal distribution (a nice symmetrical bell-shaped curve), then the maximum probability occurs as the data points become closer to the mean (i.e. the peak of the curve). Since the normal distribution is symmetric about the mean, then data points can approach from either above or below the mean. In other words, this is equivalent to minimising the distance between the data and the mean.

Maximum likelihood can also be used for estimating parameters of non-normal models.

In the following sections, we will look more closely at some of the properties of $\hat{\beta}$ .

Distribution of $\hat{\beta}$

$\hat{\beta}=(X^TX)^{-1}X^TY$

$\underline{Y}|\underline{X} \sim N(\underline{X} \underline{\beta},\Sigma)$

Distribution of $\hat{\beta}$

What is the distribution of $\hat{\beta}$ ?

Expectation of $\hat{\beta}$

$\begin{eqnarray*} E(\hat{\beta})&=&E((X^TX)^{-1}X^TY)\\ &=&(X^TX)^{-1}X^TE(Y)\\ &=&(X^TX)^{-1}X^TX\beta\\ &=&\beta \end{eqnarray*}$

Covariance of $\hat{\beta}$

$\begin{eqnarray*} Var(\hat{\beta})&=&Var((X^TX)^{-1}X^TY)\\ &=&Var((X^TX)^{-1}X^T[X\beta+\epsilon])\\ &=&Var((X^TX)^{-1}X^TX\beta+(X^TX)^{-1}X^T\epsilon]\\ &=&Var(\beta+(X^TX)^{-1}X^T\epsilon)\\ &=&Var((X^TX)^{-1}X^T\epsilon)\\ &=&(X^TX)^{-1}X^TVar(\epsilon)X(X^TX)^{-1}\\ &=&(X^TX)^{-1}X^T \Sigma X(X^TX)^{-1}\\ &=&\sigma^2 (X^TX)^{-1}X^T X(X^TX)^{-1}\\ &=&\sigma^2 (X^TX)^{-1}\\ \end{eqnarray*}$

Distribution of $\hat{\beta}$

$\hat{\beta}=(X^TX)^{-1}X^TY$

$\hat{\beta} \sim N \underline{\beta},\sigma^2 (X^TX)^{-1})$

Distribution of $b^T\hat{\beta}$

$b^T\hat{\beta}=b^T(X^TX)^{-1}X^TY$

$\hat{\beta} \sim N(\underline{\beta},\sigma^2 (X^TX)^{-1})$

Expectation of $b^T\hat{\beta}$

$\begin{eqnarray*} E(b^T\hat{\beta})&=&b^T\beta \end{eqnarray*}$

Covariance of $\hat{\beta}$

$\begin{eqnarray*} Var(b^T\hat{\beta})&=&b^TVar(\hat{\beta})b\\ &=&b^T[\sigma^2(X^TX)^{-1}]b\\ &=&\sigma^2 b^T(X^TX)^{-1}b\\ \end{eqnarray*}$

Distribution of $b^T\hat{\beta}$

$b^T\hat{\beta}=b^T(X^TX)^{-1}X^TY$

$b^T\hat{\beta} \sim N (b^T\underline{\beta},\sigma^2 b^T(X^TX)^{-1}b)$

Distribution of $b^T\hat{\beta}$

$\frac{b^T\hat{\beta} - b^T\underline{\beta}}{\sqrt{\sigma^2 b^T(X^TX)^{-1}b}}\sim t(n-p)$