The Method of Least Squares
There are two main methods of parameter estimation namely least squares and maximum likelihood. We will consider the method of least squares.
We will first work out the least squares estimate of the parameters of a simpler model that is a linear model through the origin where we need to estimate just one parameter - the slope. For this problem we have
\(\newline\) Data: \((y_i,x_i), \quad i=1,\dots,n\) Model: \(y_i =\beta x_i + \epsilon_i\).
\(\newline\) That is a response variable \(y\), one explanatory variable \(x\) and one unknown regression coefficient \(\beta\). In this case, we want to fit a straight line through the origin with gradient \(\beta\).
Least squares aims to find the value for \(\beta\) such that the straight line is as close to the data as possible. This is done by trying to minimise all the vertical distances between a data point and the chosen line.
The vertical distances, shown in red, are the difference between observed value of the response variable \(y_i\) (for observation \(i\) in \(1,\dots,n\)) and the predicted, or fitted value, \(E(y_i)\). That is, the projection of \(y_i\) onto the black fitted line. In this case, the fitted line is defined by \(\beta\) and the projections are shown in red. Typically we refer to these vertical distances as \(\epsilon_1, \ldots, \epsilon_n\). In other words,
\[\begin{eqnarray*}
\epsilon_i &=& y_i - E(y_i|x_i)\\
&=& y_i - \beta x_i \hspace{0.25cm} \mbox{for } i=1, \dots, n.
\end{eqnarray*}\]
Why least squares?
The method of least squares dates back to the early 1800’s. The ideas stemmed from Astronomy where astronomers of the time were interested in measuring plantery movement. They knew that any observations they made were subject to error giving rise to the term “observational error”. They wanted to know how accurate their measurements were. They assumed that
- observations were representative of a true value and the true value was unknown. Here, ‘representative’ means that based on the observations, we could estimate the true unknown value.
- all observations were subject to error,
- observations were symmetrically distributed around a true value and so small errors were more likely than larger errors, and
- the true value is likely to be the value that best fits the observations.
The first two points above should be familiar to you. We have already seen that we can view estimates a response variable \(y\) to be a function of an explanatory variable (or multiple explanatory variables) plus some random error.
The idea of ‘best fits’ was not obvious, and still isn’t. We will discuss ideas of ‘best fit’ during later lectures of this course. We will also revise the distributional assumptions in point 3 above too.
The most straight forward answer to ‘best fit’ is to take the average value. In other words, we could make multiple observations and estimate the true value to be the average value. Instead, it was determined in the early 1800’s that the most reliable estimates of the true unknown values are the ones the minimise the sum of the squares of the difference between the observations and the estimated true unknown values.
Least squares estimate for simple linear regression through the origin
It is not important if the data (black dots) fall above or below the fitted line. Therefore, in order to gauge the overall distance between observations and the fitted line, it is conventional to consider \[\epsilon_i^2 = (y_i - \beta x_i)^2 \hspace{0.25cm} \mbox{for } i=1, \dots, n.\]
The method of least squares minimises the sum of the squares of these vertical differences, namely
\[S(\beta) =\sum_{i=1}^n(y_i-\beta x_i)^2\]
where \(S(\beta)\) is referred to as the sum-of-squares function. The vertical differences are illustrated using the red lines in the plot above. We now have the mathematical problem of choosing \(\beta\) to minimise \(S(\beta)\).
Least squares estimators
The value for \(\beta\) which minimises \(S(\beta)\) is called the least squares estimate of \(\beta\), and is usually denoted by \(\hat{\beta}\) (read as \(\beta\) hat). Therefore, we can use \(S(\beta)\) to derive from first principles the least squares estimator of \(\hat{\beta}\).
Residual sum of squares (RSS)
The minimum value of \(S(\hat{\beta})\) is called the residual sum of squares (RSS).
Minimisation of \(S(\beta)\)
\(S(\beta)\) can be minimised by calculus or algebraically and the least square estimate is \[\hat{\beta} = \frac{\sum x_iy_i}{\sum x_i^2}.\]
The derivation of the above estimate can be found as a tutorial task - please try it! The best fitting line, in the least squares sense, is then given by \[\hat{y}=\hat{\beta}x.\]
The residual sum of squares is \[S(\hat{\beta}) = \sum_{i=1}^n(y_i-\hat{\beta}x_i)^2.\]