2.2 The normal equations
So we’ve decided we want the vector of residuals to be perpendicular to the column space of \(\boldsymbol{X}\). In algebra terms, this means that for any vector \(\boldsymbol{c}\) we have \(\boldsymbol{e} \perp \boldsymbol{Xc}\). In notation:
\[\begin{aligned} \boldsymbol{Xc} \perp \boldsymbol{e} &\implies \\ \boldsymbol{(Xc)}'(\boldsymbol{y}-\boldsymbol{Xb}) = 0 &\implies \\ \boldsymbol{c}'\boldsymbol{X}'\boldsymbol{y} - \boldsymbol c' \boldsymbol X' \boldsymbol X \boldsymbol b = 0 &\implies\\ \boldsymbol{c}'(\boldsymbol X' \boldsymbol y - \boldsymbol X' \boldsymbol X \boldsymbol b) = 0 \end{aligned}\]
Since this is true for any \(\boldsymbol{c}\), then it must be true that
\[ \boldsymbol X' \boldsymbol y - \boldsymbol X' \boldsymbol X \boldsymbol b = 0\]
or in other words:
\[ \boldsymbol X' \boldsymbol X \boldsymbol b = \boldsymbol X' \boldsymbol y\]
These are the normal equations of multiple linear regression. Yes, more than one equation! Remember this is a matrix equation – the dimensions here are (\(k+1\) by \(n\))(\(n\) by \(k+1\))(\(k+1\) by 1) = (\(k+1\) by \(n\))(\(n\) by 1), so each side is a (\(k+1\) by 1) column vector. So this is a system of \(k+1\) equations.
In the “simple” (one predictor) linear regression case, there are two \(b\)’s to optimize: the slope and the intercept. This means you need partial derivatives (oooh) to find the best value for each. In multiple regression, instead of just \(b_0\) and \(b_1\), you have more \(b\)’s to optimize over. It still involves a bunch of partial derivatives.
We’ve just taken a geometrical approach to the idea of minimizing the sum of squared residuals, thereby obtaining the normal equations that our best-fit \(\boldsymbol{b}\) must fulfill. You can also take a calculus approach to obtaining these normal equations. It’s a question of minimizing the sum of squared errors, by optimizing over possible values of the various \(b\)’s. And any time you are looking for a maximum or minimum, calculus comes in handy!
But either way, we end up with this system of equations. How are we going to solve it? Well, in simple regression, this system of equations consists of \(k+1=2\) equations involving two unknowns, \(b_0\) and \(b_1\). So we can solve them by rearranging the first equation into an expression for \(b_0\) in terms of \(b_1\), and then substituting it in the second equation. Which is fine, I guess. But this is not going to be a pleasant prospect when, say, \(k=10\). Or 100.
Instead, let us use the magic of… matrix algebra!