3.1 The fitted regression model
Before jumping in here, it’s a good idea to review the basic idea of least squares (see “SLR: Least Squares” from module 1) and some matrix concepts (“General matrix fax and trix” from module 2).
We’ve been dealing with the theoretical model, but now let’s look at the fitted model. We want to find a good \(\boldsymbol{b}\) (aka \(\hat{\boldsymbol{\beta}}\)). Then our predictions will be
\[\hat{\boldsymbol{y}} = \boldsymbol{X {b}}\] This vector of predictions must fall in the column space of \(\boldsymbol{X}\); the vector \(\boldsymbol{b}\) determines where in the column space it will go. If I have \(k+1\) (linearly independent) predictors, this vector is constrained to lie within a \(k+1\)-dimensional space. But the vector of responses, \(\boldsymbol{y}\), can point anywhere in \(n\)-dimensional space! So the predictions we can get out of a model are more constrained than the actual values – which is why they can’t match the actual values exactly.
A note for the linear-algebraically inclined: The vector \(\boldsymbol{y}\) can be said to live in \(\mathbb{R}^n\). The vector \(\hat{\boldsymbol y}\) is also length \(n\), so you can think of it as also living in \(\mathbb{R}^n\)… but it has to lie within a particular \(k+1\)-dimensional subspace of \(\mathbb{R}^n\). Now, that subspace is not necessarily \(\mathbb{R}^{k+1}\), but it’s equivalent, if you’re allowed to do a change of basis and swivel the axes around. So don’t worry about it. If this distinction was not bothering you in the first place, forget I mentioned it.
Restricting \(\boldsymbol{\hat{y}}\) to \(k+1\)-space is a big constraint, but then, that’s kind of the point: we’re trying to predict \(\boldsymbol{y}\) with a simpler (lower-dimensional) model approximation.
Notice there’s still exactly one residual per data point. This hasn’t changed from the old-school, simple linear regression case; all that’s different is how we got \(\hat{\boldsymbol{y}}\), and how we’re writing it.
Of course, that will create a vector of residuals: \(\boldsymbol{e} = \boldsymbol{y} - \boldsymbol{Xb}\). We’d like this residual vector to be as small as possible, just as always. Previously, we talked about minimizing the sum of the squared residuals. Well, hold on: this is the inner product of the residual vector with itself. \[SSE=\boldsymbol e^T\boldsymbol{e} = \sum{e_i^2}\] We can think of the length of the residuals vector as the square root of this quantity. In linear algebra terms, that’s the norm.
From linear algebra, we know that if we want to minimize the length (norm) of the residual vector, then we need to project \(\boldsymbol{y}\) orthogonally onto the column space of \(\boldsymbol{X}\). In other words, we want \(\boldsymbol{e}\) to be perpendicular to \(\boldsymbol{Xc}\) for any vector \(\boldsymbol{c}\).
Why does that make sense? Well, we know that the \(\boldsymbol{\hat{y}}\) vector is stuck down in the \(k+1\)-dimensional column space of \(\boldsymbol{X}\), while the \(\boldsymbol{y}\) vector soars majestically in \(n\)-dimensional space. We’d like \(\boldsymbol{\hat{y}}\) to get as close as possible to \(\boldsymbol{{y}}\), though. So we want to “drop straight down” to \(Col(\boldsymbol X)\) from \(\boldsymbol{y}\). To get any closer to \(\boldsymbol{y}\) from that point would require us to pop out of \(Col(\boldsymbol X)\) into an additional dimension. (Trippy.)