## 3.1 The fitted regression model

Before jumping in here, it’s a good idea to review the basic idea of least squares (see “SLR: Least Squares” from module 1) and some matrix concepts (“General matrix fax and trix” from module 2).

We’ve been dealing with the theoretical model, but now let’s look at the *fitted* model. We want to find a good \(\boldsymbol{b}\) (aka \(\hat{\boldsymbol{\beta}}\)). Then our predictions will be

\[\hat{\boldsymbol{y}} = \boldsymbol{X {b}}\]
This vector of predictions must fall in the column space of \(\boldsymbol{X}\); the vector \(\boldsymbol{b}\) determines *where* in the column space it will go. If I have \(k+1\) (linearly independent) predictors, this vector is constrained to lie within a \(k+1\)-dimensional space. But the vector of responses, \(\boldsymbol{y}\), can point *anywhere* in \(n\)-dimensional space! So the predictions we can get out of a model are more *constrained* than the actual values – which is why they can’t match the actual values exactly.

A note for the linear-algebraically inclined: The vector \(\boldsymbol{y}\) can be said to live in \(\mathbb{R}^n\). The vector \(\hat{\boldsymbol y}\) is also length \(n\), so you can think of it as also living in \(\mathbb{R}^n\)… but it has to lie within a particular \(k+1\)-dimensional subspace of \(\mathbb{R}^n\). Now, that subspace is not necessarily \(\mathbb{R}^{k+1}\), but it’s equivalent, if you’re allowed to do a change of basis and swivel the axes around. So don’t worry about it. If this distinction was not bothering you in the first place, forget I mentioned it.

Restricting \(\boldsymbol{\hat{y}}\) to \(k+1\)-space is a big constraint, but then, that’s kind of the point: we’re trying to predict \(\boldsymbol{y}\) with a *simpler* (lower-dimensional) model approximation.

Notice there’s still exactly one residual per data point. This hasn’t changed from the old-school, simple linear regression case; all that’s different is how we got \(\hat{\boldsymbol{y}}\), and how we’re writing it.

Of course, that will create a vector of residuals: \(\boldsymbol{e} = \boldsymbol{y} - \boldsymbol{Xb}\). We’d like this residual vector to be as small as possible, just as always. Previously, we talked about minimizing the sum of the squared residuals. Well, hold on: this is the *inner product* of the residual vector with itself.
\[SSE=\boldsymbol e^T\boldsymbol{e} = \sum{e_i^2}\]
We can think of the *length* of the residuals vector as the square root of this quantity. In linear algebra terms, that’s the *norm*.

From linear algebra, we know that if we want to minimize the length (norm) of the residual vector, then we need to project \(\boldsymbol{y}\) orthogonally onto the column space of \(\boldsymbol{X}\). In other words, we want \(\boldsymbol{e}\) to be perpendicular to \(\boldsymbol{Xc}\) for any vector \(\boldsymbol{c}\).

Why does that make sense? Well, we know that the \(\boldsymbol{\hat{y}}\) vector is stuck down in the \(k+1\)-dimensional column space of \(\boldsymbol{X}\), while the \(\boldsymbol{y}\) vector soars majestically in \(n\)-dimensional space. We’d like \(\boldsymbol{\hat{y}}\) to get as close as possible to \(\boldsymbol{{y}}\), though. So we want to “drop straight down” to \(Col(\boldsymbol X)\) from \(\boldsymbol{y}\). To get any closer to \(\boldsymbol{y}\) from that point would require us to pop out of \(Col(\boldsymbol X)\) into an additional dimension. (Trippy.)