Chapter 2 Linear Regression by OLS and MLE

2.1 OLS

In this first chapter we will dive a bit deeper into the methods outlined in the video "What is Maximum Likelihood Estimation (w/Regression). In the video, we touched on one method of linear regression, Least Squares Regression. Recall the general intuition is that we want to minimize the distance each point is from the line. In other words, we want to minimize the residual.


You already have a basic intuition for how this works, but let’s see how to employ least squares using some basic programming. We will use data on nest temperature and birth weight tp put these intutions into practice.

Under the hood I have generated some toy data to analyze, rather than using the standard mtcars or iris datasets (as fun as they are)

2.1.1 The data

In this dataset, julian describes the day of the year hatchlings were born, nest_temp describes the temperature of the nest throughout incubation, and birth_weight describes the average mass of a hatchling in grams.

##      julian nest_temp birth_weight
## 1 105.37724  26.90120     53.51261
## 2  91.44068  29.85848     52.49336
## 3 111.25589  25.15303     52.46664
## 4  80.24140  31.77687     51.79504
## 5 105.32226  35.71426     52.38861
## 6 108.98875  36.87243     53.32003

Let’s say we want to see if there is a relationship between an julian date and birth weight. We are going to plot date on the \(x\) axis and weight on the \(y\) axis.

There seems to be a significant positive trend between julian date and birth weight. Recall from the video your basic intuation for fitting a line of best fit between these data is to find the line that minimizies the residual distances for each observation. To start, let’s fit a few different possible lines to the data and see how well each line fits the data under the least squares method. Eye-balling the plot I would fit a line with \(y\)-intercept about \(43\) (I played around with this a bit) and slope of (looks like a rise of 2 and run of 20) so \(.1\)? That sounds mildly reasonable I think.

Let’s also fit a line with 0 slope and intercept at whatever the mean of our y value is (in this case 52.8).

So let’s look at the residuals for our first line. Let’s make a new dataframe to calculate residuals for both of our lines

Now we can visualize these resdidual distances

2.1.2 The math

Remember, for least squares, our goal is to minimize the total sum of the squares of the residuals.

\[ \sum_i^n = (y_i - \beta_0 + \beta_1x_1)^2 \]

So, lets see what this sum of squares is for our first line. Simple enough.

## [1] 101.7855

So, what is the sum of the squared residuals for the flat line.

## [1] 128.2292

So the total sum of squares is higher for the second line, indicating a worse fit to the data.

This is great, but not surprising. We intentionally chose a relatively good and a relatively bad fiting line. So, how could we define a function that can test any given slope and intercept and determine the value that minimizes the sum of the squared residuals. We can take advantage of some of R’s built in functionality to use what we know about least squares to minimize the equation of our line:

\[ y_i = \beta_0 + \beta_1x_i \]

Again, our residuals can then be difend simply as

\[ y_i - \beta_0 - \beta_1x_i \] and our sum of the squared residuals as

\[ \sum_i^n (y_i - \beta_0 - \beta_1x_i)^2 \]

So, now we know our goal is to minimize this function. How to we do that? R has a nice built in optimization function that can solve this for us. So, what we will do is define our function (sum of the squared residuals) then tell R to optimize that function.

First, we will make a function that takes as input our observed y values and calculates a residual based on some proposed slope and intercept parameter values.

We the use R’s built in optim() function to find which values minimze our defined function. We are optimizing both the intercept, which we have defined as our first parameter (\(\beta_0\)) and slope as our second parameter (\(\beta_1\))

##    int  slope 
## 46.247  0.066

2.1.3 A tangent on optimizers

I want to just give a visual intution for what we are doing here. I am as of yet unsure if this explanation will end up in the video so I am including a little written sidebar here. Effectively, all regression problems (and many many other statistical and machine learning problems) boil down to finding the minimum or maximum of a function. You are iterating across all possible values and finiding where the fucntion is minimized. Where statistical pakcages often differ from one another is in the optimizers they use or the heuristics they use to choose an optimizer. For our problem, we want to iterate across every slope and intercept to see which combination gives us the minimal sum of squared residuals.Analytically this isnt horribly difficult. We just need to create a combination of every slope and intercept and calulcate the residual.

So for our data lets just say we wanted to determine the residual for every possible slope and intercept (or at least a big range of values)

I’ll use our guess of 0.1 and 43 to generate a range of possibilities

So now what I am going to do is calculate the total residuals for every combination of slope and intercept given our observed x and y values

So now that we have our squared residual for every slope and intercept combination, we can plot that as a 3-dimensional surface

So this is the function we are minimizing.For very simple models, these functions are easy to optimize. There is a clear valley to our parameter space and you can imagine it would be fairly simple to arrive at a minimum of this function. For complex models this isnt always the case. You might imagine a wiggly surface with multiple local minima or a surface where there just isnt a very clear valley. Where many statistical test “fail” or why many error messages exist is because this function can’t effectively be optimized. For example, convergence errors, boundary errors, etc all refer to issues in how the algorithm arrives at a minima. This will be covered in detail in a future video because I think its really fun (Hence this weird tangent).

Another correlated tangent: you can play around with this function to see how different optimization functions behave. Many different algorithms exist and it is important to understand how each behaves. For example, the deafult optimizer for the lmer() function in R is the “bobyqa” optimizer. Often when you run into convergence errors with lmer this can be hurdled by specifying a different optimizer function, such as optimx or Nelder-Mead. glm and lmer also let you test multiple optimizers on your model to see ow they behave. Here is a list of all the optimizer algorithms that can be used (these are roughly the same ones that use used in lm and lmer)

Let’s play a game. Which optimization methods are least like the others?*

## $`Nelder-Mead`
##    int  slope 
## 46.247  0.066 
## $BFGS
##    int  slope 
## 46.286  0.066 
## $CG
##    int  slope 
## 15.770  0.372 
## $`L-BFGS-B`
##    int  slope 
## 46.286  0.066 
## $SANN
##    int  slope 
## 11.694  0.413

CG and SANN require some user inputs to work optimally. They are best for large, complex optimization problems and so are less ideal for general-purpose. The authors of these packages choose optimization algorithms that maximize both convergence and speed as defaults.

2.2 Back to the main stuff

##    int  slope 
## 46.247  0.066

So, our defined function has found that the slope and intercept that minimize the sum of the squared residuals are \(int = 46.247\) and \(slope = 0.066\). Which is asotnishingly close to our guess of \(45\) and \(0.1\)! (Admittedly I did simulate the parameter values…)

Fitting that line to the data, we get

So, let’s compare this to what a common built in function might find for us. A common tool for calulating regression coefficients using least squares is the lm() function. Let’s see what we get:

## (Intercept)      julian 
## 46.28632464  0.06588573

You can see we get roughly equivalent results! As a fun sanity check, here are the simulated parameter values:

## [1] "beta0 = 46.907"
## [1] "beta1 = 0.0745"

2.3 Multiple Parameters

So what if we have a model where we are interested in multiple parameter’s effects on a given output?

What if we were interested in how both horsepower AND engine displacement influence mpg? Well our model would then be:

mpg ~ hp + disp

Remember, our original linear model was defined as

\[ E[y] = \beta_0 + \beta_1x_i \] Recall from the video, this just means that our vector of predicted y values is given by our vector of x values, multiplied by our slope parameter (plus the intercept parameter):

\[ \begin {bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end {bmatrix} = \beta_0 + \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \begin{bmatrix} \beta_1 \end{bmatrix} \] Which corresponds to a system of linear equations:

\[ E[y_1] = \beta_0 + x_1\beta_1 \\ E[y_2] = \beta_0 + x_2\beta_1 \\ \vdots \\ E[y_n] = \beta_0 + x_n\beta_1 \] If we want to extend this model to include multiple parameters, we are including a new vector of \(x\) values, and a new \(\beta\) parameter. So, our new model becomes

\[ E[y_i] = \beta_0 + \beta_1x_1 + \beta_2x_2 \]

In matrix form, this model can be simplified as

\[ E[y] = \beta\textbf{X} \]

Where \(\beta\) is now a vector of all of our parameters we care about estimating and \(\textbf{X}\) is now a matrix consisting of two vectors of our predictors.

\[ E \begin {bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end {bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{21} \\ 1 & x_{12} & x_{22} \\ \vdots & \vdots & \vdots \\ 1 & x_{1n} & x_{2n} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix} \] >Note that the column of 1s are there to indicate a constant for the \(\beta_0\) (intercept) parameter.

This corresponds to a new system of linear equations:

\[ E[y_1] = \beta_0 + \beta_1x_{11} + \beta_2x_{21} \\ E[y_2] = \beta_0 + \beta_1x_{12} + \beta_2x_{22} \\ \vdots \\ E[y_n] = \beta_0 + \beta_1x_{1n} + \beta_2x_{2n} \]


If you have experience with linear algebra you have likely seen the derivation for the following equations. If you do not have experience with linear algebra, you probably don’t care. So, I will show the derivation, with minimal explanation. We will start with our basic system of linear equations.

\[ y = \textbf{X}\beta + \epsilon \]

\[ \epsilon = y- \textbf{X} \hat{\beta} \]

The sum of squared reisiduals is defined as

\[ \sum \epsilon_i^2 = [\epsilon_1,\epsilon_2,...\epsilon_n] \begin {bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end {bmatrix} = \epsilon'\epsilon \]

Recallying that

\[ \epsilon = y - \textbf{X}\hat\beta \]

\[ \epsilon'\epsilon = (y-\textbf{X}\beta)'(y-\textbf{X}\beta) \]

Using these linear algebra terms, the least-squares parameter estimates are the vectors that minimize the function:

\[ \sum_{i=1}^n = (y-\textbf{X}\beta)'(y-\textbf{X}\beta) \]

From FOIL…

\[ \sum_{i=1}^n = y'y - 2\beta'\textbf{X}'y + \beta'\textbf{X}'\textbf{X}\beta \] Okay, we have derived the least squares estimators for multple regression. Now let’s see if we can use our previous leastsq function to optimize.

It’s a good idea to remind ourselves that we aren’t just palying around with symbols for fun, these equations represent the model we are trying to apply to our data:

birth_weight ~ julian + nest_temp

So, what does this model look like for our data? First, let’s construct a model matrix, which is a matrix of our x predictors with 1’s for the intercept vector.

This matrix corresponds to \(\textbf{X}\) in our system of linear equations

##   (Intercept)         x      x_1
## 1           1 105.37724 26.90120
## 2           1  91.44068 29.85848
## 3           1 111.25589 25.15303
## 4           1  80.24140 31.77687
## 5           1 105.32226 35.71426
## 6           1 108.98875 36.87243

Now, we plug in the functions we derived above into our least squares function, and use par[1:3] to represent our vector of beta’s.

Note: %*% represents matrix multiplication

##       int    julian nest_temp 
##    48.090     0.066    -0.063

And now let’s just double check that FOIL’ing our function above gives the same result.

##       int    julian nest_temp 
##    48.090     0.066    -0.063

Now, lets see how some of R’s built-in functions can solve this problem for us. First, Using linear algebra notation, R has a convenient function called solve() which…get this…solves linear systems of equations. For us, our linear system of equations is

\[ (X'X)^{-1}X'y \]

So let’s use solve to…solve it.

##      (Intercept)      x     x_1
## [1,]     48.0903 0.0665 -0.0627

And finally, let’s see if all of our analytical techniques give similar resutls to those acquired through the ever-so-simple lm() function:

## (Intercept)      julian   nest_temp 
##      48.090       0.066      -0.063
##            int     julian  nest_temp
## OLS_1 48.09000 0.06600000 -0.0630000
##       48.09030 0.06650000 -0.0627000
##       48.09032 0.06649805 -0.0627491

Hoozah! We now have seen how least squares works. Now we can see how it compares and constrasts with maximum likelihood.