Chapter 4 How R solves linear models
With a decent amount of background out of the way, we can now start to dive into the programming of the
lm function. I will be explicitly referencing the source code, which at time of writing this can be found here, and the actual
lm documentation can be found here.
4.2 What’s the goal?
Let’s think about our general flow of information here. If we were a programmer wanting to write a similar function, what are the ultimate tasks our function should accomplish? n this case we want to 1. Receive some input from the user that specifies a linear model. 2. Convert this formula to a design matrix and response vector 3. Calculate regression coefficients by solving the system of linear equations 4. Return relevant information to the user
4.2.1 Walking through the function calls
It is important to understand the various arguments that CAN be passed to the function. For the remainder of this walkthrough we will ignore many of these, but it is still valuable to grasp the full functionality.
We can see the description of the function by typing
help(lm). Below is the specification for the function where we see all available input parameters. Note the
... represents further inputs that are possible, but are currently disregarded unless called explicitly. (This is the equivalent of the
Kwargs arguments). I give a little example here
lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...)
For many, these arguments will be relatively familiar, for example the
data arguments. These make use of core
R data types. We then get into arguments that explicitly control how
lm solves the least square problem. For example, applying a
weights vector will push
lm to solve the problem via Weighted Least Squares, rather than Ordinary least Squares. The
method argument is interesting in that there is only one possibility –
"qr". Along the way, there was likely an option for either
"SVD" or another solving method, but these have obviously been deprecated, while the sole input argument remains. The remaining arguments are relatively boring and just give the user more control over what type of output is returned. For now, we will ignore these.
The final thing to note, is the recurring message
See also Details which directs us to another R function,
lm.fit. This is the function that ultimately does the heavy lifting, so lets look at the surface level arguments for that.
?lm.fit lm.fit (x, y, offset = NULL, method = "qr", tol = 1e-7, singular.ok = TRUE, ...)
Here we recognize a few of the arguments passed to
lm from before. We see now we have explicit inputs for
y rather than a formula. The remaining arguments are inherited from
lm. Again, our method defaults to
qr although I emphasize again, this is the only option. We will keep
lm.fit in mind as we will return to it in a moment. For now though, lets start combing through the
lm source code to take a peek under the hood,
4.2.2 Cleanup work and defensive programming
A major chunk of this code – and all good code, really – is ensuring the inputs are in a useable format for the downstream operations. This involves checking inputs, throwing useful error codes and formatting what we can to make sure our data is in a format that can be used by other functions. Moving forward we will go ahead and ignore this code, but it is necessary for a well functioning program.
Note: moving forward I will be referencing the source code. Line numbers represent line numbers in the source documents at the time of writing. They may change…
There are 18 different errors that can be thrown here. Most of the have to do with mis-specifying data types or inputtng data with mismatching lengths.
Lines 48-52 define an error if the vector of offsets is not equal to the total number of observations.
Another major chunk of the code is spent setting object attributes and defining classes of objects.
Lines 72-78 are specifying the attributes for the matrix of coefficients
Again, most of the source code is this type of code and therefore, for our purposes is relatively boring, albeit necessary. We will spend the remainder of our time only focusing on the fun parts of the code.
4.2.3 Specifying the design matrix
The first major objective of our function is to convert the user defined
formula to a design matrix. This is acheived in lines 28-37 of our source code.
mf <- match.call(expand.dots = FALSE) m <- match(c("formula", "data", "subset", "weights", "na.action", "offset"), names(mf), 0L) mf <- mf[c(1L, m)] mf$drop.unused.levels <- TRUE ## need stats:: for non-standard evaluation mf[[1L]] <- quote(stats::model.frame) mf <- eval(mf, parent.frame()) if (method == "model.frame") return(mf)
This is a relatively dense and fairly convoluded block of code but highlights some interesting
R functionality, which is the
match.call() function. The
match.call() function takes in a functions arguments and returns a
call class in which all of the arguments are specified by their full character names. That’s a bit jargony, so let’s go on a side-quest. I dive a bit deeper into this odd function here.
So, we see these lines of code are effectively respecifying
lm(y~x1 + x2, data = df, weights = w)
as a call to
::model.frame(y~x1+x2, data = df, weights = w)stats
And no additional evaluation or default specification was needed.
model.frame gives us a dat frame that contains all of the terms in our model. How
R converts from a
formula to a
data.frame is an interesting feat that I will not go into for now, but may attach it to the Appendix in the future. For now, we can be happy we have created a
data.frame that lays our our fully specified model.
4.2.4 Calculating the regression coefficients – lm.fit
Lines 67-70 are where a lot of magic starts to happen.
We are now passing our
model.frame to our
lm.fit requires a
matrix so in line 67 we convert our
data.frame we made earlier to a
model.matrix. At line 68 we are passing in our data matrix and response vector into
lm.fit and store the output in a variable z. Inside
lm.fit is where a lot of the heavy lifting for calcuating regression coefficients occurs.
We can call
lm.fit explicitly if we like by feeding in a design matrix
x and a response vector
= matrix(rnorm(20), ncol = 2) x = rnorm(10) y lm.fit(x,y)
## $coefficients ## x1 x2 ## -0.5140863 -0.3825237 ## ## $residuals ##  -0.5194875 -0.5030902 1.1161387 -0.4240432 0.1522698 1.2613321 ....
So, we see the output of
lm.fit is som recognizable stuff including coefficients, residuals, fitted values, etc. So, the top layer
lm is really just housing a lot of the post-processing and formatting.
We then jump down to line 90 where the actual formulation of
lm.fit begins. Lines 91-113 are more defensive programming, making sure things are formatting correctly, and most importantly throwing an error if x is not a matrix in line 93.
The next interesting thing happens at line 114.
R’s way of calling
C code. What we can see here is that
z – which will ultimately be our final regression result – is actually being computed in a different language and being fed back up to
.Call syntax works, is that the first argument is the
C function which is being called. In this case, the function we are calling is
qrls (ie QR decomposition for least squares…). We will now explore some
We will now explore the
C_dqrls code which can be found here. Immediately, the syntax is a bit odd. Essentially, all R code is written on top of
C so all objects a normal
R programmer is familiar with have a representation in
C language. I will not delve too much into the details here, as there are lots of resources to explore how
R data types are actually represented in memory, however I will try to cover a bit of that here.
R objects are stored at the
C level as a common datatype. All
R objects are what are called S-expressions (
SEXP) which are pointers to a specific structure with datatype
SEXPREC. What this means is that any time you write a function in
C to act on
R objects, you miust specify the internal structure of the incoming
R objects and also specify the internal structure of the output to be sent back to
R objects can be represented as a few different
SEXPs. Some of the most common are:
REALSXP: numeric vector
INTSXP: integer vector
LGLXSP: logical vector
With this in mind, we can look at some of the
The first thing that is worth pointing out is actually a comment at line 35-45:
So immediately, we see this is actually a helper function to replace an old R feature that allowed you to call directly to
Fortran code. So, we suspect this code will really just help us get our model to
So, let’s look at our initial formulation at line 48 and compare it to our intial call from
We see the input here are generic
R objects corresponding to our
tol (tolerance threshold) scalar, and a boolean function check parameter which was fed into our function as
Line 50-51 then define a
SEXP type called
SEXP types for our output data. These are the generic
R compatible pointers.
There is then a bunch more defensive programming and even more opportunities to throw errors…especially is
x is not a matrix (line 57).
We then get to line 83 where we are going to assign our explicit
R types to those
SEXP pointers we assigned before. Here, we create an array to hold our output and assign it a pointer. We then convert it to a
VECSXP type which we will be able to send back up to
R without any problems. The
PROTECT function is a macro tells
R that new objects assigned to this location should be protected, rather than collected from the memory heap.
We then spend lines 86-100 formatting this array to be passed to another programming language.
Again, the real work of calculating regression coefficients will take place in a different programming language,
F77_CALL is a wrapper that calls directly into
FORTRAN functions from
C. This suggests that we are calling into the
We are now ready to move on to our third language, FORTRAN. The source code for the
dqrls function can be found here
4.4.1 QR decomposition
126.96.36.199 Householder algorithm
4.5 The machine code