Chapter 4 How R solves linear models
With a decent amount of background out of the way, we can now start to dive into the programming of the lm
function. I will be explicitly referencing the source code, which at time of writing this can be found here, and the actual lm
documentation can be found here.
4.1 R
4.2 What’s the goal?
Let’s think about our general flow of information here. If we were a programmer wanting to write a similar function, what are the ultimate tasks our function should accomplish? n this case we want to 1. Receive some input from the user that specifies a linear model. 2. Convert this formula to a design matrix and response vector 3. Calculate regression coefficients by solving the system of linear equations 4. Return relevant information to the user
4.2.1 Walking through the function calls
It is important to understand the various arguments that CAN be passed to the function. For the remainder of this walkthrough we will ignore many of these, but it is still valuable to grasp the full functionality.
We can see the description of the function by typing ?lm
or help(lm)
. Below is the specification for the function where we see all available input parameters. Note the ...
represents further inputs that are possible, but are currently disregarded unless called explicitly. (This is the equivalent of the Python
Args
and Kwargs
arguments). I give a little example here
lm(formula, data, subset, weights, na.action,
method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, ...)
For many, these arguments will be relatively familiar, for example the formula
and data
arguments. These make use of core R
data types. We then get into arguments that explicitly control how lm
solves the least square problem. For example, applying a weights
vector will push lm
to solve the problem via Weighted Least Squares, rather than Ordinary least Squares. The method
argument is interesting in that there is only one possibility – "qr"
. Along the way, there was likely an option for either "SVD"
or another solving method, but these have obviously been deprecated, while the sole input argument remains. The remaining arguments are relatively boring and just give the user more control over what type of output is returned. For now, we will ignore these.
The final thing to note, is the recurring message See also Details
which directs us to another R function, lm.fit
. This is the function that ultimately does the heavy lifting, so lets look at the surface level arguments for that.
?lm.fit
lm.fit (x, y, offset = NULL, method = "qr", tol = 1e-7,
singular.ok = TRUE, ...)
Here we recognize a few of the arguments passed to lm
from before. We see now we have explicit inputs for x
and y
rather than a formula. The remaining arguments are inherited from lm
. Again, our method defaults to qr
although I emphasize again, this is the only option. We will keep lm.fit
in mind as we will return to it in a moment. For now though, lets start combing through the lm
source code to take a peek under the hood,
4.2.2 Cleanup work and defensive programming
A major chunk of this code – and all good code, really – is ensuring the inputs are in a useable format for the downstream operations. This involves checking inputs, throwing useful error codes and formatting what we can to make sure our data is in a format that can be used by other functions. Moving forward we will go ahead and ignore this code, but it is necessary for a well functioning program.
Note: moving forward I will be referencing the source code. Line numbers represent line numbers in the source documents at the time of writing. They may change…
There are 18 different errors that can be thrown here. Most of the have to do with mis-specifying data types or inputtng data with mismatching lengths.
Lines 48-52 define an error if the vector of offsets is not equal to the total number of observations.
if(!is.null(offset)) {
if(length(offset) != NROW(y))
stop(gettextf("number of offsets is %d, should equal %d (number of observations)",
length(offset), NROW(y)), domain = NA)
}
Another major chunk of the code is spent setting object attributes and defining classes of objects.
Lines 72-78 are specifying the attributes for the matrix of coefficients
class(z) <- c(if(is.matrix(y)) "mlm", "lm")
z$na.action <- attr(mf, "na.action")
z$offset <- offset
z$contrasts <- attr(x, "contrasts")
z$xlevels <- .getXlevels(mt, mf)
z$call <- cl
z$terms <- mt
Again, most of the source code is this type of code and therefore, for our purposes is relatively boring, albeit necessary. We will spend the remainder of our time only focusing on the fun parts of the code.
4.2.3 Specifying the design matrix
The first major objective of our function is to convert the user defined formula
to a design matrix. This is acheived in lines 28-37 of our source code.
mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data", "subset", "weights", "na.action", "offset"),
names(mf), 0L)
mf <- mf[c(1L, m)]
mf$drop.unused.levels <- TRUE
## need stats:: for non-standard evaluation
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())
if (method == "model.frame")
return(mf)
This is a relatively dense and fairly convoluded block of code but highlights some interesting R
functionality, which is the match.call()
function. The match.call()
function takes in a functions arguments and returns a call
class in which all of the arguments are specified by their full character names. That’s a bit jargony, so let’s go on a side-quest. I dive a bit deeper into this odd function here.
So, we see these lines of code are effectively respecifying
lm(y~x1 + x2, data = df, weights = w)
as a call to stats::model.frame
like
::model.frame(y~x1+x2, data = df, weights = w) stats
And no additional evaluation or default specification was needed.
model.frame
gives us a dat frame that contains all of the terms in our model. How R
converts from a formula
to a data.frame
is an interesting feat that I will not go into for now, but may attach it to the Appendix in the future. For now, we can be happy we have created a data.frame
that lays our our fully specified model.
4.2.4 Calculating the regression coefficients – lm.fit
Lines 67-70 are where a lot of magic starts to happen.
x <- model.matrix(mt, mf, contrasts)
z <- if(is.null(w)) lm.fit(x, y, offset = offset,
singular.ok=singular.ok, ...)
else lm.wfit(x, y, w, offset = offset, singular.ok=singular.ok, ...)
}
We are now passing our model.frame
to our lm.fit
function. lm.fit
requires a matrix
so in line 67 we convert our data.frame
we made earlier to a model.matrix
. At line 68 we are passing in our data matrix and response vector into lm.fit
and store the output in a variable z. Inside lm.fit
is where a lot of the heavy lifting for calcuating regression coefficients occurs.
We can call lm.fit
explicitly if we like by feeding in a design matrix x
and a response vector y
.
= matrix(rnorm(20), ncol = 2)
x = rnorm(10)
y
lm.fit(x,y)
## $coefficients
## x1 x2
## -0.5140863 -0.3825237
##
## $residuals
## [1] -0.5194875 -0.5030902 1.1161387 -0.4240432 0.1522698 1.2613321
....
So, we see the output of lm.fit
is som recognizable stuff including coefficients, residuals, fitted values, etc. So, the top layer lm
is really just housing a lot of the post-processing and formatting.
We then jump down to line 90 where the actual formulation of lm.fit
begins. Lines 91-113 are more defensive programming, making sure things are formatting correctly, and most importantly throwing an error if x is not a matrix in line 93.
The next interesting thing happens at line 114.
.Call
is R
’s way of calling C
code. What we can see here is that z
– which will ultimately be our final regression result – is actually being computed in a different language and being fed back up to R
.
The way .Call
syntax works, is that the first argument is the C
function which is being called. In this case, the function we are calling is C_dqrls
. qrls
(ie QR decomposition for least squares…). We will now explore some C
code…
4.3 C
We will now explore the C_dqrls
code which can be found here. Immediately, the syntax is a bit odd. Essentially, all R code is written on top of C
so all objects a normal R
programmer is familiar with have a representation in C
language. I will not delve too much into the details here, as there are lots of resources to explore how R
data types are actually represented in memory, however I will try to cover a bit of that here.
All R
objects are stored at the C
level as a common datatype. All R
objects are what are called S-expressions (SEXP
) which are pointers to a specific structure with datatype SEXPREC
. What this means is that any time you write a function in C
to act on R
objects, you miust specify the internal structure of the incoming R
objects and also specify the internal structure of the output to be sent back to R
.
All R
objects can be represented as a few different SEXP
s. Some of the most common are:
REALSXP
: numeric vectorINTSXP
: integer vectorLGLXSP
: logical vectorVECSXP
: AnR
list
With this in mind, we can look at some of the Cdqrls
code.
The first thing that is worth pointing out is actually a comment at line 35-45:
/* A wrapper to replace
z <- .Fortran("dqrls",
qr = x, n = n, p = p,
y = y, ny = ny,
tol = as.double(tol),
coefficients = mat.or.vec(p, ny),
residuals = y, effects = y, rank = integer(1L),
pivot = 1L:p, qraux = double(p), work = double(2*p),
PACKAGE="base")
with less allocation.
*/
So immediately, we see this is actually a helper function to replace an old R feature that allowed you to call directly to Fortran
code. So, we suspect this code will really just help us get our model to Fortran
…
So, let’s look at our initial formulation at line 48 and compare it to our intial call from lm.fit
We see the input here are generic SEXP
R
objects corresponding to our x
matrix, y
vector, tol
(tolerance threshold) scalar, and a boolean function check parameter which was fed into our function as FALSE
.
Line 50-51 then define a SEXP
type called ans
and SEXP
types for our output data. These are the generic R
compatible pointers.
There is then a bunch more defensive programming and even more opportunities to throw errors…especially is x
is not a matrix (line 57).
We then get to line 83 where we are going to assign our explicit R
types to those SEXP
pointers we assigned before. Here, we create an array to hold our output and assign it a pointer. We then convert it to a VECSXP
type which we will be able to send back up to R
without any problems. The PROTECT
function is a macro tells R
that new objects assigned to this location should be protected, rather than collected from the memory heap.
const char *ansNms[] = {"qr", "coefficients", "residuals", "effects",
"rank", "pivot", "qraux", "tol", "pivoted", ""};
PROTECT(ans = mkNamed(VECSXP, ansNms));
We then spend lines 86-100 formatting this array to be passed to another programming language.
F77_CALL(dqrls)(REAL(qr), &n, &p, REAL(y), &ny, &rtol,
REAL(coefficients), REAL(residuals), REAL(effects),
&rank, INTEGER(pivot), REAL(qraux), work);
Again, the real work of calculating regression coefficients will take place in a different programming language, FORTRAN
. The F77_CALL
is a wrapper that calls directly into FORTRAN
functions from C
. This suggests that we are calling into the FORTRAN
function, dqrls
…fitting.
4.4 Fortran
We are now ready to move on to our third language, FORTRAN. The source code for the dqrls
function can be found here
4.4.1 QR decomposition
4.4.1.1 Householder algorithm
4.5 The machine code
Lol no.