Chapter 4 How R solves linear models

With a decent amount of background out of the way, we can now start to dive into the programming of the lm function. I will be explicitly referencing the source code, which at time of writing this can be found here, and the actual lm documentation can be found here.

4.1 R

4.2 What’s the goal?

Let’s think about our general flow of information here. If we were a programmer wanting to write a similar function, what are the ultimate tasks our function should accomplish? n this case we want to 1. Receive some input from the user that specifies a linear model. 2. Convert this formula to a design matrix and response vector 3. Calculate regression coefficients by solving the system of linear equations 4. Return relevant information to the user

4.2.1 Walking through the function calls

It is important to understand the various arguments that CAN be passed to the function. For the remainder of this walkthrough we will ignore many of these, but it is still valuable to grasp the full functionality.

We can see the description of the function by typing ?lm or help(lm). Below is the specification for the function where we see all available input parameters. Note the ... represents further inputs that are possible, but are currently disregarded unless called explicitly. (This is the equivalent of the Python Args and Kwargs arguments). I give a little example here

lm(formula, data, subset, weights, na.action,
   method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
   singular.ok = TRUE, contrasts = NULL, offset, ...)

For many, these arguments will be relatively familiar, for example the formula and data arguments. These make use of core R data types. We then get into arguments that explicitly control how lm solves the least square problem. For example, applying a weights vector will push lm to solve the problem via Weighted Least Squares, rather than Ordinary least Squares. The method argument is interesting in that there is only one possibility – "qr". Along the way, there was likely an option for either "SVD" or another solving method, but these have obviously been deprecated, while the sole input argument remains. The remaining arguments are relatively boring and just give the user more control over what type of output is returned. For now, we will ignore these.

The final thing to note, is the recurring message See also Details which directs us to another R function, lm.fit. This is the function that ultimately does the heavy lifting, so lets look at the surface level arguments for that.

?lm.fit

lm.fit (x, y,    offset = NULL, method = "qr", tol = 1e-7,
       singular.ok = TRUE, ...)

Here we recognize a few of the arguments passed to lm from before. We see now we have explicit inputs for x and y rather than a formula. The remaining arguments are inherited from lm. Again, our method defaults to qr although I emphasize again, this is the only option. We will keep lm.fit in mind as we will return to it in a moment. For now though, lets start combing through the lm source code to take a peek under the hood,

4.2.2 Cleanup work and defensive programming

A major chunk of this code – and all good code, really – is ensuring the inputs are in a useable format for the downstream operations. This involves checking inputs, throwing useful error codes and formatting what we can to make sure our data is in a format that can be used by other functions. Moving forward we will go ahead and ignore this code, but it is necessary for a well functioning program.

Note: moving forward I will be referencing the source code. Line numbers represent line numbers in the source documents at the time of writing. They may change…

There are 18 different errors that can be thrown here. Most of the have to do with mis-specifying data types or inputtng data with mismatching lengths.

Lines 48-52 define an error if the vector of offsets is not equal to the total number of observations.

if(!is.null(offset)) {
        if(length(offset) != NROW(y))
            stop(gettextf("number of offsets is %d, should equal %d (number of observations)",
                          length(offset), NROW(y)), domain = NA)
    }

Another major chunk of the code is spent setting object attributes and defining classes of objects.

Lines 72-78 are specifying the attributes for the matrix of coefficients

class(z) <- c(if(is.matrix(y)) "mlm", "lm")
z$na.action <- attr(mf, "na.action")
z$offset <- offset
z$contrasts <- attr(x, "contrasts")
z$xlevels <- .getXlevels(mt, mf)
z$call <- cl
z$terms <- mt

Again, most of the source code is this type of code and therefore, for our purposes is relatively boring, albeit necessary. We will spend the remainder of our time only focusing on the fun parts of the code.

4.2.3 Specifying the design matrix

The first major objective of our function is to convert the user defined formula to a design matrix. This is acheived in lines 28-37 of our source code.

    mf <- match.call(expand.dots = FALSE)
    m <- match(c("formula", "data", "subset", "weights", "na.action", "offset"),
               names(mf), 0L)
    mf <- mf[c(1L, m)]
    mf$drop.unused.levels <- TRUE
    ## need stats:: for non-standard evaluation
    mf[[1L]] <- quote(stats::model.frame)
    mf <- eval(mf, parent.frame())
    if (method == "model.frame")
    return(mf)

This is a relatively dense and fairly convoluded block of code but highlights some interesting R functionality, which is the match.call() function. The match.call() function takes in a functions arguments and returns a call class in which all of the arguments are specified by their full character names. That’s a bit jargony, so let’s go on a side-quest. I dive a bit deeper into this odd function here.

So, we see these lines of code are effectively respecifying

lm(y~x1 + x2, data = df, weights = w)

as a call to stats::model.frame like

stats::model.frame(y~x1+x2, data = df, weights = w)

And no additional evaluation or default specification was needed.

model.frame gives us a dat frame that contains all of the terms in our model. How R converts from a formula to a data.frame is an interesting feat that I will not go into for now, but may attach it to the Appendix in the future. For now, we can be happy we have created a data.frame that lays our our fully specified model.

4.2.4 Calculating the regression coefficients – lm.fit

Lines 67-70 are where a lot of magic starts to happen.

x <- model.matrix(mt, mf, contrasts)
z <- if(is.null(w)) lm.fit(x, y, offset = offset,
                           singular.ok=singular.ok, ...)
else lm.wfit(x, y, w, offset = offset, singular.ok=singular.ok, ...)
    }

We are now passing our model.frame to our lm.fit function. lm.fit requires a matrix so in line 67 we convert our data.frame we made earlier to a model.matrix. At line 68 we are passing in our data matrix and response vector into lm.fit and store the output in a variable z. Inside lm.fit is where a lot of the heavy lifting for calcuating regression coefficients occurs.

We can call lm.fit explicitly if we like by feeding in a design matrix x and a response vector y.

x = matrix(rnorm(20), ncol = 2)
y = rnorm(10)

lm.fit(x,y)

## $coefficients
##         x1         x2 
## -0.5140863 -0.3825237 
## 
## $residuals
##  [1] -0.5194875 -0.5030902  1.1161387 -0.4240432  0.1522698  1.2613321
....

So, we see the output of lm.fit is som recognizable stuff including coefficients, residuals, fitted values, etc. So, the top layer lm is really just housing a lot of the post-processing and formatting.

We then jump down to line 90 where the actual formulation of lm.fit begins. Lines 91-113 are more defensive programming, making sure things are formatting correctly, and most importantly throwing an error if x is not a matrix in line 93.

The next interesting thing happens at line 114.

z <- .Call(C_Cdqrls, x, y, tol, FALSE)

.Call is R’s way of calling C code. What we can see here is that z – which will ultimately be our final regression result – is actually being computed in a different language and being fed back up to R.

The way .Call syntax works, is that the first argument is the C function which is being called. In this case, the function we are calling is C_dqrls. qrls (ie QR decomposition for least squares…). We will now explore some C code…

4.3 C

We will now explore the C_dqrls code which can be found here. Immediately, the syntax is a bit odd. Essentially, all R code is written on top of C so all objects a normal R programmer is familiar with have a representation in C language. I will not delve too much into the details here, as there are lots of resources to explore how R data types are actually represented in memory, however I will try to cover a bit of that here.

All R objects are stored at the C level as a common datatype. All R objects are what are called S-expressions (SEXP) which are pointers to a specific structure with datatype SEXPREC. What this means is that any time you write a function in C to act on R objects, you miust specify the internal structure of the incoming R objects and also specify the internal structure of the output to be sent back to R.

All R objects can be represented as a few different SEXPs. Some of the most common are:

REALSXP : numeric vector
INTSXP : integer vector
LGLXSP : logical vector
VECSXP : An R list

With this in mind, we can look at some of the Cdqrls code.

The first thing that is worth pointing out is actually a comment at line 35-45:

/* A wrapper to replace
    z <- .Fortran("dqrls",
          qr = x, n = n, p = p,
          y = y, ny = ny,
          tol = as.double(tol),
          coefficients = mat.or.vec(p, ny),
          residuals = y, effects = y, rank = integer(1L),
          pivot = 1L:p, qraux = double(p), work = double(2*p),
                  PACKAGE="base")
    with less allocation.
*/

So immediately, we see this is actually a helper function to replace an old R feature that allowed you to call directly to Fortran code. So, we suspect this code will really just help us get our model to Fortran…

So, let’s look at our initial formulation at line 48 and compare it to our intial call from lm.fit

SEXP Cdqrls(SEXP x, SEXP y, SEXP tol, SEXP chk)

We see the input here are generic SEXP R objects corresponding to our x matrix, y vector, tol (tolerance threshold) scalar, and a boolean function check parameter which was fed into our function as FALSE.

Line 50-51 then define a SEXP type called ans and SEXP types for our output data. These are the generic R compatible pointers.

SEXP ans;
SEXP qr, coefficients, residuals, effects, pivot, qraux;

There is then a bunch more defensive programming and even more opportunities to throw errors…especially is x is not a matrix (line 57).

We then get to line 83 where we are going to assign our explicit R types to those SEXP pointers we assigned before. Here, we create an array to hold our output and assign it a pointer. We then convert it to a VECSXP type which we will be able to send back up to R without any problems. The PROTECT function is a macro tells R that new objects assigned to this location should be protected, rather than collected from the memory heap.

const char *ansNms[] = {"qr", "coefficients", "residuals", "effects",
                        "rank", "pivot", "qraux", "tol", "pivoted", ""};
PROTECT(ans = mkNamed(VECSXP, ansNms));

We then spend lines 86-100 formatting this array to be passed to another programming language.

F77_CALL(dqrls)(REAL(qr), &n, &p, REAL(y), &ny, &rtol,
REAL(coefficients), REAL(residuals), REAL(effects),
&rank, INTEGER(pivot), REAL(qraux), work);

Again, the real work of calculating regression coefficients will take place in a different programming language, FORTRAN. The F77_CALL is a wrapper that calls directly into FORTRAN functions from C. This suggests that we are calling into the FORTRAN function, dqrls…fitting.

4.4 Fortran

We are now ready to move on to our third language, FORTRAN. The source code for the dqrls function can be found here

4.4.1 QR decomposition

4.4.1.1 Householder algorithm

4.5 The machine code

Lol no.