# Chapter 4 How R solves linear models

With a decent amount of background out of the way, we can now start to dive into the programming of the `lm`

function. I will be explicitly referencing the source code, which at time of writing this can be found here, and the actual `lm`

documentation can be found here.

## 4.1 R

## 4.2 What’s the goal?

Let’s think about our general flow of information here. If we were a programmer wanting to write a similar function, what are the ultimate tasks our function should accomplish? n this case we want to 1. Receive some input from the user that specifies a linear model. 2. Convert this formula to a design matrix and response vector 3. Calculate regression coefficients by solving the system of linear equations 4. Return relevant information to the user

### 4.2.1 Walking through the function calls

It is important to understand the various arguments that CAN be passed to the function. For the remainder of this walkthrough we will ignore many of these, but it is still valuable to grasp the full functionality.

We can see the description of the function by typing `?lm`

or `help(lm)`

. Below is the specification for the function where we see all available input parameters. Note the `...`

represents further inputs that are possible, but are currently disregarded unless called explicitly. (This is the equivalent of the `Python`

`Args`

and `Kwargs`

arguments). I give a little example here

```
lm(formula, data, subset, weights, na.action,
method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, ...)
```

For many, these arguments will be relatively familiar, for example the `formula`

and `data`

arguments. These make use of core `R`

data types. We then get into arguments that explicitly control how `lm`

solves the least square problem. For example, applying a `weights`

vector will push `lm`

to solve the problem via Weighted Least Squares, rather than Ordinary least Squares. The `method`

argument is interesting in that there is only one possibility – `"qr"`

. Along the way, there was likely an option for either `"SVD"`

or another solving method, but these have obviously been deprecated, while the sole input argument remains. The remaining arguments are relatively boring and just give the user more control over what type of output is returned. For now, we will ignore these.

The final thing to note, is the recurring message `See also Details`

which directs us to another R function, `lm.fit`

. This is the function that ultimately does the heavy lifting, so lets look at the surface level arguments for that.

```
?lm.fit
lm.fit (x, y, offset = NULL, method = "qr", tol = 1e-7,
singular.ok = TRUE, ...)
```

Here we recognize a few of the arguments passed to `lm`

from before. We see now we have explicit inputs for `x`

and `y`

rather than a formula. The remaining arguments are inherited from `lm`

. Again, our method defaults to `qr`

although I emphasize again, this is the only option. We will keep `lm.fit`

in mind as we will return to it in a moment. For now though, lets start combing through the `lm`

source code to take a peek under the hood,

### 4.2.2 Cleanup work and defensive programming

A major chunk of this code – and all good code, really – is ensuring the inputs are in a useable format for the downstream operations. This involves checking inputs, throwing useful error codes and formatting what we can to make sure our data is in a format that can be used by other functions. Moving forward we will go ahead and ignore this code, but it is necessary for a well functioning program.

Note: moving forward I will be referencing the source code. Line numbers represent line numbers in the source documents at the time of writing. They may change…

There are 18 different errors that can be thrown here. Most of the have to do with mis-specifying data types or inputtng data with mismatching lengths.

Lines 48-52 define an error if the vector of offsets is not equal to the total number of observations.

```
if(!is.null(offset)) {
if(length(offset) != NROW(y))
stop(gettextf("number of offsets is %d, should equal %d (number of observations)",
length(offset), NROW(y)), domain = NA)
}
```

Another major chunk of the code is spent setting object attributes and defining classes of objects.

Lines 72-78 are specifying the attributes for the matrix of coefficients

```
class(z) <- c(if(is.matrix(y)) "mlm", "lm")
z$na.action <- attr(mf, "na.action")
z$offset <- offset
z$contrasts <- attr(x, "contrasts")
z$xlevels <- .getXlevels(mt, mf)
z$call <- cl
z$terms <- mt
```

Again, most of the source code is this type of code and therefore, for our purposes is relatively boring, albeit necessary. We will spend the remainder of our time only focusing on the fun parts of the code.

### 4.2.3 Specifying the design matrix

The first major objective of our function is to convert the user defined `formula`

to a design matrix. This is acheived in lines 28-37 of our source code.

```
mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data", "subset", "weights", "na.action", "offset"),
names(mf), 0L)
mf <- mf[c(1L, m)]
mf$drop.unused.levels <- TRUE
## need stats:: for non-standard evaluation
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())
if (method == "model.frame")
return(mf)
```

This is a relatively dense and fairly convoluded block of code but highlights some interesting `R`

functionality, which is the `match.call()`

function. The `match.call()`

function takes in a functions arguments and returns a `call`

class in which all of the arguments are specified by their full character names. That’s a bit jargony, so let’s go on a side-quest. I dive a bit deeper into this odd function here.

So, we see these lines of code are effectively respecifying

`lm(y~x1 + x2, data = df, weights = w)`

as a call to `stats::model.frame`

like

`::model.frame(y~x1+x2, data = df, weights = w) stats`

And no additional evaluation or default specification was needed.

`model.frame`

gives us a dat frame that contains all of the terms in our model. How `R`

converts from a `formula`

to a `data.frame`

is an interesting feat that I will not go into for now, but may attach it to the Appendix in the future. For now, we can be happy we have created a `data.frame`

that lays our our fully specified model.

### 4.2.4 Calculating the regression coefficients – lm.fit

Lines 67-70 are where a lot of magic starts to happen.

```
x <- model.matrix(mt, mf, contrasts)
z <- if(is.null(w)) lm.fit(x, y, offset = offset,
singular.ok=singular.ok, ...)
else lm.wfit(x, y, w, offset = offset, singular.ok=singular.ok, ...)
}
```

We are now passing our `model.frame`

to our `lm.fit`

function. `lm.fit`

requires a `matrix`

so in line 67 we convert our `data.frame`

we made earlier to a `model.matrix`

. At line 68 we are passing in our data matrix and response vector into `lm.fit`

and store the output in a variable z. Inside `lm.fit`

is where a lot of the heavy lifting for calcuating regression coefficients occurs.

We can call `lm.fit`

explicitly if we like by feeding in a design matrix `x`

and a response vector `y`

.

```
= matrix(rnorm(20), ncol = 2)
x = rnorm(10)
y
lm.fit(x,y)
```

```
## $coefficients
## x1 x2
## -0.5140863 -0.3825237
##
## $residuals
## [1] -0.5194875 -0.5030902 1.1161387 -0.4240432 0.1522698 1.2613321
....
```

So, we see the output of `lm.fit`

is som recognizable stuff including coefficients, residuals, fitted values, etc. So, the top layer `lm`

is really just housing a lot of the post-processing and formatting.

We then jump down to line 90 where the actual formulation of `lm.fit`

begins. Lines 91-113 are more defensive programming, making sure things are formatting correctly, and most importantly throwing an error if x is not a matrix in line 93.

The next interesting thing happens at line 114.

`.Call`

is `R`

’s way of calling `C`

code. What we can see here is that `z`

– which will ultimately be our final regression result – is actually being computed in a different language and being fed back up to `R`

.

The way `.Call`

syntax works, is that the first argument is the `C`

function which is being called. In this case, the function we are calling is `C_dqrls`

. `qrls`

(ie QR decomposition for least squares…). We will now explore some `C`

code…

## 4.3 C

We will now explore the `C_dqrls`

code which can be found here. Immediately, the syntax is a bit odd. Essentially, all R code is written on top of `C`

so all objects a normal `R`

programmer is familiar with have a representation in `C`

language. I will not delve too much into the details here, as there are lots of resources to explore how `R`

data types are actually represented in memory, however I will try to cover a bit of that here.

All `R`

objects are stored at the `C`

level as a common datatype. All `R`

objects are what are called S-expressions (`SEXP`

) which are pointers to a specific structure with datatype `SEXPREC`

. What this means is that any time you write a function in `C`

to act on `R`

objects, you miust specify the internal structure of the incoming `R`

objects and also specify the internal structure of the output to be sent back to `R`

.

All `R`

objects can be represented as a few different `SEXP`

s. Some of the most common are:

`REALSXP`

: numeric vector`INTSXP`

: integer vector`LGLXSP`

: logical vector`VECSXP`

: An`R`

list

With this in mind, we can look at some of the `Cdqrls`

code.

The first thing that is worth pointing out is actually a comment at line 35-45:

```
/* A wrapper to replace
z <- .Fortran("dqrls",
qr = x, n = n, p = p,
y = y, ny = ny,
tol = as.double(tol),
coefficients = mat.or.vec(p, ny),
residuals = y, effects = y, rank = integer(1L),
pivot = 1L:p, qraux = double(p), work = double(2*p),
PACKAGE="base")
with less allocation.
*/
```

So immediately, we see this is actually a helper function to replace an old R feature that allowed you to call directly to `Fortran`

code. So, we suspect this code will really just help us get our model to `Fortran`

…

So, let’s look at our initial formulation at line 48 and compare it to our intial call from `lm.fit`

We see the input here are generic `SEXP`

`R`

objects corresponding to our `x`

matrix, `y`

vector, `tol`

(tolerance threshold) scalar, and a boolean function check parameter which was fed into our function as `FALSE`

.

Line 50-51 then define a `SEXP`

type called `ans`

and `SEXP`

types for our output data. These are the generic `R`

compatible pointers.

There is then a bunch more defensive programming and even more opportunities to throw errors…especially is `x`

is not a matrix (line 57).

We then get to line 83 where we are going to assign our explicit `R`

types to those `SEXP`

pointers we assigned before. Here, we create an array to hold our output and assign it a pointer. We then convert it to a `VECSXP`

type which we will be able to send back up to `R`

without any problems. The `PROTECT`

function is a macro tells `R`

that new objects assigned to this location should be protected, rather than collected from the memory heap.

```
const char *ansNms[] = {"qr", "coefficients", "residuals", "effects",
"rank", "pivot", "qraux", "tol", "pivoted", ""};
PROTECT(ans = mkNamed(VECSXP, ansNms));
```

We then spend lines 86-100 formatting this array to be passed to another programming language.

```
F77_CALL(dqrls)(REAL(qr), &n, &p, REAL(y), &ny, &rtol,
REAL(coefficients), REAL(residuals), REAL(effects),
&rank, INTEGER(pivot), REAL(qraux), work);
```

Again, the real work of calculating regression coefficients will take place in a different programming language, `FORTRAN`

. The `F77_CALL`

is a wrapper that calls directly into `FORTRAN`

functions from `C`

. This suggests that we are calling into the `FORTRAN`

function, `dqrls`

…fitting.

## 4.4 Fortran

We are now ready to move on to our third language, FORTRAN. The source code for the `dqrls`

function can be found here

### 4.4.1 QR decomposition

#### 4.4.1.1 Householder algorithm

## 4.5 The machine code

Lol no.