## 4.4 Big data considerations

The computation of the least squares estimator

\begin{align} \hat{\boldsymbol{\beta}}=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\tag{4.24} \end{align}

involves inverting the $$(p+1)\times(p+1)$$ matrix $$\mathbf{X}'\mathbf{X},$$ where $$\mathbf{X}$$ is an $$n\times(p+1)$$ matrix. The vector to be obtained, $$\hat{\boldsymbol{\beta}},$$ is of size $$p+1.$$ However, computing it directly from (4.24) requires allocating $$\mathcal{O}(np + p^2)$$ elements in memory. When $$n$$ is very large, this can be prohibitive. In addition, for convenience of the statistical analysis, R’s lm returns several objects of the same size as $$\mathbf{X}$$ and $$\mathbf{Y},$$ thus notably increasing the memory usage. For these reasons, alternative approaches for computing $$\hat{\boldsymbol{\beta}}$$ with big data are required.

An approach for computing (4.24) in a memory-friendly way is to split the computation of $$(\mathbf{X}'\mathbf{X})^{-1}$$ and $$\mathbf{X}'\mathbf{Y}$$ by blocks that are storable in memory. A possibility is to update sequentially the estimation of the vector of coefficients. This can be done with the following expression, which relates $$\hat{\boldsymbol{\beta}}$$ with $$\hat{\boldsymbol{\beta}}_{-i},$$ the vector of estimated coefficients without the $$i$$-th datum:

\begin{align} \hat{\boldsymbol{\beta}}=\hat{\boldsymbol{\beta}}_{-i}+(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_{i}\left(Y_{i}-\mathbf{x}_{i}'\hat{\boldsymbol{\beta}}_{-i}\right).\tag{4.25} \end{align}

In (4.25) above, $$\mathbf{x}_{i}'$$ is the $$i$$-th row of the design matrix $$\mathbf{X}.$$ The expression follows from the Sherman–Morrison formula for an invertible matrix $$\mathbf{A}$$ and a vector $$\mathbf{b},$$

\begin{align*} (\mathbf{A}+\mathbf{b}\mathbf{b}')^{-1}=\mathbf{A}^{-1}-\frac{\mathbf{A}^{-1}\mathbf{b}\mathbf{b}'\mathbf{A}^{-1}}{1+\mathbf{b}'\mathbf{A}^{-1}\mathbf{b}}, \end{align*}

and from the equalities

\begin{align*} \mathbf{X}'\mathbf{X}&=\mathbf{X}_{-i}'\mathbf{X}_{-i}+\mathbf{x}_{i}\mathbf{x}'_{i},\\ \mathbf{X}'\mathbf{Y}&=\mathbf{X}_{-i}'\mathbf{Y}_{-i}+\mathbf{x}_{i}Y'_{i}, \end{align*}

where $$\mathbf{X}_{-i}$$ is the $$(n-1)\times (p+1)$$ matrix obtained by removing the $$i$$-th row of $$\mathbf{X}.$$ In (4.25), using again the Sherman–Morrison formula, we can update $$(\mathbf{X}'\mathbf{X})^{-1}$$ easily from $$\left(\mathbf{X}_{-i}'\mathbf{X}_{-i}\right)^{-1}$$:

\begin{align} (\mathbf{X}'\mathbf{X})^{-1}=\left(\mathbf{X}_{-i}'\mathbf{X}_{-i}\right)^{-1}-\frac{\left(\mathbf{X}_{-i}'\mathbf{X}_{-i}\right)^{-1}\mathbf{x}_{i}\mathbf{x}_{i}'\left(\mathbf{X}_{-i}'\mathbf{X}_{-i}\right)^{-1}}{1+\mathbf{x}_{i}'\left(\mathbf{X}_{-i}'\mathbf{X}_{-i}\right)^{-1}\mathbf{x}_{i}}.\tag{4.26} \end{align}

This has the advantage of not requiring to compute $$\mathbf{X}'\mathbf{X}$$ and then to invert it. Instead of that, we work directly with $$\left(\mathbf{X}_{-i}'\mathbf{X}_{-i}\right)^{-1},$$ which was already computed and has size $$(p + 1)\times(p+1).$$

This idea can be iterated and we can compute $$\hat{\boldsymbol{\beta}}$$ by the following iterative procedure:

1. Start from a reduced dataset $$\mathbf{X}_\mathrm{old}\equiv \mathbf{X}_{-i}$$ and $$\mathbf{Y}_\mathrm{old}\equiv \mathbf{Y}_{-i}$$ for which the least squares estimate can be computed. Denote it by $$\hat{\boldsymbol{\beta}}_\mathrm{old}\equiv\hat{\boldsymbol{\beta}}_{-i}.$$
2. Add one of the remaining data points to get $$\hat{\boldsymbol{\beta}}_\mathrm{new}\equiv\hat{\boldsymbol{\beta}}$$ from (4.25) and (4.26).
3. Set $$\hat{\boldsymbol{\beta}}_\mathrm{new}\leftarrow\hat{\boldsymbol{\beta}}_\mathrm{old}$$ and $$\mathbf{X}_\mathrm{new}\leftarrow\mathbf{X}_\mathrm{old}.$$
4. Repeat Steps 2–3 until there are no remaining data points left.
5. Return $$\hat{\boldsymbol{\beta}}\leftarrow \hat{\boldsymbol{\beta}}_\mathrm{new}.$$

The main advantage of this iterative procedure is clear: we do not need to store any vector or matrix with $$n$$ in the dimension – only matrices of size $$p.$$ As a consequence, we do not need to store the data in memory.

A similar iterative approach (yet more sophisticated) is followed by the biglm package. We omit the details here (see Miller (1992)) and just comment the main idea: for computing (4.24), biglm::biglm performs a QR decomposition138 of $$\mathbf{X}$$ that is computed iteratively. Then, instead of computing (4.24), it solves the triangular system

\begin{align*} \mathbf{R}\hat{\boldsymbol{\beta}}=\mathbf{Q}'\mathbf{Y}. \end{align*}

Let’s see how biglm::biglm works in practice.

# Not really "big data", but for the sake of illustration
set.seed(12345)
n <- 1e6
p <- 10
beta <- seq(-1, 1, length.out = p)^5
x1 <- matrix(rnorm(n * p), nrow = n, ncol = p)
x1[, p] <- 2 * x1[, 1] + rnorm(n, sd = 0.1) # Add some dependence to predictors
x1[, p - 1] <- 2 - x1[, 2] + rnorm(n, sd = 0.5)
y1 <- 1 + x1 %*% beta + rnorm(n)
x2 <- matrix(rnorm(100 * p), nrow = 100, ncol = p)
y2 <- 1 + x2 %*% beta + rnorm(100)
bigData1 <- data.frame("resp" = y1, "pred" = x1)
bigData2 <- data.frame("resp" = y2, "pred" = x2)

# biglm has a very similar syntax to lm -- but the formula interface does not
# work always as expected
# biglm::biglm(formula = resp ~ ., data = bigData1) # Does not work
# biglm::biglm(formula = y ~ x) # Does not work
# biglm::biglm(formula = resp ~ pred.1 + pred.2, data = bigData1) # Does work,
# but not very convenient for a large number of predictors
# Hack for automatic inclusion of all the predictors
f <- formula(paste("resp ~", paste(names(bigData1)[-1], collapse = " + ")))
biglmMod <- biglm::biglm(formula = f, data = bigData1)

# lm's call
lmMod <- lm(formula = resp ~ ., data = bigData1)

# The reduction in size of the resulting object is more than notable
print(object.size(biglmMod), units = "KB")
## 13.1 Kb
print(object.size(lmMod), units = "MB")
## 381.5 Mb

# Summaries
s1 <- summary(biglmMod)
s2 <- summary(lmMod)
s1
## Large data regression model: biglm::biglm(formula = f, data = bigData1)
## Sample size =  1000000
##                Coef    (95%     CI)     SE      p
## (Intercept)  1.0021  0.9939  1.0104 0.0041 0.0000
## pred.1      -0.9733 -1.0133 -0.9333 0.0200 0.0000
## pred.2      -0.2866 -0.2911 -0.2822 0.0022 0.0000
## pred.3      -0.0535 -0.0555 -0.0515 0.0010 0.0000
## pred.4      -0.0041 -0.0061 -0.0021 0.0010 0.0000
## pred.5      -0.0002 -0.0022  0.0018 0.0010 0.8373
## pred.6       0.0003 -0.0017  0.0023 0.0010 0.7771
## pred.7       0.0026  0.0006  0.0046 0.0010 0.0091
## pred.8       0.0521  0.0501  0.0541 0.0010 0.0000
## pred.9       0.2840  0.2800  0.2880 0.0020 0.0000
## pred.10      0.9867  0.9667  1.0067 0.0100 0.0000
s2
##
## Call:
## lm(formula = resp ~ ., data = bigData1)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -4.8798 -0.6735 -0.0013  0.6735  4.9060
##
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)
## (Intercept)  1.0021454  0.0041200  243.236  < 2e-16 ***
## pred.1      -0.9732675  0.0199989  -48.666  < 2e-16 ***
## pred.2      -0.2866314  0.0022354 -128.227  < 2e-16 ***
## pred.3      -0.0534834  0.0009997  -53.500  < 2e-16 ***
## pred.4      -0.0040772  0.0009984   -4.084 4.43e-05 ***
## pred.5      -0.0002051  0.0009990   -0.205  0.83731
## pred.6       0.0002828  0.0009989    0.283  0.77706
## pred.7       0.0026085  0.0009996    2.610  0.00907 **
## pred.8       0.0520744  0.0009994   52.105  < 2e-16 ***
## pred.9       0.2840358  0.0019992  142.076  < 2e-16 ***
## pred.10      0.9866851  0.0099876   98.791  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9993 on 999989 degrees of freedom
## Multiple R-squared:  0.5777, Adjusted R-squared:  0.5777
## F-statistic: 1.368e+05 on 10 and 999989 DF,  p-value: < 2.2e-16

# Further information
s1$mat # Coefficients and their inferences ## Coef (95% CI) SE p ## (Intercept) 1.0021454430 0.9939053491 1.010385537 0.0041200470 0.000000e+00 ## pred.1 -0.9732674585 -1.0132653005 -0.933269616 0.0199989210 0.000000e+00 ## pred.2 -0.2866314070 -0.2911021089 -0.282160705 0.0022353509 0.000000e+00 ## pred.3 -0.0534833941 -0.0554827653 -0.051484023 0.0009996856 0.000000e+00 ## pred.4 -0.0040771777 -0.0060739907 -0.002080365 0.0009984065 4.432709e-05 ## pred.5 -0.0002051218 -0.0022030377 0.001792794 0.0009989579 8.373098e-01 ## pred.6 0.0002828388 -0.0017149118 0.002280589 0.0009988753 7.770563e-01 ## pred.7 0.0026085425 0.0006093153 0.004607770 0.0009996136 9.066118e-03 ## pred.8 0.0520743791 0.0500755376 0.054073221 0.0009994208 0.000000e+00 ## pred.9 0.2840358104 0.2800374345 0.288034186 0.0019991879 0.000000e+00 ## pred.10 0.9866850849 0.9667099026 1.006660267 0.0099875911 0.000000e+00 s1$rsq # R^2
## [1] 0.5777074
s1$nullrss # SST (as in Section 2.6) ## [1] 2364861 # Extract coefficients coef(biglmMod) ## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 ## 1.0021454430 -0.9732674585 -0.2866314070 -0.0534833941 -0.0040771777 -0.0002051218 0.0002828388 0.0026085425 ## pred.8 pred.9 pred.10 ## 0.0520743791 0.2840358104 0.9866850849 # Prediction works as usual predict(biglmMod, newdata = bigData2[1:5, ]) ## [,1] ## 1 2.3554732 ## 2 2.5631387 ## 3 2.4546594 ## 4 2.3483083 ## 5 0.6587481 # Must contain a column for the response # predict(biglmMod, newdata = bigData2[1:5, -1]) # Error # Update the model with training data update(biglmMod, moredata = bigData2) ## Large data regression model: biglm::biglm(formula = f, data = bigData1) ## Sample size = 1000100 # AIC and BIC AIC(biglmMod, k = 2) ## [1] 998685.1 AIC(biglmMod, k = log(n)) ## [1] 998815.1 # Features not immediately available for biglm objects: stepwise selection by # stepAIC, residuals, variance of the error, model diagnostics, and vifs # Workaround for obtaining hat(sigma)^2 = SSE / (n - p - 1), SSE = SST * (1 - R^2) (s1$nullrss * (1 - s1$rsq)) / s1$obj$df.resid ## [1] 0.9986741 s2$sigma^2
## [1] 0.9986741

Model selection of biglm models can be done, not by MASS::stepAIC, but with the more advanced leaps package. This is achieved by the leaps::regsubsets function, which returns the best subset of up to (by default) nvmax = 8 predictors among the $$p$$ possible predictors to be included in the model. The function requires the full biglm model to begin the exhaustive139 search . The kind of search can be changed using the method argument and choosing the exhaustive (by default), forward, or backward selection.

# Model selection adapted to big data models
reg <- leaps::regsubsets(biglmMod, nvmax = p, method = "exhaustive")
plot(reg) # Plot best model (top row) to worst model (bottom row)

# Summarize (otherwise regsubsets's output is hard to decipher)
subs <- summary(reg)
subs
## Subset selection object
## 10 Variables  (and intercept)
##         Forced in Forced out
## pred.1      FALSE      FALSE
## pred.2      FALSE      FALSE
## pred.3      FALSE      FALSE
## pred.4      FALSE      FALSE
## pred.5      FALSE      FALSE
## pred.6      FALSE      FALSE
## pred.7      FALSE      FALSE
## pred.8      FALSE      FALSE
## pred.9      FALSE      FALSE
## pred.10     FALSE      FALSE
## 1 subsets of each size up to 9
## Selection Algorithm: exhaustive
##          pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8 pred.9 pred.10
## 1  ( 1 ) " "    " "    " "    " "    " "    " "    " "    " "    " "    "*"
## 2  ( 1 ) " "    " "    " "    " "    " "    " "    " "    " "    "*"    "*"
## 3  ( 1 ) " "    "*"    " "    " "    " "    " "    " "    " "    "*"    "*"
## 4  ( 1 ) " "    "*"    "*"    " "    " "    " "    " "    " "    "*"    "*"
## 5  ( 1 ) " "    "*"    "*"    " "    " "    " "    " "    "*"    "*"    "*"
## 6  ( 1 ) "*"    "*"    "*"    " "    " "    " "    " "    "*"    "*"    "*"
## 7  ( 1 ) "*"    "*"    "*"    "*"    " "    " "    " "    "*"    "*"    "*"
## 8  ( 1 ) "*"    "*"    "*"    "*"    " "    " "    "*"    "*"    "*"    "*"
## 9  ( 1 ) "*"    "*"    "*"    "*"    " "    "*"    "*"    "*"    "*"    "*"

# Lots of useful information
str(subs, 1)
## List of 8
##  $which : logi [1:9, 1:11] TRUE TRUE TRUE TRUE TRUE TRUE ... ## ..- attr(*, "dimnames")=List of 2 ##$ rsq   : num [1:9] 0.428 0.567 0.574 0.576 0.577 ...
##  $rss : num [1:9] 1352680 1023080 1006623 1003763 1001051 ... ##$ adjr2 : num [1:9] 0.428 0.567 0.574 0.576 0.577 ...
##  $cp : num [1:9] 354480 24444 7968 5106 2392 ... ##$ bic   : num [1:9] -558604 -837860 -854062 -856894 -859585 ...
##  $outmat: chr [1:9, 1:10] " " " " " " " " ... ## ..- attr(*, "dimnames")=List of 2 ##$ obj   :List of 27
##   ..- attr(*, "class")= chr "regsubsets"
##  - attr(*, "class")= chr "summary.regsubsets"

# Get the model with lowest BIC
subs$which ## (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8 pred.9 pred.10 ## 1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE ## 2 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE ## 3 TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE ## 4 TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE ## 5 TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE ## 6 TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE ## 7 TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE ## 8 TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE ## 9 TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE subs$bic
## [1] -558603.7 -837859.9 -854062.3 -856893.8 -859585.3 -861936.5 -861939.3 -861932.3 -861918.6
subs$which[which.min(subs$bic), ]
## (Intercept)      pred.1      pred.2      pred.3      pred.4      pred.5      pred.6      pred.7      pred.8      pred.9
##        TRUE        TRUE        TRUE        TRUE        TRUE       FALSE       FALSE       FALSE        TRUE        TRUE
##     pred.10
##        TRUE

# Show the display in Figure 4.6
subs$which[order(subs$bic), ]
##   (Intercept) pred.1 pred.2 pred.3 pred.4 pred.5 pred.6 pred.7 pred.8 pred.9 pred.10
## 7        TRUE   TRUE   TRUE   TRUE   TRUE  FALSE  FALSE  FALSE   TRUE   TRUE    TRUE
## 6        TRUE   TRUE   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE   TRUE   TRUE    TRUE
## 8        TRUE   TRUE   TRUE   TRUE   TRUE  FALSE  FALSE   TRUE   TRUE   TRUE    TRUE
## 9        TRUE   TRUE   TRUE   TRUE   TRUE  FALSE   TRUE   TRUE   TRUE   TRUE    TRUE
## 5        TRUE  FALSE   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE   TRUE   TRUE    TRUE
## 4        TRUE  FALSE   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE   TRUE    TRUE
## 3        TRUE  FALSE   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE   TRUE    TRUE
## 2        TRUE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE   TRUE    TRUE
## 1        TRUE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE    TRUE

# It also works with ordinary linear models and it is much faster and
# informative than stepAIC
reg <- leaps::regsubsets(resp ~ ., data = bigData1, nvmax = p,
method = "backward")
subs <- summary(reg)
subs$bic ## [1] -558603.7 -837859.9 -854062.3 -856893.8 -859585.3 -861936.5 -861939.3 -861932.3 -861918.6 -861904.8 subs$which[which.min(subs\$bic), ]
## (Intercept)      pred.1      pred.2      pred.3      pred.4      pred.5      pred.6      pred.7      pred.8      pred.9
##        TRUE        TRUE        TRUE        TRUE        TRUE       FALSE       FALSE       FALSE        TRUE        TRUE
##     pred.10
##        TRUE

# Compare it with stepAIC
MASS::stepAIC(lm(resp ~ ., data = bigData1), trace = 0,
direction = "backward", k = log(n))
##
## Call:
## lm(formula = resp ~ pred.1 + pred.2 + pred.3 + pred.4 + pred.8 +
##     pred.9 + pred.10, data = bigData1)
##
## Coefficients:
## (Intercept)       pred.1       pred.2       pred.3       pred.4       pred.8       pred.9      pred.10
##    1.002141    -0.973201    -0.286626    -0.053487    -0.004074     0.052076     0.284038     0.986651

Finally, let’s see an example on how to fit a linear model to a large dataset that does not fit in the RAM of most regular laptops. Imagine that you want to regress a response $$Y$$ into a set of $$p=10$$ predictors and the sample size is $$n=10^8.$$ Merely storing the response and the predictors will take up to $$8.2$$ GB in RAM:

# Size of the response
print(object.size(rnorm(1e6)) * 1e2, units = "GB")
## 0.7 Gb

# Size of the predictors
print(object.size(rnorm(1e6)) * 1e2 * 10, units = "GB")
## 7.5 Gb

In addition to this, if lm was called, it will return the residuals, effects, and fitted.values slots (all vectors of length $$n,$$ hence $$0.7 \times 3=2.1$$ GB more). It will also return the qr decomposition of the design matrix and the model matrix (both are $$n\times (p + 1)$$ matrices, so another $$8.2\times 2=16.4$$ GB more). The final lm object will thus be at the very least, of size $$18.5$$ GB. Clearly, this is not a very memory-friendly procedure.

A possible approach is to split the dataset and perform updates of the model in chunks of reasonable size. The next code provides a template for such approach using biglm and update.

# Linear regression with n = 10^8 and p = 10
n <- 10^8
p <- 10
beta <- seq(-1, 1, length.out = p)^5

# Number of chunks for splitting the dataset
nChunks <- 1e3
nSmall <- n / nChunks

# Simulates reading the first chunk of data
set.seed(12345)
x <- matrix(rnorm(nSmall * p), nrow = nSmall, ncol = p)
x[, p] <- 2 * x[, 1] + rnorm(nSmall, sd = 0.1)
x[, p - 1] <- 2 - x[, 2] + rnorm(nSmall, sd = 0.5)
y <- 1 + x %*% beta + rnorm(nSmall)

# First fit
bigMod <- biglm::biglm(y ~ x, data = data.frame(y, x))

# Update fit
# pb <- txtProgressBar(style = 3)
for (i in 2:nChunks) {

# Simulates reading the i-th chunk of data
set.seed(12345 + i)
x <- matrix(rnorm(nSmall * p), nrow = nSmall, ncol = p)
x[, p] <- 2 * x[, 1] + rnorm(nSmall, sd = 0.1)
x[, p - 1] <- 2 - x[, 2] + rnorm(nSmall, sd = 0.5)
y <- 1 + x %*% beta + rnorm(nSmall)

# Update the fit
bigMod <- update(bigMod, moredata = data.frame(y, x))

# Progress
# setTxtProgressBar(pb = pb, value = i / nChunks)

}

# Final model
summary(bigMod)
## Large data regression model: biglm::biglm(y ~ x, data = data.frame(y, x))
## Sample size =  100000000
##                Coef    (95%     CI)    SE      p
## (Intercept)  1.0003  0.9995  1.0011 4e-04 0.0000
## x1          -1.0015 -1.0055 -0.9975 2e-03 0.0000
## x2          -0.2847 -0.2852 -0.2843 2e-04 0.0000
## x3          -0.0531 -0.0533 -0.0529 1e-04 0.0000
## x4          -0.0041 -0.0043 -0.0039 1e-04 0.0000
## x5           0.0002  0.0000  0.0004 1e-04 0.0760
## x6          -0.0001 -0.0003  0.0001 1e-04 0.2201
## x7           0.0041  0.0039  0.0043 1e-04 0.0000
## x8           0.0529  0.0527  0.0531 1e-04 0.0000
## x9           0.2844  0.2840  0.2848 2e-04 0.0000
## x10          1.0007  0.9987  1.0027 1e-03 0.0000
print(object.size(bigMod), units = "KB")
## 7.8 Kb
The summary of a biglm object yields slightly different significances for the coefficients than those of lm. The reason is that biglm employs $$\mathcal{N}(0,1)$$-approximations for the distributions of the $$t$$-tests instead of the exact $$t_{n-1}$$ distribution. Obviously, if $$n$$ is large, the differences are inappreciable.

### References

Furnival, G. M., and R. W. Wilson. 1974. “Regressions by Leaps and Bounds.” Technometrics 16 (4): 499–511. https://doi.org/10.2307/1271435.
Miller, A. J. 1992. “Algorithm AS 274: Least Squares Routines to Supplement Those of Gentleman.” Journal of the Royal Statistical Society, Series C (Applied Statistics) 41 (2): 458–78. https://doi.org/10.2307/2347583.

1. The QR decomposition of the matrix $$\mathbf{X}$$ of size $$n\times m$$ is $$\mathbf{X}=\mathbf{Q}\mathbf{R}$$ such that $$\mathbf{Q}$$ is an $$n\times n$$ orthogonal matrix and $$\mathbf{R}$$ is an $$n\times m$$ upper triangular matrix. This factorization is commonly used in numerical analysis for solving linear systems.↩︎

2. Not really exhaustive: the method behind it, due to Furnival and Wilson (1974), employs an ingenious branch and bound algorithm to remove most of the non-interesting models.↩︎