## 5.9 Big data considerations

As we saw in Section 5.2.2, fitting a generalized linear model involves fitting a series of linear models. Therefore, all the memory problems that appeared in Section 4.4 are inherited. Worse, computation is now more complicated because:

- Computing the likelihood requires
*reading all the data at once*. Differently from the linear model, updating the model with a new chunk implies re-fitting with all the data due to the nonlinearity of the likelihood. - The IRLS algorithm requires
*reading the data as many times as iterations*.

These two peculiarities are a game-changer for the approach followed in Section 4.4: `biglm::bigglm`

needs to have access to the full data while performing the fitting. This can be cumbersome.

Hopefully, a neat solution is available using the `ff`

and `ffbase`

packages, which allow for efficiently “working with data stored in disk that behave (almost) as if they were in RAM”^{183}. The function that we will employ is `ffbase::bigglm.ffdf`

, and requires from an object of the class `ffdf`

(`ff`

’s data frames).

The latest version of `ffbase`

has a bug in
`ffbase::bigglm.ffdf`

that is reported in this GitHub issue.
Until that issue is solved, you will not be able to apply
`ffbase::bigglm.ffdf`

with the latest package versions. A
temporary workaround is to downgrade the packages `bit`

,
`ff`

, and `ffbase`

to the respective versions
`1.1-15.2`

, `2.2-14.2`

, and `0.12.8`

.
The next two chunk of code provides this fix.

```
# To install specific versions of packages
install.packages("versions")
library(versions)
# Install specific package versions. It may take a while to do so, be patient
install.versions(pkgs = c("bit", "ff", "ffbase"),
versions = c("1.1-15.2", "2.2-14.2", "0.12.8"))
# After bit's version 1.1-15.2, something is off in the integration with
# ffbase; see issue in https://github.com/edwindj/ffbase/issues/61
# Alternatively, if the binaries for your OS are not available (e.g., for
# Apple M1's processors), then you will need to compile the packages from
# source... and cross your fingers!
urls <- c(
"https://cran.r-project.org/src/contrib/Archive/bit/bit_1.1-15.2.tar.gz",
"https://cran.r-project.org/src/contrib/Archive/ff/ff_2.2-14.2.tar.gz",
"https://cran.r-project.org/src/contrib/Archive/ffbase/ffbase_0.12.8.tar.gz"
)
install.packages(pkgs = urls, repos = NULL, type = "source")
```

After the packages `bit`

, `ff`

, and `ffbase`

have been downgraded (or if the GitHub issue has been fixed), then you will be able to run the following code:

```
# Not really "big data", but for the sake of illustration
set.seed(12345)
n <- 1e6
p <- 10
beta <- seq(-1, 1, length.out = p)^5
x1 <- matrix(rnorm(n * p), nrow = n, ncol = p)
x1[, p] <- 2 * x1[, 1] + rnorm(n, sd = 0.1) # Add some dependence to predictors
x1[, p - 1] <- 2 - x1[, 2] + rnorm(n, sd = 0.5)
y1 <- rbinom(n, size = 1, prob = 1 / (1 + exp(-(1 + x1 %*% beta))))
x2 <- matrix(rnorm(100 * p), nrow = 100, ncol = p)
y2 <- rbinom(100, size = 1, prob = 1 / (1 + exp(-(1 + x2 %*% beta))))
bigData1 <- data.frame("resp" = y1, "pred" = x1)
bigData2 <- data.frame("resp" = y2, "pred" = x2)
# Save files to disk to emulate the situation with big data
write.csv(x = bigData1, file = "bigData1.csv", row.names = FALSE)
write.csv(x = bigData2, file = "bigData2.csv", row.names = FALSE)
# Read files using ff
library(ffbase) # Imports ff
bigData1ff <- read.table.ffdf(file = "bigData1.csv", header = TRUE, sep = ",")
bigData2ff <- read.table.ffdf(file = "bigData2.csv", header = TRUE, sep = ",")
# Recall: bigData1.csv is not copied into RAM
print(object.size(bigData1), units = "MB")
print(object.size(bigData1ff), units = "KB")
# Logistic regression
# Same comments for the formula framework -- this is the hack for automatic
# inclusion of all the predictors
library(biglm)
f <- formula(paste("resp ~", paste(names(bigData1)[-1], collapse = " + ")))
bigglmMod <- bigglm.ffdf(formula = f, data = bigData1ff, family = binomial())
# glm's call
glmMod <- glm(formula = resp ~ ., data = bigData1, family = binomial())
# Compare sizes
print(object.size(bigglmMod), units = "KB")
print(object.size(glmMod), units = "MB")
# Summaries
s1 <- summary(bigglmMod)
s2 <- summary(glmMod)
s1
s2
# Further information
s1$mat # Coefficients and their inferences
s1$rsq # R^2
s1$nullrss # Null deviance
# Extract coefficients
coef(bigglmMod)
# Prediction works as usual
predict(bigglmMod, newdata = bigData2[1:5, ], type = "response")
# predict(bigglmMod, newdata = bigData2[1:5, -1]) # Error
# Update the model with training data
update(bigglmMod, moredata = bigData2)
# AIC and BIC
AIC(bigglmMod, k = 2)
AIC(bigglmMod, k = log(n))
# Delete the files in disk
file.remove(c("bigData1.csv", "bigData2.csv"))
```

Note that this is also a perfectly **valid approach for linear
models**, we just need to specify
`family = gaussian()`

in the call to
`bigglm.ffdf`

.

Model selection of `biglm::bigglm`

models is not so straightforward. The trick that `leaps::regsubsets`

employs for simplifying the model search in linear models (see Section 4.4) does not apply for generalized linear models because of the nonlinearity of the likelihood. However, there is a simple and useful hack: we can do **best subset selection** in the **linear model associated to the last iteration of the IRLS algorithm** and then refine the search by computing the exact BIC/AIC from a set of candidate models^{184}. If we do so, we translate the model selection problem back to the linear case, plus an extra overhead of fitting several generalized linear models. Keep in mind that, albeit useful, this approach is a hacky approximation to the task of finding the best subset of predictors.

```
# Model selection adapted to big data generalized linear models
reg <- leaps::regsubsets(bigglmMod, nvmax = p + 1, method = "exhaustive")
# This takes the QR decomposition, which encodes the linear model associated to
# the last iteration of the IRLS algorithm. However, the reported BICs are *not*
# the true BICs of the generalized linear models, but a sufficient
# approximation to obtain a list of candidate models in a fast way
# Get the model with lowest BIC
plot(reg)
subs <- summary(reg)
subs$which
subs$bic
subs$which[which.min(subs$bic), ]
# Let's compute the true BICs for the p models. This implies fitting p bigglm's
bestModels <- list()
for (i in 1:nrow(subs$which)) {
f <- formula(paste("resp ~", paste(names(which(subs$which[i, -1])),
collapse = " + ")))
bestModels[[i]] <- bigglm.ffdf(formula = f, data = bigData1ff,
family = binomial(), maxit = 20)
# Did not converge with the default iteration limit, maxit = 8
}
# The approximate BICs and the true BICs are very similar (in this example)
exactBICs <- sapply(bestModels, AIC, k = log(n))
plot(subs$bic, exactBICs, type = "o", xlab = "Exact", ylab = "Approximate")
# Pearson correlation
cor(subs$bic, exactBICs, method = "pearson")
# Order correlation
cor(subs$bic, exactBICs, method = "spearman")
# Both give the same model selection and same order
subs$which[which.min(subs$bic), ] # Approximate
subs$which[which.min(exactBICs), ] # Exact
```

The

`ff`

package implements the`ff`

vectors and`ffdf`

data frames classes. The package`ffbase`

provides convenience functions for working with these non-standard classes in a more transparent way.↩︎Without actually expanding that list, as coming out with this list of candidate models is the most expensive part in best subset selection.↩︎