5.9 Big data considerations

As we saw in Section 5.2.2, fitting a generalized linear model involves fitting a series of linear models. Therefore, all the memory problems that appeared in Section 4.4 are inherited. Worse, computation is now more complicated because:

Computing the likelihood requires reading all the data at once. Differently from the linear model, updating the model with a new chunk implies re-fitting with all the data due to the nonlinearity of the likelihood.
The IRLS algorithm requires reading the data as many times as iterations.

These two peculiarities are a game-changer for the approach followed in Section 4.4: biglm::bigglm needs to have access to the full data while performing the fitting. This can be cumbersome.

Hopefully, a neat solution is available using the ff and ffbase packages, which allow for efficiently “working with data stored in disk that behave (almost) as if they were in RAM”¹⁸³. The function that we will employ is ffbase::bigglm.ffdf, and requires from an object of the class ffdf (ff’s data frames).

The latest version of ffbase has a bug in ffbase::bigglm.ffdf that is reported in this GitHub issue. Until that issue is solved, you will not be able to apply ffbase::bigglm.ffdf with the latest package versions. A temporary workaround is to downgrade the packages bit, ff, and ffbase to the respective versions 1.1-15.2, 2.2-14.2, and 0.12.8. The next two chunk of code provides this fix.

# To install specific versions of packages
install.packages("versions")
library(versions)

# Install specific package versions. It may take a while to do so, be patient
install.versions(pkgs = c("bit", "ff", "ffbase"),
                 versions = c("1.1-15.2", "2.2-14.2", "0.12.8"))
# After bit's version 1.1-15.2, something is off in the integration with
# ffbase; see issue in https://github.com/edwindj/ffbase/issues/61

# Alternatively, if the binaries for your OS are not available (e.g., for
# Apple M1's processors), then you will need to compile the packages from
# source... and cross your fingers!
urls <- c(
  "https://cran.r-project.org/src/contrib/Archive/bit/bit_1.1-15.2.tar.gz",
  "https://cran.r-project.org/src/contrib/Archive/ff/ff_2.2-14.2.tar.gz",
  "https://cran.r-project.org/src/contrib/Archive/ffbase/ffbase_0.12.8.tar.gz"
  )
install.packages(pkgs = urls, repos = NULL, type = "source")

After the packages bit, ff, and ffbase have been downgraded (or if the GitHub issue has been fixed), then you will be able to run the following code:

# Not really "big data", but for the sake of illustration
set.seed(12345)
n <- 1e6
p <- 10
beta <- seq(-1, 1, length.out = p)^5
x1 <- matrix(rnorm(n * p), nrow = n, ncol = p)
x1[, p] <- 2 * x1[, 1] + rnorm(n, sd = 0.1) # Add some dependence to predictors
x1[, p - 1] <- 2 - x1[, 2] + rnorm(n, sd = 0.5)
y1 <- rbinom(n, size = 1, prob = 1 / (1 + exp(-(1 + x1 %*% beta))))
x2 <- matrix(rnorm(100 * p), nrow = 100, ncol = p)
y2 <- rbinom(100, size = 1, prob = 1 / (1 + exp(-(1 + x2 %*% beta))))
bigData1 <- data.frame("resp" = y1, "pred" = x1)
bigData2 <- data.frame("resp" = y2, "pred" = x2)

# Save files to disk to emulate the situation with big data
write.csv(x = bigData1, file = "bigData1.csv", row.names = FALSE)
write.csv(x = bigData2, file = "bigData2.csv", row.names = FALSE)

# Read files using ff
library(ffbase) # Imports ff
bigData1ff <- read.table.ffdf(file = "bigData1.csv", header = TRUE, sep = ",")
bigData2ff <- read.table.ffdf(file = "bigData2.csv", header = TRUE, sep = ",")

# Recall: bigData1.csv is not copied into RAM
print(object.size(bigData1), units = "MB")
print(object.size(bigData1ff), units = "KB")

# Logistic regression
# Same comments for the formula framework -- this is the hack for automatic
# inclusion of all the predictors
library(biglm)
f <- formula(paste("resp ~", paste(names(bigData1)[-1], collapse = " + ")))
bigglmMod <- bigglm.ffdf(formula = f, data = bigData1ff, family = binomial())

# glm's call
glmMod <- glm(formula = resp ~ ., data = bigData1, family = binomial())

# Compare sizes
print(object.size(bigglmMod), units = "KB")
print(object.size(glmMod), units = "MB")

# Summaries
s1 <- summary(bigglmMod)
s2 <- summary(glmMod)
s1
s2

# Further information
s1$mat # Coefficients and their inferences
s1$rsq # R^2
s1$nullrss # Null deviance

# Extract coefficients
coef(bigglmMod)

# Prediction works as usual
predict(bigglmMod, newdata = bigData2[1:5, ], type = "response")
# predict(bigglmMod, newdata = bigData2[1:5, -1]) # Error

# Update the model with training data
update(bigglmMod, moredata = bigData2)

# AIC and BIC
AIC(bigglmMod, k = 2)
AIC(bigglmMod, k = log(n))

# Delete the files in disk
file.remove(c("bigData1.csv", "bigData2.csv"))

Note that this is also a perfectly valid approach for linear models, we just need to specify family = gaussian() in the call to bigglm.ffdf.

Model selection of biglm::bigglm models is not so straightforward. The trick that leaps::regsubsets employs for simplifying the model search in linear models (see Section 4.4) does not apply for generalized linear models because of the nonlinearity of the likelihood. However, there is a simple and useful hack: we can do best subset selection in the linear model associated to the last iteration of the IRLS algorithm and then refine the search by computing the exact BIC/AIC from a set of candidate models¹⁸⁴. If we do so, we translate the model selection problem back to the linear case, plus an extra overhead of fitting several generalized linear models. Keep in mind that, albeit useful, this approach is a hacky approximation to the task of finding the best subset of predictors.

# Model selection adapted to big data generalized linear models
reg <- leaps::regsubsets(bigglmMod, nvmax = p + 1, method = "exhaustive")
# This takes the QR decomposition, which encodes the linear model associated to
# the last iteration of the IRLS algorithm. However, the reported BICs are *not*
# the true BICs of the generalized linear models, but a sufficient
# approximation to obtain a list of candidate models in a fast way

# Get the model with lowest BIC
plot(reg)
subs <- summary(reg)
subs$which
subs$bic
subs$which[which.min(subs$bic), ]

# Let's compute the true BICs for the p models. This implies fitting p bigglm's
bestModels <- list()
for (i in 1:nrow(subs$which)) {
  f <- formula(paste("resp ~", paste(names(which(subs$which[i, -1])),
                                     collapse = " + ")))
  bestModels[[i]] <- bigglm.ffdf(formula = f, data = bigData1ff,
                                 family = binomial(), maxit = 20)
  # Did not converge with the default iteration limit, maxit = 8

}

# The approximate BICs and the true BICs are very similar (in this example)
exactBICs <- sapply(bestModels, AIC, k = log(n))
plot(subs$bic, exactBICs, type = "o", xlab = "Exact", ylab = "Approximate")

# Pearson correlation
cor(subs$bic, exactBICs, method = "pearson")

# Order correlation
cor(subs$bic, exactBICs, method = "spearman")

# Both give the same model selection and same order
subs$which[which.min(subs$bic), ] # Approximate
subs$which[which.min(exactBICs), ] # Exact

The ff package implements the ff vectors and ffdf data frames classes. The package ffbase provides convenience functions for working with these non-standard classes in a more transparent way.↩︎
Without actually expanding that list, as coming out with this list of candidate models is the most expensive part in best subset selection.↩︎