Chapter 20 Speeding up R
library(microbenchmark) # for measuring how long stuff takes
library(doMC) # do multi-core stuff
library(foreach) # parallelizable for loops
library(tidyverse) # dplyr, ggplot2, etc...
library(faraway) # some examples
library(boot)
library(caret)
library(glmnet)
There is a YouTube Video Lecture for this chapter.
Eventually if you have large enough data sets, an R user eventually writes code that is slow to execute and needs to be sped up. This chapter tries to lay out common problems and bad habits and shows how to correct them. However, the correctness and maintainability of code should take precedence over speed. Too often, misguided attempts to obtain efficient code results in an un-maintainable mess that is no faster that the initial code.
Hadley Wickham has a book aimed at advanced R user that describes many of the finer details about R. One section in the book describes his process for building fast, maintainable software projects and if you have the time, I highly suggest reading the on-line version, Advanced R.
First we need some way of measuring how long our code took to run. For this we will use the package microbenchmark
. The idea is that we want to evaluate two or three expressions that solve a problem.
# The expressions can be as many lines as you'd like, enclosed by { }.
# For a single line, you can skip the enclosing { }.
#
# For evaluating large chunks of code, I like to
# just wrap the code up in a function.
<- runif(1000) # x vector of 1000 numbers between 0 and 1.
x microbenchmark(
sqrt(x), # First expression to compare
^(0.5) # second expression to compare
x%>% print(digits=3) )
## Unit: microseconds
## expr min lq mean median uq max neval cld
## sqrt(x) 1.71 1.82 2.37 1.98 2.61 25.2 100 a
## x^(0.5) 16.73 16.95 20.47 17.10 25.19 75.0 100 b
What microbenchmark
does is run the two expressions a number of times and then
produces the 5-number summary of those times. By running it multiple times (by
default, 100 times), we
account for the randomness associated with a operating system that is also running
at the same time. Being good statisticians, the cld
column stands for compact
letter display and if the letters are different, there is a statistically
significant difference in the timing. If we cause the microbenchmark()
function to
run more times (1000s or 100,000s times), we could eventually end up with the smallest
difference to be statistically significant, but I think we shouldn’t complain about
speed differences if a sample of 100 runs can’t detect the difference.
20.1 Faster for loops?
Often we need to perform some simple action repeatedly. It is natural to write
a for
loop to do the action and we wish to speed the up. In this first case,
we will consider having to do the action millions of times and each chunk of
computation within the for
takes very little time.
Consider frame of 4 columns, and for each of \(n\) rows, we wish to know which column has the largest value.
<- function(n){
make.data <- data.frame(
data Norm = rnorm(n, mean=5, sd=2),
Pois = rpois(n, lambda = 5),
Gamma = rgamma(n, shape = 2, scale = 3),
Exp = rexp(n, rate = 1/5))
return(data)
}
<- make.data(6)
data data
## Norm Pois Gamma Exp
## 1 3.619871 4 4.377997 5.419884
## 2 6.443015 9 4.815838 3.161746
## 3 7.080651 8 22.128405 4.937013
## 4 6.550359 3 4.001545 3.982853
## 5 5.996938 6 3.029410 5.363923
## 6 3.448970 5 12.581530 6.051131
The way that you might first think about solving this problem is to write a for loop and, for each row, figure it out.
<- function( input ){
f1 <- NULL
output for( i in 1:nrow(input) ){
<- which.max( input[i,] )
output[i]
}return(output)
}
We might consider that there are two ways to return a value from a function
(using the return
function and just printing it). In fact, I’ve always heard
that using the return
statement is a touch slower.
<- function( input ){
f2.noReturn <- NULL
output for( i in 1:nrow(input) ){
<- which.max( input[i,] )
output[i]
}
output }
<- make.data(1000)
data microbenchmark(
f1(data),
f2.noReturn(data)
%>% print(digits=3) )
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## f1(data) 29.1 31.1 35.0 32.2 34.4 53.4 100 a
## f2.noReturn(data) 29.7 31.1 35.9 32.5 36.7 60.6 100 a
In fact, it looks like it is a touch slower, but not massively compared to the run-to-run variability. I prefer to use the return
statement for readability, but if we agree have the last line of code in the function be whatever needs to be returned, readability isn’t strongly effected.
We next consider whether it would be faster to allocate the output vector once we figure out the number of rows needed, or just build it on the fly?
<- function( input ){
f3.AllocOutput <- nrow(input)
n <- rep(NULL, n)
output for( i in 1:nrow(input) ){
<- which.max( input[i,] )
output[i]
}return(output)
}
<- make.data(10000) # Moderately large sample size
data microbenchmark(
f1(data),
f3.AllocOutput(data)
%>% print(digits=3) )
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## f1(data) 318 329 341 334 344 556 100 a
## f3.AllocOutput(data) 321 329 338 333 341 536 100 a
There isn’t a significant improvement allocating the size of output first. So given this, we shouldn’t feel to bad being lazy and using output <- NULL
to initialize things.
20.2 Vectorizing loops
In general, for
loops in R are very slow and we want to avoid them as much as possible. The apply
family of functions can be quite helpful for applying a function to each row or column of a matrix or data.frame or to each element of a list.
To test this, instead of a for
loop, we will use apply
.
<- function( input ){
f4.apply <- apply(input, 1, which.max) # 1 = apply to rows
output return(output)
}
microbenchmark(
f1(data),
f4.apply(data)
%>% print(digits=3) )
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## f1(data) 321.1 332.9 344.7 338.7 346.8 556.8 100 b
## f4.apply(data) 19.9 21.6 24.6 22.5 23.8 42.6 100 a
This is the type of speed up that matters. We have a 10-fold speed up in execution time and particularly the maximum time has dropped impressively.
Unfortunately, I have always found the apply
functions a little cumbersome and I prefer to use dplyr
functions for readability. For this example, we’ll compare summarizing each column by calculating the mean.
First for a small sample sized data:
<- make.data(1000)
data microbenchmark(
%>% apply(2, mean),
data %>% summarize_all(mean)
data %>% print(digits=3) )
## Unit: microseconds
## expr min lq mean median uq max neval cld
## data %>% apply(2, mean) 107 119 145 134 155 443 100 a
## data %>% summarize_all(mean) 1300 1347 1526 1443 1635 3113 100 b
Unfortunately dplyr
is a lot slower than apply
in this case. I wonder if the dynamics would change with a larger n
?
<- make.data(1000000)
data microbenchmark(
%>% apply(2, mean),
data %>% summarize_all(mean)
data %>% print(digits=3) )
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## data %>% apply(2, mean) 58.01 65.23 90.53 67.58 71.06 339.7 100 b
## data %>% summarize_all(mean) 7.76 8.51 8.95 8.91 9.23 12.7 100 a
What just happened? The package dplyr
is designed to work well for large data sets, and utilizes a modified structure, called a tibble
, which provides massive benefits for large tables, but at the small scale, the overhead of converting the data.frame
to a tibble
overwhelms any speed up. But because the small sample case is already fast enough to not be noticeable, we don’t really care about the small n
case.
20.3 Parallel Processing
Most modern computers have multiple computing cores, and can run muliple processes at the same time. Sometimes this means that you can run multiple programs and switch back and forth easily without lag, but we are now interested in using as many cores as possible to get our statistical calculations completed by using muliple processing cores at the same time. This is referred to as running the process “in parallel” and there are many tasks in modern statistical computing that are “embarrassingly easily parallelized.” In particular bootstrapping and cross validation techniques are extremely easy to implement in a parallel fashion.
However, running commands in parallel incurs some overhead cost in set up computation, as well as all the message passing from core to core. For example, to have 5 cores all perform an analysis on a set of data, all 5 cores must have access to the data, and not overwrite any of it. So parallelizing code only makes sense if the individual steps that we pass to each core is of sufficient size that the overhead incurred is substantially less than the time to run the job.
We should think of executing code in parallel as having three major steps: 1. Tell R that there are multiple computing cores available and to set up a useable cluster to which we can pass jobs to. 2. Decide what ‘computational chunk’ should be sent to each core and distribute all necessary data, libraries, etc to each core. 3. Combine the results of each core back into a unified object.
20.4 Parallel Aware Functions
There are many packages that address problems that are “embarrassingly easily parallelized” and they will happily work with multiple cores. Methods that rely on re-sampling certainly fit into this category.
The first step is letting R know it has access to multiple cores. This is quite simple using the doMC
package (aka the do Multi-Core package).
The registration of multiple cores is actually pretty easy.
::registerDoMC(cores = 2) # my laptop only has two cores. doMC
20.4.1 boot::boot
Bootstrapping relies on re-sampling the dataset and calculating test statistics from each re-sample. In R, the most common way to do this is using the package boot
and we just need to tell the boot
function, to use the multiple cores available. (Note, we have to have registered the cores first!)
# make the trees dataset a bit larger,
# just so the overhead cost isn't as large
# compared to the computation of the regression
# coefficients.
for(i in 1:4){
<- rbind(trees, trees)
trees
}
# define the model using the original sample data
<- lm( Volume ~ Girth, data=trees)
model
# define a function that calculates the model parameters
# given an index of which data points to use.
<- function(df, index){
my.fun <- lm( Volume ~ Girth, data= trees[index,] )
model.star $coefficients
model.star
}
# the boot::boot() function has a parallel option we just need
# to switch on
microbenchmark(
serial = boot::boot( trees, my.fun, R=1000 ),
parallel = boot::boot( trees, my.fun, R=1000,
parallel='multicore', ncpus=2 )
%>% print(digits=3) )
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## serial 682 713 774 753 821 1049 100 a
## parallel 680 727 758 747 776 988 100 a
In this case, we had a bit of a speed up, but not a factor of 2. This is due to the overhead of splitting the job across both cores.
20.4.2 caret::train
The statistical learning package caret
also handles all the work to do cross validation in a parallel computing environment. The functions in caret
have an option allowParallel
which by default is TRUE, which controls if we should use all the cores. Assuming we have already registered the number of cores, then by default caret
will use them all.
library(caret)
# Remember to register how many cores to use!
::registerDoMC(cores = 2) # my laptop only has two cores.
doMC
<- function(parallel){
caret.model <- data.frame(
grid alpha = 1, # 1 => Lasso Regression
lambda = exp(seq(-6, 1, length=50)))
<- trainControl(
ctrl method='repeatedcv', number=5, repeats=4,
preProcOptions = c('center','scale'),
allowParallel = parallel)
<- train(
model ~ . -svi, data=prostate, method='glmnet',
lpsa trControl=ctrl,
tuneGrid=grid,
lambda = grid$lambda )
return(model)
}
# make the dataset a bit larger, just so the overhead
# of moving to multi-core doesn't overwhelm the benefits
# of using two processors. You should never do this
# in a real analysis!
data('prostate', package='faraway')
for( i in 1:7 ){
<- rbind(prostate,prostate)
prostate
}
microbenchmark(
caret.model(parallel = FALSE),
caret.model(parallel = TRUE )
%>% print(digits=3) )
## Unit: seconds
## expr min lq mean median uq max neval cld
## caret.model(parallel = FALSE) 1.82 1.91 1.96 1.95 2.00 2.25 100 b
## caret.model(parallel = TRUE) 1.69 1.74 1.79 1.77 1.82 2.07 100 a
Again, we saw only moderate gains by using both cores, however it didn’t really cost us anything. Because the caret
package by default allows parallel processing, it doesn’t hurt to just load the doMC
package and register the number of cores. Even in just the two core case, it is a good habit to get into so that when you port your code to a huge computer with many cores, the only thing to change is how many cores you have access to.
20.5 Parallelizing for
loops
One of the easiest ways to parallelize a for
loop is using a package called foreach
. This package was created by R-Revolutions company that was subsequently bought by Microsoft. They have a nice introduction to the package here.
We will consider an example that is common in modern statistics. We will examine parallel computing utilizing a bootstrap example where we create bootstrap samples for calculating confidence intervals for regression coefficients.
ggplot(trees, aes(x=Girth, y=Volume)) + geom_point() + geom_smooth(method='lm')
## `geom_smooth()` using formula 'y ~ x'
<- lm( Volume ~ Girth, data=trees) model
This is how we would do this previously.
# f is a formula
# df is the input data frame
# M is the number of bootstrap iterations
<- function( f, df, M=999){
boot.for <- list()
output for( i in 1:100 ){
# Do stuff
<- lm( f, data=df %>% sample_frac(1, replace=TRUE) )
model.star <- model.star$coefficients
output[[i]]
}
# use rbind to put the list of results together into a data.frame
<- sapply(output, rbind) %>% t() %>% data.frame()
output return(output)
}
We will first ask about how to do the same thing using the function foreach
# f is a formula
# df is the input data frame
# M is the number of bootstrap iterations
<- function(f, df, M=999){
boot.foreach <- foreach( i=1:100 ) %dopar% {
output # Do stuff
<- lm( f, data=df %>% sample_frac(1, replace=TRUE) )
model.star $coefficients
model.star
}
# use rbind to put the list of results together into a data.frame
<- sapply(output, rbind) %>% t() %>% data.frame()
output return(output)
}
Not much has changed in our code. Lets see which is faster.
microbenchmark(
boot.for( Volume~Girth, trees ),
boot.foreach( Volume~Girth, trees )
%>% print(digits=3) )
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## boot.for(Volume ~ Girth, trees) 113 142 154 152 169 205 100 a
## boot.foreach(Volume ~ Girth, trees) 145 161 180 169 178 309 100 b
In this case, the overhead associated with splitting the job across two cores, copying the data over, and then combining the results back together was more than we saved by using both cores. If the nugget of computation within each pass of the for
loop was larger, then it would pay to use both cores.
# massiveTrees has 31000 observations
<- NULL
massiveTrees for( i in 1:1000 ){
<- rbind(massiveTrees, trees)
massiveTrees
}microbenchmark(
boot.for( Volume~Girth, massiveTrees ) ,
boot.foreach( Volume~Girth, massiveTrees )
%>% print(digits=3) )
## Unit: seconds
## expr min lq mean median uq
## boot.for(Volume ~ Girth, massiveTrees) 15.29 16.16 16.70 16.86 17.21
## boot.foreach(Volume ~ Girth, massiveTrees) 6.85 7.68 7.96 7.91 8.22
## max neval cld
## 18.04 100 b
## 8.72 100 a
Because we often generate a bunch of results that we want to see as a data.frame, the foreach
function includes a .combine
option which allows us to specify a function to do the combining.
<- foreach( i=1:1000, .combine=rbind ) %dopar% {
output # Do stuff
<- lm( Volume ~ Girth, data= trees %>% sample_frac(1, replace=TRUE) )
model.star $coefficients %>% t() %>% as.data.frame()
model.star }
It is important to recognize that the data.frame trees
was utilized inside the foreach
loop. So when we called the foreach
loop and distributed the workload across the cores, it was smart enough to distribute the data to each core. However, if there were functions that we utilized inside the for loop that came from a package, we need to tell each core to load the function.
<- foreach( i=1:1000, .combine=rbind, .packages='dplyr' ) %dopar% {
output # Do stuff
<- lm( Volume ~ Girth, data= trees %>% sample_frac(1, replace=TRUE) )
model.star $coefficients %>% t() %>% as.data.frame()
model.star }