11.5 Bonus: Efficient R
Helene Wagner
1. Overview of Bonus Material
a. Goals
This Bonus Material provides some introductory worked examples for:
- Navigating the file system
- Benchmarking file import and export functions
- Profiling an R script
- Creating and executing a Bash R script
- Parallelizing code
b. Data set
We will use the wolf SNP data (Schweizer et al., 2016) from the Week 11 worked example. The genetic data are individual-based, and are input as allele counts (i.e. 0/1/2) for each locus. We are using a randomly sampled subset of 10,000 single nucleotide polymorphism (SNP) markers from the full data set (which contains 42,587 SNPs).
3. Benchmarking file import and export options
See also Chapter 5 in “Efficient R Programming”: https://csgillespie.github.io/efficientR/input-output.html
a. Benchmark methods for importing csv files
The file ‘myFile’ has 2MB and thus a reasonable size to compare the speed of different import and export functions.
Let’s benchmark the function read.csv
used in Week 11. We use the function microbenchmark
from the package microbenchmark
to compare the speed of four different functions that can import a ‘csv’ file.
Note: Here we execute each function only once to save time, typically you would set times = 10
or so. Also, read_csv
will print a warning about a missing column name. This is because the first column here contains the row names and does not have a column name. We’ll ignore this here, as we can use the first column as an example to compare how character data are being imported.
x = myFile
microbenchmark::microbenchmark(times = 1, unit = "ms",
read.csv(x), readr::read_csv(x, show_col_types = FALSE), data.table::fread(x),
rio::import(x))
## Unit: milliseconds
## expr min lq mean
## read.csv(x) 1805.1360 1805.1360 1805.1360
## readr::read_csv(x, show_col_types = FALSE) 2907.1049 2907.1049 2907.1049
## data.table::fread(x) 161.0499 161.0499 161.0499
## rio::import(x) 104.1379 104.1379 104.1379
## median uq max neval
## 1805.1360 1805.1360 1805.1360 1
## 2907.1049 2907.1049 2907.1049 1
## 161.0499 161.0499 161.0499 1
## 104.1379 104.1379 104.1379 1
Would it be faster if we first loaded the packages so that we could call the functions directly?
Note: you can list several independent commands on the same line by separating the with a semi-colon ‘;’. Also, the chunk setting message=FALSE
here suppresses the warning message from read_csv
.
library(readr); library(data.table); library(rio); library(microbenchmark)
microbenchmark(times = 1, unit = "ms",
read.csv(x), read_csv(x, show_col_types = FALSE), fread(x), import(x))
## Unit: milliseconds
## expr min lq mean
## read.csv(x) 1761.73906 1761.73906 1761.73906
## read_csv(x, show_col_types = FALSE) 2498.70910 2498.70910 2498.70910
## fread(x) 86.06117 86.06117 86.06117
## import(x) 85.58287 85.58287 85.58287
## median uq max neval
## 1761.73906 1761.73906 1761.73906 1
## 2498.70910 2498.70910 2498.70910 1
## 86.06117 86.06117 86.06117 1
## 85.58287 85.58287 85.58287 1
Yes, the import was faster when the packages were already loaded.
Overall fread
and import
were in the order of 50 times faster than read.csv
and read_csv
! The two had practically the same speed, which is little surprising: for ‘csv’ files, import
uses the function fread
.
The beauty of import
is that it can handle a wide range of file types (and the list keeps growing): csv, xls, xlsx, html, xml, json, feather, R, RData, rda, rds, psv, tsv, sas7bdat, xpt, sav, dta, xpt, por, rec, mtp, syd, dbf, arff, dif, fwf, csv.gz, CSVY, fst, mat, ods, yml, as well as Fortan files and clipboard imports (Mac and Windows). It recognizes the file type from the extension and uses an appropriate import function.
Note: there is also a fuction Import
in the car
package that similarly aims to provide an easy way to import various file formats. However, car::Import
can be very slow (slower than ‘read.csv’), whereas rio::import
is fast.
b. Check handling of character data
The functions differ not only in their speed but also in how they handle text data (character or factor?), missing values etc.
The first column in ‘myFile’ is an ID variable that should be used as row names. Let’s compare what the four methods did with this. The following code determines, for each import method, the class of the first column (IDs), and the class (or classes) of the resulting object.
Note: here we use double square brackets to subset the first column. Strictly speaking, we interpret ‘gen’ as a list of vectors. With a data.frame, we could also access the first column by gen[,1]. However, this would not return what we want for ‘tbl’ of ‘data.table’ objects. Always double check.
## [1] "character" "data.frame"
## [1] "character" "spec_tbl_df" "tbl_df" "tbl" "data.frame"
## [1] "character" "data.table" "data.frame"
## [1] "character" "data.frame"
- The function
read.csv
interprets any text as ‘factor’, the other functions use ‘character’ as default. Always double check!
- All of these functions have optional arguments for specifying how each column how it should be interpreted (see help files).
- With the functions
fread
andimport
, you can set the argument ‘stringsAsFactors = TRUE’ to import all text data as factors.
c. Binary files
Binary files are not readable by users (or other software) but provide an efficient way of storing data. Let’s compare file size and input/output speed between text files (csv) and different types of binary files (RData, rds, feather). We’ll also export the ‘csv’ file so that we have it in the same location.
First we make sure an output
folder exists in the R project:
gen <- import(myFile)
export(gen, file.path(here::here(), "output", "gen.csv"))
save(gen, file=file.path(here::here(), "output", "gen.RData"))
saveRDS(gen, file=file.path(here::here(), "output", "gen.rds"))
export(gen, file=file.path(here::here(), "output", "gen.feather"))
c(csv=file.size(file.path(here::here(), "output", "gen.csv")),
RData=file.size(file.path(here::here(), "output", "gen.RData")),
rds=file.size(file.path(here::here(), "output", "gen.rds")),
feather=file.size(file.path(here::here(), "output", "gen.feather")))/10^6
## csv RData rds feather
## 2.002491 0.351648 0.351621 3.716274
- The ‘csv’ file is 2 MB (first row).
- The R binary files ‘RData’ and ‘rds’ are much smaller!
- In contrast, the ‘feather’ file (last row) is twice as large here than the ‘csv’ file, and more than 10 times larger than ‘rds’!
Let’s benchmark the import again. We can use the function import
for all of them. This is so fast that we can actually do it 10 times.
microbenchmark(times = 10, unit = "ms",
csv= import(file.path(here::here(), "output", "gen.csv")),
RData=import(file.path(here::here(), "output", "gen.RData")),
rds=import(file.path(here::here(), "output", "gen.rds")),
feather=import(file.path(here::here(), "output", "gen.feather")))
## Unit: milliseconds
## expr min lq mean median uq max
## csv 73.49528 74.29014 76.27097 74.96548 75.43926 86.79166
## RData 20.95441 20.98436 21.64224 21.45495 22.03525 23.48003
## rds 20.99058 21.85946 22.34382 22.44774 22.57337 24.80483
## feather 5668.14858 5716.64603 5829.49806 5809.81615 5883.97031 6123.39036
## neval cld
## 10 a
## 10 a
## 10 a
## 10 b
Look at the column ‘mean’. Importing any of the binary files was at least twice as fast as importing the ‘csv’ file with the underlying function fread
(which was already 50 times faster than read.csv
).
Here’s my recommendation for saving R objects/data efficiently:
- If object is not in tabular form: rds (can store any R object)
- If storage space is most important: rds
- If portability with Python is important: feather
- If being able to read text file is important: csv
Note: the developer of feather
does not recommend using it for long-term data storage since its stability with future updates to R or Python can’t be guaranteed: https://github.com/wesm/feather/issues/183
Why ‘rds’ and not ‘RData’? In practice, the main advantage of ‘rds’ is convenience when importing data.
- with ‘readRDS’, you can directly assign the imported data to an object, and thus choose the object name.
- with ‘load’, you have to do this in two steps. When using ‘load’, the loaded object will inherit the name from the object that was stored.
# Let's delete any copy of 'gen' from the workspace:
rm(gen)
# Create object 'myData' in a single step from 'rds' file:
myData <- readRDS(file.path(here::here(), "output", "gen.rds"))
# Two steps when importing 'RData': first, load the stored object:
load(file.path(here::here(), "output", "gen.RData"))
# then assign to the new object 'myData':
myData <- gen
Note that when you use load
, the object name is NOT taken from the file name! This means that you may not know what object you are loading, if the object and file names are different.
Let’s test this. Here we save the object ‘gen’ in file ‘gen2.RData’, then load it.
# Export 'gen' to a file with a different name 'gen2.RData':
save(gen, file=file.path(here::here(), "output", "gen2.RData"))
rm(gen)
# Load:
load(file.path(here::here(), "output", "gen2.RData"))
# What is the name of the loaded object?
exists("gen")
## [1] TRUE
## [1] FALSE
We see that an object ‘gen’ exists (TRUE), but an object ‘gen2’ does not exist (FALSE). The name of the loaded object is thus ‘gen’.
d. Should you save your R workspace?
When you close RStudio, you may be asked whether you want to save your workspace. What happens when you do this, and should you do so?
- When you save your workspace, all objects from the workspace are saved in one binary file ‘.RData’ (unless you provide a name, like ‘myWorkspace.Rdata’).
- This may result in a large file!
- There are other downsides: you may accidentally overwrite an object. And your code will not be portable because it depends on a copy of the workspace.
- The general recommendation is to NOT save your workspace, but save your dataset and your R scripts (or Notebooks). This means that you can always recreate all the objects needed.
Also, in the vein of reproducible research, do not save multiple copies of your data set. Instead:
- Save the raw data (with the original file name and extension, e.g. if you downloaded it from the internet - this will help identify the source)
- Save all the data manipulation steps (such as checking for errors, excluding rows, recoding variables, etc.) in an R Notebook and document them (what you are doing, why and when).
- Save the ‘clean’ dataset (result from data manipulation), preferably as a binary file (especially if it is large). Keep only one copy (you can always recreate it).
- Save your data analysis in another R Notebook (or multiple) that starts with importing the ‘clean’ dataset. If your code runs quickly, this is sufficient to recreate your results.
- If your code takes a lot of computation time, you may want to export the results (see above).
- Backup your data and scripts (R Notebooks)!
- Use version control for your script (R Notebooks)! See Chapter 0 video ‘Version Control 101’. The simplest is to include version numbers in the file name: ‘myNotebook_v7.Rmd’. A better way is to use e.g. GitHub.
e. Compile your functions
A recommended way of keeping your code tidy is to write functions for anything you will do more than once.
- Collect your functions in a separate file (R script), e.g.
myFunctions.R
. - At the beginning of your R Notebook, source the R script with the functions with
source("myFunctions.R")
. - This will make your code much shorter and thus easier to read.
- If you need to change some code inside your function, you only have to change it in one place, hence there is less risk of mistakes.
- Use some kind of version control for your functions file so that you can go back to an older version, and you always know which is the current version.
To further speed up your code, you can compile your function with ‘cmpfun’:
myFunction <- function() {
sum(rnorm(1000))/1000
}
myFunction.cmp <- compiler::cmpfun(myFunction)
microbenchmark::microbenchmark(myFunction(), myFunction.cmp())
## Unit: microseconds
## expr min lq mean median uq max neval cld
## myFunction() 46.324 47.073 62.45434 47.346 47.7210 1550.557 100 a
## myFunction.cmp() 46.522 47.053 47.53702 47.313 47.5835 67.132 100 a
Question: Which of the following times are most different between the uncompiled and the compiled versions of this simple function?
- min: minimum time across ‘neval’ replicates
- Quartiles: ‘lq’ = lower quartile (25%), ‘median’, ‘uq’ = upper quartile (75%)
- mean: mean time across ‘neval’ replicates
- max: maximum time across ‘neval’ replicates
In this case, compiling mainly reduced the duration of the longest 25% runs (with longer times than the 75% quartile), which brought down the mean processing time.
4. Profiling your code
a. Named chunks
An simple way to identify parts of your code that may be slow is to:
- Name each chunk in your R Notebook
- Knit the notebook
- Monitor the R markdown pane while the notebook is knitted: which chunks seem to take a lot of time?
To name a chunk, click on the wheel symbol at the top right of the grey chunk area and enter a one-word name.
Here’s an example of a named chunk: the name ‘myChunkName’ has been added in the curly brackets {r, myChunkName}
. You can add a name manually in the same way.
More generally, this is where chunk options are added in the R Notebook. Here’s a long list of chunk options: https://yihui.name/knitr/options/
b. Profiling some lines of code
RStudio has a built-in menu for profiling. Check it out:
- Select the five lines of code below, from
dat <- data.frame(
untillm(y ~ x, data=dat)
- In RStudio’s menu, click on ‘Profile’ > ‘Profile Selected Line(s)’.
You can achieve the same with a call to the funciton ‘profvis’ of the ‘profvis’ package:
profvis::profvis({
dat <- data.frame(
x = rnorm(5e5),
y = rnorm(5e5))
mean(dat$x)
with(dat, cor(x, y))
lm(y ~ x, data=dat)
})
The results will be opened in a ‘Profile’ tab. The upper part has two tabs (the lower part is less intuitive to interpret, we’ll ignore it here):
- Flame Graph: this plots horizontal bars indicating the amount of memory and time used by each line of code, in the original order of the code.
- Data: code is sorted by resource use, with the most ‘costly’ line
qqnorm
at the top of the list. You can click on the triangle before each line to see more detail.
The results may depend on the speed of your computer. Check the sample interval at the bottom of the ‘Profile’ tab: time was estimated by observing every 10 milliseconds what the computer was doing. Some lines lines must have been too fast to be recorded.
c. Converting an R Notebook to an R script
Unfortunately, we can’t profile an entire R Notebook (as far as I know). However, we can extract the R code as a script file, then profile the script file.
Copy an R Notebook file (Testfile.Rmd
) to the downloads folder. This is just so that we have an example of an .Rmd
file to extract the code from.
file.copy(from=system.file("extdata", "Testfile.Rmd", package = "LandGenCourse"),
to=file.path(here::here(), "downloads", "Testfile.Rmd"))
## [1] FALSE
Extract the R code from the R Notebook and save it in a script file Testfile.R
in the downloads folder. You may adapt the infile and outfile to extract the code from any R Notebook saved in your project.
Let’s open the two files and compare them. Note that file.show
opens a simple text version, whereas file.edit
shows the colored versions commonly displayed by the RStudio editor.
Question: Compare the two files (they should be open, check the tabs of the source pane).
- How are the chunks from the Notebook file divided in the new script file?
- What happened to the titles and text?
- What about the header information?
d. Profiling an R script
We can now use the function source
to read and execute the code in the R script. We use the function Rprof
to start and stop the profiling.
With the function summaryRprof
, we get a summary by function. First, let’s check the total time spent executing the code, and the sample interval (in seconds). Note that profiling works in this way: as the code is executed, R checks at regular time intervals which function is being executed at that very moment. The summaryRprof
function then tabulates the number of intervals, and thus the time used, by function.
The attribute $by.total
only lists the top-level functions called (not any functions called internally by that top-level function).
Note: The functions are listed by total time, in decreasing order. The two functions eval
and source
, are related to evaluating (running) the code from the sourced R script file.
- total.time: total time spent executing the function (including functions called by it)
- total.pct: (ignore this) percent time spent executing the function (including functions called by it)
- self.time: total time spent executing the function itself (excluding functions called by it)
- self.pct: percent time spent executing the function itself (excluding functions called by it)
If you need to see a summary with more detail, $by.self
lists each function used, even internal functions.
If you want to profile memory use rather than processing time, see here: https://developer.r-project.org/memory-profiling.html
5. Creating and executing a Bash R script
a. Run R script directly in the Terminal
Now that we have a stand-alone R script ‘myScript.R’, we can run it from the command line in the terminal (shell). After navigating to the correct folder, type:
Rscript myScript.R
This will source the file and execute the R code.
- Numerical output will be printed in the termminal. Here, a random number should be returned.
- Best include code in your R script to export any R objects, data tables or figures that you want to retain. These will be saved in the same folder as the R script (unless you specify file paths).
b. Create a Bash R script
If you want to run your code on a node or cluster, you may need to take this one step further and include the R code in a bash Rscript. In a bash script, you can add bash commands that govern resource use to submit a job to a node or cluster. Bash scripts are the way of giving instructions to the scheduler of the cluster (e.g. SLURM) for how to manage input and output files.
To execute our R script ‘myScript.R’ as a Bash script, we need to add a few lines.
- The ‘shebang’ line
#!/bin/bash
that tells the computer that this is a Bash script, and where to find Bash. Note that here the hashtag symbol does NOT mean that the line is commented out (this line is Unix code, not R code). - The line
R --slave << EOF
that declares the rest of the file (untilEOF
) as R code. - The end of file
EOF
marker.
Let’s modify the previous code and write it into a Bash file. As an additional challenge, our code contains two figures, which won’t be written anywhere unless we change the code to write them into a file:
- We create a graphics file ‘my_plot.png’ that is 800 pixels wide and 400 pixels high.
- With
par(mfrow=c(1,2))
, we specify that the two plots should be plotted side-by-side. Then we create the plots. - We close the graphics device (png file) with
dev.off
.
Note: We use single quotes for the file name here, ‘my_plot.png’, as they are nested within a set of double quotes. R pretty much considers single and double quotes as synonyms, which allows us to nest them either way: ‘““’ or”’’“.
myPath <- file.path(here::here(), "output/myBashScript.sh")
fileConn <- file(myPath)
writeLines(c("#!/bin/bash",
"R --slave << EOF",
"x <- rnorm(100)",
"mean(x)",
"png('my_plot.png', height = 400, width = 800)",
"par(mfrow=c(1,2))",
"hist(x)",
"qqnorm(x)",
"dev.off()",
"EOF"), fileConn)
close(fileConn)
file.show(myPath)
c. Executing a Bash R script
On Mac / Unix / Linux, this is straight-forward:
- Open terminal:
- From the RStudio menu, select ‘Tools’ > ‘Terminal’ > ‘New Terminal’. This will open an Terminal tab in RStudio. Alternatively, you could select ‘Tools’ > ‘Shell’ to open a Shell in a new window outside RStudio.
- Check the prompt: it should start with the name of your computer, then a colon, then the name of your project folder, then your use name followed ‘$’.
- Navigate to Bash file:
- Enter
ls
to list the content of the project folder. - Enter
cd output
to change directory to the subfolder ‘output’. - Repeat
ls
to list the content of the ‘output’ folder. The Bash script ‘myBashScript.sh’ should be listed there.
- Enter
- Execute Bash file:
- Enter
chmod +x myBashScript.sh
to change file permission for the script. - Enter
./myBashScript.sh
to execute the script.
- Enter
- Find output:
- The output from
mean(x)
is printed in the terminal, it should look like this:[1] -0.07731751
. - This is followed ‘null device’ and the number 1, which tells us that a graphics device has been closed.
- Enter
ls
to list the content of the project folder. The graphics file ‘my_plot.png’ should now be listed.
- The output from
Use R again to open the graphics file (the code here first checks whether the file exists):
d. Moving to a node or cluster?
The example bash file and advice in this section have been provided by Hossam Abdel Moniem, thanks!
Here’s an annotated example of a bash file that contains instructions for submitting a job to a single node (a single machine with multiple/many cores).
Note: A copy of the file ‘BashExample.sh’ should also have been copied into the downloads folder inside your project folder.
## #!/bin/bash
##
## #SBATCH --nodes=1 # Number of Nodes
## #SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL)
## #SBATCH --mail-user=myEmail@gmail.com # Where to send mail
## #SBATCH --ntasks=1 # Run a single task
## #SBATCH --cpus-per-task=24 # Number of CPU cores per task
## #SBATCH --mem-per-cpu=8000 # allocated memory for the task
## #SBATCH -p nodename # name of cluster node
## #SBATCH --requeue # Allow the job to be requeued
## #SBATCH -e myJob.err # File to which STDERR will be written
## #SBATCH -o myJob.out # File to which STDOUT will be written
## #SBATCH -J myJob # Job name
##
## module load R/MS3.4.1 # call a preinstalled module(program) on the cluster
##
## Rscript connect_calc_25.R # Run the R script in bash
##
Instead of including the R code directly in the bash file, the last line here executes an R script with the Rscript
command.
Notice the second-last line. Obviously, on the node, R and any relevant packages need to be pre-installed. Different users may need different configurations (different packages or versions) installed, hence each installation has a name, which needs to be specified in the bash script.
Note: Make sure that all packages that you need (and their dependencies), as well as the package unixtools
, have been installed on the node or cluster (i.e., they are part of the installation you will be using). Install unixtools
with: install.packages("unixtools",,"http://rforge.net/")
Further reading:
- Bash script tutorial: https://ryanstutorials.net/bash-scripting-tutorial/
- Scheduling a job with SLURM commands: https://www.rc.fas.harvard.edu/resources/documentation/convenient-slurm-commands/
e. Store your session info and package versions
If you use any R packages, load them at the beginning of your script file with library
. Make sure the packages are installed on the system where you will be running the Bash R script.
A big issue with R is that package updates may make your code break. At least at the end of any project (such as the analyses for a manuscript), save your session information.
Here we use the function session_info
from the devtools
package (preferred over the R base function sessionInfo
). We store the information as an object, ‘Session’, of class ‘session_info’ that has two list elements:
- platform: Information about R version, your computer’s operating system, etc.
- packages: List of all packages, including their version, installation date and repository (CRAN, Github etc.). All packages that are currently loaded will be marked with an asterisk.
Platform:
## setting value
## version R version 4.3.3 (2024-02-29)
## os macOS Sonoma 14.4.1
## system x86_64, darwin20
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Europe/Zurich
## date 2024-04-25
## pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
Packages: here we display only the first six lines, as the list may be long.
## package * version date (UTC) lib source
## abind 1.4-5 2016-07-21 [1] CRAN (R 4.3.0)
## ade4 1.7-22 2023-02-06 [1] CRAN (R 4.3.0)
## adegenet 2.1.10 2023-01-26 [1] CRAN (R 4.3.0)
## ape 5.7-1 2023-03-13 [1] CRAN (R 4.3.0)
## arrow 14.0.0.2 2023-12-02 [1] CRAN (R 4.3.0)
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.3.0)
##
## [1] /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library
Exporting the session information is a bit tricky because ‘Session’ is not in tabular format, and what we want to export is the formatted output, not the object itself.
Also, we will add a time stamp to the file name. This achieves two goals: avoid overwriting earlier files, and keep a record of the date of the session information.
capture.output
: captures the output of a function (here:devtools::session_info()
)writeLines
: writes the captured output into a file, here a text file.Sys.Date
: returns the current date. Here we specify the format as “%Y-%m-%d”, i.e., ‘Year-month-day’.
More advanced ways for handling the problem of package versions to make sure you can run your code in the future without compatibility issues include:
- Package ‘packrat’: bundle all your current package version. This can become quite large.
- Package ‘checkpoint’: access daily snapshots of CRAN for any given date. Note that checkpoint only covers packages from CRAN, not other repositories like GitHub.
More on the topic: https://timogrossenbacher.ch/2017/07/a-truly-reproducible-r-workflow/
f. Potential Windows issues
Check that Bash is installed and ready to use with RStudio:
- In the RStudio menu, select ‘Tools’ > ‘Global options’ > ‘Terminal’.
- Make sure some form of Bash (e.g., Git Bash) is listed under ‘Shell: New terminals open with:’.
If this does not work, try installing ‘Git for Windows’, which will also install Bash: http://neondataskills.org/setup/setup-git-bash-R
Here’s a detailed multi-part tutorial on running R from the command line, with R scripts and Bash R scripts, with some additional information on Windows:
- Introduction: https://github.com/gastonstat/tutorial-R-noninteractive/blob/master/01-introduction.Rmd
- Batch mode: https://github.com/gastonstat/tutorial-R-noninteractive/blob/master/02-batch-mode.Rmd
- Executing R scripts: https://github.com/gastonstat/tutorial-R-noninteractive/blob/master/03-rscript.Rmd
- Bash R scripts: https://github.com/gastonstat/tutorial-R-noninteractive/blob/master/04-shell-script.Rmd
- Redirection: https://github.com/gastonstat/tutorial-R-noninteractive/blob/master/05-redirection.Rmd
6. Parallelizing code
Note that the following issues may create problems when developing Bash R scripts on Windows that you want to run e.g. on a Linux cluster or another Unix-type system;
- File paths are different, and system files are stored in a different place.
- End-of-line symbols are different
a. Replace ‘lapply’ by ‘mclapply’ with package ‘parallel’
Note: while this will run on a Windows without causing an error, it will only be faster on Mac / Unix.
With the package ‘parallel’, it is really easy to use all cores of your local machine (as long as you are on a Mac / Unix / Linux system). Let’s check the number of cores available:
## [1] 6
Question: How many cores does your machine have?
Note: Here we check whether the operating system is ‘Windows’, in which case we set nCores = 1
. This means that we will use all cores on Mac or Linux, but only one core on Windows. This is to avoid problems on Windows machines.
## [1] 6
- Code your analysis with
lapply
(and related functions). - Replace
lapply
bymclapply
(and related functions). Use the argumentmc.cores=detectCores()
to automatically detect the number of cores in your machine.
NOTE: April 2021: package build does not work with multiple cores, changing nCores to 1.
x <- gen[,-1]
m1 <- lapply(x, mean, na.rm=TRUE)
#m2 <- mclapply(x, mean, na.rm=TRUE, mc.cores=nCores) # Use this line when running the code yourself
m2 <- mclapply(x, mean, na.rm=TRUE, mc.cores=1) # Replace this line with the previous line
Let’s benchmark four ways of calculating the mean of each column in our example data set ‘gen’ with 94 rows and 10,000 columns (SNPs):
- Method 1: The dedicated function
colMeans
from base R. - Method 2: A
for
loop. - Method 3: Vectorization with
lapply
- Method 4: Multi-core with
mclapply
.
method1 <- function(x) {colMeans(x, na.rm=TRUE)}
method2 <- function(x) {for(i in 1:ncol(x)) mean(x[,i], na.rm=TRUE)}
method3 <- function(x) {lapply(x, mean, na.rm=TRUE)}
#method4 <- function(x) {mclapply(x, mean, na.rm=TRUE, mc.cores=nCores)} # Use this line when running the code yourself
method4 <- function(x) {mclapply(x, mean, na.rm=TRUE, mc.cores=1)} # Replace this line with the previous line
microbenchmark::microbenchmark(times = 10, unit = "ms",
method1(x), method2(x), method3(x), method4(x))
## Unit: milliseconds
## expr min lq mean median uq max neval
## method1(x) 36.08653 37.50303 45.00312 37.97650 38.23344 78.90942 10
## method2(x) 210.20485 212.69901 235.16524 223.33735 266.25324 269.15389 10
## method3(x) 59.26787 68.22158 72.73510 69.96695 72.50248 112.73286 10
## method4(x) 60.19222 61.87228 76.39005 67.44764 96.16702 116.18058 10
## cld
## a
## b
## c
## c
Question: Compare the mean
- Which method was the fastest? Can you explain this?
- Which method was slowest?
- Was
mclapply
faster thanlapply
in this example?
Note: Obviously you might only expect to see a gain in speed if nCores > 1.
b. Replace ‘for’ by ‘foreach’ with package ‘doParallel’
On Windows, it is easier to use the package ‘doParallel’ with the function ‘foreach’. Here’s a detailed introduction: http://127.0.0.1:26758/help/library/doParallel/doc/gettingstartedParallel.pdf.
- Use
makeCluster
to specify the number of cores to be used. The default is half the number of cores. - Check number of clusters by printing
cl
. - Register the cluster with
registerDoParallel
. If you omit this step, the code will not use parallel computing.
The following code is commented out to avoid problems when knitting the Notebook. You may uncomment and run it.
## Loading required package: foreach
## Loading required package: iterators
Now we adapt the code in two steps:
- The code is a little different with
foreach
than withfor
, as we use a pipe-like syntax with%do%
, which means, for each value of i, do the following. - Thus,
%do%
is only a pipe operator, it does not result in parallelisation yet. - To make this parallel, replace
%do%
by%dopar%
.
m1 <- for(i in 1:ncol(x)) mean(x[,i], na.rm=TRUE)
m2 <- foreach(i = 1:ncol(x)) %do% (mean(x[,i], na.rm=TRUE))
#m3 <- foreach(i = 1:ncol(x)) %dopar% (mean(x[,i], na.rm=TRUE))
Let’s benchmark this again. We’ll only do 3 replicates this time (uncomment before running this code).
#method1 <- function(x) {colMeans(x, na.rm=TRUE)}
#method2 <- function(x) {for(i in 1:ncol(x)) mean(x[,i], na.rm=TRUE)}
#method3 <- function(x) {lapply(x, mean, na.rm=TRUE)}
#method4 <- function(x) {mclapply(x, mean, na.rm=TRUE, mc.cores=nCores)}
#method5 <- function(x) {foreach(i = 1:ncol(x)) %do% (mean(x[,i], na.rm=TRUE))}
#method6 <- function(x) {foreach(i = 1:ncol(x)) %dopar% (mean(x[,i], na.rm=TRUE))}
#microbenchmark::microbenchmark(times = 3, unit = "ms",method1(x), method2(x), method3(x), method4(x), method5(x), method6(x))
Whether parallelisation is faster depends on the type of task and on your system. In this case, both versions that used parallelisation were actually slower than the sequential code (at least on my system).
Of course, this will not always be the case, it may depend on:
- What type and size of computational task you are running.
- How many nodes you can use in parallel. The initial ‘cost’ of coordinating the task among cores may not be worth it if you only have two cores to use anyways, but with 20 cores, the gain will be higher.
- Whether all cores are in the same node, or whether you distribute the work among multiple nodes (this will increase the cost of coordination, but potentially also give you access to many more cores).