Chapter 18 R Packages

library(tidyverse)
library(devtools)
library(roxygen2)
library(testthat)
library(usethis)

CHAPTER IS STILL BEING EDITED

18.1 Introduction

Dr. Sonderegger’s Video Companion: Video Lecture.

R packages are documented and consistent format for storing data, functions, documentation, and analysis. We use a consistent format so that other researchers (or ourselves in six months) know exactly where the raw data should be, where to find any functions that are written, and document the data cleaning process.

In principle, all of these steps could be accomplished by a single data file and a single analysis Rmarkdown file. However as projects get larger in scope, the number of data files, the complexity of data cleaning, and the number of people working with the data will grow. With more complexity, the need to impose order on it becomes critical.

Even if the project is small, organizing my work into a package structure provides a benefit. First, it forces me to keep my data wrangling code organized and encourages documenting any functions I create. Second, by separating the data wrangling step code from the analysis, I think more deeply about verification and initial exploration to understand how best to store the data. Finally, because all my subsequent analysis will depend on the same tidy data set, I make few mistakes where I cleaned the data correctly in one analysis, but forgot a step in another analysis.

I recommend using an R package for any analysis more complicated than a homework assignment because the start-up is relatively simple and if the project grows, you’ll appreciate that you started it in an organized fashion.

18.1.1 Useful packages and books

There are several packages that make life easier.

Package Description
devtools Tools by Hadley, for Hadley (and the rest of us).
roxygen2 A coherent documentation syntax
testthat Quality Assurance tools
usethis Automates repetitive tasks that arise during project setup and development, both for R packages and non-package projects.

Hadley Wickham has written a book on R packages that gives a lot more information than I’m giving here. The book is available online.

18.2 Package Structure

18.2.1 Minimal files and directories

File/Directory Description
DESCRIPTION A file describing your package. You should edit this at some point.
NAMESPACE A file that lists all the functions and datasets available to users after loading the package. You should not edit this by hand.
.Rbuildignore A list of files that shouldn’t be included when the package is built.
R/ This directory contains documentation files for datasets. It also contains the R code and documentation for functions you create. I generally recommend one documentation file for each dataset, and one file for each function, although if you have several related functions, you might keep them in the same file. This directory can be empty, but it does have to exist.
man/ This contains the documentation (manual) files generated by roxygen2. You should not edit these as they will be rebuilt from the source R code in the R/ directory.

18.2.2 Optional Files and Directories

File/Directory Description
data-raw/ A directory where we store data files that are not .RData format. Usually these are .csv or .xls files that have not been processed. Typically I’ll have R scripts in this directory that read in raw data, do whatever data wrangling and cleaning that needs to be done, and saves the result in the data/ directory. Obnoxiously the documentation for the dataset does not live in this directory, but rather in the R/ directory.
data/ A directory of datasets that are saved in R’s efficient .RData format. Each file should be an .RData or .Rda file created by the save() containing a single object (with the same names as the file). Anything in this directory will be loaded and accessible to the user when the package is loaded. While it isn’t necessary for this directory to exist, it often does.
docs/ A directory where any Rmarkdown analysis files that are especially time consuming and should not be executed each time the package is built. When I build a package for a data-analysis project, the reports I create go into this directory.
vignettes/ A directory where Rmarkdown files should go that introduce how to use the package. When a package is built, then Rmarkdown files in this directory will also be rebuilt. I generally don’t make a vi
tests/ A directory for code used for package testing functions you’ve written.
inst/ Miscellaneous stuff. In particular inst/extdata/ is where you might put data that is not in .RData format (Excel files and such) but you want it available to the users. Anything in the inst/extdata will be available to to user via system.file('file.xls',package='MyAwesomePackage')
src/ A directory where C/C++, Fortran, Python, etc source code is stored.
exec/ A directory where executables you might have created from the source code should go.

18.3 Documenting

The man/ directory is where the final documentation exists, but the format that was initially established is quite unwieldy. To address this, the roxygen2 package uses a more robust and modern syntax and keeps the function documentation with the actual code in the R\ directory. The results in a process where we write the documentation in files in the R/ directory and then run a roxygen2 command to build the actual documentation files in the man/ directory. To run this, use the Build tab and then More -> Document.

Hadley Wickham has a more complete discussion of package documentation in a vignette for roxygen2. If/when the information in this chapter seems insufficient, that should be your next resource.

The documentation information is built in comments and so documentation lines always start with a #'. For both data sets and functions, the first couple of lines give the short title and description.

#' A short title
#' 
#' A longer paragraph that describes the context of the dataset/function and 
#' discusses important aspects that will be necessary for somebody first seeing 
#' the data/function to know about. Any text in these initial paragraphs will be 
#' in the description section of the documentation file.
#' 

18.3.1 Data Documentation

Data set documentation should contain both general information about the context of the data as well as detailed information about the data columns. Finally the documentation should also include information about where the data came from, if it is available. The title and description are given in the first paragraphs of the description, but the format and source need some indication starting the sections.

#' A short title
#' 
#' A longer paragraph that describes the context of the dataset and 
#' discusses important aspects that will be necessary for somebody first seeing 
#' the data to know about. Any text in these initial paragraphs will be 
#' in the description section of the documentation file.
#' 
#' @format A data frame with XXX observations with ZZ columns.
#' \describe{
#'    \item{Column1}{Description of column 1, including units if appropriate}
#'    \item{Column2}{Description of column 2, including units if appropriate}
#' }
"DataSetName"

There are a few other documentation sections that can be filled in. They will all be introduced using @Section notation.

#' @source This describes where the data came from
#' @references If we need to cite some book or journal article.

18.3.2 Documenting Functions

Functions that you want other people to use need to be documentated. In particular, we need a general description of what the function does, a list of all function arguments and what they do, and what type of object the function returns. Finally, it is nice to have some examples that demonstrate how the function can be used.

#' Sum two numeric objects.
#'
#' Because this is a very simple function, my explanation is short. These
#' paragraphs should explain everything you need to know.
#' 
#' This is still in the description part of the documentation and and it 
#' will be until we see something that indicates a new section.
#'
#' @param a A real number
#' @param b A real number
#' @return The sum of \code{a} and \code{b}
#' @examples
#' sum(12,5)
#' sum(4,-2)
#' @export
my.sum <- function( a, b ){
  return( a + b )
}

Each of the sections is self explanatory except for the @export. The purpose of this is to indicate that this function should be available to any user of the package. If a function is not exported, then it is available only to functions within the package. This can be convenient if there are multiple functions that help with the analysis but you don’t want the user to see them because it is too much work to explain that they shouldn’t be used.

Other regions that you might use:

  • @seealso allows you to point to other resources
    • on the web \url{http://www.r-project.org}
    • in your package \code{\link{hello}}
    • in another package \code{\link[package]{function}}
  • @aliases alias_1 alias_2 ... Other topics that when searched for will point to this documentation
  • @author This isn’t necessary if the author is the same as the package author.
  • @references This is a text area to point to journal articles or other literature sources.

18.4 Testing

In any package that contains R-functions, we need to make sure those functions work correctly. In particular, as I am writing the function, I am building test cases that verify that my function does exactly what I claim it does. In particular, I want to save all of those simple test cases that I’ve thought about and automatically run them each time I re-build the package.

Moving from ad-hoc testing into a formalized unit testing results substantial improvements in your package and your code for a variety of reasons:

  1. Cleaner functionality. Because unit testing requires you to think about how your code should respond in different instances, you think more clearly about what the appropriate inputs and outputs should be and as a result, you are less likely to have functions that do WAY too much and are difficult to test. Separate smaller functions are easier to write, easier to test, and ultimately more reliable.

  2. Robust code. With unit testing, it is easier to make changes and feel confident that you haven’t broken previously working code. In particular, it allows us to capture weird edge cases and make sure they are always tested for.

# To set up your package to use the testthat package run:
usethis::use_testthat()

What this command does is:

  1. Creates a tests/testthat directory in the package.
  2. Adds testthat to the Suggests field in the DESCRIPTION.
  3. Creates a file tests/testthat.R that runs all your tests when R CMD check runs.

Next you create .R files in the tests/testthat/ directory named test_XXX.r. In those files, you’ll input your test code.

Recently, I had to utilize a truncated distribution and the trunc package didn’t work for the distribution I needed, so I had to create my own version of the trunc package. So I made my own trunc2 package. However, I wanted to be absolutely certain that I was getting the correct answers, so I made some unit tests.

# Check to see if I get the same values in Poisson distribution
test_that('Unrestricted Poisson values correct', {
  expect_equal(dpois( 2, lambda=3 ),  dtrunc(2, 'pois', lambda=3) )
  expect_equal(ppois( 2, lambda=3 ),  ptrunc(2, 'pois', lambda=3) )
  expect_equal(qpois( .8, lambda=3 ), qtrunc(.8, 'pois', lambda=3) )
})

# Check to see if I get the same values in Exponential distribution
test_that('Unrestricted Exponential values correct', {
  expect_equal(dexp( 2, rate=3 ),  dtrunc(2, 'exp', rate=3) )
  expect_equal(pexp( 2, rate=3 ),  ptrunc(2, 'exp', rate=3) )
  expect_equal(qexp( .8, rate=3 ), qtrunc(.8, 'exp', rate=3) )
})

The idea is that each test_that() command tests some functionality and each expect_XXX() tests some atomic unit of computing. I would then have multiple files, where each file is named test_XXX and has some organizational rational.

The expectation functions give you a way to have your function calculate something and compare it to what you think should be the output. These functions start with expect_ and throw an error if the expectation is not met. In the table below, the a and b represent expressions to be evaluated.

Function Description
expect_equal(a,b) Are the two inputs equal (up to numerical tolerances).
expect_match(a,b) Does the character string a match the regular expression b
expect_error(a) Does expression a cause an error?
expect_is(a,b) Does the object a have the class listed in character string b
expect_true(a) Does a evaluate to TRUE?
expect_false(a) Does a evaluate to FALSE?

The expect_true and expect_false functions are intended as a catch-all for cases that couldn’t be captured using one of the other expect functions.

There are a few more expect_XXX functions and you can see more detail in Hadley’s chapter on testing in his R-packages book.

Each test should cover a single unit of functionality and if the test fails, you should easily know the underlying cause and know where/how to find/fix the issue. Each test name should complete the sentence “Test that …” so that when we run the unit testing and something fails, we know exactly which test failed and what the underlying problem is.

Now that we have the testing setup built, the work flow is simple:

  1. Edit/modify your code or test definitions.
  2. Test your package with Ctrl/Cmd + Shift + T or devtools::test(). This causes all of your functions to be re-created (thus capturing any new changes to the functions) and then runs the testing commands.
  3. Repeat until all tests pass and there are no new test cases to implement.

18.5 The DESCRIPTION file

I never write the DESCRIPTION file from scratch, but rather it is generated it from a template when the package structure is initially created. It is useful to go into this file and edit it.

Package: MyAwesomePackage
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R: 
    person(given = "First",
           family = "Last",
           role = c("aut", "cre"),
           email = "first.last@example.com",
           comment = c(ORCID = "YOUR-ORCID-ID"))
Description: What the package does (one paragraph).
License: What license it uses
Encoding: UTF-8
LazyData: true  

Often you want to have your package include other libraries so that the packages are available to be used in any functions you use. To do this, you’ll add lines to the description file.

Depends: magrittr     
Imports: dplyr, ggplot2, tidyr
Suggests: lme4

In this example, I’ve included a dependency on the magrittr package, which defines the %>% operator, while dplyr, ggplot2, tidyr, and lme4 packages are included in a slightly different manner.

Package Dependency Type Description
Depends These packages are required to have been downloaded from CRAN and will be attached to the namespace when your package is loaded. If your package is going to be widely used, you want to keep this list as short as possible to avoid function name clashes.
Imports These packages are required be present on the computer, but will not be attached to the namespace. Whenever you want to use them you must use then in one of your functions, you’ll need to use PackageName::FunctionName syntax.
Suggests These packages are not required. Often these are packages of data that are only used in the examples, the unit tests, or in a vignette. These are not loaded/attached by default.

For widely distributed packages, using Imports is the preferred way to utilize other packages in your code to avoid namespace problems. For example, because packages MASS and dplyr both have a select() function, it is advisable to avoid depending on dplyr just in case the user also has loaded the MASS package. However, this choice is annoying because I then have to use PackageName::FunctionName() syntax within all of the functions within your package.

For a data analysis package, I usually leave the Depends/Imports/Suggests blank and just load whatever analysis packages I need in the RMarkdown files that live in the docs/ directory.

18.6 Sharing your Package

The last step to a package is being able to share it with other people. We could either wrap up the package into a .tar.gz file or we could save the package to some version control platform like GitHub. For packages that are in a stable form and need to be available via CRAN or Bioconductor, then building a .tar.gz file is important. However when a package is just meant for yourself and your collaborators, I prefer to save the package to GitHub.

I have several packages available on my GitHub account. I have a repository https://github.com/dereksonderegger/TestPackage that demonstrates a very simple package. To install this package, we can install it directly using the following:

devtools::install_github('dereksonderegger/TestPackage')

If you chose to share your package with others by sharing a .tar.gz file, then create the file using the Build tab and More -> Build Source Package. Then to install the package, the user will run the R command

install.packages('TestPackage_0.0.0.1.tar.gz', repos=NULL, type='source')

18.7 An Example Package

I find it is easiest to use RStudio to start a new package via File -> New Project ... and then start a project in a new directory and finally select that we want a new R package.

Alternatively we could use the usethis::create_package() function to build the minimal package.

usethis::create_package('~/GitHub/TestPackage')  # replace the path to where you want it...

Once the package is created:

  1. Put any .csv or .xls data files you have in data-raw/ sub-directory. For this example, save the file https://raw.githubusercontent.com/dereksonderegger/444/master/data-raw/FlagMaxTemp.csv from the STA 444/5 Github page. In the same data-raw directory, create a R script or Rmarkdown file that reads the data in, cleans it up by renaming columns, or whatever. An example R script might look something like this:

    library(tidyverse)
    
    # Read in the data.  Do some cleaning/verification
    MaxTemp <- read.csv('data-raw/FlagMaxTemp.csv') %>%
      gather('DOM', 'MaxTemp', X1:X31) %>%            
      drop_na() %>%
      mutate(DOM  = str_remove(DOM, fixed('X')) ) %>%  
      mutate(Date = lubridate::ymd( paste( Year, Month, DOM )) ) %>%
      select(Date, MaxTemp)
    
    # Save the data frame to the data/ directory as MaxTemp.rda
    usethis::use_data(MaxTemp)
  2. In the R/ directory, create file MaxTemp.R and when the package is built, this will document the dataset.

    #' A time series of daily maximum temperatures in Flagstaff, AZ. 
    #' 
    #' @format a data frame with 10882 observations 
    #' \describe{
    #'   \item{Date}{The date of observation as a POSIX date format.}
    #'   \item{MaxTemp}{Daily maximum temperature in degrees Farhenheit.}
    #' }
    #' @source \url{www.ncdc.noaa.gov}
    "MaxTemp"
  3. Build the package by going to Build tab.

    1. Create the Documentation.
      1. The first time, you’ll need to enable Oxygen style documentation. Do this by clicking Build tab, then click More -> Configure Build Tools. Finally select the tick-box to build documentation using ROxygen.
      2. Click the More and select Document to create the data frame documentation. The shortcut is Ctrl/Cmd Shift D.
    2. Install the package.
      1. Click the Install and Restart to build the package. The shortcut is Ctrl/Cmd Shift B.
  4. Create the docs/ directory and then create a RMarkdown file that does some analysis.

    ---
    title: "My Awesome Analysis"
    author: "Derek Sonderegger"
    date: "9/18/2019"
    output: html_document
    ---
    
    This Rmarkdown file will do the analysis.
    
    ```{r, eval=FALSE}
    library(TestPackage)   # load TestPackage, which includes MaxTemp data frame.
    library(ggplot2)
    
    ggplot(MaxTemp, aex(x=Date, y=MaxTemp)) +
      geom_line()
    ```
    
    We see that the daily max temperature in Flagstaff varies quite a lot.

18.8 Exercises

  1. Build a package that contains a dataset that gives weather information at Flagstaff’s Pulliam Airport from 1950 to 2019. I have the data and metadata on my GitHub site and I downloaded the data on 9-19-19 from https://www.ncdc.noaa.gov/cdo-web/search. In the data-raw directory, there are files Pulliam_Airport_Weather_Station.csv and its associated metadata Pulliam_Airport_Weather_Station_Metadata.txt. In the data, there are a bunch of columns that contain attribute information about the preceding column, I don’t think those are helpful, or at least the metadata didn’t explain how to interpret them. So remove those. Many of the later columns have values that are exclusively ones or zeros. I believe those indicate if the weather phenomena was present that day. Presumably a 1 is a yes, but I don’t know that. When I downloaded the data, I asked for “standard” units, so precipitation and snow amounts should be in inches, and temperature should be in Fahrenheit. For this package, we only care about a couple of variables, DATE, PRCP, SNOW, TMAX, and TMIN.
    1. Create a new package named YourNameFlagWeather. In the package, use usethis::use_data_raw() function to create the data-raw/ directory. Place the data and metadata there.
    2. Also in the data-raw directory, create an R-script that reads in the data and does any necessary cleaning. Call your resulting data frame Flagstaff_Weather and save a .rda file to the data/ directory using the command usethis::use_data(Flagstaff_Weather). For this package, we only care about a couple of variables, DATE, PRCP, SNOW, TMAX, and TMIN. Keep and document only these variables.
    3. In the R/ directory, create a file Flagstaff_Weather.R that documents where the data came from and what each of the columns mean.
    4. Set RStudio to build documentation using Roxygen by clicking the Build tab, then More -> Configure Build Tools and click the box for generating documentation with Roxygen. Select OK and then build the appropriate documentation file by clicking the Build tab, then More -> Document.
    5. Load your package and restart your session of R, again using the Build tab.
    6. Create a new directory in your package called docs/. In that directory create a RMarkdown file that loads your package and uses the weather data to make a few graphs of weather phenomena over time.
    7. Suppose that we decided to change something in the data and we need to rebuild the package.
      1. Changing the name of one of the columns in your cleaning script.
      2. Re-run the cleaning script and the usethis::use_data command.
      3. Re-install the package using the Build tab and Install and Restart.
      4. Verify that the Flagstaff_Weather object has changed.
      5. Verify that the documentation hasn’t changed yet.
      6. Update the documentation file for the dataset and re-run the documentation routine.
      7. Re-install the package and check that the documentation is now correct.
  2. Recall writing the function FizzBuzz in the chapter on functions. We will add this function to our package and include both documentation and unit tests.
    1. Copy your previously submitted FizzBuzz function into an R file in R/FizzBuzz.R.
    2. Document what the function does, what its arguments are, and what its result should be.
    3. Force your package to use unit testing by running the usethis::use_testthat().
    4. Add unit tests for testing that the length of the output is the same as the input n.
    5. Add unit tests that address what should happen if the user inputs a negative, zero, or infinite value for n.
    6. Modify your function so that if the user inputs a negative, zero, or infinite value for n, that the function throws and error using the command stop('Error Message'). Modify the error message appropriately for the input n. Hint: there is a family of functions is.XXX() which test a variety of conditions. In particular there is a is.infinite() function.
    7. Update your unit tests and make sure that the unit tests all pass.
  3. Now save the package as one file by building a source package using the Build tab, More -> Build Source Package. This will create a .tar.gz file that you can easily upload to Bblearn.