Chapter 16 R Packages
16.1 Introduction
As usual, I have a YouTube Video Lecture for this chapter.
R packages are documented and consistent format for storing data, functions, documentation, and analysis. We use a consistent format so that other researchers (or ourselves in six months) know exactly where the raw data should be, where to find any functions that are written, and document the data cleaning process.
In principle, all of these steps could be accomplished by a single data file and a single analysis Rmarkdown file. However as projects get larger in scope, the number of data files, the complexity of data cleaning, and the number of people working with the data will grow. With more complexity, the need to impose order on it becomes critical.
Even if the project is small, organizing my work into a package structure provides a benefit. First, it forces me to keep my data wrangling code organized and encourages documenting any functions I create. Second, by separating the data wrangling step code from the analysis, I think more deeply about verification and initial exploration to understand how best to store the data. Finally, because all my subsequent analysis will depend on the same tidy dataset, I make few mistakes where I cleaned the data correctly in one analysis, but forgot a step in another analysis.
I recommend using an R package for any analysis more complicated than a homework assignment because the start-up is relatively simple and if the project grows, you’ll appreciate that you started it in an organized fashion.
16.1.1 Useful packages and books
There are several packages that make life easier.
Package | Description |
---|---|
devtools |
Tools by Hadley, for Hadley (and the rest of us). |
roxygen2 |
A coherent documentation syntax |
testthat |
Quality Assurance tools |
usethis |
Automates repetitive tasks that arise during project setup and development, both for R packages and non-package projects. |
Hadley Wickham has written a book on R packages that gives a lot more information than I’m giving here. The book is available online.
16.2 Package Structure
16.2.1 Minimal files and directories
File/Directory | Description |
---|---|
DESCRIPTION |
A file describing your package. You should edit this at some point. |
NAMESPACE |
A file that lists all the functions and datasets available to users after loading the package. You should not edit this by hand. |
.Rbuildignore |
A list of files that shouldn’t be included when the package is built. |
R/ |
This directory contains documentation files for datasets. It also contains the R code and documentation for functions you create. I generally recommend one documentation file for each dataset, and one file for each function, although if you have several related functions, you might keep them in the same file. This directory can be empty, but it does have to exist. |
man/ |
This contains the documentation (manual) files generated by roxygen2 . You should not edit these as they will be rebuilt from the source R code in the R/ directory. |
16.2.2 Optional Files and Directories
File/Directory | Description |
---|---|
data-raw/ |
A directory where we store data files that are not .RData format. Usually these are .csv or .xls files that have not been processed. Typically I’ll have R scripts in this directory that read in raw data, do whatever data wrangling and cleaning that needs to be done, and saves the result in the data/ directory. Obnoxiously the documentation for the dataset does not live in this directory, but rather in the R/ directory. |
data/ |
A directory of datasets that are saved in R’s efficient .RData format. Each file should be an .RData or .Rda file created by the save() containing a single object (with the same names as the file). Anything in this directory will be loaded and accessible to the user when the package is loaded. While it isn’t necessary for this directory to exist, it often does. |
docs/ |
A directory where any Rmarkdown analysis files that are especially time consuming and should not be executed each time the package is built. When I build a package for a data-analysis project, the reports I create go into this directory. |
vignettes/ |
A directory where Rmarkdown files should go that introduce how to use the package. When a package is built, then Rmarkdown files in this directory will also be rebuilt. I generally don’t make a vignettes directory until I’m certain that I will be sharing the package with a wide audience. |
tests/ |
A directory for code used for package testing functions you’ve written. |
inst/ |
Miscellaneous stuff. In particular inst/extdata/ is where you might put data that is not in .RData format (Excel files and such) but you want it available to the users. Anything in the inst/extdata will be available to to user via system.file('file.xls',package='MyAwesomePackage') |
src/ |
A directory where C/C++, Fortran, Python, etc source code is stored. |
exec/ |
A directory where executables you might have created from the source code should go. |
16.3 Documenting
The man/
directory is where the final documentation exists, but the format that
was initially established by R is quite unwieldy. To address this, the roxygen2
package uses a more robust and modern syntax and keeps the function documentation
with the actual code in the R\
directory. The results in a process where we
write the documentation in files in the R/
directory and then run a roxygen2
command to build the actual documentation files in the man/
directory. To run
this, use the Build
tab and then More -> Document
.
Hadley Wickham has a more complete discussion of package documentation in a
vignette
for roxygen2
. If/when the information in this chapter seems insufficient,
that should be your next resource.
The documentation information is built in comments and so documentation lines
always start with a #'
. For both data sets and functions, the first couple of
lines give the short title and description.
#' A short title
#'
#' A longer paragraph that describes the context of the dataset/function and
#' discusses important aspects that will be necessary for somebody first seeing
#' the data/function to know about. Any text in these initial paragraphs will be
#' in the description section of the documentation file.
#'
16.3.1 Data Documentation
Data set documentation should contain both general information about the context of the data as well as detailed information about the data columns. Finally the documentation should also include information about where the data came from, if it is available. The title and description are given in the first paragraphs of the description, but the format and source need some indication starting the sections.
#' A short title
#'
#' A longer paragraph that describes the context of the dataset and
#' discusses important aspects that will be necessary for somebody first seeing
#' the data to know about. Any text in these initial paragraphs will be
#' in the description section of the documentation file.
#'
#' @format A data frame with XXX observations with ZZ columns.
#' \describe{
#' \item{Column1}{Description of column 1, including units if appropriate}
#' \item{Column2}{Description of column 2, including units if appropriate}
#' }
"DataSetName"
There are a few other documentation sections that can be filled in. They will
all be introduced using @Section
notation.
#' @source This describes where the data came from
#' @references If we need to cite some book or journal article.
16.3.2 Documenting Functions
Functions that you want other people to use need to be documentated. In particular, we need a general description of what the function does, a list of all function arguments and what they do, and what type of object the function returns. Finally, it is nice to have some examples that demonstrate how the function can be used.
#' Sum two numeric objects.
#'
#' Because this is a very simple function, my explanation is short. These
#' paragraphs should explain everything you need to know.
#'
#' This is still in the description part of the documentation and and it
#' will be until we see something that indicates a new section.
#'
#' @param a A real number
#' @param b A real number
#' @return The sum of \code{a} and \code{b}
#' @examples
#' sum(12,5)
#' sum(4,-2)
#' @export
<- function( a, b ){
my.sum return( a + b )
}
Each of the sections is self explanatory except for the @export
. The purpose
of this is to indicate that this function should be available to any user of the
package. If a function is not exported, then it is available only to functions
within the package. This can be convenient if there are multiple functions that
help with the analysis but you don’t want the user to see them because it is too
much work to explain that they shouldn’t be used.
Other regions that you might use:
@seealso
allows you to point to other resources- on the web
\url{http://www.r-project.org}
- in your package
\code{\link{hello}}
- in another package
\code{\link[package]{function}}
- on the web
@aliases alias_1 alias_2 ...
Other topics that when searched for will point to this documentation@author
This isn’t necessary if the author is the same as the package author.@references
This is a text area to point to journal articles or other literature sources.
16.4 Testing
In any package that contains R-functions, we need to make sure those functions work correctly. In particular, as I am writing the function, I am building test cases that verify that my function does exactly what I claim it does. In particular, I want to save all of those simple test cases that I’ve thought about and automatically run them each time I re-build the package.
Moving from ad-hoc testing into a formalized unit testing results substantial improvements in your package and your code for a variety of reasons:
Cleaner functionality. Because unit testing requires you to think about how your code should respond in different instances, you think more clearly about what the appropriate inputs and outputs should be and as a result, you are less likely to have functions that do WAY too much and are difficult to test. Separate smaller functions are easier to write, easier to test, and ultimately more reliable.
Robust code. With unit testing, it is easier to make changes and feel confident that you haven’t broken previously working code. In particular, it allows us to capture weird edge cases and make sure they are always tested for.
# To set up your package to use the testthat package run:
::use_testthat() usethis
What this command does is:
- Creates a tests/testthat directory in the package.
- Adds
testthat
to theSuggests
field in the DESCRIPTION. - Creates a file
tests/testthat.R
that runs all your tests when R CMD check runs.
Next you create .R files in the tests/testthat/
directory named test_XXX.r
.
In those files, you’ll input your test code.
Recently, I had to utilize a truncated distribution and the trunc
package
didn’t work for the distribution I needed, so I had to create my own version
of the trunc
package. So I made my own
trunc2
package. However, I wanted to be absolutely certain that I was getting the
correct answers, so I made some unit tests.
# Check to see if I get the same values in Poisson distribution
test_that('Unrestricted Poisson values correct', {
expect_equal(dpois( 2, lambda=3 ), dtrunc(2, 'pois', lambda=3) )
expect_equal(ppois( 2, lambda=3 ), ptrunc(2, 'pois', lambda=3) )
expect_equal(qpois( .8, lambda=3 ), qtrunc(.8, 'pois', lambda=3) )
})
# Check to see if I get the same values in Exponential distribution
test_that('Unrestricted Exponential values correct', {
expect_equal(dexp( 2, rate=3 ), dtrunc(2, 'exp', rate=3) )
expect_equal(pexp( 2, rate=3 ), ptrunc(2, 'exp', rate=3) )
expect_equal(qexp( .8, rate=3 ), qtrunc(.8, 'exp', rate=3) )
})
The idea is that each test_that()
command tests some functionality and each
expect_XXX()
tests some atomic unit of computing. I would then have multiple
files, where each file is named test_XXX
and has some organizational rational.
The expectation functions give you a way to have your function calculate something
and compare it to what you think should be the output. These functions start with
expect_
and throw an error if the expectation is not met. In the table below,
the a
and b
represent expressions to be evaluated.
Function | Description |
---|---|
expect_equal(a,b) |
Are the two inputs equal (up to numerical tolerances). |
expect_match(a,b) |
Does the character string a match the regular expression b |
expect_error(a) |
Does expression a cause an error? |
expect_is(a,b) |
Does the object a have the class listed in character string b |
expect_true(a) |
Does a evaluate to TRUE? |
expect_false(a) |
Does a evaluate to FALSE? |
The expect_true
and expect_false
functions are intended as a catch-all for
cases that couldn’t be captured using one of the other expect functions.
There are a few more expect_XXX
functions and you can see more detail in
Hadley’s chapter on testing in his R-packages
book.
Each test should cover a single unit of functionality and if the test fails, you should easily know the underlying cause and know where/how to find/fix the issue. Each test name should complete the sentence “Test that …” so that when we run the unit testing and something fails, we know exactly which test failed and what the underlying problem is.
Now that we have the testing setup built, the work flow is simple:
- Edit/modify your code or test definitions.
- Test your package with
Ctrl/Cmd + Shift + T
ordevtools::test()
. This causes all of your functions to be re-created (thus capturing any new changes to the functions) and then runs the testing commands. - Repeat until all tests pass and there are no new test cases to implement.
16.5 The DESCRIPTION file
I never write the DESCRIPTION file from scratch, but rather it is generated it from a template when the package structure is initially created. It is useful to go into this file and edit it.
: MyAwesomePackage
Package: What the Package Does (One Line, Title Case)
Title: 0.0.0.9000
Version@R:
Authorsperson(given = "First",
family = "Last",
role = c("aut", "cre"),
email = "first.last@example.com",
comment = c(ORCID = "YOUR-ORCID-ID"))
: What the package does (one paragraph).
Description: What license it uses
License: UTF-8
Encoding: true LazyData
Often you want to have your package include other libraries so that the packages are available to be used in any functions you use. To do this, you’ll add lines to the description file.
: magrittr
Depends: dplyr, ggplot2, tidyr
Imports: lme4 Suggests
In this example, I’ve included a dependency on the magrittr
package, which
defines the %>%
operator, while dplyr
, ggplot2
, tidyr
, and lme4
packages
are included in a slightly different manner.
Package Dependency Type | Description |
---|---|
Depends |
These packages are required to have been downloaded from CRAN and will be attached to the namespace when your package is loaded. If your package is going to be widely used, you want to keep this list as short as possible to avoid function name clashes. |
Imports |
These packages are required be present on the computer, but will not be attached to the namespace. Whenever you want to use them you must use then in one of your functions, you’ll need to use PackageName::FunctionName syntax. |
Suggests |
These packages are not required. Often these are packages of data that are only used in the examples, the unit tests, or in a vignette. These are not loaded/attached by default. |
For widely distributed packages, using Imports
is the preferred way to utilize
other packages in your code to avoid namespace problems. For example, because
packages MASS
and dplyr
both have a select()
function, it is advisable to
avoid depending on dplyr
just in case the user also has loaded the MASS
package. However, this choice is annoying because I then have to use
PackageName::FunctionName()
syntax within all of the functions within your
package.
For a data analysis package, I usually leave the Depends/Imports/Suggests blank
and just load whatever analysis packages I need in the RMarkdown files that live
in the docs/
directory.
16.7 An Example Package
I find it is easiest to use RStudio to start a new package via
File -> New Project ...
and then start a project in a new directory
and
finally select that we want a new R package
.
Alternatively we could use the usethis::create_package()
function to build
the minimal package.
::create_package('~/GitHub/TestPackage') # replace the path to where you want it... usethis
Once the package is created:
Put any
.csv
or.xls
data files you have indata-raw/
sub-directory. For this example, save the file https://raw.githubusercontent.com/dereksonderegger/444/master/data-raw/FlagMaxTemp.csv from the STA 444/5 Github page. In the samedata-raw
directory, create a R script or Rmarkdown file that reads the data in, cleans it up by renaming columns, or whatever. An example R script might look something like this:library(tidyverse) # Read in the data. Do some cleaning/verification MaxTemp <- read.csv('data-raw/FlagMaxTemp.csv') %>% gather('DOM', 'MaxTemp', X1:X31) %>% drop_na() %>% mutate(DOM = str_remove(DOM, fixed('X')) ) %>% mutate(Date = lubridate::ymd( paste( Year, Month, DOM )) ) %>% select(Date, MaxTemp) # Save the data frame to the data/ directory as MaxTemp.rda usethis::use_data(MaxTemp)
In the
R/
directory, create fileMaxTemp.R
and when the package is built, this will document the dataset.#' A time series of daily maximum temperatures in Flagstaff, AZ. #' #' @format a data frame with 10882 observations #' \describe{ #' \item{Date}{The date of observation as a POSIX date format.} #' \item{MaxTemp}{Daily maximum temperature in degrees Farhenheit.} #' } #' @source \url{www.ncdc.noaa.gov} "MaxTemp"
Build the package by going to
Build
tab.- Create the Documentation.
- The first time, you’ll need to enable
Oxygen
style documentation. Do this by clickingBuild
tab, then clickMore
->Configure Build Tools
. Finally select the tick-box to build documentation usingROxygen
. - Click the
More
and selectDocument
to create the data frame documentation. The shortcut isCtrl/Cmd Shift D
.
- The first time, you’ll need to enable
- Install the package.
- Click the
Install and Restart
to build the package. The shortcut isCtrl/Cmd Shift B
.
- Click the
- Create the Documentation.
Create the
docs/
directory and then create a RMarkdown file that does some analysis.--- title: "My Awesome Analysis" author: "Derek Sonderegger" date: "9/18/2019" output: html_document --- This Rmarkdown file will do the analysis. ```{r, eval=FALSE} library(TestPackage) # load TestPackage, which includes MaxTemp data frame. library(ggplot2) ggplot(MaxTemp, aex(x=Date, y=MaxTemp)) + geom_line() ``` We see that the daily max temperature in Flagstaff varies quite a lot.
16.8 Exercises
- Build a package that contains a dataset that gives weather information at
Flagstaff’s Pulliam Airport from 1950 to 2019. I have the data and metadata
on my GitHub site and I downloaded
the data on 9-19-19 from https://www.ncdc.noaa.gov/cdo-web/search. In the
data-raw
directory, there are filesPulliam_Airport_Weather_Station.csv
and its associated metadataPulliam_Airport_Weather_Station_Metadata.txt
. In the data, there are a bunch of columns that contain attribute information about the preceding column, I don’t think those are helpful, or at least the metadata didn’t explain how to interpret them. So remove those. Many of the later columns have values that are exclusively ones or zeros. I believe those indicate if the weather phenomena was present that day. Presumably a1
is a yes, but I don’t know that. When I downloaded the data, I asked for “standard” units, so precipitation and snow amounts should be in inches, and temperature should be in Fahrenheit. For this package, we only care about a couple of variables,DATE
,PRCP
,SNOW
,TMAX
, andTMIN
.- Create a new package named
YourNameFlagWeather
. In the package, useusethis::use_data_raw()
function to create thedata-raw/
directory. Place the data and metadata there. - Also in the
data-raw
directory, create an R-script that reads in the data and does any necessary cleaning. Call your resulting data frameFlagstaff_Weather
and save a.rda
file to thedata/
directory using the commandusethis::use_data(Flagstaff_Weather)
. For this package, we only care about a couple of variables,DATE
,PRCP
,SNOW
,TMAX
, andTMIN
. Keep and document only these variables. - In the
R/
directory, create a fileFlagstaff_Weather.R
that documents where the data came from and what each of the columns mean. - Set RStudio to build documentation using Roxygen by clicking the
Build
tab, thenMore -> Configure Build Tools
and click the box for generating documentation with Roxygen. SelectOK
and then build the appropriate documentation file by clicking theBuild
tab, thenMore -> Document
. - Load your package and restart your session of R, again using the
Build tab
. - Create a new directory in your package called
docs/
. In that directory create a RMarkdown file that loads your package and uses the weather data to make a few graphs of weather phenomena over time. - Suppose that we decided to change something in the data and we need to
rebuild the package.
- Changing the name of one of the columns in your cleaning script.
- Re-run the cleaning script and the
usethis::use_data
command. - Re-install the package using the
Build
tab andInstall and Restart
. - Verify that the
Flagstaff_Weather
object has changed. - Verify that the documentation hasn’t changed yet.
- Update the documentation file for the dataset and re-run the documentation routine.
- Re-install the package and check that the documentation is now correct.
- Changing the name of one of the columns in your cleaning script.
- Create a new package named
- Recall writing the function
FizzBuzz
in the chapter on functions. We will add this function to our package and include both documentation and unit tests.- Copy your previously submitted
FizzBuzz
function into an R file inR/FizzBuzz.R
. If necessary, modify your code so that the input is a vector of integers and your function returns the associated vector of FizzBuzz responses. - Document what the function does, what its arguments are, and what its result should be using the Roxygen2 notation. Run the package documentation and rebuild your package and make sure the documentation works.
- Force your package to use unit testing by running the
usethis::use_testthat()
. - Add unit tests for testing that the length of the output is the same
as the input
n
. - Modify your function so that if the user inputs a negative, zero, or
infinite value for
n
, that the function throws and error using the commandstop('Error Message')
. Modify the error message appropriately for the inputn
. Hint: there is a family of functionsis.XXX()
which test a variety of conditions. In particular there is anis.infinite()
function. - Add unit tests that address what should happen if the user inputs a negative, zero, or infinite value. Verify that all your unit tests pass if you cause the unit tests to be run using the build tools.
- Copy your previously submitted
- Now save the package as one file by building a source package using the
Build
tab,More -> Build Source Package
. This will create a.tar.gz
file that you can easily upload to Bblearn.