3.5 Data Within a Package

The objective of this section is:

  • Create an R package that contains data (and associated documentation)

Many R packages are designed to manipulate, visualize, and model data so it may be a good idea for you to include some data in your package. The primary reason most developers include data in their package is to demonstrate how to use the functions included in the package with the included data. Creating a package as a means to distribute data is also a method that is gaining popularity. Additionally you may want to include data that your package uses internally, but is not available to somebody who is using your package. When including data in your package consider the fact that your compressed package file should be smaller than 5MB, which is the largest package size that CRAN allows. If your package is larger than 5MB make sure to inform users in the instructions for downloading and installing your package.

3.5.1 Data for Demos

3.5.1.1 Data Objects

Including data in your package is easy thanks to the devtools package. To include datasets in a package, first create the objects that you would like to include in your package inside of the global environment. You can include any R object in a package, not just data frames. Then make sure you’re in your package directory and use the use_data() function, listing each object that you want to include in your package. The names of the objects that you pass as arguments to use_data() will be the names of the objects when a user loads the package, so make sure you like the variable names that you’re using.

You should then document each data object that you’re including in the package. This way package users can use common R help syntax like ?dataset to find out more information about the included data set. You should create one R file called data.R in the R/ directory of your package. You can write the data documentation in the data.R file. Let’s take a look at some documentation examples from the minimap package. First we’ll look at the documentation for a data frame called maple:

#' Production and farm value of maple products in Canada
#'
#' @source Statistics Canada. Table 001-0008 - Production and farm value of
#'  maple products, annual. \url{http://www5.statcan.gc.ca/cansim/}
#' @format A data frame with columns:
#' \describe{
#'  \item{Year}{A value between 1924 and 2015.}
#'  \item{Syrup}{Maple products expressed as syrup, total in thousands of gallons.}
#'  \item{CAD}{Gross value of maple products in thousands of Canadian dollars.}
#'  \item{Region}{Postal code abbreviation for territory or province.}
#' }
#' @examples
#' \dontrun{
#'  maple
#' }
"maple"

Data frames that you include in your package should follow the general schema above where the documentation page has the following attributes:

  • An informative title describing the object.
  • A @source tag describing where the data was found.
  • A @format tag which describes the data in each column of the data frame.
  • And then finally a string with the name of the object.

The minimap package also includes a few vectors. Let’s look at the documentation for mexico_abb:

#' Postal Abbreviations for Mexico
#'
#' @examples
#' \dontrun{
#'  mexico_abb
#' }
"mexico_abb"

You should always include a title for a description of a vector or any other object. If you need to elaborate on the details of a vector you can include a description in the documentation or a @source tag. Just like with data frames the documentation for a vector should end with a string containing the name of the object.

3.5.1.2 Raw Data

A common task for R packages is to take raw data from files and to import them into R objects so that they can be analyzed. You might want to include some sample raw data files so you can show different methods and options for importing the data. To include raw data files in your package you should create a directory under inst/extdata in your R package. If you stored a data file in this directory called response.json in inst/extdata and your package is named mypackage then a user could access the path to this file with system.file("extdata", "response.json", package = "mypackage"). Include that line of code in the documentation to your package so that your users know how to access the raw data file.

3.5.2 Internal Data

Functions in your package may need to have access to data that you don’t want your users to be able to access. For example the swirl package contains translations for menu items into languages other than English, however that data has nothing to do with the purpose of the swirl package and so it’s hidden from the user. To add internal data to your package you can use the use_data() function from devtools, however you must specify the internal = TRUE argument. All of the objects you pass to use_data(..., internal = TRUE) can be referenced by the same name within your R package. All of these objects will be saved to one file called R/sysdata.rda.

3.5.3 Data Packages

There are several packages which were created for the sole purpose of distributing data including janeaustenr, gapminder, babynames, and lego. Using an R package as a means of distributing data has advantages and disadvantages. On one hand the data is extremely easy to load into R, as a user only needs to install and load the package. This can be useful for teaching folks who are new to R and may not be familiar with importing and cleaning data. Data packages also allow you document datasets using roxygen2, which provides a much cleaner and more programmer-friendly kind of code book compared to including a file that describes the data. On the other hand data in a data package is not accessible to people who are not using R, though there’s nothing stopping you from distributing the data in multiple ways.

If you decide to create a data package you should document the process that you used to obtain, clean, and save the data. One approach to doing this is to use the use_data_raw() function from devtools. This will create a directory inside of your package called data_raw. Inside of this directory you should include any raw files that the data objects in your package are derived from. You should also include one or more R scripts which import, clean, and save those data objects in your R package. Theoretically if you needed to update the data package with new data files you should be able to just run these scripts again in order to rebuild your package.

3.5.4 Summary

Including data in a package is useful for showing new users how to use your package, using data internally, and sharing and documenting datasets. The devtools package includes several useful functions to help you add data to your package including use_data() and use_data_raw(). You can document data within your package just like you would document a function.