Appendix: Building a Data Package for GitHub

These are just some notes on building data packages, based mostly on Chapter 14 “External Packages” of r-pkgs.org, which also covers code packages.

For our package, iGIScData, we provided data in two ways:

  • rda files: normal external data that are ready to use as data frames, simple feature (sf) data, and rasters.

  • raw data as CSVs, shapefiles and TIFFs.

rda files

These files need to be prepared from data in R and go in the data folder. The process is made very easy by using usethis::use_data() package:

usethis::use_data()

usethis::use_data() is used to add data as rda files to the data folder. These data can be data frames, simple features, rasters, and I’m sure other things.

I used a script addData.R that I put in the data-raw folder which built the data (usually with read_csv(), st_read(), or raster() and maybe some other processing like mutate, filter, etc.) to create the data set, and then usethis::use_data() to store it in the data folder. Here’s an simple example with just a csv converted directly, and it takes care of storing the result in the data folder as an .rda file:

sierraFeb <- read_csv("data-raw/sierraFeb.csv")
usethis::use_data(sierraFeb)

usethis::create_package()

This creates the package and uses roxygen to document it. I think you just need to run this once, then the devtools::document() does the rest, and can be run again to update it.

devtools::document()

This creates documentation on the data sets, using the file R/data.R, which will need to have lines of code similar to the following to document each data set. Note that the name of the data set goes last, in quotes. The formatting of the field names and descriptions is a bit tricky and doesn’t follow normal R rules. As a result, sometimes my field names don’t exactly match the actual field names. Maybe I’ll get around to changing the original field names with rename. Note that the organization is important, with the title of the data first, a blank line, then a description, etc.:

#' Sierra February climate data
#'
#' Selection from SierraData to only include February data
#'
#' @format A data frame with 82 entries and 7 variables selected and renamed \describe{
#'   \item{STATION_NAME}{Station name}
#'   \item{COUNTY}{County Name}
#'   \item{ELEVATION}{Elevation in meters}
#'   \item{LATITUDE}{Latitude in decimal degrees}
#'   \item{LONGITUDE}{Longitude in decimal degrees}
#'   \item{PRECIPITATION}{February Average Precipitation in mm}
#'   \item{TEMPERATURE}{Febrary Average Temperature in degrees C}
#' }
#' @source \url{https://www.ncdc.noaa.gov/}
"sierraFeb"

Once these are on GitHub, a user can simply install the package with devtools::install_github("iGISc/iGIScData") – to use the iGIScData we created. Then to access the data just like built-in data, the user just needs to load that library with library(iGIScData)

But we also wanted to provide raw CSVs, shapefiles and rasters, in order to demonstrate how to read those.

Raw data

Raw data (e.g. CSVs, shapefiles, and rasters) are simply stored in the inst/extdata folder. Just create those folders and put the files there. Make sure to include all the files (like the multiple files that go with a shapefile). Then, to access the data once the data package is installed, the user just needs to use the system.file() function to provide the path and then use that with the appropriate read function; e.g. for a CSV, something like:

csvPath <- system.file("extdata","TRI_2017_CA.csv")
TRI_2017_CA <- read_csv(csvPath)