14 Building a Package on GitHub for Data and Code

These are just some notes mostly on building packages, based mostly on a couple of great sources which you should use to learn the process:

  • R Packages https://r-pkgs.org/ (Wickham and Bryan (n.d.))

    • Includes a quick run through in chapter 2 for building a package, with a function
    • Chapter 14 “External data” goes over how to build data into the package
    • Various chapter references below refer to this source
  • Happy Git and GitHub for the useR https://happygitwithr.com/ (Bryan (n.d.))

    • All about how to get Git and GitHub working

14.1 Git and GitHub

You’ll need to learn about using Git methods to host a package on GitHub (https://github.com), where you’ll also need at least a free account. See Happy Git and GitHub for the useR https://happygitwithr.com/ (Bryan (n.d.)) to learn everything you’ll need for this. Some key takeaways:

  • The purpose of Git is version control, and https://github.com is a place to host package repositories (repos) that use this. Methods include keeping track of all changes in your package, including all files in the folders.

  • You’ll need to use GitHub to create an account and connect with collaborators. And you can use it for setting everything up in your package repository.

  • You’ll need to have the same folders on your local computer, and a method for committing and pushing up changes to GitHub.

    • You can use RStudio for most of this, and you’ll be making all of your edits on this anyway. You’ll see any changes showing up in the Git tab, where you can stage, commit, and push them to your repo.
    • GitHub Desktop can also do commits, pushes and pulls, and is very handy for updating changes to GitHub resulting from any package editing done in RStudio (or from any other changes to files and folders in the package folder, such as external data added to extdata). You might also find GitHub Desktop better for dealing with moving external files and folders manually stored in the extdata folder.

14.2 Some notes on the RStudio process

See Chapter 2 in R Packages (Wickham and Bryan (n.d.)) for a simple process. Some observations:

  • You need devtools and usethis packages.

  • You’ll need to specify a folder with create_package, and you might want to put it in a GitHub folder in Documents. It doesn’t matter where, but it’s useful to keep GitHub things together. Refer to Chapter 2 https://r-pkgs.org/whole-game.html.

  • This will create some essential files and folders:

    • DESCRIPTION: very important for naming and package information, which you’ll be editing carefully
    • NAMESPACE: where functions are declared, generated by roxygen2
    • R: folder where your functions will go.
  • Ultimately, other folders and files are created, and there’s a diagram in https://r-pkgs.org/package-structure-state.html that illustrates the structure.

    • man: documentation on all of your functions and data sets
    • inst: various things go in here, but I’ve just used it for extdata where you can store .csv and other files
    • data: data (as .rda files) that can be called up like co2 in the base R package.
  • You might start by following the process in Chapter 2 for creating a simple package containing a simple function, using various devtools and usethis methods. This chapter also describes editing the DESCRIPTION file, using roxygen2 to create the man file, how to document the function, and getting things ready to post to GitHub.

  • On getting things ready for GitHub, it’s very useful to use check() from time to time to see if there are any errors in the various setup files, like DESCRIPTION, file & folder structure, and documentation.

  • The process can be confusing, but the various RStudio methods and functions provided in devtools and usethis make it at least very easy to do if you follow directions, which are spelled out pretty clearly in the R Packages book. We’ll look at creating another function later in this appendix, following the method provided in R Packages

14.3 Data

For our package, igisci, we provided data in two ways:

  • raw data as CSVs, shapefiles and TIFFs (useful since rasters don’t seem to be supported in rda)
  • rda files: normal external data that are ready to use as data frames and simple feature (sf) data

14.3.1 Raw data in extdata

Raw data (e.g. CSVs, shapefiles, and rasters) can be simply stored in the inst/extdata folder. Just create those folders and put the files there. Make sure to include all the files (like the multiple files that go with a shapefile). Then, to access the data once the data package is installed, the user just needs to use the system.file() function to provide the path and then use that with the appropriate read function; e.g. for a CSV, something like:

csvPath <- system.file("extdata","TRI/TRI_2017_CA.csv")
TRI_2017_CA <- read_csv(csvPath)

… or by using the igisci::ex() function, can be written as

TRI_2017_CA <- read.csv(ex("TRI/TRI_2017_CA.csv"))

… which isn’t much different from reading data in your RStudio project folder, so seemed a useful method to include in this book. And adding data to the data package by simply putting them in the extdata folder is very easy and also works for data files like TIFFs that aren’t supported as rda files that we’ll look at next.

14.3.2 Binary data as rda files

For data needed frequently, similar to co2 or other provided data in base R, you’ll want to create binary data stored as rda files. These files need to be prepared from data in R and go in the data folder. The process is made very easy by using usethis::use_data() to add data as rda files to the data folder. These data can be data frames, simple features, and I’m sure other things.

I used a script addData.R that I put in the inst folder which built the data from some of the same files included in the extdata folder, with usethis::use_data() to store it in the data folder. Here’s a simple example with just a csv converted directly, and it takes care of storing the result in the data folder as an .rda file:

sierraFeb <- read_csv("inst/extdata/sierra/sierraFeb.csv")
usethis::use_data(sierraFeb)

14.3.2.1 devtools::document()

This creates documentation on the data sets, using the file R/data.R, which will need to have lines of code similar to the following to document each data set. Note that the name of the data set goes last, in quotes. The formatting of the field names and descriptions is a bit tricky and doesn’t follow normal R rules. As a result, sometimes my field names don’t exactly match the actual field names. Maybe I’ll get around to changing the original field names with rename. Note that the organization is important, with the title of the data first, a blank line, then a description, etc.:

#' Sierra February climate data
#'
#' Selection from SierraData to only include February data
#'
#' @format A data frame with 82 entries and 7 variables selected and renamed \describe{
#'   \item{STATION_NAME}{Station name}
#'   \item{COUNTY}{County Name}
#'   \item{ELEVATION}{Elevation in meters}
#'   \item{LATITUDE}{Latitude in decimal degrees}
#'   \item{LONGITUDE}{Longitude in decimal degrees}
#'   \item{PRECIPITATION}{February Average Precipitation in mm}
#'   \item{TEMPERATURE}{Febrary Average Temperature in degrees C}
#' }
#' @source \url{https://www.ncdc.noaa.gov/}
"sierraFeb"

Once these are on GitHub, a user can simply install the package with devtools::install_github("iGISc/igisci") – to use the igisci we created. Then to access the data just like built-in data, the user just needs to load that library with library(igisci)

14.3.2.2 Removing data

There may be a better way, but what worked was removing the rda and corresponding man files. Also edited the data.R file to remove them from the man file maybe being created again.

14.4 Code (functions)

You can build pretty extensive functions in a package, and you’ve been using these from a variety of packages. We’ll just look at the simple example of a function we’ll name ex() for making it easier to read data from the extdata folder of the igisci package. The methods shown here are based on https://r-pkgs.org/ which you should refer to for more thorough documentation.

Here’s the function we’ll create in the package. However, one caveat is that you want to add this after you’ve already installed the package with data, or you’ll get an error.

ex <- function(dta){
  system.file("extdata",dta,package="igisci")
  }

We want to use it where in our code we wanted to read a file from extdata but we want cleaner code that doesn’t take up much space in code we’re writing than just reading it from your workspace. There’s a csv file in the sierra folder of extdata named Sierra2LassenData.csv that we could then read with:

read.csv(ex("sierra/sierraFebShort.csv"))
##        STATION    COUNTY ELEVATION LATITUDE LONGITUDE PRECIPITATION TEMPERATURE
## 1     OROVILLE     Butte        52    39.52   -121.55           124        10.7
## 2       AUBURN    Placer       394    38.91   -121.08           160         9.7
## 3       SONORA  Tuolumne       511    37.97   -120.39           148         7.7
## 4  PLACERVILLE El Dorado       564    38.70   -120.82           171         9.2
## 5       COLFAX    Placer       725    39.09   -120.95           207         7.3
## 6  NEVADA CITY    Nevada       848    39.25   -121.00           268         6.7
## 7       QUINCY    Plumas      1042    39.94   -120.95           182         4.0
## 8     YOSEMITE  Mariposa      1225    37.75   -119.59           169         5.0
## 9      PORTOLA    Plumas      1478    39.81   -120.47            98         0.5
## 10     TRUCKEE    Nevada      1775    39.33   -120.17           126        -1.1
## 11  BRIDGEPORT      Mono      1972    38.26   -119.23            41        -2.2
## 12  LEE VINING      Mono      2072    37.96   -119.12            72         0.4
## 13       BODIE      Mono      2551    38.21   -119.01            40        -4.4

… and that doesn’t take up much more code real estate than if you had a "sierra/sierraFebShort.csv" in your workspace/RStudio project folder. You just need to remember to also include library(igisci) in your code.

As you can see, this function is already in the igisci package, but how did it get there? Here are the steps for this simple function.

  1. Use usethis::use_r() to create a script to hold your function(s). We’ll just create one function, so we’ll just name it the same name as the function, but you might want to have a set of maybe related functions in your script.
usethis::use_r("ex")
  1. Add the function code to the ex.R script created.
ex <- function(dta){
  system.file("extdata",dta,package="igisci")
  }
  1. Create the documentation skeleton while editing the script by using Insert Roxygen Skeleton from the RStudio Code menu.

  2. Then fill in the various elements of the documentation (A title, parameters, what it returns and examples) which shows what you want the user to see with ?ex or ?exfiles, similar to this, and save the script. If you have more than one function, the documentation will go just above the function definition. See https://r-pkgs.org/man.html.

#' Access external data in package
#'
#' @param dta filename to access from extdata, including folders
#'
#' @return path to the file
#' @export
#'
#' @examples
#' read.csv(ex("sierra/sierraStations.csv"))
ex <- function(dta){
  system.file("extdata",dta,package="igisci")
}
#' Listing contents of extdata from package
#' @param NA
#' @return tibble
#' @export
#'
#' @examples
#' view(exfiles())
exfiles <- function(){
  library(dplyr)
  exfilesDF <- tribble(~dir,~file,~path,~type)
  for(d in list.dirs(ex(""),recursive=F)){
    #print(d)
    dsplit <- strsplit(d,"/")
    len <- length(dsplit[[1]])
    dir <- dsplit[[1]][len]
    for(shp in list.files(d,pattern="*.shp$")){
      exfilesDF <- bind_rows(exfilesDF,tribble(~dir,~file,~path,~type,dir,shp,paste0("ex(\'",dir,"/",shp,"\')"),"shapefile"))
    }
    for(tif in list.files(d,pattern="*.tif$")){
      exfilesDF <- bind_rows(exfilesDF,tribble(~dir,~file,~path,~type,dir,tif,paste0("ex(\'",dir,"/",tif,"\')"),"TIFF"))
    }
    for(csv in list.files(d,pattern="*.csv")){
      exfilesDF <- bind_rows(exfilesDF,tribble(~dir,~file,~path,~type,dir,csv,paste0("ex(\'",dir,"/",csv,"\')"),"CSV"))
    }
    for(xls in list.files(d,pattern="*.xls")){
      exfilesDF <- bind_rows(exfilesDF,tribble(~dir,~file,~path,~type,dir,xls,paste0("ex(\'",dir,"/",xls,"\')"),"xls"))
    }
  }
  exfilesDF
}
  1. Use devtools::document() to convert the comments to .Rd files.
devtools::document()
  1. Then preview the documentation with ?ex, and maybe use Check in the Build tab.

  2. Then when you commit these changes in GitHub, the package will contain the function, ready to use as long as the user loads the library. You should see the function as ex.Rd in the man folder of your package, similar to what you see for the data described above.

This example was pretty simple for this simple code with no dependencies other than what’s in base R, but you’ll probably want to test it thoroughly, and add more documentation to do anything more complicated.

References

Bryan, Jenny. n.d. Happy Git and GitHub for the useR. https://happygitwithr.com/.
Wickham, Hadley, and Jenny Bryan. n.d. R Packages. O’Reilly. https://r-pkgs.org.