35 Utilities
One of the many great things about the R ecosystem is that data scientists around the world are tackling the same types of things that you and I might encounter, creating solutions to those problems, and then sharing those solutions as packages that add new functionality to the ecosystem.
This chapter introduces some of the packages that I have found useful.
35.1 {datapasta}
.
Reference page: https://milesmcbain.github.io/datapasta/
datapasta is about reducing resistance associated with copying and pasting data to and from R. It is a response to the realisation that I often found myself using intermediate programs like Sublime to munge text into suitable formats. Addins and functions in datapasta support a wide variety of input and output situations, so it (probably) “just works”. Hopefully tools in this package will remove such intermediate steps and associated frustrations from our data slinging workflows.
We don’t use {datapasta} in this course, but the chances are high that someday soon you’ll need to copy-and-paste data.
35.2 {glue}
·
Reference page: https://glue.tidyverse.org/
Glue offers interpreted string literals that are small, fast, and dependency-free. Glue does this by embedding R expressions in curly braces which are then evaluated and inserted into the argument string.
- You can use a function that you have installed but not loaded through the
library()
function, by putting the name of the package followed by the function you want to use. In this case, we are using theglue()
function from the {glue} package, so our code looks likeglue::glue()
. The variable from our data frame gets put in braces, and the string returned is all within quotation marks.
Here’s a simple example:
glue('BIDA', '302')
## BIDA302
We can also paste together variables, by putting the name inside curly braces {}
:
course <- "BIDA"
course_number <- 405
glue({course}, {course_number})
## BIDA405
We can specify a character to separate the elements we are gluing, with .sep = "<string>"
. In this example it is a space:
glue('_The Mandolorian_',
'is a tv series on the Disney+ streaming service',
'that features Baby Yoda.', .sep = " ")
## _The Mandolorian_ is a tv series on the Disney+ streaming service that features Baby Yoda.
Glue together separate year (as a variable), month, and day values, separated by a hyphen “-”, to create a single date string.
# assign year to variable
this_year <- 2020
# solution
glue({this_year},
'02',
'29',
.sep = "-")
## 2020-02-29
35.2.1 glue_data()
We can glue data in a {dplyr} pipe with the glue_data()
function.
First, let’s filter the {gapminder} data so we have a smaller piece to work with:
## # A tibble: 3 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Canada Americas 2007 80.7 33390141 36319.
## 2 Japan Asia 2007 82.6 127467972 31656.
## 3 New Zealand Oceania 2007 80.2 4115771 25185.
Now we can create multiple strings, one for each of the rows in the gm_3
tibble:
gm_3 |>
glue_data("{country} has a life expectancy of {lifeExp}.")
## Canada has a life expectancy of 80.653.
## Japan has a life expectancy of 82.603.
## New Zealand has a life expectancy of 80.204.
We can also apply functions within the curly braces:
gm_3 |>
glue_data("{country} has a life expectancy of {round(lifeExp, 1)}.")
## Canada has a life expectancy of 80.7.
## Japan has a life expectancy of 82.6.
## New Zealand has a life expectancy of 80.2.
Introducing the starwars
character data table:
head(starwars)
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis>
## 1 Luke… 172 77 blond fair blue 19 male mascu… Tatooine Human <chr>
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooine Droid <chr>
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… Naboo Droid <chr>
## 4 Dart… 202 136 none white yellow 41.9 male mascu… Tatooine Human <chr>
## 5 Leia… 150 49 brown light brown 19 fema… femin… Alderaan Human <chr>
## 6 Owen… 178 120 brown, gr… light blue 52 male mascu… Tatooine Human <chr>
## # ℹ 2 more variables: vehicles <list>, starships <list>
Create a new variable in the data frame by gluing the variables species
and gender
.
# solution
starwars |>
mutate(species_gender = glue("{species} - {gender}")) |>
select(name, species, gender, species_gender)
## # A tibble: 87 × 4
## name species gender species_gender
## <chr> <chr> <chr> <glue>
## 1 Luke Skywalker Human masculine Human - masculine
## 2 C-3PO Droid masculine Droid - masculine
## 3 R2-D2 Droid masculine Droid - masculine
## 4 Darth Vader Human masculine Human - masculine
## 5 Leia Organa Human feminine Human - feminine
## 6 Owen Lars Human masculine Human - masculine
## 7 Beru Whitesun Lars Human feminine Human - feminine
## 8 R5-D4 Droid masculine Droid - masculine
## 9 Biggs Darklighter Human masculine Human - masculine
## 10 Obi-Wan Kenobi Human masculine Human - masculine
## # ℹ 77 more rows
35.3 {here}
·
Reference page: https://here.r-lib.org/
The goal of the here package is to enable easy file referencing in project-oriented workflows. In contrast to using setwd(), which is fragile and dependent on the way you organize your files, here uses the top-level directory of a project to easily build paths to files.
The here()
package is used as part of the Workflow chapter.
35.4 {janitor}
·
Reference page: https://sfirke.github.io/janitor/
The janitor package is an R package that has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimised for user-friendliness.
The main janitor functions:
perfectly format data frame column names;
isolate partially-duplicate records; and
provide quick tabulations (i.e., frequency tables and crosstabs).
function | action |
---|---|
cleaning | |
clean_names() |
handles problematic variable names |
remove_empty() |
removes any columns that are entirely empty and entire rows that are entirely empty |
excel_numeric_to_date() |
converts Excel date serial numbers to class Date
|
data summaries | |
tabyl() |
takes a vector and returns a frequency table |
adorn_() |
a family of functions for additional summary values |
35.4.1 clean_names()
This is going to demonstrate {janitor} clean_names()
function.
The function
changes the capitalization to lower case
replaces spaces and periods with underscores (“_“)
changes unusual characters (e.g. percent) with text variations
removes non-standard characters (e.g. dollar signs)
# Create a data.frame with dirty names
test_df <- as.data.frame(matrix(ncol = 7))
names(test_df) <- c(
# snake case
"firstName",
# non-standard characters
"ábc@!*",
# percent and parentheses
"% successful (2009)",
# repeated name, with period separator
"REPEAT.VARIABLE",
"REPEAT.VARIABLE",
# two variables with blank for name
"",
"")
test_df
## firstName ábc@!* % successful (2009) REPEAT.VARIABLE REPEAT.VARIABLE
## 1 NA NA NA NA NA NA NA
Now we apply the clean_names()
function:
clean_df <- test_df |>
janitor::clean_names()
clean_df
## first_name abc percent_successful_2009 repeat_variable repeat_variable_2 x x_2
## 1 NA NA NA NA NA NA NA
Compare the cleaned variable names with those in the original dataframe. What’s changed?
35.4.2 another clean_names()
example
In this example, we read in the valid rows from the “deaths.xlsx” file (part of the [{readxl} package] https://readxl.tidyverse.org/).
deaths <- read_excel("data/deaths.xlsx", skip = 4, n_max = 10)
deaths
## # A tibble: 10 × 6
## Name Profession Age `Has kids` `Date of birth` `Date of death`
## <chr> <chr> <dbl> <lgl> <dttm> <dttm>
## 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 2016-01-10 00:00:00
## 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 2016-12-27 00:00:00
## 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 2017-03-18 00:00:00
## 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 2017-02-25 00:00:00
## 5 Prince musician 57 TRUE 1958-06-07 00:00:00 2016-04-21 00:00:00
## 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 2016-01-14 00:00:00
## 7 Florence Henderson actor 82 TRUE 1934-02-14 00:00:00 2016-11-24 00:00:00
## 8 Harper Lee author 89 FALSE 1926-04-28 00:00:00 2016-02-19 00:00:00
## 9 Zsa Zsa Gábor actor 99 TRUE 1917-02-06 00:00:00 2016-12-18 00:00:00
## 10 George Michael musician 53 FALSE 1963-06-25 00:00:00 2016-12-25 00:00:00
If we wanted to filter the cases where the individual had children, we have to surround the Has kids
variable name in backticks, since there is a space. We also have to remember to capitalize the “H” (although this is less of an issue if we are using RStudio, as the autofill is not sensitive to case).
deaths |>
filter(`Has kids` == TRUE)
## # A tibble: 7 × 6
## Name Profession Age `Has kids` `Date of birth` `Date of death`
## <chr> <chr> <dbl> <lgl> <dttm> <dttm>
## 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 2016-01-10 00:00:00
## 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 2016-12-27 00:00:00
## 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 2017-03-18 00:00:00
## 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 2017-02-25 00:00:00
## 5 Prince musician 57 TRUE 1958-06-07 00:00:00 2016-04-21 00:00:00
## 6 Florence Henderson actor 82 TRUE 1934-02-14 00:00:00 2016-11-24 00:00:00
## 7 Zsa Zsa Gábor actor 99 TRUE 1917-02-06 00:00:00 2016-12-18 00:00:00
Using the clean_names()
function from {janitor} gives us a solution: it makes the variable names consistently lower-case, and with the spaces replaced by an underscore (“_“).
deaths <- deaths |>
janitor::clean_names()
deaths
## # A tibble: 10 × 6
## name profession age has_kids date_of_birth date_of_death
## <chr> <chr> <dbl> <lgl> <dttm> <dttm>
## 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 2016-01-10 00:00:00
## 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 2016-12-27 00:00:00
## 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 2017-03-18 00:00:00
## 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 2017-02-25 00:00:00
## 5 Prince musician 57 TRUE 1958-06-07 00:00:00 2016-04-21 00:00:00
## 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 2016-01-14 00:00:00
## 7 Florence Henderson actor 82 TRUE 1934-02-14 00:00:00 2016-11-24 00:00:00
## 8 Harper Lee author 89 FALSE 1926-04-28 00:00:00 2016-02-19 00:00:00
## 9 Zsa Zsa Gábor actor 99 TRUE 1917-02-06 00:00:00 2016-12-18 00:00:00
## 10 George Michael musician 53 FALSE 1963-06-25 00:00:00 2016-12-25 00:00:00
This then means that our filter statement does not need backticks around the variable name.
deaths |>
filter(has_kids == TRUE)
## # A tibble: 7 × 6
## name profession age has_kids date_of_birth date_of_death
## <chr> <chr> <dbl> <lgl> <dttm> <dttm>
## 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 2016-01-10 00:00:00
## 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 2016-12-27 00:00:00
## 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 2017-03-18 00:00:00
## 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 2017-02-25 00:00:00
## 5 Prince musician 57 TRUE 1958-06-07 00:00:00 2016-04-21 00:00:00
## 6 Florence Henderson actor 82 TRUE 1934-02-14 00:00:00 2016-11-24 00:00:00
## 7 Zsa Zsa Gábor actor 99 TRUE 1917-02-06 00:00:00 2016-12-18 00:00:00
35.4.3 remove_empty()
The {janitor} package also has functions to clean up dataframes that have empty rows and columns.
First, we’ll read in an Excel file called “dirty_data.xlsx”.
roster_raw <- readxl::read_excel("data/dirty_data.xlsx") # available at http://github.com/sfirke/janitor
glimpse(roster_raw)
## Rows: 13
## Columns: 11
## $ `First Name` <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shi…
## $ `Last Name` <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu", "Wu", NA, "Joyce…
## $ `Employee Status` <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administration", "Teacher…
## $ Subject <chr> "PE", "Drafting", "Music", NA, "Dean", "Physics", "Chemistry", NA, "En…
## $ `Hire Date` <dbl> 39690, 39690, 37118, 27515, 41431, 11037, 11037, NA, 32994, 27919, 422…
## $ `% Allocated` <dbl> 0.75, 0.25, 1.00, 1.00, 1.00, 0.50, 0.50, NA, 0.50, 0.50, NA, NA, 0.80
## $ `Full time?` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, "No", "No", "No",…
## $ `do not edit! --->` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ Certification...9 <chr> "Physical ed", "Physical ed", "Instr. music", "PENDING", "PENDING", "S…
## $ Certification...10 <chr> "Theater", "Theater", "Vocal music", "Computers", NA, "Physics", "Phys…
## $ Certification...11 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
roster_raw
## # A tibble: 13 × 11
## `First Name` `Last Name` `Employee Status` Subject `Hire Date` `% Allocated` `Full time?`
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 Jason Bourne Teacher PE 39690 0.75 Yes
## 2 Jason Bourne Teacher Drafting 39690 0.25 Yes
## 3 Alicia Keys Teacher Music 37118 1 Yes
## 4 Ada Lovelace Teacher <NA> 27515 1 Yes
## 5 Desus Nice Administration Dean 41431 1 Yes
## 6 Chien-Shiung Wu Teacher Physics 11037 0.5 Yes
## 7 Chien-Shiung Wu Teacher Chemistry 11037 0.5 Yes
## 8 <NA> <NA> <NA> <NA> NA NA <NA>
## 9 James Joyce Teacher English 32994 0.5 No
## 10 Hedy Lamarr Teacher Science 27919 0.5 No
## 11 Carlos Boozer Coach Basketball 42221 NA No
## 12 Young Boozer Coach <NA> 34700 NA No
## 13 Micheal Larsen Teacher English 40071 0.8 No
## # ℹ 4 more variables: `do not edit! --->` <lgl>, Certification...9 <chr>, Certification...10 <chr>,
## # Certification...11 <lgl>
As we can see, this table has messy names and a messy structure. In the code below, we’ll use the clean_names()
and remove_empty()
functions from {janitor}.
roster <- roster_raw |>
clean_names() |>
remove_empty(c("rows", "cols"))
roster
## # A tibble: 12 × 9
## first_name last_name employee_status subject hire_date percent_allocated full_time
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 Jason Bourne Teacher PE 39690 0.75 Yes
## 2 Jason Bourne Teacher Drafting 39690 0.25 Yes
## 3 Alicia Keys Teacher Music 37118 1 Yes
## 4 Ada Lovelace Teacher <NA> 27515 1 Yes
## 5 Desus Nice Administration Dean 41431 1 Yes
## 6 Chien-Shiung Wu Teacher Physics 11037 0.5 Yes
## 7 Chien-Shiung Wu Teacher Chemistry 11037 0.5 Yes
## 8 James Joyce Teacher English 32994 0.5 No
## 9 Hedy Lamarr Teacher Science 27919 0.5 No
## 10 Carlos Boozer Coach Basketball 42221 NA No
## 11 Young Boozer Coach <NA> 34700 NA No
## 12 Micheal Larsen Teacher English 40071 0.8 No
## # ℹ 2 more variables: certification_9 <chr>, certification_10 <chr>
35.4.4 Other useful {janitor} functions
We can also extend the pipe with other cleaning functions. Note that excel_numeric_to_date()
is another {janitor} function, while coalesce()
is from {dplyr} (and is similar to the SQL function of the same name).
roster <- roster_raw |>
clean_names() |>
remove_empty(c("rows", "cols")) |>
mutate(hire_date = excel_numeric_to_date(hire_date),
cert = coalesce(certification_9, certification_10)) |> # from dplyr
select(-certification_9, -certification_10) # drop unwanted columns
roster
## # A tibble: 12 × 8
## first_name last_name employee_status subject hire_date percent_allocated full_time cert
## <chr> <chr> <chr> <chr> <date> <dbl> <chr> <chr>
## 1 Jason Bourne Teacher PE 2008-08-30 0.75 Yes Physica…
## 2 Jason Bourne Teacher Drafting 2008-08-30 0.25 Yes Physica…
## 3 Alicia Keys Teacher Music 2001-08-15 1 Yes Instr. …
## 4 Ada Lovelace Teacher <NA> 1975-05-01 1 Yes PENDING
## 5 Desus Nice Administration Dean 2013-06-06 1 Yes PENDING
## 6 Chien-Shiung Wu Teacher Physics 1930-03-20 0.5 Yes Science…
## 7 Chien-Shiung Wu Teacher Chemistry 1930-03-20 0.5 Yes Science…
## 8 James Joyce Teacher English 1990-05-01 0.5 No English…
## 9 Hedy Lamarr Teacher Science 1976-06-08 0.5 No PENDING
## 10 Carlos Boozer Coach Basketball 2015-08-05 NA No Physica…
## 11 Young Boozer Coach <NA> 1995-01-01 NA No Politic…
## 12 Micheal Larsen Teacher English 2009-09-15 0.8 No Vocal m…
-
coalesce()
finds the first non-missing value at each position
## [1] 1 2 93 4
What happens if you put y
first in the coalesce()
function?
# solution
coalesce(y, x)
## [1] 1 2 93 94
# the last value in the sequence is pulled from `y` instead of `x`
35.4.5 Duplicate records with get_dupes()
We can find the cases where there are duplicates in the data records. In this case, we are looking for duplicate names:
roster |> get_dupes(first_name, last_name)
## # A tibble: 4 × 9
## first_name last_name dupe_count employee_status subject hire_date percent_allocated full_time
## <chr> <chr> <int> <chr> <chr> <date> <dbl> <chr>
## 1 Chien-Shiung Wu 2 Teacher Physics 1930-03-20 0.5 Yes
## 2 Chien-Shiung Wu 2 Teacher Chemistry 1930-03-20 0.5 Yes
## 3 Jason Bourne 2 Teacher PE 2008-08-30 0.75 Yes
## 4 Jason Bourne 2 Teacher Drafting 2008-08-30 0.25 Yes
## # ℹ 1 more variable: cert <chr>
Find the cases where the subject is duplicated.
# solution
roster |> get_dupes(subject)
## # A tibble: 4 × 9
## subject dupe_count first_name last_name employee_status hire_date percent_allocated full_time
## <chr> <int> <chr> <chr> <chr> <date> <dbl> <chr>
## 1 English 2 James Joyce Teacher 1990-05-01 0.5 No
## 2 English 2 Micheal Larsen Teacher 2009-09-15 0.8 No
## 3 <NA> 2 Ada Lovelace Teacher 1975-05-01 1 Yes
## 4 <NA> 2 Young Boozer Coach 1995-01-01 NA No
## # ℹ 1 more variable: cert <chr>
35.4.6 Simple tables with tabyl
The {janitor} package also offers some simple tabulations with the tabyl()
function.
One variable returns counts and percentages:
roster |>
tabyl(full_time)
## full_time n percent
## No 5 0.4166667
## Yes 7 0.5833333
If we use two variables, it takes the first one as the rows, and the second as the columns:
roster |>
tabyl(full_time, subject)
## full_time Basketball Chemistry Dean Drafting English Music PE Physics Science NA_
## No 1 0 0 0 2 0 0 0 1 1
## Yes 0 1 1 1 0 1 1 1 0 1
And for three variables, it makes separate tables based on the third one, then creates tables like the two-variable version:
roster |>
tabyl(full_time, subject, employee_status)
## $Administration
## full_time Basketball Chemistry Dean Drafting English Music PE Physics Science
## No 0 0 0 0 0 0 0 0 0
## Yes 0 0 1 0 0 0 0 0 0
##
## $Coach
## full_time Basketball Chemistry Dean Drafting English Music PE Physics Science NA_
## No 1 0 0 0 0 0 0 0 0 1
## Yes 0 0 0 0 0 0 0 0 0 0
##
## $Teacher
## full_time Basketball Chemistry Dean Drafting English Music PE Physics Science NA_
## No 0 0 0 0 2 0 0 0 1 0
## Yes 0 1 0 1 0 1 1 1 0 1
There’s also an option to drop levels with zero cases:
roster |>
tabyl(full_time, subject, employee_status, show_missing_levels = FALSE)
## $Administration
## full_time Dean
## Yes 1
##
## $Coach
## full_time Basketball NA_
## No 1 1
##
## $Teacher
## full_time Chemistry Drafting English Music PE Physics Science NA_
## No 0 0 2 0 0 0 1 0
## Yes 1 1 0 1 1 1 0 1
35.4.7 adorn
tables with totals and percentages
You can also add totals and percentages with the adorn_()
functions. This example creates a fancy version of what we had before.
- Note: I found the naming of the totals a bit counter-intuitive.
adorn_totals("row")
creates a new row with the total for that column, whileadorn_totals("col")
is a new column with the row total.
roster |>
tabyl(employee_status, full_time) |>
adorn_totals("col") |>
adorn_percentages() |>
adorn_pct_formatting() |>
adorn_ns() |>
adorn_title("combined")
## employee_status/full_time No Yes Total
## Administration 0.0% (0) 100.0% (1) 100.0% (1)
## Coach 100.0% (2) 0.0% (0) 100.0% (2)
## Teacher 33.3% (3) 66.7% (6) 100.0% (9)
See the {janitor} reference page for additional details on the table functions.
35.5 {waldo}
·
Have you ever needed to compare the contents of two Excel files? The original and the one that you maybe accidentally changed? The {waldo} has functions to help with that comparison.
35.6 But wait, there’s more!
Here’s a twitter thread on packages that people nominated as deserving more attention:
And a list of some other useful packages:
- Garrett Grolemund, “Quick list of useful R packages” (2021-03-18)
-30-