3 More Data Manipulation in R

This chapter extends some of the data cleaning and data wrangling skills from Chapter 2. It uses survey data from the RStudio Learning R Survey. You can access this data, which includes all transformations and cleanings from Chapter 2, here.

You can load the data as follows:

load("data/rsurvey.RData")

3.1 Wide and Long Data

Data typically comes in two formats: wide and long.

A wide dataset has each individual’s data in separate columns. It is typically easy for humans to read. Here is a simple example:

country	avgtemp.1994	avgtemp.1995	avgtemp.1996
Sweden	6	5	7
Denmark	6	8	8
Norway	3	11	7

A long dataset, each variable has its own column, and each observation has its own row. In the wide dataset above, country, year, and temperature are three variables. In a long dataset, it would look like this:

country	year	avgtemp
Sweden	1994	6
Denmark	1994	6
Norway	1994	3
Sweden	1995	5
Denmark	1995	8
Norway	1995	11
Sweden	1996	7
Denmark	1996	8
Norway	1996	7

Long data is often referred to as tabular data and is more machine-readable than human-readable. It is also very similar to Tidy data.

3.2 Pivoting data from wide to long - pivot_longer()

We can easily transform data from wide to long with the pivot_longer() function from the Tidyverse. Here is a simple example:

library(tidyverse)

wide_example %>%
  pivot_longer(avgtemp.1994:avgtemp.1996, names_to = "year", values_to = "avg.temp")

## # A tibble: 9 x 3
##   country year         avg.temp
##   <chr>   <chr>           <dbl>
## 1 Sweden  avgtemp.1994        6
## 2 Sweden  avgtemp.1995        5
## 3 Sweden  avgtemp.1996        7
## 4 Denmark avgtemp.1994        6
## 5 Denmark avgtemp.1995        8
## 6 Denmark avgtemp.1996        8
## 7 Norway  avgtemp.1994        3
## 8 Norway  avgtemp.1995       11
## 9 Norway  avgtemp.1996        7

What does the code mean?

wide_example %>% - use the wide example data and

pivot_longer( - make it long

avgtemp.1994:avgtemp.1996, - the columns we want to change from wide to long

names_to = "year", - take the column names from the wide data and put them into a column called “year”

values_to = "avg.temp") - take the values from the wide data and put them into a column called “avg.temp”

3.2.1 Pivoting survey data

Recall that we had a check-all question in our survey that was a little bit messy. It is a good example of wide data:

rsurvey %>%
  select(use_1:use_11) %>% #select only the check-all questions
  head(n = 8)

##                  use_1                use_2          use_3
## 1 Statistical analysis  Data transformation       Modeling
## 2 Statistical analysis  Data transformation  Visualization
## 3 Statistical analysis  Data transformation  Visualization
## 4  Data transformation                 <NA>           <NA>
## 5 Statistical analysis  Data transformation       Modeling
## 6 Statistical analysis  Data transformation       Modeling
## 7 Statistical analysis  Data transformation       Modeling
## 8 Statistical analysis  Data transformation       Modeling
##            use_4             use_5            use_6
## 1  Visualization  Machine learning  Text processing
## 2           <NA>              <NA>             <NA>
## 3           <NA>              <NA>             <NA>
## 4           <NA>              <NA>             <NA>
## 5  Visualization              <NA>             <NA>
## 6  Visualization  Machine learning  Text processing
## 7  Visualization              <NA>             <NA>
## 8  Visualization  Machine learning  Text processing
##              use_7 use_8 use_9 use_10 use_11
## 1             <NA>  <NA>  <NA>   <NA>   <NA>
## 2             <NA>  <NA>  <NA>   <NA>   <NA>
## 3             <NA>  <NA>  <NA>   <NA>   <NA>
## 4             <NA>  <NA>  <NA>   <NA>   <NA>
## 5             <NA>  <NA>  <NA>   <NA>   <NA>
## 6             <NA>  <NA>  <NA>   <NA>   <NA>
## 7             <NA>  <NA>  <NA>   <NA>   <NA>
## 8  Survey analysis  <NA>  <NA>   <NA>   <NA>

You will notice each column has different data, and there are many NAs. This means some checked 1 or 2 items and therefore have 10 or 11 NAs while others checked more items and have less NAs. This is a perfect example of data that we can pivot from wide to long and better organize it.

rsurvey_use_long <- rsurvey %>%
  rownames_to_column("id") %>%
  select(id, use_1:use_11) %>%
  pivot_longer(use_1:use_11, names_to = "use_number", values_to="use") %>%
  mutate(use = ifelse(use == "", NA, use),
         use = str_trim(use))

rsurvey_use_long %>% head(n=10)

## # A tibble: 10 x 3
##    id    use_number use                 
##    <chr> <chr>      <chr>               
##  1 1     use_1      Statistical analysis
##  2 1     use_2      Data transformation 
##  3 1     use_3      Modeling            
##  4 1     use_4      Visualization       
##  5 1     use_5      Machine learning    
##  6 1     use_6      Text processing     
##  7 1     use_7      <NA>                
##  8 1     use_8      <NA>                
##  9 1     use_9      <NA>                
## 10 1     use_10     <NA>

What does the code mean?

rsurvey_use_long <- - For this example, we will pivot the data into an object called “rsurvey_use_long”

rsurvey %>% - use the rsurvey data and

rownames_to_column("id") %>% - In the wide dataset, each row is an individual, but we do not have a column of individual ids. This function >will make a column of row numbers called “id”. The first row will be >“1”, the second row will be “2”, etc. Now, we have ids for each individual in the dataset. We will use this later/

select(id, use_1:use_11) %>% - keep only the id column and the columns to pivot. We are removing them for this example, but you wouldn’t do this if you wanted to keep the data in the “rsurvey” data object.

pivot_longer( - change (pivot) it from wide to long

use_1:use_11, - select the columns to pivot

names_to = "use_number", - take the multiple column names and put them in a single column called “use_number”

values_to="use") %>% - take the values and put them in a column called “use”

mutate(use = ifelse(use == "", NA, use)) - make any empty text NA (some will be NA and others will just be "", which is treated as a character space)

drop_na(use) - drop any NA values in the “use” column

We could use this data to visualize and analyze, or we could include it back into the same format as the dataset by making it wide, with each column a dichotomous variable for each use, with the values 1 for “checked” and 0 for “not checked”. We will do that in the next section.

3.3 Pivoting data from long to wide - pivot_wider()

We can easily transform data from long to wide with the pivot_wider() function from the Tidyverse. Here is a simple example:

library(tidyverse)

long_example %>%
  pivot_wider(names_from = year, values_from=avgtemp)

## # A tibble: 3 x 4
##   country `1994` `1995` `1996`
##   <chr>    <dbl>  <dbl>  <dbl>
## 1 Sweden       6      5      7
## 2 Denmark      6      8      8
## 3 Norway       3     11      7

What does the code mean?

long_example %>% - use the wide example data and

pivot_wide( - make it long

names_from = year, - choose the column from which our names should come from

values_from = avgtemp) - choose the values for each new row

Before we transform the data, we need to note that there is a problem. If we count the unique values in the dataset, we will find there are 158 different options whereas, according to the survey, there should be 11 + an “other” category.

rsurvey_use_long %>%
  distinct(use) %>% # find only unique values
  count() # count unique values

## # A tibble: 1 x 1
##       n
##   <int>
## 1   159

This will cause a big problem if we pivot it. It will cause 158 additional columns to be created. The easiest way to deal with this issue is to use only the original categories and then create an other category. Here is what we can do:

rsurvey_use_long <- rsurvey_use_long %>%
  mutate(use_new = case_when(
    str_detect(use, "Statistical analysis") ~ "Statistical analysis",
    str_detect(use, "Data transformation") ~ "Data transformation",
    str_detect(use, "Modeling") ~ "Modeling",
    str_detect(use, "Visualization") ~ "Visualization",
    str_detect(use, "Machine learning") ~ "Machine learning", 
    str_detect(use, "Text processing") ~ "Text processing",
    str_detect(use, "Genomics") ~ "Genomics",
    str_detect(use, "Medicine") ~ "Medicine",
    str_detect(use, "Survey analysis") ~ "Survey analysis",
    str_detect(use, "Clinical trials") ~ "Clinical trials",
    str_detect(use, "Financial analysis") ~ "Financial analysis",
    is.na(use) ~ "NA",
    TRUE ~ "Other"
  ),
  use_new = ifelse(is.na(use), NA, use_new),
  use_other = ifelse(use_new == "Other", use, NA)
  )
    
rsurvey_use_long %>%
  distinct(use_new) %>% # find only unqiue values
  count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1    13

rsurvey_use_wide <- rsurvey_use_long %>%
  group_by(id) %>%
  count(use_new) %>%
  ungroup %>%
  pivot_wider(names_from = use_new, values_from = n, 
              values_fill = list(n=0)) %>%
  janitor::clean_names()

We can check if this worked if the number of observations in rsurvey is the same as our wide data:

# check if the are the same
# == means "equal to"
count(rsurvey) == count(rsurvey_use_wide)

##         n
## [1,] TRUE

What does the code mean?

rsurvey_use_wide <- - create a new data object (for demonstration / testing purposes) called “rsurvey_use_wide”

rsurvey_use_long %>% - use the data from “rsurvey_use_long” and

group_by(id) %>% - group it by each individual’s id and

add_count(use_new) %>% - count each use per individual - this will create a column of 1’s, indicating they checked the item (recall that we deleted unchecked items)

ungroup %>% - ungroup the data

pivot_wider( - pivot the data wider

names_from = use, - take the names of the new rows from the “use” column

values_from = n, - take the values (all 1s) from the “n” column

values_fill = list(n=0)) %>% - create 0s for any columns that do not have a 1

values_fill allows you to fill empty values

list(n=0) - states to make empty values from n into zeroes

janitor::clean_names() - use the janitor package to make the names lowercase and underlined for ease of typing later

Now that we have demonstrated how to fix the check-all questions, we can simplify the steps by combining pivot_longer() and pivot_wider() in the same script:

rsurvey_check_all <- rsurvey %>%
  rownames_to_column("id") %>%
  select(id, use_1:use_11) %>%
  pivot_longer(use_1:use_11, names_to = "use_number", values_to="use") %>%
  mutate(use = ifelse(use == "", NA, use),
         use = str_trim(use),
         use_new = case_when(
    str_detect(use, "Statistical analysis") ~ "Statistical analysis",
    str_detect(use, "Data transformation") ~ "Data transformation",
    str_detect(use, "Modeling") ~ "Modeling",
    str_detect(use, "Visualization") ~ "Visualization",
    str_detect(use, "Machine learning") ~ "Machine learning", 
    str_detect(use, "Text processing") ~ "Text processing",
    str_detect(use, "Genomics") ~ "Genomics",
    str_detect(use, "Medicine") ~ "Medicine",
    str_detect(use, "Survey analysis") ~ "Survey analysis",
    str_detect(use, "Clinical trials") ~ "Clinical trials",
    str_detect(use, "Financial analysis") ~ "Financial analysis",
    is.na(use) ~ "NA",
    TRUE ~ "Other"
  ),
  use_new = ifelse(is.na(use), NA, use_new),
  use_other = ifelse(use_new == "Other", use, NA)
  ) %>%
  group_by(id) %>%
  count(use_new) %>%
  ungroup %>%
  pivot_wider(names_from = use_new, values_from = n, 
              values_fill = list(n=0)) %>%
  janitor::clean_names()
    
head(rsurvey_check_all, n=10) %>%
  kable()

id	data_transformation	machine_learning	modeling	statistical_analysis	text_processing	visualization	na	medicine	financial_analysis	survey_analysis
1	1	1	1	1	1	1	5	0	0	0
10	1	0	1	1	0	1	7	0	0	0
100	1	0	1	1	1	1	6	0	0	0
1000	1	0	0	1	0	0	8	1	0	0
1001	1	1	1	1	0	1	6	0	0	0
1002	1	0	0	1	0	1	7	0	1	0
1003	1	0	0	0	1	1	8	0	0	0
1004	1	0	1	1	0	1	6	0	0	1
1005	1	0	1	1	0	1	6	0	0	1
1006	1	0	0	1	0	1	8	0	0	0

3.4 Joining multiple data objects

This section will explain the different types of joins you can use in R to connect data objects together. Joins are kinds of merges where you add columns and/or rows from one data frame to another data frame by matching them on some variable or sets of variables.

There are many different types of joins, including left joins, right joins, inner joins, semi joins, full joins, and anti joins. See https://dplyr.tidyverse.org/reference/join.html for more information.

3.4.1 Left Join

Left joins typically connect rows in a second data frame to matching rows in the first data frame.

Here is a simple example.

Main data frame

## # A tibble: 3 x 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles

Second data frame

band_instruments

## # A tibble: 3 x 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar

We can connect the second data frame to the main (first) data frame by joining on the column they have in common, “name”. This will create a single, merged data frame. Any unmatched data in the second data frame will not be in the merged data frame.

band_members %>% left_join(band_instruments)

## Joining, by = "name"

## # A tibble: 3 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass

3.4.2 Left Joining our Survey Data

We have created a data frame called “rsurvey_check_all,” which contains all our check-all questions and whether someone checked it (indicated by 1) or did not check it (indicated by 0). We can combined this with our original “rsurvey” data like this:

rsurvey_joined <- rsurvey %>% 
  rownames_to_column("id") %>%
  left_join(rsurvey_check_all)

What does the code mean?

rsurvey_joined <- - save to a new data frame

Note: this new data frame is created for demonstration purposes so you can compare the original to the joined data object. In reality, we would simply overwrite our existing data object like this: rsurvey <-

rsurvey %>% - use the “rsurvey” data and…

rownames_to_column("id") %>% - make a column called “id” that is simply the number of each row. This creates the column we will join by (“rsurvey_check_all” already has this)

left_join(rsurvey_check_all) - join “rsurvey” with “rsurvey_check_all”.

Note: because both data frames have an “id” column, R automatically joins them by this column. You can also specify:

left_join(rsurvey_check_all, by = "id) or

left_join(df_name, by = c("col1" = "col2"))if the columns have different names

To complete the rsurvey cleaning process, let’s also join the “other” column from the check-all data:

rsurvey<- rsurvey %>%
  left_join(rsurvey_use_long %>%
              select(id, use_other) %>%
              drop_na(use_other))

## Joining, by = "id"