3 More Data Manipulation in R

This chapter extends some of the data cleaning and data wrangling skills from Chapter 2. It uses survey data from the RStudio Learning R Survey. You can access this data, which includes all transformations and cleanings from Chapter 2, here.

You can load the data as follows:

3.1 Wide and Long Data

Data typically comes in two formats: wide and long.

A wide dataset has each individual’s data in separate columns. It is typically easy for humans to read. Here is a simple example:

country avgtemp.1994 avgtemp.1995 avgtemp.1996
Sweden 6 5 7
Denmark 6 8 8
Norway 3 11 7

A long dataset, each variable has its own column, and each observation has its own row. In the wide dataset above, country, year, and temperature are three variables. In a long dataset, it would look like this:

country year avgtemp
Sweden 1994 6
Denmark 1994 6
Norway 1994 3
Sweden 1995 5
Denmark 1995 8
Norway 1995 11
Sweden 1996 7
Denmark 1996 8
Norway 1996 7

Long data is often referred to as tabular data and is more machine-readable than human-readable. It is also very similar to Tidy data.

3.2 Pivoting data from wide to long - pivot_longer()

We can easily transform data from wide to long with the pivot_longer() function from the Tidyverse. Here is a simple example:

## # A tibble: 9 x 3
##   country year         avg.temp
##   <chr>   <chr>           <dbl>
## 1 Sweden  avgtemp.1994        6
## 2 Sweden  avgtemp.1995        5
## 3 Sweden  avgtemp.1996        7
## 4 Denmark avgtemp.1994        6
## 5 Denmark avgtemp.1995        8
## 6 Denmark avgtemp.1996        8
## 7 Norway  avgtemp.1994        3
## 8 Norway  avgtemp.1995       11
## 9 Norway  avgtemp.1996        7

What does the code mean?

  • wide_example %>% - use the wide example data and
  • pivot_longer( - make it long
  • avgtemp.1994:avgtemp.1996, - the columns we want to change from wide to long
  • names_to = "year", - take the column names from the wide data and put them into a column called “year”
  • values_to = "avg.temp") - take the values from the wide data and put them into a column called “avg.temp”

3.2.1 Pivoting survey data

Recall that we had a check-all question in our survey that was a little bit messy. It is a good example of wide data:

##                  use_1                use_2          use_3
## 1 Statistical analysis  Data transformation       Modeling
## 2 Statistical analysis  Data transformation  Visualization
## 3 Statistical analysis  Data transformation  Visualization
## 4  Data transformation                 <NA>           <NA>
## 5 Statistical analysis  Data transformation       Modeling
## 6 Statistical analysis  Data transformation       Modeling
## 7 Statistical analysis  Data transformation       Modeling
## 8 Statistical analysis  Data transformation       Modeling
##            use_4             use_5            use_6
## 1  Visualization  Machine learning  Text processing
## 2           <NA>              <NA>             <NA>
## 3           <NA>              <NA>             <NA>
## 4           <NA>              <NA>             <NA>
## 5  Visualization              <NA>             <NA>
## 6  Visualization  Machine learning  Text processing
## 7  Visualization              <NA>             <NA>
## 8  Visualization  Machine learning  Text processing
##              use_7 use_8 use_9 use_10 use_11
## 1             <NA>  <NA>  <NA>   <NA>   <NA>
## 2             <NA>  <NA>  <NA>   <NA>   <NA>
## 3             <NA>  <NA>  <NA>   <NA>   <NA>
## 4             <NA>  <NA>  <NA>   <NA>   <NA>
## 5             <NA>  <NA>  <NA>   <NA>   <NA>
## 6             <NA>  <NA>  <NA>   <NA>   <NA>
## 7             <NA>  <NA>  <NA>   <NA>   <NA>
## 8  Survey analysis  <NA>  <NA>   <NA>   <NA>

You will notice each column has different data, and there are many NAs. This means some checked 1 or 2 items and therefore have 10 or 11 NAs while others checked more items and have less NAs. This is a perfect example of data that we can pivot from wide to long and better organize it.

## # A tibble: 10 x 3
##    id    use_number use                 
##    <chr> <chr>      <chr>               
##  1 1     use_1      Statistical analysis
##  2 1     use_2      Data transformation 
##  3 1     use_3      Modeling            
##  4 1     use_4      Visualization       
##  5 1     use_5      Machine learning    
##  6 1     use_6      Text processing     
##  7 1     use_7      <NA>                
##  8 1     use_8      <NA>                
##  9 1     use_9      <NA>                
## 10 1     use_10     <NA>

What does the code mean?

  • rsurvey_use_long <- - For this example, we will pivot the data into an object called “rsurvey_use_long”
  • rsurvey %>% - use the rsurvey data and
  • rownames_to_column("id") %>% - In the wide dataset, each row is an individual, but we do not have a column of individual ids. This function >will make a column of row numbers called “id”. The first row will be >“1”, the second row will be “2”, etc. Now, we have ids for each individual in the dataset. We will use this later/
  • select(id, use_1:use_11) %>% - keep only the id column and the columns to pivot. We are removing them for this example, but you wouldn’t do this if you wanted to keep the data in the “rsurvey” data object.
  • pivot_longer( - change (pivot) it from wide to long
  • use_1:use_11, - select the columns to pivot
  • names_to = "use_number", - take the multiple column names and put them in a single column called “use_number”
  • values_to="use") %>% - take the values and put them in a column called “use”
  • mutate(use = ifelse(use == "", NA, use)) - make any empty text NA (some will be NA and others will just be "", which is treated as a character space)
  • drop_na(use) - drop any NA values in the “use” column

We could use this data to visualize and analyze, or we could include it back into the same format as the dataset by making it wide, with each column a dichotomous variable for each use, with the values 1 for “checked” and 0 for “not checked”. We will do that in the next section.

3.3 Pivoting data from long to wide - pivot_wider()

We can easily transform data from long to wide with the pivot_wider() function from the Tidyverse. Here is a simple example:

## # A tibble: 3 x 4
##   country `1994` `1995` `1996`
##   <chr>    <dbl>  <dbl>  <dbl>
## 1 Sweden       6      5      7
## 2 Denmark      6      8      8
## 3 Norway       3     11      7

What does the code mean?

  • long_example %>% - use the wide example data and
  • pivot_wide( - make it long
  • names_from = year, - choose the column from which our names should come from
  • values_from = avgtemp) - choose the values for each new row

Before we transform the data, we need to note that there is a problem. If we count the unique values in the dataset, we will find there are 158 different options whereas, according to the survey, there should be 11 + an “other” category.

## # A tibble: 1 x 1
##       n
##   <int>
## 1   159

This will cause a big problem if we pivot it. It will cause 158 additional columns to be created. The easiest way to deal with this issue is to use only the original categories and then create an other category. Here is what we can do:

## # A tibble: 1 x 1
##       n
##   <int>
## 1    13

We can check if this worked if the number of observations in rsurvey is the same as our wide data:

##         n
## [1,] TRUE

What does the code mean?

  • rsurvey_use_wide <- - create a new data object (for demonstration / testing purposes) called “rsurvey_use_wide”
  • rsurvey_use_long %>% - use the data from “rsurvey_use_long” and
  • group_by(id) %>% - group it by each individual’s id and
  • add_count(use_new) %>% - count each use per individual - this will create a column of 1’s, indicating they checked the item (recall that we deleted unchecked items)
  • ungroup %>% - ungroup the data
  • pivot_wider( - pivot the data wider
  • names_from = use, - take the names of the new rows from the “use” column
  • values_from = n, - take the values (all 1s) from the “n” column
  • values_fill = list(n=0)) %>% - create 0s for any columns that do not have a 1
    • values_fill allows you to fill empty values
    • list(n=0) - states to make empty values from n into zeroes
  • janitor::clean_names() - use the janitor package to make the names lowercase and underlined for ease of typing later

Now that we have demonstrated how to fix the check-all questions, we can simplify the steps by combining pivot_longer() and pivot_wider() in the same script:

id data_transformation machine_learning modeling statistical_analysis text_processing visualization na medicine financial_analysis survey_analysis clinical_trials genomics other
1 1 1 1 1 1 1 5 0 0 0 0 0 0
10 1 0 1 1 0 1 7 0 0 0 0 0 0
100 1 0 1 1 1 1 6 0 0 0 0 0 0
1000 1 0 0 1 0 0 8 1 0 0 0 0 0
1001 1 1 1 1 0 1 6 0 0 0 0 0 0
1002 1 0 0 1 0 1 7 0 1 0 0 0 0
1003 1 0 0 0 1 1 8 0 0 0 0 0 0
1004 1 0 1 1 0 1 6 0 0 1 0 0 0
1005 1 0 1 1 0 1 6 0 0 1 0 0 0
1006 1 0 0 1 0 1 8 0 0 0 0 0 0

3.4 Joining multiple data objects

This section will explain the different types of joins you can use in R to connect data objects together. Joins are kinds of merges where you add columns and/or rows from one data frame to another data frame by matching them on some variable or sets of variables.

There are many different types of joins, including left joins, right joins, inner joins, semi joins, full joins, and anti joins. See https://dplyr.tidyverse.org/reference/join.html for more information.

3.4.1 Left Join

Left joins typically connect rows in a second data frame to matching rows in the first data frame.

Here is a simple example.

Main data frame

## # A tibble: 3 x 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles

Second data frame

## # A tibble: 3 x 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar

We can connect the second data frame to the main (first) data frame by joining on the column they have in common, “name”. This will create a single, merged data frame. Any unmatched data in the second data frame will not be in the merged data frame.

## Joining, by = "name"
## # A tibble: 3 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass

3.4.2 Left Joining our Survey Data

We have created a data frame called “rsurvey_check_all,” which contains all our check-all questions and whether someone checked it (indicated by 1) or did not check it (indicated by 0). We can combined this with our original “rsurvey” data like this:

What does the code mean?

  • rsurvey_joined <- - save to a new data frame
    • Note: this new data frame is created for demonstration purposes so you can compare the original to the joined data object. In reality, we would simply overwrite our existing data object like this: rsurvey <-
  • rsurvey %>% - use the “rsurvey” data and…
  • rownames_to_column("id") %>% - make a column called “id” that is simply the number of each row. This creates the column we will join by (“rsurvey_check_all” already has this)
  • left_join(rsurvey_check_all) - join “rsurvey” with “rsurvey_check_all”.
    • Note: because both data frames have an “id” column, R automatically joins them by this column. You can also specify:
      • left_join(rsurvey_check_all, by = "id) or
      • left_join(df_name, by = c("col1" = "col2"))if the columns have different names

To complete the rsurvey cleaning process, let’s also join the “other” column from the check-all data:

## Joining, by = "id"