3 More Data Manipulation in R
This chapter extends some of the data cleaning and data wrangling skills from Chapter 2. It uses survey data from the RStudio Learning R Survey. You can access this data, which includes all transformations and cleanings from Chapter 2, here.
You can load the data as follows:
3.1 Wide and Long Data
Data typically comes in two formats: wide and long.
A wide dataset has each individual’s data in separate columns. It is typically easy for humans to read. Here is a simple example:
country | avgtemp.1994 | avgtemp.1995 | avgtemp.1996 |
---|---|---|---|
Sweden | 6 | 5 | 7 |
Denmark | 6 | 8 | 8 |
Norway | 3 | 11 | 7 |
A long dataset, each variable has its own column, and each observation has its own row. In the wide dataset above, country, year, and temperature are three variables. In a long dataset, it would look like this:
country | year | avgtemp |
---|---|---|
Sweden | 1994 | 6 |
Denmark | 1994 | 6 |
Norway | 1994 | 3 |
Sweden | 1995 | 5 |
Denmark | 1995 | 8 |
Norway | 1995 | 11 |
Sweden | 1996 | 7 |
Denmark | 1996 | 8 |
Norway | 1996 | 7 |
Long data is often referred to as tabular data and is more machine-readable than human-readable. It is also very similar to Tidy data.
3.2 Pivoting data from wide to long - pivot_longer()
We can easily transform data from wide to long with the pivot_longer()
function from the Tidyverse
. Here is a simple example:
library(tidyverse)
wide_example %>%
pivot_longer(avgtemp.1994:avgtemp.1996, names_to = "year", values_to = "avg.temp")
## # A tibble: 9 x 3
## country year avg.temp
## <chr> <chr> <dbl>
## 1 Sweden avgtemp.1994 6
## 2 Sweden avgtemp.1995 5
## 3 Sweden avgtemp.1996 7
## 4 Denmark avgtemp.1994 6
## 5 Denmark avgtemp.1995 8
## 6 Denmark avgtemp.1996 8
## 7 Norway avgtemp.1994 3
## 8 Norway avgtemp.1995 11
## 9 Norway avgtemp.1996 7
What does the code mean?
wide_example %>%
- use the wide example data andpivot_longer(
- make it longavgtemp.1994:avgtemp.1996,
- the columns we want to change from wide to longnames_to = "year",
- take the column names from the wide data and put them into a column called “year”values_to = "avg.temp")
- take the values from the wide data and put them into a column called “avg.temp”
3.2.1 Pivoting survey data
Recall that we had a check-all question in our survey that was a little bit messy. It is a good example of wide data:
## use_1 use_2 use_3
## 1 Statistical analysis Data transformation Modeling
## 2 Statistical analysis Data transformation Visualization
## 3 Statistical analysis Data transformation Visualization
## 4 Data transformation <NA> <NA>
## 5 Statistical analysis Data transformation Modeling
## 6 Statistical analysis Data transformation Modeling
## 7 Statistical analysis Data transformation Modeling
## 8 Statistical analysis Data transformation Modeling
## use_4 use_5 use_6
## 1 Visualization Machine learning Text processing
## 2 <NA> <NA> <NA>
## 3 <NA> <NA> <NA>
## 4 <NA> <NA> <NA>
## 5 Visualization <NA> <NA>
## 6 Visualization Machine learning Text processing
## 7 Visualization <NA> <NA>
## 8 Visualization Machine learning Text processing
## use_7 use_8 use_9 use_10 use_11
## 1 <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA>
## 7 <NA> <NA> <NA> <NA> <NA>
## 8 Survey analysis <NA> <NA> <NA> <NA>
You will notice each column has different data, and there are many NA
s. This means some checked 1 or 2 items and therefore have 10 or 11 NA
s while others checked more items and have less NA
s. This is a perfect example of data that we can pivot from wide to long and better organize it.
rsurvey_use_long <- rsurvey %>%
rownames_to_column("id") %>%
select(id, use_1:use_11) %>%
pivot_longer(use_1:use_11, names_to = "use_number", values_to="use") %>%
mutate(use = ifelse(use == "", NA, use),
use = str_trim(use))
rsurvey_use_long %>% head(n=10)
## # A tibble: 10 x 3
## id use_number use
## <chr> <chr> <chr>
## 1 1 use_1 Statistical analysis
## 2 1 use_2 Data transformation
## 3 1 use_3 Modeling
## 4 1 use_4 Visualization
## 5 1 use_5 Machine learning
## 6 1 use_6 Text processing
## 7 1 use_7 <NA>
## 8 1 use_8 <NA>
## 9 1 use_9 <NA>
## 10 1 use_10 <NA>
What does the code mean?
rsurvey_use_long <-
- For this example, we will pivot the data into an object called “rsurvey_use_long”rsurvey %>%
- use the rsurvey data andrownames_to_column("id") %>%
- In the wide dataset, each row is an individual, but we do not have a column of individual ids. This function >will make a column of row numbers called “id”. The first row will be >“1”, the second row will be “2”, etc. Now, we have ids for each individual in the dataset. We will use this later/select(id, use_1:use_11) %>%
- keep only the id column and the columns to pivot. We are removing them for this example, but you wouldn’t do this if you wanted to keep the data in the “rsurvey” data object.pivot_longer(
- change (pivot) it from wide to longuse_1:use_11,
- select the columns to pivotnames_to = "use_number"
, - take the multiple column names and put them in a single column called “use_number”values_to="use") %>%
- take the values and put them in a column called “use”mutate(use = ifelse(use == "", NA, use))
- make any empty textNA
(some will beNA
and others will just be "", which is treated as a character space)drop_na(use)
- drop anyNA
values in the “use” column
We could use this data to visualize and analyze, or we could include it back into the same format as the dataset by making it wide, with each column a dichotomous variable for each use, with the values 1
for “checked” and 0
for “not checked”. We will do that in the next section.
3.3 Pivoting data from long to wide - pivot_wider()
We can easily transform data from long to wide with the pivot_wider()
function from the Tidyverse
. Here is a simple example:
## # A tibble: 3 x 4
## country `1994` `1995` `1996`
## <chr> <dbl> <dbl> <dbl>
## 1 Sweden 6 5 7
## 2 Denmark 6 8 8
## 3 Norway 3 11 7
What does the code mean?
long_example %>%
- use the wide example data andpivot_wide(
- make it longnames_from = year,
- choose the column from which our names should come fromvalues_from = avgtemp)
- choose the values for each new row
Before we transform the data, we need to note that there is a problem. If we count the unique values in the dataset, we will find there are 158 different options whereas, according to the survey, there should be 11 + an “other” category.
## # A tibble: 1 x 1
## n
## <int>
## 1 159
This will cause a big problem if we pivot it. It will cause 158 additional columns to be created. The easiest way to deal with this issue is to use only the original categories and then create an other category. Here is what we can do:
rsurvey_use_long <- rsurvey_use_long %>%
mutate(use_new = case_when(
str_detect(use, "Statistical analysis") ~ "Statistical analysis",
str_detect(use, "Data transformation") ~ "Data transformation",
str_detect(use, "Modeling") ~ "Modeling",
str_detect(use, "Visualization") ~ "Visualization",
str_detect(use, "Machine learning") ~ "Machine learning",
str_detect(use, "Text processing") ~ "Text processing",
str_detect(use, "Genomics") ~ "Genomics",
str_detect(use, "Medicine") ~ "Medicine",
str_detect(use, "Survey analysis") ~ "Survey analysis",
str_detect(use, "Clinical trials") ~ "Clinical trials",
str_detect(use, "Financial analysis") ~ "Financial analysis",
is.na(use) ~ "NA",
TRUE ~ "Other"
),
use_new = ifelse(is.na(use), NA, use_new),
use_other = ifelse(use_new == "Other", use, NA)
)
rsurvey_use_long %>%
distinct(use_new) %>% # find only unqiue values
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 13
rsurvey_use_wide <- rsurvey_use_long %>%
group_by(id) %>%
count(use_new) %>%
ungroup %>%
pivot_wider(names_from = use_new, values_from = n,
values_fill = list(n=0)) %>%
janitor::clean_names()
We can check if this worked if the number of observations in rsurvey is the same as our wide data:
## n
## [1,] TRUE
What does the code mean?
rsurvey_use_wide <-
- create a new data object (for demonstration / testing purposes) called “rsurvey_use_wide”rsurvey_use_long %>%
- use the data from “rsurvey_use_long” andgroup_by(id) %>%
- group it by each individual’s id andadd_count(use_new) %>%
- count each use per individual - this will create a column of 1’s, indicating they checked the item (recall that we deleted unchecked items)ungroup %>%
- ungroup the datapivot_wider(
- pivot the data widernames_from = use,
- take the names of the new rows from the “use” columnvalues_from = n,
- take the values (all 1s) from the “n” columnvalues_fill = list(n=0)) %>%
- create 0s for any columns that do not have a 1
values_fill
allows you to fill empty valueslist(n=0)
- states to make empty values from n into zeroesjanitor::clean_names()
- use the janitor package to make the names lowercase and underlined for ease of typing later
Now that we have demonstrated how to fix the check-all questions, we can simplify the steps by combining pivot_longer()
and pivot_wider()
in the same script:
rsurvey_check_all <- rsurvey %>%
rownames_to_column("id") %>%
select(id, use_1:use_11) %>%
pivot_longer(use_1:use_11, names_to = "use_number", values_to="use") %>%
mutate(use = ifelse(use == "", NA, use),
use = str_trim(use),
use_new = case_when(
str_detect(use, "Statistical analysis") ~ "Statistical analysis",
str_detect(use, "Data transformation") ~ "Data transformation",
str_detect(use, "Modeling") ~ "Modeling",
str_detect(use, "Visualization") ~ "Visualization",
str_detect(use, "Machine learning") ~ "Machine learning",
str_detect(use, "Text processing") ~ "Text processing",
str_detect(use, "Genomics") ~ "Genomics",
str_detect(use, "Medicine") ~ "Medicine",
str_detect(use, "Survey analysis") ~ "Survey analysis",
str_detect(use, "Clinical trials") ~ "Clinical trials",
str_detect(use, "Financial analysis") ~ "Financial analysis",
is.na(use) ~ "NA",
TRUE ~ "Other"
),
use_new = ifelse(is.na(use), NA, use_new),
use_other = ifelse(use_new == "Other", use, NA)
) %>%
group_by(id) %>%
count(use_new) %>%
ungroup %>%
pivot_wider(names_from = use_new, values_from = n,
values_fill = list(n=0)) %>%
janitor::clean_names()
head(rsurvey_check_all, n=10) %>%
kable()
id | data_transformation | machine_learning | modeling | statistical_analysis | text_processing | visualization | na | medicine | financial_analysis | survey_analysis | clinical_trials | genomics | other |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
10 | 1 | 0 | 1 | 1 | 0 | 1 | 7 | 0 | 0 | 0 | 0 | 0 | 0 |
100 | 1 | 0 | 1 | 1 | 1 | 1 | 6 | 0 | 0 | 0 | 0 | 0 | 0 |
1000 | 1 | 0 | 0 | 1 | 0 | 0 | 8 | 1 | 0 | 0 | 0 | 0 | 0 |
1001 | 1 | 1 | 1 | 1 | 0 | 1 | 6 | 0 | 0 | 0 | 0 | 0 | 0 |
1002 | 1 | 0 | 0 | 1 | 0 | 1 | 7 | 0 | 1 | 0 | 0 | 0 | 0 |
1003 | 1 | 0 | 0 | 0 | 1 | 1 | 8 | 0 | 0 | 0 | 0 | 0 | 0 |
1004 | 1 | 0 | 1 | 1 | 0 | 1 | 6 | 0 | 0 | 1 | 0 | 0 | 0 |
1005 | 1 | 0 | 1 | 1 | 0 | 1 | 6 | 0 | 0 | 1 | 0 | 0 | 0 |
1006 | 1 | 0 | 0 | 1 | 0 | 1 | 8 | 0 | 0 | 0 | 0 | 0 | 0 |
3.4 Joining multiple data objects
This section will explain the different types of joins you can use in R to connect data objects together. Joins are kinds of merges where you add columns and/or rows from one data frame to another data frame by matching them on some variable or sets of variables.
There are many different types of joins, including left joins, right joins, inner joins, semi joins, full joins, and anti joins. See https://dplyr.tidyverse.org/reference/join.html for more information.
3.4.1 Left Join
Left joins typically connect rows in a second data frame to matching rows in the first data frame.
Here is a simple example.
Main data frame
## # A tibble: 3 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
Second data frame
## # A tibble: 3 x 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
We can connect the second data frame to the main (first) data frame by joining on the column they have in common, “name”. This will create a single, merged data frame. Any unmatched data in the second data frame will not be in the merged data frame.
## Joining, by = "name"
## # A tibble: 3 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
3.4.2 Left Joining our Survey Data
We have created a data frame called “rsurvey_check_all,” which contains all our check-all questions and whether someone checked it (indicated by 1) or did not check it (indicated by 0). We can combined this with our original “rsurvey” data like this:
What does the code mean?
rsurvey_joined <-
- save to a new data frame
- Note: this new data frame is created for demonstration purposes so you can compare the original to the joined data object. In reality, we would simply overwrite our existing data object like this:
rsurvey <-
rsurvey %>%
- use the “rsurvey” data and…rownames_to_column("id") %>%
- make a column called “id” that is simply the number of each row. This creates the column we will join by (“rsurvey_check_all” already has this)left_join(rsurvey_check_all)
- join “rsurvey” with “rsurvey_check_all”.
- Note: because both data frames have an “id” column,
R
automatically joins them by this column. You can also specify:
left_join(rsurvey_check_all, by = "id)
orleft_join(df_name, by = c("col1" = "col2"))
if the columns have different names
To complete the rsurvey cleaning process, let’s also join the “other” column from the check-all data:
## Joining, by = "id"