21 Data Wrangling with tidyverse

This unit was mostly written by Olha Guley.

Data wrangling is the process of transforming raw data into a format suitable for analysis. It’s an important step in data science, as study data rarely arrives in a format perfectly aligned with the needs of your specific analysis. For example, you might need to filter specific rows or columns, exclude participants who failed attention checks, compute summary statistics for certain groups, create new variables based on existing ones, or reorder rows. Each of these tasks falls under the umbrella of data wrangling. In this section, we will learn basic data wrangling techniques using the tidyverse package.

For this exercise, we will use the Counties Statistics data set, which covers multiple variables spanning demographics, economics, environment, healthcare, and education statistics across countries worldwide. The data set includes observations for each country over 15 years (from 2000 to 2015), which gives us many possibilities for data wrangling.

Let’s go ahead and load the tidyverse package and the data. The data set is available via Daniel Nettle’s website using the URL shown below.

library(tidyverse)
countries_data <- read_csv("https://www.danielnettle.eu/wp-content/uploads/2024/12/countries_statistics.csv")

Let’s take a quick look at the data set:

head(countries_data)
## # A tibble: 6 × 17
##   Country  Year Continent `Least Developed`
##   <chr>   <dbl> <chr>     <lgl>            
## 1 Albania  2000 Europe    FALSE            
## 2 Albania  2001 Europe    FALSE            
## 3 Albania  2002 Europe    FALSE            
## 4 Albania  2003 Europe    FALSE            
## 5 Albania  2004 Europe    FALSE            
## 6 Albania  2005 Europe    FALSE            
## # ℹ 13 more variables: `Life Expectancy` <dbl>,
## #   Population <dbl>, `CO2 emissions` <dbl>,
## #   `Health expenditure` <dbl>,
## #   `Electric power consumption` <dbl>,
## #   `Forest area` <dbl>, `GDP per capita` <dbl>,
## #   `Individuals using the Internet` <dbl>,
## #   `Military expenditure` <dbl>, …

21.1 Making new variables with mutate()

We covered this in an earlier unit, but repeat here for completeness. You make a new variable from the existing ones with the function mutate(). For example, the data set includes GDP per capita, and also the number of people in the population. You might want to have a variable representing the total GDP of the country, which is the first of those numbers multiplied by the second. You would make that variable as follows:

countries_data <- countries_data %>% 
  mutate(`GDP total` = Population * `GDP per capita`)

21.2 Recoding variables with case_when()

When you have categorical variables, sometimes you want to adjust their categorization. For example, here we have a variable Continent, in which North America and South America are considered as separate continents. But perhaps you want to have a different variable, stating whether the country is in the Americas or elsewhere. You could make the variable you want from the current one using case_when() in conjunction with mutate(), as follows:

countries_data_revised <- countries_data %>%
  mutate(Americas = case_when(Continent == "Africa" ~ "Elsewhere",
                              Continent == "Asia" ~ "Elsewhere",
                              Continent == "Europe" ~ "Elsewhere",
                              Continent == "North America" ~ "Americas",
                              Continent == "Oceania" ~ "Elsewhere",
                              Continent == "South America" ~ "Americas"))

21.3 Using the group_by() function for more detailed data summaries

Throughout this course, we have frequently used the summarize() function to compute summaries of our data, such as means and standard deviations. Often, you will want to calculate descriptive statistics separately for each condition, group, or demographic variable (e.g., genders or regions). The group_by() function is helpful in these cases:

summary_d <- countries_data %>%
  group_by(Continent) %>%
  summarize(
   mean_population = mean(Population, na.rm = TRUE),
   mean_beer_consumption_pp = mean(`Beer consumption per capita`, na.rm = TRUE)
  )
  
summary_d
## # A tibble: 6 × 3
##   Continent     mean_population mean_beer_consumption_pp
##   <chr>                   <dbl>                    <dbl>
## 1 Africa              28169437.                    1.16 
## 2 Asia               120914074.                    0.516
## 3 Europe              19153988.                    3.95 
## 4 North America       44604449.                    2.39 
## 5 Oceania             12757605.                    4.18 
## 6 South America       37976184.                    2.89

21.4 The filter() function

Suppose we are only interested in analyzing data from the least developed countries. We can filter these observations using the filter() function, which subsets rows based on specified conditions:

d <- countries_data %>%
  filter(`Least Developed` == "TRUE")
head(d)
## # A tibble: 6 × 18
##   Country  Year Continent `Least Developed`
##   <chr>   <dbl> <chr>     <lgl>            
## 1 Angola   2000 Africa    TRUE             
## 2 Angola   2001 Africa    TRUE             
## 3 Angola   2002 Africa    TRUE             
## 4 Angola   2003 Africa    TRUE             
## 5 Angola   2004 Africa    TRUE             
## 6 Angola   2005 Africa    TRUE             
## # ℹ 14 more variables: `Life Expectancy` <dbl>,
## #   Population <dbl>, `CO2 emissions` <dbl>,
## #   `Health expenditure` <dbl>,
## #   `Electric power consumption` <dbl>,
## #   `Forest area` <dbl>, `GDP per capita` <dbl>,
## #   `Individuals using the Internet` <dbl>,
## #   `Military expenditure` <dbl>, …

We can also filter rows based on specific numeric criteria:

b <- countries_data %>%
  filter(`GDP per capita` < 1000, `Beer consumption per capita` > 0.2)

We can also use this function to remove the rows with missing values from our data set, either for a specific variable, or for any variable, as in the example below.

filtered_data <- countries_data %>% filter(!if_any(everything(), is.na))

Note that if_any() checks if there are any values in the data set that meet the condition, in this case the condition of being missing (is.na()). Hence, !if_any(everything(), is.na) keeps only rows where no NA values are found in any variable, given that ! means ‘not’.

21.5 Reshaping data: long to wide and wide to long

Let’s take a second look at the data set:

head(countries_data, 30)
## # A tibble: 30 × 18
##    Country  Year Continent `Least Developed`
##    <chr>   <dbl> <chr>     <lgl>            
##  1 Albania  2000 Europe    FALSE            
##  2 Albania  2001 Europe    FALSE            
##  3 Albania  2002 Europe    FALSE            
##  4 Albania  2003 Europe    FALSE            
##  5 Albania  2004 Europe    FALSE            
##  6 Albania  2005 Europe    FALSE            
##  7 Albania  2006 Europe    FALSE            
##  8 Albania  2007 Europe    FALSE            
##  9 Albania  2008 Europe    FALSE            
## 10 Albania  2009 Europe    FALSE            
## # ℹ 20 more rows
## # ℹ 14 more variables: `Life Expectancy` <dbl>,
## #   Population <dbl>, `CO2 emissions` <dbl>,
## #   `Health expenditure` <dbl>,
## #   `Electric power consumption` <dbl>,
## #   `Forest area` <dbl>, `GDP per capita` <dbl>,
## #   `Individuals using the Internet` <dbl>, …

We can say that this data is in the long format, because each row represents a single observation for a specific country and year.

Imagine, you wanted a new data set, where one row represents one country, and the year-related data will be split in different columns instead of the rows. We can do this transformation using R function pivot_wider().

wide_data <- countries_data %>%
  pivot_wider(
    names_from = Year,
  #we will exclude from the new columns the variables that do not change in the dataset
    values_from = -c(Country, Year, Continent)
  )

head(wide_data)
## # A tibble: 6 × 242
##   Country   Continent     `Least Developed_2000`
##   <chr>     <chr>         <lgl>                 
## 1 Albania   Europe        FALSE                 
## 2 Algeria   Africa        FALSE                 
## 3 Angola    Africa        TRUE                  
## 4 Argentina South America FALSE                 
## 5 Armenia   Asia          FALSE                 
## 6 Australia Oceania       FALSE                 
## # ℹ 239 more variables: `Least Developed_2001` <lgl>,
## #   `Least Developed_2002` <lgl>,
## #   `Least Developed_2003` <lgl>,
## #   `Least Developed_2004` <lgl>,
## #   `Least Developed_2005` <lgl>,
## #   `Least Developed_2006` <lgl>,
## #   `Least Developed_2007` <lgl>, …

Let’s transform it back to the long format with pivot_longer() function. You are likely to use this type of transformation a lot, since many survey and behavioral experiments platforms produce wide data sets that are not suitable for the analysis.

long_data <- wide_data %>%
  pivot_longer(
    #Let's keep the country and continent as identifiers
    cols = -c(Country, Continent), 
     #Let's split column names back into variable name and year
    names_to = c(".value", "Year"),
    #Let's split names at the underscore 
    names_sep = "_" 
  )

head(long_data)
## # A tibble: 6 × 18
##   Country Continent Year  `Least Developed`
##   <chr>   <chr>     <chr> <lgl>            
## 1 Albania Europe    2000  FALSE            
## 2 Albania Europe    2001  FALSE            
## 3 Albania Europe    2002  FALSE            
## 4 Albania Europe    2003  FALSE            
## 5 Albania Europe    2004  FALSE            
## 6 Albania Europe    2005  FALSE            
## # ℹ 14 more variables: `Life Expectancy` <dbl>,
## #   Population <dbl>, `CO2 emissions` <dbl>,
## #   `Health expenditure` <dbl>,
## #   `Electric power consumption` <dbl>,
## #   `Forest area` <dbl>, `GDP per capita` <dbl>,
## #   `Individuals using the Internet` <dbl>,
## #   `Military expenditure` <dbl>, …

21.6 Merging several data sets together

Suppose you conducted several trials with the same participants and have separate data sets for each trial. However, you want to perform one analysis on all the data you collected across your trials. To be able to analyze the combined data, you can merge the data sets in R using left_join() or right_join() function. These functions require a key, that is, a variable that appears in both, so that the correct rows can be matched with the correct rows. In the example below, the combination of the variables Country and Year serves as the key.

Let’s see how it works:

x <- d %>% select(Country, Year,`Life Expectancy`, `Population`)
y <- d %>% select(Country, Year,`CO2 emissions`, `GDP per capita`)

merged_d <- x %>% 
  left_join(y,by = c("Country", "Year"))  

head(merged_d)
## # A tibble: 6 × 6
##   Country  Year `Life Expectancy` Population
##   <chr>   <dbl>             <dbl>      <dbl>
## 1 Angola   2000              46.5   16395477
## 2 Angola   2001              47.1   16945753
## 3 Angola   2002              47.7   17519418
## 4 Angola   2003              48.4   18121477
## 5 Angola   2004              49.3   18758138
## 6 Angola   2005              50.2   19433604
## # ℹ 2 more variables: `CO2 emissions` <dbl>,
## #   `GDP per capita` <dbl>

The difference between left_join() and right_join() is which of the two data frames you are joining provides the rows. left_join() will give you all the rows that are in the first data frame, with added variables from the second. right_join() gives you all the rows from the second data frame, with added variables from the first. full_join() would give you a data frame with all the rows from both data frames.

21.7 Summary

In this section, we covered the following data wrangling features:

  • Making new variables with mutate(), in conjunction with case_when() where appropriate.

  • Grouping and summarizing using group_by() and summarize() for descriptive statistics.

  • Filtering and subsetting rows based on conditions (filter()), removing missing values, or selecting specific numeric ranges.

  • Reshaping data from long format to wide with pivot_wider() and transforming wide-format data to long with pivot_longer().

  • Merging multiple data sets using left_join().