21 Data Wrangling with tidyverse
This unit was mostly written by Olha Guley.
Data wrangling is the process of transforming raw data into a format suitable for analysis. It’s an important step in data science, as study data rarely arrives in a format perfectly aligned with the needs of your specific analysis. For example, you might need to filter specific rows or columns, exclude participants who failed attention checks, compute summary statistics for certain groups, create new variables based on existing ones, or reorder rows. Each of these tasks falls under the umbrella of data wrangling. In this section, we will learn basic data wrangling techniques using the tidyverse package.
For this exercise, we will use the Counties Statistics data set, which covers multiple variables spanning demographics, economics, environment, healthcare, and education statistics across countries worldwide. The data set includes observations for each country over 15 years (from 2000 to 2015), which gives us many possibilities for data wrangling.
Let’s go ahead and load the tidyverse
package and the data. The data set is available via Daniel Nettle’s website using the URL shown below.
library(tidyverse)
countries_data <- read_csv("https://www.danielnettle.eu/wp-content/uploads/2024/12/countries_statistics.csv")
Let’s take a quick look at the data set:
## # A tibble: 6 × 17
## Country Year Continent `Least Developed`
## <chr> <dbl> <chr> <lgl>
## 1 Albania 2000 Europe FALSE
## 2 Albania 2001 Europe FALSE
## 3 Albania 2002 Europe FALSE
## 4 Albania 2003 Europe FALSE
## 5 Albania 2004 Europe FALSE
## 6 Albania 2005 Europe FALSE
## # ℹ 13 more variables: `Life Expectancy` <dbl>,
## # Population <dbl>, `CO2 emissions` <dbl>,
## # `Health expenditure` <dbl>,
## # `Electric power consumption` <dbl>,
## # `Forest area` <dbl>, `GDP per capita` <dbl>,
## # `Individuals using the Internet` <dbl>,
## # `Military expenditure` <dbl>, …
21.1 Making new variables with mutate()
We covered this in an earlier unit, but repeat here for completeness. You make a new variable from the existing ones with the function mutate()
. For example, the data set includes GDP per capita, and also the number of people in the population. You might want to have a variable representing the total GDP of the country, which is the first of those numbers multiplied by the second. You would make that variable as follows:
21.2 Recoding variables with case_when()
When you have categorical variables, sometimes you want to adjust their categorization. For example, here we have a variable Continent
, in which North America and South America are considered as separate continents. But perhaps you want to have a different variable, stating whether the country is in the Americas or elsewhere. You could make the variable you want from the current one using case_when()
in conjunction with mutate()
, as follows:
countries_data_revised <- countries_data %>%
mutate(Americas = case_when(Continent == "Africa" ~ "Elsewhere",
Continent == "Asia" ~ "Elsewhere",
Continent == "Europe" ~ "Elsewhere",
Continent == "North America" ~ "Americas",
Continent == "Oceania" ~ "Elsewhere",
Continent == "South America" ~ "Americas"))
21.3 Using the group_by() function for more detailed data summaries
Throughout this course, we have frequently used the summarize()
function to compute summaries of our data, such as means and standard deviations. Often, you will want to calculate descriptive statistics separately for each condition, group, or demographic variable (e.g., genders or regions). The group_by()
function is helpful in these cases:
summary_d <- countries_data %>%
group_by(Continent) %>%
summarize(
mean_population = mean(Population, na.rm = TRUE),
mean_beer_consumption_pp = mean(`Beer consumption per capita`, na.rm = TRUE)
)
summary_d
## # A tibble: 6 × 3
## Continent mean_population mean_beer_consumption_pp
## <chr> <dbl> <dbl>
## 1 Africa 28169437. 1.16
## 2 Asia 120914074. 0.516
## 3 Europe 19153988. 3.95
## 4 North America 44604449. 2.39
## 5 Oceania 12757605. 4.18
## 6 South America 37976184. 2.89
21.4 The filter() function
Suppose we are only interested in analyzing data from the least developed countries. We can filter these observations using the filter()
function, which subsets rows based on specified conditions:
## # A tibble: 6 × 18
## Country Year Continent `Least Developed`
## <chr> <dbl> <chr> <lgl>
## 1 Angola 2000 Africa TRUE
## 2 Angola 2001 Africa TRUE
## 3 Angola 2002 Africa TRUE
## 4 Angola 2003 Africa TRUE
## 5 Angola 2004 Africa TRUE
## 6 Angola 2005 Africa TRUE
## # ℹ 14 more variables: `Life Expectancy` <dbl>,
## # Population <dbl>, `CO2 emissions` <dbl>,
## # `Health expenditure` <dbl>,
## # `Electric power consumption` <dbl>,
## # `Forest area` <dbl>, `GDP per capita` <dbl>,
## # `Individuals using the Internet` <dbl>,
## # `Military expenditure` <dbl>, …
We can also filter rows based on specific numeric criteria:
We can also use this function to remove the rows with missing values from our data set, either for a specific variable, or for any variable, as in the example below.
Note that if_any()
checks if there are any values in the data set that meet the condition, in this case the condition of being missing (is.na()
). Hence, !if_any(everything(), is.na)
keeps only rows where no NA values are found in any variable, given that !
means ‘not’.
21.5 Reshaping data: long to wide and wide to long
Let’s take a second look at the data set:
## # A tibble: 30 × 18
## Country Year Continent `Least Developed`
## <chr> <dbl> <chr> <lgl>
## 1 Albania 2000 Europe FALSE
## 2 Albania 2001 Europe FALSE
## 3 Albania 2002 Europe FALSE
## 4 Albania 2003 Europe FALSE
## 5 Albania 2004 Europe FALSE
## 6 Albania 2005 Europe FALSE
## 7 Albania 2006 Europe FALSE
## 8 Albania 2007 Europe FALSE
## 9 Albania 2008 Europe FALSE
## 10 Albania 2009 Europe FALSE
## # ℹ 20 more rows
## # ℹ 14 more variables: `Life Expectancy` <dbl>,
## # Population <dbl>, `CO2 emissions` <dbl>,
## # `Health expenditure` <dbl>,
## # `Electric power consumption` <dbl>,
## # `Forest area` <dbl>, `GDP per capita` <dbl>,
## # `Individuals using the Internet` <dbl>, …
We can say that this data is in the long format, because each row represents a single observation for a specific country and year.
Imagine, you wanted a new data set, where one row represents one country, and the year-related data will be split in different columns instead of the rows. We can do this transformation using R function pivot_wider()
.
wide_data <- countries_data %>%
pivot_wider(
names_from = Year,
#we will exclude from the new columns the variables that do not change in the dataset
values_from = -c(Country, Year, Continent)
)
head(wide_data)
## # A tibble: 6 × 242
## Country Continent `Least Developed_2000`
## <chr> <chr> <lgl>
## 1 Albania Europe FALSE
## 2 Algeria Africa FALSE
## 3 Angola Africa TRUE
## 4 Argentina South America FALSE
## 5 Armenia Asia FALSE
## 6 Australia Oceania FALSE
## # ℹ 239 more variables: `Least Developed_2001` <lgl>,
## # `Least Developed_2002` <lgl>,
## # `Least Developed_2003` <lgl>,
## # `Least Developed_2004` <lgl>,
## # `Least Developed_2005` <lgl>,
## # `Least Developed_2006` <lgl>,
## # `Least Developed_2007` <lgl>, …
Let’s transform it back to the long format with pivot_longer()
function. You are likely to use this type of transformation a lot, since many survey and behavioral experiments platforms produce wide data sets that are not suitable for the analysis.
long_data <- wide_data %>%
pivot_longer(
#Let's keep the country and continent as identifiers
cols = -c(Country, Continent),
#Let's split column names back into variable name and year
names_to = c(".value", "Year"),
#Let's split names at the underscore
names_sep = "_"
)
head(long_data)
## # A tibble: 6 × 18
## Country Continent Year `Least Developed`
## <chr> <chr> <chr> <lgl>
## 1 Albania Europe 2000 FALSE
## 2 Albania Europe 2001 FALSE
## 3 Albania Europe 2002 FALSE
## 4 Albania Europe 2003 FALSE
## 5 Albania Europe 2004 FALSE
## 6 Albania Europe 2005 FALSE
## # ℹ 14 more variables: `Life Expectancy` <dbl>,
## # Population <dbl>, `CO2 emissions` <dbl>,
## # `Health expenditure` <dbl>,
## # `Electric power consumption` <dbl>,
## # `Forest area` <dbl>, `GDP per capita` <dbl>,
## # `Individuals using the Internet` <dbl>,
## # `Military expenditure` <dbl>, …
21.6 Merging several data sets together
Suppose you conducted several trials with the same participants and have separate data sets for each trial. However, you want to perform one analysis on all the data you collected across your trials. To be able to analyze the combined data, you can merge the data sets in R using left_join()
or right_join()
function. These functions require a key, that is, a variable that appears in both, so that the correct rows can be matched with the correct rows. In the example below, the combination of the variables Country
and Year
serves as the key.
Let’s see how it works:
x <- d %>% select(Country, Year,`Life Expectancy`, `Population`)
y <- d %>% select(Country, Year,`CO2 emissions`, `GDP per capita`)
merged_d <- x %>%
left_join(y,by = c("Country", "Year"))
head(merged_d)
## # A tibble: 6 × 6
## Country Year `Life Expectancy` Population
## <chr> <dbl> <dbl> <dbl>
## 1 Angola 2000 46.5 16395477
## 2 Angola 2001 47.1 16945753
## 3 Angola 2002 47.7 17519418
## 4 Angola 2003 48.4 18121477
## 5 Angola 2004 49.3 18758138
## 6 Angola 2005 50.2 19433604
## # ℹ 2 more variables: `CO2 emissions` <dbl>,
## # `GDP per capita` <dbl>
The difference between left_join()
and right_join()
is which of the two data frames you are joining provides the rows. left_join()
will give you all the rows that are in the first data frame, with added variables from the second. right_join()
gives you all the rows from the second data frame, with added variables from the first. full_join()
would give you a data frame with all the rows from both data frames.
21.7 Summary
In this section, we covered the following data wrangling features:
Making new variables with
mutate()
, in conjunction withcase_when()
where appropriate.Grouping and summarizing using
group_by()
andsummarize()
for descriptive statistics.Filtering and subsetting rows based on conditions (
filter()
), removing missing values, or selecting specific numeric ranges.Reshaping data from long format to wide with
pivot_wider()
and transforming wide-format data to long withpivot_longer()
.Merging multiple data sets using
left_join()
.