Chapter 5 Data Manipulation

5.1 Introduction

Data manipulation is a fundamental step in data analysis, transforming raw datasets into formats suitable for analysis and visualization. This chapter explores key techniques for manipulating data in R, including selecting, removing, and reshaping data. We’ll use the popular dplyr and tidyr packages from the tidyverse ecosystem, which provide user-friendly functions for common tasks.

Data manipulation involves modifying, organizing, or restructuring datasets. Common goals include:

  • Selecting specific columns or rows.

  • Filtering out unnecessary data.

  • Reshaping data between wide and long formats.

  • Aggregating or summarizing data.

In scientific research, long-format data is typically preferred for analysis and visualization because it aligns better with statistical modeling and data visualization tools, such as those in R and Python. Long-format data is also easier to process when performing tasks like group-wise analysis or generating plots.

5.2 Selecting Columns

Selecting specific columns helps reduce dataset complexity.

5.2.1 Using select() from dplyr

selected_data <- data %>% select(Name, Age)  
print(selected_data)  

5.2.2 Selecting Columns by Pattern

Use starts_with(), ends_with(), or contains().

pattern_data <- data %>% select(starts_with("S"))  
print(pattern_data)  

5.3 Filtering Rows

Filtering rows extracts subsets of data based on conditions.

5.3.1 Using filter() from dplyr

filtered_data <- data %>% filter(Age > 30)  
print(filtered_data)  

5.3.2 Combining Conditions

Use logical operators like & (and), | (or), and ! (not).

combined_filter <- data %>% filter(Age > 30 & Group == "A")  
print(combined_filter)  

5.4 Adding or Modifying Columns

5.4.1 Adding New Columns with mutate()

mutated_data <- data %>% mutate(AgeGroup = ifelse(Age > 40, "Senior", "Junior"))  
print(mutated_data)  

5.4.2 Transforming Existing Columns

transformed_data <- data %>% mutate(Score = Score / 100)  
print(transformed_data)  

5.5 Removing Columns or Rows

5.5.1 Removing Columns

removed_columns <- data %>% select(-Score)  
print(removed_columns)  

5.5.2 Removing Rows

removed_rows <- data %>% filter(ID != 3)  
print(removed_rows)  

5.6 Sorting Data

5.6.1 Using arrange()

sorted_data <- data %>% arrange(desc(Age))  
print(sorted_data)  

5.7 Summarizing Data

Summarizing data helps calculate statistics like mean, median, and count.

5.7.1 Using summarize()

summary_data <- data %>% summarize(AverageScore = mean(Score), MaxAge = max(Age))  
print(summary_data)  

5.8 Grouping and Aggregation

Use group_by() with summarize() to calculate group-wise summaries.

grouped_summary <- data %>%  
  group_by(Group) %>%  
  summarize(AverageScore = mean(Score), Count = n())  
print(grouped_summary)  

5.9 Reshaping Data

Reshaping involves converting data between wide and long formats.

5.9.1 Long to Wide Format

Use pivot_wider() from tidyr.

library(tidyr)  
long_data <- tibble::tibble(  
  ID = rep(1:3, each = 2),  
  Variable = c("Height", "Weight", "Height", "Weight", "Height", "Weight"),  
  Value = c(160, 60, 170, 70, 180, 80)  
)  

wide_data <- long_data %>% pivot_wider(names_from = Variable, values_from = Value)  
print(wide_data)  

5.9.2 Wide to Long Format

Use pivot_longer() to convert wide-format data to long format, which is more suitable for analysis and modeling in scientific research.

long_format <- wide_data %>% pivot_longer(cols = Height:Weight, names_to = "Variable", values_to = "Value")  
print(long_format)  

5.9.3 Why Long Format Is Important in Research

  • Many statistical tools (e.g., ANOVA, regression) require long-format data.

  • Visualization libraries like ggplot2 in R expect data in long format.

  • Long format allows easier group-wise operations and comparisons.

Wide format is primarily useful for human-readable summaries or when data needs to be shared as tables. However, it often complicates analysis and visualization tasks.

5.10 Removing Duplicates

5.10.1 Using distinct()

distinct_data <- data %>% distinct(Group, .keep_all = TRUE)  
print(distinct_data)  

5.11 Joining Datasets

Combining datasets is common in multi-source analyses.

5.11.1 Using left_join()

additional_data <- tibble::tibble(  
  Group = c("A", "B"),  
  Category = c("Alpha", "Beta")  
)  

joined_data <- data %>% left_join(additional_data, by = "Group")  
print(joined_data)  

5.12 Summary

In this chapter, we explored essential data manipulation techniques in R. While both long and wide formats have their use cases, long-format data is critical for scientific research, enabling easier statistical analysis and visualization. By mastering the tools and techniques outlined here, you’ll be well-equipped to handle diverse data manipulation challenges in your projects.