Chapter 5 Data Manipulation

5.1 Introduction

Data manipulation is a fundamental step in data analysis, transforming raw datasets into formats suitable for analysis and visualization. This chapter explores key techniques for manipulating data in R, including selecting, removing, and reshaping data. We’ll use the popular dplyr and tidyr packages from the tidyverse ecosystem, which provide user-friendly functions for common tasks.

Data manipulation involves modifying, organizing, or restructuring datasets. Common goals include:

Selecting specific columns or rows.
Filtering out unnecessary data.
Reshaping data between wide and long formats.
Aggregating or summarizing data.

In scientific research, long-format data is typically preferred for analysis and visualization because it aligns better with statistical modeling and data visualization tools, such as those in R and Python. Long-format data is also easier to process when performing tasks like group-wise analysis or generating plots.

5.2 Selecting Columns

Selecting specific columns helps reduce dataset complexity.

5.2.1 Using `select()` from `dplyr`

selected_data <- data %>% select(Name, Age)  
print(selected_data)

5.2.2 Selecting Columns by Pattern

Use starts_with(), ends_with(), or contains().

pattern_data <- data %>% select(starts_with("S"))  
print(pattern_data)

5.3 Filtering Rows

Filtering rows extracts subsets of data based on conditions.

5.3.1 Using `filter()` from `dplyr`

filtered_data <- data %>% filter(Age > 30)  
print(filtered_data)

5.3.2 Combining Conditions

Use logical operators like & (and), | (or), and ! (not).

combined_filter <- data %>% filter(Age > 30 & Group == "A")  
print(combined_filter)

5.4 Adding or Modifying Columns

5.4.1 Adding New Columns with `mutate()`

mutated_data <- data %>% mutate(AgeGroup = ifelse(Age > 40, "Senior", "Junior"))  
print(mutated_data)

5.4.2 Transforming Existing Columns

transformed_data <- data %>% mutate(Score = Score / 100)  
print(transformed_data)

5.5 Removing Columns or Rows

5.5.1 Removing Columns

removed_columns <- data %>% select(-Score)  
print(removed_columns)

5.5.2 Removing Rows

removed_rows <- data %>% filter(ID != 3)  
print(removed_rows)

5.6 Sorting Data

5.6.1 Using `arrange()`

sorted_data <- data %>% arrange(desc(Age))  
print(sorted_data)

5.7 Summarizing Data

Summarizing data helps calculate statistics like mean, median, and count.

5.7.1 Using `summarize()`

summary_data <- data %>% summarize(AverageScore = mean(Score), MaxAge = max(Age))  
print(summary_data)

5.8 Grouping and Aggregation

Use group_by() with summarize() to calculate group-wise summaries.

grouped_summary <- data %>%  
  group_by(Group) %>%  
  summarize(AverageScore = mean(Score), Count = n())  
print(grouped_summary)

5.9 Reshaping Data

Reshaping involves converting data between wide and long formats.

5.9.1 Long to Wide Format

Use pivot_wider() from tidyr.

library(tidyr)  
long_data <- tibble::tibble(  
  ID = rep(1:3, each = 2),  
  Variable = c("Height", "Weight", "Height", "Weight", "Height", "Weight"),  
  Value = c(160, 60, 170, 70, 180, 80)  
)  

wide_data <- long_data %>% pivot_wider(names_from = Variable, values_from = Value)  
print(wide_data)

5.9.2 Wide to Long Format

Use pivot_longer() to convert wide-format data to long format, which is more suitable for analysis and modeling in scientific research.

long_format <- wide_data %>% pivot_longer(cols = Height:Weight, names_to = "Variable", values_to = "Value")  
print(long_format)

5.9.3 Why Long Format Is Important in Research

Many statistical tools (e.g., ANOVA, regression) require long-format data.
Visualization libraries like ggplot2 in R expect data in long format.
Long format allows easier group-wise operations and comparisons.

Wide format is primarily useful for human-readable summaries or when data needs to be shared as tables. However, it often complicates analysis and visualization tasks.

5.10 Removing Duplicates

5.10.1 Using `distinct()`

distinct_data <- data %>% distinct(Group, .keep_all = TRUE)  
print(distinct_data)

5.11 Joining Datasets

Combining datasets is common in multi-source analyses.

5.11.1 Using `left_join()`

additional_data <- tibble::tibble(  
  Group = c("A", "B"),  
  Category = c("Alpha", "Beta")  
)  

joined_data <- data %>% left_join(additional_data, by = "Group")  
print(joined_data)

5.12 Summary

In this chapter, we explored essential data manipulation techniques in R. While both long and wide formats have their use cases, long-format data is critical for scientific research, enabling easier statistical analysis and visualization. By mastering the tools and techniques outlined here, you’ll be well-equipped to handle diverse data manipulation challenges in your projects.