Chapter 3 Data Manipulation

3.1 Introduction to Tidyverse

Tidyverse is a collection of R packages that provides set of tools for data manipulation, exploration, visualization, and modeling.

3.1.1 Core Tidyverse

Some of the key packages in tidyverse, likely use in data analysis, include:

  • ggplot2 : A powerful and flexible package for creating graphics and data visualizations. It follows the grammar of graphics philosophy, allowing users to build complex plots layer by layer.

  • dplyr : A package for data manipulation tasks such as filtering, selecting, arranging, grouping, and summarizing data.

  • tidyr : A package for reshaping and tidying data

  • readr : A package for reading and writing structured text files, including CSV, TSV, and fixed-width format files.

  • purrr : A package for functional programming in R

  • tibble : A tidy alternative to traditional data frames, providing better printing, subsetting, and handling of missing values.

  • forcats : A package for working with categorical variables (factors) in R.

  • lubridate : A package for working with dates and times. It provides functions to parse, manipulate, and work with date-time objects efficiently.

3.1.2 Install and Load Package

Install all the package in the tidyverse by running :

install.packages('tidyverse')

Load the core tidyverse by running :

library(tidyverse)

3.2 Pipe Operator

The pipe operator (%>%) is a feature of the magrittr package, which is commonly used alongside dplyr and other tidyverse packages.

The pipe operator (%>%) simplifies code organization by allowing sequential execution of actions. It enhances code readability and efficiency by seamlessly transferring the output of one operation as input to the next, thereby reducing the need for intermediate variables and promoting a clear data manipulation workflow.

Example of utilization of pipe (%>%) for data manipulation :

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Sample dataset
students <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  math_score = c(85, 90, 75),
  science_score = c(88, 82, 79)
)

# Calculate the total score for each student and filter students with a total score above 160
students_filtered <- students %>%
  mutate(total_score = math_score + science_score) %>%  # Add a new column for total score
  filter(total_score > 160)  # Filter students with a total score above 160

print(students_filtered)
##    name math_score science_score total_score
## 1 Alice         85            88         173
## 2   Bob         90            82         172

3.3 Data Manipulation

Data manipulation refers to the process of transforming, cleaning, and reorganizing data to make it more suitable for analysis or presentation.

Data manipulation typically involves tasks such as filtering, sorting, aggregating, merging, and restructuring data.

We will use one of tidyverse core, dplyr package for data manipulation. Install (if required) and load the package by running :

install.packages('dplyr')
library(dplyr)

Note The dplyr package (and all other core packages) will be loaded automatocally if you load tidyverse byusing the code :

library(tidyverse)

3.3.1 Data Manipulation Task

Here’s an overview of some key functions in dplyr for data manipulation :

  • filter() : Selects rows of a dataframe that meet specified conditions.
  • select() : Selects columns of a dataframe by name.
  • arrange() : Arranges rows of a dataframe based on one or more variables.
  • mutate() : Adds new variables to a dataframe, or modifies existing ones, based on transformations of existing variables.
  • group_by() : Groups a dataframe by one or more variables.
  • summarize() : Computes summary statistics for each group in a dataframe.
  • rename() : Renames variables in a dataframe.
  • distinct() : Returns unique rows in a dataframe.
  • slice() : Selects rows by position.
  • pull() : Extracts a single variable from a dataframe as a vector.
  • bind_cols() : Binds multiple dataframes by column.
  • bind_rows() : Binds multiple dataframes by row.

3.3.2 Data Manipulation in R

Before further explaining data manipulation task in R, we will describe the data to be used. The data used is the iris dataset. Here is a summary of the data used:

data("iris")
iris %>% summary()
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

The iris dataset consists of 5 variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. The first four variables are numeric variables, while the Species variable is a categorical variable.

Example of Data Manipulation in R :

# Filter rows where Sepal.Length > 6
filtered_iris <- iris %>% filter(Sepal.Length > 6)
filtered_iris  %>% head(2)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          7.0         3.2          4.7         1.4 versicolor
## 2          6.4         3.2          4.5         1.5 versicolor
# Select specific columns
selected_columns <- iris %>% select(Sepal.Length, Sepal.Width, Species)
selected_columns %>% head(2)
##   Sepal.Length Sepal.Width Species
## 1          5.1         3.5  setosa
## 2          4.9         3.0  setosa
# Arrange rows by Sepal.Length in descending order
arranged_iris <- iris %>% arrange(desc(Sepal.Length))
arranged_iris %>% head(2)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1          7.9         3.8          6.4         2.0 virginica
## 2          7.7         3.8          6.7         2.2 virginica
# Add a new column Sepal.Area
mutated_iris <- iris %>%  mutate(Sepal.Area = Sepal.Length * Sepal.Width)
mutated_iris %>% head(2)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1          5.1         3.5          1.4         0.2  setosa      17.85
## 2          4.9         3.0          1.4         0.2  setosa      14.70
# Rename columns
renamed_iris <- iris %>% rename(SL = Sepal.Length, SW = Sepal.Width)
renamed_iris  %>% head(2)
##    SL  SW Petal.Length Petal.Width Species
## 1 5.1 3.5          1.4         0.2  setosa
## 2 4.9 3.0          1.4         0.2  setosa
# Get distinct rows based on Sepal.Length and Sepal.Width
distinct_iris <- iris %>% distinct(Sepal.Length, Sepal.Width)
distinct_iris  %>% head(2)
##   Sepal.Length Sepal.Width
## 1          5.1         3.5
## 2          4.9         3.0
# Select the first 10 rows
sliced_iris <- iris %>% slice(1:10)
sliced_iris  %>% head(2)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
# Extract Sepal.Length as a vector
sepal_length_vector <- pull(iris, Sepal.Length)
sepal_length_vector
##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
##  [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
##  [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
##  [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
##  [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
##  [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9
# Bind the new column to the iris dataset
iris_with_random <- bind_cols(iris, random_column = runif(nrow(iris)))
iris_with_random  %>% head(2)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species random_column
## 1          5.1         3.5          1.4         0.2  setosa     0.4354568
## 2          4.9         3.0          1.4         0.2  setosa     0.3686970
# Bind rows of iris to itself
duplicated_iris <- bind_rows(iris, iris)