Chapter 3 Data Manipulation
3.1 Introduction to Tidyverse
Tidyverse is a collection of R packages that provides set of tools for data manipulation, exploration, visualization, and modeling.
3.1.1 Core Tidyverse
Some of the key packages in tidyverse, likely use in data analysis, include:
ggplot2
: A powerful and flexible package for creating graphics and data visualizations. It follows the grammar of graphics philosophy, allowing users to build complex plots layer by layer.dplyr
: A package for data manipulation tasks such as filtering, selecting, arranging, grouping, and summarizing data.tidyr
: A package for reshaping and tidying datareadr
: A package for reading and writing structured text files, including CSV, TSV, and fixed-width format files.purrr
: A package for functional programming in Rtibble
: A tidy alternative to traditional data frames, providing better printing, subsetting, and handling of missing values.forcats
: A package for working with categorical variables (factors) in R.lubridate
: A package for working with dates and times. It provides functions to parse, manipulate, and work with date-time objects efficiently.
3.2 Pipe Operator
The pipe operator (%>%)
is a feature of the magrittr
package, which is commonly used alongside dplyr
and other tidyverse
packages.
The pipe operator (%>%)
simplifies code organization by allowing sequential execution of actions. It enhances code readability and efficiency by seamlessly transferring the output of one operation as input to the next, thereby reducing the need for intermediate variables and promoting a clear data manipulation workflow.
Example of utilization of pipe (%>%)
for data manipulation :
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Sample dataset
<- data.frame(
students name = c("Alice", "Bob", "Charlie"),
math_score = c(85, 90, 75),
science_score = c(88, 82, 79)
)
# Calculate the total score for each student and filter students with a total score above 160
<- students %>%
students_filtered mutate(total_score = math_score + science_score) %>% # Add a new column for total score
filter(total_score > 160) # Filter students with a total score above 160
print(students_filtered)
## name math_score science_score total_score
## 1 Alice 85 88 173
## 2 Bob 90 82 172
3.3 Data Manipulation
Data manipulation refers to the process of transforming, cleaning, and reorganizing data to make it more suitable for analysis or presentation.
Data manipulation typically involves tasks such as filtering, sorting, aggregating, merging, and restructuring data.
We will use one of tidyverse
core, dplyr
package for data manipulation. Install (if required) and load the package by running :
install.packages('dplyr')
library(dplyr)
Note The dplyr package (and all other core packages) will be loaded automatocally if you load tidyverse
byusing the code :
library(tidyverse)
3.3.1 Data Manipulation Task
Here’s an overview of some key functions in dplyr for data manipulation :
filter()
: Selects rows of a dataframe that meet specified conditions.select()
: Selects columns of a dataframe by name.arrange()
: Arranges rows of a dataframe based on one or more variables.mutate()
: Adds new variables to a dataframe, or modifies existing ones, based on transformations of existing variables.group_by()
: Groups a dataframe by one or more variables.summarize()
: Computes summary statistics for each group in a dataframe.rename()
: Renames variables in a dataframe.distinct()
: Returns unique rows in a dataframe.slice()
: Selects rows by position.pull()
: Extracts a single variable from a dataframe as a vector.bind_cols()
: Binds multiple dataframes by column.bind_rows()
: Binds multiple dataframes by row.
3.3.2 Data Manipulation in R
Before further explaining data manipulation task in R, we will describe the data to be used. The data used is the iris
dataset. Here is a summary of the data used:
data("iris")
%>% summary() iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
The iris dataset consists of 5 variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. The first four variables are numeric variables, while the Species variable is a categorical variable.
Example of Data Manipulation in R :
# Filter rows where Sepal.Length > 6
<- iris %>% filter(Sepal.Length > 6)
filtered_iris %>% head(2) filtered_iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 7.0 3.2 4.7 1.4 versicolor
## 2 6.4 3.2 4.5 1.5 versicolor
# Select specific columns
<- iris %>% select(Sepal.Length, Sepal.Width, Species)
selected_columns %>% head(2) selected_columns
## Sepal.Length Sepal.Width Species
## 1 5.1 3.5 setosa
## 2 4.9 3.0 setosa
# Arrange rows by Sepal.Length in descending order
<- iris %>% arrange(desc(Sepal.Length))
arranged_iris %>% head(2) arranged_iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 7.9 3.8 6.4 2.0 virginica
## 2 7.7 3.8 6.7 2.2 virginica
# Add a new column Sepal.Area
<- iris %>% mutate(Sepal.Area = Sepal.Length * Sepal.Width)
mutated_iris %>% head(2) mutated_iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1 5.1 3.5 1.4 0.2 setosa 17.85
## 2 4.9 3.0 1.4 0.2 setosa 14.70
# Rename columns
<- iris %>% rename(SL = Sepal.Length, SW = Sepal.Width)
renamed_iris %>% head(2) renamed_iris
## SL SW Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
# Get distinct rows based on Sepal.Length and Sepal.Width
<- iris %>% distinct(Sepal.Length, Sepal.Width)
distinct_iris %>% head(2) distinct_iris
## Sepal.Length Sepal.Width
## 1 5.1 3.5
## 2 4.9 3.0
# Select the first 10 rows
<- iris %>% slice(1:10)
sliced_iris %>% head(2) sliced_iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
# Extract Sepal.Length as a vector
<- pull(iris, Sepal.Length)
sepal_length_vector sepal_length_vector
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
## [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
## [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
## [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
## [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
## [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9
# Bind the new column to the iris dataset
<- bind_cols(iris, random_column = runif(nrow(iris)))
iris_with_random %>% head(2) iris_with_random
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species random_column
## 1 5.1 3.5 1.4 0.2 setosa 0.4354568
## 2 4.9 3.0 1.4 0.2 setosa 0.3686970
# Bind rows of iris to itself
<- bind_rows(iris, iris) duplicated_iris