Chapter 3 Data Manipulation

3.1 Introduction to Tidyverse

Tidyverse is a collection of R packages that provides set of tools for data manipulation, exploration, visualization, and modeling.

3.1.1 Core Tidyverse

Some of the key packages in tidyverse, likely use in data analysis, include:

  • ggplot2 : A powerful and flexible package for creating graphics and data visualizations. It follows the grammar of graphics philosophy, allowing users to build complex plots layer by layer.

  • dplyr : A package for data manipulation tasks such as filtering, selecting, arranging, grouping, and summarizing data.

  • tidyr : A package for reshaping and tidying data

  • readr : A package for reading and writing structured text files, including CSV, TSV, and fixed-width format files.

  • purrr : A package for functional programming in R

  • tibble : A tidy alternative to traditional data frames, providing better printing, subsetting, and handling of missing values.

  • forcats : A package for working with categorical variables (factors) in R.

  • lubridate : A package for working with dates and times. It provides functions to parse, manipulate, and work with date-time objects efficiently.

3.1.2 Install and Load Package

Install all the package in the tidyverse by running :

install.packages('tidyverse')

Load the core tidyverse by running :

library(tidyverse)

3.2 Pipe Operator

The pipe operator (%>%) is a feature of the magrittr package, which is commonly used alongside dplyr and other tidyverse packages.

The pipe operator (%>%) simplifies code organization by allowing sequential execution of actions. It enhances code readability and efficiency by seamlessly transferring the output of one operation as input to the next, thereby reducing the need for intermediate variables and promoting a clear data manipulation workflow.

Example of utilization of pipe (%>%) for data manipulation :

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Sample dataset
students <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  math_score = c(85, 90, 75),
  science_score = c(88, 82, 79)
)

# Calculate the total score for each student and filter students with a total score above 160
students_filtered <- students %>%
  mutate(total_score = math_score + science_score) %>%  # Add a new column for total score
  filter(total_score > 160)  # Filter students with a total score above 160

print(students_filtered)
##    name math_score science_score total_score
## 1 Alice         85            88         173
## 2   Bob         90            82         172

3.3 Data Manipulation

Data manipulation refers to the process of transforming, cleaning, and reorganizing data to make it more suitable for analysis or presentation.

Data manipulation typically involves tasks such as filtering, sorting, aggregating, merging, and restructuring data.

We will use one of tidyverse core, dplyr package for data manipulation. Install (if required) and load the package by running :

install.packages('dplyr')
library(dplyr)

Note The dplyr package (and all other core packages) will be loaded automatocally if you load tidyverse byusing the code :

library(tidyverse)

3.3.1 Data Manipulation Task

Here’s an overview of some key functions in dplyr for data manipulation :

  • filter() : Selects rows of a dataframe that meet specified conditions.
  • select() : Selects columns of a dataframe by name.
  • arrange() : Arranges rows of a dataframe based on one or more variables.
  • mutate() : Adds new variables to a dataframe, or modifies existing ones, based on transformations of existing variables.
  • group_by() : Groups a dataframe by one or more variables.
  • summarize() : Computes summary statistics for each group in a dataframe.
  • rename() : Renames variables in a dataframe.
  • distinct() : Returns unique rows in a dataframe.
  • slice() : Selects rows by position.
  • pull() : Extracts a single variable from a dataframe as a vector.
  • bind_cols() : Binds multiple dataframes by column.
  • bind_rows() : Binds multiple dataframes by row.

3.3.2 Data Manipulation in R

on progress~