Chapter 2 Introduction to Tidyverse

Tidyverse is a collection of R packages that provides set of tools for data manipulation, exploration, visualization, and modeling.

2.1 Core Tidyverse

Some of the key packages in tidyverse, likely use in data analysis, include:

  • ggplot2 : A powerful and flexible package for creating graphics and data visualizations. It follows the grammar of graphics philosophy, allowing users to build complex plots layer by layer.

  • dplyr : A package for data manipulation tasks such as filtering, selecting, arranging, grouping, and summarizing data.

  • tidyr : A package for reshaping and tidying data

  • readr : A package for reading and writing structured text files, including CSV, TSV, and fixed-width format files.

  • purrr : A package for functional programming in R

  • tibble : A tidy alternative to traditional data frames, providing better printing, subsetting, and handling of missing values.

  • forcats : A package for working with categorical variables (factors) in R.

  • lubridate : A package for working with dates and times. It provides functions to parse, manipulate, and work with date-time objects efficiently.

2.2 Install and Load Package

Install all the package in the tidyverse by running :

install.packages('tidyverse')

Load the core tidyverse by running :

library(tidyverse)

2.3 Pipe Operator

The pipe operator (%>%) is a feature of the magrittr package, which is commonly used alongside dplyr and other tidyverse packages.

The pipe operator (%>%) simplifies code organization by allowing sequential execution of actions. It enhances code readability and efficiency by seamlessly transferring the output of one operation as input to the next, thereby reducing the need for intermediate variables and promoting a clear data manipulation workflow.

Example of utilization of pipe (%>%) for data manipulation :

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Sample dataset
students <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  math_score = c(85, 90, 75),
  science_score = c(88, 82, 79)
)

# Calculate the total score for each student and filter students with a total score above 160
students_filtered <- students %>%
  mutate(total_score = math_score + science_score) %>%  # Add a new column for total score
  filter(total_score > 160)  # Filter students with a total score above 160

print(students_filtered)
##    name math_score science_score total_score
## 1 Alice         85            88         173
## 2   Bob         90            82         172