Chapter 10 Advanced dataframe manipulation

Figure 10.1: Make your dataframes dance for you

In this chapter we’ll cover some more advanced functions and procedures for manipulating dataframes.

# Exam data
exam <- data.frame(
  id = 1:5,
  q1 = c(1, 5, 2, 3, 2),
  q2 = c(8, 10, 9, 8, 7),
  q3 = c(3, 7, 4, 6, 4))

# Demographic data
demographics <- data.frame(
  id = 1:5,
  sex = c("f", "m", "f", "f", "m"),
  age = c(25, 22, 24, 19, 23))

# Combine exam and demographics
combined <- merge(x = exam, 
              y = demographics, 
              by = "id")

# Mean q1 score for each sex
aggregate(x = q1 ~ sex, 
          data = combined, 
          FUN = mean)
##   sex  q1
## 1   f 2.0
## 2   m 3.5


# Median q3 score for each sex, but only for those
#   older than 20
aggregate(x = q3 ~ sex, 
          data = combined,
          subset = age > 20,
          FUN = mean)
##   sex  q3
## 1   f 3.5
## 2   m 5.5


# Many summary statistics by sex using dplyr!
library(dplyr)
combined %>% group_by(sex) %>%
  summarise(
    q1.mean = mean(q1),
    q2.mean = mean(q2),
    q3.mean = mean(q3),
    age.mean = mean(age),
    N = n())
## # A tibble: 2 × 6
##   sex   q1.mean q2.mean q3.mean age.mean     N
##   <chr>   <dbl>   <dbl>   <dbl>    <dbl> <int>
## 1 f         2      8.33    4.33     22.7     3
## 2 m         3.5    8.5     5.5      22.5     2

In Chapter 6, you learned how to calculate statistics on subsets of data using indexing. However, you may have noticed that indexing is not very intuitive and not terribly efficient. If you want to calculate statistics for many different subsets of data (e.g.; mean birth rate for every country), you’d have to write a new indexing command for each subset, which could take forever. Thankfully, R has some great built-in functions like aggregate() that allow you to easily apply functions (like mean()) to a dependent variable (like birth rate) for each level of one or more independent variables (like a country) with just a few lines of code.