Chapter 10 Advanced dataframe manipulation
In this chapter we’ll cover some more advanced functions and procedures for manipulating dataframes.
# Exam data exam <- data.frame( id = 1:5, q1 = c(1, 5, 2, 3, 2), q2 = c(8, 10, 9, 8, 7), q3 = c(3, 7, 4, 6, 4)) # Demographic data demographics <- data.frame( id = 1:5, sex = c("f", "m", "f", "f", "m"), age = c(25, 22, 24, 19, 23)) # Combine exam and demographics combined <- merge(x = exam, y = demographics, by = "id") # Mean q1 score for each sex aggregate(formula = q1 ~ sex, data = combined, FUN = mean) ## sex q1 ## 1 f 2.0 ## 2 m 3.5 # Median q3 score for each sex, but only for those # older than 20 aggregate(formula = q3 ~ sex, data = combined, subset = age > 20, FUN = mean) ## sex q3 ## 1 f 3.5 ## 2 m 5.5 # Many summary statistics by sex using dplyr! library(dplyr) combined %>% group_by(sex) %>% summarise( q1.mean = mean(q1), q2.mean = mean(q2), q3.mean = mean(q3), age.mean = mean(age), N = n()) ## # A tibble: 2 x 6 ## sex q1.mean q2.mean q3.mean age.mean N ## <fctr> <dbl> <dbl> <dbl> <dbl> <int> ## 1 f 2.0 8.3 4.3 23 3 ## 2 m 3.5 8.5 5.5 22 2
In Chapter 6, you learned how to calculate statistics on subsets of data using indexing. However, you may have noticed that indexing is not very intuitive and not terribly efficient. If you want to calculate statistics for many different subsets of data (e.g.; mean birth rate for every country), you’d have to write a new indexing command for each subset, which could take forever. Thankfully, R has some great built-in functions like
aggregate() that allow you to easily apply functions (like
mean()) to a dependent variable (like birth rate) for each level of one or more independent variables (like a country) with just a few lines of code.