5 Data Manipulation

In this section, we discuss the basics of data manipulation using the library dplyr.

5.1 Data manipulation

See also the introduction to dplyr vignette for more information. Let’s consider the movies data set which we load in the following way.

## Warning: package 'dplyr' was built under R version 3.4.2
movies <- read.csv("http://stat.slu.edu/~speegle/_book_data/movieLensData", as.is = TRUE)

The data set movies consists of 100,000 observations of the variables MovieID, Title, Genres, UserID, Rating and Timestamp. These observations were selected from 746 consecutive users selected with a random start point from MovieLens.
The data set movies consists of 100,000 observations of the variables MovieID, Title, Genres, UserID, Rating and Timestamp. This data is from a much larger data set, MovieLens, freely available from GroupLens Research. The movies data set contains 746 consecutive users extracted from MovieLens, beginning at a random starting point.

Let’s begin by converting the data fame to a tibble.

movies <- tbl_df(movies)

Now, we can just type movies to get an overview of what is contained in the tibble movies. I like using tibbles because if I accidentally (or on purpose) type the name of a tibble, the output is presented in a usable form. I will use the word “data frame” below, even though I will typically be working with tibbles.

We will be interested in the following commands:

  1. select() forms a new data frame with selected columns.
  2. arrange() forms a new data frame with row(s) arranged in a specified order.
  3. filter() forms a new data frame consisting of rows that satisfy certain filtering conditions.
  4. mutate() and transmute() allow you to create new columns out of a data frame. mutate adds to the data frame, and transmute creates a new data frame.
  5. summarize() summarizes a data frame into a single row.
  6. distinct() collapses the identical observations into a single one.
  7. group_by() groups the data to perform tasks by groups.

Let’s see how to use some of these with our data frame movies. You should try all of these commands to see what they do. (Be sure to use tbl_df first so you can get a good look at the output!)

  1. Create a new data frame with the columns MovieID and Rating. select(movies, Rating, MovieID).
  2. Order the ratings by the time at which they were reviewed. arrange(movies, Timestamp)
  3. Find all 5 star ratings. filter(movies, Rating == 5)
  4. Find the mean of all Ratings in the data frame. summarize(movies, mean(Rating))
  5. Form a data frame consisting of the unique User ID’s. distinct(movies, UserID)
  6. We will save mutate() and transmute() for a different data set.
  7. group_by() really only works well when combined with other commands.
  8. head isn’t a dplyr command, but can be used to view a specified number of rows of a data frame.

The utility of these commands is more obvious when we start to combine them. The key tool for combining the commands is the pipe operator, %>%. The pipe operator allows you to feed the result of one expression as input to a function. For example, x %>% f(y) %>% g(z) becomes g(f(x,y),z).

Example Find the mean4 rating of Toy Story, which has MovieID 1.

filter(movies, MovieID == 1) %>%
  summarize(mean(Rating))
## # A tibble: 1 x 1
##   `mean(Rating)`
##            <dbl>
## 1       3.875862

The filter() command creates a data frame that consists solely of the observations of Toy Story, and summarize() computes the mean rating.

Example Create a data frame consisting of the observations associated with movies that have an average rating of 5 stars. That is, each rating of the movie was 5 stars.

In order to do this, we will use the group_by() function to find the mean rating of each movie.

movies %>% 
  group_by(Title) %>%
  summarize(Rating = mean(Rating))
## # A tibble: 6,200 x 2
##                                                      Title   Rating
##                                                      <chr>    <dbl>
##  1 ...All the Marbles (a.k.a. The California Dolls) (1981) 3.500000
##  2                                 ...And God Spoke (1993) 4.000000
##  3                           ...And Justice for All (1979) 3.600000
##  4                          *batteries not included (1987) 3.125000
##  5                                               10 (1979) 3.500000
##  6                                 10 Items or Less (2006) 3.000000
##  7                              10 Rillington Place (1971) 4.250000
##  8                       10 Things I Hate About You (1999) 3.450000
##  9                                   10 to Midnight (1983) 2.500000
## 10                                   101 Dalmatians (1996) 3.186047
## # ... with 6,190 more rows

Now, we could sort those in decreasing order to see ones that have a mean rating of 5.

movies %>% 
  group_by(Title) %>%
  summarize(Rating = mean(Rating)) %>%
  arrange(desc(Rating))
## # A tibble: 6,200 x 2
##                                                   Title Rating
##                                                   <chr>  <dbl>
##  1                                         49 Up (2005)      5
##  2                   5,000 Fingers of Dr. T, The (1953)      5
##  3                                     Anastasia (1956)      5
##  4                  Ay, Carmela! (¡Ay, Carmela!) (1990)      5
##  5    Ballad of Narayama, The (Narayama bushiko) (1983)      5
##  6                              Band of the Hand (1986)      5
##  7                                      Beerfest (2006)      5
##  8                                Big Clock, The (1948)      5
##  9 Boss of It All, The (Direktøren for det hele) (2006)      5
## 10                                          Boys (1996)      5
## # ... with 6,190 more rows

Now, we can filter out those whose mean rating is 5. Note, it is not necessary to sort first.

movies %>% 
  group_by(Title) %>%
  summarize(Rating = mean(Rating)) %>%
  filter(Rating == 5)
## # A tibble: 113 x 2
##                                                   Title Rating
##                                                   <chr>  <dbl>
##  1                                         49 Up (2005)      5
##  2                   5,000 Fingers of Dr. T, The (1953)      5
##  3                                     Anastasia (1956)      5
##  4                  Ay, Carmela! (¡Ay, Carmela!) (1990)      5
##  5    Ballad of Narayama, The (Narayama bushiko) (1983)      5
##  6                              Band of the Hand (1986)      5
##  7                                      Beerfest (2006)      5
##  8                                Big Clock, The (1948)      5
##  9 Boss of It All, The (Direktøren for det hele) (2006)      5
## 10                                          Boys (1996)      5
## # ... with 103 more rows

And we see that there are 113 movies which have mean rating of 5.

Example Find the lowest rated movie.

movies %>%
  group_by(Title) %>%
  summarize(mr = mean(Rating)) %>%
  arrange(mr)
## # A tibble: 6,200 x 2
##                                            Title    mr
##                                            <chr> <dbl>
##  1   Avventura, L (a.k.a. Adventure, The) (1960)   0.5
##  2                         Big Combo, The (1955)   0.5
##  3                             BloodRayne (2005)   0.5
##  4                       Bratz: The Movie (2007)   0.5
##  5  Care Bears Movie II: A New Generation (1986)   0.5
##  6                                  Chaos (2001)   0.5
##  7 Chinese Ghost Story, A (Sinnui yauwan) (1987)   0.5
##  8                             Devil Doll (1964)   0.5
##  9             Diary of a Mad Black Woman (2005)   0.5
## 10              Gate II: Trespassers, The (1990)   0.5
## # ... with 6,190 more rows

Yikes! 0.5 out of 5 stars!

To see which movie that received 5 stars has the most ratings, we would do:

movies %>%
  group_by(Title) %>%
  summarize(mr = mean(Rating), numRating = n()) %>%
  arrange(desc(mr), desc(numRating))
## # A tibble: 6,200 x 3
##                                       Title    mr numRating
##                                       <chr> <dbl>     <int>
##  1                     Lady Eve, The (1941)     5         4
##  2                           Frances (1982)     5         3
##  3                              Boys (1996)     5         2
##  4                   Denise Calls Up (1995)     5         2
##  5                        Manny & Lo (1996)     5         2
##  6              Pier, The (La Jetée) (1962)     5         2
##  7 Tin Drum, The (Blechtrommel, Die) (1979)     5         2
##  8      Tinker, Tailor, Soldier, Spy (1979)     5         2
##  9               Trouble in Paradise (1932)     5         2
## 10                             49 Up (2005)     5         1
## # ... with 6,190 more rows

And, we see that The Lady Eve had 4 ratings. Never heard of it. If you are interested, IMDB says “A trio of classy card sharps targets the socially awkward heir to brewery millions for his money, until one of them falls in love with him.” Out of curiosity, what is The Lady Eve’s movie ID.

movies %>%
  filter(Title == "Lady Eve, The (1941)") %>%
  select(MovieID) %>%
  distinct()
## # A tibble: 1 x 1
##   MovieID
##     <int>
## 1    2935

Next, out of movies with a lot of ratings, which has the highest rating? Well, we need to decide what a lot of ratings means. Let’s see how many ratings some of the movies had.

movies %>%
  group_by(Title) %>%
  summarize(count = n()) %>%
  arrange(desc(count))
## # A tibble: 6,200 x 2
##                                                           Title count
##                                                           <chr> <int>
##  1                             Silence of the Lambs, The (1991)   368
##  2                                          Pulp Fiction (1994)   359
##  3                                          Forrest Gump (1994)   349
##  4                             Shawshank Redemption, The (1994)   344
##  5                                         Jurassic Park (1993)   338
##  6 Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)   323
##  7                                         Fugitive, The (1993)   305
##  8                            Terminator 2: Judgment Day (1991)   296
##  9                         Independence Day (a.k.a. ID4) (1996)   295
## 10                                             Apollo 13 (1995)   291
## # ... with 6,190 more rows

If we want a list of the 5 most rated movies, we could do:

movies %>%
  group_by(Title) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  head(n = 5)
## # A tibble: 5 x 2
##                              Title count
##                              <chr> <int>
## 1 Silence of the Lambs, The (1991)   368
## 2              Pulp Fiction (1994)   359
## 3              Forrest Gump (1994)   349
## 4 Shawshank Redemption, The (1994)   344
## 5             Jurassic Park (1993)   338

So, it seems like 200 ratings could classify as “a lot”. Let’s see which movie with at least 200 ratings has the highest mean rating.

movies %>%
  group_by(Title) %>%
  summarize(count = n(), meanRating = mean(Rating)) %>%
  filter(count > 200) %>%
  arrange(desc(meanRating))
## # A tibble: 40 x 3
##                                                                          Title
##                                                                          <chr>
##  1                                            Shawshank Redemption, The (1994)
##  2                                                       Godfather, The (1972)
##  3                                                      Schindlers List (1993)
##  4                                                  Usual Suspects, The (1995)
##  5                                            Silence of the Lambs, The (1991)
##  6                                                          Matrix, The (1999)
##  7                                                      American Beauty (1999)
##  8                Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)
##  9 Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (19
## 10                                                     Sixth Sense, The (1999)
## # ... with 30 more rows, and 2 more variables: count <int>,
## #   meanRating <dbl>

I have to say, that’s not a bad list of movies. If we want to see all of the movies in this list, we can pipe to print, and specify the number of movies we wish to see. (If you definitely want to see all of the movies, you can pipe to print(n = nrow(.)))

movies %>%
  group_by(Title) %>%
  summarize(count = n(), meanRating = mean(Rating)) %>%
  filter(count > 200) %>%
  arrange(desc(meanRating)) %>%
  select(Title, meanRating) %>% 
  print(n = 40)
## # A tibble: 40 x 2
##                                                                          Title
##                                                                          <chr>
##  1                                            Shawshank Redemption, The (1994)
##  2                                                       Godfather, The (1972)
##  3                                                      Schindlers List (1993)
##  4                                                  Usual Suspects, The (1995)
##  5                                            Silence of the Lambs, The (1991)
##  6                                                          Matrix, The (1999)
##  7                                                      American Beauty (1999)
##  8                Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)
##  9 Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (19
## 10                                                     Sixth Sense, The (1999)
## 11                       Star Wars: Episode V - The Empire Strikes Back (1980)
## 12                                                                Fargo (1996)
## 13                                                        Fugitive, The (1993)
## 14                                                         Pulp Fiction (1994)
## 15                                                  Saving Private Ryan (1998)
## 16                                                           Braveheart (1995)
## 17                                                 Seven (a.k.a. Se7en) (1995)
## 18                           Star Wars: Episode VI - Return of the Jedi (1983)
## 19                                                         Forrest Gump (1994)
## 20                                                            Apollo 13 (1995)
## 21                                           Terminator 2: Judgment Day (1991)
## 22                                          12 Monkeys (Twelve Monkeys) (1995)
## 23                                                            Toy Story (1995)
## 24                                                            Rock, The (1996)
## 25                                                       Lion King, The (1994)
## 26                                                   Back to the Future (1985)
## 27                                                   Dances with Wolves (1990)
## 28                                           E.T. the Extra-Terrestrial (1982)
## 29                                                              Aladdin (1992)
## 30                                                 Beauty and the Beast (1991)
## 31                                                        Jurassic Park (1993)
## 32                                           Die Hard: With a Vengeance (1995)
## 33                                                                Speed (1994)
## 34                                        Independence Day (a.k.a. ID4) (1996)
## 35                                                            True Lies (1994)
## 36                                                  Mission: Impossible (1996)
## 37                                                               Batman (1989)
## 38                                                             Stargate (1994)
## 39                                                              Twister (1996)
## 40                                           Ace Ventura: Pet Detective (1994)
## # ... with 1 more variables: meanRating <dbl>

You just got your summer movie watching list. You’re welcome.

Now, why don’t you try to figure out which genre had the highest rating, which movie was rated the most, and the 10 worst rated movies that were rated at least 100 times. The solutions are printed out below, so that you can check your answers.

## # A tibble: 644 x 2
##                                  Genres GenreRating
##                                   <chr>       <dbl>
##  1 Comedy|Drama|Fantasy|Musical|Romance    5.000000
##  2             Animation|Fantasy|Horror    4.666667
##  3     Animation|Comedy|Fantasy|Musical    4.500000
##  4                      Animation|Drama    4.500000
##  5                  Comedy|Drama|Sci-Fi    4.500000
##  6 Crime|Drama|Fantasy|Film-Noir|Horror    4.500000
##  7                      Crime|Drama|War    4.500000
##  8                  Documentary|Fantasy    4.500000
##  9         Drama|Fantasy|Horror|Romance    4.500000
## 10                 Drama|Horror|Romance    4.500000
## # ... with 634 more rows

What kind of genre is Comedy|Drama|Fantasy|Musical|Romance? That is crazy. What movie could that be?

movies %>%
  filter(Genres == "Comedy|Drama|Fantasy|Musical|Romance") %>%
  select(Title) %>%
  distinct()
## # A tibble: 1 x 1
##                                    Title
##                                    <chr>
## 1 Saddest Music in the World, The (2003)
## # A tibble: 6,200 x 2
##                                                           Title numRating
##                                                           <chr>     <int>
##  1                             Silence of the Lambs, The (1991)       368
##  2                                          Pulp Fiction (1994)       359
##  3                                          Forrest Gump (1994)       349
##  4                             Shawshank Redemption, The (1994)       344
##  5                                         Jurassic Park (1993)       338
##  6 Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)       323
##  7                                         Fugitive, The (1993)       305
##  8                            Terminator 2: Judgment Day (1991)       296
##  9                         Independence Day (a.k.a. ID4) (1996)       295
## 10                                             Apollo 13 (1995)       291
## # ... with 6,190 more rows
## # A tibble: 195 x 3
##                                    Title MeanRating numRating
##                                    <chr>      <dbl>     <int>
##  1                    Judge Dredd (1995)   2.415888       107
##  2       Honey, I Shrunk the Kids (1989)   2.642857       105
##  3                     Waterworld (1995)   2.716667       180
##  4 Ace Ventura: When Nature Calls (1995)   2.725806       124
##  5       Blair Witch Project, The (1999)   2.732000       125
##  6                     Armageddon (1998)   2.767442       129
##  7                 Batman Forever (1995)   2.885787       197
##  8           Addams Family Values (1993)   2.943478       115
##  9                Charlies Angels (2000)   2.950000       100
## 10                 Demolition Man (1993)   3.015873       126
## # ... with 185 more rows

5.2 Return to Batting exercise from Chapter Two

When we first considered the Batting data set in the Lahman library, we ignored the stint variable. When a player played for more than one team in a season, the statistics of the player for each team is listed on separate lines. So, in chapter two, when you were asked to find the player that hit the most triples in one season, you should have added up all of the triples that each player hit for the various teams. We didn’t have the tools to do that at the time, but now we can.

Let’s begin by creating a data frame that has players’ ID, year ID and the number of triples that they hit in that year.

library(Lahman)
Batting %>%
  group_by(playerID, yearID) %>%
  summarize(X3B = sum(X3B))
## # A tibble: 95,250 x 3
## # Groups:   playerID [?]
##     playerID yearID   X3B
##        <chr>  <int> <int>
##  1 aardsda01   2004     0
##  2 aardsda01   2006     0
##  3 aardsda01   2007     0
##  4 aardsda01   2008     0
##  5 aardsda01   2009     0
##  6 aardsda01   2010     0
##  7 aardsda01   2012     0
##  8 aardsda01   2013     0
##  9 aardsda01   2015     0
## 10 aaronha01   1954     6
## # ... with 95,240 more rows

Now, we just need to find the largest number of triples.

Batting %>%
  group_by(playerID, yearID) %>%
  summarize(X3B = sum(X3B)) %>%
  arrange(desc(X3B))
## # A tibble: 95,250 x 3
## # Groups:   playerID [18,915]
##     playerID yearID   X3B
##        <chr>  <int> <int>
##  1 wilsoch01   1912    36
##  2   orrda01   1886    31
##  3 reitzhe01   1894    31
##  4 werdepe01   1893    29
##  5 davisha01   1897    28
##  6 williji01   1899    28
##  7 davisge01   1893    27
##  8 thompsa01   1894    27
##  9 crawfsa01   1914    26
## 10 cuyleki01   1925    26
## # ... with 95,240 more rows

And, we see that the answer is still John Owen Wilson from 1912, just as it would have been without taking into account stints.

5.3 Exercises

  1. Consider the movieLensData.
    1. What is the movie with the highest mean rating that has been rated at least 30 times?
    2. Which genre has been rated the most? (For the purpose of this, consider Comedy and Comedy|Romance completely different genres, for example.)
    3. Which movie in the genre Comedy|Romance that has been rated at least 75 times has the lowest mean rating? Which has the highest mean rating?
    4. Which movie that has a mean rating of 4 or higher has been rated the most times?
    5. Which user gave the highest mean ratings?
  2. Consider the Batting data set in the Lahman library. This gives the batting statistics of 101,332 players who have played baseball from 1871 through 2015. Answer the following questions.
    1. Which player has been hit-by-pitch the most number of times?
    2. How many doubles were hit in 1871?
    3. Which team has the most number of home runs?
    4. Which player who has played in at least 500 games has scored the most number of runs per game?
    5. Which player has the most lifetime at bats without ever having hit a home run?
    6. Which player who was active in 2015 has the most lifetime at bats without ever having hit a home run?
  3. More questions on the Batting data set. Make sure to take into account stint in these problems.
    1. Verify that Curtis Granderson hit the most triples in a single season since 1960.
    2. In which season did the major league leader in triples have the fewest triples?
    3. In which season was there the biggest difference between the major league leader in stolen bases (SB) and the player with the second most stolen bases?
  4. Consider the Pitching data in the Lahman data set.
    1. Which pitcher has won (W) the most number of games?
    2. Which pitcher has lost (L) the most number of games?
    3. Which pitcher has hit the most opponents with a pitch (HBP)?
    4. Which year had the most number of complete games (CG)?
    5. Among pitchers who have won at least 100 games, which has the highest winning percentage? (Winning percentage is wins divided by wins + losses.)
    6. Among pitchers who have struck out at least 500 batters, which has the highest strikeout to walk ratio? (Strikeout to walk ratio is SO/BB.)
    7. List the pitchers for the St Louis Cardinals (SLN) in 2006 with at least 30 recorded outs (IPouts), sorted by ERA from lowest to highest.
     
  5. Complete the Data Manipulation in R with dplyr datacamp tutorial. Note that only the first set of exercises is free.

  6. Complete the Cleaning Data in R datacamp tutorial. This course uses the tidyr package, which is not covered in the textbook. However, it is a good complement to the dplyr material in this chapter.

5.4 Solutions to Two Data Wrangling Exercises

1(a).

movies %>%
  group_by(Title) %>%
  summarize(meanRating = mean(Rating), numRating = n()) %>%
  filter(numRating >= 30) %>%
  arrange(desc(meanRating))
## # A tibble: 874 x 3
##                                     Title meanRating numRating
##                                     <chr>      <dbl>     <int>
##  1       Shawshank Redemption, The (1994)   4.438953       344
##  2                  Godfather, The (1972)   4.425439       228
##  3 Wallace & Gromit: A Close Shave (1995)   4.415663        83
##  4                 Schindlers List (1993)   4.396491       285
##  5             Usual Suspects, The (1995)   4.393281       253
##  6         Godfather: Part II, The (1974)   4.385135       148
##  7              Lawrence of Arabia (1962)   4.371622        74
##  8                       Big Night (1996)   4.363636        33
##  9    City of God (Cidade de Deus) (2002)   4.340909        44
## 10                      Sting, The (1973)   4.337079        89
## # ... with 864 more rows

1(b).

movies %>%
  group_by(Genres) %>%
  summarize(numRatings = n()) %>%
  arrange(desc(numRatings))
## # A tibble: 644 x 2
##                       Genres numRatings
##                        <chr>      <int>
##  1                     Drama       8141
##  2                    Comedy       7801
##  3            Comedy|Romance       3983
##  4              Comedy|Drama       3550
##  5      Comedy|Drama|Romance       2942
##  6             Drama|Romance       2775
##  7   Action|Adventure|Sci-Fi       2604
##  8 Action|Adventure|Thriller       1763
##  9            Drama|Thriller       1627
## 10               Crime|Drama       1566
## # ... with 634 more rows

  1. Rating should almost certainly have been reclassified as an ordered factor, and it is not clear that taking the mean is really a valid thing to do.