Chapter 7 dplyr
Hello! In this tutorial, we’ll learn more about dplyr
a tidyverse
package that helps you manipulate data when doing data science. To work through this material, we’ll use the crowdtangle_barbenheimer_2023.csv
dataset, which includes Facebook posts about the Barbenheimer meme.
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.4.0 v purrr 0.3.5
## v tibble 3.1.8 v dplyr 1.0.10
## v tidyr 1.2.1 v stringr 1.4.1
## v readr 2.1.3 v forcats 0.5.2
## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'tidyr' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.2
## Warning: package 'purrr' was built under R version 4.2.2
## Warning: package 'dplyr' was built under R version 4.2.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
sm_data <- read_csv("data/crowdtangle_barbenheimer_2023.csv") |> clean_names() # make sure this file is in your working directory!
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 16794 Columns: 41
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (18): Page Name, User Name, Page Category, Page Admin Top Country, Page...
## dbl (15): Facebook Id, Followers at Posting, Likes, Comments, Shares, Love,...
## num (2): Total Interactions, Total Interactions (weighted — Likes 1x Sha...
## lgl (3): Sponsor Id, Sponsor Name, Sponsor Category
## dttm (1): Page Created
## date (1): Post Created Date
## time (1): Post Created Time
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## spc_tbl_ [16,794 x 41] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ page_name : chr [1:16794] "Ghost Killer Entertainment" "Jetpack Cave" "Babe" "Jetpack Cave" ...
## $ user_name : chr [1:16794] "GhostKillerEntertainment" "JetpackCave" "babedotnet" "JetpackCave" ...
## $ facebook_id : num [1:16794] 1e+14 1e+14 1e+14 1e+14 1e+14 ...
## $ page_category : chr [1:16794] "RECORD_LABEL" "ENTERTAINMENT_SITE" "MEDIA_NEWS_COMPANY" "ENTERTAINMENT_SITE" ...
## $ page_admin_top_country : chr [1:16794] "US" "MX" "GB" "MX" ...
## $ page_description : chr [1:16794] "Ghost Killer Entertainment:\nRecord Label / Music Promotion Company\nEst. 2014\n" "¡Hola! Bienvenidos a Jetpack Cave <U+0001F31F> ¡un espacio donde encontraras las novedades del mundo geek! <U+0001F9B8>" "For girls who don't give a fuck\n\n" "¡Hola! Bienvenidos a Jetpack Cave <U+0001F31F> ¡un espacio donde encontraras las novedades del mundo geek! <U+0001F9B8>" ...
## $ page_created : POSIXct[1:16794], format: "2017-05-09 05:02:02" "2019-06-07 02:31:45" ...
## $ likes_at_posting : chr [1:16794] "686064" "402302" "1323914" "409135" ...
## $ followers_at_posting : num [1:16794] 882069 413868 1389225 420993 292210 ...
## $ post_created : chr [1:16794] "2023-07-22 10:37:12 CDT" "2023-07-18 11:10:12 CDT" "2023-07-26 06:50:02 CDT" "2023-07-25 12:00:07 CDT" ...
## $ post_created_date : Date[1:16794], format: "2023-07-22" "2023-07-18" ...
## $ post_created_time : 'hms' num [1:16794] 10:37:12 11:10:12 06:50:02 12:00:07 ...
## ..- attr(*, "units")= chr "secs"
## $ type : chr [1:16794] "Photo" "Photo" "Photo" "Photo" ...
## $ total_interactions : num [1:16794] 195095 145913 112162 103261 89916 ...
## $ likes : num [1:16794] 54601 24812 44516 11170 62449 ...
## $ comments : num [1:16794] 3308 3743 852 1381 301 ...
## $ shares : num [1:16794] 41801 12416 9333 16146 5536 ...
## $ love : num [1:16794] 17270 36791 3121 8382 1236 ...
## $ wow : num [1:16794] 320 461 272 315 146 97 456 56 196 14 ...
## $ haha : num [1:16794] 77094 66519 53840 65550 20115 ...
## $ sad : num [1:16794] 20 49 10 7 13 10 5 21 8 2 ...
## $ angry : num [1:16794] 5 16 4 6 0 8 1 5 3 1 ...
## $ care : num [1:16794] 676 1106 214 304 120 ...
## $ video_share_status : chr [1:16794] NA NA NA NA ...
## $ is_video_owner : chr [1:16794] "-" "-" "-" "-" ...
## $ post_views : num [1:16794] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_views : num [1:16794] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_views_for_all_crossposts : num [1:16794] 0 0 0 0 0 0 0 0 0 0 ...
## $ video_length : chr [1:16794] "N/A" "N/A" "N/A" "N/A" ...
## $ url : chr [1:16794] "https://www.facebook.com/100044381711160/posts/829033098586073" "https://www.facebook.com/100063550571635/posts/806119938183018" "https://www.facebook.com/100064803423556/posts/665983275571781" "https://www.facebook.com/100063550571635/posts/810761124385566" ...
## $ message : chr [1:16794] "Check out our Emo spotify playlist! http://bit.ly/EmoNeverDies:=:https://open.spotify.com/playlist/3czha2tGePoOcVI4xwa3j3" "¡QUIERE SER KEN! <U+0001FA77> Cillian Murphy (Oppenheimer) comentó que esta abierto a interpretar a Ken en una "| __truncated__ "The true Barbenheimer experience" "¡BARBENHEIMER ES REAL! En un cine de Estados Unidos durante una función de 'Oppenheimer' hubo un fallo en el pr"| __truncated__ ...
## $ link : chr [1:16794] "https://www.facebook.com/photo.php?fbid=829033051919411&set=a.723874642435253&type=3" "https://www.facebook.com/photo.php?fbid=806119904849688&set=a.448295013965514&type=3" "https://www.facebook.com/photo.php?fbid=665983258905116&set=a.598827208954055&type=3" "https://www.facebook.com/photo.php?fbid=810761107718901&set=a.448295013965514&type=3" ...
## $ final_link : chr [1:16794] NA NA NA NA ...
## $ image_text : chr [1:16794] "sunny jay real estate @cleaming_ this was the original barbenheimer The Black Parade Hannah Montana ROMANEE LAN"| __truncated__ "¿Qué si interpretaría a Claro,por Claro, por Ken en 'Barbie 2'? ACK JEAVE CAVE ETPACK supuesto. Murphy Cilian L"| __truncated__ "Britt BrittRivera Rivera @kindamoviesnob My sister just saw Oppenheimer and something went wrong and half the s"| __truncated__ "= <U+65E5> JETPACK ETPACK CAVE EVANS BLACKETT MS Eso ya compensó el precio del boleto." ...
## $ link_text : chr [1:16794] NA NA NA NA ...
## $ description : chr [1:16794] NA NA NA NA ...
## $ sponsor_id : logi [1:16794] NA NA NA NA NA NA ...
## $ sponsor_name : logi [1:16794] NA NA NA NA NA NA ...
## $ sponsor_category : logi [1:16794] NA NA NA NA NA NA ...
## $ total_interactions_weighted_likes_1x_shares_1x_comments_1x_love_1x_wow_1x_haha_1x_sad_1x_angry_1x_care_1x: num [1:16794] 195095 145913 112162 103261 89916 ...
## $ overperforming_score : num [1:16794] 63.99 7.49 455.94 6.03 48.73 ...
## - attr(*, "spec")=
## .. cols(
## .. `Page Name` = col_character(),
## .. `User Name` = col_character(),
## .. `Facebook Id` = col_double(),
## .. `Page Category` = col_character(),
## .. `Page Admin Top Country` = col_character(),
## .. `Page Description` = col_character(),
## .. `Page Created` = col_datetime(format = ""),
## .. `Likes at Posting` = col_character(),
## .. `Followers at Posting` = col_double(),
## .. `Post Created` = col_character(),
## .. `Post Created Date` = col_date(format = ""),
## .. `Post Created Time` = col_time(format = ""),
## .. Type = col_character(),
## .. `Total Interactions` = col_number(),
## .. Likes = col_double(),
## .. Comments = col_double(),
## .. Shares = col_double(),
## .. Love = col_double(),
## .. Wow = col_double(),
## .. Haha = col_double(),
## .. Sad = col_double(),
## .. Angry = col_double(),
## .. Care = col_double(),
## .. `Video Share Status` = col_character(),
## .. `Is Video Owner?` = col_character(),
## .. `Post Views` = col_double(),
## .. `Total Views` = col_double(),
## .. `Total Views For All Crossposts` = col_double(),
## .. `Video Length` = col_character(),
## .. URL = col_character(),
## .. Message = col_character(),
## .. Link = col_character(),
## .. `Final Link` = col_character(),
## .. `Image Text` = col_character(),
## .. `Link Text` = col_character(),
## .. Description = col_character(),
## .. `Sponsor Id` = col_logical(),
## .. `Sponsor Name` = col_logical(),
## .. `Sponsor Category` = col_logical(),
## .. `Total Interactions (weighted — Likes 1x Shares 1x Comments 1x Love 1x Wow 1x Haha 1x Sad 1x Angry 1x Care 1x )` = col_number(),
## .. `Overperforming Score` = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
7.1 filter()
We use filter()
to subset the dataset based on information in a column. For example, if we wanted to only see the responses from Texas, we could filter by the column name (i.e., the variable) stcode
. To use filter()
this way, we will need two arguments: the dataset you are using (in our case, covid_po
) and the logical condition for the filter (in this case, when page_admin_top_country
is "US"
).
## Rows: 6,725
## Columns: 41
## $ page_name <chr> ~
## $ user_name <chr> ~
## $ facebook_id <dbl> ~
## $ page_category <chr> ~
## $ page_admin_top_country <chr> ~
## $ page_description <chr> ~
## $ page_created <dttm> ~
## $ likes_at_posting <chr> ~
## $ followers_at_posting <dbl> ~
## $ post_created <chr> ~
## $ post_created_date <date> ~
## $ post_created_time <time> ~
## $ type <chr> ~
## $ total_interactions <dbl> ~
## $ likes <dbl> ~
## $ comments <dbl> ~
## $ shares <dbl> ~
## $ love <dbl> ~
## $ wow <dbl> ~
## $ haha <dbl> ~
## $ sad <dbl> ~
## $ angry <dbl> ~
## $ care <dbl> ~
## $ video_share_status <chr> ~
## $ is_video_owner <chr> ~
## $ post_views <dbl> ~
## $ total_views <dbl> ~
## $ total_views_for_all_crossposts <dbl> ~
## $ video_length <chr> ~
## $ url <chr> ~
## $ message <chr> ~
## $ link <chr> ~
## $ final_link <chr> ~
## $ image_text <chr> ~
## $ link_text <chr> ~
## $ description <chr> ~
## $ sponsor_id <lgl> ~
## $ sponsor_name <lgl> ~
## $ sponsor_category <lgl> ~
## $ total_interactions_weighted_likes_1x_shares_1x_comments_1x_love_1x_wow_1x_haha_1x_sad_1x_angry_1x_care_1x <dbl> ~
## $ overperforming_score <dbl> ~
We could also do this with numerical information. For example, maybe we only want to look at posts from U.S. Facebook pages with at least 15 like
reactions. We can do this by filter
ing from the likes
.
## [1] 2005
7.2 select()
In some instances, you not only want to filter by variable information, but you also may want to focus on a specific number of variables. This is especially common when you are ingesting/bringing in external data (e.g., secondary data or data from quadratics). For our dataset, let’s select
the following variables: the user name, the page category, the date it was posted, the message type, the message, the possible Facebook reactions (likes, love, wow, haha, sad, angry, care), and the post type
sm_data_select <- sm_data %>%
select(user_name, page_category, post_created, type, message, likes, love, wow, haha, sad, angry, care)
glimpse(sm_data_select)
## Rows: 16,794
## Columns: 12
## $ user_name <chr> "GhostKillerEntertainment", "JetpackCave", "babedotnet",~
## $ page_category <chr> "RECORD_LABEL", "ENTERTAINMENT_SITE", "MEDIA_NEWS_COMPAN~
## $ post_created <chr> "2023-07-22 10:37:12 CDT", "2023-07-18 11:10:12 CDT", "2~
## $ type <chr> "Photo", "Photo", "Photo", "Photo", "Photo", "Photo", "P~
## $ message <chr> "Check out our Emo spotify playlist! http://bit.ly/EmoNe~
## $ likes <dbl> 54601, 24812, 44516, 11170, 62449, 33888, 35480, 25347, ~
## $ love <dbl> 17270, 36791, 3121, 8382, 1236, 20070, 7335, 12286, 3482~
## $ wow <dbl> 320, 461, 272, 315, 146, 97, 456, 56, 196, 14, 229, 13, ~
## $ haha <dbl> 77094, 66519, 53840, 65550, 20115, 8758, 26719, 18575, 1~
## $ sad <dbl> 20, 49, 10, 7, 13, 10, 5, 21, 8, 2, 19, 0, 8, 3, 13, 2, ~
## $ angry <dbl> 5, 16, 4, 6, 0, 8, 1, 5, 3, 1, 6, 1, 2, 0, 9, 0, 11, 254~
## $ care <dbl> 676, 1106, 214, 304, 120, 543, 251, 408, 464, 512, 240, ~
7.3 rename()
Want to rename a variable? You can use the function rename()
to change a column. The way rename()
work is by taking two arguments, the dataset you hope to use and the name change with the following format, new_name = old_name
. You can even include more than one variable! Let’s see that in action below:
## [1] "user_name" "page_category" "post_created" "type"
## [5] "message" "likes" "love" "wow"
## [9] "haha" "sad" "angry" "care"
sm_data_rename <- sm_data_select %>%
rename(datetime = post_created,
like = likes)
colnames(sm_data_rename)
## [1] "user_name" "page_category" "datetime" "type"
## [5] "message" "like" "love" "wow"
## [9] "haha" "sad" "angry" "care"
Notice how post_created
and likes
has now been changed to post_created
and like
.
7.4 arrange()
The function arrange()
allows you to reorganize the rows based on the information in one variable. (in this case, we will use the like
variable, which we just renamed above,)
## # A tibble: 6 x 12
## user_name page_~1 datet~2 type message like love wow haha sad angry
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 GhostKiller~ RECORD~ 2023-0~ Photo "Check~ 54601 17270 320 77094 20 5
## 2 JetpackCave ENTERT~ 2023-0~ Photo "¡QUIE~ 24812 36791 461 66519 49 16
## 3 babedotnet MEDIA_~ 2023-0~ Photo "The t~ 44516 3121 272 53840 10 4
## 4 JetpackCave ENTERT~ 2023-0~ Photo "¡BARB~ 11170 8382 315 65550 7 6
## 5 ReviewMovie~ MOVIE_~ 2023-0~ Photo "Barbe~ 62449 1236 146 20115 13 0
## 6 thegoodfilms EDU_SI~ 2023-0~ Photo "Barbe~ 33888 20070 97 8758 10 8
## # ... with 1 more variable: care <dbl>, and abbreviated variable names
## # 1: page_category, 2: datetime
## # A tibble: 16,794 x 12
## user_name page_~1 datet~2 type message like love wow haha sad angry
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ReviewMovi~ MOVIE_~ 2023-0~ Photo "Barbe~ 62449 1236 146 20115 13 0
## 2 GhostKille~ RECORD~ 2023-0~ Photo "Check~ 54601 17270 320 77094 20 5
## 3 babedotnet MEDIA_~ 2023-0~ Photo "The t~ 44516 3121 272 53840 10 4
## 4 amBiSocial ACTIVI~ 2023-0~ Photo "#barb~ 35480 7335 456 26719 5 1
## 5 skittodigi~ TELECOM 2023-0~ Photo "barbi~ 34470 1171 39 3675 13 9
## 6 thegoodfil~ EDU_SI~ 2023-0~ Photo "Barbe~ 33888 20070 97 8758 10 8
## 7 CiberCubaN~ NEWS_S~ 2023-0~ Link "La pe~ 29325 3590 1959 886 102 254
## 8 JustAshott~ MOVIE 2023-0~ Photo "5 day~ 25347 12286 56 18575 21 5
## 9 SomosComic~ MEDIA_~ 2023-0~ Photo "Las p~ 25259 34825 196 139 8 3
## 10 PhillipChu~ BLOGGER 2023-0~ Photo "Sobre~ 25098 15071 229 612 19 6
## # ... with 16,784 more rows, 1 more variable: care <dbl>, and abbreviated
## # variable names 1: page_category, 2: datetime
# using arrange(variable) will order the numbers from smallest to largest
# adding a "-" in front of the variable reverses the order, so this will cause the rows to go from the largest to the smallest
Notice here that the results of arrange
are not saved. In order to do that, you need to assign it to a variable (either itself, or a new variable)
7.5 mutate()
You can use mutate()
to create new variables. Below, we add to our data frame sm_data_rename
by creating a variable to add all the reactions (this is different from “engagement” as comments do not count in this reaction metric.) To do this, we use numbers.
sm_data_rename <- sm_data_rename %>%
mutate(all_reactions = like + love + wow + haha + sad + angry + care) #modify sep = " " to change the separator!
#sm_data_rename$all_reactions #to see the variable
To use mutate()
, we use an “equation” using the =
sign. On the left side, we will write the name of the new variable (in this case, all_reactions
). On the right side, we add all the variables we want to add by (variable1 + variable2 + … + variableN).
Another thing that might be worth learning is how to use the paste()
function, which is how you combine strings. For example, maybe you want to combine the page category information with the post type. You can use paste()
within mutate()
to create this variable.
7.6 group_by() and summarize()
Finally, let’s learn about group_by()
and summarize()
, which is used to summarize numeric variables by categorial ones. For example, say we wanted to see whether videos get more likes than regular text posts. To check this, we can group by the type
of the post, and then summarize the average of likes. Let’s try this now.
sm_data_rename %>%
group_by(type) %>% #group by the type of content
summarize(avg_likes = mean(like)) %>% #create a new variable, avg_likes, by getting the mean for each type
arrange(-avg_likes) #arrange the data so the avg_likes is high-to-low
## # A tibble: 8 x 2
## type avg_likes
## <chr> <dbl>
## 1 Photo 205.
## 2 Native Video 92.7
## 3 Status 84.1
## 4 Link 67.5
## 5 Video 34.3
## 6 Live Video Complete 18.0
## 7 YouTube 11.3
## 8 Live Video Scheduled 3.13
In this dataset, photos get the most likes (by far), followed by Native Videos, and Text posts (“Status”)
Learn more about group_by and summarize here.