Chapter 11 Data Wrangling
Hello! In this tutorial, we will go over some data wrangling strategies when dealing with two or more datasets. Let’s begin by loading in the appropriate packages (mainly, tidyverse
).
Alrighty! Let’s now download our data. For this analysis, we will be looking at two datasets: (1) the Barbenheimer Facebook posts, which we’ve worked with before, and (2) stock prices for Mattel, the company that owns the Barbie franchise (we have not worked with this dataset before). The second dataset is gathered from Yahoo Finance.
data_fb <- read_csv("data/crowdtangle_barbenheimer_2023.csv") |> clean_names()
data_mattel <- read_csv("data/mattel_stock_2023.csv") |> clean_names()
#don't forget to make sure these files are in your working directory!
#View(data_fb)
#View(data_mattel)
Unsurprisingly, we will have to do some cleaning in order to merge these datasets. To do this, we’ll have to learn about a new data type, dates, and learn more about a data type we’ve worked with, strings.
11.1 More Data Types
11.1.1 Dates
So, “datetimes” are variable-types that include date and time information (in R, you can construct both date and datetime variables). In R, they follow the following structure: YYYY-MM-DD HH:MM:SS
. Base R calls this type of data a POSIXct
. Dates only, like the ones we’ll use here, arecalled a Date
type.
## Date[1:16794], format: "2023-07-22" "2023-07-18" "2023-07-26" "2023-07-25" "2023-07-25" "2023-07-05" "2023-07-27" "2023-07-15" "2023-07-19" "2023-07-24" "2023-07-19" ...
One nice thing about datetime variables is that you can subtract date variables from one another (but you can’t do other operations, which makes sense because you can’t divide last week by tomorrow).
## [1] "2023-07-03"
## [1] "2023-07-07"
## Time difference of 4 days
In this context, the data were ingested as a Date
format, but in your datasets, you may have to coerce a variable into a Date
type. You can do this using as.Date()
, which takes two arguments: the variable, and a format type. We will often use "%m/%d/%Y"
, representing:
data_fb$date <- as.Date(data_fb$post_created_date, format = "%m/%d/%Y")
data_mattel$date <- as.Date(data_mattel$date, format = "%m/%d/%Y")
#head(data_fb$date)
#head(data_mattel$date)
When plotting datetimes and dates, ggplot2
will also organize the axis nicely.
ggplot(data_mattel, aes(x = date, y = close)) +
geom_line() +
labs(title = "Closing Price of Mattel Stock, Daily", x = "Date", y = "Stock Value") +
theme_minimal()
Now, let’s take a brief detour to talk about strings.
11.1.2 Strings
While we’ve used the Barbenheimer dataset (data_fb
) before, this chapter is a good opportunity to give you a primer on dealing with strings/characters. As I’ve mentioned, strings and characters are types of data that scholars use to work with text data. Computers are not great at understanding language (in fact, without human involvement, they don’t understand language), but they are great for repetitive pattern recognition. For that reason, it can be very easy to identify parts of a string (like a keyword or consistent phrase), replace it, or search for it.
For the next few lines, let’s focus on the message
variable in the data_fb
dataset.
11.1.2.1 With base R
## [1] "Check out our Emo spotify playlist! http://bit.ly/EmoNeverDies:=:https://open.spotify.com/playlist/3czha2tGePoOcVI4xwa3j3"
## [2] "<U+00A1>QUIERE SER KEN! <U+0001FA77> Cillian Murphy (Oppenheimer) coment<U+00F3> que esta abierto a interpretar a Ken en una futura secuela de 'Barbie' <U+0001FA77>"
## [3] "The true Barbenheimer experience"
## [4] "<U+00A1>BARBENHEIMER ES REAL! En un cine de Estados Unidos durante una funci<U+00F3>n de 'Oppenheimer' hubo un fallo en el proyector y en los <U+00FA>ltimos 20 minutos la mitad de la pantalla se volvi<U+00F3> rosa <U+0001F606> El Barbie x Oppenheimer se volvi<U+00F3> canon."
## [5] "Barbenheimer <U+0E2A><U+0E21><U+0E31><U+0E22><U+0E15><U+0E2D><U+0E19><U+0E2D><U+0E22><U+0E39><U+0E48> Gotham <U+0001F605>"
## [6] "Barbenheimer 2023. <U+0001F91D><U+0001F3FC> <U+0001F3A8>: @justralphy"
## [7] "#barbenheimer | Join amBi - Your Bi Social Community for the best memes and community around! <U+0001F60E><U+0001FAF6>"
## [8] "5 days!"
## [9] "Las puntuaciones en Rotten Tomatoes de 'Barbenheimer'."
## [10] "Barbie + Oppenheimer= Barbenheimer (Taylor<U+2019>s version)"
As you can see, this variable contains information about headlines for each article. Notice how some of them mention Barbie and Oppenheimer separately? If we wanted to see the proportion of messages that mention Barbie, we can use the grepl()
function in base R. This is great for searching individual keywords or substrings (substrings are parts of a string–“Barbie” would be a substring in the longer string, which is a message). Let’s see how this is used.
## [1] TRUE
Notice that grepl()
returns a binary: TRUE
if that substring (“China”) exists and false if it does not. From this, we can create a binary variable for whether an article mentions China in the headline.
With news articles, we can be reasonably sure that the word “Barbie” will be capitalized. But sometimes (like on social media), people do not adhere to “standard writing norms.” If you want your search to be case insensitive, you can use the ignore.case()
argument.
data_fb$barbie_messages <- grepl("barbie", data_fb$message, ignore.case = TRUE)
head(data_fb$barbie_messages)
## [1] FALSE TRUE FALSE TRUE FALSE FALSE
Once we have this variable, we can use it to subset out all the articles that mention China in its headline.
## [1] 8390
As we can see, there are 8390 Facebook posts mentioning Barbie (out of the 16794 posts)
nrow(barbie_messages)/nrow(data_fb) #this would tell you the proportion of posts that mention Barbie
## [1] 0.4995832
11.1.2.2 With tidyverse
We can do a similar thing in tidyverse
through the stringr
package. This package has additional functions that make it especially useful for people (like me) who deal a lot with strings and characters.
Unlike grepl()
, str_detect()
does not have a ignore.case
argument to make something case insensitive. But you can use a regular expression (?i)
in front of a word to search for it in a case-insensitive form.
So, if we wanted to search for all references to Ken or ken, this is what our code would look like.
## [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE
But what if you don’t want just Ken? Maybe you want to see how many messages mention Barbie OR Ken. To do so, you would include a |
pipe representing “or”.
## [1] "Barbenheimer' ganhar<U+00E1> todos os pr<U+00EA>mios Artista: ThatTallGinger"
## [2] "<U+0E02><U+0E2D><U+0E07><U+0E41><U+0E17><U+0E23><U+0E48>"
## [3] "Warner Bros.' \"Barbie\" and Universal's \"Oppenheimer\" open this weekend. Dubbed Barbenheimer, the two films are expected to dominate the box office."
## [4] "Thursday box office preview numbers for #Barbenheimer <U+0001F4C8>"
## [5] "The National Association of Theatre Owners is projecting that more than 200,000 moviegoers will attend same-day viewings of \"Barbie\" and \"Oppenheimer.\""
## [6] "#theoffice #barbenheimer"
## [7] "Yo"
## [8] "Barbie (2023) Dire<U+00E7><U+00E3>o: Greta Gerwig . #Barbie #BarbieFilme #Barbieheimer #Barbenheimer #BesouroAzul #BlueBeetle"
## [9] "Catching up with #barbenheimer memes... here's how few of our projects would look in barbie colors #pinkcar #pinkcars"
## [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
Searching for two words together (AND, rather than OR) is a little harder because “Trump and Biden” and “Biden and Trump” would be considered two different strings. But you can use the |
to include two searches: one where Barbie comes first and one where Barbie comes second. The characters you see in the middle, .*
represents a wildcard, or all the possible things that could be in between these words. By writing your string this way (using regular expressions), we are able to produce modular searches.
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Now, let’s bring it all together in the tidyverse. First, we’ll create a new variables using mutate()
that contains the binary output of str_detect()
. Then, in the next line, we will filter()
out just the headlines that mention Barbie.
data_fb_barbie <- data_fb %>%
mutate(barbie = str_detect(message, "(?i)barbie"),
date = as.Date(post_created_date)) %>%
filter(barbie == TRUE)
nrow(data_fb_barbie)
## [1] 8390
Of our ~16,700 Facebook posts, 8,390 messages mention Barbie
11.2 Wrangling & Merging Datasets
Alright, now that we’ve cleaned our date variables, let’s move onto the wrangling.
## [1] "2023-07-18" "2023-07-25" "2023-07-24" "2023-07-19" "2023-07-23" "2023-07-25"
## [1] "2023-07-03" "2023-07-05" "2023-07-06" "2023-07-07" "2023-07-10" "2023-07-11"
If you View(data_mattel)
, you might notice that the data is already very tidy time-wise. Each row is a temporal observation (one day). But the data_fb_barbie
dataset is not this way because it is at the message level. For this reason, we will need to wrangle it: we’ll do this by creating a count of the daily number of public Facebook posts that mentioned Barbie.
A function you have already learned, table()
is great for this!
count_fb_barbie <- table(data_fb_barbie$date) %>%
as.data.frame() #make sure to convert it into a data frame
When turned into a data frame, a table
will automatically produce a factor (what variable you are counting by) and a numeric (the count). Let’s rename those now, and convert the date back into a Date
variable.
colnames(count_fb_barbie)[1] <- "date"
colnames(count_fb_barbie)[2] <- "fb_count"
count_fb_barbie$date <- as.Date(count_fb_barbie$date)
There’s also a lot of variables in data_mattel we won’t use, so let’s use select()
to focus on our variables of interest
We are now ready to merge our datasets!
11.2.1 merge()
One way you can merge your datasets is using the merge()
function in base R. This function has two required arguments with additional arguments like by
and all
to help you be more specific about how you want the data to be merged: merge(<first dataset>, <second dataset>)
. The first dataset (on the left) is considered “x” and the second dataset (on the right) is considered “y”.
With merged()
the dataset looks for a shared variable between the two datasets and combines the columns of the datasets along the shared variable. So, for example, if you merge count_fb_barbie
and count_mattel
by date, you will get a dataset containing the information in both by day.
In this case, “date” is the only variable they both share, so the merge()
function will default to this. But in the future, if you are working with datasets that may share multiple variable names, the by
argument becomes very important to tell the computer which variable to merge by.
The last argument, all
is also very important. Setting all
to true means you want to include the rows in both datasets, even if they do not appear in the other.
## date fb_count close
## 1 2023-07-02 11 NA
## 2 2023-07-03 13 20.04
## 3 2023-07-04 11 NA
## 4 2023-07-05 24 20.00
## 5 2023-07-06 19 19.91
## 6 2023-07-07 27 20.48
Notice that some of the close
variables (from count_mattel
, the stock data) is missing because the stock market is closed over the weekends.
If you wanted to merge and keep only the rows found in one of the two datasets, you can use the all.x
or all.y
arguments, like below.
## [1] 30
## [1] 34
Note that the merged datasets using the all.x
-argument and all.y
-argument are slightly different. That’s because in the first line, rows in count_fb_barbie
but not in count_mattel
are excluded and in the second row, it is the opposite (Use View()
to see for yourself).
11.2.2 tidyverse
You can also do this using the join functions from the dplyr
package.
left_join()
and right_join()
work very similar to the all.x
and all.y
argument in the base R merge()
function:
## [1] 30
## Joining, by = "date"
## [1] 34
#just like merge(), if you don't indicate a `by` variable, the code will return a message telling you which one it used
There is also inner_join()
, which only keeps rows when BOTH of the datasets have that row (this is the one we’ll use for plotting). This is akin to using all = FALSE
.
join_data_complete <- inner_join(count_fb_barbie, count_mattel, by = "date")
nrow(join_data_complete)
## [1] 20
Notice that this data frame is shorter than the other two because the date had to appear in both datasets to be included.
If we wanted to include all the rows in both (similar to the all = TRUE
), we would use full_join
().
## [1] 44
Notice that this data frame has the most rows.
11.3 Misc
11.3.1 Merging with an Index
As I mentioned previously, the data frames we used start/stop at different points.
## [1] "2023-07-02" "2023-07-03" "2023-07-04" "2023-07-05" "2023-07-06" "2023-07-07"
## [1] "2023-07-26" "2023-07-27" "2023-07-28" "2023-07-29" "2023-07-30" "2023-07-31"
## [1] "2023-07-03" "2023-07-05" "2023-07-06" "2023-07-07" "2023-07-10" "2023-07-11"
## [1] "2023-08-11" "2023-08-14" "2023-08-15" "2023-08-16" "2023-08-17" "2023-08-18"
Sometimes we want to clean the ends before merging them (or at least, I like to do this). If I’m working with multiple data frames, this is how I like to do it.
First, I’ll create an index:
Then, I use the index to merge all the additional data frames
11.3.2 rbind
Sometimes, you want to “merge” by row rather than by column. In R, you can do this using the rbind()
function (rbind
stands for row bind). To do this, though, you need to have two data frames with the same number of columns. For example, imagine the Barbie dataset came in two parts:
How do you combine them? Using rbind()
!
## [1] 8390
## [1] 8390
Note that this is the same number of rows as the original data_fb_barbie
dataset!
11.3.3 Additional Resources
- “What if I want to combine more than two datasets?” I recommend the
reduce()
function frompurrr
(which is also atidyverse
package). You can learn more about it here.
- “I wanted to learn more about the
%in%
operator!” That’s great! We’ll use it more once we get to the NLP section, but here’s a great tutorial on how to use in a general context