Chapter 7 Data Merging
Hello! In this tutorial, we will go over some data wrangling strategies when dealing with two or more datasets. Let’s begin by loading in the appropriate packages (mainly, tidyverse
).
options(scipen=999) #use this to remove scientific notation
library(plyr)
library(tidyverse)
Alrighty! Let’s now download our data. For this analysis, we will be looking at two datasets: (1) a list of NYT articles about COVID (not the actual stories, but headlines, URLs, and meta-data) and (2) U.S. COVID-19 data (cases, deaths).
<- read_csv("data/covid_cases.csv") #don't forget to make sure these files are in your working directory!
data_covid <- read_csv("data/covid_nyt_stories.csv") data_nyt
Let’s first use str()
and head()
to see what’s in the data.
str(data_covid)
## spec_tbl_df [385 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ iso_code : chr [1:385] "USA" "USA" "USA" "USA" ...
## $ date : chr [1:385] "1/22/2020" "1/23/2020" "1/24/2020" "1/25/2020" ...
## $ total_cases : num [1:385] 1 1 2 2 5 5 5 6 6 8 ...
## $ new_cases : num [1:385] NA 0 1 0 3 0 0 1 0 2 ...
## $ total_deaths : num [1:385] NA NA NA NA NA NA NA NA NA NA ...
## $ new_deaths : num [1:385] NA NA NA NA NA NA NA NA NA NA ...
## $ total_cases_per_million: num [1:385] 0.003 0.003 0.006 0.006 0.015 0.015 0.015 0.018 0.018 0.024 ...
## $ new_cases_per_million : num [1:385] NA 0 0.003 0 0.009 0 0 0.003 0 0.006 ...
## - attr(*, "spec")=
## .. cols(
## .. iso_code = col_character(),
## .. date = col_character(),
## .. total_cases = col_double(),
## .. new_cases = col_double(),
## .. total_deaths = col_double(),
## .. new_deaths = col_double(),
## .. total_cases_per_million = col_double(),
## .. new_cases_per_million = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(data_nyt)
## spec_tbl_df [28,350 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ stories_id : num [1:28350] 1490114687 1491992217 1496159739 1497814035 1498707611 ...
## $ publish_date : POSIXct[1:28350], format: "2020-01-08 21:48:29" "2020-01-10 20:41:55" ...
## $ title : chr [1:28350] "China Identifies New Virus Causing Pneumonia-Like Illness" "China Reports First Death From New Virus" "Japan Confirms First Case of New Chinese Coronavirus" "Three U.S. Airports to Screen Passengers for Chinese Coronavirus" ...
## $ url : chr [1:28350] "https://www.nytimes.com/2020/01/08/health/china-pneumonia-outbreak-virus.html" "https://www.nytimes.com/2020/01/10/world/asia/china-virus-wuhan-death.html" "https://www.nytimes.com/2020/01/15/world/asia/coronavirus-japan-china.html" "https://www.nytimes.com/2020/01/17/health/china-coronavirus-airport-screening.html" ...
## $ language : chr [1:28350] "en" "en" "en" "en" ...
## $ ap_syndicated: logi [1:28350] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ themes : chr [1:28350] NA NA NA NA ...
## $ media_id : num [1:28350] 1 1 1 1 1 1 1 1 1 1 ...
## $ media_name : chr [1:28350] "New York Times" "New York Times" "New York Times" "New York Times" ...
## $ media_url : chr [1:28350] "http://nytimes.com" "http://nytimes.com" "http://nytimes.com" "http://nytimes.com" ...
## - attr(*, "spec")=
## .. cols(
## .. stories_id = col_double(),
## .. publish_date = col_datetime(format = ""),
## .. title = col_character(),
## .. url = col_character(),
## .. language = col_character(),
## .. ap_syndicated = col_logical(),
## .. themes = col_character(),
## .. media_id = col_double(),
## .. media_name = col_character(),
## .. media_url = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
head(data_covid$date)
## [1] "1/22/2020" "1/23/2020" "1/24/2020" "1/25/2020" "1/26/2020" "1/27/2020"
head(data_nyt$publish_date)
## [1] "2020-01-08 21:48:29 UTC" "2020-01-10 20:41:55 UTC"
## [3] "2020-01-15 22:30:04 UTC" "2020-01-17 14:08:42 UTC"
## [5] "2020-01-18 22:08:02 UTC" "2020-01-20 05:59:15 UTC"
str(data_nyt$publish_date)
## POSIXct[1:28350], format: "2020-01-08 21:48:29" "2020-01-10 20:41:55" "2020-01-15 22:30:04" ...
Of the 8 variables in the COVID dataset and the 10 variables in the NYT dataset, it’s worth noting that one variable they kind of share in common is time. So, it might be interesting to construct a plot of the number of cases and number of articles about COVID daily. To do this, we’ll have to merge these two datasets.
But these datasets are in slightly different structures. data_covid
is daily data, whereas the data_nyt
dataset is at the article-level with both date and timestamp information included in the publish_date
variable. So, before we merge, we need to learn more about how to wrangle both datasets.
7.1 More Data Types
7.1.1 Datetime
So “datetimes” are variable-types that include date and time information (in R, you can construct both date and datetime variables). In R, they follow the following structure: YYYY-MM-DD HH:MM:SS
. Base R calls this type of data a POSIXct
.
head(data_nyt$publish_date)
## [1] "2020-01-08 21:48:29 UTC" "2020-01-10 20:41:55 UTC"
## [3] "2020-01-15 22:30:04 UTC" "2020-01-17 14:08:42 UTC"
## [5] "2020-01-18 22:08:02 UTC" "2020-01-20 05:59:15 UTC"
str(data_nyt$publish_date)
## POSIXct[1:28350], format: "2020-01-08 21:48:29" "2020-01-10 20:41:55" "2020-01-15 22:30:04" ...
One nice thing about datetime variables is that you can subtract datetime variables from one another (but you can’t do other operations, which makes sense because you can’t divide last week by tomorrow).
$publish_date[1] data_nyt
## [1] "2020-01-08 21:48:29 UTC"
$publish_date[2] data_nyt
## [1] "2020-01-10 20:41:55 UTC"
$publish_date[2] - data_nyt$publish_date[1] data_nyt
## Time difference of 1.953773 days
#data_nyt$publish_date[2] * data_nyt$publish_date[1] this line will not work
7.1.2 Dates
In the data_covid
dataset, we have a Date variable, but it’s not actually a date (you’ll see this if you use str()
)
str(data_covid$date)
## chr [1:385] "1/22/2020" "1/23/2020" "1/24/2020" "1/25/2020" "1/26/2020" ...
So, we’ll need to coerce this variable into a date variable. We can do so using the as.Date()
function.
If your character/string is already structured like this: "2020-01-10"
, then as.Date() will interpret this easily. In most cases, though, we often have to tell R what the format of the Date (or datetime) structure is. We can do this using the format()
argument in the as.Date()
function (here, we will use "%m/%d/%Y"
, representing:
$date <- as.Date(data_covid$date, format = "%m/%d/%Y")
data_covidstr(data_covid$date)
## Date[1:385], format: "2020-01-22" "2020-01-23" "2020-01-24" "2020-01-25" "2020-01-26" ...
head(data_covid$date)
## [1] "2020-01-22" "2020-01-23" "2020-01-24" "2020-01-25" "2020-01-26"
## [6] "2020-01-27"
$publish_date <- as.Date(data_nyt$publish_date)
data_nytstr(data_nyt$publish_date)
## Date[1:28350], format: "2020-01-08" "2020-01-10" "2020-01-15" "2020-01-17" "2020-01-18" ...
When plotting datetimes and dates, ggplot2
will also organize the axis nicely.
ggplot(data_covid, aes(x = date, y = new_deaths)) +
geom_line() +
labs(title = "Number of new COVID-19 Deaths, Daily", x = "Date", y = "Number of deaths") +
theme_minimal()
## Warning: Removed 38 row(s) containing missing values (geom_path).
Now, let’s take a very brief detour to talk about strings.
7.1.3 Strings
The data_nyt
dataset is a good opportunity to give you a primer on dealing with strings/characters. As I’ve mentioned, strings and characters are types of data that scholars use to work with text data. Computers are not great at understanding language (in fact, without human involvement, they don’t understand language), but they are great for repetitive pattern recognition. For that reason, it can be very easy to identify parts of a string (like a keyword or consisent phrase), replace it, or search y it.
For the next few lines, let’s focus on the title
variable in the data_nyt
dataset.
7.1.3.1 With base R
head(data_nyt$title, n = 10)
## [1] "China Identifies New Virus Causing Pneumonia-Like Illness"
## [2] "China Reports First Death From New Virus"
## [3] "Japan Confirms First Case of New Chinese Coronavirus"
## [4] "Three U.S. Airports to Screen Passengers for Chinese Coronavirus"
## [5] "China Reports 17 New Cases of Mysterious Virus"
## [6] "What We Know About China’s New Coronavirus"
## [7] "Your Tuesday Briefing"
## [8] "Virginia, Davos, Coco Gauff: Your Monday Evening Briefing"
## [9] "Your Tuesday Briefing"
## [10] "Fears of China’s Coronavirus Prompt Australia to Screen Flights"
As you can see, this variable contains information about headlines for each article. Notice how some of them mention China? If we wanted to see the proportion of articles that mention China in its headline, we can use the grepl()
function in base R. This is great for searching individual keywords or substrings (substrings are parts of a string–“China” would be a substring in the longer string, which is a headline). Let’s see how this is used.
grepl("China", data_nyt$title[1]) #note that this returns a TRUE
## [1] TRUE
Notice that grepl()
returns a binary: TRUE
if that substring (“China”) exists and false if it does not. From this, we can create a binary variable for whether an article mentions China in the headline.
$china <- grepl("China", data_nyt$title) data_nyt
With news articles, we can be reasonably sure that the word “China” will be capitalized. But sometimes (like on social media), people do not adhere to “standard writing norms.” If you want your search to be case insensitive, you can use the ignore.case()
argument.
$china <- grepl("china", data_nyt$title, ignore.case = TRUE)
data_nythead(data_nyt$china)
## [1] TRUE TRUE FALSE FALSE TRUE TRUE
Once we have this variable, we can use it to subset out all the articles that mention China in its headline.
<- subset(data_nyt, china == TRUE)
china_headlines nrow(china_headlines)
## [1] 692
As we can see, there are 692 articles mentioning China in its headline.
7.1.4 With tidyverse
We can do a similar thing in tidyverse
through the stringr
package. This package has additional functions that make it especially useful for people (like me) who deal a lot with strings and characters.
Unlike grepl()
, str_detect()
does not have a ignore.case
argument to make something case insensitive. But you can use a regular expression (?i)
in front of a word to search for it in a case-insensitive form.
So, if we wanted to search for all references to Trump or trump, this is what our code would look like.
str_detect(data_nyt$title[7975:7981], "(?i)trump")
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
What if you wanted to search by multiple keywords. Maybe you want to see how many headlines mention Trump OR Biden. To do so, you would include a |
pipe representing “or”.
str_detect(data_nyt$title[19310:19315], "(?i)trump|(?i)biden")
## [1] TRUE TRUE TRUE TRUE TRUE FALSE
$title[19310:19315] data_nyt
## [1] "What President Trump Has Said About Wearing Masks"
## [2] "Biden Pushes a Jobs Plan and Tears Into Trump's Covid Response in Michigan - The New York Times"
## [3] "Ruth Bader Ginsburg Updates: Biden Says Election Winner Should Pick Nominee - The New York Times"
## [4] "The World Shudders as President Trump Tests Positive for Covid-19"
## [5] "Trump’s Positive Coronavirus Test Upends Campaign in Final Stretch"
## [6] "Federal Report Warns of Financial Havoc From Climate Change - The New York Times"
Searching for two words together is a little harder because “Trump and Biden” and “Biden and Trump” would be considered two different strings. But you can use the |
to include two searches: one where Trump comes first and one whree Trump comes second. The characters you see in the middle, .*
represents a wildcard, or all the possible things that could be in between these words. By writing your string this way (using regular expressions), we are able to produce modular searches.
str_detect(data_nyt$title[19310:19315], "(?i)trump.*(?i)biden|(?i)biden.*(?i)trump")
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
Now, let’s bring it all together in the tidyverse. First, we’ll create a new variables using mutate()
that contains the binary output of str_detect()
. Then, in the next line, we will filter()
out just the headlines that mention one of the two presidents.
<- data_nyt %>%
president_covid mutate(presidents = str_detect(title, "(?i)trump|(?i)biden")) %>%
filter(presidents == TRUE)
nrow(president_covid)
## [1] 3566
Of our ~28,000 news stories, 3,500 headlines reference Trump or Biden by name.
7.2 Wrangling & Merging Datasets
Alright, now that we’ve cleaned our date variables, let’s move onto the wrangling.
head(data_covid$date)
## [1] "2020-01-22" "2020-01-23" "2020-01-24" "2020-01-25" "2020-01-26"
## [6] "2020-01-27"
head(data_nyt$publish_date)
## [1] "2020-01-08" "2020-01-10" "2020-01-15" "2020-01-17" "2020-01-18"
## [6] "2020-01-20"
If you View(data_covid)
, you might notice that the data is already very tidy time-wise. Each row is a temporal observation (one day). But the data_nyt
dataset is not this way because it is at the article level. For this reason, we will need to wrangle it: we’ll do this by creating a count of the number of articles that NYT posted on each day.
A function you have already learned, table()
is great for this!
<- table(data_nyt$publish_date) %>%
data_nyt_count as.data.frame() #make sure to convert it into a data frame
When turned into a data frame, a table
will automatically produce a factor (what variable you are counting by) and a numeric (the count). Let’s rename those now, and convert the date back into a Date
variable.
colnames(data_nyt_count)[1] <- "date"
colnames(data_nyt_count)[2] <- "nyt_count"
$date <- as.Date(data_nyt_count$date) data_nyt_count
We are now ready to merge our datasets!
7.2.1 merge()
One way you can merge your datasets is using the merge()
function in base R. This function has two required arguments with additional arguments like by
and all
to help you be more specific about how you want the data to be merged: merge(<first dataset>, <second dataset>)
. The first dataset (on the left) is considered “x” and the second dataset (on the right) is considered “y”.
With merged()
the dataset looks for a shared variable between the two datasets and combines the columns of the datasets along the shared variable. So, for example, if you merge data_nyt_count
and data_covid
by date, you will get a dataset containing the information in both by day.
In this case, “date” is the only variable they both share, so the merge()
function will default to this. But in the future, if you are working with datasets that may share multiple variable names, the by
argument becomes very important to tell the computer which variable to merge by.
The last argument, all
is also very important. Setting all
to true means you want to include the rows in both datasets, even if they do not appear in the other. So, for example, data_covid
begins on January 22, 2021 while data_nyt_count
begins on January 8, 2021. If you set all to TRUE
, the merged dataset will include the rows from January 8-22 but mark them NA
for the data_covid
variables (see below)
<- merge(data_nyt_count, data_covid, by = "date", all = TRUE)
merged_dat head(merged_dat)
## date nyt_count iso_code total_cases new_cases total_deaths new_deaths
## 1 2020-01-08 1 <NA> NA NA NA NA
## 2 2020-01-09 1 <NA> NA NA NA NA
## 3 2020-01-10 1 <NA> NA NA NA NA
## 4 2020-01-11 1 <NA> NA NA NA NA
## 5 2020-01-15 1 <NA> NA NA NA NA
## 6 2020-01-16 1 <NA> NA NA NA NA
## total_cases_per_million new_cases_per_million
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
If you wanted to merge and keep only the rows found in one of the two datasets, you can use the all.x
or all.y
arguments, like below.
<- merge(data_nyt_count, data_covid, all.x = TRUE)
m_dat_left nrow(m_dat_left)
## [1] 387
<- merge(data_nyt_count, data_covid, all.y = TRUE)
m_dat_right nrow(m_dat_right)
## [1] 385
Note that the merged datasets using the all.x
-argument and all.y
-argument are slightly different. That’s because in the first line, rows in data_covid
but not in data_nyt_count
are excluded and in the second row, it is the opposite (Use View()
to see for yourself).
7.2.2 tidyverse
You can also do this using the join functions from the dplyr
package.
left_join()
and right_join()
work very similar to the all.x
and all.y
argument in the base R merge()
function:
<- left_join(data_nyt_count, data_covid, by = "date")
join_dat_left nrow(join_dat_left)
## [1] 387
<- right_join(data_nyt_count, data_covid) join_dat_right
## Joining, by = "date"
nrow(join_dat_right)
## [1] 385
#just like merge(), if you don't indicate a `by` variable, the code will return a message telling you which one it used
There is also inner_join()
, which only keeps rows when BOTH of the datasets have that row (this is the one we’ll use for plotting). This is akin to using all = FALSE
.
<- inner_join(data_nyt_count, data_covid, by = "date")
join_data_complete nrow(join_data_complete)
## [1] 376
Notice that this data frame is shorter than the other two because the date had to appear in both datasets to be included.
If we wanted to include all the rows in both (similar to the all = TRUE
), we would use full_join
().
<- full_join(data_nyt_count, data_covid, by = "date")
join_data_all nrow(join_data_all)
## [1] 396
Notice that this data frame has the most rows.
7.3 Seeing the results
Now that we have a merged dataset, let’s put together a figure!
Let’s plot our results! Below, I construct two separate plots (you’re welcome to try combing the two, but news articles and COVID cases operate on different scales of magnitude, meaning that you will have a wonky Y-axis)
<- ggplot(join_data_complete, aes(x = date)) +
plot_covid geom_line(aes(y = new_cases), size = 1) +
labs(title = "Number of n cases, daily", x = "", y = "COVID Cases") +
theme_minimal()
<- ggplot(join_data_complete, aes(x = date)) +
plot_news geom_line(aes(y = nyt_count), size = 1) +
labs(title = "Number of NYT Stories about COVID, daily", x = "Date", y = "# of Stories") +
theme_minimal()
Then, I can use grid.arrange()
from gridExtra
to effectively facet the two together.
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(plot_covid, plot_news, nrow = 2)
## Warning: Removed 1 row(s) containing missing values (geom_path).
(Note: you may have gotten a warning or note that one row was removed. This is because the new_case
variable from the data_covid
dataset is NA
for the date of January 22, 2021). You could remove this row before doing this analysis. I would not recommend using complete.cases()
in this instance, though, because the death columns (total_deaths
and new_deaths
) don’t start until February 29, 2021.
7.4 Misc
7.4.1 Merging with an Index
As I mentioned previously, the data frames we used start/stop at different points.
head(data_nyt_count$date)
## [1] "2020-01-08" "2020-01-09" "2020-01-10" "2020-01-11" "2020-01-15"
## [6] "2020-01-16"
tail(data_nyt_count$date)
## [1] "2021-01-26" "2021-01-27" "2021-01-28" "2021-01-29" "2021-01-30"
## [6] "2021-01-31"
head(data_covid$date)
## [1] "2020-01-22" "2020-01-23" "2020-01-24" "2020-01-25" "2020-01-26"
## [6] "2020-01-27"
tail(data_covid$date)
## [1] "2021-02-04" "2021-02-05" "2021-02-06" "2021-02-07" "2021-02-08"
## [6] "2021-02-09"
Sometimes we want to clean the ends before merging them (or at least, I like to do this). If I’m working with multiple data frames, this is how I like to do it.
First, I’ll create an index:
<- data.frame(date = seq.Date(as.Date("2020-01-23"), as.Date("2021-01-31"), by = "day")) data_index
Then, I use the index to merge all the additional data frames
<- left_join(data_index, data_covid, by = "date") %>% left_join(data_nyt_count, by = "date") full_data
7.4.2 rbind
Sometimes, you want to “merge” by row rather than by column. In R, you can do this using the rbind()
function (rbind
stands for row bind). To do this, though, you need to have two data frames with the same number of columns. For example, imagine the covid dataset came in two parts:
<- data_covid[1:200,]
data_covid_setA <- data_covid[201:385,] data_covid_setB
How do you combine them? Using rbind()
!
<- rbind(data_covid_setA, data_covid_setB) data_covid_fullset
Note that this is the same number of rows as the original data_covid
dataset!
7.4.3 Additional Resources
- “What if I want to combine more than two datasets?” I recommend the
reduce()
function frompurrr
(which is also atidyverse
package): https://stackoverflow.com/questions/8091303/simultaneously-merge-multiple-data-frames-in-a-list
- “I wanted to learn more about the
%in%
operator!” That’s great! We’ll use it more once we get to the NLP section, but here’s a great tutorial on how to use in a general context: https://www.marsja.se/how-to-use-in-in-r/