Chapter 6 Exercises
6.1 Regular Expressions
1.Write a regex for Swedish mobile numbers. Test it withstr_detect("+46 71-738 25 33", "[insert your regex here]")
.
Solution. Click to expand!
#1
str_detect("+46 71-738 25 33", "\\+46 [0-9]{2}\\-[0-9]{3} [0-9]{2} [0-9]{2}")
- Remember the vector of heights?
- How can you extract the meters using the negative look behind?
- Bring it into numeric format (i.e., so that
your_solution == c(1.3, 2.01, 3.1)
yieldsTRUE
) using regexes andstringr
commands.
Solution. Click to expand!
<- c("1m30cm", "2m01cm", "3m10cm")
heights
#a
<- str_extract(heights, "(?<!m)[0-9]")
meters
#b
<- str_replace(heights, "(?<=[0-9])m", "\\.") %>%
for_test str_replace("cm", "") %>%
as.numeric()
== c(1.3, 2.01, 3.1) for_test
- Find all Mercedes in the
mtcars
data set.
Solution. Click to expand!
%>%
mtcars rownames_to_column("model") %>%
filter(str_detect(model, "Merc"))
- Take the IMDb file and split the
Genre
column into different columns (hint: look at thetidyr::separate()
function). How would you do it ifGenre
were a vector usingstr_split_fixed()
?
Solution. Click to expand!
<- read_csv("imdb2006-2016.csv")
imdb
%>%
imdb separate(Genre, sep = ",", into = c("genre_1", "genre_2", "genre_3"))
$Genre %>%
imdbstr_split_fixed(pattern = ",", 3)
6.2 rvest
- Download all movies from the IMDb Top250 best movies of all time. Put them in a tibble with the columns
rank
– in numeric format,title
,url
to IMDb entry,rating
– in numeric format.
Solution. Click to expand!
<- read_html("https://www.imdb.com/chart/top/?ref_=nv_mv_250")
imdb_top250
tibble(
rank = imdb_top250 %>%
html_elements(".titleColumn") %>%
html_text2() %>%
str_extract("^[0-9]+(?=\\.)") %>%
parse_integer(),
title = imdb_top250 %>%
html_elements(".titleColumn a") %>%
html_text2(),
url = imdb_top250 %>%
html_elements(".titleColumn a") %>%
html_attr("href") %>%
str_c("https://www.imdb.com", .),
rating = imdb_top250 %>%
html_elements("strong") %>%
html_text() %>%
parse_double()
)
- Scrape the table on “https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/” which contains the total cases and deaths for each State, the table on “https://usafacts.org/visualizations/covid-vaccine-tracker-states” that contains the vaccination progress, and the table on “https://simple.wikipedia.org/wiki/List_of_U.S._states” to get more data on U.S. States (hint: use
janitor::clean_names()
because the column names are a bit off and take away the first row withslice()
). Build a regression model that tries to predict the number of COVID cases or deaths (normalized by the total number of inhabitants, of course). Include all predictors you deem relevant here (e.g., vaccination progress, Republican/Democratic senator – note that some States have both, population density – inhabitants/sqkm, etc.). You will need some regular expressions here.
Solution. Click to expand!
<- read_html("https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/") %>%
cov_cases html_table() %>%
pluck(2) %>%
clean_names() %>%
mutate(across(2:5, ~str_remove_all(.x, "[^0-9]") %>%
as.numeric()))
<- read_html("https://usafacts.org/visualizations/covid-vaccine-tracker-states") %>%
vaccination_stuff html_table() %>%
pluck(1) %>%
clean_names() %>%
mutate(across(2:6, ~str_remove_all(.x, "[^0-9]") %>%
as.numeric()))
<- read_html("https://simple.wikipedia.org/wiki/List_of_U.S._states") %>%
demographic_stuff html_table(fill = TRUE) %>%
pluck(1) %>%
clean_names() %>%
slice(2:51) %>%
mutate(state = str_remove(name_postal_abbs_1, "\\[C\\]$")) %>%
select(state, population = population_b_3, area = total_area_4_2) %>%
mutate(across(2:3, ~str_remove_all(.x, "[^0-9]") %>%
as.numeric()))
<- senators_clean_names %>%
majority count(party, state) %>%
group_by(state) %>%
filter(n == max(n)) %>%
add_count(state) %>%
mutate(nn = case_when(nn == 2 ~ NA_integer_,
TRUE ~ nn)) %>%
drop_na() %>%
select(state, party)
<- senators_clean_names %>%
final_tbl left_join(demographic_stuff) %>%
left_join(vaccination_stuff) %>%
left_join(cov_cases) %>%
select(-party) %>%
left_join(majority) %>%
distinct(state, .keep_all = TRUE) %>%
mutate(percentage_deaths = deaths/population,
pop_density = population/area)
lm(percentage_deaths ~ party + percent_of_pop_fully_vaccinated + pop_density, data = final_tbl) %>%
summary()
- Write a
while
loop that accomplishes the same things as the ones above (i.e., scrapes wg-gesucht offerings) yet does so usinghtml_session()
.- Extend it and scrape the listings that are still available? Can you determine whether they are available or not without making an unnecessary call – if not, make as few calls as possible?
- Think about how you can store the listings and their contents in a structured manner. Bonus points for doing it in a function.
- Are there any social scientific questions you could ask with those data?
Solution. Click to expand!
# a -- if an apartment is rented out, the picture element doesn't contain a link anymore
<- vector(mode = "list", length = 10L)
output_list
<- function(date_vec){
fix_date <- str_extract(date_vec, "[0-9]{2}.[0-1][0-9].[2][0][0-3][0-9]") %>%
proper_dates parse_date(format = "%d.%m.%Y") %>%
!is.na(.)]
.[<- date_vec[str_detect(date_vec, "Minuten|Stunde")] %>%
today str_replace(".+", today() %>% as.character()) %>%
ymd()
<- date_vec[str_detect(date_vec, "Tag")] %>%
days_ago str_replace(.,
".+",
today()-(months(str_extract(., "[1-4](?= Tag)") %>%
(as.numeric()))) %>%
as.character()) %>%
ymd()
c(today, days_ago, proper_dates)
}
<- 0
i <- today()
date <- today() - months(1)
end_date <- session("https://www.wg-gesucht.de/1-zimmer-wohnungen-in-Leipzig.77.1.1.0.html")
wg_session
while (date >= end_date) {
<- i + 1
i <- read_html(wg_session)
page
<- page %>%
output_list[[i]] html_elements(".truncate_title a") %>%
html_attrs_dfr() %>%
filter(class == "detailansicht") %>%
select(link = href, title = .text) %>%
mutate(title = title %>% str_squish(),
link = url_absolute(link, base = "https://www.wg-gesucht.de/"))
$date <- page %>%
output_list[[i]]html_elements("span:nth-child(2)") %>%
html_text2() %>%
str_detect(., "^Online")] %>%
.[fix_date()
<- output_list[[i]]$date %>% tail(1)
date
%<>%
output_list[[i]] filter(link %in% page) %>%
html_elements(".card_image a") %>%
html_attr("href") %>%
url_absolute(base = "https://www.wg-gesucht.de/")
)
if (i == 1) wg_session %<>% session_follow_link(css = "#main_column li:nth-child(15) a")
if (i > 1) wg_session %<>% session_follow_link(css = "#main_column li:nth-child(16) a")
}
<- output_list %>% bind_rows()
listing_list
# b
<- function(link) {
scrape_listing <- read_html(link)
page list(
<- page %>%
costs html_table() %>%
pluck(1),
<- page %>%
details html_elements(":nth-child(12) .col-xs-12") %>%
html_text2(),
<- page %>%
apartment_description html_elements("#freitext_0_content") %>%
html_text2(),
<- page %>%
location_description html_elements("#freitext_1_content") %>%
html_text2(),
<- page %>%
number html_elements(".col-md-4 :nth-child(2) .col-md-12") %>%
html_text2()
) }
- Go to Wikipedia and search an article you find interesting using R.
Solution. Click to expand!
<- session("https://en.m.wikipedia.org/wiki/Main_Page")
wiki_session <- wiki_session %>% read_html()
wiki_startpage
<- html_form_set(wiki_session %>% html_form() %>% pluck(1),
search search = "R")
<- wiki_session %>%
search_r session_submit(search) %>%
read_html()
%>% html_elements("p") %>% html_text2() --> worked search_r
6.3 APIs
- Explore the API of the New York Times. Scrape the following:
- bestsellers hardcover fiction in June 2019 (store them in separate tibbles in a list).
- articles from the 6th to the 10th of January 2021 dealing with the capitol.
Solution. Click to expand!
#a
map(c("02", "09", "16", "23", "30"), ~{
modify_url(
url = str_c("https://api.nytimes.com/svc/books/v3/lists/2019-06-", .x, "/hardcover-fiction.json"),
query = list(`api-key` = Sys.getenv("nyt_api_key"))
%>%
) GET() %>%
content(as = "text") %>%
fromJSON()
}
)
#b
modify_url(
url = "http://api.nytimes.com/svc/search/v2/articlesearch.json",
query = list(q = "capitol",
pub = "20210106",
end_date = "20210110",
`api-key` = Sys.getenv("nyt_api_key"))
%>%
) GET() %>%
content(as = "text") %>%
fromJSON()
Search the API data base on The Programmable Web and choose one. Send some
GET()
requests.Play around with
rtweet
. You can use, for instance, this list.- Get followers of 10 members. Automate the process. Get all of them (hint: look at the default parameters of the function).
- Get the author’s bio – could you infer their party belonging fom their bio?
- Think about the recent climate summit. Come up with three keywords/hashtags that might describe it. Please, send them to me by email. Acquire the tweets related to this event (i.e., the ones that contained your keyword). Plot their frequency over time, what patterns do you see?
- Stream all tweets that were sent from Muniich (hint:
lookup_coords("munich, germany")
). - Analyze the liking behavior of the politicians, look at 10 politicians’ 100 latest favorited tweets – do they seem to have a preference for members of the same party?
Solution. Click to expand!
library(rtweet)
<- lists_members(
pol_list list_id = "1050737868606455810"
)#a
<- pol_list$screen_name[1:10] %>%
followers map(~get_followers(.x, n = lookup_users(.x) %>% pull(followers_count), retryonratelimit = TRUE))
#b
%>% filter(description %>% str_detect(regex("afd|cdu|csu|fdp|b.?90|linke", ignore_case = TRUE)))
pol_list
#c
<- c("glasgow", "cop26", "globalwarming")
keywords
<- map(keywords, search_tweets, n = 6000, include_rts = FALSE)
cop_tweets
%>% bind_rows() %>%
cop_tweets ggplot() +
geom_histogram(aes(x = created_at))
# --> pattern: we need more data
#d
<- stream_tweets(timeout = 60, geocode = lookup_coords("munich, germany"))
tweets_muc
#e
<- get_favorites(pol_list$screen_name[1:10], n = 100) favs
- Name four shortcomings of Twitter data and the implications those may have for research. How could one address the shortcomings you mentioned and which data would you wish for to do so?
6.4 Text Mining
- Read in some Twitter data. Bring them into
tidytext
format. Stem them.
Solution. Click to expand!
library(tidyverse)
library(tidytext)
<- read_csv("data/twitter-set.csv") %>%
tweets mutate(id = as.character(id))
<- tweets %>%
tidy_tweets unnest_tokens(output = token, input = posting)
library(SnowballC)
<- tidy_tweets %>%
tidy_stemmed mutate(stemmed = wordStem(token, language = "de"))
- What are the 10 terms the respective parties are using the most (before and after stopword removal)? Visualize your results using
ggplot2
andgeom_col()
. Are there any differences? Should you remove further terms? Plus, do the results change once you remove retweets?
Solution. Click to expand!
library(stopwords)
<- c("AFD" = "#009ee0",
party_colors "CSU" = "#000000",
"DIE LINKE" = "#d837f0",
"FDP" = "#feed01",
"DIE GRUENEN" = "#1faf12",
"SPD" = "#e2001a",
"FREIE WAEHLER" = "#e58609")
%>%
tidy_stemmed anti_join(get_stopwords(language = "de"), by = c("token" = "word")) %>%
group_by(party) %>%
count(stemmed) %>%
slice_max(n, n = 10) %>%
ggplot() +
geom_col(aes(reorder(stemmed, n), n, fill = party), show.legend = FALSE) +
labs(x = NULL, y = "n") +
facet_wrap(~party, ncol = 2, scales = "free") +
scale_fill_manual(values = party_colors) +
coord_flip()
Solution. Click to expand!
%>%
tweets filter(!str_detect(posting, regex("^rt|^retweeted", ignore_case = TRUE))) %>%
unnest_tokens(output = token, input = posting) %>%
anti_join(get_stopwords(language = "de"), by = c("token" = "word")) %>%
mutate(stemmed = wordStem(token, language = "de")) %>%
filter(!stemmed %in% c("t.co", "https", "dass", "f0")) %>%
group_by(party) %>%
count(stemmed) %>%
slice_max(n, n = 10) %>%
ggplot() +
geom_col(aes(reorder(stemmed, n), n, fill = party), show.legend = FALSE) +
labs(x = NULL, y = "n") +
facet_wrap(~party, ncol = 2, scales = "free") +
scale_fill_manual(values = party_colors) +
coord_flip()
library(tidyverse)
library(sotu)
library(tidytext)
library(SnowballC)
<- sotu_meta %>%
sotu_clean bind_cols(sotu_text) %>%
rename(content = `...6`) %>%
distinct(content, .keep_all = TRUE) %>%
unnest_tokens(output = token, input = content) %>%
anti_join(get_stopwords(), by = c("token" = "word")) %>%
filter(!str_detect(token, "[:digit:]")) %>%
mutate(token = wordStem(token, language = "en")) %>%
group_by(token) %>%
filter(n() > 5) %>%
rename(word = token)
- Analyze the SOTU addresses over time using the “bing” dictionary.
- What are the results? Compare them with the “AFINN” analyses from the script.
Solution. Click to expand!
<- get_sentiments("afinn") %>%
sotu_afinnmutate(word = wordStem(word, language = "en")) %>%
inner_join(sotu_clean) %>%
group_by(year) %>%
summarize(sentiment = mean(value))
<- get_sentiments("bing") %>%
sotu_bing mutate(word = wordStem(word, language = "en")) %>%
inner_join(sotu_clean) %>%
mutate(sentiment = case_when(sentiment == "negative" ~ -1,
== "positive" ~ 1)) %>%
sentiment group_by(year) %>%
summarize(sentiment = mean(sentiment))
ggplot(sotu_bing) +
geom_line(aes(x = year, y = sentiment))
%>%
sotu_afinn ggplot() +
geom_line(aes(x = year, y = sentiment))
- Which words contribute to them? Again, compare the results with the “AFINN” ones.
Solution. Click to expand!
<- get_sentiments("afinn") %>%
sotu_afinn_contribution mutate(word = wordStem(word, language = "en")) %>%
inner_join(sotu_clean) %>%
group_by(word) %>%
summarize(occurences = n(),
contribution = sum(value)) %>%
mutate(type = "afinn")
<- get_sentiments("bing") %>%
sotu_bing_contribution mutate(word = wordStem(word, language = "en")) %>%
inner_join(sotu_clean) %>%
mutate(sentiment = case_when(sentiment == "negative" ~ -1,
== "positive" ~ 1)) %>%
sentiment group_by(word) %>%
summarize(occurences = n(),
contribution = sum(sentiment)) %>%
mutate(type = "bing")
bind_rows(sotu_afinn_contribution %>% slice_max(contribution, n = 10),
%>% slice_min(contribution, n = 10),
sotu_afinn_contribution %>% slice_max(contribution, n = 10),
sotu_bing_contribution %>% slice_min(contribution, n = 10)) %>%
sotu_bing_contribution mutate(type = as_factor(type),
word = reorder_within(word, contribution, type)) %>%
ggplot(aes(contribution, word, fill = contribution > 0)) +
geom_col(show.legend = FALSE) +
labs(y = NULL) +
scale_y_reordered() +
facet_wrap(~ type, scales = "free_y")
- Calculate Intercoder Reliability (i.e., Krippendorff’s Alpha) of both dictionaries.
Solution. Click to expand!
library(caret)
<- sotu_afinn %>% mutate(sentiment_afinn = case_when(sentiment > 0 ~ "positive",
sotus < 0 ~ "negative",
sentiment TRUE ~ NA_character_) %>%
factor(levels = c("positive", "negative"))) %>%
select(-sentiment) %>%
full_join(sotu_bing %>%
mutate(sentiment_bing = case_when(sentiment > 0 ~ "positive",
< 0 ~ "negative",
sentiment TRUE ~ NA_character_) %>%
factor(levels = c("positive", "negative"))) %>%
select(-sentiment),
by = "year") %>%
drop_na()
<- matrix(0, nrow = 2, ncol = nrow(sotus))
mat 1, ] <- sotus$sentiment_afinn
mat[2, ] <- sotus$sentiment_bing
mat[
1,2:200] <- 2
mat[
::kripp.alpha(mat, method = "nominal") irr
- Use “AFINN” and “bing” to classify the IMDb reviews.
- Evaluate the classification quality using
caret::confusionMatrix()
. Any differences?
Solution. Click to expand!
<- read_csv("data/imdb_reviews.csv")
imdb_reviews
<- imdb_reviews %>%
imdb_reviews_afinn rowid_to_column("doc") %>%
unnest_tokens(token, text) %>%
anti_join(get_stopwords(), by = c("token" = "word")) %>%
mutate(stemmed = wordStem(token)) %>%
inner_join(get_sentiments("afinn") %>% mutate(stemmed = wordStem(word))) %>%
group_by(doc) %>%
summarize(sentiment = mean(value)) %>%
mutate(sentiment_afinn = case_when(sentiment > 0 ~ "positive",
TRUE ~ "negative")) %>%
select(doc, sentiment_afinn)
<- imdb_reviews %>%
imdb_reviews_bing select(-sentiment) %>%
rowid_to_column("doc") %>%
unnest_tokens(token, text) %>%
anti_join(get_stopwords(), by = c("token" = "word")) %>%
mutate(stemmed = wordStem(token)) %>%
inner_join(get_sentiments("bing") %>% mutate(stemmed = wordStem(word))) %>%
mutate(sentiment = case_when(sentiment == "negative" ~ -1,
== "positive" ~ 1)) %>%
sentiment group_by(doc) %>%
summarize(sentiment = mean(sentiment)) %>%
mutate(sentiment_bing = case_when(sentiment > 0 ~ "positive",
TRUE ~ "negative")) %>%
select(doc, sentiment_bing)
<- imdb_reviews %>%
reviews rowid_to_column("doc") %>%
select(-text)
<- inner_join(reviews, imdb_reviews_afinn, by = "doc") %>%
afinn_bing inner_join(imdb_reviews_bing, by = "doc") %>%
mutate(across(starts_with("sentiment"), ~factor(.x, levels = c("positive", "negative"))))
<- confusionMatrix(data = afinn_bing$sentiment, reference = afinn_bing$sentiment_afinn)
conf_matrix_afinn <- confusionMatrix(data = afinn_bing$sentiment, reference = afinn_bing$sentiment_bing) conf_matrix_bing
- How could you remodel the “NRC” dictionary to compare its results?
Solution. Click to expand!
get_sentiments("nrc") %>% mutate(sentiment_numeric = case_when(sentiment == "positive" ~ 1,
== "negative" ~ -1,
sentiment TRUE ~ NA_real_)) %>%
drop_na()
- Search for the accuracy scores other, more cutting-edge approaches have achieved on the IMDb data set.
Solution. Click to expand!
https://paperswithcode.com/sota/sentiment-analysis-on-imdb- Use the Twitter data from exercise 1.
- Plot the volume of Tweets over time.
Solution. Click to expand!
%>%
tweets count(date) %>%
ggplot() +
geom_line(aes(date, n))
- When there are spikes in volume, this usually hints on some underlying event. Determine the event using a TF-IDF-based approach. Thereby, you treat the day as one document and calculate the TF-IDF for the respective terms. Use the 5 dates with the highest Tweet volume before October 1. If volume remains high for multiple days, this can be treated as one event. Plot your results (put the TF-IDF value on the x-axis, word on the y-axia and use facets for the respective event). Can you determine what drove the spikes?
Solution. Click to expand!
library(lubridate)
<- tweets %>%
dates mutate(date = case_when(date == ymd("2018-09-27") ~ ymd("2018-09-26"),
TRUE ~ date)) %>%
count(date) %>%
filter(date < ymd("2018-10-01")) %>%
slice_max(n, n = 5) %>%
pull(date)
<- tweets %>%
tweets_tf_idf mutate(date = case_when(date == ymd("2018-09-27") ~ ymd("2018-09-26"),
TRUE ~ date),
document = case_when(date %in% dates ~ date %>% as.character(),
TRUE ~ "other")) %>%
unnest_tokens(word, posting) %>%
count(document, word) %>%
bind_tf_idf(word, document, n)
%>%
tweets_tf_idf group_by(document) %>%
slice_max(tf_idf, n = 5) %>%
ungroup() %>%
mutate(word = reorder_within(word, tf_idf, document)) %>%
ggplot() +
geom_col(aes(x = tf_idf, y = word)) +
scale_y_reordered() +
facet_wrap(~document, scales = "free")