Chapter 6 Exercises

6.1 Regular Expressions

1.Write a regex for Swedish mobile numbers. Test it with str_detect("+46 71-738 25 33", "[insert your regex here]").

Solution. Click to expand!

#1
str_detect("+46 71-738 25 33", "\\+46 [0-9]{2}\\-[0-9]{3} [0-9]{2} [0-9]{2}")

Remember the vector of heights?
1. How can you extract the meters using the negative look behind?
2. Bring it into numeric format (i.e., so that your_solution == c(1.3, 2.01, 3.1) yields TRUE) using regexes and stringr commands.

Solution. Click to expand!

heights <- c("1m30cm", "2m01cm", "3m10cm")

#a
meters <- str_extract(heights, "(?<!m)[0-9]")

#b
for_test <- str_replace(heights, "(?<=[0-9])m", "\\.") %>% 
  str_replace("cm", "") %>% 
  as.numeric() 

for_test == c(1.3, 2.01, 3.1)

Find all Mercedes in the mtcars data set.

Solution. Click to expand!

mtcars %>% 
  rownames_to_column("model") %>% 
  filter(str_detect(model, "Merc"))

Take the IMDb file and split the Genre column into different columns (hint: look at the tidyr::separate() function). How would you do it if Genre were a vector using str_split_fixed()?

Solution. Click to expand!

imdb <- read_csv("imdb2006-2016.csv")

imdb %>% 
  separate(Genre, sep = ",", into = c("genre_1", "genre_2", "genre_3"))

imdb$Genre %>% 
  str_split_fixed(pattern = ",", 3)

6.2 `rvest`

Download all movies from the IMDb Top250 best movies of all time. Put them in a tibble with the columns rank – in numeric format, title, url to IMDb entry, rating – in numeric format.

Solution. Click to expand!

imdb_top250 <- read_html("https://www.imdb.com/chart/top/?ref_=nv_mv_250")

tibble(
  rank = imdb_top250 %>% 
    html_elements(".titleColumn") %>% 
    html_text2() %>% 
    str_extract("^[0-9]+(?=\\.)") %>% 
    parse_integer(),
  title = imdb_top250 %>% 
    html_elements(".titleColumn a") %>% 
    html_text2(),
  url = imdb_top250 %>% 
    html_elements(".titleColumn a") %>% 
    html_attr("href") %>% 
    str_c("https://www.imdb.com", .),
  rating = imdb_top250 %>% 
    html_elements("strong") %>% 
    html_text() %>% 
    parse_double()
)

Scrape the table on “https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/” which contains the total cases and deaths for each State, the table on “https://usafacts.org/visualizations/covid-vaccine-tracker-states” that contains the vaccination progress, and the table on “https://simple.wikipedia.org/wiki/List_of_U.S._states” to get more data on U.S. States (hint: use janitor::clean_names() because the column names are a bit off and take away the first row with slice()). Build a regression model that tries to predict the number of COVID cases or deaths (normalized by the total number of inhabitants, of course). Include all predictors you deem relevant here (e.g., vaccination progress, Republican/Democratic senator – note that some States have both, population density – inhabitants/sqkm, etc.). You will need some regular expressions here.

Solution. Click to expand!

cov_cases <- read_html("https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/") %>%
  html_table() %>% 
  pluck(2) %>% 
  clean_names() %>% 
  mutate(across(2:5, ~str_remove_all(.x, "[^0-9]") %>% 
                  as.numeric()))

vaccination_stuff <- read_html("https://usafacts.org/visualizations/covid-vaccine-tracker-states") %>%
  html_table() %>% 
  pluck(1) %>% 
  clean_names() %>% 
  mutate(across(2:6, ~str_remove_all(.x, "[^0-9]") %>% 
                  as.numeric()))

demographic_stuff <- read_html("https://simple.wikipedia.org/wiki/List_of_U.S._states") %>% 
  html_table(fill = TRUE) %>% 
  pluck(1) %>% 
  clean_names() %>% 
  slice(2:51) %>% 
  mutate(state = str_remove(name_postal_abbs_1, "\\[C\\]$")) %>% 
  select(state, population = population_b_3, area = total_area_4_2) %>% 
  mutate(across(2:3, ~str_remove_all(.x, "[^0-9]") %>% 
                  as.numeric()))

majority <- senators_clean_names %>% 
  count(party, state) %>% 
  group_by(state) %>% 
  filter(n == max(n)) %>% 
  add_count(state) %>% 
  mutate(nn = case_when(nn == 2 ~ NA_integer_,
                        TRUE ~ nn)) %>% 
  drop_na() %>% 
  select(state, party) 

final_tbl <- senators_clean_names %>% 
  left_join(demographic_stuff) %>% 
  left_join(vaccination_stuff) %>% 
  left_join(cov_cases) %>% 
  select(-party) %>% 
  left_join(majority) %>% 
  distinct(state, .keep_all = TRUE) %>% 
  mutate(percentage_deaths = deaths/population,
         pop_density = population/area)

lm(percentage_deaths ~ party + percent_of_pop_fully_vaccinated + pop_density, data = final_tbl) %>% 
  summary()

Write a while loop that accomplishes the same things as the ones above (i.e., scrapes wg-gesucht offerings) yet does so using html_session().
1. Extend it and scrape the listings that are still available? Can you determine whether they are available or not without making an unnecessary call – if not, make as few calls as possible?
2. Think about how you can store the listings and their contents in a structured manner. Bonus points for doing it in a function.
3. Are there any social scientific questions you could ask with those data?

Solution. Click to expand!

# a -- if an apartment is rented out, the picture element doesn't contain a link anymore
output_list <- vector(mode = "list", length = 10L)

fix_date <- function(date_vec){
  proper_dates <- str_extract(date_vec, "[0-9]{2}.[0-1][0-9].[2][0][0-3][0-9]") %>% 
    parse_date(format = "%d.%m.%Y") %>% 
    .[!is.na(.)]
  today <- date_vec[str_detect(date_vec, "Minuten|Stunde")] %>% 
    str_replace(".+", today() %>% as.character()) %>% 
    ymd()
  days_ago <- date_vec[str_detect(date_vec, "Tag")] %>% 
    str_replace(., 
                ".+", 
                (today()-(months(str_extract(., "[1-4](?= Tag)") %>% 
                                 as.numeric()))) %>% 
                as.character()) %>% 
    ymd()
  c(today, days_ago, proper_dates)
}

i <- 0
date <- today()
end_date <- today() - months(1)
wg_session <- session("https://www.wg-gesucht.de/1-zimmer-wohnungen-in-Leipzig.77.1.1.0.html") 

while (date >= end_date) {
  i <- i + 1
  page <- read_html(wg_session)

  output_list[[i]] <- page %>% 
    html_elements(".truncate_title a") %>% 
    html_attrs_dfr() %>% 
    filter(class == "detailansicht") %>% 
    select(link = href, title = .text) %>% 
    mutate(title = title %>% str_squish(),
           link = url_absolute(link, base = "https://www.wg-gesucht.de/"))

  output_list[[i]]$date <- page %>% 
    html_elements("span:nth-child(2)") %>% 
    html_text2() %>% 
    .[str_detect(., "^Online")] %>% 
    fix_date()
  
  date <- output_list[[i]]$date %>% tail(1)
  
  output_list[[i]] %<>% 
    filter(link %in% page) %>% 
    html_elements(".card_image a") %>% 
    html_attr("href") %>% 
    url_absolute(base = "https://www.wg-gesucht.de/")
    )
    
  if (i == 1) wg_session %<>% session_follow_link(css = "#main_column li:nth-child(15) a")
  if (i > 1) wg_session %<>% session_follow_link(css = "#main_column li:nth-child(16) a")
}

listing_list <- output_list %>% bind_rows()

# b 
scrape_listing <- function(link) {
  page <- read_html(link)
  list(
    costs <- page %>% 
      html_table() %>% 
      pluck(1),
    details <- page %>% 
      html_elements(":nth-child(12) .col-xs-12") %>% 
      html_text2(),
    apartment_description <- page %>% 
      html_elements("#freitext_0_content") %>% 
      html_text2(),
    location_description <- page %>% 
      html_elements("#freitext_1_content") %>% 
      html_text2(),
    number <- page %>% 
      html_elements(".col-md-4 :nth-child(2) .col-md-12") %>% 
      html_text2()
  )
}

Go to Wikipedia and search an article you find interesting using R.

Solution. Click to expand!

wiki_session <- session("https://en.m.wikipedia.org/wiki/Main_Page")
wiki_startpage <- wiki_session %>% read_html()

search <- html_form_set(wiki_session %>% html_form() %>% pluck(1),
                        search = "R")

search_r <- wiki_session %>% 
  session_submit(search) %>% 
  read_html()
  
search_r %>% html_elements("p") %>% html_text2() --> worked

6.3 APIs

Explore the API of the New York Times. Scrape the following:
1. bestsellers hardcover fiction in June 2019 (store them in separate tibbles in a list).
2. articles from the 6th to the 10th of January 2021 dealing with the capitol.

Solution. Click to expand!

#a 
map(c("02", "09", "16", "23", "30"), ~{
  modify_url(
    url = str_c("https://api.nytimes.com/svc/books/v3/lists/2019-06-", .x, "/hardcover-fiction.json"),
    query = list(`api-key` = Sys.getenv("nyt_api_key"))
  )  %>% 
    GET() %>% 
    content(as = "text") %>%
    fromJSON()
    }
)


#b
modify_url(
  url = "http://api.nytimes.com/svc/search/v2/articlesearch.json",
  query = list(q = "capitol",
               pub = "20210106",
               end_date = "20210110",
               `api-key` = Sys.getenv("nyt_api_key"))
) %>% 
  GET() %>% 
  content(as = "text") %>%
  fromJSON()

Search the API data base on The Programmable Web and choose one. Send some GET() requests.
Play around with rtweet. You can use, for instance, this list.
1. Get followers of 10 members. Automate the process. Get all of them (hint: look at the default parameters of the function).
2. Get the author’s bio – could you infer their party belonging fom their bio?
3. Think about the recent climate summit. Come up with three keywords/hashtags that might describe it. Please, send them to me by email. Acquire the tweets related to this event (i.e., the ones that contained your keyword). Plot their frequency over time, what patterns do you see?
4. Stream all tweets that were sent from Muniich (hint: lookup_coords("munich, germany")).
5. Analyze the liking behavior of the politicians, look at 10 politicians’ 100 latest favorited tweets – do they seem to have a preference for members of the same party?

Solution. Click to expand!

library(rtweet)
pol_list <- lists_members(
  list_id = "1050737868606455810"
)
#a 
followers <- pol_list$screen_name[1:10] %>% 
  map(~get_followers(.x, n = lookup_users(.x) %>% pull(followers_count), retryonratelimit = TRUE))

#b 
pol_list %>% filter(description %>% str_detect(regex("afd|cdu|csu|fdp|b.?90|linke", ignore_case = TRUE)))

#c
keywords <- c("glasgow", "cop26", "globalwarming")

cop_tweets <- map(keywords, search_tweets, n = 6000, include_rts = FALSE)

cop_tweets %>% bind_rows() %>% 
  ggplot() +
  geom_histogram(aes(x = created_at))

# --> pattern: we need more data

#d 
tweets_muc <- stream_tweets(timeout = 60, geocode = lookup_coords("munich, germany"))

#e
favs <- get_favorites(pol_list$screen_name[1:10], n = 100)

Name four shortcomings of Twitter data and the implications those may have for research. How could one address the shortcomings you mentioned and which data would you wish for to do so?

6.4 Text Mining

Read in some Twitter data. Bring them into tidytext format. Stem them.

Solution. Click to expand!

library(tidyverse)
library(tidytext)
tweets <- read_csv("data/twitter-set.csv") %>% 
  mutate(id = as.character(id)) 

tidy_tweets <- tweets %>% 
  unnest_tokens(output = token, input = posting)

library(SnowballC)
tidy_stemmed <- tidy_tweets %>% 
  mutate(stemmed = wordStem(token, language = "de"))

What are the 10 terms the respective parties are using the most (before and after stopword removal)? Visualize your results using ggplot2 and geom_col(). Are there any differences? Should you remove further terms? Plus, do the results change once you remove retweets?

Solution. Click to expand!

library(stopwords)
party_colors <- c("AFD" = "#009ee0", 
                  "CSU" = "#000000", 
                  "DIE LINKE" = "#d837f0", 
                  "FDP" = "#feed01", 
                  "DIE GRUENEN" = "#1faf12", 
                  "SPD" = "#e2001a", 
                  "FREIE WAEHLER" = "#e58609")
tidy_stemmed %>% 
  anti_join(get_stopwords(language = "de"), by = c("token" = "word")) %>% 
  group_by(party) %>% 
  count(stemmed) %>% 
  slice_max(n, n = 10) %>% 
  ggplot() +
  geom_col(aes(reorder(stemmed, n), n, fill = party), show.legend = FALSE) +
  labs(x = NULL, y = "n") +
  facet_wrap(~party, ncol = 2, scales = "free") +
  scale_fill_manual(values = party_colors) +
  coord_flip()

Solution. Click to expand!

tweets %>% 
  filter(!str_detect(posting, regex("^rt|^retweeted", ignore_case = TRUE))) %>% 
  unnest_tokens(output = token, input = posting) %>% 
  anti_join(get_stopwords(language = "de"), by = c("token" = "word")) %>% 
  mutate(stemmed = wordStem(token, language = "de")) %>% 
  filter(!stemmed %in% c("t.co", "https", "dass", "f0")) %>% 
  group_by(party) %>% 
  count(stemmed) %>% 
  slice_max(n, n = 10) %>% 
  ggplot() +
  geom_col(aes(reorder(stemmed, n), n, fill = party), show.legend = FALSE) +
  labs(x = NULL, y = "n") +
  facet_wrap(~party, ncol = 2, scales = "free") +
  scale_fill_manual(values = party_colors) +
  coord_flip()

library(tidyverse)
library(sotu)
library(tidytext)
library(SnowballC)
sotu_clean <- sotu_meta %>% 
  bind_cols(sotu_text) %>% 
  rename(content = `...6`) %>% 
  distinct(content, .keep_all = TRUE) %>% 
  unnest_tokens(output = token, input = content) %>% 
  anti_join(get_stopwords(), by = c("token" = "word")) %>% 
  filter(!str_detect(token, "[:digit:]")) %>% 
  mutate(token = wordStem(token, language = "en")) %>% 
  group_by(token) %>% 
  filter(n() > 5) %>% 
  rename(word = token)

Analyze the SOTU addresses over time using the “bing” dictionary.

What are the results? Compare them with the “AFINN” analyses from the script.

Solution. Click to expand!

sotu_afinn<- get_sentiments("afinn") %>% 
  mutate(word = wordStem(word, language = "en")) %>% 
  inner_join(sotu_clean) %>% 
  group_by(year) %>% 
  summarize(sentiment = mean(value))

sotu_bing <- get_sentiments("bing") %>% 
  mutate(word = wordStem(word, language = "en")) %>% 
  inner_join(sotu_clean) %>% 
  mutate(sentiment = case_when(sentiment == "negative" ~ -1,
                               sentiment == "positive" ~ 1)) %>% 
  group_by(year) %>% 
  summarize(sentiment = mean(sentiment))

ggplot(sotu_bing) +
  geom_line(aes(x = year, y = sentiment))

sotu_afinn %>% 
  ggplot() +
  geom_line(aes(x = year, y = sentiment))

Which words contribute to them? Again, compare the results with the “AFINN” ones.

Solution. Click to expand!

sotu_afinn_contribution <- get_sentiments("afinn") %>% 
  mutate(word = wordStem(word, language = "en")) %>% 
  inner_join(sotu_clean) %>%
  group_by(word) %>%
  summarize(occurences = n(),
            contribution = sum(value)) %>% 
  mutate(type = "afinn") 

sotu_bing_contribution <- get_sentiments("bing") %>% 
  mutate(word = wordStem(word, language = "en")) %>% 
  inner_join(sotu_clean) %>% 
  mutate(sentiment = case_when(sentiment == "negative" ~ -1,
                               sentiment == "positive" ~ 1)) %>%
  group_by(word) %>%
  summarize(occurences = n(),
            contribution = sum(sentiment)) %>% 
  mutate(type = "bing")

bind_rows(sotu_afinn_contribution %>% slice_max(contribution, n = 10),
          sotu_afinn_contribution %>% slice_min(contribution, n = 10),
          sotu_bing_contribution %>% slice_max(contribution, n = 10),
          sotu_bing_contribution %>% slice_min(contribution, n = 10)) %>%
  mutate(type = as_factor(type),
         word = reorder_within(word, contribution, type)) %>%
  ggplot(aes(contribution, word, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  labs(y = NULL) +
  scale_y_reordered() +
  facet_wrap(~ type, scales = "free_y")

Calculate Intercoder Reliability (i.e., Krippendorff’s Alpha) of both dictionaries.

Solution. Click to expand!

library(caret)
sotus <- sotu_afinn %>% mutate(sentiment_afinn = case_when(sentiment > 0 ~ "positive",
                                                         sentiment < 0 ~ "negative",
                                                         TRUE ~ NA_character_) %>% 
                        factor(levels = c("positive", "negative"))) %>% 
  select(-sentiment) %>% 
  full_join(sotu_bing %>% 
              mutate(sentiment_bing = case_when(sentiment > 0 ~ "positive",
                                                sentiment < 0 ~ "negative",
                                                TRUE ~ NA_character_) %>% 
                        factor(levels = c("positive", "negative"))) %>% 
              select(-sentiment), 
            by = "year") %>% 
  drop_na() 

mat <- matrix(0, nrow = 2, ncol = nrow(sotus))
mat[1, ] <- sotus$sentiment_afinn
mat[2, ] <- sotus$sentiment_bing

mat[1,2:200] <- 2

irr::kripp.alpha(mat, method = "nominal")

Krippendorff’s Alpha is 0. This means that it doesn’t really perform better than guessing. This is due to the fact that virtually every SOTU address gets rated “positive” bewertet werden. Krippendorff’s Alpha requires a certain balance though (as you will find in the IMDb data set). More on that in this [ResearchGate discussion] (https://www.researchgate.net/post/Why_is_reliability_so_low_when_percentage_of_agreement_is_high).

Use “AFINN” and “bing” to classify the IMDb reviews.

Evaluate the classification quality using caret::confusionMatrix(). Any differences?

Solution. Click to expand!

imdb_reviews <- read_csv("data/imdb_reviews.csv")

imdb_reviews_afinn <- imdb_reviews %>% 
  rowid_to_column("doc") %>% 
  unnest_tokens(token, text) %>% 
  anti_join(get_stopwords(), by = c("token" = "word")) %>% 
  mutate(stemmed = wordStem(token)) %>% 
  inner_join(get_sentiments("afinn") %>% mutate(stemmed = wordStem(word))) %>% 
  group_by(doc) %>% 
  summarize(sentiment = mean(value)) %>% 
  mutate(sentiment_afinn = case_when(sentiment > 0 ~ "positive",
                                     TRUE ~ "negative")) %>% 
  select(doc, sentiment_afinn)

imdb_reviews_bing <- imdb_reviews %>% 
  select(-sentiment) %>% 
  rowid_to_column("doc") %>% 
  unnest_tokens(token, text) %>% 
  anti_join(get_stopwords(), by = c("token" = "word")) %>% 
  mutate(stemmed = wordStem(token)) %>% 
  inner_join(get_sentiments("bing") %>% mutate(stemmed = wordStem(word))) %>% 
  mutate(sentiment = case_when(sentiment == "negative" ~ -1,
                               sentiment == "positive" ~ 1)) %>% 
  group_by(doc) %>% 
  summarize(sentiment = mean(sentiment)) %>% 
  mutate(sentiment_bing = case_when(sentiment > 0 ~ "positive",
                                     TRUE ~ "negative")) %>% 
  select(doc, sentiment_bing)

reviews <- imdb_reviews %>% 
  rowid_to_column("doc") %>% 
  select(-text)

afinn_bing <- inner_join(reviews, imdb_reviews_afinn, by = "doc") %>% 
  inner_join(imdb_reviews_bing, by = "doc") %>% 
  mutate(across(starts_with("sentiment"), ~factor(.x, levels = c("positive", "negative"))))

conf_matrix_afinn <- confusionMatrix(data = afinn_bing$sentiment, reference = afinn_bing$sentiment_afinn)
conf_matrix_bing <- confusionMatrix(data = afinn_bing$sentiment, reference = afinn_bing$sentiment_bing)

How could you remodel the “NRC” dictionary to compare its results?

Solution. Click to expand!

get_sentiments("nrc") %>% mutate(sentiment_numeric = case_when(sentiment == "positive" ~ 1,
                                                               sentiment == "negative" ~ -1,
                                                               TRUE ~ NA_real_)) %>% 
  drop_na()

Search for the accuracy scores other, more cutting-edge approaches have achieved on the IMDb data set.

Solution. Click to expand!

https://paperswithcode.com/sota/sentiment-analysis-on-imdb

Use the Twitter data from exercise 1.

Plot the volume of Tweets over time.

Solution. Click to expand!

tweets %>% 
  count(date) %>% 
  ggplot() +
  geom_line(aes(date, n))

When there are spikes in volume, this usually hints on some underlying event. Determine the event using a TF-IDF-based approach. Thereby, you treat the day as one document and calculate the TF-IDF for the respective terms. Use the 5 dates with the highest Tweet volume before October 1. If volume remains high for multiple days, this can be treated as one event. Plot your results (put the TF-IDF value on the x-axis, word on the y-axia and use facets for the respective event). Can you determine what drove the spikes?

Solution. Click to expand!

library(lubridate)
dates <- tweets %>% 
  mutate(date = case_when(date == ymd("2018-09-27") ~ ymd("2018-09-26"),
                          TRUE ~ date)) %>% 
  count(date) %>% 
  filter(date < ymd("2018-10-01")) %>% 
  slice_max(n, n = 5) %>% 
  pull(date)

tweets_tf_idf <- tweets %>% 
  mutate(date = case_when(date == ymd("2018-09-27") ~ ymd("2018-09-26"),
                          TRUE ~ date),
         document = case_when(date %in% dates ~ date %>% as.character(),
                              TRUE ~ "other")) %>% 
  unnest_tokens(word, posting) %>% 
  count(document, word) %>% 
  bind_tf_idf(word, document, n)

tweets_tf_idf %>% 
  group_by(document) %>% 
  slice_max(tf_idf, n = 5) %>% 
  ungroup() %>% 
  mutate(word = reorder_within(word, tf_idf, document)) %>% 
  ggplot() + 
  geom_col(aes(x = tf_idf, y = word)) +
  scale_y_reordered() +
  facet_wrap(~document, scales = "free")

Chapter 6 Exercises

6.1 Regular Expressions

6.2 rvest

6.3 APIs

6.4 Text Mining

6.2 `rvest`