16.2 Lab: Using Google ML APIs

In this lab we’ll use Twitter data to illustrate the use of different Google ML APIs (translation, sentiment coding, syntax analysis, image analysis). Thereby, the lab provides a quick overview rather than a deep dive into the single APIs.

16.2.1 Software

googleLanguageR: Interact with the Google Natural Language API
- See the vignette for an overview
- Google Natural Language API
  - Entity analysis (i.e., finds named entities, types, salience, mentions + properties, metadata)
  - Syntax (i.e., syntax analysis, e.g., identify nouns)
  - Sentiment (i.e., provides sentiment scores)
  - Content Classification (i.e., content classification into categories)
- Google Cloud Speech-to-Text API
- Google Cloud Text-to-Speech API
- Google Cloud Translation API
googleCloudVisionR: Interact with the Google Vision API
- See the vignette for an overview

16.2.2 Install & load packages

#library(pacman)
#p_load(rtweet, tidyverse, tidytext, foreach, ggwordcloud, googleCloudVisionR, googleLanguageR)

# Load packages
library(rtweet)
library(tidyverse)
library(tidytext)
library(foreach)
library(ggwordcloud)
library(googleCloudVisionR)
library(googleLanguageR) # can be used to access the language APIs

16.2.3 Twitter: Authenticate & load data

In order get access to the Twitter API you need to create an app and generate the corresponding API keys on the Twitter developer platform. See slides and lab on Twitter starting with [X-Twitter’s APIs] (we didn’t go through them). Here we’ll merely download a few tweets to explore the Google ML APIs.

In case you don’t have a Twitter developer account you can also download the data further below!

library(rtweet)

# authenticate via web browser
token <- create_token(
   consumer_key = "#######################",
   consumer_secret = "#######################",
   access_token = "#######################",
   access_secret = "#######################")


get_token()

We’ll work with tweets by Alice Weidel (Afd) and tweets from Martin Schulz (SPD). The tweets themselves are text data that we can analyze using the Google Natural Language APIs.

# Load most recent tweets
tweets_alice <- get_timeline("Alice_Weidel", n = 25)
tweets_martin <- get_timeline("MartinSchulz", n = 25)
# Bind data rowwise
data_tweets <- bind_rows(tweets_alice, tweets_martin)
nrow(data_tweets)
save(data_tweets, 
     file = "./data & material/data_tweets.RData")

If you can’t authenticate with Twitter download the data data_tweets.RData from the material folder, store the file in your working directory and load it into R with the command below:

# Remember to adapt the path!
load("./data & material/data_tweets.RData")

16.2.4 Google: Authenticate

Remember the instructions for setting up your Google research credits and Google API access.

Fill in the the quotation marks with the directory where the created JSON-File is located & read in the JSON-File (gl_auth).

#gl_auth("your_JSON_file.json")
gl_auth("./keys/css-seminar-2021-a1e75382ae2c.json")

16.2.5 Translation API

We can use the Cloud Translation API to translate the tweets from German to English (other languages can of course be choosen as well. Check the language codes under the following link and replace the string “de” in the target command: https://developers.google.com/admin-sdk/directory/v1/languages.

data_translation <- gl_translate(data_tweets$text, target ="en")
# data_tweets <- bind_cols(data_tweets, data_translation)
head(data_translation)

# Add translation to twitter data
data_tweets <- bind_cols(data_tweets, 
                         data_translation %>% select(translatedText))

16.2.6 NLP API: Sentiment

Next, we use the Cloud Natural Language API to analyze the sentiment of each tweet. We choose nlp_type = "analyzeSentiment" to analyze the sentiment on the sentence level. The chunk returns a list object where the list element documentSentiment contains the sentiment scores for the single documents stored in the score variable. The variable score indicates the sentiment level from -1 (negative) to +1 (positive).

# Analyze sentiment
data_sentiment <- gl_nlp(data_tweets$translatedText, 
                         nlp_type = "analyzeSentiment")

# Show first lines of sentiment scores
head(data_sentiment$documentSentiment)


# Add sentiment to twitter data
data_tweets <- bind_cols(data_tweets, 
                         data_sentiment$documentSentiment %>% select(score))

# Return statements with min/max scores
data_tweets %>%
  select(translatedText, score) %>%
  filter(score == min(score) | score == max(score))

And can also try to plot the data:

# Plot data 
library(ggridges)
data_tweets %>% 
ggplot(aes(x = score, 
           y = screen_name)) + 
  geom_density_ridges(aes(fill = screen_name),
                      alpha=0.5) +
  scale_fill_manual(name = "Name", 
                    values = c("blue", "red")) +
  xlab("Sentiment score") + 
  ylab("Screen names") + 
  theme(panel.background = element_rect(fill = NA),
                           legend.position = "top",
                           panel.grid.major = element_line(colour = "black", 
                                                           linetype = 9, 
                                                           size = 0.3),
                           panel.ontop = FALSE, 
                           axis.title = element_text(size = 16))

16.2.7 NLP API: Syntax

The Google NLP API also allows for analyzing syntax. This is extremely helpful as sometimes we may want to isolate certain parts of a sentence. Below we extract the nouns and subsequently plot them using a wordcloud.

data_syntax <- gl_nlp(data_tweets$translatedText, nlp_type = "analyzeSyntax")

# Access the syntax analysis for the single tokes ("words")
# Check out 1
# head(data_syntax[["tokens"]][[1]][,1:5])

# Add dataframe with tokens syntax to original dataframe
data_tweets$syntax_tokens <- data_syntax[["tokens"]]



# Filter out nouns only
data_tweets <- data_tweets %>% 
  mutate(tokens_nouns = map(syntax_tokens, 
                            ~ filter(., tag == "NOUN"))) 

head(data_tweets %>% 
  select(screen_name, syntax_tokens, tokens_nouns))

# Create the data for the plot
data_plot <- data_tweets %>% 
  # only keep content variable
   mutate(tokens_nouns = map(tokens_nouns, 
                            ~ select(., content))) %>% 
  # Write tokens in all rows into a single string
  unnest(tokens_nouns) %>%# unnest tokens
# delete hashtags from strings
  mutate(content = gsub("# ", "", content)) %>%
# Unnest tokens
  select(screen_name, content) %>%
  unnest_tokens(output = word, input = content) %>% # generate a wordcloud
anti_join(stop_words) %>%
group_by(screen_name) %>%
dplyr::count(word) %>%
  filter(word!="https") %>%
    filter(word!="t.co") %>%
arrange(desc(n), screen_name) %>%
  filter(n > 3)

# Visualize in a word cloud
data_plot %>%
  ggplot(aes(label = word, 
             size = n,
             color = screen_name)) +
  geom_text_wordcloud() +
scale_size_area(max_size = 20) +
theme_minimal()

16.2.8 Analyzing images

In addition, tweets may contain images that we can analyze using the Google Vision API. Each tweet comes with an image (and the corresponding link) or not.

How many out of tweets in the data (of the two politicians) contain images?

table(is.na(data_tweets$media_url))

Who uses more images?

table(data_tweets$screen_name, 
      is.na(data_tweets$media_url))

Below we load those images and store them in a directory (we should always store the data we analyze locally).

# Filter: status_id & tweets with pictures
tweets_media <- data_tweets %>% 
  filter(!is.na(media_url)) %>% 
  select(screen_name, user_id, status_id, media_url)

# Loop through tweets with pictures, download them and name them according to status_id
foreach(i = unlist(tweets_media$media_url), # Url to image
        k = tweets_media$user_id, # Account ID
        y = tweets_media$screen_name, # Account name
        j = tweets_media$status_id) %do% # Tweet ID
  curl::curl_download(i, destfile = paste0("./data/twitter_images/", y, "_", k, "_", j, ".jpg"))

Next we try to recognize text entities in the text (we do so directly for the image urls - not downloading the data):

# "API returned: Cloud Vision API has not been used in project 1167376142 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/vision.googleapis.com/overview?project=1167376142 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry."

# Recognize text entities 
tweets_media$image_entities <- foreach(l = tweets_media$media_url) %do% 
gcv_get_image_annotations(
  imagePaths = l,
  feature = "TEXT_DETECTION",
  maxNumResults = 7
)

Then we merge the list of words (for every image into sentences).

# Collapse list of words into sentences
tweets_media$image_entities <- foreach(m = 1:length(tweets_media$image_entities)) %do%
paste(tweets_media$image_entities[[m]][["description"]], collapse = " ")

# Create string vector from list
tweets_media$image_entities <- unlist(tweets_media$image_entities)

And we add it to the original tweet dataset:

data_tweets <- left_join(data_tweets, 
                         tweets_media %>% select(-media_url),
                         by = c("user_id", "status_id"))

We also try to recognize objects in those images:

tweets_media$image_objects <- foreach(l = tweets_media$media_url) %do% 
  gcv_get_image_annotations(
    imagePaths = l,
    feature = "LABEL_DETECTION",
    maxNumResults = 20
  )

Then we turn the list of objects into a string variable:

tweets_media$image_objects <- foreach(m = 1:length(tweets_media$image_objects)) %do%
  paste(tweets_media$image_objects[[m]][["description"]], collapse = " ")
tweets_media$image_objects <- unlist(tweets_media$image_objects)

Then we join the scraped data with the original tweet data:

data_tweets <- left_join(data_tweets, 
                         tweets_media %>% select(-media_url, image_objects, user_id, status_id),
                         by = c("user_id", "status_id"))
head(data_tweets %>% 
       select(translatedText, image_objects))