16.2 Lab: Using Google ML APIs
In this lab we’ll use Twitter data to illustrate the use of different Google ML APIs (translation, sentiment coding, syntax analysis, image analysis). Thereby, the lab provides a quick overview rather than a deep dive into the single APIs.
16.2.1 Software
googleLanguageR
: Interact with the Google Natural Language API- See the vignette for an overview
- Google Natural Language API
- Entity analysis (i.e., finds named entities, types, salience, mentions + properties, metadata)
- Syntax (i.e., syntax analysis, e.g., identify nouns)
- Sentiment (i.e., provides sentiment scores)
- Content Classification (i.e., content classification into categories)
- Google Cloud Speech-to-Text API
- Google Cloud Text-to-Speech API
- Google Cloud Translation API
googleCloudVisionR
: Interact with the Google Vision API- See the vignette for an overview
16.2.2 Install & load packages
#library(pacman)
#p_load(rtweet, tidyverse, tidytext, foreach, ggwordcloud, googleCloudVisionR, googleLanguageR)
# Load packages
library(rtweet)
library(tidyverse)
library(tidytext)
library(foreach)
library(ggwordcloud)
library(googleCloudVisionR)
library(googleLanguageR) # can be used to access the language APIs
16.2.3 Twitter: Authenticate & load data
In order get access to the Twitter API you need to create an app and generate the corresponding API keys on the Twitter developer platform. See slides and lab on Twitter starting with [X-Twitter’s APIs] (we didn’t go through them). Here we’ll merely download a few tweets to explore the Google ML APIs.
- In case you don’t have a Twitter developer account you can also download the data further below!
library(rtweet)
# authenticate via web browser
<- create_token(
token consumer_key = "#######################",
consumer_secret = "#######################",
access_token = "#######################",
access_secret = "#######################")
get_token()
We’ll work with tweets by Alice Weidel (Afd) and tweets from Martin Schulz (SPD). The tweets themselves are text data that we can analyze using the Google Natural Language APIs.
# Load most recent tweets
<- get_timeline("Alice_Weidel", n = 25)
tweets_alice <- get_timeline("MartinSchulz", n = 25)
tweets_martin # Bind data rowwise
<- bind_rows(tweets_alice, tweets_martin)
data_tweets nrow(data_tweets)
save(data_tweets,
file = "./data & material/data_tweets.RData")
If you can’t authenticate with Twitter download the data data_tweets.RData
from the material folder, store the file in your working directory and load it into R with the command below:
# Remember to adapt the path!
load("./data & material/data_tweets.RData")
16.2.4 Google: Authenticate
Remember the instructions for setting up your Google research credits and Google API access.
Fill in the the quotation marks with the directory where the created JSON-File is located & read in the JSON-File (gl_auth).
#gl_auth("your_JSON_file.json")
gl_auth("./keys/css-seminar-2021-a1e75382ae2c.json")
16.2.5 Translation API
We can use the Cloud Translation API to translate the tweets from German to English (other languages can of course be choosen as well. Check the language codes under the following link and replace the string “de” in the target command: https://developers.google.com/admin-sdk/directory/v1/languages.
<- gl_translate(data_tweets$text, target ="en")
data_translation # data_tweets <- bind_cols(data_tweets, data_translation)
head(data_translation)
# Add translation to twitter data
<- bind_cols(data_tweets,
data_tweets %>% select(translatedText)) data_translation
16.2.6 NLP API: Sentiment
Next, we use the Cloud Natural Language API to analyze the sentiment of each tweet. We choose nlp_type = "analyzeSentiment"
to analyze the sentiment on the sentence level. The chunk returns a list object where the list element documentSentiment
contains the sentiment scores for the single documents stored in the score
variable. The variable score
indicates the sentiment level from -1 (negative) to +1 (positive).
# Analyze sentiment
<- gl_nlp(data_tweets$translatedText,
data_sentiment nlp_type = "analyzeSentiment")
# Show first lines of sentiment scores
head(data_sentiment$documentSentiment)
# Add sentiment to twitter data
<- bind_cols(data_tweets,
data_tweets $documentSentiment %>% select(score))
data_sentiment
# Return statements with min/max scores
%>%
data_tweets select(translatedText, score) %>%
filter(score == min(score) | score == max(score))
And can also try to plot the data:
# Plot data
library(ggridges)
%>%
data_tweets ggplot(aes(x = score,
y = screen_name)) +
geom_density_ridges(aes(fill = screen_name),
alpha=0.5) +
scale_fill_manual(name = "Name",
values = c("blue", "red")) +
xlab("Sentiment score") +
ylab("Screen names") +
theme(panel.background = element_rect(fill = NA),
legend.position = "top",
panel.grid.major = element_line(colour = "black",
linetype = 9,
size = 0.3),
panel.ontop = FALSE,
axis.title = element_text(size = 16))
16.2.7 NLP API: Syntax
The Google NLP API also allows for analyzing syntax. This is extremely helpful as sometimes we may want to isolate certain parts of a sentence. Below we extract the nouns and subsequently plot them using a wordcloud.
<- gl_nlp(data_tweets$translatedText, nlp_type = "analyzeSyntax")
data_syntax
# Access the syntax analysis for the single tokes ("words")
# Check out 1
# head(data_syntax[["tokens"]][[1]][,1:5])
# Add dataframe with tokens syntax to original dataframe
$syntax_tokens <- data_syntax[["tokens"]]
data_tweets
# Filter out nouns only
<- data_tweets %>%
data_tweets mutate(tokens_nouns = map(syntax_tokens,
~ filter(., tag == "NOUN")))
head(data_tweets %>%
select(screen_name, syntax_tokens, tokens_nouns))
# Create the data for the plot
<- data_tweets %>%
data_plot # only keep content variable
mutate(tokens_nouns = map(tokens_nouns,
~ select(., content))) %>%
# Write tokens in all rows into a single string
unnest(tokens_nouns) %>%# unnest tokens
# delete hashtags from strings
mutate(content = gsub("# ", "", content)) %>%
# Unnest tokens
select(screen_name, content) %>%
unnest_tokens(output = word, input = content) %>% # generate a wordcloud
anti_join(stop_words) %>%
group_by(screen_name) %>%
::count(word) %>%
dplyrfilter(word!="https") %>%
filter(word!="t.co") %>%
arrange(desc(n), screen_name) %>%
filter(n > 3)
# Visualize in a word cloud
%>%
data_plot ggplot(aes(label = word,
size = n,
color = screen_name)) +
geom_text_wordcloud() +
scale_size_area(max_size = 20) +
theme_minimal()
16.2.8 Analyzing images
In addition, tweets may contain images that we can analyze using the Google Vision API. Each tweet comes with an image (and the corresponding link) or not.
How many out of tweets in the data (of the two politicians) contain images?
table(is.na(data_tweets$media_url))
Who uses more images?
table(data_tweets$screen_name,
is.na(data_tweets$media_url))
Below we load those images and store them in a directory (we should always store the data we analyze locally).
# Filter: status_id & tweets with pictures
<- data_tweets %>%
tweets_media filter(!is.na(media_url)) %>%
select(screen_name, user_id, status_id, media_url)
# Loop through tweets with pictures, download them and name them according to status_id
foreach(i = unlist(tweets_media$media_url), # Url to image
k = tweets_media$user_id, # Account ID
y = tweets_media$screen_name, # Account name
j = tweets_media$status_id) %do% # Tweet ID
::curl_download(i, destfile = paste0("./data/twitter_images/", y, "_", k, "_", j, ".jpg")) curl
Next we try to recognize text entities in the text (we do so directly for the image urls - not downloading the data):
# "API returned: Cloud Vision API has not been used in project 1167376142 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/vision.googleapis.com/overview?project=1167376142 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry."
# Recognize text entities
$image_entities <- foreach(l = tweets_media$media_url) %do%
tweets_mediagcv_get_image_annotations(
imagePaths = l,
feature = "TEXT_DETECTION",
maxNumResults = 7
)
Then we merge the list of words (for every image into sentences).
# Collapse list of words into sentences
$image_entities <- foreach(m = 1:length(tweets_media$image_entities)) %do%
tweets_mediapaste(tweets_media$image_entities[[m]][["description"]], collapse = " ")
# Create string vector from list
$image_entities <- unlist(tweets_media$image_entities) tweets_media
And we add it to the original tweet dataset:
<- left_join(data_tweets,
data_tweets %>% select(-media_url),
tweets_media by = c("user_id", "status_id"))
We also try to recognize objects in those images:
$image_objects <- foreach(l = tweets_media$media_url) %do%
tweets_mediagcv_get_image_annotations(
imagePaths = l,
feature = "LABEL_DETECTION",
maxNumResults = 20
)
Then we turn the list of objects into a string variable:
$image_objects <- foreach(m = 1:length(tweets_media$image_objects)) %do%
tweets_mediapaste(tweets_media$image_objects[[m]][["description"]], collapse = " ")
$image_objects <- unlist(tweets_media$image_objects) tweets_media
Then we join the scraped data with the original tweet data:
<- left_join(data_tweets,
data_tweets %>% select(-media_url, image_objects, user_id, status_id),
tweets_media by = c("user_id", "status_id"))
head(data_tweets %>%
select(translatedText, image_objects))