8.5 Lab: Classifying tweets & accounts
In this lab we’ll use Twitter data to illustrate the use of different Google APIs. * Use a tweet with an image as an example (Alice Weidel) + Translate + Sentiment + Because we can classify different things.. the text, the image etc.
8.5.1 Software
googleLanguageR
: Interact with the Google Natural Language API- See the vignette for an overview
- Google Natural Language API
- Entity analysis (i.e., finds named entities, types, salience, mentions + properties, metadata)
- Syntax (i.e., syntax analysis, e.g., identify nouns)
- Sentiment (i.e., provides sentiment scores)
- Content Classification (i.e., content classification into categories)
- Google Cloud Speech-to-Text API
- Google Cloud Text-to-Speech API
- Google Cloud Translation API
googleCloudVisionR
: Interact with the Google Vision API- See the vignette for an overview
8.5.2 Load packages
8.5.3 Twitter: Authenticate & load data
In order get access to the Twitter API you need to create an app and generate the corresponding API keys on the Twitter developer platform. See slides and lab on Twitter starting with X-Twitter’s APIs (we didn’t go through them). Here we’ll merely download a few tweets to explore the Google ML APIs.
library(rtweet)
# authenticate via web browser
token <- create_token(
consumer_key = "#######################",
consumer_secret = "#######################",
access_token = "#######################",
access_secret = "#######################")
get_token()
We’ll work with tweets by Alice Weidel (Afd) and tweets from Martin Schulz (SPD). The tweets themselves are text data that we can analyze using the Google Natural Language APIs.
# Load most recent tweets
tweets_alice <- get_timeline("Alice_Weidel", n = 25)
tweets_martin <- get_timeline("MartinSchulz", n = 25)
# Bind data rowwise
data_tweets <- bind_rows(tweets_alice, tweets_martin)
nrow(data_tweets)
save(data_tweets,
file = "C:/Users/Paul/Google Drive/2-Teaching/2021 Computational Social Science/2021_computational_social_science/data/data_tweets.RData")
If you can’t authenticate with Twitter download the data data_tweets.RData
from the material folder and load it into R with the command below:
8.5.4 Google: Authenticate
The following Markdown file can be used if you followed all the steps described in the Google Docs document.
Fill in the the quotation marks with the directory where the created JSON-File is located & read in the JSON-File (gl_auth).
8.5.5 Translation API
We can use the Cloud Translation API to translate the tweets from Gean to English (other languages can of course be choosen as well. Check the language codes under the following link and replace the string “de” in the target command: https://developers.google.com/admin-sdk/directory/v1/languages.
8.5.6 NLP API: Sentiment
Next, we use the Cloud Natural Language API to analyze the sentiment of each tweet. We choose nlp_type = "analyzeSentiment"
to analyze the sentiment on the sentence level. The chunk returns a list object where the list element documentSentiment
contains the sentiment scores for the single documents in the score
variable. The variable score
indicates the sentiment level from -1 (negative) to +1 (positive).
# Analyze sentiment
data_sentiment <- gl_nlp(data_tweets$translatedText,
nlp_type = "analyzeSentiment")
# Show first lines of sentiment scores
head(data_sentiment$documentSentiment)
# Add sentiment to twitter data
data_tweets <- bind_cols(data_tweets,
data_sentiment$documentSentiment %>% select(score))
# Return statements with min/max scores
data_tweets %>%
select(translatedText, score) %>%
filter(score == min(score) | score == max(score))
And can also try to plot the data:
# Plot data
library(ggridges)
data_tweets %>%
ggplot(aes(x = score,
y = screen_name)) +
geom_density_ridges(aes(fill = screen_name),
alpha=0.5) +
scale_fill_manual(name = "Name",
values = c("blue", "red")) +
xlab("Sentiment score") +
ylab("Screen names") +
theme(panel.background = element_rect(fill = NA),
legend.position = "top",
panel.grid.major = element_line(colour = "black",
linetype = 9,
size = 0.3),
panel.ontop = FALSE,
axis.title = element_text(size = 16))
8.5.7 NLP API: Syntax
The Google NLP API also allows for analyzing syntax. This is extremely helpful as sometimes we may want to isolate certain parts of a sentence. Below we extract the nouns and subsequently plot them in a using a wordcloud. The function returns various information.
data_syntax <- gl_nlp(data_tweets$translatedText, nlp_type = "analyzeSyntax")
# Access the syntax analysis for the single tokes ("words")
# Check out 1
# head(data_syntax[["tokens"]][[1]][,1:5])
# Add dataframe with tokens syntax to original dataframe
data_tweets$syntax_tokens <- data_syntax[["tokens"]]
# Filter out nouns only
data_tweets <- data_tweets %>%
mutate(tokens_nouns = map(syntax_tokens,
~ filter(., tag == "NOUN")))
head(data_tweets %>%
select(screen_name, syntax_tokens, tokens_nouns))
# Collapse strings
data_plot <- data_tweets %>%
# only keep content variable
mutate(tokens_nouns = map(tokens_nouns,
~ select(., content))) %>%
# Write tokens in all rows into a single string
unnest(tokens_nouns) %>%# unnest tokens
# delete hashtags from strings
mutate(content = gsub("# ", "", content)) %>%
# Unnest tokens
select(screen_name, content) %>%
unnest_tokens(output = word, input = content) %>% # generate a wordcloud
anti_join(stop_words) %>%
group_by(screen_name) %>%
dplyr::count(word) %>%
filter(word!="https") %>%
filter(word!="t.co") %>%
arrange(desc(n), screen_name) %>%
filter(n > 3)
# Visualize in a word cloud
data_plot %>%
ggplot(aes(label = word,
size = n,
color = screen_name)) +
geom_text_wordcloud() +
scale_size_area(max_size = 20) +
theme_minimal()
8.5.8 Analyzing images
In addition, tweets may contain images that we can analyze using the Google Vision API. Each tweet comes with an image (and the corresponding link) or not.
How many out of 1000 tweets (of the two politicians) contain images?
Who uses more images?
Below we load those images and store them in a directory.
# Filter: status_id & tweets with pictures
tweets_media <- data_tweets %>% filter(!is.na(media_url)) %>%
select(screen_name, user_id, status_id, media_url)
# Loop through tweets with pictures, download them and name them according to status_id
foreach(i = unlist(tweets_media$media_url), # Url to image
k = tweets_media$user_id, # Account ID
y = tweets_media$screen_name, # Account name
j = tweets_media$status_id) %do% # Tweet ID
curl::curl_download(i, destfile = paste0("./data/twitter_images/", y, "_", k, "_", j, ".jpg"))
Next we try to recognize text entities in the text (we do so directly for the image urls - not downloading the data):
# "API returned: Cloud Vision API has not been used in project 1167376142 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/vision.googleapis.com/overview?project=1167376142 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry."
# Recognize text entities
tweets_media$image_entities <- foreach(l = tweets_media$media_url) %do%
gcv_get_image_annotations(
imagePaths = l,
feature = "TEXT_DETECTION",
maxNumResults = 7
)
Then we merge the list of words (for every image into sentences).
tweets_media$image_entities <- foreach(m = 1:length(tweets_media$image_entities)) %do%
paste(tweets_media$image_entities[[m]][["description"]], collapse = " ")
tweets_media$image_entities <- unlist(tweets_media$image_entities)
And we add it to the original tweet dataset:
data_tweets <- left_join(data_tweets,
tweets_media %>% select(-media_url),
by = c("user_id", "status_id"))
We also try to recognize objects in those images:
tweets_media$image_objects <- foreach(l = tweets_media$media_url) %do%
gcv_get_image_annotations(
imagePaths = l,
feature = "LABEL_DETECTION",
maxNumResults = 20
)
Then we turn the list of objects into a string variable:
tweets_media$image_objects <- foreach(m = 1:length(tweets_media$image_objects)) %do%
paste(tweets_media$image_objects[[m]][["description"]], collapse = " ")
tweets_media$image_objects <- unlist(tweets_media$image_objects)
Then we join the scraped data with the original tweet data:
data_tweets <- left_join(data_tweets,
tweets_media %>% select(-media_url, image_objects, user_id, status_id),
by = c("user_id", "status_id"))
head(data_tweets %>%
select(translatedText, image_objects))
Acharya, Avidit, Matthew Blackwell, and Maya Sen. 2016. “Explaining Causal Findings Without Bias: Detecting and Assessing Direct Effects.” Am. Polit. Sci. Rev. 110 (3): 512–29.
Alvarez, Michael R. 2016. Computational Social Science. Cambridge University Press.
Angwin, Julia, Jeff Larson, Lauren Kirchner, and Surya Mattu. 2016. “Machine Bias.” https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
Athey, Susan, and Guido W Imbens. 2019. “Machine Learning Methods That Economists Should Know About.” Annu. Rev. Econom. 11 (1): 685–725.
Bauer, Paul. 2018. “Writing a Reproducible Paper in R Markdown,” May.
Bauer, Paul C. 2015. “Negative Experiences and Trust: A Causal Analysis of the Effects of Victimization on Generalized Trust.” Eur. Sociol. Rev. 31 (4): 397–417.
Bauer, Paul C, and Clemm von Hohenberg. 2020. “Believing and Sharing Information by Fake Sources: An Experiment.” Political Communication, November.
Breiman, Leo. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” SSO Schweiz. Monatsschr. Zahnheilkd. 16 (3): 199–231.
Chollet, Francois, and J J Allaire. 2018. Deep Learning with R. 1st ed. Manning Publications.
Cioffi-Revilla, Claudio. 2017. “Computation and Social Science.” In Introduction to Computational Social Science: Principles and Applications, edited by Claudio Cioffi-Revilla, 35–102. Cham: Springer International Publishing.
Donoho, David. 2017. “50 Years of Data Science.” J. Comput. Graph. Stat. 26 (4): 745–66.
Dressel, Julia, and Hany Farid. 2018. “The Accuracy, Fairness, and Limits of Predicting Recidivism.” Sci Adv 4 (1): eaao5580.
Entwisle, B, and P Elias. 2013. “New Data for Understanding the Human Condition: International Perspectives.” Paris, France: OECD, available at http://www. oecd. org/sti/sci-tech/new-data-for-understanding-the-hu man-condition. pdf[ 1477].
Gerring, John. 2012. “Mere Description.” British Journal of Political Science 4 (4): 721–46.
Gill, Jeff. 1999. “The Insignificance of Null Hypothesis Significance Testing.” Polit. Res. Q. 52 (3): 647–74.
Golder, Scott A, and Michael W Macy. 2014. “Digital Footprints: Opportunities and Challenges for Online Social Research.” Annu. Rev. Sociol. 40 (1): 129–52.
Grimmer, Justin. 2015. “We Are All Social Scientists Now: How Big Data, Machine Learning, and Causal Inference Work Together.” PS Polit. Sci. Polit. 48 (1): 80–83.
Hilbert, Martin, and Priscila López. 2011. “The World’s Technological Capacity to Store, Communicate, and Compute Information.” Science 332 (6025): 60–65.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer.
King, Gary. 1995. “Replication, Replication.” PS, Political Science & Politics 28 (3): 444–52.
———. 2010. “A Hard Unsolved Problem? Post-Treatment Bias in Big Social Science Questions.” In Hard Problems in Social Science” Symposium, Harvard University. scholar.harvard.edu.
Laney, Doug. 2001. “3D Data Management: Controlling Data Volume, Velocity and Variety.” META Group Research Note 6 (70): 1.
Lazer, David, Alex Pentland, Lada Adamic, Sinan Aral, Albert-Laszlo Barabasi, Devon Brewer, Nicholas Christakis, et al. 2009. “Social Science. Computational Social Science.” Science 323 (5915): 721–23.
Lee, Claire S, Jeremy Du, and Michael Guerzhoy. 2020. “Auditing the COMPAS Recidivism Risk Assessment Tool: Predictive Modelling and Algorithmic Fairness in CS1.” In Proceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education, 535–36. ITiCSE ’20. New York, NY, USA: Association for Computing Machinery.
Mayer-Schönberger, Viktor, and Kenneth Cukier. 2012. Big Data: A Revolution That Transforms How We Work, Live, and Think. Boston: Houghton Mifflin Harcourt.
Mellon, Jonathan. 2013. “Where and When Can We Use Google Trends to Measure Issue Salience?” PS Polit. Sci. Polit. 46 (2): 280–90.
Molina, Mario, and Filiz Garip. 2019. “Machine Learning for Sociology.” Annu. Rev. Sociol., July.
Monroe, Burt L. 2013. “The Five Vs of Big Data Political Science Introduction to the Virtual Issue on Big Data in Political Science Political Analysis.” Polit. Anal. 21 (V5): 1–9.
Munzert, Simon, Christian Rubba, Peter Meißner, and Dominic Nyhuis. 2014. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. John Wiley & Sons.
Open Science Collaboration. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349 (6251): aac4716.
Richthammer, Christian, Michael Netter, Moritz Riesner, Johannes Sänger, and Günther Pernul. 2014. “Taxonomy of Social Network Data Types.” EURASIP Journal on Information Security 2014 (1): 11.
Salganik, Matthew J. 2017. Bit by Bit: Social Research in the Digital Age. Princeton University Press.
Wikipedia contributors. 2018. “Data.” https://en.wikipedia.org/w/index.php?title=Data&oldid=869556199.
Zimmer, Michael. 2010. “‘But the Data Is Already Public’: On the Ethics of Research in Facebook.” Ethics Inf. Technol. 12 (4): 313–25.