Text Mining for Social Scientists
if you read this script, you are either participating in one of my courses on digital methods for the social sciences, or at least interested in this topic. If you have any questions or remarks regarding this script, hit me up at firstname.lastname@example.org.
This script will introduce you to the quantitative analysis of text using R. Through the last decades, more and more text has become readily available. Think for example of Social Networking Platforms, online fora, Google Books, newspaper articles, the fact that YouTube can generate subtitles for all of its videos, or the fact that administrative records are increasingly digitized. Social scientists of course have decades of experience analyzing these things, yet they used to be constrained by data availability and, hence, their data sets used to be way smaller and could be processed by humans. To make the most out of the newly available data sets I mentioned above and to repurpose them for social scientific research, we need to use tools from the information and computer sciences. Some are fairly old such as basic dictionary-based sentiment analysis whose precursors were introduced in the 1960s, others are as recent as the early 2000s (LDA) or even 2014 (word2vec).
In specific, this script will cover the pre-processing of text, the implementation of supervised and unsupervised approaches to text, and in the end, I will briefly touch upon word embeddings and how social science can use them for inquiry.
The following chapters draw heavily on packages from the
tidyverse (Wickham et al. 2019),
tidytext (Silge and Robinson 2016), and
tidymodels (Kuhn and Wickham 2020), as well as the two excellent books “Text Mining with R: A Tidy Approach” (Silge and Robinson 2017) and “Supervised Machine Learning for Text Analysis in R” (Hvitfeldt and Silge 2022). Examples and inspiration are often drawn from blog posts and/or other tutorials, and I will give credit wherever credit is due. Moreover, you will find further readings at the end of each section as well as exercises at the end of the respective chapter.
Not that you haven’t made it here anyway, but let me briefly dwell on why you should hone your R.
You may wonder why it makes sense for you to learn R and there are many people with way brighter minds and better ways with words than I who have thought and written about this topic. I will just briefly dwell upon the points that I find most interesting from a research perspective (I would consider myself a computational social scientist of sorts) and then link to resources that point out how R can contribute to your research/career development/pursuit of happiness/you name it depending on your background and your goals.
In science, we are facing what has been famously coined the “replication crisis”. One way to overcome it is maximum transparency (Munafò et al. 2017). Here, R can help transparency as it allows researchers to simply publish their code to make it easy for their colleagues to comprehend what sorts of analyses have exactly been performed. Moreover, publishing (R) code decisively facilitates replicating the actual analyses with different data to assess the transportability and generalizability of the results.
While licenses for applications such as Stata and IBM SPSS are costly for schools and companies, R comes at no cost1. Moreover, the useRs (how R users call themselves) community is constantly developing further packages to extend Rs functionality for free. Also, you can write your packages or functions if you want to extend your R’s functionality – and then publish them to contribute to the community. Hence, there will not be any need to pay for an R extension or update ever.
Data scientist has been considered the “sexiest job of the 21st century” and one of their sharpest tools in the shed are R. The following graphs stem from a blog entry by the statistician Robert Muenchen.
Moreover, the demand for R has been constantly growing throughout the last couple of years.
And also if your goal is to stay in academia, the tendency appears to be clear:
Learning R – and programming in general – is tough. More often than not, things will not go the way you want them to go. Mostly, this is due to minor typos or the fact that R is case-sensitive. However, don’t fret. Only practice makes perfect. It is perfectly normal to not comprehend error messages. The following video illustrates this:
If questions arise that a google search cannot answer, I am always only one email away – and will probably just hit google right away, too, to figure something out for you.
RStudio charges companies from the private sector yet is free for educational institutions and private users↩︎