Graded Task in R (III)

For this task, we’ll once again work with the “State of the Union Addresses” from US presidents since the 1790s. Source: The American Presidency Project, accessed via the quanteda corpus package.

Please note that I’ve slightly adapted the corpus: It now only contains speeches by Democratic or Republican presidents (N = 193). I have also already preprocessed the corpus and conducted a topic model to estimate a total of K = 30 topics, thereby including data$party, here the party each president was part of (either Democratic or Republican), as an independent variable that may influence the prevalence of topics.

The code for estimating the stm looks like this:

library("stm")
model <- stm(documents = dfm_stm$documents,
             vocab = dfm_stm$vocab, 
             K = 30,
             prevalence = ~ party,
             data = data,
             verbose = FALSE)

Important: Please do not preprocess the corpus or estimate the topic model yourself. Here, I simply want you to load the result of my already existing analysis (you do not need to know anything about the exact preprocessing steps that were used). You’ll find the corresponding working environment titled Graded Task III.RDATA in OLAT (via: Materials / Data for R).

The working environment contains the following objects:

  • data: the original dataset containing “State of the Union Addresses” from US presidents since the 1790s and some meta data (e.g., when each speech was given or by whom)
  • dfm: the document-feature-matrix for the texts in data$texts after preprocessing
  • model: the stm topic model I ran
  • manual_validation: A manually coded gold standard for one of the topics (you’ll only need to use this in task 3.5, so ignore it for now)
load("Graded Task III.RDATA")

Graded Task 3.1

Writing the corresponding R code, please save the top 20 terms describing each topic (according to FREX weighting) in a new object called top_terms.

Which of these topics include the term “inflation” among their top 20 terms? Please automatically retrieve the number of these topics (e.g., “Topic 1”, or “Topic 2”, or “Topic 3 and Topic 4” etc.)

Graded Task 3.2

Especially topics 5, 12 and 29 seem to touch upon financial and economic issues.

Writing the corresponding R code, please calculate the overall number of speeches in which topics 5, 12 and 29 are the main topic according to the Rank-1 metric. In short: How many speeches seem to focus on the economy/financial issues, as indicated by the prevalence of topics 5, 12, and 29 as the main topic in these speeches?

Please note: Your answer should be single number - the number of speeches where either topic 5, 12, or 29 are the main topic.

Graded Task 3.3

Writing the corresponding R code, please plot the effects of party affiliation on the prevalence of the economy-focused topics 5, 12, and 29.

In short, the graph should indicate where there is a significant correlation between a president’s party affiliation (Democratic vs. Republican) on the prevalence with which a president discusses economic issues in a speech (similar to the one below, but the graphs do not have to match exactly):

Do Democrats more often discuss the economy in their speeches than Republicans (or vice versa)?

Hint: Have a look at the STM vignette if you are uncertain about how to plot the influence of categorical independent variables on topic prevalence.

Graded Task 3.4

Writing the corresponding R code, retrieve the document-topic matrix of the stm model model. Based on the matrix: Retrieve the doc_ids of those 5 speeches that have the highest conditional probability for the topic Iraq_War/Terrorism (Topic 19) according to the document-topic matrix.

Graded Task 3.5

Writing the corresponding R code, please validate your results.

First, assign each speech a single main topic based on the Rank-1 value.

Next, assume that we manually coded 15 speeches for whether or not they actually deal with the Iraq_War/Terrorism (Topic 19) as their main topic. The results of this manual coding - i.e., the manual gold standard - are saved in the object manual_validation.

This data frame contains the following vectors:

  • doc_id: the name of the speech that was coded
  • text: the text of the speech that was coded
  • manual: the manual coding - a binary variable indicating whether the Iraq War/Terrorism (Topic 19) is the main topic of the speech (1) or not (0) according to the manual coding.

Using both the automated classification of main topics and the manually coded gold standard, calculate the precision, recall, and F1 value for your automated content analysis correctly classifying speeches as having the Iraq_War/Terrorism as their main topic.

In one sentence, please summarize: Do you think that your automated content analysis is “good enough” for automatically measuring which speeches discuss the Iraq_War/Terrorism?

Graded Task 3.6

Writing the corresponding R code, please count: Which president(s) discuss the “war on terror” by using this term in at least one speech?