Day 6 Word Embeddings

In the following script we will first train some word embeddings from scratch and perform basic analyses on them. Since training your own embeddings is very costly, pretrained models exist. We will showcase how to use them in the second part, using a large model that was trained on a Google News corpus.

6.1 Training your own embeddings

For the training process, we will use the word2vec algorithm (Mikolov et al. 2013) and data on elected politicians’ Tweets which are kindly provided by Chris Bail. The word2vec() function takes as argument a vector of documents (in our case, each document is a tweet) and some parameters. They are subject to tuning8, but for our basic application, I just go with an arbitrarily chosen set.

library(word2vec)
library(fs)
library(tidyverse) 
library(lsa)

set.seed(1234)

load(url("https://cbail.github.io/Elected_Official_Tweets.Rdata")) 

embeddings_tweets <- word2vec(elected_official_tweets$text %>% 
                                str_to_lower() %>% 
                                str_remove_all("[:punct:]"), 
                              dim = 100, 
                              iter = 20,
                              threads = 16L)

# write.word2vec(embeddings_tweets, "embeddings_tweets.bin") # save model

# model <- read.word2vec("embeddings_tweets.bin") # read in model

embedding_mat <- as.matrix(embeddings_tweets)

We can get the vectors of singular terms by using the predict function and type = "embedding". If we want to do calculations with them, we need to simply extract the vectors and then perform the calculation. We can then provide the predict() function our new vector and ask it to give us the names of the vectors that are close based on cosine similarity. However, bear in mind that those things might not work out so well given the size of our corpus. At least, Clinton is in the top 5 here.

trump <- predict(embeddings_tweets, c("trump"), type = "embedding")

wv <- predict(embeddings_tweets, newdata = c("trump", "republican", "democrat"), type = "embedding")
wv <- wv["trump", ] - wv["republican", ] + wv["democrat", ]

predict(embeddings_tweets, newdata = wv, type = "nearest", top_n = 10)
##               term similarity rank
## 1            trump  0.9783953    1
## 2              yet  0.7690675    2
## 3   unconscionable  0.7677177    3
## 4          routine  0.7601476    4
## 5         democrat  0.7522761    5
## 6            obama  0.7514847    6
## 7             fact  0.7479585    7
## 8             gone  0.7367740    8
## 9  mickmulvaneyomb  0.7267088    9
## 10     replacement  0.7263388   10

We can also create new axes by taking the difference between two words and then project other words on these axes using cosine similarity. For this endeavor we first normalize all our vectors to make them equal in length. Moreover, I use multiple “seed words” for each end of the axis. Finally, we take the average of the axes that result from subtracting the seed words. This is equivalent to how Kozlowski, Taddy, and Evans (2019) construct their “class” axes.

# define function for normalizing vector
normalize_vec <- function(x) {x / sqrt(sum(x^2))}

# define function for getting an axis from … to … -- can also be multiple terms, but they need to be of same length; axis will then be averaged
get_frame_normal <- function(model, left_terms, right_terms){
  frames <- vector(mode = "list", length = length(left_terms))
  right_vec <- vector(mode = "list", length = length(right_terms))
  left_vec <- vector(mode = "list", length = length(left_terms))
  
  for (i in seq_along(left_terms)){
    right_vec[[i]] <- predict(model, newdata = right_terms[[i]], type = "embedding") %>% normalize_vec()
    left_vec[[i]] <- predict(model, newdata = left_terms[[i]], type = "embedding") %>% normalize_vec()
  }
  
  output <- map2(right_vec, left_vec, ~.x - .y) %>% 
    pluck(1) 
  rownames(output) <- NULL
  
  if (nrow(output) > 1){
    return(map_dbl(array_tree(output, nrow(output)), mean))
  }else{
    return(output[1, ])
  }
}


l_r_frame <- get_frame_normal(embeddings_tweets, 
                              left_terms = c("democrat", "democratic", "democrats"), 
                              right_terms = c("republican", "republican", "republicans"))

trump <- predict(embeddings_tweets, newdata = c("trump"), type = "embedding") 
clinton <- predict(embeddings_tweets, newdata = c("clinton"), type = "embedding")
cruz <- predict(embeddings_tweets, newdata = c("cruz"), type = "embedding") 
obama <- predict(embeddings_tweets, newdata = c("obama"), type = "embedding")

cosine(l_r_frame,
       trump %>% normalize_vec() %>% as.numeric())
##            [,1]
## [1,] 0.04409644
cosine(l_r_frame,
       clinton %>% normalize_vec() %>% as.numeric())
##             [,1]
## [1,] -0.06530029
cosine(l_r_frame,
       cruz %>% normalize_vec() %>% as.numeric())
##             [,1]
## [1,] -0.05621158
cosine(l_r_frame,
       obama %>% normalize_vec() %>% as.numeric())
##              [,1]
## [1,] -0.004143448

6.2 Using pre-trained models

We can also use pre-trained models such as the one you can download from Google. The model is very big (~4GB), hence I need to load it from my own hard drive and cannot store it online.

google_news <- word2vec::read.word2vec("/Users/felixlennert/Downloads/GoogleNews-vectors-negative300.bin", normalize = TRUE)

wv <- predict(google_news, newdata = c("king", "man", "woman"), type = "embedding")
wv <- wv["king", ] - wv["man", ] + wv["woman", ]
predict(google_news, newdata = wv, type = "nearest", top_n = 3)
##      term similarity rank
## 1    king  0.9481843    1
## 2   queen  0.8948160    2
## 3 monarch  0.8344159    3
## gender bias
female_job <- predict(google_news, newdata = c("doctor", "man", "woman"), type = "embedding")
jobs <- female_job["doctor", ] - female_job["man", ] + female_job["woman", ]
predict(google_news, newdata = jobs, type = "nearest", top_n = 3)
##           term similarity rank
## 1 gynecologist  0.9468427    1
## 2        nurse  0.9047574    2
## 3      doctors  0.9043502    3
male_female <- female_job["woman", ] - female_job["man", ]

lsa::cosine(male_female, 
       predict(google_news, newdata = c("professor"), type = "embedding") %>% as.numeric()) 
##            [,1]
## [1,] 0.05357555
cosine(male_female, 
       predict(google_news, newdata = c("locksmith"), type = "embedding") %>% as.numeric())
##              [,1]
## [1,] -0.004585093
cosine(male_female, 
       predict(google_news, newdata = c("nurse"), type = "embedding") %>% as.numeric())
##           [,1]
## [1,] 0.2730476
cosine(male_female, 
       predict(google_news, newdata = c("waitress"), type = "embedding") %>% as.numeric())
##           [,1]
## [1,] 0.2437929
cosine(male_female, 
       predict(google_news, newdata = c("waiter"), type = "embedding") %>% as.numeric())
##               [,1]
## [1,] -0.0007116955

Also, let’s try our left–right thing again:

left_right <- predict(google_news, newdata = c("republican", "democrat"), type = "embedding")
left_right_axis <- left_right["republican", ] - left_right["democrat", ]

cosine(left_right_axis, 
       predict(google_news, newdata = c("trump"), type = "embedding") %>% as.numeric()) 
##            [,1]
## [1,] -0.1284781
cosine(left_right_axis, 
       predict(google_news, newdata = c("clinton"), type = "embedding") %>% as.numeric()) 
##             [,1]
## [1,] -0.08412658
cosine(left_right_axis, 
       predict(google_news, newdata = c("obama"), type = "embedding") %>% as.numeric()) 
##             [,1]
## [1,] -0.08368384
cosine(left_right_axis, 
       predict(google_news, newdata = c("cruz"), type = "embedding") %>% as.numeric()) 
##             [,1]
## [1,] -0.04617176
cosine(left_right_axis, 
       predict(google_news, newdata = c("prolife"), type = "embedding") %>% as.numeric()) 
##            [,1]
## [1,] 0.07656146
cosine(left_right_axis, 
       predict(google_news, newdata = c("prochoice"), type = "embedding") %>% as.numeric()) 
##             [,1]
## [1,] -0.03026095

Doesn’t work so well for the politicians and Trump in particular. However, when it comes to jobs and their male–female gender, the model is picking up some real-world implications (see Garg et al. (2018) for more on this).

References

Garg, Nikhil, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. “Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes.” Proceedings of the National Academy of Sciences 115 (16): E3635–44. https://doi.org/10.1073/pnas.1720347115.
Kozlowski, Austin C., Matt Taddy, and James A. Evans. 2019. “The Geometry of Culture: Analyzing the Meanings of Class Through Word Embeddings.” American Sociological Review 84 (5): 905–49. https://doi.org/10.1177/0003122419877135.
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” arXiv:1310.4546 [Cs, Stat], October. https://arxiv.org/abs/1310.4546.
Stoltz, Dustin S., and Marshall A. Taylor. 2021. “Cultural Cartography with Word Embeddings.” Poetics, May. https://doi.org/10.1016/j.poetic.2021.101567.

  1. In the real world, you would probably do this using a set of realworld analogies that you will want the model to perform well on.↩︎