Day 6 Word Embeddings
In the following script we will first train some word embeddings from scratch and perform basic analyses on them. Since training your own embeddings is very costly, pretrained models exist. We will showcase how to use them in the second part, using a large model that was trained on a Google News corpus.
6.1 Training your own embeddings
For the training process, we will use the word2vec
algorithm (Mikolov et al. 2013) and data on elected politicians’ Tweets which are kindly provided by Chris Bail. The word2vec()
function takes as argument a vector of documents (in our case, each document is a tweet) and some parameters. They are subject to tuning8, but for our basic application, I just go with an arbitrarily chosen set.
library(word2vec)
library(fs)
library(tidyverse)
library(lsa)
set.seed(1234)
load(url("https://cbail.github.io/Elected_Official_Tweets.Rdata"))
<- word2vec(elected_official_tweets$text %>%
embeddings_tweets str_to_lower() %>%
str_remove_all("[:punct:]"),
dim = 100,
iter = 20,
threads = 16L)
# write.word2vec(embeddings_tweets, "embeddings_tweets.bin") # save model
# model <- read.word2vec("embeddings_tweets.bin") # read in model
<- as.matrix(embeddings_tweets) embedding_mat
We can get the vectors of singular terms by using the predict function and type = "embedding"
. If we want to do calculations with them, we need to simply extract the vectors and then perform the calculation. We can then provide the predict()
function our new vector and ask it to give us the names of the vectors that are close based on cosine similarity. However, bear in mind that those things might not work out so well given the size of our corpus. At least, Clinton is in the top 5 here.
<- predict(embeddings_tweets, c("trump"), type = "embedding")
trump
<- predict(embeddings_tweets, newdata = c("trump", "republican", "democrat"), type = "embedding")
wv <- wv["trump", ] - wv["republican", ] + wv["democrat", ]
wv
predict(embeddings_tweets, newdata = wv, type = "nearest", top_n = 10)
## term similarity rank
## 1 trump 0.9783953 1
## 2 yet 0.7690675 2
## 3 unconscionable 0.7677177 3
## 4 routine 0.7601476 4
## 5 democrat 0.7522761 5
## 6 obama 0.7514847 6
## 7 fact 0.7479585 7
## 8 gone 0.7367740 8
## 9 mickmulvaneyomb 0.7267088 9
## 10 replacement 0.7263388 10
We can also create new axes by taking the difference between two words and then project other words on these axes using cosine similarity. For this endeavor we first normalize all our vectors to make them equal in length. Moreover, I use multiple “seed words” for each end of the axis. Finally, we take the average of the axes that result from subtracting the seed words. This is equivalent to how Kozlowski, Taddy, and Evans (2019) construct their “class” axes.
# define function for normalizing vector
<- function(x) {x / sqrt(sum(x^2))}
normalize_vec
# define function for getting an axis from … to … -- can also be multiple terms, but they need to be of same length; axis will then be averaged
<- function(model, left_terms, right_terms){
get_frame_normal <- vector(mode = "list", length = length(left_terms))
frames <- vector(mode = "list", length = length(right_terms))
right_vec <- vector(mode = "list", length = length(left_terms))
left_vec
for (i in seq_along(left_terms)){
<- predict(model, newdata = right_terms[[i]], type = "embedding") %>% normalize_vec()
right_vec[[i]] <- predict(model, newdata = left_terms[[i]], type = "embedding") %>% normalize_vec()
left_vec[[i]]
}
<- map2(right_vec, left_vec, ~.x - .y) %>%
output pluck(1)
rownames(output) <- NULL
if (nrow(output) > 1){
return(map_dbl(array_tree(output, nrow(output)), mean))
else{
}return(output[1, ])
}
}
<- get_frame_normal(embeddings_tweets,
l_r_frame left_terms = c("democrat", "democratic", "democrats"),
right_terms = c("republican", "republican", "republicans"))
<- predict(embeddings_tweets, newdata = c("trump"), type = "embedding")
trump <- predict(embeddings_tweets, newdata = c("clinton"), type = "embedding")
clinton <- predict(embeddings_tweets, newdata = c("cruz"), type = "embedding")
cruz <- predict(embeddings_tweets, newdata = c("obama"), type = "embedding")
obama
cosine(l_r_frame,
%>% normalize_vec() %>% as.numeric()) trump
## [,1]
## [1,] 0.04409644
cosine(l_r_frame,
%>% normalize_vec() %>% as.numeric()) clinton
## [,1]
## [1,] -0.06530029
cosine(l_r_frame,
%>% normalize_vec() %>% as.numeric()) cruz
## [,1]
## [1,] -0.05621158
cosine(l_r_frame,
%>% normalize_vec() %>% as.numeric()) obama
## [,1]
## [1,] -0.004143448
6.2 Using pre-trained models
We can also use pre-trained models such as the one you can download from Google. The model is very big (~4GB), hence I need to load it from my own hard drive and cannot store it online.
<- word2vec::read.word2vec("/Users/felixlennert/Downloads/GoogleNews-vectors-negative300.bin", normalize = TRUE)
google_news
<- predict(google_news, newdata = c("king", "man", "woman"), type = "embedding")
wv <- wv["king", ] - wv["man", ] + wv["woman", ]
wv predict(google_news, newdata = wv, type = "nearest", top_n = 3)
## term similarity rank
## 1 king 0.9481843 1
## 2 queen 0.8948160 2
## 3 monarch 0.8344159 3
## gender bias
<- predict(google_news, newdata = c("doctor", "man", "woman"), type = "embedding")
female_job <- female_job["doctor", ] - female_job["man", ] + female_job["woman", ]
jobs predict(google_news, newdata = jobs, type = "nearest", top_n = 3)
## term similarity rank
## 1 gynecologist 0.9468427 1
## 2 nurse 0.9047574 2
## 3 doctors 0.9043502 3
<- female_job["woman", ] - female_job["man", ]
male_female
::cosine(male_female,
lsapredict(google_news, newdata = c("professor"), type = "embedding") %>% as.numeric())
## [,1]
## [1,] 0.05357555
cosine(male_female,
predict(google_news, newdata = c("locksmith"), type = "embedding") %>% as.numeric())
## [,1]
## [1,] -0.004585093
cosine(male_female,
predict(google_news, newdata = c("nurse"), type = "embedding") %>% as.numeric())
## [,1]
## [1,] 0.2730476
cosine(male_female,
predict(google_news, newdata = c("waitress"), type = "embedding") %>% as.numeric())
## [,1]
## [1,] 0.2437929
cosine(male_female,
predict(google_news, newdata = c("waiter"), type = "embedding") %>% as.numeric())
## [,1]
## [1,] -0.0007116955
Also, let’s try our left–right thing again:
<- predict(google_news, newdata = c("republican", "democrat"), type = "embedding")
left_right <- left_right["republican", ] - left_right["democrat", ]
left_right_axis
cosine(left_right_axis,
predict(google_news, newdata = c("trump"), type = "embedding") %>% as.numeric())
## [,1]
## [1,] -0.1284781
cosine(left_right_axis,
predict(google_news, newdata = c("clinton"), type = "embedding") %>% as.numeric())
## [,1]
## [1,] -0.08412658
cosine(left_right_axis,
predict(google_news, newdata = c("obama"), type = "embedding") %>% as.numeric())
## [,1]
## [1,] -0.08368384
cosine(left_right_axis,
predict(google_news, newdata = c("cruz"), type = "embedding") %>% as.numeric())
## [,1]
## [1,] -0.04617176
cosine(left_right_axis,
predict(google_news, newdata = c("prolife"), type = "embedding") %>% as.numeric())
## [,1]
## [1,] 0.07656146
cosine(left_right_axis,
predict(google_news, newdata = c("prochoice"), type = "embedding") %>% as.numeric())
## [,1]
## [1,] -0.03026095
Doesn’t work so well for the politicians and Trump in particular. However, when it comes to jobs and their male–female gender, the model is picking up some real-world implications (see Garg et al. (2018) for more on this).
6.3 Further links
This is just a quick demonstration of what you can do with word embeddings. In case you want to use your embeddings as new features for your supervised machine learning classifier, look at ?textmodels::step_word_embeddings()
. You may want to use pre-trained models for such tasks.
You can also train embeddings on multiple corpora and identify their different biases. You may want to have a look at Stoltz and Taylor (2021) before going down this road.
- See the word2vec vignette for more information
- The first of a series blog posts on word embeddings
- An approachable lecture by Richard Socher, one of the founding fathers of GloVe
References
In the real world, you would probably do this using a set of realworld analogies that you will want the model to perform well on.↩︎