Chapter 8 Named Entity Recognition

8.1 What is Named Entity Recognition?

Named Entity Recognition is the task of recognizing and classifying named entities in text. This can include people, places, organizations, and more.

We often want to connect this to other information about the entities, such as their location, population, or other attributes.

8.2 A simple dictionary-based approach

The simplest approach is to use a dictionary of entities and search for them in the text, and we’ll do this first to extract the names of countries from a Wikipedia article.

library(tidyverse)
library(rvest)
library(glue)

First, we’ll download the text of a Wikipedia article.

Here, I’m using the history of the UK, as this country had lots of “interactions” with other countries, but feel free to use any article you like.

We first download the HTML of the page:

en_wiki_text <- read_html("https://en.wikipedia.org/wiki/History_of_the_United_Kingdom")

Then, we extract the text from the paragraphs of the page, and collapse them into a single string.

en_wiki_text <- en_wiki_text |>
  html_elements("p") |>
  html_text2() |>
  str_flatten()

Next, we’ll load a list of countries and their names in different languages. This is included in the gapminder package, but you can get this from any source you like.

library(gapminder)
country_codes |> head(10)

## # A tibble: 10 × 3
##    country     iso_alpha iso_num
##    <chr>       <chr>       <int>
##  1 Afghanistan AFG             4
##  2 Albania     ALB             8
##  3 Algeria     DZA            12
##  4 Angola      AGO            24
##  5 Argentina   ARG            32
##  6 Armenia     ARM            51
##  7 Aruba       ABW           533
##  8 Australia   AUS            36
##  9 Austria     AUT            40
## 10 Azerbaijan  AZE            31

We’ll use the English names of the countries, and we’ll use the stringr package to count the number of times each country is mentioned in the text, simply using the str_count function.

For this simple approach, we’ll just use the English names of the countries, and not bother to tokenize the text, or do any other preprocessing.

If you’re doing this for real, you’ll want to do a lot more preprocessing, and find those wierd edge cases that don’t work, but as a demonstration, this will get us 90% of the way there.

country_counts <- country_codes |>
  mutate(times_mentioned = str_count(en_wiki_text, country)) |>
  filter(times_mentioned > 0)

country_counts |>
  arrange(desc(times_mentioned)) |>
  head(10)

## # A tibble: 10 × 4
##    country        iso_alpha iso_num times_mentioned
##    <chr>          <chr>       <int>           <int>
##  1 France         FRA           250              45
##  2 Ireland        IRL           372              40
##  3 Germany        DEU           276              26
##  4 United Kingdom GBR           826              25
##  5 United States  USA           840              25
##  6 India          IND           356              21
##  7 Russia         RUS           643              17
##  8 Canada         CAN           124              11
##  9 Australia      AUS            36               8
## 10 Iraq           IRQ           368               7

8.3 Basic Machine Learning approaches

The dictionary-based approach is simple, but it has a lot of limitations. For example, you need to have a dictionary of all the entities you want to find, which would be complicated if you wanted to find all the people mentioned in a text, as the list of people is constantly changing.

As a more flexible approach, we can use machine learning to learn to recognize entities. There are plenty of pre-built tools for you to use, and for this example, we’ll use the nametagger package, which works decently well for English.

library(nametagger)

First, we need to download a model for the language we’re working with.

dir.create("models")
nametagger_model <- nametagger_download_model("english-conll-140408", model_dir = "models")

Next, we’ll use the model to predict the entities in the text. This will return a data frame with the predicted entities, and their locations in the text.

place_count_nametagger <- predict(nametagger_model, en_wiki_text, split = " ")
place_count_nametagger |> head(10)

##    doc_id sentence_id term_id    term entity
## 1       1           1       1     The      O
## 2       1           1       1 history      O
## 3       1           1       1      of      O
## 4       1           1       1     the      O
## 5       1           1       1  United      O
## 6       1           1       1 Kingdom      O
## 7       1           1       1   began      O
## 8       1           1       1      in      O
## 9       1           1       1     the      O
## 10      1           1       1   early      O

Here, we can see that the model has predicted the different types of entities, each with a different label.

The types of entities will depend on the model you’re using, but for this model, we have the following types:

place_count_nametagger |> count(entity)

##   entity     n
## 1  B-LOC   369
## 2 B-MISC   321
## 3  B-ORG   336
## 4  B-PER    61
## 5      O 20080

Guessing that B-LOC is a location, we can filter the data frame to only include locations, and then count the number of times each location is mentioned.

This is better than the dictionary-based approach, as we don’t need to have a list of all the locations we want to find, but it’s still not perfect.

place_count_nametagger |>
  filter(entity == "B-PER") |>
  count(term) |>
  arrange(desc(n)) |>
  head(20)

##           term  n
## 1        Royal 12
## 2        David 10
## 3         John  8
## 4      William  5
## 5       Robert  4
## 6       Wilson  3
## 7       Gordon  2
## 8          von  2
## 9         Alex  1
## 10       Brown  1
## 11        Bush  1
## 12 Christopher  1
## 13      Daniel  1
## 14        Karl  1
## 15       Louis  1
## 16     Michael  1
## 17    Ministry  1
## 18    Mitchell  1
## 19        Otto  1
## 20        Paul  1

This sort of analysis is often used for exploratory analysis, to get a sense of what’s in a text, but it’s not perfect.

Another disadvantage of this approach is that it sometimes requires re-assembling the text, as the model splits the text into tokens, and then predicts the entities for each token.

For example, if we want to find the names of people, we need to re-assemble the tokens that are part of the same name. Here are the most common people in the text:

place_count_nametagger |>
  filter(entity == "B-LOC") |>
  count(term) |>
  arrange(desc(n)) |>
  head(20)

##          term  n
## 1     Britain 98
## 2      France 32
## 3          UK 26
## 4     Ireland 22
## 5     England 19
## 6      George 17
## 7     Germany 15
## 8      London 14
## 9    Scotland 14
## 10      India  8
## 11      Wales  7
## 12     Battle  6
## 13     Berlin  6
## 14     Europe  6
## 15     Africa  5
## 16      Spain  5
## 17       U.S.  5
## 18     Union.  5
## 19     Canada  4
## 20 Wellington  4

I’m sure you can see the issue.

8.4 Spacy

The cutting-edge tool for named entity recognition is Spacy, which is a Python library. In fact, most of the best tools for text analysis are in Python right now, but many of them have R packages that allow you to use them from R.

However, getting them to work is a real pain, so we’re not going to do it in class. For your own research, you’ll want to follow the following steps, which can be found in greater detail in the spacyr documentation.

Install Python 3
Install miniconda
Install Spacy for R
Install Reticulate for R
Run spacy_install()
Run spacy_initialize()

Having done all that, here is a demo of how to use Spacy to extract entities from text.

library(spacyr)
spacy_initialize(model = "en_core_web_sm")

## Found 'spacy_condaenv'. spacyr will use this environment

## successfully initialized (spaCy Version: 3.7.2, language model: en_core_web_sm)

## (python options: type = "condaenv", value = "spacy_condaenv")

Spacy allows you to load a number of languages and models, and we’ll use the small English model, which is the fastest, but least accurate.

It supports a crazy number of languages, and you can find the full list here.

Spacy has a ton of features, but for this, we’ll just use the spacy_extract_entity function, which will return a data frame with the entities and their locations in the text.

place_count_spacy <- spacy_extract_entity(en_wiki_text)
place_count_spacy |> head(10)

##    doc_id                         text ent_type start_id length
## 1   text1           the United Kingdom      GPE        4      3
## 2   text1 the early eighteenth century     DATE        9      4
## 3   text1          the Treaty of Union      ORG       14      4
## 4   text1           the United Kingdom      GPE       26      3
## 5   text1                         1707     DATE       37      1
## 6   text1                      England      GPE       46      1
## 7   text1                Great Britain      GPE       64      2
## 8   text1                 Simon Schama   PERSON       69      2
## 9   text1                     European     NORP      104      1
## 10  text1       the Kingdom of Ireland      GPE      111      4

Running this, we can see a crazy number of entities, and we can see that the model has predicted the type of each entity, which is useful for filtering.

Each lanugage model has a different set of entity types, and a little experimentation and filtering will help you find the ones you want.

place_count_spacy |>
  count(ent_type)

##       ent_type   n
## 1     CARDINAL 164
## 2         DATE 517
## 3        EVENT  49
## 4          FAC   8
## 5          GPE 592
## 6     LANGUAGE   8
## 7          LAW  21
## 8          LOC  76
## 9        MONEY  34
## 10        NORP 338
## 11     ORDINAL  46
## 12         ORG 369
## 13     PERCENT  91
## 14      PERSON 265
## 15     PRODUCT   7
## 16    QUANTITY   2
## 17        TIME  11
## 18 WORK_OF_ART   9

For example, we can filter the entities to only include locations, and then count the number of times each location is mentioned.

place_count_spacy |>
  filter(ent_type == "GPE") |>
  count(text) |>
  arrange(desc(n)) |>
  head(10)

##                 text   n
## 1            Britain 116
## 2             France  44
## 3                 UK  32
## 4  the United States  22
## 5            Germany  21
## 6            England  20
## 7           Scotland  20
## 8            Ireland  19
## 9             London  19
## 10             India  13

place_count_spacy |>
  filter(ent_type == "PERSON") |>
  count(text) |>
  arrange(desc(n)) |>
  head(10)

##                  text n
## 1             Johnson 8
## 2             Asquith 6
## 3        Lloyd George 5
## 4               Truss 5
## 5             Baldwin 4
## 6              Brexit 4
## 7       David Cameron 4
## 8  David Lloyd George 4
## 9              Hitler 4
## 10    Stanley Baldwin 4

None of these methods are perfect; we can see here that it thinks Brexit is a person, but it’s as good as you can get as an off-the-shelf solution.

8.5 Classwork: A dictionary-based approach to finding countries in news articles

A great source of place names in different countries is Natural Earth, which has a number of different data sets for different levels of detail.

There is a great R package for working with this data, called rnaturalearth, which you can install with install.packages("rnaturalearth"). You’ll also need to install the rnaturalearthdata package, which contains the data.

library(rnaturalearth)

## Support for Spatial objects (`sp`) will be deprecated in {rnaturalearth} and will be removed in a future release of the package. Please use `sf` objects with {rnaturalearth}. For example: `ne_download(returnclass = 'sf')`

library(rnaturalearthdata)

## 
## Attaching package: 'rnaturalearthdata'

## The following object is masked from 'package:rnaturalearth':
## 
##     countries110

This contains a ton of detail, but for this we just need the names of the countries in a language of your choice. I’m picking out the Italian names, as it’s a cool language, which can be found under names_it.

countries <- ne_countries() |> as_tibble()

## Warning: The `returnclass` argument of `ne_download()` sp as of rnaturalearth 1.0.0.
## ℹ Please use `sf` objects with {rnaturalearth}, support for Spatial objects (sp) will be removed
##   in a future release of the package.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

colnames(countries)

##   [1] "featurecla" "scalerank"  "labelrank"  "sovereignt" "sov_a3"    
##   [6] "adm0_dif"   "level"      "type"       "tlc"        "admin"     
##  [11] "adm0_a3"    "geou_dif"   "geounit"    "gu_a3"      "su_dif"    
##  [16] "subunit"    "su_a3"      "brk_diff"   "name"       "name_long" 
##  [21] "brk_a3"     "brk_name"   "brk_group"  "abbrev"     "postal"    
##  [26] "formal_en"  "formal_fr"  "name_ciawf" "note_adm0"  "note_brk"  
##  [31] "name_sort"  "name_alt"   "mapcolor7"  "mapcolor8"  "mapcolor9" 
##  [36] "mapcolor13" "pop_est"    "pop_rank"   "pop_year"   "gdp_md"    
##  [41] "gdp_year"   "economy"    "income_grp" "fips_10"    "iso_a2"    
##  [46] "iso_a2_eh"  "iso_a3"     "iso_a3_eh"  "iso_n3"     "iso_n3_eh" 
##  [51] "un_a3"      "wb_a2"      "wb_a3"      "woe_id"     "woe_id_eh" 
##  [56] "woe_note"   "adm0_iso"   "adm0_diff"  "adm0_tlc"   "adm0_a3_us"
##  [61] "adm0_a3_fr" "adm0_a3_ru" "adm0_a3_es" "adm0_a3_cn" "adm0_a3_tw"
##  [66] "adm0_a3_in" "adm0_a3_np" "adm0_a3_pk" "adm0_a3_de" "adm0_a3_gb"
##  [71] "adm0_a3_br" "adm0_a3_il" "adm0_a3_ps" "adm0_a3_sa" "adm0_a3_eg"
##  [76] "adm0_a3_ma" "adm0_a3_pt" "adm0_a3_ar" "adm0_a3_jp" "adm0_a3_ko"
##  [81] "adm0_a3_vn" "adm0_a3_tr" "adm0_a3_id" "adm0_a3_pl" "adm0_a3_gr"
##  [86] "adm0_a3_it" "adm0_a3_nl" "adm0_a3_se" "adm0_a3_bd" "adm0_a3_ua"
##  [91] "adm0_a3_un" "adm0_a3_wb" "continent"  "region_un"  "subregion" 
##  [96] "region_wb"  "name_len"   "long_len"   "abbrev_len" "tiny"      
## [101] "homepart"   "min_zoom"   "min_label"  "max_label"  "label_x"   
## [106] "label_y"    "ne_id"      "wikidataid" "name_ar"    "name_bn"   
## [111] "name_de"    "name_en"    "name_es"    "name_fa"    "name_fr"   
## [116] "name_el"    "name_he"    "name_hi"    "name_hu"    "name_id"   
## [121] "name_it"    "name_ja"    "name_ko"    "name_nl"    "name_pl"   
## [126] "name_pt"    "name_ru"    "name_sv"    "name_tr"    "name_uk"   
## [131] "name_ur"    "name_vi"    "name_zh"    "name_zht"   "fclass_iso"
## [136] "tlc_diff"   "fclass_tlc" "fclass_us"  "fclass_fr"  "fclass_ru" 
## [141] "fclass_es"  "fclass_cn"  "fclass_tw"  "fclass_in"  "fclass_np" 
## [146] "fclass_pk"  "fclass_de"  "fclass_gb"  "fclass_br"  "fclass_il" 
## [151] "fclass_ps"  "fclass_sa"  "fclass_eg"  "fclass_ma"  "fclass_pt" 
## [156] "fclass_ar"  "fclass_jp"  "fclass_ko"  "fclass_vn"  "fclass_tr" 
## [161] "fclass_id"  "fclass_pl"  "fclass_gr"  "fclass_it"  "fclass_nl" 
## [166] "fclass_se"  "fclass_bd"  "fclass_ua"

it_countries <- countries |> select(name_it)
it_countries |> sample_n(10)

## # A tibble: 10 × 1
##    name_it    
##    <chr>      
##  1 Regno Unito
##  2 Cile       
##  3 Iran       
##  4 Liberia    
##  5 Laos       
##  6 Porto Rico 
##  7 Oman       
##  8 Giamaica   
##  9 Eritrea    
## 10 Tunisia

One of the great things about Natural Earth Data is that you can download different points of view, which saves you lots of trouble depending on where you publish.

The different points of view are listed here:

https://www.naturalearthdata.com/downloads/10m-cultural-vectors/

You next need some text to analyze. I went to The Wayback Machine and found the front page of an Italian newspaper La Repubblica from 2016, but you can do whatever makes you happy.

For this classwork, you can now find all the locations mentioned in your text

repubblica <- read_html("https://web.archive.org/web/20191228042534mp_/https://www.repubblica.it/solidarieta/?ref=RHHD-M")

8.6 Combining Entity Extraction and Other Methods

Just having a list of people or places is kind of interesting, but the real power of named entity recognition is when you combine it with other methods. To give you a small idea of how this could be done, we’re going to look at how we could combine named entity recognition with sentiment analysis.

For this example, we’ll use the UN’s State of the World’s Indigenous Peoples report, which you can find here: https://www.un.org/development/desa/indigenouspeoples/publications/state-of-the-worlds-indigenous-peoples.html

This is a long report, with lots of information about different countries in very different contexts, so it’s a great place to look for interesting patterns.

The first thing we’ll do is iuse the pdftools package to download the PDF and extract all the text from it.

library(pdftools)
un_report <- pdf_text(pdf = "https://www.un.org/development/desa/indigenouspeoples/wp-content/uploads/sites/19/2021/03/State-of-Worlds-Indigenous-Peoples-Vol-V-Final.pdf")

We then turn the whole thing into a single string, as this will make it easier to work with.

un_report <- un_report |> str_flatten(collapse = " ")

un_report |> write_file("un_report.txt")

Next, we’ll load a list of countries and their names in different languages. For this example, we’ll use the Natural Earth names, here using the names from the CIA World Factbook.

When choosing these dictionaries, you’ll want to carefully check for alternate names and do some editing if necessary.

Compare the following:

countries <- ne_countries() |> as_tibble()
countries |>
  select(name_ciawf, name_en, formal_en, name_de) |>
  filter(str_detect(name_en, "(China|United Kingdom|America|Taiwan)"))

## # A tibble: 4 × 4
##   name_ciawf     name_en                    formal_en                    name_de
##   <chr>          <chr>                      <chr>                        <chr>  
## 1 United States  United States of America   United States of America     Verein…
## 2 China          People's Republic of China People's Republic of China   Volksr…
## 3 Taiwan         Taiwan                     <NA>                         Republ…
## 4 United Kingdom United Kingdom             United Kingdom of Great Bri… Verein…

As a test, let’s use the same method as the beginning of this class to look for countries in the text.

country_codes |>
  mutate(ct = str_count(un_report, country)) |>
  filter(ct > 0) |>
  arrange(desc(ct)) |>
  head(10)

## # A tibble: 10 × 4
##    country     iso_alpha iso_num    ct
##    <chr>       <chr>       <int> <int>
##  1 Colombia    COL           170    47
##  2 Kenya       KEN           404    47
##  3 Philippines PHL           608    46
##  4 Brazil      BRA            76    36
##  5 Indonesia   IDN           360    33
##  6 India       IND           356    30
##  7 Canada      CAN           124    28
##  8 Ecuador     ECU           218    28
##  9 Peru        PER           604    27
## 10 Suriname    SUR           740    24

However, we also want to combine this with some sentiment analysis!

A simple way to operationalize this is to split the text into sentences, and then calculate the average sentiment of the sentences that mention a country.

We’ll start by splitting the text into sentences, using the tokenize_sentences() function in the tokenizers package, just as we did for the sentiment analysis class.

library(tokenizers)

un_df <-
  un_report |>
  tokenize_sentences(simplify = TRUE) |>
  as_tibble() |>
  rename(sentence = value)

un_df |> head(10)

## # A tibble: 10 × 1
##    sentence                                                                     
##    <chr>                                                                        
##  1 5th Volume                                               State of the       …
##  2 The Indigenous Peoples and Development Branch/ Secretariat of the Permanent …
##  3 The thematic chapters were written by Mattias Åhrén, Cathal Doyle, Jérémie G…
##  4 Special acknowledge- ment also goes to the editor, Terri Lore, as well as th…
##  5 ST/ESA/375   Department of Economic and Social Affairs Division for Inclusiv…
##  6 The Department works in three main interlinked areas: (i) it compiles, gener…
##  7 Note        The views expressed in the present publication do not necessaril…
##  8 The designations employed and the presentation of the        material in thi…
##  9 The designations of country groups in the text and the        tables are int…
## 10 Mention of the names of firms and commer-        cial products does not impl…

Next, we’ll load the AFINN sentiment dictionary, which is a simple dictionary-based approach to sentiment analysis. This is included in the tidytext package, but you can also download it from here:

http://corpustext.com/reference/sentiment_afinn.html

library(tidytext)
library(textdata)

sentiment_dict <- get_sentiments("afinn")
sentiment_dict

## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ℹ 2,467 more rows

Eventually, we’ll want to calculate the average sentiment of each country, but it’s usually a good idea to start with a single country, to make sure everything is working. Colombia was the most mentioned country in the report, so we’ll start with that.

country_to_look_at <- "Colombia"

Next, we’ll filter the sentences to only include those that mention the country we’re looking at, and then combine them into a single string.

sents_with_country <-
  un_df |>
  filter(str_detect(sentence, country_to_look_at)) |>
  str_flatten(collapse = " ")

## Warning in stri_flatten(string, collapse = collapse): argument is not an atomic
## vector; coercing

The sentiment dictionary is all lower case, so we should also convert our sentences to lower case.

sents_with_country <- sents_with_country |> str_to_lower()

We can then use this dictionary to pull all the sentimental words that are included in this group of sentences.

In this case, we want to make sure there isn’t any accidental regexing on our string, so we use fixed() to ensure that we’re just looking at the string itself.

avg_sentiment <- sentiment_dict |>
  filter(str_detect(sents_with_country, fixed(word)))
avg_sentiment |> head(10)

## # A tibble: 10 × 2
##    word      value
##    <chr>     <dbl>
##  1 abuse        -3
##  2 abuses       -3
##  3 active        1
##  4 agree         1
##  5 agreement     1
##  6 alone        -2
##  7 anger        -3
##  8 anti         -1
##  9 ass          -4
## 10 attack       -1

Now that we have our list of words, we could use summarise() to get the mean of those words, an alright way to calculate sentiment.

avg_sentiment <- avg_sentiment |>
  summarise(sentiment = mean(value))
avg_sentiment

## # A tibble: 1 × 1
##   sentiment
##       <dbl>
## 1    -0.347

Now, we need to prepare to put this in a for-loop, so we should add a second column with the country. Obviously, at this point, it will be just one country.

avg_sentiment <- avg_sentiment |>
  mutate(country = country_to_look_at)
avg_sentiment

## # A tibble: 1 × 2
##   sentiment country 
##       <dbl> <chr>   
## 1    -0.347 Colombia

Now, we set an empty data frame to accept each row we generate from the for loop.

sentiment_by_country <- tibble()

And we set the code that we wrote earlier into a loop, which now goes through every country, and adds it to that tibble using rbind(). I added a print statement just for fun.

for (country_to_look_at in country_codes$country) {
  print(country_to_look_at) # This line is new
  sents_with_country <-
    un_df |>
    filter(str_detect(sentence, country_to_look_at))|> 
    str_flatten(collapse = " ")

  avg_sentiment <- sentiment_dict |>
    filter(str_detect(sents_with_country, fixed(word))) |> 
    summarise(sentiment = mean(value)) |>
    mutate(country = country_to_look_at)

  sentiment_by_country <- rbind(sentiment_by_country, avg_sentiment) # This line is new
}

## [1] "Afghanistan"
## [1] "Albania"
## [1] "Algeria"
## [1] "Angola"
## [1] "Argentina"
## [1] "Armenia"
## [1] "Aruba"
## [1] "Australia"
## [1] "Austria"
## [1] "Azerbaijan"
## [1] "Bahamas"
## [1] "Bahrain"
## [1] "Bangladesh"
## [1] "Barbados"
## [1] "Belarus"
## [1] "Belgium"
## [1] "Belize"
## [1] "Benin"
## [1] "Bhutan"
## [1] "Bolivia"
## [1] "Bosnia and Herzegovina"
## [1] "Botswana"
## [1] "Brazil"
## [1] "Brunei"
## [1] "Bulgaria"
## [1] "Burkina Faso"
## [1] "Burundi"
## [1] "Cambodia"
## [1] "Cameroon"
## [1] "Canada"
## [1] "Cape Verde"
## [1] "Central African Republic"
## [1] "Chad"
## [1] "Chile"
## [1] "China"
## [1] "Colombia"
## [1] "Comoros"
## [1] "Congo, Dem. Rep."
## [1] "Congo, Rep."
## [1] "Costa Rica"
## [1] "Cote d'Ivoire"
## [1] "Croatia"
## [1] "Cuba"
## [1] "Cyprus"
## [1] "Czech Republic"
## [1] "Denmark"
## [1] "Djibouti"
## [1] "Dominican Republic"
## [1] "Ecuador"
## [1] "Egypt"
## [1] "El Salvador"
## [1] "Equatorial Guinea"
## [1] "Eritrea"
## [1] "Estonia"
## [1] "Ethiopia"
## [1] "Fiji"
## [1] "Finland"
## [1] "France"
## [1] "French Guiana"
## [1] "French Polynesia"
## [1] "Gabon"
## [1] "Gambia"
## [1] "Georgia"
## [1] "Germany"
## [1] "Ghana"
## [1] "Greece"
## [1] "Grenada"
## [1] "Guadeloupe"
## [1] "Guatemala"
## [1] "Guinea"
## [1] "Guinea-Bissau"
## [1] "Guyana"
## [1] "Haiti"
## [1] "Honduras"
## [1] "Hong Kong, China"
## [1] "Hungary"
## [1] "Iceland"
## [1] "India"
## [1] "Indonesia"
## [1] "Iran"
## [1] "Iraq"
## [1] "Ireland"
## [1] "Israel"
## [1] "Italy"
## [1] "Jamaica"
## [1] "Japan"
## [1] "Jordan"
## [1] "Kazakhstan"
## [1] "Kenya"
## [1] "Korea, Dem. Rep."
## [1] "Korea, Rep."
## [1] "Kuwait"
## [1] "Latvia"
## [1] "Lebanon"
## [1] "Lesotho"
## [1] "Liberia"
## [1] "Libya"
## [1] "Lithuania"
## [1] "Luxembourg"
## [1] "Macao, China"
## [1] "Madagascar"
## [1] "Malawi"
## [1] "Malaysia"
## [1] "Maldives"
## [1] "Mali"
## [1] "Malta"
## [1] "Martinique"
## [1] "Mauritania"
## [1] "Mauritius"
## [1] "Mexico"
## [1] "Micronesia, Fed. Sts."
## [1] "Moldova"
## [1] "Mongolia"
## [1] "Montenegro"
## [1] "Morocco"
## [1] "Mozambique"
## [1] "Myanmar"
## [1] "Namibia"
## [1] "Nepal"
## [1] "Netherlands"
## [1] "Netherlands Antilles"
## [1] "New Caledonia"
## [1] "New Zealand"
## [1] "Nicaragua"
## [1] "Niger"
## [1] "Nigeria"
## [1] "Norway"
## [1] "Oman"
## [1] "Pakistan"
## [1] "Panama"
## [1] "Papua New Guinea"
## [1] "Paraguay"
## [1] "Peru"
## [1] "Philippines"
## [1] "Poland"
## [1] "Portugal"
## [1] "Puerto Rico"
## [1] "Qatar"
## [1] "Reunion"
## [1] "Romania"
## [1] "Russia"
## [1] "Rwanda"
## [1] "Samoa"
## [1] "Sao Tome and Principe"
## [1] "Saudi Arabia"
## [1] "Senegal"
## [1] "Serbia"
## [1] "Sierra Leone"
## [1] "Singapore"
## [1] "Slovak Republic"
## [1] "Slovenia"
## [1] "Solomon Islands"
## [1] "Somalia"
## [1] "South Africa"
## [1] "Spain"
## [1] "Sri Lanka"
## [1] "Sudan"
## [1] "Suriname"
## [1] "Swaziland"
## [1] "Sweden"
## [1] "Switzerland"
## [1] "Syria"
## [1] "Taiwan"
## [1] "Tajikistan"
## [1] "Tanzania"
## [1] "Thailand"
## [1] "Timor-Leste"
## [1] "Togo"
## [1] "Tonga"
## [1] "Trinidad and Tobago"
## [1] "Tunisia"
## [1] "Turkey"
## [1] "Turkmenistan"
## [1] "Uganda"
## [1] "Ukraine"
## [1] "United Arab Emirates"
## [1] "United Kingdom"
## [1] "United States"
## [1] "Uruguay"
## [1] "Uzbekistan"
## [1] "Vanuatu"
## [1] "Venezuela"
## [1] "Vietnam"
## [1] "West Bank and Gaza"
## [1] "Yemen, Rep."
## [1] "Zambia"
## [1] "Zimbabwe"

With that done, we can see the most positive countries mentioned in the report:

sentiment_by_country |>
  filter(!is.na(sentiment)) |>
  arrange(sentiment) |>
  head(10)

## # A tibble: 10 × 2
##    sentiment country             
##        <dbl> <chr>               
##  1    -1.33  Zimbabwe            
##  2    -1.29  United Arab Emirates
##  3    -1.09  Botswana            
##  4    -1.04  Niger               
##  5    -1.04  Nigeria             
##  6    -0.92  United States       
##  7    -0.909 Venezuela           
##  8    -0.8   Greece              
##  9    -0.793 Namibia             
## 10    -0.769 Honduras

And also the most positive:

sentiment_by_country |>
  filter(!is.na(sentiment)) |>
  arrange(sentiment) |>
  tail(10)

## # A tibble: 10 × 2
##    sentiment country            
##        <dbl> <chr>              
##  1         2 Tonga              
##  2         2 Trinidad and Tobago
##  3         2 Tunisia            
##  4         2 Turkey             
##  5         2 Turkmenistan       
##  6         2 Ukraine            
##  7         2 Uzbekistan         
##  8         2 Vanuatu            
##  9         2 West Bank and Gaza 
## 10         2 Yemen, Rep.

8.7 Classwork: Brainstorm ways to improve on this method.

There’s something off about the positive ones. How can we fix them?