Section 1 Session 2

1.1 “Sentiment Analysis”

Let me first read in a common dataset for all our sentiment-an examples.

rm(list=ls())  # clear workspace

# IBM Q3 2016 analyst call transcript
x = readLines('https://raw.githubusercontent.com/sudhir-voleti/sample-data-sets/master/International%20Business%20Machines%20(IBM)%20Q3%202016%20Results%20-%20Earnings%20Call%20Transcript.txt')

length(x)  # how many docs in this corpus?
## [1] 219

1.1.1 Using qdap for sentiment analysis

qdap or the qualitative discourse analysis package offers a relatively rigorous sentiment-An approach. However, it is limited to a standard dictionary (from Princeton).

Our main use of sentiment-An in this session shall be with tidytext library. However, for completeness sake, let’s quickly do qdap first.

suppressPackageStartupMessages({
# loading required libraries
if(!require(qdap)) {install.packages("qdap")} # ensure java is up to date!
if(!require(textdata)) {install.packages("textdata")}  
library(qdap)
library(textdata)
})  

Use polarity() func from qdap and store its output into various variables. Try ?qdap::polarity for more detail.

# apply polarity() func from qdap to compute sentiment polarity
t1 = Sys.time()   # set timer
  pol = qdap::polarity(x)         # Calculate the polarity from qdap dictionary
## Warning in qdap::polarity(x): 
##   Some rows contain double punctuation.  Suggested use of `sentSplit` function.
  wc = pol$all[,2]                  # Word Count in each doc
  val = pol$all[,3]                 # average polarity score
  p  = pol$all[,4]                  # Positive words info
  n  = pol$all[,5]                  # Negative Words info  
print(round(Sys.time() - t1, 3))  # how much time did the above take?
## Time difference of 2.578 secs

View output variables.

head(pol$all)  # see struc of obj 'pol' in Envmt pane first
##   all wc   polarity pos.words neg.words
## 1 all  6  0.0000000         -         -
## 2 all  5  0.0000000         -         -
## 3 all  3  0.0000000         -         -
## 4 all  1  0.0000000         -         -
## 5 all  6 -0.4082483         -      vice
## 6 all  9 -0.3333333         -      vice
##                                                               text.var
## 1              International Business Machines Corporation. (NYSE:IBM)
## 2                             Q3 2016 Results Earnings Conference Call
## 3                                        October 17, 2016, 05:00 PM ET
## 4                                                           Executives
## 5                  Patricia Murphy - Vice President-Investor Relations
## 6 Martin Schroeter - Senior Vice President and Chief Financial Officer

Summarize polarity scores for the corpus

head(pol$group)
##   all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all             219       10275    0.1032166   0.2454903          0.4204508

What positive words were there in the corpus?

positive_words = unique(setdiff(unlist(p),"-"))  # Positive words list, do ?dplyr::setdiff
print(positive_words)       # Print all the positive words found in the corpus
##  [1] "thank"           "like"            "welcome"         "available"      
##  [5] "reform"          "reconciliation"  "stability"       "strong"         
##  [9] "top"             "consistent"      "led"             "solid"          
## [13] "modest"          "good"            "progress"        "best"           
## [17] "well"            "innovation"      "success"         "helping"        
## [21] "leverage"        "contribution"    "easy"            "intelligence"   
## [25] "better"          "trusted"         "support"         "smart"          
## [29] "work"            "improvements"    "improved"        "gains"          
## [33] "meaningful"      "effective"       "benefit"         "free"           
## [37] "improvement"     "gain"            "capability"      "correct"        
## [41] "wins"            "improve"         "compliant"       "efficient"      
## [45] "trusting"        "right"           "great"           "protect"        
## [49] "leading"         "empower"         "faster"          "important"      
## [53] "secure"          "agile"           "significant"     "easier"         
## [57] "advanced"        "stable"          "supporting"      "win"            
## [61] "flexible"        "guidance"        "innovative"      "supports"       
## [65] "pretty"          "successful"      "straightforward" "appreciate"     
## [69] "helped"          "enough"          "terrific"        "won"            
## [73] "confident"       "improving"       "positive"        "accurate"       
## [77] "overtakes"       "reasonable"      "consistently"    "wonder"         
## [81] "regard"          "appeal"          "benefits"        "sustainability" 
## [85] "tougher"         "interesting"     "freed"           "freedom"        
## [89] "sustainable"     "appreciated"     "trust"           "excited"

We can use the wordcloud or barchart or other display aids as required on the list above.

What negative words were there in the corpus?

negative_words = unique(setdiff(unlist(n),"-"))  # Negative words list
print(negative_words)       # Print all neg words
##  [1] "vice"          "volatility"    "cloud"         "decline"      
##  [5] "drones"        "critical"      "risk"          "cancer"       
##  [9] "uncertain"     "weaker"        "gross"         "set up"       
## [13] "chronic"       "risks"         "latency"       "object"       
## [17] "declining"     "drag"          "debt"          "declines"     
## [21] "oppose"        "bump"          "negative"      "indiscernible"
## [25] "volatile"      "errors"        "inaccuracies"

While qdap is nice and easy, its limitations are also there.

E.g., it uses an inbuilt wordlist that cannot be customized for particular contexts.

Another is that it does exact matching and misses minor variations in sentiment words. It suffers from the old problems of synonymy (i.e., multiple words that have similar meanings) and polysemy (words that have more than one meaning).

While these are old standing problems and cannot be eliminated in the bag of words approach, more flexibility would enable a substantial mitigation at least.

1.1.2 Sentiment-An with Tidytext

There are 3 inbuilt sentiment dictionaries as of now in tidytext, with a fourth under development.

To briefly see what these are, return to the slides.

The nice thing about tidytext is that while there are 3 inbuilt sentiment dictionaires in different output formats for convenience, we can also create and customize our own wordlists as needed.

suppressPackageStartupMessages({
require(tidytext)
require(tidyr)
require(dplyr)
})

The tidytext package contains all three sentiment lexicons in the sentiments dataset.

# over 27k words sentiment-ized and scored across the 3 lexicons
sentiments = read.csv("https://raw.githubusercontent.com/sudhir-voleti/sample-data-sets/master/sentiments.csv")    

head(sentiments, 10)
##     X        word sentiment lexicon score
## 1   1      abacus     trust     nrc    NA
## 2   2     abandon      fear     nrc    NA
## 3   3     abandon  negative     nrc    NA
## 4   4     abandon   sadness     nrc    NA
## 5   5   abandoned     anger     nrc    NA
## 6   6   abandoned      fear     nrc    NA
## 7   7   abandoned  negative     nrc    NA
## 8   8   abandoned   sadness     nrc    NA
## 9   9 abandonment     anger     nrc    NA
## 10 10 abandonment      fear     nrc    NA

But the inbuilt dataset is now incomplete as one lexicon (‘NRC’) was withdrawn after objections from the lexicon’s creators. Strictly for academic purposes only, I’m using a saved version of that lexicon.

# sentiments = read.csv("https://raw.githubusercontent.com/sudhir-voleti/sample-data-sets/master/sentiments.csv")

sentiments %>% 
          filter(lexicon == "AFINN") %>% 
          head()
##       X       word sentiment lexicon score
## 1 20690    abandon      <NA>   AFINN    -2
## 2 20691  abandoned      <NA>   AFINN    -2
## 3 20692   abandons      <NA>   AFINN    -2
## 4 20693   abducted      <NA>   AFINN    -2
## 5 20694  abduction      <NA>   AFINN    -2
## 6 20695 abductions      <NA>   AFINN    -2

Let’s start simple, with Bing.

textdf = data_frame(text = x)   # convert to data frame
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
bing = get_sentiments("bing")   # put all of the bing sentiment dict into object 'bing'

head(bing, 10)     # view bing object
## # A tibble: 10 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative

Which docs are most positive and negative in the corpus?

senti.bing = textdf %>%
      mutate(linenumber = seq(1:nrow(textdf))) %>%   # build line num variable
      ungroup() %>%
      unnest_tokens(word, text) %>%
      inner_join(get_sentiments("bing")) %>%
      count(sentiment, index = linenumber %/% 1, sort = FALSE) %>%
      # mutate(index = row_number()) %>%
      mutate(method = "bing")    # creates a column with method name
## Joining, by = "word"
head(senti.bing, 15)
## # A tibble: 15 × 4
##    sentiment index     n method
##    <chr>     <dbl> <int> <chr> 
##  1 negative      5     1 bing  
##  2 negative      6     1 bing  
##  3 negative     19     2 bing  
##  4 negative     25     1 bing  
##  5 negative     26     2 bing  
##  6 negative     27     1 bing  
##  7 negative     28     3 bing  
##  8 negative     29     1 bing  
##  9 negative     31     1 bing  
## 10 negative     32     1 bing  
## 11 negative     34     5 bing  
## 12 negative     35     1 bing  
## 13 negative     36     3 bing  
## 14 negative     38     3 bing  
## 15 negative     40     1 bing

Now let’s see the distribution of positive and negative sentiment within documents across the corpus.

Note use of the spread() function to combine extra row pertaining to some index (doc) and make an extra column. Do ?tidyr::spread for more info.

bing_df = data.frame(senti.bing %>% spread(sentiment, n, fill = 0))

head(bing_df)
##   index method negative positive
## 1     5   bing        1        0
## 2     6   bing        1        0
## 3    19   bing        2        3
## 4    20   bing        0        1
## 5    21   bing        0        2
## 6    22   bing        0        1

Can we combine the negative and positive rows, by say, subtracting negative from positive score and thereby computing some polarity score for each line?

Yes, see code below.

bing_pol = bing_df %>% 
  mutate(polarity = (positive - negative)) %>%   #create variable polarity = pos - neg
  arrange(desc(polarity))    # sort by polarity

bing_pol %>%  head()
##   index method negative positive polarity
## 1   212   bing        0        9        9
## 2   110   bing        0        8        8
## 3   154   bing        1        9        8
## 4   160   bing        0        8        8
## 5   129   bing        1        7        6
## 6    45   bing        0        5        5

Now for some quick visualization of the distribution of sentiment across the analyst call. See code below.

require(ggplot2)
## Loading required package: ggplot2
# plotting running sentiment distribution across the analyst call
ggplot(bing_pol, 
       aes(index, polarity)) +
       geom_bar(stat = "identity", show.legend = FALSE) +
      labs(title = "Sentiment in IBM analyst call corpus",
             x = "doc",  
             y = "Sentiment")

Another quick visualization. We want to see which words contributed most to positive or neg sentiment in the corpus using the bing lexicon.

So first we create a count of bing sentiment words that occur a lot in the corpus.

bing_word_counts <- textdf %>%
                    unnest_tokens(word, text) %>%
                    inner_join(bing) %>%
                    count(word, sentiment, sort = TRUE) %>%
                    ungroup()
## Joining, by = "word"
bing_word_counts %>% head(., 10)
## # A tibble: 10 × 3
##    word      sentiment     n
##    <chr>     <chr>     <int>
##  1 cloud     negative     60
##  2 thank     positive     22
##  3 good      positive     21
##  4 well      positive     19
##  5 free      positive     17
##  6 like      positive     14
##  7 strong    positive     12
##  8 right     positive     11
##  9 gross     negative      8
## 10 important positive      8

Now ggplot it and see.

bing_word_counts %>%
  filter(n > 3) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylab("Contribution to sentiment")

And then there’re wordclouds.

require(wordcloud)
## Loading required package: wordcloud
## Loading required package: RColorBrewer
suppressWarnings({
  # build wordcloud of commonest tokens
  textdf %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words) %>%
    count(word) %>%
    with(wordcloud(word, n, max.words = 100))
})
## Joining, by = "word"

1.1.3 Sentiment-An with AFINN

# get AFINN first
AFINN <- sentiments %>% filter(lexicon == "AFINN") # get_sentiments("AFINN")
head(AFINN, 10)
##        X       word sentiment lexicon score
## 1  20690    abandon      <NA>   AFINN    -2
## 2  20691  abandoned      <NA>   AFINN    -2
## 3  20692   abandons      <NA>   AFINN    -2
## 4  20693   abducted      <NA>   AFINN    -2
## 5  20694  abduction      <NA>   AFINN    -2
## 6  20695 abductions      <NA>   AFINN    -2
## 7  20696      abhor      <NA>   AFINN    -3
## 8  20697   abhorred      <NA>   AFINN    -3
## 9  20698  abhorrent      <NA>   AFINN    -3
## 10 20699     abhors      <NA>   AFINN    -3
# inner join AFINN words and scores with text tokens from corpus
senti.afinn = textdf %>%
      mutate(linenumber = row_number()) %>%
      ungroup() %>%
      unnest_tokens(word, text) %>%
      inner_join(AFINN) %>%    # returns only intersection of wordlists and all columns
      group_by(index = linenumber %/% 1) %>% 
      summarise(sentiment = sum(score)) %>% 
      mutate(method = "afinn")
## Joining, by = "word"
senti.afinn %>% head(., 10)
## # A tibble: 10 × 3
##    index sentiment method
##    <dbl>     <int> <chr> 
##  1    19         6 afinn 
##  2    20         1 afinn 
##  3    21         0 afinn 
##  4    22         1 afinn 
##  5    25        11 afinn 
##  6    26         7 afinn 
##  7    27         6 afinn 
##  8    28        14 afinn 
##  9    29         9 afinn 
## 10    30         6 afinn
data.frame(senti.afinn) %>% head()
##   index sentiment method
## 1    19         6  afinn
## 2    20         1  afinn
## 3    21         0  afinn
## 4    22         1  afinn
## 5    25        11  afinn
## 6    26         7  afinn

Can we combine sentiment-An with bigrams?

E.g., Suppose you want to list bigrams where first word is a sentiment word. See code below.

# first construct and split bigrams into word1 and word2
ibm_bigrams_separated <- textdf %>%
                unnest_tokens(bigram, text, 
                token = "ngrams", n = 2) %>%
                separate(bigram, c("word1", "word2"), sep = " ")

ibm_bigrams_separated %>% head(., 10)
## # A tibble: 10 × 2
##    word1         word2      
##    <chr>         <chr>      
##  1 international business   
##  2 business      machines   
##  3 machines      corporation
##  4 corporation   nyse:ibm   
##  5 q3            2016       
##  6 2016          results    
##  7 results       earnings   
##  8 earnings      conference 
##  9 conference    call       
## 10 october       17

Next, inner join with AFINN

# examine the most frequent bigrams whose first word is a sentiment word
senti_bigrams <- ibm_bigrams_separated %>%
  
              # word1 from bigrams and word from AFINN
              inner_join(AFINN, by = c(word1 = "word")) %>% ungroup()

senti_bigrams %>% head(.,10)
## # A tibble: 10 × 6
##    word1      word2        X sentiment lexicon score
##    <chr>      <chr>    <int> <chr>     <chr>   <int>
##  1 thank      you      22911 <NA>      AFINN       2
##  2 like       to       22127 <NA>      AFINN       2
##  3 welcome    you      23108 <NA>      AFINN       2
##  4 prepared   remarks  22444 <NA>      AFINN       1
##  5 certain    comments 21055 <NA>      AFINN       1
##  6 litigation reform   22133 <NA>      AFINN      -1
##  7 certain    non      21055 <NA>      AFINN       1
##  8 thanks     patricia 22913 <NA>      AFINN       2
##  9 share      as       22691 <NA>      AFINN       1
## 10 like       banking  22127 <NA>      AFINN       2

Filter out stopwords and then do inner join with AFINN.

# what if we want sentiment associated with proper words, not stopwords?
senti_bigrams_filtered = ibm_bigrams_separated %>%
                        filter(!word1 %in% stop_words$word) %>%
                        filter(!word2 %in% stop_words$word) %>%
                        inner_join(AFINN, by = c(word1 = "word")) %>%     # word1 is from bigrams and word from AFINN
                        ungroup()

senti_bigrams_filtered %>% head(.,10)
## # A tibble: 10 × 6
##    word1      word2           X sentiment lexicon score
##    <chr>      <chr>       <int> <chr>     <chr>   <int>
##  1 prepared   remarks     22444 <NA>      AFINN       1
##  2 litigation reform      22133 <NA>      AFINN      -1
##  3 strong     growth      22835 <NA>      AFINN       2
##  4 strong     growth      22835 <NA>      AFINN       2
##  5 solid      recurring   22753 <NA>      AFINN       2
##  6 solution   software    22755 <NA>      AFINN       1
##  7 strong     performance 22835 <NA>      AFINN       2
##  8 strong     revenue     22835 <NA>      AFINN       2
##  9 strong     growth      22835 <NA>      AFINN       2
## 10 easy       hybrid      21497 <NA>      AFINN       1

And so on.

Heading finally to the NRC dictionary.

1.1.4 Sentiment-An with the NRC dictionary

# view nrc dict structure
nrc = sentiments %>% filter(lexicon == "nrc") # get_sentiments("nrc")
nrc %>% head(.,10)
##     X        word sentiment lexicon score
## 1   1      abacus     trust     nrc    NA
## 2   2     abandon      fear     nrc    NA
## 3   3     abandon  negative     nrc    NA
## 4   4     abandon   sadness     nrc    NA
## 5   5   abandoned     anger     nrc    NA
## 6   6   abandoned      fear     nrc    NA
## 7   7   abandoned  negative     nrc    NA
## 8   8   abandoned   sadness     nrc    NA
## 9   9 abandonment     anger     nrc    NA
## 10 10 abandonment      fear     nrc    NA
nrc1 = nrc[, c(2,3)]; head(nrc1)
##        word sentiment
## 1    abacus     trust
## 2   abandon      fear
## 3   abandon  negative
## 4   abandon   sadness
## 5 abandoned     anger
## 6 abandoned      fear
# use NRC lexicon now 
senti.nrc = textdf %>%
      mutate(linenumber = row_number()) %>%
      ungroup() %>%
      unnest_tokens(word, text) %>%
      inner_join(nrc1) %>%
      count(sentiment, index = linenumber %/% 1, sort = FALSE) %>%  # %/% gives quotient
      mutate(method = "nrc")
## Joining, by = "word"
senti.nrc %>% head()
## # A tibble: 6 × 4
##   sentiment index     n method
##   <chr>     <dbl> <int> <chr> 
## 1 anger        13     1 nrc   
## 2 anger        25     1 nrc   
## 3 anger        31     1 nrc   
## 4 anger        38     1 nrc   
## 5 anger        40     1 nrc   
## 6 anger        42     1 nrc
# make a neat table out of the 8 emotion dimensions
a = data.frame(senti.nrc %>% spread(sentiment, n, fill = 0))
head(a)
##   index method anger anticipation disgust fear joy negative positive sadness
## 1     1    nrc     0            0       0    0   0        0        1       0
## 2     5    nrc     0            0       0    0   0        1        1       0
## 3     6    nrc     0            0       0    0   0        1        2       0
## 4    13    nrc     1            0       1    1   0        1        0       1
## 5    19    nrc     0            0       0    0   0        2        3       0
## 6    20    nrc     0            3       0    0   0        0        1       0
##   surprise trust
## 1        0     1
## 2        0     1
## 3        0     2
## 4        0     1
## 5        0     3
## 6        0     1

Suppose you want to see what joyful words most occurred in the corpus.

ibm_joy = textdf %>%
      unnest_tokens(word, text) %>%
      inner_join(nrc) %>%
      filter(sentiment == "joy") %>%
      count(word, sort = TRUE)
## Joining, by = "word"
ibm_joy %>% head()
## # A tibble: 6 × 2
##   word            n
##   <chr>       <int>
## 1 cash           25
## 2 income         24
## 3 good           21
## 4 kind           10
## 5 improvement     8
## 6 grow            7

1.2 === Exercise ====

Q: Can we functionize common workflows in tidytext sentimt-an?

Say I ask you to write a func that * [1] takes an input corpus, * [2] applies a sentiment lexicon to it, * [3] outputs a DF with a sentiment score for each document and * [4] an associated visualization?

Q: Can we run such a func on the Nokia dataset? Potential homework?

Well, here below is my attempt at writing a func for the afinn lexicon. You can write your own funcs for different lexica and they will yield different outputs.

Behold.

Afinn_Sentimt <- function(corpus0){

    textdf = tibble(text = corpus0) # make DF out of corpus first

    # build DF with doc index, sentimt & score
    a0 = textdf %>%
        mutate(linenumber = seq(1:nrow(textdf))) %>%   # build line num variable
        ungroup() %>% unnest_tokens(word, text) %>%
        inner_join(get_sentiments("afinn")) %>% rename(sentiment = value) 

    # for wordcloud output
    a1 = a0 %>%  group_by(word) %>%
        summarise(senti_score = sum(sentiment), n = n()) %>%
        with(wordcloud(word, n, max.words = 200)) # plots wordcloud

    # for dataset output
    a2 = a0 %>% 
        count(sentiment, index = linenumber %/% 1, sort = FALSE) %>%
        mutate(method = "afinn") %>%   # creates a column with method name
        spread(sentiment, n, fill = 0)  

return(a2) } # func ends

Time to test the func we wrote above.

nokia = readLines('https://raw.githubusercontent.com/sudhir-voleti/sample-data-sets/master/text%20analysis%20data/amazon%20nokia%20lumia%20reviews.txt')

require(stringr)
## Loading required package: stringr
nokia  =  str_replace_all(nokia, "<.*?>", "") # get rid of html junk 

# test drive the func
system.time({a3 = Afinn_Sentimt(nokia)})
## Joining, by = "word"

##    user  system elapsed 
##   0.359   0.008   0.309
a3
## # A tibble: 115 × 11
##    index method  `-4`  `-3`  `-2`  `-1`   `1`   `2`   `3`   `4`   `5`
##    <dbl> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     1 afinn      0     1     0     1     9     4     7     2     0
##  2     2 afinn      0     0     2     1     5     5     9     0     0
##  3     3 afinn      0     0     2     0     1     4     4     5     0
##  4     4 afinn      0     0     0     0     0     0     2     0     0
##  5     5 afinn      0     0     2     1     3     8     9     0     0
##  6     6 afinn      0     0     6     7     4     3     5     0     0
##  7     7 afinn      0     0     3     0     0     0     2     0     0
##  8     8 afinn      0     0     1     0     5     4    12     1     0
##  9     9 afinn      0     0     1     0     1     1     2     0     0
## 10    10 afinn      0     0     0     0     0     0     1     0     1
## # … with 105 more rows

Chalo, dassall for now. Back to the slides.

Sudhir

1.3 “Sentiment An with SentimentR”

Let me first read in a common dataset for all our sentiment-an examples.

# IBM Q3 2016 analyst call transcript
x = readLines('https://raw.githubusercontent.com/sudhir-voleti/sample-data-sets/master/International%20Business%20Machines%20(IBM)%20Q3%202016%20Results%20-%20Earnings%20Call%20Transcript.txt')

textdf = as.data.frame(x)
head(textdf)
##                                                                      x
## 1              International Business Machines Corporation. (NYSE:IBM)
## 2                             Q3 2016 Results Earnings Conference Call
## 3                                        October 17, 2016, 05:00 PM ET
## 4                                                           Executives
## 5                  Patricia Murphy - Vice President-Investor Relations
## 6 Martin Schroeter - Senior Vice President and Chief Financial Officer

First let us load the required libraries

suppressPackageStartupMessages({
    # loading required libraries
    if(!require(sentimentr)) { install.packages("sentimentr")} # ensure java is up to date!
    library(sentimentr)
})

Lets evaluate the sentiment in those sentences.

# Break each document into sentences. Then average sentiment of all sents in a doc.
out <- sentiment_by(get_sentences(textdf))

head(out, 10)
##     element_id word_count         sd ave_sentiment
##  1:          1          6 0.08838835   -0.06821079
##  2:          2          5         NA    0.11180340
##  3:          3          3         NA    0.00000000
##  4:          4          1         NA    0.00000000
##  5:          5          6         NA   -0.20412415
##  6:          6          9         NA   -0.16666667
##  7:          7          1         NA    0.00000000
##  8:          8          3         NA    0.00000000
##  9:          9          4         NA    0.00000000
## 10:         10          4         NA    0.00000000

Lets plot the sentiment output for better visualization

plot(out)

If there exists other variables which we can use to group together documents and evaluate the sentiment for the corpus, we can do so by passing “by” value to the sentiment_by() function.

By default it groups by the element_id. To illustrate this, we will use the presidential_debates_2012 data in R.

out <- with(
    presidential_debates_2012, 
    sentiment_by(
        get_sentences(dialogue), 
        list(person)
    )
)

out
##       person word_count        sd ave_sentiment
## 1:     OBAMA      18319 0.2488916    0.10204767
## 2:    ROMNEY      19924 0.2444782    0.09611597
## 3:   CROWLEY       1672 0.2181662    0.19455290
## 4:    LEHRER        765 0.2973360    0.15473364
## 5:  QUESTION        583 0.1756778    0.03197751
## 6: SCHIEFFER       1445 0.2345187    0.08843478
plot(out)

Plotting by sentiment value in a sentence

plot(uncombine(out))

1.4 “Valence Shifters for Sentiment-An”

1.4.1 Sentiment analysis with sentimentR

Loading required libraries

rm(list=ls())

suppressPackageStartupMessages({

  if (!require(sentimentr)) {install.packages("sentimentr")} # ensure java is up to date!
  library(sentimentr)
  library(dplyr)
  library(ggplot2)   # for visualizing sentimt variation

  })

1.4.1.1 Sentiment scoring using valence shifters

SentimentR uses valence shifters (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions to compute the score, which change the meaning but are not taken into account by usual dictionary based packages.

Behold a simple example.

sent1 <- "I like to learn new things"
sent2 <- "I not so much like to learn new things"
 
sentiment(sent1)
##    element_id sentence_id word_count sentiment
## 1:          1           1          6 0.8573214
sentiment(sent2)
##    element_id sentence_id word_count sentiment
## 1:          1           1          9 0.3933333

1.4.2 Valence Shifters in action

Consider a list of simple-ish sentences and how sentimentr treats the various valence shifters contained therein.

To see what valence shifters are available in the lexicon package, run the following:

require(lexicon)
## Loading required package: lexicon
## 
## Attaching package: 'lexicon'
## The following object is masked from 'package:sentimentr':
## 
##     available_data
hash_valence_shifters
##               x y
##   1: absolutely 2
##   2:      acute 2
##   3:    acutely 2
##   4:      ain't 1
##   5:       aint 1
##  ---             
## 136:    whereas 4
## 137:      won't 1
## 138:       wont 1
## 139:   wouldn't 1
## 140:    wouldnt 1
hash_valence_shifters[y==1][1:8]  # to see top 8  negators
##           x y
## 1:    ain't 1
## 2:     aint 1
## 3:   aren't 1
## 4:    arent 1
## 5:    can't 1
## 6:   cannot 1
## 7:     cant 1
## 8: couldn't 1
hash_valence_shifters[y==2][1:8]  # to see amplifiers wala list
##               x y
## 1:   absolutely 2
## 2:        acute 2
## 3:      acutely 2
## 4:      certain 2
## 5:    certainly 2
## 6:     colossal 2
## 7:   colossally 2
## 8: considerably 2
hash_valence_shifters[y==3][1:8]  # for de-amplifiers list
##             x y
## 1:     almost 3
## 2:     barely 3
## 3:    faintly 3
## 4:        few 3
## 5:     hardly 3
## 6: incredibly 3
## 7:    kind of 3
## 8:      kinda 3
hash_valence_shifters[y==4]   # for adversative conjunctions
##                   x y
## 1:         although 4
## 2:              but 4
## 3: despite all that 4
## 4: despite all this 4
## 5:     despite that 4
## 6:     despite this 4
## 7:          however 4
## 8:  that being said 4
## 9:          whereas 4
sentiment("I like it.")  # simple sent
##    element_id sentence_id word_count sentiment
## 1:          1           1          3 0.2886751
sentiment("I don't like it.")  # negation
##    element_id sentence_id word_count sentiment
## 1:          1           1          4     -0.25
sentiment("I hate it.")   # simple sent
##    element_id sentence_id word_count  sentiment
## 1:          1           1          3 -0.4330127
sentiment("I don't hate it.")  # negation
##    element_id sentence_id word_count sentiment
## 1:          1           1          4     0.375
sentiment("But I don't hate it.")  # negation with adverserial conjunction
##    element_id sentence_id word_count sentiment
## 1:          1           1          5 0.7546729

1.4.2.1 De-amplifiers or downtoners

I’d expect de-amplifiers like may, somewhat, sort of etc. to dampen the sentiment score of a sentence. Let’s see whether and to what extent that happens.

sentiment("I not like it.")   # another negation
##    element_id sentence_id word_count sentiment
## 1:          1           1          4     -0.25
sentiment("I may not like it.")  # negation with a de-amplifier
##    element_id sentence_id word_count  sentiment
## 1:          1           1          5 -0.2236068
sentiment("I almost don't like it.")  # negation with a de-amplifier
##    element_id sentence_id word_count   sentiment
## 1:          1           1          5 -0.04472136
sentiment("I hardly like it.")    # de-amplifier or downtoner
##    element_id sentence_id word_count sentiment
## 1:          1           1          4      0.05
sentiment("I sort of like it.")   # more de-amplifier or downtoner
##    element_id sentence_id word_count sentiment
## 1:          1           1          5         0
sentiment("I somewhat like it.")  # more de-amplifier or downtoner
##    element_id sentence_id word_count sentiment
## 1:          1           1          4      0.05

1.4.2.2 Amplifiers or intensifiers

Amplifiers like ‘really’, ‘absolutely’, ‘surely’ etc. will intensify the valence in a sentence.

sentiment("I really like it.")   # amplifier or intensifier
##    element_id sentence_id word_count sentiment
## 1:          1           1          4      0.45
sentiment("I never like it.")   # negation + amplifier
##    element_id sentence_id word_count sentiment
## 1:          1           1          4     -0.25
sentiment("I sure like it.")   # straight amplifier
##    element_id sentence_id word_count sentiment
## 1:          1           1          4      0.45

1.4.2.3 Repeat negations

sentiment("I'm not happy.")    # single negation
##    element_id sentence_id word_count  sentiment
## 1:          1           1          3 -0.4330127
sentiment("I'm not unhappy.")  # double negation
##    element_id sentence_id word_count sentiment
## 1:          1           1          3 0.4330127
sentiment("I don't feel not unhappy.")   # triple negation!
##    element_id sentence_id word_count  sentiment
## 1:          1           1          5 -0.3354102

1.4.2.4 Adversative conjunctions

sentiment("But I don't like it.")  # Adversative conjunction + negation
##    element_id sentence_id word_count  sentiment
## 1:          1           1          5 -0.5031153
sentiment("I didn't like it but now I like it.")   # adversative conjunction
##    element_id sentence_id word_count sentiment
## 1:          1           1          9 0.3748333
sentiment("I didn't like it and now I like it.")   # non-adversative conjunction
##    element_id sentence_id word_count sentiment
## 1:          1           1          9         0

So what all did we see above?

1.4.3 Analyzing groups of sentences

Sometimes we want to analyse the sentences as a group rather than a single sentence.

We can use the sentiment_by() function to group the sentences according to a criteria.

It averages the sentiment from each sentence and outputs an aggregated value

sent3 <- "I hate reading. But I love comic books."

sentiment(sent3)
##    element_id sentence_id word_count  sentiment
## 1:          1           1          3 -0.3752777
## 2:          1           2          5  0.7546729
sentiment_by(sent3)
##    element_id word_count        sd ave_sentiment
## 1:          1          8 0.7989957     0.1896976

1.4.4 Extracting words from score calculation

We can extract the words which were used for polarity calculation and also the polarity they contributed.

sent3   # view sent3 again
## [1] "I hate reading. But I love comic books."
words <- extract_sentiment_terms(sent3)

attributes(extract_sentiment_terms(sent3))$elements
##    element_id sentence_id   words polarity
## 1:          1           1       i     0.00
## 2:          1           1    hate    -0.75
## 3:          1           1 reading     0.10
## 4:          1           2     but     0.00
## 5:          1           2       i     0.00
## 6:          1           2    love     0.75
## 7:          1           2   comic     0.00
## 8:          1           2   books     0.00

1.4.5 Applying to Nokia dataset

How about we apply this to a size-able dataset and not just a few small sentences here and there, eh?

Plan next is to apply the above to the Nokia Lumia reviews dataset.

nokia = readLines('https://github.com/sudhir-voleti/sample-data-sets/raw/master/text%20analysis%20data/amazon%20nokia%20lumia%20reviews.txt')

review1 = nokia[1]   # try first review 
sentiment_by(review1)  # get aggregate polarity for the whole review
##    element_id word_count       sd ave_sentiment
## 1:          1        423 0.372013     0.2655574
# get polarity by indiv sentence
a0.df = review1 %>% 
    get_sentences() %>% # inbuilt sentence-parsing
    sentiment()    # sentiment for each sentence parsed
a0.df
##     element_id sentence_id word_count   sentiment
##  1:          1           1         23 -0.13032151
##  2:          1           2         55  0.37080992
##  3:          1           3         23  1.07384923
##  4:          1           4         40  0.27669930
##  5:          1           5         24  0.52051657
##  6:          1           6         39  0.25620505
##  7:          1           7         84  0.29786742
##  8:          1           8         37  0.68225580
##  9:          1           9         56  0.02004459
## 10:          1          10         19 -0.24776899
## 11:          1          11         15  0.15491933
## 12:          1          12          8 -0.08838835
# get aggreg polarity for the whole corpus
system.time({ nokia_senti = sentiment_by(nokia) })  # 0.39 secs
##    user  system elapsed 
##   0.602   0.000   0.380
nokia_senti[1:15,]    # view first 15 rows
##     element_id word_count        sd ave_sentiment
##  1:          1        423 0.3720130    0.26555736
##  2:          2        517 0.3470145    0.38225896
##  3:          3        323 0.2560640    0.25550177
##  4:          4        101 0.1999131    0.36400051
##  5:          5        354 0.3845797    0.35866699
##  6:          6        458 0.2798667    0.07156705
##  7:          7         64 0.2149912    0.03425437
##  8:          8        336 0.3872232    0.34279928
##  9:          9        163 0.2720118    0.06086719
## 10:         10         57 0.4243636    0.35988667
## 11:         11        480 0.2032872    0.25488575
## 12:         12        254 0.3346242    0.14056631
## 13:         13        390 0.2290323    0.47240608
## 14:         14        248 0.2289725    0.25562258
## 15:         15         30 0.2801002    0.28385364

1.4.6 Visualizing Sentiment over Doc length

How might sentiment levels vary over the length of a document? Is there an upward trend? downward trend? No trend? How to know?

How better in fact, than to visualize the same and see for ourselves?

Am using ggplot2 below. Code is straightforward to follow.

require(ggplot2)
p <- ggplot(a0.df, aes(x = a0.df$sentence_id, y = a0.df$sentiment)) + 
geom_smooth(col="blue", se=FALSE) + geom_hline(yintercept=0) + 
geom_smooth(method="lm", formula=y~x, col="red", se=FALSE) 
p
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

1.5 === Exercise time. ===

Your task: Find the highest and lowest valence words respectively in the highest and lowest sentiment bearing sentences in each of Nokia’s 120 reviews.

How would you approach this problem?

Based on what we’ve seen so far, I’d suggest the following.

  • Write a small unit func to extract the highest and lowest valence sents for a doc. Remember, a unit func takes in one row as it’s input.

  • In the unit func, write code to identify the top and bottom valence laden words and their senti_score in each of the sents extracted.

  • Now write a wrapper func that takes the corpus as input and loops over each doc in the corpus.

  • pre-define a DF with as many rows as there are docs in the corpus. Populate this DF with output from the unit func in each turn of the loop.

  • check output not just for one corpus (say, nokia) but others also to detect anomalies and errors for correction.

Closing with that for now.

Ciao

Sudhir

1.6 “Cluster An funcs applied to text data”

1.6.1 Introduction

Previously, we saw: * (i) what cluster-an is, * (ii) what cluster-an does.

Now we will see: * (iii) how to find the optimal number of clusters, * (iv) how to interpret the results via tables and/or displays.

Time now to extend that primer wala learning to actual text, then. Some Qs to think about: * [1] What might cluster-an on text data mean? * [2] On what text data objects would you apply it? * [3] What kind of result displays would be most meaningful for interpretation purposes? * [4] Etc.

In what follows, I present a short markdown guide on applying cluster-an principles to text. In the process, I also demonstrate the use of the source() func which enables us to read in previously created funcs directly into our workspace, the use of piping to chain together functions into one seamless workflow etc, on a small scale.

Open in new tab this page: https://github.com/sudhir-voleti/code-chunks/blob/master/cba%20tidytext%20funcs%20for%20git%20upload.R

Behold.

## setup
rm(list=ls())    # clear workspace

# sourcing in our session 1 funcs directly into R
suppressPackageStartupMessages({

  if (!require(tidyverse)) {install.packages("tidyverse")}; library(tidyverse)
  
  source("https://raw.githubusercontent.com/sudhir-voleti/code-chunks/master/cba%20tidytext%20funcs%20for%20git%20upload.R")
  
  })

Notice how the use of source() enabled us to fill our workspace with yesterday’s funcs. Consider the possibilities that entails in expanding domain, sharing workflows and collaborating on complex projects across teams using github’s full functionalities …

OK, back to cluster-an. First things first. How many clusters are there in the solution? We’ll use scree plots to find that out. The func below applies on DTMs.

Another thing about the test data we’ll be using here. This is a 2008 websurvey based primary data collected from 5000+ regular shoppers at a mid-sized regional US supermarket chain.

The retailer was launching a store brand line of ice-creams and the survey Q being analyzed asked respondents: > “Which flavors of ice cream do you prefer? Please be as specific as possible”.

Respondents gave varying answers. Some skipped the Q. Below, let us load the corpus & inspect it.

ice_cream = readLines("https://raw.githubusercontent.com/sudhir-voleti/sample-data-sets/master/Ice-cream-dataset.txt")

str(ice_cream)
##  chr [1:5165] "Vanilla, chocolate, cookies and cream, seasonal variations" ...
head(ice_cream, 10)
##  [1] "Vanilla, chocolate, cookies and cream, seasonal variations"                                                                                
##  [2] ""                                                                                                                                          
##  [3] "vanilla & chocolate"                                                                                                                       
##  [4] "Chocolate/peanut butter swirl; Moose Tracks"                                                                                               
##  [5] "chocolate chip cookie dough"                                                                                                               
##  [6] "Chocolate, cookies n cream, butter pecan, vanilla fudge.  "                                                                                
##  [7] "all of them my varied family member will eat anything.  A really good rich vanilla is the most important becase that goes with everything."
##  [8] "Chocolate chip cookie dough!!!!, cinnamon, vanilla bean, cake flavored"                                                                    
##  [9] ""                                                                                                                                          
## [10] ""

5000+ rows in the dataset. Some empty. The kind of responses you can see above.

Now I define a func to build scree-plots for cluster-an.

1.6.2 Func 1: Build scree plot to find optimal #clusters

build_kmeans_scree <- function(mydata)  # rows are units, colms are basis variables
{ # Determine number of clusters
set.seed(seed = 0000)   # set seed for reproducible work
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))  # wss is within group sum of squares

for (i in 2:15) wss[i] <- sum(      # checking model fit for 2 to 15 clusters
                            kmeans(mydata,  centers = i)$withinss)  # note use of kmeans() func

plot(1:15, wss, type="b",  # scree.plot = 
         xlab="Number of Clusters",
         ylab="Within groups sum of squares")

 } # func ends

Time now to test-drive above func with ice cream corpus. Forward, ahoy!

# pipe together sourced funcs into a single workflow. :)
system.time({ 
  dtm_ice_cream = ice_cream %>% 
                    text.clean() %>% 
                      dtm_build() 
  })  # 1.33 secs
## Joining, by = "word"
##    user  system elapsed 
##   0.428   0.016   0.909
dim(dtm_ice_cream)    # [1] 2208  922
## [1] 2208  922
system.time({ build_kmeans_scree(dtm_ice_cream) })   # 7.53 secs. 

##    user  system elapsed 
##   2.524   0.072   2.590

Again, above, I want to point out the use of piping operator %>% to chain together distinct funcs.

From the scree plot, we identify the optimal k. Let us say k=8 clusters is optimal (illustrative).

Next, below, we divide the DTM into cluster-wise pieces and display each for interpretation purposes.

1.6.3 Func 2: Build display aids to view text-based clusters

display.clusters <- function(dtm, k)  # k=optimal num of clusters
{ 

  # K-Means Cluster Analysis
  fit <- kmeans(dtm, k) # k cluster solution

 for (i1 in 1:max(fit$cluster)){ 
  # windows()
    dtm_cluster = dtm[(fit$cluster == i1),] 
    distill.cog(dtm_cluster)    } # i1 loop ends

 }  # func ends

 # test driving the func
 system.time({ display.clusters(dtm_ice_cream, 3) }) 

##    user  system elapsed 
##   1.583   0.016   1.592

OK. Dassit for this markdown. Back to the slides, class.

Sudhir