Text Analytics
2022-06-18
Section 1 Session 2
1.1 “Sentiment Analysis”
Let me first read in a common dataset for all our sentiment-an examples.
rm(list=ls()) # clear workspace
# IBM Q3 2016 analyst call transcript
= readLines('https://raw.githubusercontent.com/sudhir-voleti/sample-data-sets/master/International%20Business%20Machines%20(IBM)%20Q3%202016%20Results%20-%20Earnings%20Call%20Transcript.txt')
x
length(x) # how many docs in this corpus?
## [1] 219
1.1.1 Using qdap for sentiment analysis
qdap
or the qualitative discourse analysis package offers a relatively rigorous sentiment-An approach. However, it is limited to a standard dictionary (from Princeton).
Our main use of sentiment-An in this session shall be with tidytext library. However, for completeness sake, let’s quickly do qdap first.
suppressPackageStartupMessages({
# loading required libraries
if(!require(qdap)) {install.packages("qdap")} # ensure java is up to date!
if(!require(textdata)) {install.packages("textdata")}
library(qdap)
library(textdata)
})
Use polarity()
func from qdap and store its output into various variables. Try ?qdap::polarity
for more detail.
# apply polarity() func from qdap to compute sentiment polarity
= Sys.time() # set timer
t1 = qdap::polarity(x) # Calculate the polarity from qdap dictionary pol
## Warning in qdap::polarity(x):
## Some rows contain double punctuation. Suggested use of `sentSplit` function.
= pol$all[,2] # Word Count in each doc
wc = pol$all[,3] # average polarity score
val = pol$all[,4] # Positive words info
p = pol$all[,5] # Negative Words info
n print(round(Sys.time() - t1, 3)) # how much time did the above take?
## Time difference of 2.578 secs
View output variables.
head(pol$all) # see struc of obj 'pol' in Envmt pane first
## all wc polarity pos.words neg.words
## 1 all 6 0.0000000 - -
## 2 all 5 0.0000000 - -
## 3 all 3 0.0000000 - -
## 4 all 1 0.0000000 - -
## 5 all 6 -0.4082483 - vice
## 6 all 9 -0.3333333 - vice
## text.var
## 1 International Business Machines Corporation. (NYSE:IBM)
## 2 Q3 2016 Results Earnings Conference Call
## 3 October 17, 2016, 05:00 PM ET
## 4 Executives
## 5 Patricia Murphy - Vice President-Investor Relations
## 6 Martin Schroeter - Senior Vice President and Chief Financial Officer
Summarize polarity scores for the corpus
head(pol$group)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 219 10275 0.1032166 0.2454903 0.4204508
What positive words were there in the corpus?
= unique(setdiff(unlist(p),"-")) # Positive words list, do ?dplyr::setdiff
positive_words print(positive_words) # Print all the positive words found in the corpus
## [1] "thank" "like" "welcome" "available"
## [5] "reform" "reconciliation" "stability" "strong"
## [9] "top" "consistent" "led" "solid"
## [13] "modest" "good" "progress" "best"
## [17] "well" "innovation" "success" "helping"
## [21] "leverage" "contribution" "easy" "intelligence"
## [25] "better" "trusted" "support" "smart"
## [29] "work" "improvements" "improved" "gains"
## [33] "meaningful" "effective" "benefit" "free"
## [37] "improvement" "gain" "capability" "correct"
## [41] "wins" "improve" "compliant" "efficient"
## [45] "trusting" "right" "great" "protect"
## [49] "leading" "empower" "faster" "important"
## [53] "secure" "agile" "significant" "easier"
## [57] "advanced" "stable" "supporting" "win"
## [61] "flexible" "guidance" "innovative" "supports"
## [65] "pretty" "successful" "straightforward" "appreciate"
## [69] "helped" "enough" "terrific" "won"
## [73] "confident" "improving" "positive" "accurate"
## [77] "overtakes" "reasonable" "consistently" "wonder"
## [81] "regard" "appeal" "benefits" "sustainability"
## [85] "tougher" "interesting" "freed" "freedom"
## [89] "sustainable" "appreciated" "trust" "excited"
We can use the wordcloud or barchart or other display aids as required on the list above.
What negative words were there in the corpus?
= unique(setdiff(unlist(n),"-")) # Negative words list
negative_words print(negative_words) # Print all neg words
## [1] "vice" "volatility" "cloud" "decline"
## [5] "drones" "critical" "risk" "cancer"
## [9] "uncertain" "weaker" "gross" "set up"
## [13] "chronic" "risks" "latency" "object"
## [17] "declining" "drag" "debt" "declines"
## [21] "oppose" "bump" "negative" "indiscernible"
## [25] "volatile" "errors" "inaccuracies"
While qdap is nice and easy, its limitations are also there.
E.g., it uses an inbuilt wordlist that cannot be customized for particular contexts.
Another is that it does exact matching and misses minor variations in sentiment words. It suffers from the old problems of synonymy (i.e., multiple words that have similar meanings) and polysemy (words that have more than one meaning).
While these are old standing problems and cannot be eliminated in the bag of words approach, more flexibility would enable a substantial mitigation at least.
1.1.2 Sentiment-An with Tidytext
There are 3 inbuilt sentiment dictionaries as of now in tidytext, with a fourth under development.
To briefly see what these are, return to the slides.
The nice thing about tidytext is that while there are 3 inbuilt sentiment dictionaires in different output formats for convenience, we can also create and customize our own wordlists as needed.
suppressPackageStartupMessages({
require(tidytext)
require(tidyr)
require(dplyr)
})
The tidytext package contains all three sentiment lexicons in the sentiments dataset.
# over 27k words sentiment-ized and scored across the 3 lexicons
= read.csv("https://raw.githubusercontent.com/sudhir-voleti/sample-data-sets/master/sentiments.csv")
sentiments
head(sentiments, 10)
## X word sentiment lexicon score
## 1 1 abacus trust nrc NA
## 2 2 abandon fear nrc NA
## 3 3 abandon negative nrc NA
## 4 4 abandon sadness nrc NA
## 5 5 abandoned anger nrc NA
## 6 6 abandoned fear nrc NA
## 7 7 abandoned negative nrc NA
## 8 8 abandoned sadness nrc NA
## 9 9 abandonment anger nrc NA
## 10 10 abandonment fear nrc NA
But the inbuilt dataset is now incomplete as one lexicon (‘NRC’) was withdrawn after objections from the lexicon’s creators. Strictly for academic purposes only, I’m using a saved version of that lexicon.
# sentiments = read.csv("https://raw.githubusercontent.com/sudhir-voleti/sample-data-sets/master/sentiments.csv")
%>%
sentiments filter(lexicon == "AFINN") %>%
head()
## X word sentiment lexicon score
## 1 20690 abandon <NA> AFINN -2
## 2 20691 abandoned <NA> AFINN -2
## 3 20692 abandons <NA> AFINN -2
## 4 20693 abducted <NA> AFINN -2
## 5 20694 abduction <NA> AFINN -2
## 6 20695 abductions <NA> AFINN -2
Let’s start simple, with Bing.
= data_frame(text = x) # convert to data frame textdf
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
= get_sentiments("bing") # put all of the bing sentiment dict into object 'bing'
bing
head(bing, 10) # view bing object
## # A tibble: 10 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
Which docs are most positive and negative in the corpus?
= textdf %>%
senti.bing mutate(linenumber = seq(1:nrow(textdf))) %>% # build line num variable
ungroup() %>%
unnest_tokens(word, text) %>%
inner_join(get_sentiments("bing")) %>%
count(sentiment, index = linenumber %/% 1, sort = FALSE) %>%
# mutate(index = row_number()) %>%
mutate(method = "bing") # creates a column with method name
## Joining, by = "word"
head(senti.bing, 15)
## # A tibble: 15 × 4
## sentiment index n method
## <chr> <dbl> <int> <chr>
## 1 negative 5 1 bing
## 2 negative 6 1 bing
## 3 negative 19 2 bing
## 4 negative 25 1 bing
## 5 negative 26 2 bing
## 6 negative 27 1 bing
## 7 negative 28 3 bing
## 8 negative 29 1 bing
## 9 negative 31 1 bing
## 10 negative 32 1 bing
## 11 negative 34 5 bing
## 12 negative 35 1 bing
## 13 negative 36 3 bing
## 14 negative 38 3 bing
## 15 negative 40 1 bing
Now let’s see the distribution of positive and negative sentiment within documents across the corpus.
Note use of the spread()
function to combine extra row pertaining to some index (doc) and make an extra column. Do ?tidyr::spread
for more info.
= data.frame(senti.bing %>% spread(sentiment, n, fill = 0))
bing_df
head(bing_df)
## index method negative positive
## 1 5 bing 1 0
## 2 6 bing 1 0
## 3 19 bing 2 3
## 4 20 bing 0 1
## 5 21 bing 0 2
## 6 22 bing 0 1
Can we combine the negative and positive rows, by say, subtracting negative from positive score and thereby computing some polarity score for each line?
Yes, see code below.
= bing_df %>%
bing_pol mutate(polarity = (positive - negative)) %>% #create variable polarity = pos - neg
arrange(desc(polarity)) # sort by polarity
%>% head() bing_pol
## index method negative positive polarity
## 1 212 bing 0 9 9
## 2 110 bing 0 8 8
## 3 154 bing 1 9 8
## 4 160 bing 0 8 8
## 5 129 bing 1 7 6
## 6 45 bing 0 5 5
Now for some quick visualization of the distribution of sentiment across the analyst call. See code below.
require(ggplot2)
## Loading required package: ggplot2
# plotting running sentiment distribution across the analyst call
ggplot(bing_pol,
aes(index, polarity)) +
geom_bar(stat = "identity", show.legend = FALSE) +
labs(title = "Sentiment in IBM analyst call corpus",
x = "doc",
y = "Sentiment")
Another quick visualization. We want to see which words contributed most to positive or neg sentiment in the corpus using the bing lexicon.
So first we create a count of bing sentiment words that occur a lot in the corpus.
<- textdf %>%
bing_word_counts unnest_tokens(word, text) %>%
inner_join(bing) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
%>% head(., 10) bing_word_counts
## # A tibble: 10 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 cloud negative 60
## 2 thank positive 22
## 3 good positive 21
## 4 well positive 19
## 5 free positive 17
## 6 like positive 14
## 7 strong positive 12
## 8 right positive 11
## 9 gross negative 8
## 10 important positive 8
Now ggplot
it and see.
%>%
bing_word_counts filter(n > 3) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Contribution to sentiment")
And then there’re wordclouds.
require(wordcloud)
## Loading required package: wordcloud
## Loading required package: RColorBrewer
suppressWarnings({
# build wordcloud of commonest tokens
%>%
textdf unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
})
## Joining, by = "word"
1.1.3 Sentiment-An with AFINN
# get AFINN first
<- sentiments %>% filter(lexicon == "AFINN") # get_sentiments("AFINN")
AFINN head(AFINN, 10)
## X word sentiment lexicon score
## 1 20690 abandon <NA> AFINN -2
## 2 20691 abandoned <NA> AFINN -2
## 3 20692 abandons <NA> AFINN -2
## 4 20693 abducted <NA> AFINN -2
## 5 20694 abduction <NA> AFINN -2
## 6 20695 abductions <NA> AFINN -2
## 7 20696 abhor <NA> AFINN -3
## 8 20697 abhorred <NA> AFINN -3
## 9 20698 abhorrent <NA> AFINN -3
## 10 20699 abhors <NA> AFINN -3
# inner join AFINN words and scores with text tokens from corpus
= textdf %>%
senti.afinn mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
inner_join(AFINN) %>% # returns only intersection of wordlists and all columns
group_by(index = linenumber %/% 1) %>%
summarise(sentiment = sum(score)) %>%
mutate(method = "afinn")
## Joining, by = "word"
%>% head(., 10) senti.afinn
## # A tibble: 10 × 3
## index sentiment method
## <dbl> <int> <chr>
## 1 19 6 afinn
## 2 20 1 afinn
## 3 21 0 afinn
## 4 22 1 afinn
## 5 25 11 afinn
## 6 26 7 afinn
## 7 27 6 afinn
## 8 28 14 afinn
## 9 29 9 afinn
## 10 30 6 afinn
data.frame(senti.afinn) %>% head()
## index sentiment method
## 1 19 6 afinn
## 2 20 1 afinn
## 3 21 0 afinn
## 4 22 1 afinn
## 5 25 11 afinn
## 6 26 7 afinn
Can we combine sentiment-An with bigrams?
E.g., Suppose you want to list bigrams where first word is a sentiment word. See code below.
# first construct and split bigrams into word1 and word2
<- textdf %>%
ibm_bigrams_separated unnest_tokens(bigram, text,
token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ")
%>% head(., 10) ibm_bigrams_separated
## # A tibble: 10 × 2
## word1 word2
## <chr> <chr>
## 1 international business
## 2 business machines
## 3 machines corporation
## 4 corporation nyse:ibm
## 5 q3 2016
## 6 2016 results
## 7 results earnings
## 8 earnings conference
## 9 conference call
## 10 october 17
Next, inner join with AFINN
# examine the most frequent bigrams whose first word is a sentiment word
<- ibm_bigrams_separated %>%
senti_bigrams
# word1 from bigrams and word from AFINN
inner_join(AFINN, by = c(word1 = "word")) %>% ungroup()
%>% head(.,10) senti_bigrams
## # A tibble: 10 × 6
## word1 word2 X sentiment lexicon score
## <chr> <chr> <int> <chr> <chr> <int>
## 1 thank you 22911 <NA> AFINN 2
## 2 like to 22127 <NA> AFINN 2
## 3 welcome you 23108 <NA> AFINN 2
## 4 prepared remarks 22444 <NA> AFINN 1
## 5 certain comments 21055 <NA> AFINN 1
## 6 litigation reform 22133 <NA> AFINN -1
## 7 certain non 21055 <NA> AFINN 1
## 8 thanks patricia 22913 <NA> AFINN 2
## 9 share as 22691 <NA> AFINN 1
## 10 like banking 22127 <NA> AFINN 2
Filter out stopwords and then do inner join with AFINN.
# what if we want sentiment associated with proper words, not stopwords?
= ibm_bigrams_separated %>%
senti_bigrams_filtered filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
inner_join(AFINN, by = c(word1 = "word")) %>% # word1 is from bigrams and word from AFINN
ungroup()
%>% head(.,10) senti_bigrams_filtered
## # A tibble: 10 × 6
## word1 word2 X sentiment lexicon score
## <chr> <chr> <int> <chr> <chr> <int>
## 1 prepared remarks 22444 <NA> AFINN 1
## 2 litigation reform 22133 <NA> AFINN -1
## 3 strong growth 22835 <NA> AFINN 2
## 4 strong growth 22835 <NA> AFINN 2
## 5 solid recurring 22753 <NA> AFINN 2
## 6 solution software 22755 <NA> AFINN 1
## 7 strong performance 22835 <NA> AFINN 2
## 8 strong revenue 22835 <NA> AFINN 2
## 9 strong growth 22835 <NA> AFINN 2
## 10 easy hybrid 21497 <NA> AFINN 1
And so on.
Heading finally to the NRC dictionary.
1.1.4 Sentiment-An with the NRC dictionary
# view nrc dict structure
= sentiments %>% filter(lexicon == "nrc") # get_sentiments("nrc")
nrc %>% head(.,10) nrc
## X word sentiment lexicon score
## 1 1 abacus trust nrc NA
## 2 2 abandon fear nrc NA
## 3 3 abandon negative nrc NA
## 4 4 abandon sadness nrc NA
## 5 5 abandoned anger nrc NA
## 6 6 abandoned fear nrc NA
## 7 7 abandoned negative nrc NA
## 8 8 abandoned sadness nrc NA
## 9 9 abandonment anger nrc NA
## 10 10 abandonment fear nrc NA
= nrc[, c(2,3)]; head(nrc1) nrc1
## word sentiment
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
# use NRC lexicon now
= textdf %>%
senti.nrc mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
inner_join(nrc1) %>%
count(sentiment, index = linenumber %/% 1, sort = FALSE) %>% # %/% gives quotient
mutate(method = "nrc")
## Joining, by = "word"
%>% head() senti.nrc
## # A tibble: 6 × 4
## sentiment index n method
## <chr> <dbl> <int> <chr>
## 1 anger 13 1 nrc
## 2 anger 25 1 nrc
## 3 anger 31 1 nrc
## 4 anger 38 1 nrc
## 5 anger 40 1 nrc
## 6 anger 42 1 nrc
# make a neat table out of the 8 emotion dimensions
= data.frame(senti.nrc %>% spread(sentiment, n, fill = 0))
a head(a)
## index method anger anticipation disgust fear joy negative positive sadness
## 1 1 nrc 0 0 0 0 0 0 1 0
## 2 5 nrc 0 0 0 0 0 1 1 0
## 3 6 nrc 0 0 0 0 0 1 2 0
## 4 13 nrc 1 0 1 1 0 1 0 1
## 5 19 nrc 0 0 0 0 0 2 3 0
## 6 20 nrc 0 3 0 0 0 0 1 0
## surprise trust
## 1 0 1
## 2 0 1
## 3 0 2
## 4 0 1
## 5 0 3
## 6 0 1
Suppose you want to see what joyful words most occurred in the corpus.
= textdf %>%
ibm_joy unnest_tokens(word, text) %>%
inner_join(nrc) %>%
filter(sentiment == "joy") %>%
count(word, sort = TRUE)
## Joining, by = "word"
%>% head() ibm_joy
## # A tibble: 6 × 2
## word n
## <chr> <int>
## 1 cash 25
## 2 income 24
## 3 good 21
## 4 kind 10
## 5 improvement 8
## 6 grow 7
1.2 === Exercise ====
Q: Can we functionize common workflows in tidytext sentimt-an?
Say I ask you to write a func that * [1] takes an input corpus, * [2] applies a sentiment lexicon to it, * [3] outputs a DF with a sentiment score for each document and * [4] an associated visualization?
Q: Can we run such a func on the Nokia dataset? Potential homework?
Well, here below is my attempt at writing a func for the afinn
lexicon. You can write your own funcs for different lexica and they will yield different outputs.
Behold.
<- function(corpus0){
Afinn_Sentimt
= tibble(text = corpus0) # make DF out of corpus first
textdf
# build DF with doc index, sentimt & score
= textdf %>%
a0 mutate(linenumber = seq(1:nrow(textdf))) %>% # build line num variable
ungroup() %>% unnest_tokens(word, text) %>%
inner_join(get_sentiments("afinn")) %>% rename(sentiment = value)
# for wordcloud output
= a0 %>% group_by(word) %>%
a1 summarise(senti_score = sum(sentiment), n = n()) %>%
with(wordcloud(word, n, max.words = 200)) # plots wordcloud
# for dataset output
= a0 %>%
a2 count(sentiment, index = linenumber %/% 1, sort = FALSE) %>%
mutate(method = "afinn") %>% # creates a column with method name
spread(sentiment, n, fill = 0)
return(a2) } # func ends
Time to test the func we wrote above.
= readLines('https://raw.githubusercontent.com/sudhir-voleti/sample-data-sets/master/text%20analysis%20data/amazon%20nokia%20lumia%20reviews.txt')
nokia
require(stringr)
## Loading required package: stringr
= str_replace_all(nokia, "<.*?>", "") # get rid of html junk
nokia
# test drive the func
system.time({a3 = Afinn_Sentimt(nokia)})
## Joining, by = "word"
## user system elapsed
## 0.359 0.008 0.309
a3
## # A tibble: 115 × 11
## index method `-4` `-3` `-2` `-1` `1` `2` `3` `4` `5`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 afinn 0 1 0 1 9 4 7 2 0
## 2 2 afinn 0 0 2 1 5 5 9 0 0
## 3 3 afinn 0 0 2 0 1 4 4 5 0
## 4 4 afinn 0 0 0 0 0 0 2 0 0
## 5 5 afinn 0 0 2 1 3 8 9 0 0
## 6 6 afinn 0 0 6 7 4 3 5 0 0
## 7 7 afinn 0 0 3 0 0 0 2 0 0
## 8 8 afinn 0 0 1 0 5 4 12 1 0
## 9 9 afinn 0 0 1 0 1 1 2 0 0
## 10 10 afinn 0 0 0 0 0 0 1 0 1
## # … with 105 more rows
Chalo, dassall for now. Back to the slides.
Sudhir
1.3 “Sentiment An with SentimentR”
Let me first read in a common dataset for all our sentiment-an examples.
# IBM Q3 2016 analyst call transcript
= readLines('https://raw.githubusercontent.com/sudhir-voleti/sample-data-sets/master/International%20Business%20Machines%20(IBM)%20Q3%202016%20Results%20-%20Earnings%20Call%20Transcript.txt')
x
= as.data.frame(x)
textdf head(textdf)
## x
## 1 International Business Machines Corporation. (NYSE:IBM)
## 2 Q3 2016 Results Earnings Conference Call
## 3 October 17, 2016, 05:00 PM ET
## 4 Executives
## 5 Patricia Murphy - Vice President-Investor Relations
## 6 Martin Schroeter - Senior Vice President and Chief Financial Officer
First let us load the required libraries
suppressPackageStartupMessages({
# loading required libraries
if(!require(sentimentr)) { install.packages("sentimentr")} # ensure java is up to date!
library(sentimentr)
})
Lets evaluate the sentiment in those sentences.
# Break each document into sentences. Then average sentiment of all sents in a doc.
<- sentiment_by(get_sentences(textdf))
out
head(out, 10)
## element_id word_count sd ave_sentiment
## 1: 1 6 0.08838835 -0.06821079
## 2: 2 5 NA 0.11180340
## 3: 3 3 NA 0.00000000
## 4: 4 1 NA 0.00000000
## 5: 5 6 NA -0.20412415
## 6: 6 9 NA -0.16666667
## 7: 7 1 NA 0.00000000
## 8: 8 3 NA 0.00000000
## 9: 9 4 NA 0.00000000
## 10: 10 4 NA 0.00000000
Lets plot the sentiment output for better visualization
plot(out)
If there exists other variables which we can use to group together documents and evaluate the sentiment for the corpus, we can do so by passing “by” value to the sentiment_by()
function.
By default it groups by the element_id
. To illustrate this, we will use the presidential_debates_2012
data in R.
<- with(
out
presidential_debates_2012, sentiment_by(
get_sentences(dialogue),
list(person)
)
)
out
## person word_count sd ave_sentiment
## 1: OBAMA 18319 0.2488916 0.10204767
## 2: ROMNEY 19924 0.2444782 0.09611597
## 3: CROWLEY 1672 0.2181662 0.19455290
## 4: LEHRER 765 0.2973360 0.15473364
## 5: QUESTION 583 0.1756778 0.03197751
## 6: SCHIEFFER 1445 0.2345187 0.08843478
plot(out)
Plotting by sentiment value in a sentence
plot(uncombine(out))
1.4 “Valence Shifters for Sentiment-An”
1.4.1 Sentiment analysis with sentimentR
Loading required libraries
rm(list=ls())
suppressPackageStartupMessages({
if (!require(sentimentr)) {install.packages("sentimentr")} # ensure java is up to date!
library(sentimentr)
library(dplyr)
library(ggplot2) # for visualizing sentimt variation
})
1.4.1.1 Sentiment scoring using valence shifters
SentimentR uses valence shifters (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions to compute the score, which change the meaning but are not taken into account by usual dictionary based packages.
Behold a simple example.
<- "I like to learn new things"
sent1 <- "I not so much like to learn new things"
sent2
sentiment(sent1)
## element_id sentence_id word_count sentiment
## 1: 1 1 6 0.8573214
sentiment(sent2)
## element_id sentence_id word_count sentiment
## 1: 1 1 9 0.3933333
1.4.2 Valence Shifters in action
Consider a list of simple-ish sentences and how sentimentr
treats the various valence shifters contained therein.
To see what valence shifters are available in the lexicon
package, run the following:
require(lexicon)
## Loading required package: lexicon
##
## Attaching package: 'lexicon'
## The following object is masked from 'package:sentimentr':
##
## available_data
hash_valence_shifters
## x y
## 1: absolutely 2
## 2: acute 2
## 3: acutely 2
## 4: ain't 1
## 5: aint 1
## ---
## 136: whereas 4
## 137: won't 1
## 138: wont 1
## 139: wouldn't 1
## 140: wouldnt 1
==1][1:8] # to see top 8 negators hash_valence_shifters[y
## x y
## 1: ain't 1
## 2: aint 1
## 3: aren't 1
## 4: arent 1
## 5: can't 1
## 6: cannot 1
## 7: cant 1
## 8: couldn't 1
==2][1:8] # to see amplifiers wala list hash_valence_shifters[y
## x y
## 1: absolutely 2
## 2: acute 2
## 3: acutely 2
## 4: certain 2
## 5: certainly 2
## 6: colossal 2
## 7: colossally 2
## 8: considerably 2
==3][1:8] # for de-amplifiers list hash_valence_shifters[y
## x y
## 1: almost 3
## 2: barely 3
## 3: faintly 3
## 4: few 3
## 5: hardly 3
## 6: incredibly 3
## 7: kind of 3
## 8: kinda 3
==4] # for adversative conjunctions hash_valence_shifters[y
## x y
## 1: although 4
## 2: but 4
## 3: despite all that 4
## 4: despite all this 4
## 5: despite that 4
## 6: despite this 4
## 7: however 4
## 8: that being said 4
## 9: whereas 4
sentiment("I like it.") # simple sent
## element_id sentence_id word_count sentiment
## 1: 1 1 3 0.2886751
sentiment("I don't like it.") # negation
## element_id sentence_id word_count sentiment
## 1: 1 1 4 -0.25
sentiment("I hate it.") # simple sent
## element_id sentence_id word_count sentiment
## 1: 1 1 3 -0.4330127
sentiment("I don't hate it.") # negation
## element_id sentence_id word_count sentiment
## 1: 1 1 4 0.375
sentiment("But I don't hate it.") # negation with adverserial conjunction
## element_id sentence_id word_count sentiment
## 1: 1 1 5 0.7546729
1.4.2.1 De-amplifiers or downtoners
I’d expect de-amplifiers like may, somewhat, sort of etc. to dampen the sentiment score of a sentence. Let’s see whether and to what extent that happens.
sentiment("I not like it.") # another negation
## element_id sentence_id word_count sentiment
## 1: 1 1 4 -0.25
sentiment("I may not like it.") # negation with a de-amplifier
## element_id sentence_id word_count sentiment
## 1: 1 1 5 -0.2236068
sentiment("I almost don't like it.") # negation with a de-amplifier
## element_id sentence_id word_count sentiment
## 1: 1 1 5 -0.04472136
sentiment("I hardly like it.") # de-amplifier or downtoner
## element_id sentence_id word_count sentiment
## 1: 1 1 4 0.05
sentiment("I sort of like it.") # more de-amplifier or downtoner
## element_id sentence_id word_count sentiment
## 1: 1 1 5 0
sentiment("I somewhat like it.") # more de-amplifier or downtoner
## element_id sentence_id word_count sentiment
## 1: 1 1 4 0.05
1.4.2.2 Amplifiers or intensifiers
Amplifiers like ‘really’, ‘absolutely’, ‘surely’ etc. will intensify the valence in a sentence.
sentiment("I really like it.") # amplifier or intensifier
## element_id sentence_id word_count sentiment
## 1: 1 1 4 0.45
sentiment("I never like it.") # negation + amplifier
## element_id sentence_id word_count sentiment
## 1: 1 1 4 -0.25
sentiment("I sure like it.") # straight amplifier
## element_id sentence_id word_count sentiment
## 1: 1 1 4 0.45
1.4.2.3 Repeat negations
sentiment("I'm not happy.") # single negation
## element_id sentence_id word_count sentiment
## 1: 1 1 3 -0.4330127
sentiment("I'm not unhappy.") # double negation
## element_id sentence_id word_count sentiment
## 1: 1 1 3 0.4330127
sentiment("I don't feel not unhappy.") # triple negation!
## element_id sentence_id word_count sentiment
## 1: 1 1 5 -0.3354102
1.4.2.4 Adversative conjunctions
sentiment("But I don't like it.") # Adversative conjunction + negation
## element_id sentence_id word_count sentiment
## 1: 1 1 5 -0.5031153
sentiment("I didn't like it but now I like it.") # adversative conjunction
## element_id sentence_id word_count sentiment
## 1: 1 1 9 0.3748333
sentiment("I didn't like it and now I like it.") # non-adversative conjunction
## element_id sentence_id word_count sentiment
## 1: 1 1 9 0
So what all did we see above?
1.4.3 Analyzing groups of sentences
Sometimes we want to analyse the sentences as a group rather than a single sentence.
We can use the sentiment_by()
function to group the sentences according to a criteria.
It averages the sentiment from each sentence and outputs an aggregated value
<- "I hate reading. But I love comic books."
sent3
sentiment(sent3)
## element_id sentence_id word_count sentiment
## 1: 1 1 3 -0.3752777
## 2: 1 2 5 0.7546729
sentiment_by(sent3)
## element_id word_count sd ave_sentiment
## 1: 1 8 0.7989957 0.1896976
1.4.4 Extracting words from score calculation
We can extract the words which were used for polarity calculation and also the polarity they contributed.
# view sent3 again sent3
## [1] "I hate reading. But I love comic books."
<- extract_sentiment_terms(sent3)
words
attributes(extract_sentiment_terms(sent3))$elements
## element_id sentence_id words polarity
## 1: 1 1 i 0.00
## 2: 1 1 hate -0.75
## 3: 1 1 reading 0.10
## 4: 1 2 but 0.00
## 5: 1 2 i 0.00
## 6: 1 2 love 0.75
## 7: 1 2 comic 0.00
## 8: 1 2 books 0.00
1.4.5 Applying to Nokia dataset
How about we apply this to a size-able dataset and not just a few small sentences here and there, eh?
Plan next is to apply the above to the Nokia Lumia reviews dataset.
= readLines('https://github.com/sudhir-voleti/sample-data-sets/raw/master/text%20analysis%20data/amazon%20nokia%20lumia%20reviews.txt')
nokia
= nokia[1] # try first review
review1 sentiment_by(review1) # get aggregate polarity for the whole review
## element_id word_count sd ave_sentiment
## 1: 1 423 0.372013 0.2655574
# get polarity by indiv sentence
= review1 %>%
a0.df get_sentences() %>% # inbuilt sentence-parsing
sentiment() # sentiment for each sentence parsed
a0.df
## element_id sentence_id word_count sentiment
## 1: 1 1 23 -0.13032151
## 2: 1 2 55 0.37080992
## 3: 1 3 23 1.07384923
## 4: 1 4 40 0.27669930
## 5: 1 5 24 0.52051657
## 6: 1 6 39 0.25620505
## 7: 1 7 84 0.29786742
## 8: 1 8 37 0.68225580
## 9: 1 9 56 0.02004459
## 10: 1 10 19 -0.24776899
## 11: 1 11 15 0.15491933
## 12: 1 12 8 -0.08838835
# get aggreg polarity for the whole corpus
system.time({ nokia_senti = sentiment_by(nokia) }) # 0.39 secs
## user system elapsed
## 0.602 0.000 0.380
1:15,] # view first 15 rows nokia_senti[
## element_id word_count sd ave_sentiment
## 1: 1 423 0.3720130 0.26555736
## 2: 2 517 0.3470145 0.38225896
## 3: 3 323 0.2560640 0.25550177
## 4: 4 101 0.1999131 0.36400051
## 5: 5 354 0.3845797 0.35866699
## 6: 6 458 0.2798667 0.07156705
## 7: 7 64 0.2149912 0.03425437
## 8: 8 336 0.3872232 0.34279928
## 9: 9 163 0.2720118 0.06086719
## 10: 10 57 0.4243636 0.35988667
## 11: 11 480 0.2032872 0.25488575
## 12: 12 254 0.3346242 0.14056631
## 13: 13 390 0.2290323 0.47240608
## 14: 14 248 0.2289725 0.25562258
## 15: 15 30 0.2801002 0.28385364
1.4.6 Visualizing Sentiment over Doc length
How might sentiment levels vary over the length of a document? Is there an upward trend? downward trend? No trend? How to know?
How better in fact, than to visualize the same and see for ourselves?
Am using ggplot2
below. Code is straightforward to follow.
require(ggplot2)
<- ggplot(a0.df, aes(x = a0.df$sentence_id, y = a0.df$sentiment)) +
p geom_smooth(col="blue", se=FALSE) + geom_hline(yintercept=0) +
geom_smooth(method="lm", formula=y~x, col="red", se=FALSE)
p
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
1.5 === Exercise time. ===
Your task: Find the highest and lowest valence words respectively in the highest and lowest sentiment bearing sentences in each of Nokia’s 120 reviews.
How would you approach this problem?
Based on what we’ve seen so far, I’d suggest the following.
Write a small unit func to extract the highest and lowest valence sents for a doc. Remember, a unit func takes in one row as it’s input.
In the unit func, write code to identify the top and bottom valence laden words and their senti_score in each of the sents extracted.
Now write a wrapper func that takes the corpus as input and loops over each doc in the corpus.
pre-define a DF with as many rows as there are docs in the corpus. Populate this DF with output from the unit func in each turn of the loop.
check output not just for one corpus (say, nokia) but others also to detect anomalies and errors for correction.
Closing with that for now.
Ciao
Sudhir
1.6 “Cluster An funcs applied to text data”
1.6.1 Introduction
Previously, we saw: * (i) what cluster-an is, * (ii) what cluster-an does.
Now we will see: * (iii) how to find the optimal number of clusters, * (iv) how to interpret the results via tables and/or displays.
Time now to extend that primer wala learning to actual text, then. Some Qs to think about: * [1] What might cluster-an on text data mean? * [2] On what text data objects would you apply it? * [3] What kind of result displays would be most meaningful for interpretation purposes? * [4] Etc.
In what follows, I present a short markdown guide on applying cluster-an principles to text. In the process, I also demonstrate the use of the source()
func which enables us to read in previously created funcs directly into our workspace, the use of piping to chain together functions into one seamless workflow etc, on a small scale.
Open in new tab this page: https://github.com/sudhir-voleti/code-chunks/blob/master/cba%20tidytext%20funcs%20for%20git%20upload.R
Behold.
## setup
rm(list=ls()) # clear workspace
# sourcing in our session 1 funcs directly into R
suppressPackageStartupMessages({
if (!require(tidyverse)) {install.packages("tidyverse")}; library(tidyverse)
source("https://raw.githubusercontent.com/sudhir-voleti/code-chunks/master/cba%20tidytext%20funcs%20for%20git%20upload.R")
})
Notice how the use of source()
enabled us to fill our workspace with yesterday’s funcs. Consider the possibilities that entails in expanding domain, sharing workflows and collaborating on complex projects across teams using github’s full functionalities …
OK, back to cluster-an. First things first. How many clusters are there in the solution? We’ll use scree plots to find that out. The func below applies on DTMs.
Another thing about the test data we’ll be using here. This is a 2008 websurvey based primary data collected from 5000+ regular shoppers at a mid-sized regional US supermarket chain.
The retailer was launching a store brand line of ice-creams and the survey Q being analyzed asked respondents: > “Which flavors of ice cream do you prefer? Please be as specific as possible”.
Respondents gave varying answers. Some skipped the Q. Below, let us load the corpus & inspect it.
= readLines("https://raw.githubusercontent.com/sudhir-voleti/sample-data-sets/master/Ice-cream-dataset.txt")
ice_cream
str(ice_cream)
## chr [1:5165] "Vanilla, chocolate, cookies and cream, seasonal variations" ...
head(ice_cream, 10)
## [1] "Vanilla, chocolate, cookies and cream, seasonal variations"
## [2] ""
## [3] "vanilla & chocolate"
## [4] "Chocolate/peanut butter swirl; Moose Tracks"
## [5] "chocolate chip cookie dough"
## [6] "Chocolate, cookies n cream, butter pecan, vanilla fudge. "
## [7] "all of them my varied family member will eat anything. A really good rich vanilla is the most important becase that goes with everything."
## [8] "Chocolate chip cookie dough!!!!, cinnamon, vanilla bean, cake flavored"
## [9] ""
## [10] ""
5000+ rows in the dataset. Some empty. The kind of responses you can see above.
Now I define a func to build scree-plots for cluster-an.
1.6.2 Func 1: Build scree plot to find optimal #clusters
<- function(mydata) # rows are units, colms are basis variables
build_kmeans_scree # Determine number of clusters
{ set.seed(seed = 0000) # set seed for reproducible work
<- (nrow(mydata)-1)*sum(apply(mydata,2,var)) # wss is within group sum of squares
wss
for (i in 2:15) wss[i] <- sum( # checking model fit for 2 to 15 clusters
kmeans(mydata, centers = i)$withinss) # note use of kmeans() func
plot(1:15, wss, type="b", # scree.plot =
xlab="Number of Clusters",
ylab="Within groups sum of squares")
# func ends }
Time now to test-drive above func with ice cream corpus. Forward, ahoy!
# pipe together sourced funcs into a single workflow. :)
system.time({
= ice_cream %>%
dtm_ice_cream text.clean() %>%
dtm_build()
# 1.33 secs })
## Joining, by = "word"
## user system elapsed
## 0.428 0.016 0.909
dim(dtm_ice_cream) # [1] 2208 922
## [1] 2208 922
system.time({ build_kmeans_scree(dtm_ice_cream) }) # 7.53 secs.
## user system elapsed
## 2.524 0.072 2.590
Again, above, I want to point out the use of piping operator %>%
to chain together distinct funcs.
From the scree plot, we identify the optimal k. Let us say k=8 clusters is optimal (illustrative).
Next, below, we divide the DTM into cluster-wise pieces and display each for interpretation purposes.
1.6.3 Func 2: Build display aids to view text-based clusters
<- function(dtm, k) # k=optimal num of clusters
display.clusters
{
# K-Means Cluster Analysis
<- kmeans(dtm, k) # k cluster solution
fit
for (i1 in 1:max(fit$cluster)){
# windows()
= dtm[(fit$cluster == i1),]
dtm_cluster distill.cog(dtm_cluster) } # i1 loop ends
# func ends
}
# test driving the func
system.time({ display.clusters(dtm_ice_cream, 3) })
## user system elapsed
## 1.583 0.016 1.592
OK. Dassit for this markdown. Back to the slides, class.
Sudhir