12.2 Lab 11: Descriptive statistics

Generally, we can use datamanagement functions in the dplyr package to produce various descriptive statistics such as mean(), sd(), min(), max() across our dataset, across different groups. We’ll work with the tweets dataset again. You can find a description here.

library(DBI)
db <- dbConnect(RSQLite::SQLite(), "./www/tweets-sentiment-db.sqlite")
dbListTables(db)
## [1] "table_tweets"
dbListFields(db, "table_tweets")
## [1] "target" "ids"    "date"   "flag"   "user"   "text"
data <- dbGetQuery(db, "SELECT  target, ids, date, user, text
           FROM table_tweets
           ORDER BY RANDOM()
           LIMIT 200000")
dbDisconnect(db)
nrow(data) # check how many rows there are
## [1] 200000

Get the number of tweets per user:

library(dplyr)
# Number per user
data %>% group_by(user) %>% dplyr::summarise(n.tweets = n()) %>% arrange(desc(n.tweets)) %>% slice(1:20)
user n.tweets
lost_dog 72
tweetpet 40
SongoftheOss 39
webwoke 38
mcraddictal 36
VioletsCRUK 34
what_bugs_u 32
DarkPiano 31
nuttychris 31
Dogbook 30
SallytheShizzle 30
wowlew 30
enamoredsoul 29
keza34 29
Broooooke_ 28
torilovesbradie 28
twebbstack 27
Jayme1988 26
Karen230683 26
tsarnick 26

Aggregate the data by user and get the number of tweets as well as the average sentiment:

library(dplyr)
# Mean of target
# target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
data %>% group_by(user) %>% dplyr::summarise(n.tweets = n(), mean.target = mean(target, na.rm=TRUE)) %>% arrange(desc(n.tweets)) %>% slice(1:20)
user n.tweets mean.target
lost_dog 72 0.0000000
tweetpet 40 0.0000000
SongoftheOss 39 2.0512821
webwoke 38 0.7368421
mcraddictal 36 1.0000000
VioletsCRUK 34 3.1764706
what_bugs_u 32 4.0000000
DarkPiano 31 4.0000000
nuttychris 31 1.4193548
Dogbook 30 1.7333333
SallytheShizzle 30 1.3333333
wowlew 30 0.0000000
enamoredsoul 29 3.4482759
keza34 29 4.0000000
Broooooke_ 28 2.5714286
torilovesbradie 28 2.0000000
twebbstack 27 3.1111111
Jayme1988 26 2.6153846
Karen230683 26 2.4615385
tsarnick 26 3.5384615

Correlate the number of tweets with the sentiment:

# Correlation between number of tweets and sentiment
data.agg <- data %>% group_by(user) %>% dplyr::summarise(n.tweets = n(), mean.target = mean(target, na.rm=TRUE)) %>% arrange(desc(n.tweets))
cor(data.agg$n.tweets, data.agg$mean.target)
## [1] 0.03800391