Big data and Social Science

12.2 Lab 11: Descriptive statistics

Generally, we can use datamanagement functions in the dplyr package to produce various descriptive statistics such as mean(), sd(), min(), max() across our dataset, across different groups. We’ll work with the tweets dataset again. You can find a description here.

library(DBI)
db <- dbConnect(RSQLite::SQLite(), "./www/tweets-sentiment-db.sqlite")
dbListTables(db)

## [1] "table_tweets"

dbListFields(db, "table_tweets")

## [1] "target" "ids"    "date"   "flag"   "user"   "text"

data <- dbGetQuery(db, "SELECT  target, ids, date, user, text
           FROM table_tweets
           ORDER BY RANDOM()
           LIMIT 200000")
dbDisconnect(db)

nrow(data) # check how many rows there are

## [1] 200000

Get the number of tweets per user:

library(dplyr)
# Number per user
data %>% group_by(user) %>% dplyr::summarise(n.tweets = n()) %>% arrange(desc(n.tweets)) %>% slice(1:20)

user	n.tweets
lost_dog	72
tweetpet	40
SongoftheOss	39
webwoke	38
mcraddictal	36
VioletsCRUK	34
what_bugs_u	32
DarkPiano	31
nuttychris	31
Dogbook	30
SallytheShizzle	30
wowlew	30
enamoredsoul	29
keza34	29
Broooooke_	28
torilovesbradie	28
twebbstack	27
Jayme1988	26
Karen230683	26
tsarnick	26

Aggregate the data by user and get the number of tweets as well as the average sentiment:

library(dplyr)
# Mean of target
# target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
data %>% group_by(user) %>% dplyr::summarise(n.tweets = n(), mean.target = mean(target, na.rm=TRUE)) %>% arrange(desc(n.tweets)) %>% slice(1:20)

user	n.tweets	mean.target
lost_dog	72	0.0000000
tweetpet	40	0.0000000
SongoftheOss	39	2.0512821
webwoke	38	0.7368421
mcraddictal	36	1.0000000
VioletsCRUK	34	3.1764706
what_bugs_u	32	4.0000000
DarkPiano	31	4.0000000
nuttychris	31	1.4193548
Dogbook	30	1.7333333
SallytheShizzle	30	1.3333333
wowlew	30	0.0000000
enamoredsoul	29	3.4482759
keza34	29	4.0000000
Broooooke_	28	2.5714286
torilovesbradie	28	2.0000000
twebbstack	27	3.1111111
Jayme1988	26	2.6153846
Karen230683	26	2.4615385
tsarnick	26	3.5384615

Correlate the number of tweets with the sentiment:

# Correlation between number of tweets and sentiment
data.agg <- data %>% group_by(user) %>% dplyr::summarise(n.tweets = n(), mean.target = mean(target, na.rm=TRUE)) %>% arrange(desc(n.tweets))
cor(data.agg$n.tweets, data.agg$mean.target)

## [1] 0.03800391