12.2 Lab 11: Descriptive statistics
Generally, we can use datamanagement functions in the dplyr
package to produce various descriptive statistics such as mean(), sd(), min(), max() across our dataset, across different groups. We’ll work with the tweets dataset again. You can find a description here.
library(DBI)
db <- dbConnect(RSQLite::SQLite(), "./www/tweets-sentiment-db.sqlite")
dbListTables(db)
## [1] "table_tweets"
dbListFields(db, "table_tweets")
## [1] "target" "ids" "date" "flag" "user" "text"
data <- dbGetQuery(db, "SELECT target, ids, date, user, text
FROM table_tweets
ORDER BY RANDOM()
LIMIT 200000")
dbDisconnect(db)
nrow(data) # check how many rows there are
## [1] 200000
Get the number of tweets per user:
library(dplyr)
# Number per user
data %>% group_by(user) %>% dplyr::summarise(n.tweets = n()) %>% arrange(desc(n.tweets)) %>% slice(1:20)
user | n.tweets |
---|---|
lost_dog | 72 |
tweetpet | 40 |
SongoftheOss | 39 |
webwoke | 38 |
mcraddictal | 36 |
VioletsCRUK | 34 |
what_bugs_u | 32 |
DarkPiano | 31 |
nuttychris | 31 |
Dogbook | 30 |
SallytheShizzle | 30 |
wowlew | 30 |
enamoredsoul | 29 |
keza34 | 29 |
Broooooke_ | 28 |
torilovesbradie | 28 |
twebbstack | 27 |
Jayme1988 | 26 |
Karen230683 | 26 |
tsarnick | 26 |
Aggregate the data by user and get the number of tweets as well as the average sentiment:
library(dplyr)
# Mean of target
# target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
data %>% group_by(user) %>% dplyr::summarise(n.tweets = n(), mean.target = mean(target, na.rm=TRUE)) %>% arrange(desc(n.tweets)) %>% slice(1:20)
user | n.tweets | mean.target |
---|---|---|
lost_dog | 72 | 0.0000000 |
tweetpet | 40 | 0.0000000 |
SongoftheOss | 39 | 2.0512821 |
webwoke | 38 | 0.7368421 |
mcraddictal | 36 | 1.0000000 |
VioletsCRUK | 34 | 3.1764706 |
what_bugs_u | 32 | 4.0000000 |
DarkPiano | 31 | 4.0000000 |
nuttychris | 31 | 1.4193548 |
Dogbook | 30 | 1.7333333 |
SallytheShizzle | 30 | 1.3333333 |
wowlew | 30 | 0.0000000 |
enamoredsoul | 29 | 3.4482759 |
keza34 | 29 | 4.0000000 |
Broooooke_ | 28 | 2.5714286 |
torilovesbradie | 28 | 2.0000000 |
twebbstack | 27 | 3.1111111 |
Jayme1988 | 26 | 2.6153846 |
Karen230683 | 26 | 2.4615385 |
tsarnick | 26 | 3.5384615 |
Correlate the number of tweets with the sentiment:
# Correlation between number of tweets and sentiment
data.agg <- data %>% group_by(user) %>% dplyr::summarise(n.tweets = n(), mean.target = mean(target, na.rm=TRUE)) %>% arrange(desc(n.tweets))
cor(data.agg$n.tweets, data.agg$mean.target)
## [1] 0.03800391