8 Tutorial 8: Data Analysis
After working through Tutorial 8, you’ll…
- understand how to calculate basic descriptive statistics in R.
- understand how to do inferential tests in R.
Data
For this tutorial, we’ll use a fake data set I created for this tutorial: “coded_data_tutorial7.csv” (via OLAT/Materials/Data for R).
The data set contains manual codings of Instagram posts of two outlets: the BBC (N = 200) and the Guardian (N = 200). In total, the sample contains 400 observations.
The five variables included here are:
- PostID: the individual number of each coded post
- Platform: the platform from which each post stems (here: Instagram)
- Outlet: the outlet from which each post stems (here: Guardian, BBC)
- Engaging_Language: whether each post contains engaging language (0 = no, 1 = yes)
- Count_Engaging_Words: the count of engaging words in each post (numeric, starting from 0)
Read in the data set:
<- read.csv2("coded_data_tutorial7.csv") data
This is how the data looks like in R:
head(data)
## PostID Platform Outlet Engaging_Language Count_Engaging_Words
## 1 ID01 Instagram BBC 0 5
## 2 ID02 Instagram BBC 0 4
## 3 ID03 Instagram BBC 0 3
## 4 ID04 Instagram BBC 1 3
## 5 ID05 Instagram BBC 1 3
## 6 ID06 Instagram BBC 0 3
8.1 Descriptive Statistics
8.1.1 Counts
For many variables - especially those with nominal or ordinal scale levels - you may want to count values. For instance, you may want to know how many posts contain engaging language. You can get this information using count() from the dplyr package (remember to activate this package first!).
In short, we summarize how many observations contain either 0 (no engaging language) or 1 (engaging language).
library(tidyverse)
%>%
data count(Engaging_Language)
## Engaging_Language n
## 1 0 203
## 2 1 197
8.1.2 Percentages
Sometimes, it may be more helpful to not know how often a specific value occurs in absolute, but in relative terms - for instance in terms of percentages. To calculate this, we use an additional command called prop.table() to create a new variable called prop, which is merely the relative occurrence with which outlets use engaging language:
%>%
data count(Engaging_Language) %>%
mutate(prop = prop.table(n)*100)
## Engaging_Language n prop
## 1 0 203 50.75
## 2 1 197 49.25
8.1.3 Mean, standard deviation, etc.
For variables with metric scale levels, you may be interested in other types of summary statistics. For instance, you may want to know the mean number of engaging words used in posts. The dataset contains the count of engaging words used in each post in the variable Count_Engaging_Words.
We summarize the mean of engaging words by creating a new variable summarizing the value of Count_Engaging_Words via summarize() and calculating the mean using mean():
%>%
data summarize(mean = mean(Count_Engaging_Words))
## mean
## 1 3.84
Or you could get the median with median() instead of the mean (for instance, if you have extreme values in the data skewing the distribution of the variable, in case of which “average” values may be better represented by the median):
%>%
data summarize(median = median(Count_Engaging_Words))
## median
## 1 3
Another useful value is the standard deviation via sd():
%>%
data summarize(sd = sd(Count_Engaging_Words))
## sd
## 1 1.595545
Lastly, you ready know how to round numeric data to a number of decimals to your choice (for instance to report them in a seminar paper):
%>%
data summarize(mean = mean(Count_Engaging_Words)) %>%
mutate(mean = round(mean, 1))
## mean
## 1 3.8
8.1.4 Aggregating data
In some cases, you may also wish to get descriptive data for groups within our data. For instance, we may want to know more about the use of engaging language varies across outlets.
Using the dplyr package, we use the following code to get this information:
- We define the object to which functions should be applied: data
- We hand this object to our pipeline: %>%
- We aggregate the data to a higher level, here by outlets: group_by(Outlet)
- We shift the object back to the pipeline: %>%
- We create a new variable, here prop which includes the percentage of posts including engaging language across outlets: _count(Engaging_Language) %>% mutate(prop = prop.table(n)*100)_
%>%
data group_by(Outlet) %>%
count(Engaging_Language) %>%
mutate(prop = prop.table(n)*100)
## # A tibble: 4 x 4
## # Groups: Outlet [2]
## Outlet Engaging_Language n prop
## <chr> <int> <int> <dbl>
## 1 BBC 0 91 45.5
## 2 BBC 1 109 54.5
## 3 GUARDIAN 0 112 56
## 4 GUARDIAN 1 88 44
8.2 Multivariate statistics
R offers a lot of options for doing inferential tests. Here, we will focus on two tests: Chi-Sqaure tests and t-tests to analyze in how far the BBC uses more or less engaging language than the Guardian.
8.2.1 Chi-Square test
The Chi-Square test is used if we have categorical dependent and independent variables. Say we want to check whether the independent, binary variable Outlet correlates with the dependent, binary variable Engaging_Language.
In R, this is easily done with the crosstab() command from the tidycom package:
library(tidycomm)
%>%
data crosstab(Engaging_Language, Outlet, percentages = TRUE, chi_square = TRUE)
## Chi-square = 4.000900, df = 1.000000, p = 0.045476, V = 0.100011
## # A tibble: 2 x 3
## Outlet `0` `1`
## <chr> <dbl> <dbl>
## 1 BBC 0.448 0.553
## 2 GUARDIAN 0.552 0.447
Here, it seems that the BBC more often uses engaging language (55% of posts) than the Guardian (45% of posts). The Chi-Square test shows that this difference is significant, with Chi-Square(1) = 4, p <.05. However, the effect is comparably weak, with Cramer’s V = .1.
8.2.2 T-test
The t-test is used if we have a categorical independent variable and a metric dependent variable. Say we want to check if some outlets more often use engaging words as indicated by Count_Engaging_Words. To see whether the independent, binary variable Outlet correlates with the dependent, metric variable Count_Engaging_Words, we use the t_test() command from the tidycom package:
%>%
data t_test(Outlet, Count_Engaging_Words)
## # A tibble: 1 x 10
## Variable M_BBC SD_BBC M_GUARDIAN SD_GUARDIAN Delta_M t df p d
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Count_Engaging_Words 4.36 1.64 3.32 1.37 1.04 6.89 398 2.23e-11 0.689
Related to the number of engaging words used in social media posts, it also seems that the BBC more often uses engaging language (M = 4.36, SD = 1.63) than the Guardian (M = 3.32, SD = 1.37). A t-test shows that this difference is significant, with t(398) = 6.89, p <.001. This effect is strong, with an effect size of d = .69.