8 Tutorial 8: Data Analysis

After working through Tutorial 8, you’ll…

understand how to calculate basic descriptive statistics in R.
understand how to do inferential tests in R.

Data

For this tutorial, we’ll use a fake data set I created for this tutorial: “coded_data_tutorial7.csv” (via OLAT/Materials/Data for R).

The data set contains manual codings of Instagram posts of two outlets: the BBC (N = 200) and the Guardian (N = 200). In total, the sample contains 400 observations.

The five variables included here are:

PostID: the individual number of each coded post
Platform: the platform from which each post stems (here: Instagram)
Outlet: the outlet from which each post stems (here: Guardian, BBC)
Engaging_Language: whether each post contains engaging language (0 = no, 1 = yes)
Count_Engaging_Words: the count of engaging words in each post (numeric, starting from 0)

Read in the data set:

data <- read.csv2("coded_data_tutorial7.csv")

This is how the data looks like in R:

head(data)

##   PostID  Platform Outlet Engaging_Language Count_Engaging_Words
## 1   ID01 Instagram    BBC                 0                    5
## 2   ID02 Instagram    BBC                 0                    4
## 3   ID03 Instagram    BBC                 0                    3
## 4   ID04 Instagram    BBC                 1                    3
## 5   ID05 Instagram    BBC                 1                    3
## 6   ID06 Instagram    BBC                 0                    3

8.1 Descriptive Statistics

8.1.1 Counts

For many variables - especially those with nominal or ordinal scale levels - you may want to count values. For instance, you may want to know how many posts contain engaging language. You can get this information using count() from the dplyr package (remember to activate this package first!).

In short, we summarize how many observations contain either 0 (no engaging language) or 1 (engaging language).

library(tidyverse)
data %>%
  count(Engaging_Language)

##   Engaging_Language   n
## 1                 0 203
## 2                 1 197

8.1.2 Percentages

Sometimes, it may be more helpful to not know how often a specific value occurs in absolute, but in relative terms - for instance in terms of percentages. To calculate this, we use an additional command called prop.table() to create a new variable called prop, which is merely the relative occurrence with which outlets use engaging language:

data %>%
  count(Engaging_Language) %>%
  mutate(prop = prop.table(n)*100)

##   Engaging_Language   n  prop
## 1                 0 203 50.75
## 2                 1 197 49.25

8.1.3 Mean, standard deviation, etc.

For variables with metric scale levels, you may be interested in other types of summary statistics. For instance, you may want to know the mean number of engaging words used in posts. The dataset contains the count of engaging words used in each post in the variable Count_Engaging_Words.

We summarize the mean of engaging words by creating a new variable summarizing the value of Count_Engaging_Words via summarize() and calculating the mean using mean():

data %>%
  summarize(mean = mean(Count_Engaging_Words))

##   mean
## 1 3.84

Or you could get the median with median() instead of the mean (for instance, if you have extreme values in the data skewing the distribution of the variable, in case of which “average” values may be better represented by the median):

data %>%
  summarize(median = median(Count_Engaging_Words))

##   median
## 1      3

Another useful value is the standard deviation via sd():

data %>%
  summarize(sd = sd(Count_Engaging_Words))

##         sd
## 1 1.595545

Lastly, you ready know how to round numeric data to a number of decimals to your choice (for instance to report them in a seminar paper):

data %>%
  summarize(mean = mean(Count_Engaging_Words)) %>%
  mutate(mean = round(mean, 1))

##   mean
## 1  3.8

8.1.4 Aggregating data

In some cases, you may also wish to get descriptive data for groups within our data. For instance, we may want to know more about the use of engaging language varies across outlets.

Using the dplyr package, we use the following code to get this information:

We define the object to which functions should be applied: data
We hand this object to our pipeline: %>%
We aggregate the data to a higher level, here by outlets: group_by(Outlet)
We shift the object back to the pipeline: %>%
We create a new variable, here prop which includes the percentage of posts including engaging language across outlets: _count(Engaging_Language) %>% mutate(prop = prop.table(n)*100)_

data %>%
  group_by(Outlet) %>%
  count(Engaging_Language) %>%
  mutate(prop = prop.table(n)*100)

## # A tibble: 4 x 4
## # Groups:   Outlet [2]
##   Outlet   Engaging_Language     n  prop
##   <chr>                <int> <int> <dbl>
## 1 BBC                      0    91  45.5
## 2 BBC                      1   109  54.5
## 3 GUARDIAN                 0   112  56  
## 4 GUARDIAN                 1    88  44

8.2 Multivariate statistics

R offers a lot of options for doing inferential tests. Here, we will focus on two tests: Chi-Sqaure tests and t-tests to analyze in how far the BBC uses more or less engaging language than the Guardian.

8.2.1 Chi-Square test

The Chi-Square test is used if we have categorical dependent and independent variables. Say we want to check whether the independent, binary variable Outlet correlates with the dependent, binary variable Engaging_Language.

In R, this is easily done with the crosstab() command from the tidycom package:

library(tidycomm)
data %>%
  crosstab(Engaging_Language, Outlet, percentages = TRUE, chi_square = TRUE)

## Chi-square = 4.000900, df = 1.000000, p = 0.045476, V = 0.100011

## # A tibble: 2 x 3
##   Outlet     `0`   `1`
##   <chr>    <dbl> <dbl>
## 1 BBC      0.448 0.553
## 2 GUARDIAN 0.552 0.447

Here, it seems that the BBC more often uses engaging language (55% of posts) than the Guardian (45% of posts). The Chi-Square test shows that this difference is significant, with Chi-Square(1) = 4, p <.05. However, the effect is comparably weak, with Cramer’s V = .1.

8.2.2 T-test

The t-test is used if we have a categorical independent variable and a metric dependent variable. Say we want to check if some outlets more often use engaging words as indicated by Count_Engaging_Words. To see whether the independent, binary variable Outlet correlates with the dependent, metric variable Count_Engaging_Words, we use the t_test() command from the tidycom package:

data %>%
  t_test(Outlet, Count_Engaging_Words)

## # A tibble: 1 x 10
##   Variable             M_BBC SD_BBC M_GUARDIAN SD_GUARDIAN Delta_M     t    df        p     d
##   <chr>                <dbl>  <dbl>      <dbl>       <dbl>   <dbl> <dbl> <dbl>    <dbl> <dbl>
## 1 Count_Engaging_Words  4.36   1.64       3.32        1.37    1.04  6.89   398 2.23e-11 0.689

Related to the number of engaging words used in social media posts, it also seems that the BBC more often uses engaging language (M = 4.36, SD = 1.63) than the Guardian (M = 3.32, SD = 1.37). A t-test shows that this difference is significant, with t(398) = 6.89, p <.001. This effect is strong, with an effect size of d = .69.

8.3 Take Aways

Descriptive statistics: Important commands: count(), prop.table(), mean(), median(), sd()
Multivariate statistics: Important commands: crosstab(), t_test()