7 Tutorial 7: Descriptive statistics

After working through Tutorial 7, you’ll…

understand how to calculate basic descriptive statistics in R.

Data

For this tutorial, we’ll again use the data set we already used in Tutorial 6: Control structures & functions in R: “data_tutorial6.txt” (via OLAT/Materials/Data for R).

Remember: The data set consists of data that is completely made up - a survey with 1000 citizens in Europe.

The five variables included here are:

country: the country in which each citizen was living at the time of the survey (France/Germany/Italy/Switzerland)
date: the date on which each citizen was surveyed (from 2021-09-20 to 2021-10-03)
gender: each citizen’s gender (female/male/NA)
trust_politics: how much each citizen trusts the political system (from 1 = no trust at all to 4 = a lot of trust)
trust_news_media: how much each citizen trusts the news media (from 1 = no trust at all to 4 = a lot of trust)

Read in the data set:

data <- read.csv2("data_tutorial 6.txt")

This is how the data looks like in R:

head(data)

##       country       date gender trust_politics trust_news_media
## 1     Germany 2021-09-20 female              3                1
## 2 Switzerland 2021-10-02   male              2                1
## 3      France 2021-09-21   <NA>              1                3
## 4       Italy 2021-10-03   male              2                2
## 5     Germany 2021-09-21 female              3                1
## 6 Switzerland 2021-09-20   male              1                2

7.1 Counts

For many variables - especially those with nominal or ordinal scale levels - you may want to count values. For instance, you may want to know how often each different country in our data set was mentioned. You can get this information using table().

table(data$country)

## 
##      France     Germany       Italy Switzerland 
##         250         250         250         250

7.1.1 Mean, standard deviation, etc.

For variables with metric scale levels, you may want to get other information. For instance, you may want to know the mean level of trust in politics. You already know the command mean() for getting this information:

mean(data$trust_politics)

## [1] 2.545

Or you could get the median with median() instead of the mean (for instance, if you have extreme values in the data skewing the distribution of the variable, in case of which “average” values may be better represented by the median):

median(data$trust_politics)

## [1] 3

Another useful value is the standard deviation via sd()…

sd(data$trust_politics)

## [1] 1.101448

… or the smallest or biggest value using min() and max():

min(data$trust_politics) #lowest value

## [1] 1

max(data$trust_politics) #highest value

## [1] 4

The summary() function offers you most of this information with one simple function:

summary(data$trust_politics)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   2.545   4.000   4.000

Lastly, you already know how to round numeric data to a number of decimals to your choice (for instance to report them in a seminar paper):

mean(data$trust_politics) #without rounding

## [1] 2.545

round(mean(data$trust_politics),2) #round to two decimals

## [1] 2.54

7.1.2 Missing values

For many data sets, it is also important to know a little about information not in there - i.e., missing data. You may not remember, but in Tutorial 3: Objects & structures in R you already encountered a command for counting the number of missing cases for a variable of your choice (here both in base R and the dplyr version):

# base R
sum(is.na(data$gender))

## [1] 341

# dplyr
data$gender %>% is.na() %>% sum() #version in dplyr

## [1] 341

We could also reduce our data set to only complete cases using complete.cases():

data_complete <- data[complete.cases(data),]
sum(is.na(data_complete$gender))

## [1] 0

7.2 Aggregating data

In some cases, you may also wish to get descriptive data for several groups within our data. For instance, we may want to know how respondents’ trust in politics varies by country.

Using the dplyr structure, we use the following code to get this information:

We define the object to which functions should be applied: data
We hand this object to our pipeline: %>%
We aggregate the data to a higher level, here by country: group_by(country)
We shift the object back to the pipeline: %>%
We create a new variable, here mean_trust_politics which includes the respondents’ mean trust in politics for each country: summarise(mean_trust_politics = mean(trust_politics)

data %>% 
  group_by(country) %>% 
  summarise(mean_trust_politics = mean(trust_politics))

## # A tibble: 4 x 2
##   country     mean_trust_politics
##   <chr>                     <dbl>
## 1 France                     2.47
## 2 Germany                    2.54
## 3 Italy                      2.54
## 4 Switzerland                2.63

Again, we could have also done this with base R. The corresponding lines of code would have just been a bit longer:

mean(data$trust_politics[data$country=="France"])

## [1] 2.468

mean(data$trust_politics[data$country=="Germany"])

## [1] 2.54

mean(data$trust_politics[data$country=="Italy"])

## [1] 2.544

mean(data$trust_politics[data$country=="Switzerland"])

## [1] 2.628

Aggregating data may also be helpful in other cases.

For instance, say we want to calculate the absolute and relative frequency with which female participants took part in the survey by country.

In short, we want to create a table containing the absolute number of female respondents in each country as well as what percentage of each country’s sample this represents.

Using the commands we already know, we could easily do that using the dplyr logic:

data %>%
  #group by country, then by gender
  group_by(country, gender) %>% 
  
  #create variable incl. the number of respondents for each country by gender
  summarise(n=n())%>% 
  
  #create a variable containing the % of respondents for each country by gender
  mutate(freq=n/sum(n)*100) %>% 
  
  #keep only values for female respondents (exclude male/NA)
  filter(gender == "female")

## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.

## # A tibble: 4 x 4
## # Groups:   country [4]
##   country     gender     n  freq
##   <chr>       <chr>  <int> <dbl>
## 1 France      female    85  34  
## 2 Germany     female    90  36  
## 3 Italy       female    82  32.8
## 4 Switzerland female    75  30

Again, we could have also done that using base R:

#create cross table including country and gender info
table <- as.data.frame(table(data$country, data$gender))
#create relative % values
table$rel[table$Var1=="France"] <- table$Freq[table$Var1=="France"]/
  nrow(data[data$country=="France",])*100

table$rel[table$Var1=="Germany"] <- table$Freq[table$Var1=="Germany"]/
  nrow(data[data$country=="Germany",])*100

table$rel[table$Var1=="Switzerland"] <-table$Freq[table$Var1=="Switzerland"]/
  nrow(data[data$country=="Switzerland",])*100

table$rel[table$Var1=="Italy"] <- table$Freq[table$Var1=="Italy"]/
  nrow(data[data$country=="Italy",])*100

#include only information for female respondents
table <- table[table$Var2=="female",]

table

##          Var1   Var2 Freq  rel
## 1      France female   85 34.0
## 2     Germany female   90 36.0
## 3       Italy female   82 32.8
## 4 Switzerland female   75 30.0

7.3 Take Aways

Descriptive statistics: Important commands: table(), mean(), median(), sd(), summary(), is.na(), sum(is.na(x)), complete.cases(), round()
Aggregating data: Important commands: group_by(), summarize(), mutate()

7.4 More tutorials on this

You still have questions? The following tutorials & papers can help you with that:

Let’s keep going: Tutorial 8: Visualizing data