7 Tutorial 7: Descriptive statistics
After working through Tutorial 7, you’ll…
- understand how to calculate basic descriptive statistics in R.
Data
For this tutorial, we’ll again use the data set we already used in Tutorial 6: Control structures & functions in R: “data_tutorial6.txt” (via OLAT/Materials/Data for R).
Remember: The data set consists of data that is completely made up - a survey with 1000 citizens in Europe.
The five variables included here are:
- country: the country in which each citizen was living at the time of the survey (France/Germany/Italy/Switzerland)
- date: the date on which each citizen was surveyed (from 2021-09-20 to 2021-10-03)
- gender: each citizen’s gender (female/male/NA)
- trust_politics: how much each citizen trusts the political system (from 1 = no trust at all to 4 = a lot of trust)
- trust_news_media: how much each citizen trusts the news media (from 1 = no trust at all to 4 = a lot of trust)
Read in the data set:
<- read.csv2("data_tutorial 6.txt") data
This is how the data looks like in R:
head(data)
## country date gender trust_politics trust_news_media
## 1 Germany 2021-09-20 female 3 1
## 2 Switzerland 2021-10-02 male 2 1
## 3 France 2021-09-21 <NA> 1 3
## 4 Italy 2021-10-03 male 2 2
## 5 Germany 2021-09-21 female 3 1
## 6 Switzerland 2021-09-20 male 1 2
7.1 Counts
For many variables - especially those with nominal or ordinal scale levels - you may want to count values. For instance, you may want to know how often each different country in our data set was mentioned. You can get this information using table().
table(data$country)
##
## France Germany Italy Switzerland
## 250 250 250 250
7.1.1 Mean, standard deviation, etc.
For variables with metric scale levels, you may want to get other information. For instance, you may want to know the mean level of trust in politics. You already know the command mean() for getting this information:
mean(data$trust_politics)
## [1] 2.545
Or you could get the median with median() instead of the mean (for instance, if you have extreme values in the data skewing the distribution of the variable, in case of which “average” values may be better represented by the median):
median(data$trust_politics)
## [1] 3
Another useful value is the standard deviation via sd()…
sd(data$trust_politics)
## [1] 1.101448
… or the smallest or biggest value using min() and max():
min(data$trust_politics) #lowest value
## [1] 1
max(data$trust_politics) #highest value
## [1] 4
The summary() function offers you most of this information with one simple function:
summary(data$trust_politics)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 2.545 4.000 4.000
Lastly, you already know how to round numeric data to a number of decimals to your choice (for instance to report them in a seminar paper):
mean(data$trust_politics) #without rounding
## [1] 2.545
round(mean(data$trust_politics),2) #round to two decimals
## [1] 2.54
7.1.2 Missing values
For many data sets, it is also important to know a little about information not in there - i.e., missing data. You may not remember, but in Tutorial 3: Objects & structures in R you already encountered a command for counting the number of missing cases for a variable of your choice (here both in base R and the dplyr version):
# base R
sum(is.na(data$gender))
## [1] 341
# dplyr
$gender %>% is.na() %>% sum() #version in dplyr data
## [1] 341
We could also reduce our data set to only complete cases using complete.cases():
<- data[complete.cases(data),]
data_complete sum(is.na(data_complete$gender))
## [1] 0
7.2 Aggregating data
In some cases, you may also wish to get descriptive data for several groups within our data. For instance, we may want to know how respondents’ trust in politics varies by country.
Using the dplyr structure, we use the following code to get this information:
- We define the object to which functions should be applied: data
- We hand this object to our pipeline: %>%
- We aggregate the data to a higher level, here by country: group_by(country)
- We shift the object back to the pipeline: %>%
- We create a new variable, here mean_trust_politics which includes the respondents’ mean trust in politics for each country: summarise(mean_trust_politics = mean(trust_politics)
%>%
data group_by(country) %>%
summarise(mean_trust_politics = mean(trust_politics))
## # A tibble: 4 x 2
## country mean_trust_politics
## <chr> <dbl>
## 1 France 2.47
## 2 Germany 2.54
## 3 Italy 2.54
## 4 Switzerland 2.63
Again, we could have also done this with base R. The corresponding lines of code would have just been a bit longer:
mean(data$trust_politics[data$country=="France"])
## [1] 2.468
mean(data$trust_politics[data$country=="Germany"])
## [1] 2.54
mean(data$trust_politics[data$country=="Italy"])
## [1] 2.544
mean(data$trust_politics[data$country=="Switzerland"])
## [1] 2.628
Aggregating data may also be helpful in other cases.
For instance, say we want to calculate the absolute and relative frequency with which female participants took part in the survey by country.
In short, we want to create a table containing the absolute number of female respondents in each country as well as what percentage of each country’s sample this represents.
Using the commands we already know, we could easily do that using the dplyr logic:
%>%
data #group by country, then by gender
group_by(country, gender) %>%
#create variable incl. the number of respondents for each country by gender
summarise(n=n())%>%
#create a variable containing the % of respondents for each country by gender
mutate(freq=n/sum(n)*100) %>%
#keep only values for female respondents (exclude male/NA)
filter(gender == "female")
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups: country [4]
## country gender n freq
## <chr> <chr> <int> <dbl>
## 1 France female 85 34
## 2 Germany female 90 36
## 3 Italy female 82 32.8
## 4 Switzerland female 75 30
Again, we could have also done that using base R:
#create cross table including country and gender info
<- as.data.frame(table(data$country, data$gender))
table #create relative % values
$rel[table$Var1=="France"] <- table$Freq[table$Var1=="France"]/
tablenrow(data[data$country=="France",])*100
$rel[table$Var1=="Germany"] <- table$Freq[table$Var1=="Germany"]/
tablenrow(data[data$country=="Germany",])*100
$rel[table$Var1=="Switzerland"] <-table$Freq[table$Var1=="Switzerland"]/
tablenrow(data[data$country=="Switzerland",])*100
$rel[table$Var1=="Italy"] <- table$Freq[table$Var1=="Italy"]/
tablenrow(data[data$country=="Italy",])*100
#include only information for female respondents
<- table[table$Var2=="female",]
table
table
## Var1 Var2 Freq rel
## 1 France female 85 34.0
## 2 Germany female 90 36.0
## 3 Italy female 82 32.8
## 4 Switzerland female 75 30.0
7.3 Take Aways
- Descriptive statistics: Important commands: table(), mean(), median(), sd(), summary(), is.na(), sum(is.na(x)), complete.cases(), round()
- Aggregating data: Important commands: group_by(), summarize(), mutate()
7.4 More tutorials on this
You still have questions? The following tutorials & papers can help you with that:
- R Codebook by J.D. Long and P. Teetor, Tutorial 9
- YaRrr! The Pirate’s Guide to R by N.D.Phillips, Tutorial 6
Let’s keep going: Tutorial 8: Visualizing data