Chapter 8 Descriptive statistics

Now that we have learned more about data frames and how we can index them, we are now going to look at descriptive statistics.
Suppose we have a group with 1000 persons and their lengths, then we are likely not interested in the length of all individuals, but we are interested in measures of central tendency (such as the mean/median length of the group) and measures of variability (such as the standard deviation and variance).

Within this chapter, we will look at both measures of central tendency and measures of variability.

8.1 Measures of central tendency

8.1.1 Mean

We already saw the mean function at the beginning of the book when we first looked at functions and the mean is a typical example of a central tendency measure. The mean is obtained by adding all numbers together and then dividing by the number of observations.

If we do this with the numbers: 5, 18, 64, 208, 4, and 927 and assign them to a variable then we can use the mean() function to find out the mean of these numbers is.

numbers <- c(5, 18, 64, 208, 4, 927)
mean(numbers)
## [1] 204.3333

And then we see that the mean of these numbers is 204.33.

The mean function also works on data frames. If we load the iris dataset again, we can use the mean() function to see what the average Sepal.Length is. We can do this as follows:

data(iris)

mean(iris$Sepal.Length)
## [1] 5.843333

Here we gave a column from the iris dataset as an argument to the mean() function. Again, I used the dollar sign to select the column, but other ways will work as well:

mean(iris[["Sepal.Length"]])
## [1] 5.843333

When we use the mean function, there must be no missing values in our numbers. If there is at least 1 missing value, then the result becomes NA (not available).

numbers2 <- c(1, 4, 6, 10, NA)
mean(numbers2)
## [1] NA

We can see this in the example above where we have the numbers 1, 4, 6, 10, and NA, and then we use the mean() function the result is NA.

If we still want to receive the mean without the missing number, then there is an option for this. If we take a look at the documentation of the mean() function with:

?mean

then we see the following:

Documentation mean function

Figure 8.1: Documentation mean function

Here we see that the mean function has three arguments, namely x, trim, and na.rm. The x argument denotes an R object. In the examples above, we used the mean function on vectors. However, we can also use the mean function on columns from data frames. We also see the na.rm argument, and if this is set to TRUE then it will calculate the mean without the missing numbers.

If we use the na.rm argument in the example with the vector numbers2, we see that it still calculates the mean of the numbers 1, 4, 6, and 10:

mean(numbers2, na.rm = TRUE)
## [1] 5.25

and that this results in 5.25.

8.1.2 Median

Like the mean, the median is also a measure of central tendency. The median is always the middle number of multiple observations. As an example, we make a vector with the numbers 1 to 5 in it and save it in the variable called numbers3. Then, we can use the median() function to find out the median of this example:

Here we see that the median of this example is 3. This is correct as we have an odd number of numbers which means that the median is 3 because it has 2 numbers on the left and 2 numbers on the right.

numbers3 <- c(1, 2, 3, 4, 5)
median(numbers3)
## [1] 3

If we have an even number of observations, R will calculate the average of the middle 2 digits. To illustrate this, we create a new vector called numbers4 with the digits 1 to 6 in it, and then we use the median() function again:

numbers4 <- c(1, 2, 3, 4, 5, 6)
median(numbers4)
## [1] 3.5

Now we see that the result is 3.5. Because we don’t have a middle number anymore R will take the average from the numbers 3 and 4 which results in 3.5.

We can also use the median function on data frames. For example, we can obtain the median of the Petal.Width column from the iris dataset like this:

median(iris$Petal.Width)
## [1] 1.3

Finally, the median() function has a na.rm argument just like the mean() function in case there are missing values.

8.2 Measures of variability

8.2.1 Standard deviation

The standard deviation can be obtained by using the sd() function. If we use this function in our example numbers3 with 1 to 5 in it:

numbers3
## [1] 1 2 3 4 5
sd(numbers3)
## [1] 1.581139

Then we see that the standard deviation from this case is 1.58.

8.2.2 Variance

We also have the variance, and from the variance, we can calculate the standard deviation. The standard deviation is the square root of the variance. The higher the variance, the more the numbers deviate from the mean. We can obtain the variance in R by using the var() function. If we use the var() function on our numbers3 vector just like we did for the standard deviation:

var(numbers3)
## [1] 2.5

then the result is 2.5. And to prove that the standard deviation can be calculated by taking the square root of the variance, we can use the square root function called sqrt().

sqrt(2.5)
## [1] 1.581139

Then we see that this is indeed correct.

One thing to note about the standard deviation and the variance is that these are calculated for a sample and not for the population.

8.2.3 Interquartile range

Finally, we have the interquartile range as the last measure of variability. Usually, we refer to this as the 1st quartile (also 25th percentile), the 3rd quartile (also 75th percentile), and the 2nd quartile which is the median. If the median is the middle number, then the 1st quartile is the 1/4th number, and the 3rd quartile is the 3/4 number. The interquartile range is calculated as the difference between the 3rd quartile and the 1st quartile.

There are 2 ways in R to calculate the interquartile range. We can use the IQR() function which means interquartile range. To demonstrate this function, we create a vector with multiple observations called example:

example <- c(1, 4, 6, 8, 9, 11, 12, 13, 16, 17, 18, 20, 26)

The median of these numbers is 12 because that is the middle number.

median(example)
## [1] 12

The interquartile range is the difference between the 3/4th number and the 1/4th number. In this case, these numbers are 17 and 8. If we use the IQR() function:

IQR(example)
## [1] 9

then we see that the result is 9 (17-8).

If we want to see the 1/4th number and 3/4th number, then we can use the quantile() function. This function takes the data and then the number you want in fractional form as arguments. For example, if you want the 3/4th number, then we type the following:

quantile(example, 0.75)
## 75% 
##  17

Here we see that the 3rd quartile is 17. Again, we can do the same for the 1st quartile:

quantile(example, 0.25)
## 25% 
##   8

And then, we can calculate the interquartile range as the difference between the 3/4th number and the 1/4th number, and that is 17-8 = 9.

8.3 Summary

Another easy way to obtain the minimum, the 1st quartile, the median, the mean, the 3rd quartile, and the maximum of numeric variables is by using the summary() function. For example, if we use the summary() function for the variable example:

summary(example)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   12.00   12.38   17.00   26.00

then we get all these numbers back. This summary() function can be used on vectors as well as on data frames. For example, if we use the summary() function on our iris dataset, then we see that it returns all these numbers for us per column.

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

For example, we see that the average for Sepal.Length is 5.84, and that the maximum Petal.Length is 6.9. One thing to note is that we only get these back for numerical variables. For example, if we look at Species (which is a factor), we see that we only get back how often it occurs. And that makes sense because we cannot calculate a maximum species or average species.

8.4 Functions and the apply functions

Within this book, we have mainly applied functions, but we have not yet written these ourselves. Now we are going to look at how we can make functions, how we can execute them, and how we can apply functions over multiple things with the already existing apply() functions.

We can create a function by assigning the function to a variable. For this example, we are going to make a function that adds 2 for each number, and we will call this function add2.

A function always starts with function() and can have one or multiple inputs within the brackets. Afterward, a function will start with “{” and end with “}”. It is important that “{” should be on the same line after function(), and “}” should be on a new line at the very end. These { } are called curly brackets. Within these curly brackets, you will find the code that will be applied to the input. If we look at the following example:

add2 <- function(x) { 
  x + 2
}

then we see that we have created a function called add2. This function takes as input x and then adds 2 to x. X can be anything, if we only specify a single number it will work. Additionally, we can also specify multiple numbers for x, or we can specify a column of a data frame, for example.

If we enter the number 18 as x, and then we use the function add2() and put the number 18 inside the brackets:

add2(18)
## [1] 20

we see that the output is now 20 which corresponds to 18+2.

We can specify multiple numbers for this function by using a vector which we call examplevector for now, and that will work as well:

examplevector <- c(10, 15, 18, 27, 30, 40)
add2(examplevector)
## [1] 12 17 20 29 32 42

Here we see that the function has added 2 to every number in the vector.

This example function was quite simple, but you can make a function as extensive or as difficult as you want it to be. If we had typed 18+2 or examplevector + 2, we could have achieved the same thing without writing a function as we can see below.

examplevector + 2
## [1] 12 17 20 29 32 42

Now we move on to the apply() functions. These are several functions that all have “apply” in the name. We have the normal apply(), but also lapply(), sapply(), tapply(), and a few others that can all be used for slightly different purposes. Within this book, we will only discuss the 4 apply functions listed above. These apply functions are mainly used to apply a function over several things, and we will now take a look at some examples.

We start with the apply() function, and to show what this function does, we create a 3 by 3 matrix with the numbers 1 to 9 in it and call it examplematrix.

examplematrix <- matrix(1:9, nrow = 3)
examplematrix
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

Now we have a 3 by 3 matrix with the numbers 1 through 9 in it. Suppose for example, that you want to know the sum of all numbers in a row or a column, then we can use the apply() function for this purpose.

The apply function takes as input X for the data, and MARGIN to indicate whether it should be per column or row, and lastly a function. If we want the sum of the examplematrix per row we can specify MARGIN = 1 to indicate that it should be per row, and we can specify sum to indicate that we want the sum of all these numbers:

apply(X = examplematrix, MARGIN = 1, sum)
## [1] 12 15 18

As output, we see 12, 15, and 18. We can deduce that the 12 is obtained by adding 1 + 4 + 7 and that 15 is obtained by 2 + 5 + 8 and so on. Furthermore, we can also do this per column and the only thing we need to change is the MARGIN. If you specify MARGIN = 2, then the apply() function will do everything on a column basis and if it is set to 1 then it will apply everything per row basis.

apply(X = examplematrix, MARGIN = 2, sum)
## [1]  6 15 24

If we run the sum function again on a column basis, then we see 6, 15, and 24 as output and this corresponds to adding all numbers per column.

For now, we only looked at the sum function, but you can apply all standard functions such as median, mean, minus, and so on. You can even specify your own functions here.

lapply()

The lapply() function expects a list as input X followed by a function. It is easy to remember that the “l” of lapply() is for lists. For example, if we look at an example with a vector with the numbers 1 to 5 in it and we want to get the sum of these numbers with the lapply function. Then, it will not work very well:

examplevector2 <- c(1:5)
lapply(X = examplevector2, sum)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5

As output, we see that we got a list back and that it didn’t do what we wanted to achieve in the first place. This is because the lapply() function expects a list as input and not a vector. To show what the lapply() function does, we make 3 vectors called A, B, and C with the numbers 1 to 10, 10 to 19, and 20 to 29 in them respectively. Then, we combine these vectors into a list called examplelist.

A <- c(1:10)
B <- c(10:19)
C <- c(20:29)
examplelist <- list(A, B, C)
examplelist
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[2]]
##  [1] 10 11 12 13 14 15 16 17 18 19
## 
## [[3]]
##  [1] 20 21 22 23 24 25 26 27 28 29

Now, if we use lapply() and specify the example list as data followed by the sum function, then this will happen:

lapply(X = examplelist, sum)
## [[1]]
## [1] 55
## 
## [[2]]
## [1] 145
## 
## [[3]]
## [1] 245

As output, we see that all numbers are added per element of the list. For example, the numbers 1 to 10 were in [[1]] and the numbers 10 to 19 in [[2]]. Furthermore, we see that R will return a list as output. This means that the sums of the numbers from the list are in double parentheses as well per element.

sapply()

Sapply() works the same as lapply(), but it simplifies the output. The output of lapply is a list as we saw earlier, and if we use sapply, then it will change this output to vector when possible. If we use the sapply() function with the same example, then we see the following.

sapply(X = examplelist, sum)
## [1]  55 145 245

Now, we see that the sums are still the same, but the output is no longer in double brackets like with lists, but now it returns the output as a vector.

tapply()

The tapply function is specifically for factors. To show an example of this, we load the iris dataset again.

data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

The iris dataset had a factor variable called Species, and this indicated the kind of iris flower. If we are interested in the median Sepal.Width per iris species, then we can use the tapply() function.

The tapply() function takes as input: X for the data, INDEX to specify the factor variable, and last a function.

So, if we want the median Sepal.Width per iris species, we can type the following:

tapply(X = iris$Sepal.Width, INDEX = iris$Species, median)
##     setosa versicolor  virginica 
##        3.4        2.8        3.0

Here we specified the Sepal.Width column from the iris dataset as X argument, then the Species column from the iris dataset as the index and we specified the median function to get the median.

The output shows that the setosa species has the highest median of Sepal.Width.

It is important to mention that the apply functions assume that all data is either numeric or logical. If you want to apply the mean function on, for example, factors or characters, then you will get NA as a result. For example, if we use sapply() to get the mean of all columns:

sapply(X = iris, mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##     5.843333     3.057333     3.758000     1.199333           NA

We see a warning that the argument is not numeric or a logical data type. R will still return the output for the rest of the columns, so we can see the average values for Sepal.Length, Sepal.Width, and so on for the iris dataset.

So, we need to give only numeric columns as data arguments for the apply functions. Since we are working with a small dataset with only 5 columns, we know that the 5th column species is a factor. We can select the numeric columns in this way:

head(iris[, 1:4])
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

I used the head() function here to show the first 6 observations, but if you omit the head() function, then it will select the numeric columns only.

If we specify iris[, 1:4] for the previous example as data argument, then it will work fine and we will no longer have warnings:

sapply(X = iris[, 1:4], mean)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

Furthermore, there are several ways to select numeric columns only. The dplyr library has a useful function that we can use for that. First, we will need to install this package, and then we can load it again in the following way:

library(dplyr)

In the dplyr package, there is a function called select_if and this function selects things if it meets the criteria. We can type select_if(iris, is.numeric) to select the numeric columns from the iris dataset.

So, if we enter that as the X argument we can do the same thing again. This solution is better if we have a large dataset, and if we don’t want to manually select the numeric columns. Again, if we use this within the sapply function, we see that the result is the same:

sapply(X = select_if(iris, is.numeric), mean)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

Once again, this is a good moment to show that we can also use the pipe operator %>%. For example, we can first select all numeric data, and then use it for our sapply() function:

select_if(iris, is.numeric) %>%
  sapply(mean)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

This is especially helpful if we use multiple functions in a row because it makes the code easier to read.