6.2 Summary statistics

Ok, now that we can create vectors, let’s learn the basic descriptive statistics functions. We’ll start with functions that apply to continuous data. Continuous data is data that, generally speaking, can take on an infinite number of values. Height and weight are good examples of continuous data. Table 6.1 contains common functions for continuous, numeric vectors. Each of them takes a numeric vector as an argument, and returns either a scalar (or in the case of summary(), a table) as a result.

Table 6.1: Summary statistic functions for continuous data.
Function Example Result
sum(x), product(x) sum(1:10) 55
min(x), max(x) min(1:10) 1
mean(x), median(x) mean(1:10) 5.5
sd(x), var(x), range(x) sd(1:10) 3.03
quantile(x, probs) quantile(1:10, probs = .2) 2.8
summary(x) summary(1:10) Min = 1.00. 1st Qu. = 3.25, Median = 5.50, Mean = 5.50, 3rd Qu. = 7.75, Max = 10.0

Let’s calculate some descriptive statistics from some pirate related data. I’ll create a vector called x that contains the number of tattoos from 10 random pirates.

tattoos <- c(4, 50, 2, 39, 4, 20, 4, 8, 10, 100)

Now, we can calculate several descriptive statistics on this vector by using the summary statistics functions:

min(tattoos)
## [1] 2
mean(tattoos)
## [1] 24
sd(tattoos)
## [1] 31

6.2.1 length()

According to this article published in 2015 in Plos One, when it comes to people, length may matter for some. But trust me, for vectors it always does.

Figure 6.1: According to this article published in 2015 in Plos One, when it comes to people, length may matter for some. But trust me, for vectors it always does.

Vectors have one dimension: their length. Later on, when you combine vectors into more higher dimensional objects, like matrices and dataframes, you will need to make sure that all the vectors you combine have the same length. But, when you want to know the length of a vector, don’t stare at your computer screen and count the elements one by one! (That said, I must admit that I still do this sometimes…). Instead, use length() function. The length() function takes a vector as an argument, and returns a scalar representing the number of elements in the vector:

a <- 1:10
length(a)  # How many elements are in a?
## [1] 10

b <- seq(from = 1, to = 100, length.out = 20)
length(b)  # How many elements are in b?
## [1] 20

length(c("This", "character", "vector", "has", "six", "elements."))
## [1] 6
length("This character scalar has just one element.")
## [1] 1

Get used to the length() function people, you’ll be using it a lot!

6.2.2 Additional numeric vector functions

Table 6.2 contains additional functions that you will find useful when managing numeric vectors:

Table 6.2: Vector summary functions for continuous data.
Function Description Example Result
round(x, digits) Round elements in x to digits digits round(c(2.231, 3.1415), digits = 1) 2.2, 3.1
ceiling(x), floor(x) Round elements x to the next highest (or lowest) integer ceiling(c(5.1, 7.9)) 6, 8
x %% y Modular arithmetic (ie. x mod y) 7 %% 3 1

6.2.3 Sample statistics from random samples

Now that you know how to calculate summary statistics, let’s take a closer look at how R draws random samples using the rnorm() and runif() functions. In the next code chunk, I’ll calculate some summary statistics from a vector of 5 values from a Normal distribution with a mean of 10 and a standard deviation of 5. I’ll then calculate summary statistics from this sample using mean() and sd():

# 5 samples from a Normal dist with mean = 10 and sd = 5
x <- rnorm(n = 5, mean = 10, sd = 5)

# What are the mean and standard deviation of the sample?
mean(x)
## [1] 11
sd(x)
## [1] 2.5

As you can see, the mean and standard deviation of our sample vector are close to the population values of 10 and 5 – but they aren’t exactly the same because these are sample data. If we take a much larger sample (say, 100,000), the sample statistics should get much closer to the population values:

# 100,000 samples from a Normal dist with mean = 10, sd = 5
y <- rnorm(n = 100000, mean = 10, sd = 5)

mean(y)
## [1] 10
sd(y)
## [1] 5

Yep, sure enough our new sample y (containing 100,000 values) has a sample mean and standard deviation much closer (almost identical) to the population values than our sample x (containing only 5 values). This is an example of what is called the law of large numbers. Google it.