6.2 Summary statistics
Ok, now that we can create vectors, let’s learn the basic descriptive statistics functions. We’ll start with functions that apply to continuous data. Continuous data is data that, generally speaking, can take on an infinite number of values. Height and weight are good examples of continuous data. Table 6.1 contains common functions for continuous, numeric vectors. Each of them takes a numeric vector as an argument, and returns either a scalar (or in the case of
table) as a result.
Let’s calculate some descriptive statistics from some pirate related data. I’ll create a vector called
x that contains the number of tattoos from 10 random pirates.
tattoos <- c(4, 50, 2, 39, 4, 20, 4, 8, 10, 100)
Now, we can calculate several descriptive statistics on this vector by using the summary statistics functions:
min(tattoos) ##  2 mean(tattoos) ##  24 sd(tattoos) ##  31
Vectors have one dimension: their length. Later on, when you combine vectors into more higher dimensional objects, like matrices and dataframes, you will need to make sure that all the vectors you combine have the same length. But, when you want to know the length of a vector, don’t stare at your computer screen and count the elements one by one! (That said, I must admit that I still do this sometimes…). Instead, use
length() function. The
length() function takes a vector as an argument, and returns a scalar representing the number of elements in the vector:
a <- 1:10 length(a) # How many elements are in a? ##  10 b <- seq(from = 1, to = 100, length.out = 20) length(b) # How many elements are in b? ##  20 length(c("This", "character", "vector", "has", "six", "elements.")) ##  6 length("This character scalar has just one element.") ##  1
Get used to the
length() function people, you’ll be using it a lot!
6.2.2 Additional numeric vector functions
Table 6.2 contains additional functions that you will find useful when managing numeric vectors:
||Round elements in x to
||Round elements x to the next highest (or lowest) integer||
||Modular arithmetic (ie. x mod y)||
6.2.3 Sample statistics from random samples
Now that you know how to calculate summary statistics, let’s take a closer look at how R draws random samples using the
runif() functions. In the next code chunk, I’ll calculate some summary statistics from a vector of 5 values from a Normal distribution with a mean of 10 and a standard deviation of 5. I’ll then calculate summary statistics from this sample using
# 5 samples from a Normal dist with mean = 10 and sd = 5 x <- rnorm(n = 5, mean = 10, sd = 5) # What are the mean and standard deviation of the sample? mean(x) ##  11 sd(x) ##  2.5
As you can see, the mean and standard deviation of our sample vector are close to the population values of 10 and 5 – but they aren’t exactly the same because these are sample data. If we take a much larger sample (say, 100,000), the sample statistics should get much closer to the population values:
# 100,000 samples from a Normal dist with mean = 10, sd = 5 y <- rnorm(n = 100000, mean = 10, sd = 5) mean(y) ##  10 sd(y) ##  5
Yep, sure enough our new sample y (containing 100,000 values) has a sample mean and standard deviation much closer (almost identical) to the population values than our sample x (containing only 5 values). This is an example of what is called the law of large numbers. Google it.