## 6.5 Standardization (z-score)

A common task in statistics is to standardize variables – also known as calculating z-scores. The purpose of standardizing a vector is to put it on a common scale which allows you to compare it to other (standardized) variables. To standardize a vector, you simply subtract the vector by its mean, and then divide the result by the vector’s standard deviation.

If the concept of z-scores is new to you – don’t worry. In the next worked example, you’ll see how it can help you compare two sets of data. But for now, let’s see how easy it is to standardize a vector using basic arithmetic.

Let’s say you have a vector a containing some data. We’ll assign the vector to a new object called `a`

then calculate the mean and standard deviation with the `mean()`

and `sd()`

functions:

```
a <- c(5, 3, 7, 5, 5, 3, 4)
mean(a)
## [1] 4.6
sd(a)
## [1] 1.4
```

Ok. Now we’ll create a new vector called `a.z`

which is a standardized version of a. To do this, we’ll simply subtract the mean of the vector, then divide by the standard deviation.

`a.z <- (a - mean(a)) / sd(a)`

Now let’s look at the standardized values:

```
a.z
## [1] 0.31 -1.12 1.74 0.31 0.31 -1.12 -0.41
```

The mean of `a.z`

should now be 0, and the standard deviation of `a.z`

should now be 1. Let’s make sure:

```
mean(a.z)
## [1] 2e-16
sd(a.z)
## [1] 1
```

Sweet. Oh, don’t worry that the mean of `a.z`

doesn’t look like exactly zero. Using non-scientific notation, the result is 0.000000000000000198. For all intents and purposes, that’s 0. The reason the result is not exactly 0 is due to computer science theoretical reasons that I cannot explain (because I don’t understand them).

### 6.5.1 Ex: Evaluating a competition

Your gluten-intolerant first mate just perished in a tragic soy sauce incident and it’s time to promote another member of your crew to the newly vacated position. Of course, only two qualities really matter for a pirate: rope-climbing, and grogg drinking. Therefore, to see which of your crew deserves the promotion, you decide to hold a climbing and drinking competition. In the climbing competition, you measure how many feet of rope a pirate can climb in an hour. In the drinking competition, you measure how many mugs of grogg they can drink in a minute. Five pirates volunteer for the competition – here are their results:

pirate | grogg | climbing |
---|---|---|

Heidi | 12 | 100 |

Andrew | 8 | 520 |

Becki | 1 | 430 |

Madisen | 6 | 200 |

David | 2 | 700 |

We can represent the main results with two vectors `grogg`

and `climbing`

:

```
grogg <- c(12, 8, 1, 6, 2)
climbing <- c(100, 520, 430, 200, 700)
```

Now you’ve got the data, but there’s a problem: the scales of the numbers are very different. While the grogg numbers range from 1 to 12, the climbing numbers have a much larger range from 100 to 700. This makes it difficult to compare the two sets of numbers directly.

To solve this problem, we’ll use standardization. Let’s create new standardized vectors called `grogg.z`

and `climbing.z`

```
grogg.z <- (grogg - mean(grogg)) / sd(grogg)
climbing.z <- (climbing - mean(climbing)) / sd(climbing)
```

Now let’s look at the final results

```
grogg.z
## [1] 1.379 0.489 -1.068 0.044 -0.845
climbing.z
## [1] -1.20 0.54 0.17 -0.78 1.28
```

It looks like there were two outstanding performances in particular. In the grogg drinking competition, the first pirate (Heidi) had a z-score of 1.4. We can interpret this by saying that Heidi drank 1.4 more standard deviations of mugs of grogg than the average pirate. In the climbing competition, the fifth pirate (David) had a z-score of 1.3. Here, we would conclude that David climbed 1.3 standard deviations more than the average pirate.

But which pirate was the best on average across both events? To answer this, let’s create a combined z-score for each pirate which calculates the average z-scores for each pirate across the two events. We’ll do this by adding two performances and dividing by two. This will tell us, how good, on average, each pirate did relative to her fellow pirates.

`average.z <- (grogg.z + (climbing.z)) / 2`

Let’s look at the result:

```
round(average.z, 1)
## [1] 0.1 0.5 -0.5 -0.4 0.2
```

The highest average z-score belongs to the second pirate (Andrew) who had an average z-score value of 0.5. The first and last pirates, who did well in one event, seemed to have done poorly in the other event.

Moral of the story: promote the pirate who can drink *and* climb.