4 Descriptive Statistics

Descriptive statistics are a first step from raw data towards something more meaningful. The most common descriptive statistics either identify the middle of the data (mean, median) or how spread out the data is around the middle (percentiles, standard deviation) The statistics we calculate as descriptive statistics will be useful for many of the more advanced lessons we’ll encounter later, but they are important on their own as well.

Descriptive statistics are useful for exactly what it sounds like it would be: describing something. Specifically, describing data. Why does data need to be described? Because raw data is difficult to digest and a single data point doesn’t tell us very much.

Table continues below
  district school county grades students
410 61747 Moraga Elementary Contra Costa KK-08 1885
Table continues below
  teachers calworks lunch computer expenditure income
410 98.42 0.0506 0.3033 241 5593 31.05
  english read math
410 1.963 697.4 695.7

Let’s say I have data on all schools in California. I can look at the raw data and see that Moraga Elementary has 1885 students. That’s good to know, but it raises questions. Is that a lot? A raw data point like that without context is generally useless. Someone in that school district might be able to appreciate what the number 1885 students means implicitly. You might be thinking about whether it’s larger or smaller than you remember your elementary school being, or the school of a child/relative.

Adding more raw data doesn’t necessarily help either. Let’s look at the student enrollments of all the schools.

195, 240, 1550, 243, 1335, 137, 195, 888, 379, 2247, 446, 987, 103, 487, 649, 852, 491, 421, 6880, 2688, 440, 475, 2538, 476, 2357, 1588, 7306, 2601, 847, 452, 4142, 2102, 10012, 2488, 25151, 2267, 1657, 284, 5370, 2471, 15386, 184, 1217, 6219, 4258, 1235, 16244, 814, 27176, 10696, 8935, 1600, 9028, 10625, 7151, 2404, 5804, 2253, 2807, 3074, 723, 5138, 20927, 3017, 957, 1639, 4340, 5079, 6639, 1154, 237, 2987, 499, 11474, 1088, 2660, 353, 329, 252, 175, 3835, 314, 4458, 1313, 474, 1114, 1358, 11629, 6195, 499, 417, 300, 457, 146, 460, 354, 1841, 3760, 500, 5112, 146, 2141, 610, 337, 4501, 5718, 19402, 3401, 2621, 426, 205, 13668, 342, 6518, 239, 2911, 6272, 10218, 1735, 474, 544, 1987, 418, 196, 2208, 1255, 1469, 7114, 1962, 7761, 216, 224, 7887, 752, 9328, 548, 104, 275, 443, 10337, 806, 227, 8416, 149, 220, 4612, 590, 133, 2440, 133, 519, 222, 285, 3129, 2019, 5620, 9775, 246, 7210, 21338, 477, 727, 374, 18255, 8787, 797, 140, 235, 8294, 2409, 150, 3981, 2326, 501, 470, 575, 3519, 474, 223, 92, 4971, 2617, 242, 780, 324, 140, 181, 516, 108, 419, 12567, 287, 6201, 577, 170, 164, 382, 1221, 2214, 4523, 793, 1678, 536, 307, 347, 168, 532, 3272, 2045, 156, 1129, 3669, 157, 4928, 103, 175, 4153, 280, 865, 8735, 412, 6373, 332, 2903, 565, 586, 5068, 859, 145, 649, 1789, 775, 777, 3518, 19294, 7661, 158, 117, 160, 511, 2770, 551, 5205, 6437, 1712, 370, 3182, 139, 11855, 1068, 2295, 1510, 579, 1012, 1212, 119, 590, 546, 248, 461, 6312, 285, 2325, 11885, 564, 12380, 3772, 895, 5714, 105, 1449, 510, 160, 433, 3186, 5010, 717, 3548, 868, 507, 822, 1792, 1202, 515, 1354, 1252, 823, 2231, 271, 309, 3005, 966, 7710, 762, 1708, 9850, 129, 10619, 4521, 580, 2569, 6022, 670, 14708, 6601, 675, 2458, 144, 573, 721, 992, 8432, 244, 7116, 830, 160, 1588, 2272, 1425, 245, 1349, 400, 4632, 224, 576, 451, 900, 118, 1457, 4734, 3303, 6055, 424, 2801, 187, 129, 188, 1212, 2596, 4925, 6257, 868, 3787, 6423, 678, 162, 8529, 1862, 1452, 155, 2536, 567, 953, 296, 198, 218, 734, 189, 2528, 2987, 208, 379, 145, 706, 878, 594, 139, 2089, 326, 516, 449, 297, 1579, 383, 81, 5259, 1960, 151, 946, 2707, 919, 945, 738, 164, 167, 125, 1091, 134, 600, 1803, 158, 2392, 526, 141, 235, 3280, 1254, 948, 15228, 81, 2768, 535, 2542, 1940, 1059, 2340, 3469, 2106, 478, 1885, 2422, 1318, 220, 687, 2341, 984, 3724, 441, 101 and 1778

Now we have a lot more data, but those don’t tell us much. It’s more than we can quickly interpret. We can’t judge whether Moraga is bigger than normal or smaller, because it’s hard to get a grasp of what the data is telling us or find trends with just raw figures.

We can better appreciate Moraga’s enrollment though calculating some descriptive statistics for California schools in order to supply the context. We can provide summary statistics with the command summary() and pander() to make the output more visually appealing. Pander on it’s own wont do anything, but if it is wrapped around summary we get a prettier table.

If you want to use pander in your own tablesyou will need to install it once on your computer install.packages(“pander”) and make sure that it’s active by running library(pander).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    81.0   379.0   950.5  2629.0  3008.0 27180.0
Min. 1st Qu. Median Mean 3rd Qu. Max.
81 379 950.5 2629 3008 27180

This is a fairly typical list of summary statistics. We should start in the middle, with two values that can be used to measure ‘central tendency’. Median and mean are both designed to help us understand the middle of the data.

4.1 Median

  • Median. This is the middle value of your data. If you lined up 9 numbers in a row (from lowest to highest) it would be the 5th number in that line. If you had 45, it would be the 25th number. It’s always the middle value. If you have an even number, it is the average of the two middle numbers.

Let’s imagine I asked 19 students how large their family is. To calculate the median by hand, you can write the numbers in order and find the middle digit. If done with R you can use median().

## [1] 3

4.2 Mean

*mean. This is the average value in your data. To calculate it, you add up all the individual values and divide it by the total number of observations. Or in R, you can use mean()

## [1] 3

4.3 Mode

Finally, let’s talk about a third measure of central tendancy, which doesn’t appear in the summary statistics list above. It’s used more rarely than mean or median, but it still has an important role: mode. The mode is the most frequent figure or value to appear in a list. Unfortunately, R doesn’t have a built in function that is as clear as mean() or median() to estimate the mode. I’ll create a function below to help us calculate it. Luckily, we wont be asked to calculate the mode very often.

## [1] 3

The mode is 3 as well. That means there were the largest numberof 3’s in the data, or that 3 is the most typical number. Mean and median are more useful for numeric data, but if we had a list of words (categorical data) it would be useful. For instance, what if I surveyed my class on their favorite type of pizza?

## Warning in mean.default(pizza): argument is not numeric or logical:
## returning NA
## [1] NA
## [1] "pepperoni"

We get an error message when we try to take the mean because there is no average answer. There is a most common answer though: pepperoni. If I pick a student at random, the most likely favorite pizza type is pepperoni, even if that isn’t the pick of the majority.

Sometimes, like in the examples above about family size, the mean, median, and mode are the same thing. Why do we need multiple measures then? Because they often aren’t the same, and that tells us something important about the data.

In the California school data the median was 950.5 and the mean was 2629. That is a sizable gap. And the school we started with, Morgana, is larger than the median but smaller than the mean. What does that mean? It means the data is skewed.

Let’s look back at data we made up earlier for the size of families.

Here, half of the data falls above the mean/median, and half below. If half the data is above and below the mean (or if it is close), the median and mean will tell a very similar story.

But we didn’t get similar results in California. Let’s look at the distribution for the data on student enrollment.

4.4 Skew

That doesn’t look evenly distributed at all. Rather it’s skewed. Skew just means not symmetrical, which in this context means that the distribution is not evenly around the mean and median like our earlier example. It’s heavily skewed to the right, or dragged out to that end. There’s a long tail of data that is much larger than the mean and median. Those big schools have a large impact on the mean, but less of one on the median. To illustrate that, let’s say I ask two more people about their families, and they both have a lot of relatives (17 and 19).

## [1] 4.428571
## [1] 3

The median stays the same even if it’s shifted which of the 3 peoples families was the middle answer. However, the mean has increased by over 1 because of those extreme answers. The same issue is present for the schools data. While most schools are small, half have fewer than one thousand students, there are many that are very large that drag the mean upwards. That isn’t bad, it’s why we have two measures of central tendency.

So which one is best? To some degree, it depends on what you’re trying to answer.

If the median and mean differ significantly it is probably best to report both, or use the median. The median will always be in the middle of your data (the exact middle) so it’s a more consistent measure of central tendency.

These issues show up in the real world. What is a better measure of how the average American is doing financially, median income or mean income? Median income in 2017 was $31,786, while the mean was $48,150. Someone wanting to argue that Americans are doing well would probably want to highlight the mean, while those arguing for greater social protections might cite the median. With income, we have another skewed distribution again, as high earners drag the data out to the right while many Americans are clustered at the other end of the distribution. It’ssomewhat dated, but see the graph below for an example of how the distribution of family incomes looked in 2011.

wikimedia commons

wikimedia commons

4.5 Percentile

When we’re using skewed data, it’s really useful to use percentiles to figure out where a specific observation falls in the data. The median is the 50th percentile, because 50 percent of the data is above and below it. The 25th percentile is also known as the 1st quartile (that was in the summary statistics above) has 25 percent of the data below it and 75 percent above it. For California schools, 25 percent of schools have 379 students or fewer, while 75 percent have more. The 3rd quartile, or 75th percentile, was 3008. That means 75 percent of schools have fewer that 3008 students, and only 25 percent have more.

Moraga had 1885 students, so it has more students than at least half the schools in California. What percent of schools does it have more than? I can add a column to the data frame (assigning it the name percentile) to figure out the percentile for each school in the data.

## [1] 0.6261905

Moraga is in the 63rd percentile, so it has more students that 63 percent of California schools.

Let’s go back to our summary statistics from earlier to finish discussing all the metrics contained there.

Min. 1st Qu. Median Mean 3rd Qu. Max.
81 379 950.5 2629 3008 27180

We’ve discussed the 1st Qu., Median, Mean, and 3rd Qu. That leaves the Min and Max, which represent the highest and lowest figures in the data. The smallest school in the data has 81 students, while the largest has 27,176 students. Those figures help us to get a feel for how spread out our data will be. The fact that the largest observation is 10 times larger than the mean indicates there are a few large observations, and a lot of small schools all clustered below the mean.

So the median is probably more useful in this case, if we’re trying to understand how big the typical school is. That’ll be true with most data that is skewed, or doesn’t follow a normal distribution. But in other cases the mean will be just as useful, and we’ll use it for the calculation of other statistics as well.

For instance, we’ll use the mean to calculate the standard deviation of data.

4.6 Standard Deviation

Standard deviation is a measure of the variability of your data. We’ve discussed two measures of the middle (mean and median), now we want to know where all the other data fall around that middle. Are they very close to the middle? Do they spread out really wide?

To calculate the standard deviation by hand we need to:

  1. Calculate the mean
  2. Subtract each individual observation from the mean, and square the result
  3. Calculate the mean of the squared differences.
  4. Calculate the square root of each figure.

That’s a mouth full. Or, we can use the command sd(). sd() takes care of all the intermediate steps outlined above.

Standard deviation indicates how dispersed your data is, or how widely it spreads around the mean. Data that has a small standard deviation generally falls very close to the mean; data with a large standard deviation is highly dispersed. The standard deviation gives you evidence of how representative the mean is of the data. If the data is very dispersed, each individual observation might be far from the mean.

Let’s imagine you’re choosing where to go for dinner. There are two new places you’ve heard about and want to check out; you look at yelp and see they have really similar ratings (out of 5). We’ll call one Oscar’s and one Luis’s (based on restaurants I like in my home town) and look at the average ratings at both.

## [1] 4.133333
## [1] 4.12

That’s pretty close. It’s tough to pick between them. So you look closer and notice that Luis’s has really high variance in it’s reviews. There are a lot of 5s, but also a lot of 1s. Oscars on the other hand is more consistently rated around a 4. For Luis’s, the mean isn’t very indicative of the typical experience, but for Oscar’s you know what to expect with just that number. That’s because Luis’s data is more dispersed.

## [1] 1.547709
## [1] 0.6

Why? It turns out that Luis’ brother works as a chef, and is awful. So anytime anyone rates the restaurant after eating one of the dishes cooked by him, it gets a bad review. But the other cooks are top notch. On the other hand, Oscar’s chefs are far more consistent. So the choice would depend on whether you want a chance at the better meal and are willing to take a risk on getting food poisoning, or if you’d rather just know that your food will be good - but not great.

So which restaurant do you want to go to? You plan ahead and call Luis’s to find out if his brother is working the day of your dinner, and finding out that he is home sick (he ate his own cooking apparently) you make a reservation for Luis’s.

Let’s circle back to California schools. We know how large Moraga is roughly. Let’s look at some of the other variables that are in the data.

Table continues below
  district school county grades students
410 61747 Moraga Elementary Contra Costa KK-08 1885
Table continues below
  teachers calworks lunch computer expenditure income
410 98.42 0.0506 0.3033 241 5593 31.05
  english read math percentile
410 1.963 697.4 695.7 0.6262

Read and math refer to the average scores for the school on state achievement tests. Notice that the figure is a descriptive statistic, being the average score for students at the school. Moraga students got 695.7 on math. Is that good or bad? The numbers don’t mean anything without context. Is the score out of 696, meaning that Moraga was nearly perfect, or is it out of 10,000 and students were very not nearly perfect. Let’s look again at summary statistics for the state to try and understand.

Min. 1st Qu. Median Mean 3rd Qu. Max.
605.4 639.4 652.4 653.3 665.8 709.5

Moraga scores above the mean and median for the state. That’s good, and they were also above the 3rd quartile, meaning that at least 75 percent of schools performed worse.

Notice that the median and mean for this figure are very close together. That means reporting either statistics will be good enough, and that the data isn’t skewed in either direction. The data forms what we would call a normal distribution (we used that term above). Let’s look at the data for math scores in a graph.

You can see the outline of a bell there, even if it’s still imperfect. That’s really useful, because it means that roughly half of the observations fall above and below the mean. That means we can talk about the distance of an individual school from the average in standard units with the standard deviation.

We know Moraga did well on the tests, but how well?

## [1] 42.4

They scores 42 points higher than the average school, but that doesn’t help us understand just how well they did. To get a better idea, we need to know how widely school math scores were distributed, to better appreciate how much better 42 points makes a school.

## [1] 18.7542

The standard deviation is 18.7. It probably doesn’t feel like that tells you much yet, but it will. The great thing about standard deviation is that when they’re taken for normally distributed data (like average math scores in California schools) we can use them to figure out just how above or below average a given school is.

That’s because 50% of the data falls above and below the mean, and the same is true for the standard deviations. But the data also falls above and below the mean in a specific form or shape.

credit: Wikipedia

credit: Wikipedia

50 percent of the data is below and above the mean in the figure above, which has a mean of 0 (the Greek character  means standard deviation). But just as importantly, we know where that 50 percent falls. 34.1 percent of the data falls within 1 standard deviation. 13.6 percent falls between 1 and 2 standard deviations, and 47.7 falls between the mean and 2 standard deviations. And those figures are symmetrical on both sides. That might not seem exciting yet, but let’s go back to our earlier question. How good at math is Moraga. First, let’s see how many standard deviations it is above the mean for math scores. We need to find the difference between the Moraga score and the average score for the state, and divide that by the standard deviation.

## [1] 2.258554

Moraga is 2.25 standard deviations above the mean. That means that it is better than 95% of schools in the state. That’s pretty good.

That may all sound a lot like the percentiles that we calculated earlier. In fact, Moraga is in the 98 percentile for math scores. So why do we need two figures? Percentiles are more flexible, and can be useful for any data no matter the distribution.

However, standard deviation is useful for undersatnding how far from average a result is. Units refers to how we measure something, whether it be students, or math scores, or shoes, or dollars, or anything.

Moraga is 42 points above average in math scores, and their parents reported 15,735 more dollars than average in annual income. Which of those numbers is more impressive or further from the mean? Well, 15,735 is larger than 42, but because income and math scores are in different units, it’s a lot like comparing apples to pterodactyls. But standard deviations allow us to figure out the relative distance both have from the means.

## [1] 2.258554
## [1] 2.177644

They’re both well above average, but the math score is slightly more above average.

In addition to comparing different units, standard errors are used in calculating a lot of the tests we use to determine whether numbers are meaningful. It will come up a lot in future chapters, so understanding the basic idea as a descriptive statistic is worth your time.

In this chapter, we’ve gone over one way to summarize data and to make raw data and figures more understandable for ourselves and others. Let’s review the R commands we’ve done in this chapter.

We can call in data that is already loaded into r with data(). Let’s use a data set about Arrest rates in American states.

To see the top 5 lines of the data we can use the command head() or to see the bottom we can use tail()

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
##               Murder Assault UrbanPop Rape
## Vermont          2.2      48       32 11.2
## Virginia         8.5     156       63 20.7
## Washington       4.0     145       73 26.2
## West Virginia    5.7      81       39  9.3
## Wisconsin        2.6      53       66 10.8
## Wyoming          6.8     161       60 15.6

We can use summary to view the descriptive statistics we went over earlier

##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

In order to make a more attractive table in Markdown, we want to use pander(). We need to load pander in though, because it’s an additional package. If you haven’t loaded in before use install.packages(“pander”) before calling it into use with library(pander)

Murder Assault UrbanPop Rape
Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
Median : 7.250 Median :159.0 Median :66.00 Median :20.10
Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00

in order to just calculate the mean or median for one column we can use mean() or median(), in order to measure the middle of the data.

## [1] 7.788
## [1] 7.25

And finally, we can calculate the dispersion or standard deviation with sd()

## [1] 4.35551