Chapter 2 Descriptive Measures

2.1 Module Overview

In the previous module, we thought about descriptive statistics using tables and graphs. Next, we summarize data by computing numbers. Some of these numbers you may already be familiar with, such as averages and percentiles. Numbers used to describe data are called descriptive measures. We also extend our conversation on descriptive measures for quantitative variables to include the relationship between two variables.

Module Learning Objectives/Outcomes

After completing Module 2, you will be able to:

  1. Calculate and interpret measures of center.
  2. Calculate and interpret measures of variation.
  3. Summarize data using boxplots.
  4. Calculate and interpret a correlation coefficient.
  5. Calculate and interpret a regression line.
  6. Use a regression line to make predictions.

This module’s outcomes correspond to course outcomes (1) organize, summarize, and interpret data in tabular, graphical, and pictorial formats, (2) organize and interpret bivariate data and learn simple linear regression and correlation, and (6) apply statistical inference techniques of parameter estimation such as point estimation and confidence interval estimation.

2.2 Measures of Central Tendency

One research question we might ask is : what values are most common or most likely?

Mode: the most commonly occurring value.

Mean: this is what we usually think of as the “average.” Denoted \(\bar{x}\). Add up all of the values and divide by the number of observations (\(n\)): \[ \bar{x} = \frac{x_1 + x_2 + \dots + x_n}{n} = \sum_{i=1}^n \frac{x_i}{n} \] where \(x_i\) denotes the \(i\)th observation and \(\sum_{i=1}^n\) is the sum of all observations from 1 through \(n\). This is called summation notation.

Median: the middle number when the data are ordered from smallest to largest.

  • If there are an odd number of observations, this will be the number in the middle:
    {1, 3, 7, 9, 9} has median 7
  • If there are an even number of observations, there will be two numbers in the middle. The median will be their average.
    {1, 2, 4, 7, 9, 9} has median \(\frac{4+7}{2}=5.5\)

The mean is sensitive to extreme values and skew. The median is not!

\(x\): 1, 3, 7, 9, 9 \(y\): 1, 3, 7, 9, 45
\(\bar{x} = \frac{29}{5} = 5.8\) \(\bar{y} = \frac{65}{5} = 13\)

Notice how changing that 9 out for a 45 changes the mean a lot! But the median is 7 for both \(x\) and \(y\).

Because the median is not affected by extreme observations or skew, we say it is a resistant measure or that it is robust.

Which measure should we use?

  • Mean: symmetric, numeric data
  • Median: skewed, numeric data
  • Mode: categorical data

Note: If the mean and median are roughly equal, it is reasonable to assume the distribution is roughly symmetric.

2.3 Measures of Variability

How much do the data vary?

Should we care? Yes! The more variable the data, the harder it is to be confident in our measures of center!

If you live in a place with extremely variable weather, it is going to be much harder to be confident in how to dress for tomorrow’s weather… but if you live in a place where the weather is always the same, it’s much easier to be confident in what you plan to wear.

We want to think about how far observations are from the measure of center.

One easy way to think about variability is the range of the data: \[\text{range} = \text{maximum} - \text{minimum}\] This is quick and convenient, but it is extremely sensitive to outliers! It also takes into account only two of the observations - we would prefer a measure of variaiblity that takes into account all the observations.

Deviation is the distance of an observation from the mean: \(x - \bar{x}\). If we want to think about how far - on average - a typical observation is from the center, our intuition might be to take the average deviance… but it turns out that summing up the deviances will always result in 0! Conceptually, this is because the stuff below the mean (negative numbers) and the stuff above the mean (positive numbers) end up canceling each other out until we end up at 0.

One way to deal with this is to make all of the numbers positive, which we accomplish by squaring the deviance.

Deviance Squared Deviance
\(x\) \(x - \bar{x}\) \((x - \bar{x})^2\)
2 -1.2 1.44
5 1.8 3.24
3 -0.2 0.04
4 0.8 0.64
2 -1.2 1.44
\(\bar{x}=3.2\) Total = 0 Total = 6.8

Variance (denoted \(s^2\)) is the average squared distance from the mean: \[ s^2 = \frac{(x_1-\bar{x})^2 + (x_2-\bar{x})^2 + \dots + (x_n-\bar{x})^2}{n-1} = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2 \] where \(n\) is the sample size. Notice that we divide by \(n-1\) and NOT by \(n\). There are some mathematical reasons why we do this, but the short version is that it’ll be a better estimate when we move into Inference.

Finally, we come to standard deviation (denoted \(s\)). \[s = \sqrt{s^2}\] The standard deviation is the square root of the variance. We say that a “typical” observation is within about one standard deviation of the mean (between \(\bar{x}-s\) and \(\bar{x}+s\)).

We will think about one more measure of variability, the interquartile range, in the next section.

2.4 Measures of Position

The interquartile range (IQR) represents the middle 50% of the data.

Recall that the median cut the data in half: 50% of the data is below and 50% is above the median. This is also called the 50th percentile. The \(p\)th percentile is the value for which \(p\)% of the data is below it.

To get the middle 50%, we will split the data into four parts:

1 2 3 4
25% 25% 25% 25%

The 25th and 75th percentiles, along with the median, divide the data into four parts. We call these three measurements the quartiles:

  • Q1, the first quartile, is the median of the lower 50% of the data (the values below the median).
  • Q2, the second quartile, is the median.
  • Q3, the third quartile, is the median of the upper 50% of the data (the values above the median).

Example: Consider {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

  • Cutting the data in half: {1, 2, 3, 4, 5 | 6, 7, 8, 9, 10}, the median (Q2) is \(\frac{5+6}{2}=5.5\).
  • Q1 is the median of {1, 2, 3, 4, 5}, or 3
  • Q3 is the median of {6, 7, 8, 9, 10}, or 8

Then the interquartile range is \[ \text{IQR} = \text{Q3}-\text{Q1} \]

This is another measure of variability and is resistant to extreme values. In general, we prefer the mean and standard deviation when the data are symmetric and we prefer the median and IQR when the data are skewed.

2.4.1 Box Plots

Our measures of position are the foundation for constructing what we call a box plot, which summarizes the data with 5 statistics plus extreme observations:

Drawing a box plot:

  1. Draw the vertical axis to include all possible values in the data.
  2. Draw a horizontal line at the median, at Q1, and at Q3. Use these to form a box.
  3. Draw the whiskers. The whiskers’ upper limit is Q3+1.5xIQR and the lower limit is Q1-1.5xIQR.
  4. The actual whiskers are then drawn at the next closet point within the limit.
  5. Any points outside the whisker limits are included as individual points. These are potential outliers.

(Potential) outliers can help us… - examine skew (outliers in the negative direction suggest left skew; outliers in the positive direction suggest right skew). - identify issues with data collection or entry, especially if the value of the outliers doesn’t make sense.

2.4.2 Descriptive Measures for Populations

So far, we’ve thought about calculating various descriptive statistics from a sample, but our long-term goal is to estimate descriptive information about a population. At the population level, these values are called parameters.

When we find a measure of center, spread, or position, we use a sample to calculate a single value. These single values are called point estimates and they are used to estimate the corresponding population parameter. For example, we use \(\bar{x}\) to estimate the population mean, denoted \(\mu\) (Greek letter “mu”) and \(s\) to estimate the population standard deviation, denoted \(\sigma\) (Greek letter “sigma”).

Point Estimate Parameter
sample mean: \(\bar{x}\) population mean: \(\mu\)
sample standard deviation: \(s\) population standard deviation: \(\sigma\)

…and so on and so forth. For each quantity we calculate from a sample (point estimate), there is some corresponding unknown population level value (parameter) that we wish to estimate.

We will discuss this in more detail when we discuss Random Variables and Statistical Inference.

2.5 Regression and Correlation

From your previous math classes, you should have a passing familiarity with linear equations like \(y=mx+b\). In statistics, we write these as \[y=b_0 + b_1x\] where \(b_0\) and \(b_1\) are constants, \(x\) is the independent variable, and \(y\) is the dependent variable. The graph of a linear function is always a (straight) line.

The y-intercept is \(b_0\), the value the dependent variable takes when the independent variable \(x=0\). The slope is \(b_1\), the change in \(y\) for a 1-unit change in \(x\).

A scatterplot shows the relationship between two (numeric) variables.

At a glance, we can see that (in general) heavier cars have lower MPG. We call this type of data bivariate data. Now consider

This relationship can be modeled perfectly with a straight line: \[ y = 8 + 3.25x \] When we can do this - model a relationship perfectly - we know the exact value of \(y\) whenever we know the value of \(x\). This is nice (we would love to be able to do this all the time!) but typically data is more complex than this.

Linear regression takes the idea of fitting a line and allows the relationship to be imperfect. Imagine in the previous scenario that you buy an $8 pound of coffee each month and individual coffees cost $3.25… but what if your pound of coffee didn’t always cost $8? Or your coffee drinks didn’t always cost $3.25? In this case, you might get a plot that looks something like this:

The linear regression line looks like \[ y = \beta_0 + \beta_1x + \epsilon \]

  • \(\beta\) is the Greek letter “beta.”
  • \(\beta_0\) and \(\beta_1\) are constants.
  • Error (the fact that the points don’t all line up perfectly) is represented by \(\epsilon\).

Think of this as the 2-dimensional version of a point estimate!

We estimate \(\beta_0\) and \(\beta_1\) using data and denote the estimated line by \[ \hat{y} = b_0 + b_1x \]

  • \(\hat{y}\), “y-hat,” is the estimated value of \(y\).
  • \(b_0\) is the estimate for \(\beta_0\).
  • \(b_1\) is the estimate for \(\beta_1\).

We drop the error term \(\epsilon\) when we estimate the constants for a regression line; we assume that the mean error is 0, so on average we can ignore this error.

We use a regression line to make predictions about \(y\) using values of \(x\).

  • \(y\) is the response variable.
  • \(x\) is the predictor variable.

Example: (from OpenIntro Statistics 8.1.2) Researchers captured 104 brushtail possums and took a variety of body measurements on each before releasing them back into the wild. We consider two measurements for each possum: total body length and head length.

Clearly, the relationship isn’t perfectly linear, but there does appear to be some kind of linear relationship (as body length increases, head length also increases). We want to try to use body length (\(x\)) to predict head length (\(y\)).

The regression model for these data is \[\hat{y}=42.7 + 0.57x\]

To predict the head length for a possum with a body length of 80cm, we just need to plug in 80 for body length (\(x\)): \[\hat{y}=42.7 + 0.57(80) = 88.3\text{mm}.\] Note: because the regression line is built using the data’s original units (cm for body length, mm for head length), the regression line will preserve those units. That means that when we plugged in a value in cm, the equation spit out a predicted value in mm.

2.5.1 Correlation

We’ve talked about the strength of linear relationships, but it would be nice to formalize this concept. The correlation between two variables describes the strength of their linear relationship. It always takes values between -1 and 1. We denote the correlation (or correlation coefficient) by \(R\): \[R = \frac{1}{n-1}\sum_{i=1}^n\left(\frac{x_i - \bar{x}}{s_x}\times\frac{y_i - \bar{y}}{s_y}\right)\] where \(s_x\) and \(s_y\) are the respective standard deviations for \(x\) and \(y\). The sample size \(n\) is the total number of \((x,y)\) pairs.

Example: Consider

\(x\) \(y\)
1 3
2 3
3 4
\(\bar{x} = 2\) \(\bar{y} = 3.333\)
\(s_x = 1\) \(s_y = 0.577\)

Like we did with variance/standard deviation, I recommend using a table to calculate the correlation between \(x\) and \(y\):

\(x - \bar{x}\) \(\frac{x - \bar{x}}{s_x}\) \(y - \bar{y}\) \(\frac{y - \bar{y}}{s_y}\) \(\frac{x - \bar{x}}{s_x}\times\frac{y - \bar{y}}{s_y}\)
-1 -1 -0.333 -0.577 0.577
0 0 -0.333 -0.577 0.000
1 1 0.667 1.155 1.155
sum = 1.732

So \(R = \frac{1}{3-1}(1.732) = 0.866\)

Correlations

  • close to -1 suggest strong, negative linear relationships.
  • close to +1 suggest strong, positive linear relationships.
  • close to 0 have little-to-no linear relationship.

Note: the sign of the correlation will match the sign of the slope!

  • If \(R < 0\), there is a downward trend and \(b_1 < 0\).
  • If \(R > 0\), there is an upward trend and \(b_1 > 0\).
  • If \(R \approx 0\), there is no relationship and \(b_1 \approx 0\).

A final note: correlations only represent linear trends. Consider the following scatterplot:

Obviously there’s a strong relationship between \(x\) and \(y\). In fact, there’s a perfect relationship here: \(y = x^2\). But the correlation between \(x\) and \(y\) is 0! This is one reason why it’s important to examine the data both through visual and numeric measures.

2.5.2 Finding a Regression Line

Residuals are the leftover stuff (variation) in the data after accounting for model fit: \[\text{data} = \text{prediction} + \text{residual}\] Each observation has its own residual. The residual for an observation \((x,y)\) is the difference between observed (\(y\)) and predicted (\(\hat{y}\)): \[e = y - \hat{y}\] We denote the residuals by \(e\) and find \(\hat{y}\) by plugging \(x\) into the regression equation. If an observation lands above the regression line, \(e > 0\). If below, \(e < 0\).

When we estimate the parameters for the regression, our goal is to get each residual as close to 0 as possible. We might think to try minimizing \[\sum_{i=1}^n e_i = \sum_{i=1}^n (y_i - \hat{y}_i)\] but that would just give us very large negative residuals. As with the variance, we will use squares to shift the focus to magnitude: \[\begin{align} \sum_{i=1}^n e_i^2 &= \sum_{i=1}^n (y_i - \hat{y}_i)^2 \\ & = \sum_{i=1}^n [y_i - (b_0 + b_1 x_i)]^2 \end{align}\] This will allow us to shrink the residuals toward 0: the values \(b_0\) and \(b_1\) that minimize this will make up our regression line.

This is a calculus-free course, so we’ll skip the proof of the minimization part. The slope can be estimated as \[b_1 = \frac{s_y}{s_x}\times R\] and the intercept as \[b_0 = \bar{y} - b_1 \bar{x}\]

2.5.2.1 Coefficient of Determination

With the correlation and regression line in hand, we will add one last piece for considering the fit of a regression line. The coefficient of determination, \(R^2\), is the square of the correlation coefficient. This value tells us how much of the variability around the regression line is accounted for by the regression. An easy way to interpret this value is to assign it a letter grade. For example, if \(R^2 = 0.84\), the predictive capabilities of the regression line get a B.

Example: Consider two measurements taken on the Old Faithful Geyser in Yellowstone National Park: eruptions, the length of each eruption and waiting, the time between eruptions. Each is measured in minutes.

There does appear to be some kind of linear relationship here, so we will see if we can use the wait time to predict the eruption duration. The sample statistics for these data are

waiting eruptions
mean \(\bar{x}=70.90\) \(\bar{y}=3.49\)
sd \(s_x=13.60\) \(s_y=1.14\)
\(R = 0.90\)

Since we want to use wait time to predict eruption duration, wait time is \(x\) and eruption duration is \(y.\) Then \[b_1 = \frac{1.14}{13.60}\times 0.90 \approx 0.076 \] and \[b_0 = 3.49 - 0.076\times 70.90 \approx -1.87\] so the estimated regression line is \[\hat{y} = -1.87 + 0.076x\]

To interpret \(b_1\), the slope, we would say that for a one-minute increase in waiting time, we would predict a 0.076 minute increase in eruption duration. The intercept is a little bit trickier. Plugging in 0 for \(x\), we get a predicted eruption duration of \(-1.87\) minutes. There are two issues with this. First, a negative eruption duration doesn’t make sense… but it also doesn’t make sense to have a waiting time of 0 minutes.

It’s important to stop and think about our predictions. Sometimes, the numbers don’t make sense and it’s easy to see that there’s something wrong with the prediction. Other times, these issues are more insidious. Usually, all of these issues result from what we call extrapolation, applying a model estimate for values outside of the data’s range for \(x\). Our linear model is only an approximation, and we don’t know anything about how the relationship outside of the scope of our data!

Consider the following data with the best fit line drawn on the scatterplot.

The best fit line is \[\hat{y} = 2.69 + 0.179x\] and the correlation is \(R=0.877\). Then the coefficient of determination is \(R^2 = 0.767\) (think: a C grade), so the model has decent predictive capabilities. More precisely, the model accounts for 76.7% of the variability about the regression line. Now suppose we wanted to predict the value of \(y\) when \(x=0.1\): \[\hat{y} = 2.66 + 0.181\times0.1 = 2.67\] This seems like a perfectly reasonable number… But what if I told you that I generated the data using the model \(y = 2\ln(x) + \text{random error}\)? (If you’re not familiar with the natural log, \(\ln\), don’t worry about it! You won’t need to use it.) The true (population) best-fit model would look like this:

The vertical lines at \(x=5\) and \(x=20\) show the bounds of our data. The blue dot at \(x=0.1\) is the predicted value \(\hat{y}\) based on the linear model. The dashed horizontal line helps demonstrate just how far this estimate is from the true population value! This does not mean there’s anything inherently wrong with our model. If it works well from \(x=5\) to \(x=20\), great, it’s doing its job!