6 Correlations

Have you ever heard someone say “that’s just correlation, not causation!” We’ll talk about both parts of that statement in this chapter. Yes, correlation isn’t causation, but it’s also an important part of figuring out if something causes something else. Let’s start with the definition of correlation, and we’ll talk about causation at the end.

According to the Free Dictionary.com, correlation is

A relationship or connection between two things based on co-occurrence or pattern of change

That works. Let’s break it apart.

A correlation can be calculated for two things, say murder and urban population. A co-occurrence or pattern of change means that the two things happen together in some pattern. If I look at a state with a higher assault rate, we’ll probably find a state with a higher murder rate (Chapter 5). The two things, higher murder and high assault rates, occur in the same states, and low murder and low assault rates occur in the same states.

We can figure out what that relationship is more formally by calculating the correlation coefficient.

6.1 Types of Correlations

Correlation coefficients can range from 1 to -1, and can fall anywhere in between.

credit: Wikipedia

credit: Wikipedia

A correlation of 1 shows that the relationship is positive. A positive relationship indicates that if a variable is higher on one quality, it’ll be higher on the other as well. If assault rates perfectly predicted murder rates, then we’d have a correlation of 1. We don’t though, it was .69, so it’s a good predictor but not perfect.

A correlation of -1 means the opposite, if something is higher on one quality it’ll be lower on the other. To visualize a strong negative correlation, let’s look at the percentage of students on free lunches at schools and their math scores (free lunches is a typical measure for poverty)

Schools with more students on free lunches do worse on math tests on average. Schools with fewer students on free lunches perform better. Knowing one variable will help you to predict the other, but not perfectly.

What about something with a weaker relationship? Let’s see how well school expenditures correlate with math test scores. You might expect schools that spend more to perform better, because they an hire better teachers, offer smaller classes, and more activities, right?

Actually, the relationship is pretty weak in the data. A correlation coefficient of .15 is positive, meaning that a higher value in one variable does predict a higher value in the other, but not by much. The data is generally bundled up in the middle.

6.2 Strength of Correlations

How do we evaluate whether a correlation is strong or not? There are general rules for what is a strong, moderate, or weak correlation, although they aren’t set in stone.

A strong correlation is between .5 (or -.5) and 1 (-1).

A moderate correlation is between .5 (-.5) and .3 (-.3).

A weak correlation is between .3 (-.3) and .1 (-.1).

And if it falls between .1 and -.1, it’s just said to not correlate.

Those aren’t firm rules. How much stronger is a correlation that is .31 vs .29? Not very much. Thus, those categories should be taken as rough guidelines.

In the end, we’ll use a bit more math to really evaluate how strong a correlation is in a few chapters. There are tests to figure out not just whether they too variables move up and down together, but whether that relationship would be predicted just by chance. For now though, correlation is an effective way to communicate whether two variables are related.

6.3 Calculating Correlation

To really understand the correlation coefficient, it is necessary to walk through the math, which gives us an opportunity to talk about standardizing units.

Let’s say we want to know if there is any relationship between parental income and math scores for schools. Do we think schools that on average have higher parental incomes do better or worse on standardized test? We can answer that question using the California schools data we opened in a previous chapter (from the package AER).

One issue is that income and tests come in different units. Income is measured in dollars while tests are in points.

Min. 1st Qu. Median Mean 3rd Qu. Max.
605.4 639.4 652.4 653.3 665.8 709.5
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.335 10.64 13.73 15.32 17.63 55.33

What we need to do is standardize the units so that we can properly compare them no matter what is being measured.

  1. First we need to calculate the mean and standard deviation for income and math

We can use the mean and standard deviation to standardize the units for income and math scores We’ll add a new column for both that calculates how many standard deviations each individual income and math is from the mean for both.

And yes, we could have done that with out saving the mean and standard deviation as separate objects. But I thought it would be useful to break it down into each step, but you could have skipped those earlier steps.

Okay, now we have both variables in standardized units, which means the fact they’re measured in dollars and points no longer matters. We just know how far from the mean both are.

We can now calculate the correlation coefficient by multiplying the two standardized units and saving that as a new variable called IxM (for income x math). Then we’ll take the sum of those multiplications with the command sum(). Multiplying the differences helps us to know how far apart both income and math scores are from the mean for each school. If the numbers are correlated, a school with parental income that is higher than average should also have higher test scores. By taking the sum of those figures, we can figure out how the two variables differ across everyone in the data.

The final step to calculating the correlation coefficient is to divide those sums by the size of the data set after subtracting one. We can figure that out by asking r how many rows are in the data with nrow(). The correlation coefficient is generally identified with the letter r, no relation to R software.

And yes, there is a way to make R do all the work for us. The command is cor() with the two variables inserted inside the parenthesis. But doesn’t it feel good to have done that for ourselves?

So what does a correlation of .699 mean? It means the data is pretty correlated. Let’s visualize.

Knowing a school’s average parental income is generally a good predictor of their math scores. Schools with wealthier parents have children that perform better. There are a lot of reasons for that association, but what we’ve shown here is that they are associated at least in this data.

6.4 Multiple correlations

Correlation can only be calculated for two variables at a time - a single set of variables. But we can run correlations for multiple sets simultaneously.

Let’s look at the correlations for math scores and parental income again, but let’s also include the number of computers at the school, the number of students, and math scores.

  income read math students computer
income 1 0.6978 0.6994 0.02839 0.09434
read 0.6978 1 0.9229 -0.1884 -0.109
math 0.6994 0.9229 1 -0.1109 -0.03295
students 0.02839 -0.1884 -0.1109 1 0.9289
computer 0.09434 -0.109 -0.03295 0.9289 1

What the table shows you is the correlation between each set of variables. Each number is the correlation coefficient for the variable on that column and row. Each number is shown twice because each variable has both a row and a column.

In the very top left corner is the number 1, which is for the row income and column income. The same variable perfectly predicts itself, which is true. That number doesn’t mean much then.

The row before it shows the correlation for parental incomes and reading test scores: .697. That nearly matches the correlation between math scores and income we found earlier.

Let’s jump around a bit and test ourselves. What is the correlation between reading test scores and math test scores? I’ll give you a second…


It’s .92. That’s a nearly perfect correlation. Schools that do better on the math test also do better on the reading test. If you only know how good your school is at math, you can be pretty confident the reading scores will be similarly good.

One more. What is the correlation between the number of computers in the school and reading scores.

See it? Interesting, it’s -.109. That’s a weak and negative correlation, the more computers there are the worse students do. So computers make students do worse, they’re focusing so much on coding they don’t learn to read? That’s one explanation of the data, that’s almost certainly wrong. Look at the correlation between the size of the schools enrollment and their scores: -.188. Larger schools do worse on the tests, and they also tend to have more computers (.93). Thus, it can be dangerous to interpret the relationship between two variables based just on the correlations.

6.5 Correlation and Causation

Do we now know that higher parental income causes school math scores to be higher? Not just because of the correlation, no.

For one thing, correlations can’t determine which variable is causing the other to increase. Could school math scores cause parental incomes to rise? That’s unlikely, but the correlation coefficient alone can’t tell us what is causing what to increase. All we know is that they both rise together in our data.

There are three criteria for to assert causality (that A causes B):

  1. Co-variation
  2. Temporal Precedence
  3. Elimination of Extraneous Variables

Co-variation is what we’ve been measuring. As one variable moves, the other variable moves in unison.

Temporal precedence refers to the timing of the two variables. In order for A to cause B, A must precede B. I cause the TV to turn on by pushing a button, you wouldn’t say the TV turning on caused me to push a button. The measurement for parental income comes before the math tests here, so we do have temporal precedence. And math scores didn’t help parents to earn more, unless the state has introduced some sort of test reward system for parents.

So what about Extraneous Variables. We don’t just need to prove that income and math scores are correlated and that the income preceded the tests. We need to prove that nothing else could explain the relationship. Is the parental income really the cause of the scores, or is it the higher education of parents (which helps them earn more)? Is it because those parents could help their children with math homework at night, or because they could afford math camps in the summer? There are lots of things that would correlate with parental income, that would also correlate with school math scores. Until we can eliminate all of those possibilities, we can’t say for sure that parental income causes higher math scores.

These issues have arisen in the real world. A while back someone realized that children with more books in their home did better on reading tests. So nonprofits and schools started giving away books, to try and ensure every student would have books and do better on tests.

What happened? Not much. The books didn’t make a difference, having parents that would read books to their children every night did, along with many other factors (having a consistent home to store them at, parents that could afford books, etc.). That’s why it’s important to eliminate every other possible explanation.

Let’s look at one more example. Homeschools students do better than those in public school. Great! Let’s home school everyone, right?

Well, home schooled students do better on average, but that’s probably related to the fact they have families with enough income for one parent to stay home and not work regularly and they have a parent that feels comfortable teaching (high education themselves). Just based off that, I’m guessing that shutting down public schools and sending everyone home wont send scores skyrocketing. But there is still a correlation between home schooling and scores, but it may not be causal.

6.6 Brief Review

Why do we care about correlations? Because we like to explain things and two things correlating helps us to being to explain them.

There’s been shifting polling in a race for the Senate - what might explain that? Changes in support might correlate with voters beginning to pay more attention, or the discovery of some scandal, or changes in the voter pool. Any explanation will arise from a correlation between two things.

In order to calculate a correlation coefficient all you need is the cor() command and the names of the two variables you are interested in studying.