Chapter 7 Correlation and Regression

7.1 Correlation Does Not Imply Causation…

The above chart comes from Tyler Vigen’s extraordinary Spurious Correlations blog. It’s worth your time to go through the charts there and see what he has found in what must be an enormous wealth of data.

Correlation does not imply causation is literally one of the oldest principles in the practice of statistics. It is parroted endlessly in introductory statistics courses by me and people like me. It is absolutely true if we look through enough pairs of variables, we will find some pairs of things that are totally unrelated – like cheese consumption and civil engineering PhD completion – that appear to be connected.

It is also true – and probably more dangerous than coincidence – that there are some variables that correlate not because one causes the other but because a third variable causes both. Take, for example, the famous and probably apocryphal example of the correlation between ice cream sales and violent crime: the idea there is that ambient temperature is correlated with both ice cream consumption and crime rates. Now, the ice cream part of it is most likely just humorous conjecture, and there have been real studies that link heat with aggression, but it is less likely that the link between crime rates with outdoor temperatures is due to heat causing violence, but rather due to higher rates of social interaction in warmer weather that create more opportunities for interpersonal violence.

The decline of sales of digital cameras over the past decade is a well-researched trend. Here, here’s the research:

And here’s another trend you may find intriguing:

The conclusion is clear: athleisure killed the digital camera.

Or maybe not. But something led to the decline of sales of digital cameras.

7.1.1 …but it Doesn’t Imply NOT Causation either

There is another annual trend that correlates inversely with sales of digital cameras:

Again, correlation does not imply causation is a true statement. But it is simultaneously true that when there is causation, there is also correlation. Correlation techniques – more specifically, regression techniques – are the foundation of most classical and some Bayesian statistical approaches, and as such are widely used in controlled experimentation: the gold standard of establishing causation. So, yes, correlation doesn’t by itself establish cause-and-effect, but it is a good place to start looking.

7.2 Correlation

7.2.1 The Product Moment Coefficient \(r\)

Correlation is most frequently measured by the Classical statistic \(r\): the Pearson Product Moment Coefficient. It is the coefficient of the standardized score of a predictor variable Direction and Magnitude

There are two main features of interest to the value of \(r\): its sign (positive or negative) and its magnitude (its absolute value). The sign of \(r\) indicates the direction of the correlation. A positive value of \(r\) indicates that increases in one of a pair of variables correspond generally to increases in the other member of the pair and vice versa: this type of correlation is known as direct or simply positive. A negative value of \(r\) indicates that increases in one of a pair of variables correspond generally to decreases in the other member of the pair and vice versa: that type of correlation is known as indirect, inverse, or simply negative.

The absolute value of \(r\) is a measure of how closely the two variables are related. Absolute values of \(r\) that are closer to 0 indicate that there is a loose relationship between the variables; absolute values of \(r\) that are closer to 1 indicate a tighter relationship between the variables. Jacob Cohen145 recommends the following guidelines for interpreting the absolute value of \(r\):

Magnitude of \(r\) Effect Size
\(r\gt 0.5\) Strong
\(0.3 \lt r \lt 0.5\) Moderate
\(0.1\lt r\lt 0.3\) Weak
Cohen considered \(r\) statistics with absolute value less than 0.1 too unlikely to be statistically significant to include. They do exist, but if you wanted to cite Cohen’s guidelines to describe their magnitude you’d have to go with something like ‘smaller than small’

The magnitude of \(r\) is an example of an effect size: a statistic that indicates the degree of association (in the case of correlation) or difference (in pretty much every other case) between groups. As with all effect size interpretation guidelines, they are merely intended to be a general framework: the evaluation of any effect size should be interpreted in the context of similar scientific investigations. As Cohen himself wrote:

“there is a certain risk inherent in offering conventional operational definitions for those terms for use in power analysis in as diverse a field of inquiry as behavioral science”

Figure 7.1 illustrates the variety of sign and direction of \(r\) in scatterplots.

Scatterplots Indicating Positive and Negative; Strong, Moderate, and Weak Correlations

Figure 7.1: Scatterplots Indicating Positive and Negative; Strong, Moderate, and Weak Correlations Hypothesis Testing and Statistical Significance

The null hypothesis in correlational analysis is based on the assumption that \(r\)146 is equal to, greater than or equal to, or less than or equal to 0. A one-tailed test can indicate that a positive correlation is expected:

\[H_0: r \le 0\] \[H_1: r > 0\] or that a negative correlation is expected:

\[H_0: r \ge 0\] \[H_1: r < 0.\] A two-tailed hypothesis suggests that either a positive or a negative correlation may be considered significant:

\[H_0: r=0\] \[H_1: r \ne 0\] Critical values of \(r\) can be found by consulting tables like this one. The critical values of \(r\) are the smallest values given the df – which for a correlation is equal to \(n-2\) – that represents a statistically significant result given the desired \(\alpha\)-rate and type of test (one- or two-tailed). Assumptions of Parametric Correlation

Like most parametric tests in the Classical framework, the results of parametric correlation are based on assumptions about the structure of the data (the assumptions of frequentist parametric tests will get it’s own discussion later. The main assumption to be concerned about147 is the assumption of normality.

The normality assumption is that the data are sampled from a normal distribution. The sample data themselves do not have to be normally distributed – a common misconception – but the structure of the sample data should be such that they plausibly could have come from a normal distribution. If the observed data were not sampled from a normal distribution, the correlation will not be as strong as it would if they were. If there are any concerns about the normality assumption, the best and easiest way to proceed is to back up a parametric correlation with one of the nonparametric correlations described below: usually Spearman’s \(\rho\), sometimes Kendall’s \(\tau\), or, if the data are distributed in a particular way, Goodman and Kruskal’s \(\gamma\).

The secondary assumption for the \(r\) statistic is that the variables being correlated have a linear relationship. As covered below, \(r\) is the coefficient for a linear function: that function does a much better job of modeling the data if the data scatter in a linear function about the line.

If the data are not sampled from a normal distribution and/or linearly related, then \(r\) just won’t work as well. It’s not that the earth beneath the analyst opens up and swallows them whole when assumptions of classical parametric tests are violated: it’s just that they don’t wory as intended, leading to type-II errors (if anything).

7.2.2 Parametric Correlation Example

Let’s use the following set of \(n=10\) paired observations, each with an \(x\) value and a \(y\) value. Please note that it’s pretty rare – but certainly not impossible – to run a correlation with just 10 pairs in an actual scientific study; we’ll just limit it to 10 here because anything more than that gets hard to follow. More importantly, please note that \(n\) refers to pairs of observations in the context of correlation and regression. We get \(n\) not from the number of numbers (which, in this case, will be 20) but by the number of entities being measured, and with correlation each pair of \(x\) and \(y\) are different measurements related to the same thing. For example, imagine we were seeing if there were a correlation between height and weight and we had \(n=10\) people in our dataset: we would have 10 values for heights and 10 values for weights which is 20 total numbers, but the 20 numbers still refer to 10 people.

Where were we? Right, the data we are going to use for our examples. It’s a generic set of \(n=10\) observations numbered sequentially from \(1 \to 10\), 10 observations for a generic \(x\) variable, and 10 observations for a generic \(y\) variable.

\(n\) \(x\) \(y\)
1 4.18 1.73
2 4.70 1.86
3 6.43 3.61
4 5.04 2.09
5 5.25 2.00
6 6.27 4.07
7 5.34 2.56
8 4.21 0.33
9 4.17 1.49
10 4.67 1.46

And here is a scatterplot, with \(x\) on the \(x\)-axis and \(y\) on the \(y\)-axis:

Scatterplot of the Sample Data

Figure 7.2: Scatterplot of the Sample Data Definitional Formula

The product-moment correlation is the sum of the products of the standardized scores (\(z\)-scores) of each pair of variables divided by \(n-1\):

\[r=\frac{\sum_{i=1}^n z_{x_i}z_{y_i}}{n-1}\] This formula is thus known as the definitional formula: it serves as both a description of and a formula for \(r\). To illustrate, we fill the below table out using our sample data:

Observed Values
\(z\)-transformed Values
\(z\) product
Obs \(i\) \(x_i\) \(y_i\) \(z_{x_i}\) \(z_{y_1}\) \(z_{x_i}z_{y_i}\)
1 4.18 1.73 -1.03 -0.36 0.37
2 4.70 1.86 -0.40 -0.24 0.10
3 6.43 3.61 1.72 1.38 2.37
4 5.04 2.09 0.02 -0.03 0.00
5 5.25 2.00 0.27 -0.11 -0.03
6 6.27 4.07 1.52 1.81 2.75
7 5.34 2.56 0.38 0.41 0.16
8 4.21 0.33 -1.00 -1.66 1.66
9 4.17 1.49 -1.05 -0.58 0.61
10 4.67 1.46 -0.44 -0.61 0.27

Taking the sum of the rightmost column and dividing by \(n-1\), we get the \(r\) statistic:


We can check that math using the cor.test() command in R (which we could have just done in the first place):

cor.test(x, y)
##  Pearson's product-moment correlation
## data:  x and y
## t = 6.4736, df = 8, p-value = 0.0001934
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6777747 0.9803541
## sample estimates:
##       cor 
## 0.9163524 Computational Formula

\(r\) can alternatively and equivalently be calculated using what is known as the computational formula:

\[r=\frac{\sum_{i=1}^n \left(x-\bar{x} \right) \left(y-\bar{y} \right)}{\sqrt{SS_xSS_y}}\]


\[SS_x=\sum_{i=1}^n (x-\bar{x})^2\] and

\[SS_x=\sum_{i=1}^n (x-\bar{x})^2\] That formula is called the computational formula because apparently it’s easier to calculate. I don’t see it. But, here’s how we would approach using the computational formula:

Observed Values
Product of Deviations
Squared Deviations
Obs \(i\) \(x_i\) \(y_i\) \((x-\bar{x})\) \((y-\bar{y})\) \((x-\bar{x})(y-\bar{y})\) \((x-\bar{x})^2\) \((y-\bar{y})^2\)
1 4.18 1.73 -0.85 -0.39 0.33 0.72 0.15
2 4.70 1.86 -0.33 -0.26 0.08 0.11 0.07
3 6.43 3.61 1.40 1.49 2.09 1.97 2.22
4 5.04 2.09 0.01 -0.03 0.00 0.00 0.00
5 5.25 2.00 0.22 -0.12 -0.03 0.05 0.01
6 6.27 4.07 1.24 1.95 2.43 1.55 3.80
7 5.34 2.56 0.31 0.44 0.14 0.10 0.19
8 4.21 0.33 -0.82 -1.79 1.46 0.67 3.20
9 4.17 1.49 -0.86 -0.63 0.54 0.73 0.40
10 4.67 1.46 -0.36 -0.66 0.23 0.13 0.44

\[\sum(x-\bar{x})(y-\bar{y})=7.2782\] \[SS_x=\sum(x-\bar{x})^2=6.01504\] \[SS_y=\sum(y-\bar{y})^2=10.4878\]


See? Same answer.

7.2.3 Nonparametric correlation

The \(r\) statistic is the standard value used to describe the correlation between two variables. It is based on the assumptions that the data are sampled from a normal distribution and that the data have a linear relationship. If the data violate those assumptions – or even if we are concerned that the data might violate those assumptions – nonparametric correlations are a viable alternative.

Nonparametric correlations – and nonparametric tests in general – are so-named because they do not make inferences about population parameters: they do not involve means, nor standard deviations, nor combinations of the two (as \(r\) is), in either the null or alternative hypotheses. They are, instead, based on the cumulative likelihood of the observed pattern of the data or more extreme unobserved patterns of the data.148

For example, consider the following data:

\(x\) \(y\)
1 1
2 2
3 3
4 4
5 5
6 6

Please notice that \(x\) and \(y\) are perfectly correlated: in fact, they are exactly the same. What is the probability of that happening? To answer that, we need to know how many possible combinations of \(x\) and \(y\) there are, so we turn to permutation.

To find the number of combinations of \(x\) and \(y\), we can leave one of the variables fixed as is and then find out all possible orders of the other variable: that will give us the number of possible pairs. That is: if we leave all the \(x\) values where they are, we just need to know how many ways we could shuffle around the \(y\) values (or we could leave \(y\) fixed and shuffle \(x\) – it doesn’t matter). The number of possible orders for the \(y\) variable (assuming we left \(x\) fixed) is given by:

\[nPr=~_6P_6=\frac{6!}{0!}=720\] So, there are 720 possible patterns for \(x\) and \(y\). The probability of this one pattern is therefore \(1/720=0.0014\). If we have a one-tailed test where we are expecting the relationship between \(x\) and \(y\) to be positive, then the observed pattern of the data is the most extreme pattern possible (can’t get more positive agreement than we have), and \(0.0014\) is the \(p\)-value of a nonparametric correlation (specifically, the Spearman correlation).

cor.test(1:6, 1:6, method="spearman", alternative="greater")
##  Spearman's rank correlation rho
## data:  1:6 and 1:6
## S = 0, p-value = 0.001389
## alternative hypothesis: true rho is greater than 0
## sample estimates:
## rho 
##   1

If we have a two-tailed test where we do not have an expectation on the direction of the relationship, then there would be one more equally extreme possible pattern: the case where the variables went in perfectly opposite directions (\(x=\{1,~2,~3,~4,~5,~6\}\) and \(y=\{6,~5,~4,~3,~2,~1\}\) or vice versa). With two patterns being as extreme as the observed pattern, the \(p\)-value would then be \(2/720=0.0028\).

cor.test(1:6, 1:6, method="spearman", alternative="two.sided")
##  Spearman's rank correlation rho
## data:  1:6 and 1:6
## S = 0, p-value = 0.002778
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho 
##   1 Spearman’s \(\rho\)

The \(\rho\) is the \(r\) of the ranks: \(rho\) is also known as the rank correlation. There are fancy equations to derive \(\rho\) from the ranks, but I have no idea why you would need to use them instead of just correlating the ranks of the data

Returning to our example data, we can rank each value relative to the other values of the same variable from 1 to \(n\). Usually the ranks go from smallest to largest, but it doesn’t really matter so much as the same convention is followed for both the \(x\) and the \(y\) values. In the case of ties, take the average of the ranks above and below the tie cluster. If, for example, the data were \(\{4, 7, 7, 8\}\), the ranks would be \(\{1, 2.5, 2.5, 4\}\).

\(x\) \(y\) \(rank(x)\) \(rank(y)\)
4.18 1.73 2 4
4.70 1.86 5 5
6.43 3.61 10 9
5.04 2.09 6 7
5.25 2.00 7 6
6.27 4.07 9 10
5.34 2.56 8 8
4.21 0.33 3 1
4.17 1.49 1 3
4.67 1.46 4 2

We then could proceed to calculate the correlation \(r\) between the ranks. Or, even better: we can calculate the \(\rho\) correlation for the observed data:

cor.test(x, y, method="spearman")
##  Spearman's rank correlation rho
## data:  x and y
## S = 20, p-value = 0.001977
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8787879

And note that the \(\rho\) is the same value as if we calculated \(r\) for the ranks149

cor.test(xrank, yrank, method="pearson")
##  Pearson's product-moment correlation
## data:  xrank and yrank
## t = 5.2086, df = 8, p-value = 0.0008139
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5577927 0.9710980
## sample estimates:
##       cor 
## 0.8787879 Kendall’s \(\tau\)

Where \(\rho\) is based on patterns of ranks, Kendall’s \(\tau\) correlation is based on patterns of agreement. In the annotated scatterplot in Figure 7.3, the lines connecting four of the possible pairs between the five datapoints – the concordant pairs – move in the same direction (from southwest to north east), while the line connecting one pair of the datapoints – the discordant pair moves in the opposite direction.

Pairwise Comparisons of Sample Dataset with Mixed Concordance and Discordance (Example 3)

Figure 7.3: Pairwise Comparisons of Sample Dataset with Mixed Concordance and Discordance (Example 3)

The \(\tau\) for the example data (the original example data, not the data for Figure 7.3) is:

cor.test(x, y, method="kendall")
##  Kendall's rank correlation tau
## data:  x and y
## T = 39, p-value = 0.002213
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.7333333 A Bayesian Nonparametric Approach to Correlation

There is not a lot said about the differences between Classical vs. Bayesian approaches to correlation and regression on this page, mainly because this is an area where Bayesians don’t have many snarky things to say about it. Bayesians may take issue with the assumptions or with the way that \(p\)-values are calculated – and both of those potential problems are addressed in the Bayesian regression approach discussed below – but they are pretty much down with the concept itself.

A recently developed approach150 involves taking the Kendall \(\tau\) concept of concordant pairs and discordant pairs and treating the two pair possibilities as one would successes and failures in a binomial context. Using the number of concordant pairs \(T_c\) as the number of discordant pairs \(T_d\) as \(s\) and \(f\), we can analyze the concordance parameter \(\phi_c\) the same way we analyze the probability \(\pi\) of \(s\): with a \(\beta\) distribution. The \(\phi_c\) parameter is related to Kendall’s \(\tau\):

\[\phi_c=\frac{\tau+1}{2}\] and Figure 7.4 shows how the \(\beta\) distribution of \(\phi_c\) relates to scatterplots of data described by various values of \(\tau\).

$\hat{\tau}$ Values ($n=20$) and Corresponding Posterior $\beta$ Distributions of $\phi_c$

Figure 7.4: \(\hat{\tau}\) Values (\(n=20\)) and Corresponding Posterior \(\beta\) Distributions of \(\phi_c\)

There is a slight problem with the \(\tau\) statistic in the case where values are tied (either within a variable or if multiple pairs are tied with each other). The difference is small, but for a more accurate correction for ties, see the bonus content at the end of this page. Goodman and Kruskal’s \(\gamma\)

Imagine that we wanted to compare the letter grades received by a group of 50 high school students in two of their different classes. Here is a cross-tabulation of their grades:

Algebra Grade
Social Studies Grade A 6 2 1 2 0
Social Studies Grade B 2 6 2 1 1
Social Studies Grade C 1 1 7 1 1
Social Studies Grade D 1 1 0 6 3
Social Studies Grade F 0 0 0 0 5

Goodman and Kruskal’s \(\gamma\) correlation allows us to make this comparison. The essential tenet of the \(\gamma\) correlation is this: if there were a perfect, positive correlation between the two categories, then all of the students would be represented on the diagonal of the table (everybody who got an “A” in one class would get an “A” in the other, everybody who got a “B” in one class would get a “B” in the other, etc.). If there were a perfect, inverse correlation, then all of the students would be represented on the reverse diagonal. The \(\gamma\), which like the other correlation coefficients discussed here ranges from \(-1\) to 1, is a measure to the extent which categories agree. That means that the \(\gamma\) can be used for categorical data, but also for numbers that can be placed into categories (like, the way that ranges of numerical grades can be converted into categories of letter grades).

GoodmanKruskalGamma(table(social.studies, algebra), conf.level = 0.95)
##     gamma 
## 0.6769596 0.4685468 0.8853724

7.2.4 Nonlinear correlation

One of the assumptions of the \(r\) correlation is that there is a linear relationship between the variables. If there is a non-linear relationship between the variables, one option is to use a nonparametric correlation like \(\rho\) or \(\tau\)151. If there is a clear pattern to the non-linearity, one might also consider transforming one of the variables until the relationship is linear, calculating \(r\), and then transforming it back.

As an example, the data represented in Figure 7.5 are the same example data that we have been using throughout this page, except that the \(y\) variable has been exponentiated: that is, we have raised the number \(e\) to the power of each value of \(y\)

Scatterplot of the Sample Data $x$ and $e^y$

Figure 7.5: Scatterplot of the Sample Data \(x\) and \(e^y\)

We can still calculate \(r\) for these data despite the relationship now looking vaguely non-linear: the correlation between \(x\) and \(e^y\) is 0.86. However, that obscures the relationship between the variables.152 In this case, we can take the natural logarithm \(\ln{e^y}\) to get \(y\) back, and then calculate an \(r\) of 0.92 as we did before. We would then report the nonlinear correlation \(r\), or report \(r\) as the correlation between one variable and the transformation of the other.

The only thing that we can’t do is transform a variable, calculate \(r\), and then report \(r\) without disclosing the transformation. That’s just cheating.

There are, of course, whole classes of variables for which we use transformations, usually in the context of regression. To name a few examples from the unit on probability distributions, when the \(y\) variable is distributed as a logistic distribution, we transform the \(y\) variable in the procedure known as logistic regression; when the \(y\) variable is distributed as a Poisson distribution, we use Poisson regression, and when the \(y\) variable is distributed as a \(\beta\) distribution, we use \(beta\) regression. That’s one area of statistics where the names really are pretty helpful.

A couple of caveats before we move on: some patterns aren’t as easy to spot as logarithms (or, squares or sine waves), and most nonlinear correlations are better handled by multiple regression (sadly, beyond the scope of this course).

7.3 Regression

Correlation is an indicator of a relationship between variables. Regression builds on correlation to create predictive models. If \(x\) is our predictor variable and \(y\) is our predicted variable, then we can use the correlation between the observed values \(x\) and \(y\) to make a model that predicts unobserved values of \(y\) based on unobserved values of \(x\).

7.3.1 The Least-Squares Regression Line

Our regression model is summarized by the least-squares regression line. The least-squares line is the function out of all possible functions of its type that results in the smallest squared distance between each observed \((x, y)\) pair and the function itself. In visual terms, it is the line that minimizes the sum of the squared distances between each point and the line: move the line in any direction and the sum of the squared distances would increase. There are two principal ways of expressing the least-squares regression line: the standardized regression equation and the raw-score regression equation. Standardized Regression Equation

The standardized regression equation describes the relationship between the standardized values of \(x\) and the standardized values of \(y\). The standardized values of the variables are the \(z\)-scores of the variables: thus, the standardized regression equation is also called the \(z\)-score form of the regression equation. The standardized regression equation indicates that the estimated \(z\)-score of the \(y\) variable is equal to the product of \(r\) and the \(z\)-score of the \(y\) variable:


Figure 7.6 is a scatterplot of the \(z\)-scores for the \(x\) and \(y\) variables in the example introduced above. The blue line is the least-squares regression line in standardized (\(z\)-score) form. Please note that when \(z_x=0\), then \(rz_x=\hat{z_y}=0\), so the line – as do all lines made by standardized regression equations – passes through the origin \((0,~0)\).

Scatterplot of $z_x$ and $z_y$

Figure 7.6: Scatterplot of \(z_x\) and \(z_y\) Raw-score regression equation

Technically, the standardized regression equation would be all you would need to model the relationship between two variables – if you knew the mean and standard deviation of each variable and had the time to derive \(x\) and \(y\) values from \(z_x\) and \(z_y\) values. But, the combination of those statistics and the time and/or patience to convert back-and-forth from standardized scores are rarely available, especially when consuming reports on somebody else’s research. It is far more meaningful (and convenient) to put the standardized equation into a form where changes in the plain old raw \(x\) variable correspond to changes in the plain old raw \(y\) variable. That format is known as the raw-score form of the regression equation, and it’s almost always the kind of regression formula one would actually encounter in published work.

The form of the raw-score equation is:

\[\hat{y}=ax+b\] where \(a\) is the slope of the raw-score form of the least-squared regression line and \(b\) is the \(y\)-intercept of the line (that is, the value of the line when \(x=0\) and the line intersects with the \(y\)-axis).

We get the raw-score equation by converting the \(z\)-score form of the equation. Wait, actually, 99.9% of the time we get the raw-score equation using software in the first place, but if we did need to calculate the raw-score equation by hand153, we would first find \(r\), which would give us the standardized equation \(\hat{z_y}=rz_x\), which we would then convert to the raw-score form via the following steps:

  1. Find the intercept \(b\)

The intercept (of any equation in the Cartesian Plane) is the point at which a line intersects with the \(y\)-axis, meaning that it is also the point on a line where \(x=0\). Thus, our first job is to find what the value of \(y\) is when \(x=0\). We don’t yet have a way to relate \(x=0\) to any particular \(y\) value, but we do have a way to connect \(z_x\) to \(z_y\): the standardized regression equation \(\hat{z_y}=rz_x\).

First, we can find the value of \(z_x\) for \(x=0\) using the \(z\)-score equation \(z=\frac{x-\bar{x}}{sd}\) and plugging 0 in for \(x\). Using our sample data, \(\bar{x}=5.026\) and \(sd_x=0.8175193\), thus:

\[z_x=\frac{0-5.026}{0.8175193}=-6.147867\] From the standardized regression equation, we know that \(\hat{z_y}=rz_x\), so the value of \(\hat{z_y}\) when \(x=0\) is:

\[\hat{z_y}=rz_x=(0.9163524)(-6.147867)=-5.633613\] And then, using the \(z\)-score formula again and knowing that the mean of \(y\) is 2.12 and the sd of \(y\) is 1.0794958, we can solve for \(y\):

\[-5.633613=\frac{y-2.12}{1.079496}\] \[y=-3.961463\] We know then that the raw-score least-squares regression line passes through the point \((0, -3.961463)\), so the intercept \(b=-3.961463\).

It takes two points to find the slope of a line. We already have one – the \(y\)-intercept \((0, -3.961463)\) – and any other one will do. I prefer finding the point of the line where \(x=1\), just because it makes one step of the math marginally easier down the line.

Using the same procedure as above, when \(x\)=1, \(z_x=-4.924654\), \(z_y=-4.512719\), and \(y=-2.751461\). The slope of the line is the change in \(y\) divided by the change in \(x\) \(\left( \frac{\Delta y}{\Delta x} \right)\):

\[a=\frac{\Delta y}{\Delta x}=\frac{-2.751461--3.961463}{1-0}=1.210002\] Thus, our least-squares regression equation in raw-score form is (after a bit of rounding):


Scatterplot of $x$ and $y$ Featuring Our Least-Squares Regression Line in Raw-Score Form

Figure 7.7: Scatterplot of \(x\) and \(y\) Featuring Our Least-Squares Regression Line in Raw-Score Form

But, if all of that algebra isn’t quite your cup of tea, this method is quicker:

model<-lm(y~x) ## "lm" stands for Linear Model
## Call:
## lm(formula = y ~ x)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.80264 -0.22414  0.00656  0.33794  0.63366 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.9615     0.9505  -4.168 0.003133 ** 
## x             1.2100     0.1869   6.474 0.000193 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.4584 on 8 degrees of freedom
## Multiple R-squared:  0.8397, Adjusted R-squared:  0.8197 
## F-statistic: 41.91 on 1 and 8 DF,  p-value: 0.0001934 A note on the hat that \(y\) is wearing

The jaunty cap sported by \(y\) and \(z_y\) in this section indicates that \(y\) and \(z_y\) are estimates. \(rz_x\) does not equal exactly \(z_y\), nor does \(ax+b\) equal exactly \(y\). If they did, then all of the points on the scatterplot would be right on the regression line. There is always error in prediction, hence, the \(\hat{hat}\).

7.3.2 The Proportionate Reduction in Error (\(R^2\))

Once you have \(r\), \(R^2\) is pretty easy to calculate:


\(R^2\) as a statistic has special value. It is known – in stats classes only, but known – as the proportionate reduction in error. That title means that \(R^2\) represents the predictive advantage conferred by using the model to predict values of the \(y\) variable over using the mean of \(y\) \((\bar{y})\) to make predictions.

In model prediction terms, \(R^2\) is the proportion of the variance in \(y\) that is explained by the model. Recall that the least-squares regression line minimizes the squared distance between the points in a scatterplot and the line. The line represents predictions and the distance represents the prediction error (usually just called the error). The smaller the total squared distance between the points and the line, the lower the error, and the greater the value of \(R^2\).

Mathematically – in addition to being the square of the \(r\) statistic – \(R^2\) is given by:

\[R^2=\frac{SS_{total}-SS_{error}}{SS_{total}}\] where \(SS_{total}\) is the sum of the squared deviations in the \(y\) variable:

\[\sum (y-\bar{y})^2\] and \(SS_{error}\) is the sum of the squared distances between each \(y\) value predicted by the model at each point \(x\) and the observed \(y\) predicted by the model for each \(x\):

\[\sum (y_{pred}-y_{obs})^2\] The table below lists the squared deviations of \(y\), the predictions \(y_{pred}=ax+b\), and the squared prediction errors154 \((y_{pred}-y_{obs})^2\):

Obs \(x\) \(y\) \((y-\bar{y})^2\) \(y_{pred}\) \((y_{pred}-y_{obs})^2\)
1 4.18 1.73 0.1521 1.0963 0.4015757
2 4.70 1.86 0.0676 1.7255 0.0180902
3 6.43 3.61 2.2201 3.8188 0.0435974
4 5.04 2.09 0.0009 2.1369 0.0021996
5 5.25 2.00 0.0144 2.3910 0.1528810
6 6.27 4.07 3.8025 3.6252 0.1978470
7 5.34 2.56 0.1936 2.4999 0.0036120
8 4.21 0.33 3.2041 1.1326 0.6441668
9 4.17 1.49 0.3969 1.0842 0.1646736
10 4.67 1.46 0.4356 1.6892 0.0525326

The sum of the column \((y-\bar{y})^2\) gives the \(SS_{total}\), which is 10.4878. The sum of the column \((y_{pred}-y_{obs})^2\) gives us the \(SS_{error}\), which is 1.6811761. Thus, the proportionate reduction of error – better known to like everybody on Earth as \(R^2\) – is:

\[R^2=\frac{SS_{total}-SS_{error}}{SS_{total}}=0.8397017\] And if we take the square root of 0.8397017, we get the now-hopefully-familiar \(r\) value of \(0.9163524\).

7.4 Bonus Content

7.4.1 Correcting the Kendall’s \(\tau\) Software Estimate

As a general mathematical rule, there are \(\frac{n(n-1)}{2}\) possible comparisons between \(n\) pairs of numbers. We will call the number of concordant-order comparisons of a given dataset \(n_c\). Thus, for the data in Example 1:

A comparison between any two pairs of observations is considered concordant-order if and only if the difference in the ranks of the corresponding variables in each pair is of the same sign.

A comparison between any two pairs of observations is considered discordant-order if and only if the difference in the ranks of the corresponding variables in each pair is of the opposite sign.

We will call the number of discordant-order comparisons of a given dataset \(n_d\). Calculation of \(\hat{\tau}\) Without Ties

If there are no ties among the \(x\) values nor among the \(y\) values in the data, we can calculate the sample Kendall \(\hat{\tau}\) statistic using the following formula:

\[\hat{\tau}=\frac{2(n_c-n_d)}{n(n-1)}.\] With Ties

When making pairwise rank comparisons, ties in the data reduce the number of valid comparisons that can be made. For the purposes of calculating \(\hat{\tau}\) when there are tied data, there are three types of ties to consider:

  1. \(x\) ties: tied values for the \(x\) variable

  2. \(y\) ties: tied values for the \(y\) variable

  3. \(xy\) ties: pairs where the \(x\) observation and the \(y\) observation are each members of a tie cluster (that is, the pair is doubly tied)

For example, please consider the following data (note: this dataset has been created so that both the \(x\) variables, the \(y\) variables, and the alphabetical pair labels are all in generally ascending order for simplicity of presentation: note that this does not need to be the case with observed data):

pair \(x\) \(y\) \(rank_x\) \(rank_y\)
A 1 11 1.0 1.0
B 2 12 3.0 2.0
C 2 13 3.0 3.5
D 2 13 3.0 3.5
E 3 14 5.0 5.0
F 4 15 6.0 6.0
G 5 16 7.5 7.0
H 5 17 7.5 8.0
I 6 18 9.0 9.5
J 7 18 10.0 9.5

Let us consider, in turn, the \(x\), the \(y\), and the \(xy\) ties:

  1. \(x\) ties: In the example data, there are two tied-\(x\) clusters. One tied-\(x\) cluster comprises the \(x\) observations of pair B, pair C, and pair D: each pair has an \(x\) value of 2, and those three \(x\) values are tied for 3rd lowest of the \(x\) set. The other tie cluster comprises the \(x\) observations of pair G and pair H, and those \(x\) values are tied for 7.5th lowest. Thus, there are two \(x\) tie clusters: one of 3 members and one of 2 members. Let \(t_{xi}\) denote the number of tied-\(x\) members \(t_x\) of each cluster \(i\) and let \(m_x\) denote the number of tied-\(x\) clusters. We will then define a variable \(T_x\) to denote the number of comparisons that are removed from the total possible comparisons used to calculate the \(\hat{\tau}\) statistic as such:


For the example data:


  1. \(y\) ties: In the example data, there are also two tied-\(y\) clusters. One tied-\(y\) cluster comprises the \(y\) observations of pair C and pair D: each has an \(y\) value of 13, and those two \(y\) values are tied for 3.5th lowest of the \(y\) set. The other tie cluster comprises the \(y\) observations of pair I and pair J, each with an observed \(y\) value of 18 and tied for 9.5th highest position. Thus, there are two \(y\) tie clusters, each with 2 members. Let \(t_{yi}\) denote the number of tied-\(y\) members \(t_y\) of each cluster \(i\) and let \(m_y\) denote the number of tied-\(y\) clusters. We will then define a variable \(T_y\) to denote the number of comparisons that are removed from the total possible comparisons used to calculate the \(\hat{\tau}\) statistic as such:


For the _example data:


  1. \(xy\) ties: There are two data pairs in the example data that share the same \(x\) rank \(y\) rank – pair C and pair D – and the pairs in each are said to be doubly tied or, that they constitute an \(xy\) cluster. Therefore, for this dataset there is one \(xy\) cluster with two members – the \(x\) and \(y\) values of pair C and the \(x\) and \(y\) values of pair D. Let \(t_{xyi}\) denote the number of members \(t_{xy}\) of each \(xy\) cluster \(i\) and let \(m_{xy}\) denote the number of \(xy\) clusters. We will then define a variable \(T_{xy}\) to denote the number of comparisons that are removed from the total possible comparisons used to calculate the \(\hat{\tau}\) statistic as such:


For the example data:


Earlier, we noted that the maximum number of comparisons between pairs of data – which we can denote \(n_{max}\) – is equal to \(\frac{n(n-1)}{2}\) (where \(n\) equals the number of pairs). In the cases of Example 1, Example 2, and Example 3, there were no \(x\), \(y\), or \(xy\) ties, and so that relationship gave the value for the maximum number of comparisons that was used to calculate \(\hat{\tau}\). Generally, the equation to find \(n_{max}\) – whether or not \(T_X=0\), \(T_Y=0\), and/or \(T_{XY}=0\) is the following:


The above equation removes the number of possible comparisons lost to ties by subtracting \(T_x\) and \(T_y\) from the total possible number \(\frac{n(n-1)}{2}\), and then adjusts for those ties that were double-counted in the process by re-adding \(T_{xy}\).

With the number of available comparisons now adjusted for the possible presence of ties, we may now re-present the equation for the \(\hat{\tau}\) statistic as:

\[\hat{\tau}=\frac{n_c-n_d}{n_{max}}\] \(\hat{\tau}\) for larger data sets (\(\tau\)-b correction)

To this point, the example datasets we have examined have been so small that identifying and counting concordant-order comparisons, discordant-order comparisons, tie clusters, and the numbers of members of tie clusters has been relatively easy to do. To calculate the Kendall \(\hat{\tau}\) with larger datasets invites a software-based solution, but there is a problem specific to the \(\hat{\tau}\) calculation that can lead to inaccurate estimates when there are tied data in a set.

That problem stems from the specific correction for ties proposed by Kendall (1945), following Daniels (1944), in an equation for a correlation known as the Kendall tau-b correlation \(\hat{\tau_b}\):


In addition to being endorsed by Kendall himself, this equation has been adopted, for example, in R (as well as SAS, Stata, and SPSS), which brings us to our software problem. If there are no ties in a dataset, then the \(\hat{\tau_b}\) denominator simplfies to \(\sqrt{(\frac{n(n-1)}{2})(\frac{n(n-1)}{2})}=\frac{n(n-1)}{2}\), which is the same denominator as in the calculation of \(\hat{\tau}\) when there are no ties. However, when working in R, a correlation test (cor.test()) with the Kendall option (method="kendall") will return \(\hat{\tau_b}\) instead of the desired \(\hat{\tau}\).

If the Kendall correlation statistic is calculated using a program – such as R – that is programmed to return the \(\hat{\tau_b}\) statistic rather than the \(\hat{\tau}\) statistic, then the \(\hat{\tau}\) statistic can be recovered by multiplying by the denominator of the \(\hat{\tau_b}\) calculation and dividing by the proper \(\hat{\tau}\) calculation (in other words: multiply by the wrong denominator to get rid of it and then divide by the correct denominator):


Given that we may be working with larger datasets in using this correction, it may seem that we have traded one problem for another: the above equation may give us the desired estimate of \(\hat{\tau}\), but how do we find \(T_X\), \(T_Y\), \(T_XY\), \(n_c\), and \(n_d\) without using complex, tedious, and likely error-prone counting procedures? The process of counting ties can be handled with coding conditional logic. Alternatively, the R language has several commands in its base package that can be exploited for that purpose. One way to do this is using the table() command. The table() command in R applied to an array, will return a table with each of the observed values as the header (the “names”) and the frequency of each of those values. For example, given an array \(x\) with values [5, 5, 5, 8, 8, 13]], table(x) will return:

x<-c(5, 5, 5, 8, 8, 13)
## x
##  5  8 13 
##  3  2  1

We can restrict the table to a subset with only those members of the array that appear more than once:

## x
## 5 8 
## 3 2

And then, we can remove the header row with the unname() command, leaving us with an array of the sizes of each tie cluster in the original array \(x\):

## [1] 3 2

We then can use the count of values in that resulting array to represent the number of tie clusters in the original array and the values in that resulting array to represent the number of members in each tie cluster in the original array, a procedure we can use to calculate \(T_X\) and \(T_Y\).

To find doubly-tied values in a pair array – as we need to calculate \(T_XY\) – only a minor modification is needed. We can take two arrays (for example, \(x\) and \(y\)), pair them into a dataframe, and then the rest of the procedure is the same. Thus, given the data from Example 4, :

x<-c(1, 2, 2, 2, 3, 4, 5, 5, 6, 7)
y<-c(11, 12, 13, 13, 14, 15, 16, 17, 18, 18)
## [1] 2

we reach the same conclusion as we did when counting by hand: there is one \(xy\) cluster with two members in this dataset.

However we count the numbers of tie clusters and the count of members in each tie cluster, once \(T_X\), \(T_Y\), and \(T_XY\) are determined, there are algebraic means for calculating \(n_c\) and \(n_d\).

The value of \(n_c\) follows from algebraic rearrangement of the above equations:


The value of \(n_d\) is then found as the difference between \(n_{max}\) and \(n_c\):


  1. Cohen, J. (2013). Statistical power analysis for the behavioral sciences. Academic press.↩︎

  2. Normally we would use a Greek letter to indicate a population parameter when discussing null and alternative hypotheses, but the letter \(\rho\) is going to be used for something else later on.↩︎

  3. And, honestly, not that concerned about↩︎

  4. The binomial test used as the Classical example of Classical and Bayesian Inference is an example of a nonparametric test. In that test, the patterns being analyzed were the number of observed successes or more in the number of trials given the assumption that success and failure were equally likely. In the Classical approach to that example, we were not concerned with, parameters like the mean number of successes or the standard deviation of the number of successes in the population.↩︎

  5. The \(p\)-values are different, though, as they are based on permutations. For significance, use the value given by software or feel free to consult this old-ass table↩︎

  6. Full disclosure: I’m an author on a manuscript under review as of 10/2020 describing this approach. Also, by “recently developed,” I mean like “working on it right now.”↩︎

  7. Nonparametric alternatives are usually a good choice when the assumptions of classical parametric tests are violated.↩︎

  8. This is a case where violating assumptions makes type-II errors more likely: the \(r\) value is smaller than it would be otherwise, increasing the chances of missing a real effect.↩︎

  9. Calculating a raw-score regression equation by hand isn’t quite as ridiculous as I’m making it seem. If you’re ever in a situation where you need to work with models that aren’t part of software packages (for example, this one), it’s really helpful to know these steps.↩︎

  10. Note: prediction errors are commonly referred to as residuals.↩︎