6.4 Case Study: Pearson, Spearman, Kendall
This case study works with two data sets. dat1
is composed of two continuous variables; dat2
is composed of two ordinal variables.
A researcher investigates the relationship between cholesterol
concentration and time spent watching TV, time_tv
, in n = 100 otherwise healthy 45 to 65 year old men (dat1
). This data set will meet the conditions for Pearson, so we try all three tests on it.
A researcher investigates the relationship between the level of agreement with the statement “Taxes are too high” (tax_too_high
, four level ordinal) and participant income
level (three level ordinal) (dat2
). The ordinal variables rule out Pearson, leaving Spearman and Kendall.
Conditions
Pearson’s correlation applies when X and Y are continuous (interval or ratio) paired variables with 1) a linear relationship that 2) has no significant outliers, and 3) are bivariate normal. Spearman’s rho and Kendall’s tau only require that X and Y be at least ordinal with 1) a monotonically increasing or decreasing relationship.
- Linearity and Monotonicity. A visual inspection of a scatterplot should find a linear relationship (Pearson) or monotonic relationship (Spearman and Kendall).
Pearson’s correlation additionall requires
- No Outliers. Identify outliers with the scatterplot.
- Normality. Bivariate normality is difficult to assess. Instead, check that each variable is individually normally distributed. Use the Shapiro-Wilk test.
Linearity / Monotonicity
Assess linearity and monotonicity with a scatter plot. dat1
is plotted on the left in Figure 6.1. A second version that fails the linearity test is shown to the right. If the linear relationship assumption fails, consider transforming the variable instead of reverting to Spearman or Kendall.
The ordinal variable data set dat2
is plotted in Figure 6.2.
No Outliers
Pearson’s correlation requires no outliers. Both plots in Figure 6.1 are free of outliers. If there were outliers, check whether they are data entry errors or measurement errors and fix or discard them. If the outliers are genuine, leave them in if they do not affect the conclusion. You can also try tranforming the variable. Failing all that, revert to the Spearman’s rho or Kendall’s tau.
Bivariate Normality
Bivariate normality is difficult to assess. If two variables are bivariate normal, they will each be individually normal as well. That’s the best you can hope to check for. Use the Shapiro-Wilk test.
shapiro.test(cs$dat1$time_tv)
##
## Shapiro-Wilk normality test
##
## data: cs$dat1$time_tv
## W = 0.97989, p-value = 0.1304
shapiro.test(cs$dat1$cholesterol)
##
## Shapiro-Wilk normality test
##
## data: cs$dat1$cholesterol
## W = 0.97594, p-value = 0.06387
If a variable is not normally distributed, you can transform it, carry on regardless since the Pearson correlation is fairly robust to deviations from normality, or revert to Spearman and Kendall.
Test
Calculate Pearson’s correlation, Spearman’s rho, or Kendall’s tau. dat
meets the assumptions for Pearson’s correlation, but try Spearman’s rho and Kendall’s tau too, just to see how close they come to Pearson. dat2
only meets the assumptions for Spearman and Kendall.
Pearson’s Correlation
dat1
met the conditions for Pearson’s correlation.
$cc_pearson <-
(cscor.test(cs$dat1$cholesterol, cs$dat1$time_tv, method = "pearson")
)
##
## Pearson's product-moment correlation
##
## data: cs$dat1$cholesterol and cs$dat1$time_tv
## t = 3.9542, df = 98, p-value = 0.0001451
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1882387 0.5288295
## sample estimates:
## cor
## 0.3709418
r = 0.37 falls in the range of a “moderate” linear relationship. \(r^2\) = 14% is the coefficient of determination. Interpret it as the percent of variability in one variable that is explained by the other. If you are not testing a hypothesis (HO: \(\rho \ne 0\)), you can report just report r. Otherwise, include the p-value. Report your results like this:
A Pearson’s product-moment correlation was run to assess the relationship between cholesterol concentration and daily time spent watching TV in males aged 45 to 65 years. One hundred participants were recruited.
Preliminary analyses showed the relationship to be linear with both variables normally distributed, as assessed by Shapiro-Wilk’s test (p > .05), and there were no outliers.
There was a statistically significant, moderate positive correlation between daily time spent watching TV and cholesterol concentration, r(98) = 0.37, p < .0005, with time spent watching TV explaining 14% of the variation in cholesterol concentration.
You wouldn’t use Spearman’s rho or Kendall’s tau here since the more precise Pearson’s correlation is available. But just out of curiosity, here are the correlations using those two measures.
cor(cs$dat1$cholesterol, cs$dat1$time_tv, method = "kendall")
## [1] 0.2383971
cor(cs$dat1$cholesterol, cs$dat1$time_tv, method = "spearman")
## [1] 0.3322122
Both Kendall and Spearman produced more conservative estimates of the strength of the relationship - Kendall especially so.
You probably wouldn’t use linear regression here either because it describes the linear relationship between a response variable and changes to an independent explanatory variable. Even though we are reluctant to interpret a regression model in terms of causality, that is what is implied the formulation y ~ x
and independence assumption in of X. Nevertheless, correlation and regression are related. The slope term in a simple linear regression of the normalized values equals the Pearson correlation.
lm(
~ x,
y data = cs$dat1 %>% mutate(y = scale(cholesterol), x = scale(time_tv))
)
##
## Call:
## lm(formula = y ~ x, data = cs$dat1 %>% mutate(y = scale(cholesterol),
## x = scale(time_tv)))
##
## Coefficients:
## (Intercept) x
## -3.092e-16 3.709e-01
Spearman’s Rho
Data set dat2
did not meet the conditions for Pearson’s correlation, so use Spearman’s rho and/or Kendall’s tau.
Start with Spearman’s rho. Recall that Spearman’s rho is just the Pearson correlation applied to the ranks. Recall also that the Pearson’s correlation is just the covariance divided by the product of the standard deviations. You can quickly calculate it by hand.
cov(rank(cs$dat2$income), rank(cs$dat2$tax_too_high)) /
sd(rank(cs$dat2$income)) * sd(rank(cs$dat2$tax_too_high))) (
## [1] 0.6024641
Use the function though. I don’t get why cor.test
requires x
and y
be numeric.
$spearman <-
(cscor.test(
as.numeric(cs$dat2$tax_too_high),
as.numeric(cs$dat2$income),
method = "spearman")
)
## Warning in cor.test.default(as.numeric(cs$dat2$tax_too_high),
## as.numeric(cs$dat2$income), : Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: as.numeric(cs$dat2$tax_too_high) and as.numeric(cs$dat2$income)
## S = 914.33, p-value = 0.001837
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.6024641
Interpret the statistic using the same rule of thumb as for Pearson’s correlation. A rho over .5 is a strong correlation.
A Spearman’s rank-order correlation was run to assess the relationship between income level and views towards income taxes in 24 participants. Preliminary analysis showed the relationship to be monotonic, as assessed by visual inspection of a scatterplot. There was a statistically significant, strong positive correlation between income level and views towards income taxes, \(r_s\) = 0.602, p = 0.002.
Kendall’s Tau
Recall that Kendall’s tau is a function of concordant (C), discordant (D) and tied (Tx and Ty) pairs of observations. The manual calculation is a little more involved. Here is the function instead.
$kendall <-
(cscor.test(
as.numeric(cs$dat2$tax_too_high),
as.numeric(cs$dat2$income),
method = "kendall")
)
## Warning in cor.test.default(as.numeric(cs$dat2$tax_too_high),
## as.numeric(cs$dat2$income), : Cannot compute exact p-value with ties
##
## Kendall's rank correlation tau
##
## data: as.numeric(cs$dat2$tax_too_high) and as.numeric(cs$dat2$income)
## z = 2.9686, p-value = 0.002991
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.5345225
As with the case study on cholesterol and television, Kendall’s tau was more conservative than Spearman’s rho. \(\tau_b\) = 0.535 is still in the “strong” range, though just barely.
A Kendall’s tau-b correlation was run to assess the relationship between income level and views towards income taxes amongst 24 participants. Preliminary analysis showed the relationship to be monotonic, as assessed by visual inspection of a scatterplot. There was a statistically significant, strong positive correlation between income level and the view that taxes were too high, \(\tau_b\) = 0.535, p = 0.003.