Chapter 18 Associations between continuous variables
Motivating scenarios: We are interested to estimate an association between two continuous variables, and test if this association is differs from zero.
Learning goals: By the end of this chapter you should be able to
- Summarize the association between to continuous variables as
- Covariance AND
- Correlation
- Covariance AND
- Estimate uncertainty in these estimates by
- By bootstrapping AND
- Naively taking
R
’s output at face value.
- By bootstrapping AND
- Test the null hypothesis of no association by
- A permutation test
- Naively accepting
R
’s output at face value.
- A permutation test
18.1 Associations between continuous variables
One of the most common statistical questions we come across is “are the two variables associated?”
For instance, we might want to know if people who study more get better grades.
Such an association would be quite interesting, and if it arises in a randomized controlled experiment, it would suggest that studying caused this outcome.
18.2 Example: Disease and sperm viability
It is hypothesized that there is a trade off between investment in disease resistance and reproduction. To look into this, Simmons and Roberts (2005), assayed sperm viability and lysozyme activity (a measure of defense against bacteria) in 39 male crickets. Let’s use this to explore associations between continuous variables.
ggplot(crickets , aes(x = spermViability, y = lysozyme))+
geom_point()+
labs(x = "Sperm viability (%)", y = expression(paste("Lysozyme activity (",mm^2,")")))
18.3 Summarizing associations between two categorical variables
We focus on two common summaries of associations between two continuous variables, X and Y, the covariance and the correlation.
Both these summaries describe the tendency for values of X and Y to jointly deviate from the mean.
- If larger values of X tend to come with larger values of Y the covariance (and correlation) will be positive.
- If smaller values of X tend to come with larger values of Y the covariance (and correlation) will be negative.
- A covariance (or correlation) of zero means that – on average – X and Y don’t predictably deviate from their means in the same direction. This does not guarantee that X and Y are independent – if they are related by a curve rather than a line, X could predict Y well even if the covariance (or correlation) was zero. It does mean that the best fit line between X and Y will be flat.
For both these summaries X and Y are arbitrary - the covariance between sperm viability and lysozyme activity is the same as the covariance between lysozyme activity and sperm viability.
18.3.1 Covariance
The covariance quantifies the shared deviation of X and Y from the mean as follows:
\[\begin{equation} \begin{split} \text{Sample covariance between X and Y} = Cov(X,Y) &=\frac{\sum{\Big((X_i-\overline{X})\times (Y_i-\overline{Y})\Big)}}{n-1}\\ \end{split} \tag{18.1} \end{equation}\]
The numerator in Eq. (18.1) is known as the ‘Sum of Cross Products’. This should remind you of the numerator in the variance calculation (Sum of squares = \(\sum(X-\overline{X})(X-\overline{X})\)) – in fact the covariance of X with itself is its variance. The covariance can range from \(-\infty\) to \(\infty\), but in practice cannot exceed the variance in X or Y.
Some algebra can rearrange Eq. (18.1) to become Eq. (18.2). These two different looking equations are identical. Often, computing the covariance from (18.1) is more convenient, but this isn’t always true.
\[\begin{equation} \begin{split} \text{Sample covariance between X and Y} = Cov(X,Y) &= \frac{n-1}{n}(\overline{XY} - \overline{X} \times \overline{Y})\\ \end{split} \tag{18.2} \end{equation}\]
This is also equal to \(Cov(X,Y) = \frac{n-1}{n}(\overline{XY} - \overline{X} \times \overline{Y})\). Which is sometimes easier to work with.
We can calculate the covariance by explicitly writing our math out in R or by using the cov()
function:
%>%
crickets mutate(x_prods = (spermViability-mean(spermViability)) * (lysozyme-mean(lysozyme))) %>%
summarise(n = n(),
sum_x_prods = sum(x_prods),
cov_math_a = sum_x_prods / (n-1),
cov_math_b = (n/(n-1)) * sum(mean(spermViability * lysozyme) -
mean(spermViability) * mean(lysozyme)),
cov_r = cov(spermViability, lysozyme))
## # A tibble: 1 × 5
## n sum_x_prods cov_math_a cov_math_b cov_r
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 40 -54.1 -1.39 -1.39 -1.39
Reassuringly we get the same result for all derivations– higher quality sperm come from parents with less lysozyme activity.
Covariance - a visual explanation
Some people have trouble intuiting a covariance, as the mean of the product of the difference X and Y and their means (just typing it is a mouthful).
So I’m hoping the visuals in Fig 18.2 help – if they don’t then skip them. To calculate the covariance,
- Draw a rectangle around a given data point and the means of X and Y.
- Find the area of that rectangle.
- If X is less than its mean and Y is greater than its mean (or vice versa) put a negative sign in front of this area.
- Do this for all data points.
- Add this all up and divide by \(n - 1\).
18.3.2 Correlation
We found a covariance in sperm viability and lysozyme activity of -1.388. Is this large or small – I don’t know.
We therefore usually present a more interpretable summary of the association between variables – the correlation, which standardizes the covariance by dividing through by the product of the standard deviations in X and Y, \(s_X\) and \(s_Y\), respectively:
\[\begin{equation} \begin{split} \text{Sample corelation between X and Y} = Cor(X,Y) &= \frac{Cov(X,Y)}{s_X \times s_Y}\\ \end{split} \tag{18.2} \end{equation}\]
Again we can push through with math, or use the cor()
function in R to find the correlation
%>%
crickets summarise(n = n(),
cov_r = cov(spermViability, lysozyme),
cor_math = cov_r / (sd(spermViability) * sd(lysozyme)),
cor_r = cor(spermViability, lysozyme))
## # A tibble: 1 × 4
## n cov_r cor_math cor_r
## <int> <dbl> <dbl> <dbl>
## 1 40 -1.39 -0.328 -0.328
So the correlation is -0.328. It’s not clear if this is a substantial correlation – as this depends on our field and expectations etc. But we can meaningfully compare this to correlations between any other traits to put it in a broader context.
18.4 Bootstrap to quatify uncertainty
Like all estimates, a sample correlation (or covariance) can deviate from the population parameter by chance.
Again we can use the bootstrap – resampling from our sample with replacement to approximate the sampling distribution — to quantify the impact of sampling error on our estimates.
Bootstrap to quatify uncertainty: Step 1. Resample.
We first make a bootstrap replicates by resampling our data with replacement to make a new potential data set (the same size as our data).
<- crickets %>%
crickets_one_resample slice_sample(prop = 1, replace = TRUE)
So, we now have a data set of 40 combinations of sperm viability and lysozyme activity resampled with replacement from our actual data set.
Bootstrap to quantify uncertainty: Step 2. Summarize a bootstrapped resample.
We then summarize our bootstrap replicate with our estimate of interest – for now we’ll focus on the correlation, but we could have chosen the covariance if we wanted.
<- crickets_one_resample %>%
crickets_one_boot summarise(cor_vals = cor(spermViability, lysozyme))
crickets_one_boot
## # A tibble: 1 × 1
## cor_vals
## <dbl>
## 1 -0.446
From this resampled data set, we estimate a correlation of -0.446 – as compared to the correlation of -0.328 observed in the real data.
Bootstrap to quantify uncertainty: Step 2. Repeat many times to approximate the sampling distribution.
So, we resampled our data with replacement and estimated a correlation from this bootstrapped data set to imagine what other results we could have gotten by sampling error. Let’s repeat steps 1 and 2 a bunch of times (lets say ten thousand) to generate a sampling distribution!
<- 1e4
n_reps
<- replicate(n = n_reps,simplify = FALSE,
crickets_boots expr =crickets %>%
slice_sample(prop = 1, replace = TRUE)%>%
summarise(mean_spermViability = mean(spermViability),
mean_lysozyme = mean(lysozyme),
cor_vals = cor(spermViability, lysozyme)))%>%
bind_rows()
So, this is the bootstrapped sampling distribution. (I only show the first thousand values in table below).
Bootstrap to quantify uncertainty: Step 4. Summarize the bootstrap distribution.
Now that we’ve approximated the sampling distribution for the correlation, we can use it to summarize the uncertainty in our estimate of the correlation. Remember our common summaries of uncertainty:
- The standard error which describes the expected deviation between a parameter and an estimate of as the standard deviation of the sampling distribution.
<- summarise(crickets_boots, se = sd(cor_vals))
crickets_SE_cor crickets_SE_cor
## # A tibble: 1 × 1
## se
## <dbl>
## 1 0.144
- A confidence interval which puts reasonable bounds on our estimate.
<- crickets_boots %>%
crickets_CI_cor summarise(CI = quantile(cor_vals, prob = c(0.025,.975))) %>%
mutate( bound = c("lower","upper"))
crickets_CI_cor
## # A tibble: 2 × 2
## CI bound
## <dbl> <chr>
## 1 -0.581 lower
## 2 -0.0170 upper
We can visualize the bootstrapped distribution, with red dashed lines denoting 95% confidence intervals.
ggplot(crickets_boots, aes(x = cor_vals)) +
geom_histogram(bins = 100, color = "white", size = .3)+
geom_vline(data = crickets_CI_cor, aes(xintercept = CI),
color = "red", lty = 2 )+
theme_light()+
scale_y_continuous(expand = c(0,0))+
scale_x_continuous(limits = c(-1,1), expand = c(0,0))+
labs(title = "Bootstrap distribution of correlations",
subtitle = "Sperm viability & lysozyme size in crickets.")
So we find find a negative correlation between sperm motility and lyzosyme activity, with an estimated correlation coefficient of -0.328, and Standard error of 0.1442, and a 95% confidence interval between -0.581 and -0.017.
18.5 Testing the null hypothesis of no association
So, we estimate a negative correlation between sperm viability and lysozyme activity. As our 95% confidence interval excluded zero, it seems unlikely that the data came from a null population with a true correlation of zero. Now, let’s formally test the null hypothesis that such an association could easily be generated by chance sampling from a null population with a true correlation of zero. Here’s how we do it!
- State the null hypothesis and its alternative.
- Null hypotheses: There is NO linear association between our two variables. e.g. Crickets with more viable sperm do not have more or less lysozyme activity, on average, than crickets with less viable sperm.
- Alternate hypotheses: There is a linear association between our two variables. e.g. Crickets with more viable sperm do have more or less lysozyme activity, on average, than crickets with less viable sperm.
- Null hypotheses: There is NO linear association between our two variables. e.g. Crickets with more viable sperm do not have more or less lysozyme activity, on average, than crickets with less viable sperm.
- Calculate a test statistic to summarize our data.
- Compare the observed test statistic to the sampling distribution of this statistic from the null model.
- Interpret these results. If the test statistic is in an extreme tail of this sampling distribution, we reject the null hypothesis, otherwise we do not.
18.5.1 By permutation
First let’s test this null by permutation. Remember in permutation we generate a sampling distribution under the null by shuffling explanatory and response variables. We need a test statistic – so let’s take the correlation. (p-values and conclusions will be the same if we chose the covariance).
18.5.1.1 Shuffle labels
- Unlike the bootstrap – which aims to approximate the sampling distribution around our data, we permute to generate a sampling distribution under the null of no association.
- So we now shuffle the association between our explanatory and response variables. We do so by reasigning the explanatory variable (sperm viability) o to a resampling of its values without replacement with the
sample()
function (the orders will be shuffled, but all data will be there and associations will be maintained).
# shuffle the relationship between explanatory and response variable
<- crickets %>%
crickets_one_perm mutate(spermViability = sample(spermViability, replace = FALSE))
Unlike bootstrapping, here every observation in our initial data set is present in each permuted sample - but we’ve randomly swapped spermViability
across individuals with known lysozyme
activity. The permutation thus reflects one random sample from the null model of no association because we’ve randomly associated X and Y many times.
18.5.1.2 And summarize permuted data
Now let us summarize our permuted data. Again we’ll focus on the correlation, but we could have used a different test statistic, like the covariance.
<- crickets_one_perm %>%
crickets_one_perm_cor summarise(cor_vals = cor(spermViability, lysozyme))
crickets_one_perm_cor
## # A tibble: 1 × 1
## cor_vals
## <dbl>
## 1 -0.114
So, our permutation yielded a correlation of -0.114 which is greater than to correlation of -0.328 found in the data. We can’t make much of this one sample, so we need to build a sampling distribution under the null to see if our result is indeed an unusual outcome of the null model.
18.5.1.3 Permute and summarize the data many times to generate a sampling distribution for the null model
So, we now replicate what we did a bunch of times to generate a sampling distribution under the null.
<- 1e4
n_reps # shuffle the relationship between explanatory and response variable
<- replicate(n = n_reps, simplify = FALSE,
crickets_permsexpr = crickets %>%
mutate(spermViability = sample(spermViability, replace = FALSE))%>%
summarise(cor_vals = cor(spermViability, lysozyme))) %>%
bind_rows()
Unlike bootstrapping, here every observation in our initial data set is present in each permuted sample - but we’ve randomly swapped spermViability
across individuals with known lysozyme
activity. The permutation thus reflects the null model of no association because we’ve randomly associated X and Y many times. So, we now have a data set of ten thousand estimates of the correlation between sperm viability and lysozyme activity, if there was no correlation between these variables (constraining our data to have the same values for each variable as we observed initially).
18.5.1.4 Let’s compare the correlations of permuted data to what we found in the real data
<- crickets %>%
crickets_cor summarise(obs_cor = cor(spermViability, lysozyme)) %>%
pull()
<- crickets_perms %>%
crickets_perms mutate(actual_cor = crickets_cor,
as_or_more_extreme = (abs(cor_vals) >= abs(crickets_cor)))
Let’s have a look at these permuted summaries and how they compare to the actual data. (I only show the first thousand values in table below).
We can then find the p-value as the proportion of permutations that were as or more extreme than the estimate from the data.
%>%
crickets_perms summarise(pvalue = mean(as_or_more_extreme))
## # A tibble: 1 × 1
## pvalue
## <dbl>
## 1 0.0389
18.5.1.5 Plot the permuted data
ggplot( crickets_perms , aes(x = cor_vals, fill = as_or_more_extreme))+
geom_histogram() +
scale_fill_manual(values = c("grey","black"))+
theme_tufte()+
theme(legend.position = "bottom")+
labs(title = "Permuted distribution of cricket data.")+
annotate(x = -.5,y =1000, label = sprintf("p = %s", pull(cricket_p_val)%>%round(digits = 3)), geom = "text")
18.5.1.6 Find a p-value
Remember the p-value is the probability that we would observe our test statistic, or something more extreme if the null were true. While it is clear from Figure 18.3 that it would be quite strange to see a result as extreme as we did if the null were true, we want to quantify this weirdness with a p-value.
We can approximate a p-value as the proportion of permutations with a correlation as or more extreme as our observed correlation:
<- crickets_perms %>%
cricket_p_val summarise(pvalue = mean(as_or_more_extreme))
cricket_p_val
## # A tibble: 1 × 1
## pvalue
## <dbl>
## 1 0.0389
18.5.2 With math
The permutation above did the job. We came to a solid conclusion. But there are a few downsides
- It can be a bit slow – it takes my computer like ten seconds to permute this data set 100,000 times. While ten seconds isn’t a lot of time it would add up if for example we wanted to do this for large datasets with millions of variables.
- It is influenced by chance. Ever permutation will have a slightly different p-value due to the chance outcomes of random shuffling. While this shouldn’t impact our conclusions it’s a bit unsatisfying.
For these reasons, we usually only apply a permutation to cases where data do not meet the assumptions of the more traditional mathematical approach.
Later in this term we will think a lot about the t-distribution, which is a common and handy statistical distribution. A t-value of one means our estimate is one standard error away from the parameter specified by the null, a t-value of two means our estimate is one standard error away… etc.. etc… etc…
For now know:
- We can approximate the standard error of the correlation, \(SE_r\), as \(\sqrt{\frac{1-r^2}{n-2}}\), where \(r\) is our estimated correlation coefficient. So for the cricket data - we can approximate the standard error as \(\sqrt{\frac{1- (-0.328)^2}{40-2}} = 0.153\), a value similar to but not identical to the SE of
pull(crickets_SE) %>%round(digits = 3)
we estimated from the permutation. Don’t stress about this slight difference – this is statistics – we want to be approximately right.
- We can find \(t\) as \(r / SE_r\) – approximately
-0.328/0.153 = -2.14
in our case.
- In many cases we can use this \(t\) value and the sample size to estimate significance. We’ll track this down in a later class.
- Calculating the confidence interval is gross.
- R can do stats for us.
We can test the null hypothesis of zero correlation with the cor.test()
function in R. The cor.test()
function needs x
and y
as vectors so we have to pull()
them out of our tibble.
cor.test(x = crickets %>% pull(spermViability),
y = crickets %>% pull(lysozyme))
##
## Pearson's product-moment correlation
##
## data: crickets %>% pull(spermViability) and crickets %>% pull(lysozyme)
## t = -2.1393, df = 38, p-value = 0.03889
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.58012395 -0.01821532
## sample estimates:
## cor
## -0.3278643
We find a P-value nearly identical to what we got by permutation and a confidence interval pretty close to ours…. WoooHooo stats works.
Note that R’s test output is easy for humans to read but hard for us to analyze further. The tidy()
function in the broom
package can tidy this output.
library(broom)
cor.test(x = crickets %>% pull(spermViability),
y = crickets %>% pull(lysozyme)) %>%
tidy()
## # A tibble: 1 × 8
## estimate statistic p.value parameter conf.low conf.high method alternative
## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr> <chr>
## 1 -0.328 -2.14 0.0389 38 -0.580 -0.0182 Pearson's… two.sided
18.5.3 Care in interpreting (and running) correlation coefficients.
The correlation looks at how \(X\) and \(Y\) jointly deviate from their means. As such, interpreting a correlation is only meaningful when both \(X\) and \(Y\) are random.
Additionally, correlations summarize linear relationships and will fail to capture more complex relationships between variables.
Finally, the \(t\) test-based approach assumes that both variable come from a normal distribution… more on this later.
18.6 Attenuation: Random noise in X and/or Y brings correlations closer to zero.
We rarely measure all values of X and Y perfectly. In this example
- Each measurement of sperm viability will differ from the individual’s true sperm viability % because of sampling error.
- Lysozyme activity will be off because of both sampling error and imprecision of the ruler etc…
If we are lucky this imprecision is randomly distributed across the true values for each data point.
Such random imprecision impacts our estimate of the correlation because random noise will impact each observation of \(X\) and \(Y\) independently. As such their measured association will be, on average weaker (i.e. closer to zero), than the actual correlation in the real world. This is called attenuation. Two consequences of attenuation are
- Even if imprecision in each value of \(X\) and \(Y\) is unbiased, our estimated correlation will biased to be closer to zero than the actual correlation. Therefore measured correlations are, on average underestimates of true correlations.
- The more precise each measurement, the less bias in our estimate of the correlation coefficient.
18.6.1 Attenuation demonstration
Don’t take my word for the reality of attenuation - let’s simulate it! Let’s take our crickets data set with an actual correlation of, -0.328, and pretend this is a real population.
Now, let’s use the jitter()
function to add
- Random noise in \(X\),
spermViability
.
- Random noise in \(Y\),
lysozyme
- Random noise in both variables,
And compare our estimated correlation in these noisier estimates to our initial estimate of the correlation. To do so, let’s first copy our initial data set a bunch on times with the rep_sample_n()
function in the infer
package. In both cases lets add a standard deviation’s random noise
<- 10000
n_reps
<- crickets %>%
many_crickets rep_sample_n(size = nrow(crickets), replace = FALSE, reps = n_reps)
<- many_crickets %>%
noisy_x mutate(spermViability = jitter(spermViability, amount = sd(spermViability) )) %>%
summarise(est_cor = cor(spermViability, lysozyme)) %>%
mutate(noisy = "x")
<- many_crickets %>%
noisy_y mutate(lysozyme = jitter(lysozyme, amount = sd( lysozyme) )) %>%
summarise(est_cor = cor(spermViability, lysozyme)) %>%
mutate(noisy = "y")
<- many_crickets %>%
noisy_xy mutate(lysozyme = jitter(lysozyme, amount = sd( lysozyme)),
spermViability = jitter(spermViability, amount = sd(spermViability) )) %>%
summarise(est_cor = cor(spermViability, lysozyme)) %>%
mutate(noisy = "xy")
### Make a plot
bind_rows(noisy_x, noisy_y, noisy_xy) %>%
mutate(noisy = fct_reorder(noisy, est_cor)) %>%
ggplot(aes(x = est_cor , fill = noisy)) +
facet_wrap(~noisy, labeller = "label_both", ncol = 1) +
geom_density(show.legend = FALSE) +
geom_vline(xintercept = crickets %>% summarise(cor(spermViability, lysozyme)) %>% pull() ) +
geom_vline(data = . %>% group_by(noisy) %>% summarise(est_cor = mean(est_cor)),
aes(xintercept = est_cor), color = "white")+
annotate(x = crickets %>% summarise(cor(spermViability, lysozyme)) %>% pull() ,
y = 6, label = "correlation in data", geom = "text", hjust = 1.1)+
geom_label(data = . %>% group_by(noisy) %>% summarise(est_cor = mean(est_cor)),
aes(y=6),label = "Avg. noisy correlation", hjust = 0, color = "white", show.legend = FALSE)+
scale_y_continuous(limits = c(0,6.5))
18.7 Homework
18.7.1 Guess the correlation
Play three games of Guess the Correlation, and report your highe score.