# 16 Covariance, Correlation, and Least-squares regression

Before we can start on least-squares regression, we need to review covariance and correlation.

## 16.1 Covariance and Correlation

Covariance and Correlation provide a numerical measure of how and how strongly two random variables $$X$$ and $$Y$$ are related.

If you look at two plots of the same data below, you can see that the strength of the (linear) relationship between the two variables seems to depend on the scale used in the plots. This tells us that a graphical tool such as a scatter plot has limitations when it comes to judging correlation.

This demonstrates the need to develop a measurement for the correlation of two random variables that is scale and unit invariant.

Let’s start with a summary of useful formulas and properties.

1. Recall the formulas for the expected value and the variance:
$$\small E[X]= \sum_{ \text{all x} }x \cdot P(X=x)$$ for a discrete random variable $$X$$, and
$$\small E[X]= \int_{ \text{all x} } x \cdot f(x)dx$$ for a continuous random variable $$X$$.
$$\small Var[X]) = E[(X-E[X])^2] = E[X^2]-(E[X])^2$$

2. For real valued constants a and b we have
$$\small E[aX+b] = aE[X]+b$$ and
$$\small VAR[aX+b] = a^2 VAR[X].$$

3. The standardized score or z-score of a random variable X with mean $$\mu$$ and standard deviation $$\sigma$$ is $\small z=z_x = \frac{X-\mu}{\sigma}$ Note that $$E[z]=0$$ and $$VAR[z]=1$$.

4. If $$\small X$$ and $$\small Y$$ are two independent random variables, then
$$\small E[XY]=E[X]E[Y]$$ and
$$\small VAR[X \pm Y]=VAR[X]+VAR[Y].$$

5. For two random variables $$\small X$$ and $$\small Y$$, the covariance is defined as $\small Cov[X,Y] = E[(X-\mu_{X})(Y-\mu_{Y})]$ where $$\small \mu_{X}$$ and $$\small \mu_{Y}$$ are the expected values of X and Y resp.

Assignment Assume $$\small X$$ and $$\small Y$$ are two random variables. For practice, prove the following:

P1 Prove $$\small Cov[X,y] = Cov[Y,X]$$

P2 Prove $$\small Cov[aX+b,y] = aCov[X,Y]$$

P3 Prove $$\small Cov[X,Y] = E[XY] - E[X]E[Y]$$.

P4 Prove $$\small Cov[X+Y,Z] = Cov[X,Z] + Cov[Y,Z]$$

P5 Prove $\small Cov[\sum_{i=1}^n X_i,\sum_{j=1}^m Y_j] = \sum_{i=1}^n \sum_{j=1}^m Cov(X_i Y_j)$

P6 Prove If X and Y are independent, then Cov(X,Y)=0.
Note that the converse is false. Cov(X,Y)=0 does not imply X and Y are independent.

P7 Prove $$\small cov(X,c) = 0$$ for any real constant $$\small c$$.

P8 Prove $$\small VAR[X \pm Y] = VAR[X] + VAR[Y] \pm 2 Cov(X,Y)$$

P9 Prove If X and Y are independent, then $$\small VAR[X+Y] = VAR[X] + VAR[Y]$$.

Let’s compute the co-variance for Bwt and Hwt for cats from the cats data set in the MASS package. We convert the body weight from kg to grams and re-compute the co-variance. Finally, we convert grams to lbs (1 lb = 453.592 gr) and re-compute the co-variance. You see that the covariance is unit dependent.

par(mfrow=c(1,3))
plot(cats$Bwt, cats$Hwt, xlab="Body weight in kg",ylab="Heart weight")
cb1000 <- cats$Bwt*1000 plot(cb1000, cats$Hwt, xlab="Body weight in gr",ylab="Heart weight")
cblb <- cb1000/453.592
plot(cblb, cats$Hw,xlab="Body weight in lb",ylab="Heart weight") print(paste("Using kg, the co-variance is",cov(cats$Bwt,cats$Hwt))) #> [1] "Using kg, the co-variance is 0.950112665112665" print(paste("Using gr, the co-variance is",cov(cb1000, cats$Hwt)))
#> [1] "Using gr, the co-variance is 950.112665112665"
print(paste("Using lb, the co-variance is",cov(cblb, catsHwt))) #> [1] "Using lb, the co-variance is 2.09464158343327" A unit-invariant measure of the relation ship between two random variables $$\small X$$ and $$\small Y$$ is the correlation coefficient $\rho (X,Y) = \frac{Cov[X,Y]}{\sigma_X \sigma_Y},$ where $$\sigma _X$$ and $$\sigma_Y$$ are the respective standard deviations of $$\small X$$ and $$\small Y$$. Assignment P10 Prove Multiplying the variables by constants, or adding constants, does not change the correlation coefficient, i.e. $\small \rho(a+bX, c+dY) = \rho(X,Y).$ Next, we shall prove that $$\small | \rho(X,Y) | \le 1$$. Theorem For two random variables $$\small X$$ and $$\small Y$$, we have $$\small | \rho(X,Y) | \le 1.$$. Proof We use the standardized versions of the random variables $$\small x$$ and $$\small y$$: $$\small Z_x = \frac{x-\mu_x}{\sigma_x}$$ and $$\small Z_y = \frac{x-\mu_y}{\sigma_y}$$. Note that $$\small \mu_{Z_X} = \mu_{Z_Y} = 0$$ and $$\small \sigma_{Z_X} = \sigma_{Z_Y} = 1$$. We have \begin{align} \ \small 0 &\small\le Var[Z_X + Z_Y] = Var[Z_X]+ VAR[Z_Y] + 2 Cov(Z_X,Z_Y) &\small by \hspace{3pt} P7 \\ & \small = 2 + 2 Cov(Z_X,Z_Y) = 2 + 2 \frac{Cov(Z_X,Z_Y)}{\sigma_{Z_X} \sigma_{Z_Y}} &\small because \hspace{3pt} \sigma_{Z_X} = \sigma_{Z_Y} = 1 \\ &\small = 2+2\rho(Z_X,Z_Y) & \small definition \hspace{3pt} of \hspace{3pt} \rho\\ &\small = 2+2 \rho(X,Y)&\small by \hspace{3pt} P10 \end{align} Similarly, we get $$\small 0 \le 2-2\rho(X,Y)$$. Put together, we have $$\small 0 \le 2 \pm 2 \rho(X,Y)$$ or $$\small \rho(X,Y) \le 1$$ and $$\small -\rho(X,Y) \le 1$$. This means $$\small |\rho(X,Y)| \le 1|$$ $$\hspace{300pt}\square$$ Assignment P11 Prove $$\small | \rho(X,Y) | = 1$$ iff $$\small Y=aX+b$$. A positive correlation coefficient means the data are positively correlated, a negative one that they are negatively correlated. However, the magnitude of the correlation coefficient does not give the slope, it measures how “perfectly” the data are lined up. As you showed in P11, $$\small | \rho(X,Y) | = 1$$ means the data form a perfect line. In that case (perfectly correlated variables), the slope of the resulting line is $$\small s_y/s_x$$. ## 16.2 (Linear) least squares regression When trying to find the line-of-best-fit (aka regression line, least squares regression line, trend line), we need to define what “best” means. For us, “best” means that the distance between the observed y-values (the data) and the predicted y-values (the line) is minimize. We compute the coefficients $$\small a$$ and $$\small b$$ in $$\small y=a+bx$$ in such a way that the sum of the squares ($$\small ssq$$) of the vertical differences between a data point and $$\small y=a+bx$$ is minimized. Because the line is determined by the parameters $$\small a$$ and $$\small b$$, we interpret $$\small ssq$$ as a function $$\small ssq(a,b)$$of $$\small a$$ and $$\small b$$. Assume we have paired data $$\small (x_i,y_i), i=1,...,n$$. We then minimize $\small ssq(a,b) = \sum_{i=1}^n (y_i - (a + b x_i))^2 .$ If you already took calculus 3, you will recognize this as a simple two variable optimization problem. You will have to find the partial derivatives of $$\small ssq(a,b)$$ with respect to $$\small a$$ and $$\small b$$, set those to zero, and solve for $$\small a$$ and $$\small b$$. In practice, we will use R to do that. Note if you just want to match the sample data, then this is a deterministic problem and the least squares regression line is derived without making any assumptions about the data. You could use a similar process to find non-linear regression lines. However, if you want to use the regression line derived from your sample to estimate the regression line for the whole population, then this becomes a statistical problem and we need to make assumptions. More about this later. ## 16.3 Pearson’s correlation coefficient Person’s correlation coefficient $$\small r_{Pearson}$$ or $$\small r_{x,y}$$ measures the strength and direction of the linear relationship between two variables. You can think of it as the sample analog to the correlation $$\small \rho$$ introduced above. Given a sample of $$\small (x,y)$$ values, the sample Pearson correlation coefficient is computed as $\small r_{Pearson}= \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}\sqrt{ \sum_{i=1}^n(y_i-\bar{y})^2}}$ Note that, if we have a population, this is the correlation coefficient introduced above. ## 16.4 Spearman’s rank correlation coefficient Spearman’s rank correlation coefficient assesses whether there is a monotone relationship between the rank of two variables. This relationship may be linear or not. This works similar to Pearson’s, except we use the rank of each variable instead of the value of the variable. The advantage is that you can use this with qualitative ordinal data. After assigning each value a rank, there may be ties. There are different methods of dealing with ties. Assuming the 2nd and 3rd data point are tied, the ranking could be: 1 3 3 4, 1 2 2 3, 1 2 2 4, 1 2 3 4 , 1 2.5 2.5 4 …. If $$\small r(x_i)$$ stands for the rank of $$\small x$$, then the Spearman correlation coefficient is computed as $\small r_{Spearman}= r_S = \frac{\sum_{i=1}^n (r(x_i)-\bar{r}(x_i))(r(y_i)-\bar{r}(y_i))}{\sqrt{\sum_{i=1}^n (r(x_i)-\bar{r}(x_i))^2} \sqrt{ \sum_{i=1}^n (r(y_i)-\bar{r}(y_i))^2}}$ You will sometimes see an alternative formula involving $$\small d_i = r(x_i)-r(y_i)$$: $\small r_{Spearman} = 1-\frac{6 \sum d_i^2}{n(n^2-1)},$ Careful! This formula can only be used if you have no ties, i.e. all ranks are distinct integers. Example The example below shows data with a perfect rang correlation, but imperfect linear correlation. #> Warning: package 'ggplot2' was built under R version 4.3.2 x <- c(0,1,2,3,5,6,8,9,11,7) y <- 10-x^3 data <- data.frame(x,y) ggplot(data, aes(x, y))+geom_point()+labs(title="Some sample data") #> [1] "Pearson correlation coefficient: -0.893628265531117" #> [1] "Spearman correlation coefficient: -1" ## 16.5 Sample investigation of the geyser data from MASS Let’s put all this to practice using the geyser data from the MASS package. The MASS package contains a data set called geyser which contains data on waiting time between eruptions of Old Faithful and length of said eruptions. ### 16.5.1 Graphical investigation We start by looking at a graph of our data. ggplot(geyser, aes(waiting, duration))+geom_point()+ labs(title="duration vs waiting time") Looking at the graph, we find that the duration seems to be inversely correlated to the waiting time. Calculating the correlation coefficients confirms this. print(paste("Pearson correlation coefficient:", cor(geyserduration, geyser$waiting, method="pearson"))) #> [1] "Pearson correlation coefficient: -0.644623006047782" print(paste("Spearman correlation coefficient:", cor(geyser$duration, geyser$waiting, method="spearman"))) #> [1] "Spearman correlation coefficient: -0.650089254703162" To compute the equation of the regression line, you use the lm linear model function. I want to use the outputs later, so I am calling the result using Pearson glmP, and the one from Spearman glmS. x <- geyser$waiting
y <- geyser$duration glmP <- lm(y~x) glmP #> #> Call: #> lm(formula = y ~ x) #> #> Coefficients: #> (Intercept) x #> 7.31314 -0.05327 yr <- rank(y) xr <- rank(x) glmS <- lm(yr~xr) glmS #> #> Call: #> lm(formula = yr ~ xr) #> #> Coefficients: #> (Intercept) xr #> 247.2607 -0.6484 We can use the confint function to find confidence intervals for the parameters $$\small a$$ and $$\small b$$, the y-intercept and the slope. confint(glmP,level=0.95) #> 2.5 % 97.5 % #> (Intercept) 6.78191691 7.84437178 #> x -0.06048661 -0.04605735 confint(glmP,level=0.99) #> 0.5 % 99.5 % #> (Intercept) 6.61334245 8.01294624 #> x -0.06277603 -0.04376793 confint(glmS,level=0.95) #> 2.5 % 97.5 % #> (Intercept) 232.2846765 262.2367392 #> xr -0.7349511 -0.5618584 I now want to plot the data with the regression lines added: library(gridExtra) interceptp <- glmP$coefficients[1]
slopep <- glmP$coefficients[2] intercepts <- glmS$coefficients[1]
slopes <- glmS$coefficients[2] plot1 <- ggplot(geyser, aes(x=waiting, y=duration))+ geom_point() + geom_abline(intercept = interceptp, slope = slopep, color="red",linewidth=1.5)+ labs(title="Pearson's correlation, geyser data") plot2 <- ggplot(geyser, aes(x=rank(waiting), y=rank(duration)))+ geom_point() + geom_abline(intercept = intercepts, slope = slopes, color="red",linewidth=1.5)+ labs(title="Spearman's correlation, geyser data") grid.arrange(plot1, plot2, ncol=2) As we will be (mostly) using numerical data at the ratio level, we will from now on use Pearson’s correlation. ## 16.6 Formal hypothesis test for correlation We can now proceed to formal hypothesis testing. In particular we will test: $\small H_O: \text{there is no correlation between x and y}$ $\small H_A: \text{there is correlation between x and y}$ As mentioned above, we have to satisfy some pre-requisites. Formally, they are: Let $$\small (x_1, Y_1), (x_2, Y_2), ..., (x_n, Y_n)$$ be points with fixed values $$\small x_i$$ and random $$\small Y_i$$ values. 1. The x-values are fixed, not random (so in our example, the waiting times are fixed or given) 2. Each $$\small Y_i$$ value is independent of the other $$\small Y_j$$ values. 3. $$\small Y_i \sim \mathcal{N}(\mu_i,\sigma^2)$$ In our example, that means that for any given wait time, the possible duration is random and normally distributed. This one is really hard to check, you would need many observations for each $$\small x_i$$ to be able to run a test. Luckily, this assumption is also the least important. 4. The relationship between the x vales and the means $$\small \mu_i$$ is linear. This means that when you plot the$(x_i,y_i)) pairs, you should observe a linear-ish shape, not for example a parabola. This assumption is critical, but often violated in practice. If the violation is “small” it’s accepted practice to go on, but if the violation is large, then p-values, confidence intervals, and standard errors may all be incorrect.
5. The residuals $$\small \epsilon_i = Y_i - \mu_i$$ are independent (this is critical), all have the same variance, and are normally distributed.

Our data wasn’t that linear, but also not obviously non-linear. Let’s proceed, but let’s keep in mind that our results may not be reliable. We will next check a plot of the residuals.

y <- geyser$duration r <- cor(x,y,method="pearson") n <- length(x) df <- n-2 t <- r/sqrt(1-r^2)*sqrt(n-2) t #> [1] -14.53136 p <- pt(t, df=df,lower.tail=TRUE)*2 p #> [1] 1.648496e-36 In R, just use the function cor(x,y,method="pearson") to find the correlation coefficient and the function cor.test(x,y,method="pearson") to test for significance. Let’s do this for the geyser data: cor.test(geyser$waiting,geyser$duration,method="pearson") #> #> Pearson's product-moment correlation #> #> data: geyser$waiting and geyser$duration #> t = -14.531, df = 297, p-value < 2.2e-16 #> alternative hypothesis: true correlation is not equal to 0 #> 95 percent confidence interval: #> -0.7064014 -0.5730975 #> sample estimates: #> cor #> -0.644623 You see a very small p-value, so we can accept the alternative hypothesis, the true correlation is not 0. Observe that we also receive a 95% confidence interval for the correlation. But also remember that we violated some of the prerequisites, so these results are not reliable. ## 16.7 The coefficient of determination $$\small R^2$$ Assume you have data points $$(x_i, y_i)$$, that $$\bar y$$ is the mean of the $$y_i$$, and that you have estimates $$f(x_i) = f_i$$. Then the coefficient of determination is computed as $R^2 = 1-\frac{\sum_i (y_i - f_i)^2}{\sum_i (y_i - \bar y )^2}= 1-\frac{SS_{residuals}}{SS_{total}}$ $$SS$$ stands for sums of squares. If you are using linear regression, then $$R^2 = r_{pearson}^2$$. $$R^2$$ can be re-written as $R^2 = \frac{\sum_i (f_i - \bar y)^2}{\sum_i (y_i - \bar y )^2}= \frac{\sum_i (f_i - \bar y)^2 /n}{\sum_i (y_i - \bar y )^2/n} = \frac{SS_{explained}/n}{SS_{total}/n},$ which is the ratio of the explained variance to the total variance. ### 16.7.1 Another example Here, I generate linear data with some normally distributed errors. Because I generated the data myself, I know the pre-requisites are met. x <- seq(0,100, 0.5) mean <- runif(201,-0.1, 0.1) e <- rnorm(201,0,0.5 ) y <- 10-x/10 +mean+e plot(x,y) data <- data.frame(x,y) Here is the code for linear and rank correlation: glmP <- lm(y~x) glmP #> #> Call: #> lm(formula = y ~ x) #> #> Coefficients: #> (Intercept) x #> 10.0447 -0.1007 summary(glmP) #> #> Call: #> lm(formula = y ~ x) #> #> Residuals: #> Min 1Q Median 3Q Max #> -0.9507 -0.3899 -0.0289 0.3113 1.8677 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 10.044714 0.071545 140.40 <2e-16 *** #> x -0.100651 0.001238 -81.33 <2e-16 *** #> --- #> Signif. codes: #> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.5091 on 199 degrees of freedom #> Multiple R-squared: 0.9708, Adjusted R-squared: 0.9706 #> F-statistic: 6614 on 1 and 199 DF, p-value: < 2.2e-16 yr <- rank(y) xr <- rank(x) glmS <- lm(yr~xr) glmS #> #> Call: #> lm(formula = yr ~ xr) #> #> Coefficients: #> (Intercept) xr #> 200.6158 -0.9863 confint(glmP) #> 2.5 % 97.5 % #> (Intercept) 9.9036308 10.18579751 #> x -0.1030918 -0.09821066 confint(glmS) #> 2.5 % 97.5 % #> (Intercept) 197.929359 203.3022832 #> xr -1.009359 -0.9632316 And a few ggplots showing data, regression lines, and residuals. library(gridExtra) interceptp <- glmP$coefficients[1]
slopep <- glmP$coefficients[2] intercepts <- glmS$coefficients[1]
slopes <- glmS$coefficients[2] p1 <- ggplot(data, aes(x=x, y=y))+ geom_point() + geom_abline(intercept = interceptp, slope = slopep, color="red",linewidth=1)+ labs(title=paste("Pearson's correlation",round(cor(x, y, method="pearson"),3))) p2 <- ggplot(data, aes(x=xr, y=yr))+ geom_point() + geom_abline(intercept = intercepts, slope = slopes, color="red",linewidth=1)+ labs(x="Rank x",y="Rank y",title=paste("Spearman's correlation",round(cor(x, y, method="spearman"),3))) grid.arrange(p1, p2, ncol=2) p3 <- ggplot(glmP, aes(x, glmP$residuals))+geom_point()+
geom_hline(yintercept=0, color="purple", size=1)+
labs(x="x",y="Residual",title="Residual plot")
#> Warning: Using size aesthetic for lines was deprecated in ggplot2
#> 3.4.0.
#> ℹ Please use linewidth instead.
#> This warning is displayed once every 8 hours.
#> Call lifecycle::last_lifecycle_warnings() to see where
#> this warning was generated.

p4 <- ggplot(glmP, aes(sample=glmP$residuals))+stat_qq(distribution = qnorm, dparams=list(0,sd(glmP$residuals)))+
geom_abline(intercept=0, slope=1,color="red")+
labs(title="Q-Q plot of residuals, Normal distribution")

p5 <- ggplot(glmP, aes(x=glmP$fitted.values, y=glmP$residuals))+geom_point()+
geom_hline(yintercept=0, color="purple", size=1)+
labs(title="residuals vs fits plot ",x="fits",y="residuals")
grid.arrange(p3,p4,p5, ncol=3)

Finally we look at the diagnostic plots. Note that even here, the lines are not perfectly straight and the tails are too fat (see Q-Q plot), even though the residuals were generated from a normal distribution.

par(mfrow=c(2,2))
plot(glmP)
The correlation test reveals that there is indeed a negative correlation between x and y, estimated to be contained in the interval [-0.9883265 -0.9797099] with 95% confidence.
cor.test(x,y)
#>
#>  Pearson's product-moment correlation
#>
#> data:  x and y
#> t = -81.325, df = 199, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.9888441 -0.9806059
#> sample estimates:
#>        cor
#> -0.9852867

Computing $$\small R^2$$ tells us that almost 97% of the variance or variation iny is explained by the linear regression model.

rsquare <- cor(x,y)^2
rsquare
#> [1] 0.9707899

## 16.8 Non-linear regression - a few remarks

To use a non-linear model, you use nls, non-linear least squares. You have to / get to specify the formula you think works, the data frame to use, and starting values for the parameters in the model. In the example below, I used data I generated, but I am not going to show you how.

The plot shows that the data is some sort of parabola, so it looks like a quadratic model might be worth trying.

We start by plotting the data. There is no need to pretty up the plot at this point.

ggplot(data, aes(x=x, y=y))+geom_point()

The data looks quadratic, so we will try a quadratic model. You have to enter the general quadratic formula $$\small ax^2+bx+c$$ and some starting values.

model <- nls(y ~ a*x^2 + b*x + c, data=data, start=list(a=1, b=1, c=0.1))
model
#> Nonlinear regression model
#>   model: y ~ a * x^2 + b * x + c
#>    data: data
#>        a        b        c
#>  1.06392 -0.04993 -0.06709
#>  residual sum-of-squares: 4.694
#>
#> Number of iterations to convergence: 1
#> Achieved convergence tolerance: 3.861e-08

Next, we extract the parameters $$\small a, b, c$$ and plot the data together with the estimated function.

par(mfrow=c(1,2))
a <- coef(model)[1]
b <- coef(model)[2]
c <- coef(model)[3]
#option 1 computing the estimated values directly
y_est <- a*x^2 + b*x + c
plot(x,y)
lines(x, y_est, col="red")
#option 2 defining a function for estimated values
qfit <- function(x, a, b, c) {a*x^2+b*x+c}
plot(x, y)
ymod <- qfit(x, a=coef(model)[1], b=coef(model)[2], c=coef(model)[3])
lines(x,ymod, type="l", col="steelblue")

Both options give us the same graph, use whichever you prefer. We see that there are at least three outliers, at $$\small x=-2,-1.7, 0$$.

Finally, I plot the residuals and test for normality.

library(nortest)
plot(x,residuals(model))
abline(h=0, col="red")