Correlation

Suppose we have two random variables $X$ and $Y$ , and that we observe a random sample of $n$ observations $(x_i, y_i)$ from the bivariate population. We can assess the relationship between the two random variables, first using the covariance and more easily using their correlation.

The covariance of two random variables $X, Y$ is defined as

$\mathrm{Cov}(X,Y) = \mathrm{E}(XY)-\mathrm{E}(X)\mathrm{E}(Y).$

The correlation of two random variables $X, Y$ is defined as

$\rho(X,Y) =\frac{\mathrm{Cov}(X,Y)}{\sqrt{\mathrm{Var}(X)\mathrm{Var}(Y)}}.$

Calculation of the correlation coefficient

Let $\bar{x}$ and $\bar{y}$ denote the sample means of the $x_i$ ’s and $y_i$ ’s respectively.
Let $S_{xx} = \sum_{i=1}^n(x_i-\bar{x})^2$ , which can be written as $\sum_{i=1}^nx_i^2-(\sum_{i=1}^nx_i)^2/n$ , this is often called the corrected sum of squares of $X$ .
Similarly define $S_{yy} = \sum_{i=1}^n(y_i-\bar{y})^2$ , the corrected sum of squares for $Y$ .
Let $S_{xy} =\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y}) = \sum_{i=1}^nx_iy_i-(\sum_{i=1}^nx_i\sum_{i=1}^ny_i)/n$ , which is called the corrected sum of products of $X$ and $Y$ .

$S_{xx}/(n-1)$ is an unbiased estimator of $\sigma_x^2 = \mathrm{Var} (X)$
$S_{yy}/(n-1)$ is an unbiased estimator of $\sigma_y^2 = \mathrm{Var} (Y)$
$S_{xy}/(n-1)$ is an unbiased estimator of $\rho\sigma_x\sigma_y = \mathrm{Cov} (X,Y)$

Thus the population correlation $\rho = \mathrm{Cov}(X,Y)/\mathrm{Var}(X)\mathrm{Var}(Y)$ can be estimated by replacing each of $\mathrm{Cov}(X,Y)$ , $\mathrm{Var}(X)$ and $\mathrm{Var}(Y)$ by their unbiased estimators to give

$\begin{aligned} r &=\frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}\\ &=\frac{\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\frac{1}{n-1}\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2\sum_{i=1}^n(y_i-\bar{y})^2}}\\ &=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2\sum_{i=1}^n(y_i-\bar{y})^2}}\end{aligned}.$

$r$ is the sample correlation coefficient.

Interpretation of $r$

The sample correlation coefficient is a measure of the linear association between $X$ and $Y$ , i.e. how close the points $(x_i,y_i)$ are to a straight line when plotted on a scatterplot.

A value of $r$ that is close to $+1$ (or $-1$ ) indicates that the points lie close to a line of positive (or negative) slope.
A value of $r$ close to 0 indicates that there is little linear association (although there may be some other form of relationship).

It is therefore important to interpret the correlation coefficient in conjunction with a scatterplot. Some typical scatterplots are sketched below.

A correlation between two variables $X$ and $Y$ may arise because

change in $X$ causes change in $Y$ or
change in $Y$ causes change in $X$ or
there is no causal relationship, but both $X$ and $Y$ may be influenced by some other variable which is unrecorded.

The existence of a correlation does not itself allow us to decide which of these interpretations is correct. Other evidence is needed before a causal relationship can be inferred.

Correlation

Calculation of the correlation coefficient

Interpretation of rr

Interpretation of $r$