Correlation
Suppose we have two random variables \(X\) and \(Y\), and that we observe a random sample of \(n\) observations \((x_i, y_i)\) from the bivariate population. We can assess the relationship between the two random variables, first using the covariance and more easily using their correlation.
The covariance of two random variables \(X, Y\) is defined as
\[\mathrm{Cov}(X,Y) = \mathrm{E}(XY)-\mathrm{E}(X)\mathrm{E}(Y).\]
The correlation of two random variables \(X, Y\) is defined as
\[\rho(X,Y) =\frac{\mathrm{Cov}(X,Y)}{\sqrt{\mathrm{Var}(X)\mathrm{Var}(Y)}}.\]
Calculation of the correlation coefficient
Let \(\bar{x}\) and \(\bar{y}\) denote the sample means of the \(x_i\)’s and \(y_i\)’s respectively.
Let \(S_{xx} = \sum_{i=1}^n(x_i-\bar{x})^2\) , which can be written as \(\sum_{i=1}^nx_i^2-(\sum_{i=1}^nx_i)^2/n\), this is often called the corrected sum of squares of \(X\).
Similarly define \(S_{yy} = \sum_{i=1}^n(y_i-\bar{y})^2\), the corrected sum of squares for \(Y\).
Let \(S_{xy} =\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y}) = \sum_{i=1}^nx_iy_i-(\sum_{i=1}^nx_i\sum_{i=1}^ny_i)/n\), which is called the corrected sum of products of \(X\) and \(Y\).
\(S_{xx}/(n-1)\) is an unbiased estimator of \(\sigma_x^2 = \mathrm{Var} (X)\)
\(S_{yy}/(n-1)\) is an unbiased estimator of \(\sigma_y^2 = \mathrm{Var} (Y)\)
\(S_{xy}/(n-1)\) is an unbiased estimator of \(\rho\sigma_x\sigma_y = \mathrm{Cov} (X,Y)\)
Thus the population correlation \[\rho = \mathrm{Cov}(X,Y)/\mathrm{Var}(X)\mathrm{Var}(Y)\] can be estimated by replacing each of \(\mathrm{Cov}(X,Y)\), \(\mathrm{Var}(X)\) and \(\mathrm{Var}(Y)\) by their unbiased estimators to give
\[\begin{aligned} r &=\frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}\\ &=\frac{\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\frac{1}{n-1}\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2\sum_{i=1}^n(y_i-\bar{y})^2}}\\ &=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2\sum_{i=1}^n(y_i-\bar{y})^2}}\end{aligned}.\]
\(r\) is the sample correlation coefficient.
Interpretation of \(r\)
The sample correlation coefficient is a measure of the linear association between \(X\) and \(Y\), i.e. how close the points \((x_i,y_i)\) are to a straight line when plotted on a scatterplot.
A value of \(r\) that is close to \(+1\) (or \(-1\)) indicates that the points lie close to a line of positive (or negative) slope.
A value of \(r\) close to 0 indicates that there is little linear association (although there may be some other form of relationship).
It is therefore important to interpret the correlation coefficient in conjunction with a scatterplot. Some typical scatterplots are sketched below.
A correlation between two variables \(X\) and \(Y\) may arise because
change in \(X\) causes change in \(Y\) or
change in \(Y\) causes change in \(X\) or
there is no causal relationship, but both \(X\) and \(Y\) may be influenced by some other variable which is unrecorded.
The existence of a correlation does not itself allow us to decide which of these interpretations is correct. Other evidence is needed before a causal relationship can be inferred.