Section 5 Linear Predictors

In this document, I restrict my attention to “Linear Predictors” of the form shown below. It follows that my predicted response variable will vary continuously with the explanatory variables.

Definition 5.1 (Linear Prediction Functions) Let \(\underset{(r \times 1)}{Z}\) denote a r-dimensional vector of Explanatory Random Variables, \(\underset{(1 \times 1)}{Y}\) a univariate “response”" Random variable, \(\overline{\underset{(1 \times 1)}{Y}}\) a Linear Predictor of \(Y\), \(b_{0}\) a constant and \(\underset{(r \times 1)}{b_{Z}}\) a r-dimensional vector of constant co-efficients then:

\[\overline{Y} =b_{0} + b_{Z}^{'}Z\]

\(\square\)

If I can transform my data so that it appears to be multivariate normally distributed, then the best predictor will actually be linear in the explanatory variables (see Johnson, Wichern, and others (2014), page 404). Therefore this restriction is not significant in this scenario. In any case, linear models have the non-technical advantage of being easier to interpret and manipulate. We can use them to better understand our data before moving to more advanced techniques!

As we see below, we can measure their explanatory power by calculating the mean square prediction error and the multiple correlation coefficient.

Best Linear Predictors

In keeping with the approach to Multivariate Analysis in Johnson, Wichern, and others (2014), I will measure the accuracy of my linear prediction models using a mean square error criterion. That is, the technically best prediction model will have the lowest value for the Mean Square Prediction Error.

I can now define my prediction problem as follows:

Definition 5.2 (Prediction Problem) Let \(Y\),\(Z_{1}\),\(Z_{2}\),…,\(Z_{r}\) represent univariate random variables respectively. We assume these variables are generated from a joint distribution with population mean \(\underset{(r+1) \times 1}{\mu}\) and population covariance matrix \(\underset{(r+1) \times (r+1)}{\Sigma}\) which we assume to have full rank.

We denote the linear predictor random variable \(\overline{Y}\), and the prediction error random variable \(\epsilon\). We define these quantities as follows:

\[\begin{align} \overline{Y}&=b_{0} + b_{Z}^{'}Z\\ \epsilon&=Y-\overline{Y} \end{align}\]

The objective of my linear prediction problem is to minimise the Mean Square Prediction Error (MSE) \[MSE=E(\epsilon)^2\] by varying the values of parameters \(b_{0}\) and \(b_{Z}\).

\(\square\)

As will be shown later an equivalent objective function is to maximise the population multiple correlation coefficient.

Definition 5.3 (Population Multiple Correlation Coefficient) If we partition the mean vector and covariance matrix of the response and explanatory variables as follows:

\[\mu = \begin{bmatrix} \underset{(1 \times 1)}{\mu_{Y}}\\ \underset{(r \times 1)}{\mu_{Z}} \end{bmatrix}\]

\[\Sigma = \begin{bmatrix} \underset{(1 \times 1)}{\sigma_{YY}} \underset{(1 \times r)}{\sigma_{YZ}}\\ \underset{(r \times 1)}{\sigma_{ZY}} \underset{(r \times r)}{\sigma_{ZZ}} \end{bmatrix}\]

We can define the Population Multiple Correlation Coefficient (\(\rho_{Y(Z)}\)):

\[\rho_{Y(Z)} := \sqrt{\frac{(\sigma_{ZY}^{´}\sigma_{ZZ}^{-1}\sigma_{ZY})}{\sigma_{YY}}}\]

\(\square\)

Understanding Best Linear Predictors

The following theoretical arguments reinforce our understanding of the meaning of \(\rho_{Y(Z)}\). They establish (For proofs see Johnson, Wichern, and others (2014), Section 7.8, page 401):

  • \(\rho_{Y(Z)}^2\) as an upper limit on the correlation between Y and its linear predictor \(\overline{Y}\)
  • the interpretion of \(\rho_{Y(Z)}^2\) as the power of the best linear estimator to explain the variation of Y
  • the best linear predictor has maximum correlation with Y ie. \(\rho_{Y(Z)}^2\).

We start with two definitions:

Definition 5.4 (Cauchy Schwarz inequality) Given two vectors \(\underset{(p \times 1)}{b}\) and \(\underset{(p \times 1)}{d}\), their scalar product is bounded from above \[(b^{´}d)^2 \leqslant (b^{´}b)(d^{´}d)\]

\(\square\)

Definition 5.5 (Extended Cauchy Schwarz inequality) The Extended Cauchy Schwarz inequality, with positive definite matrix \(\underset{(p \times p)}{B}\), is \[(b^{´}d)^2 \leqslant (b^{´}Bb)(d^{´}B^{-1}d)\] \(\square\)

Proposition 5.1 (Upper Bound on Correlation between Response Variable and Linear Predictor) If we set \(d=\underset{(r \times 1)}{\sigma_{ZY}}\), \(B=\underset{(r \times r)}{\sigma_{ZZ}}\) and \(b=\underset{(r \times 1)}{b_{Z}}\) in the extended Cauchy Schwarz Inequality, we obtain the bound \[(b_{Z}^{´}\sigma_{ZY})^2 \leqslant (b_{Z}^{´}\sigma_{ZZ}b_{Z})(\sigma_{ZY}^{´}\sigma_{ZZ}^{-1}\sigma_{ZY})\].

Identifying \(Cov(b_{0}+b_{Z}^{´}Z,Y)=Cov(b_{Z}^{´}Z,Y)=b_{Z}^{´}Cov(Z,Y)=b_{Z}^{´}\sigma_{ZY}\), we obtain: \[[Cov(b_{0}+b_{Z}^{´}Z,Y)]^2 \leqslant (b_{Z}^{´}\sigma_{ZZ}b)(\sigma_{ZY}^{´}\sigma_{ZZ}^{-1}\sigma_{ZY})\].

Or equivalently for the correlation \[[Corr(b_{0}+b_{Z}^{´}Z,Y)]^2 \leqslant \frac{(\sigma_{ZY}^{´}\sigma_{ZZ}^{-1}\sigma_{ZY})}{\sigma_{YY}}=\rho_{Y(Z)}^2\]

\(\square\)

To reinforce our understanding of the meaning of \(\rho_{Y(Z)}\) and its connection with the MSE, lets establish matrix identities in terms of \(\mu\) and \(\Sigma\) for the best linear predictor of \(Y\) given \(Z\):

Proposition 5.2 (Matrix identity for minimising the Mean Square Prediction Error)

The Mean Square Error can be decomposed as follows:

\[\begin{equation} \begin{split} \mathrm{MSE}(b_{0},b_{Z}) & =\mathrm{E}(Y-b_{0}-b_{Z}^{´}Z)^2\\ & =\mathrm{E}(Y-b_{0}-b_{Z}^{´}Z + (\mu_{Y}-b_{Z}^{´}\mu_{Z}) -(\mu_{Y}-b_{Z}^{´}\mu_{Z}))^2\\ & =\mathrm{E}(Y-\mu_{Y} -(b_{Z}^{´}Z-b_{Z}^{´}\mu_{Z})+(\mu_{Y}-b_{0}-b_{Z}^{´}\mu_{Z}))^2\\ & =\mathrm{E}(Y-\mu_{Y})^2+\mathrm{E}(b_{Z}^{´}(Z-\mu_{Z}))^2+(\mu_{Y}-b_{0}-b_{Z}^{´}\mu_{Z})^2 -2\mathrm{E}(b_{Z}^{´}(Z-\mu_{Z})(Y-\mu_{Y}))\\ & =\sigma_{YY}+b_{Z}^{´}\mathrm{E}((Z-\mu_{Z})(Z-\mu_{Z})^{´})b_{Z}+(\mu_{Y}-b_{0}-b_{Z}^{´}\mu_{Z})^2 -2b_{Z}^{´}\mathrm{E}((Z-\mu_{Z})(Y-\mu_{Y}))\\ & = \sigma_{YY} + b_{Z}^{´}\sigma_{ZZ}b_{Z}+(\mu_{Y}-b_{0}-b_{Z}^{´}\mu_{Z})^2-2b_{Z}^{´}\sigma_{ZY}\\ & = \sigma_{YY} + b_{Z}^{´}\sigma_{ZZ}b_{Z}+(\mu_{Y}-b_{0}-b_{Z}^{´}\mu_{Z})^2-2b_{Z}^{´}\sigma_{ZY}+\sigma_{ZY}^{´}\Sigma_{ZZ}^{-1}\sigma_{ZY}-\sigma_{ZY}^{´}\Sigma_{ZZ}^{-1}\sigma_{ZY}\\ &= \sigma_{YY}-\sigma_{ZY}^{´}\Sigma_{ZZ}^{-1}\sigma_{ZY}+(\mu_{Y}-b_{0}-b_{Z}^{´}\mu_{Z})^2+(b_{Z}-\Sigma_{ZZ}^{-1}\sigma_{ZY})^{´}\Sigma_{ZZ}(b_{Z}-\Sigma_{ZZ}^{-1}\sigma_{ZY}) \end{split} \end{equation}\]

Minimisation is achieved by choosing the parameter values:

\[\begin{align} b_{Z} &= \Sigma_{ZZ}^{-1}\sigma_{ZY}\\ b_{0} &= \mu_{Y}-b_{Z}^{´}\mu_{Z} \end{align}\]

Whereby the minimal value of \(MSE\) becomes:

\[MSE_{MIN}=\sigma_{YY}-\sigma_{ZY}^{´}\sigma_{ZZ}\sigma_{ZY}\]

\(\square\)

So from Proposition 5.2 above that the Best Linear Predictor of \(Y\) using \(Z\) will have coefficients \(b_{Z}= \Sigma_{ZZ}^{-1}\sigma_{ZY}\) and \(b_{0}= \mu_{Y}-b_{Z}^{´}\mu_{Z}\). A link with the Multiple Correlation Coefficient can now be established as follows:

Proposition 5.3 (Link between Mean Squared Prediction Error and Multiple Correlation Coefficient) We can re-arrange the formula for the minimum mean squared prediction error as follows:

\[\begin{align} MSE_{min}&=\sigma_{YY}-\sigma_{ZY}^{´}\sigma_{ZZ}\sigma_{ZY}\\ &=\sigma_{YY}-\sigma_{YY}[\frac{\sigma_{ZY}^{´}\sigma_{ZZ}\sigma_{ZY}}{\sigma_{YY}}]\\ &=\sigma_{YY}[1-\rho_{Y(Z)}^2] \end{align}\]

This means that the Population Multiple Correlation Coefficient measures the “power” of the Best Linear Predictor to predict Y. If \(\rho_{Y(Z)}=0\) then

\[MSE=\sigma_{YY}\]

The linear predictor has explained none of the variation in the Response Variable. \(\square\)

Furthermore:

Corollary 5.1 (Maximising Correlation between Response Variable and Linear Predictor) The best linear predictor of \(Y\) minimises the Mean Square Prediction Error and maximises the correlation coefficient with \(Y\). In fact these two quantities are directly related.

We see this by using the expression for the Correlation in Proposition 5.1 and setting \[b_{Z}= \Sigma_{ZZ}^{-1}\sigma_{ZY}\]

\(\square\)

References

Johnson, Richard Arnold, Dean W Wichern, and others. 2014. Applied Multivariate Statistical Analysis. Vol. 4. Prentice-Hall New Jersey.