Section 4 Key Elements

Multivariate statistics is concerned with the properties of multi-dimensional random vectors. We study the properties of the p-dimensional Random Vector $\underset{(p \times 1)}{Z}$ , by capturing repeated measurements. For ease of study we arrange our n observations into an n by p data matrix (see Johnson, Wichern, and others (2014), page 49):

Definition 4.1 (Data Matrix) $\underset{(n \times p)}{X}=\begin{bmatrix}\underset{(1 \times p)}{X_{1}^{'}}\\\underset{(1 \times p)}{X_{2}^{'}}\\....\\\underset{(1 \times p)}{X_{n}^{'}}\end{bmatrix}= \begin{bmatrix}X_{11} & X_{12} & ...& X_{1p}\\X_{21} & X_{22} &... & X_{2p}\\....&....&....&....\\X_{n1} & X_{n2} &....&X_{np}\end{bmatrix}$

$\square$

A key simplifying assumption in Multivariate Analysis is that our observations constitute a Random Sample. The p-components within any observation vector $X_{i}$ are allowed to be correlated but the components in different observations must be independent. In terms of the data matrix, only components in the same row have non-zero correlation values. Furthermore we assume that the joint distribution function $f(.)$ is the same for any observation vector $X_{i}$ . In symbols:

Definition 4.2 (Random Sample)

$\begin{equation} \begin{split} f(X_{1},X_{2},...,X_{n}) & =f_{1}(X_{1})f_{2}(X_{2})..f_{n}(X_{n})\\ & =f(X_{1})f(X_{2})..f(X_{n}) \end{split} \end{equation}$

$\square$

In Multivariate Analysis, we study the properties of the sample using three fundamental objects:

Definition 4.3 (Sample Mean, Sample Covariance and Sample Correlation ) The Sample Mean $\overline{X}$ , Sample Covariance Matrix $S$ and Sample Correlation Matrix $R$ are defined as follows:

$\begin{align} \underset{(p \times 1)}{\overline{X}}&=\frac{\underset{(p \times n)}{X^{´}}\underset{(n \times 1)}{1}}{n}\\\\ \underset{(p \times p)}{S}&=\frac{X^{´}(I-\frac{11^{´}}{n})X}{n-1}\\\\ \underset{(p \times p)}{R}&=D^{-\frac{1}{2}}SD^{-\frac{1}{2}} \end{align}$

where $D^{-\frac{1}{2}}$ is the diagonal matrix with $D^{-\frac{1}{2}}_{ii}=\frac{1}{\sqrt{S_{ii}}}$

$\square$

Each of these objects has fundamental importance due to capturing different geometrical properties of the sample (see Johnson, Wichern, and others (2014), Section 2.2, page 49 and 136).

Proposition 4.1 (Geometrical Properties of the Sample) We start by considering the sample data matrix as p measurement vectors $Y_{i}$ in an n-dimensional space:

$\underset{(n \times p)}{X}= \begin{bmatrix}X_{11} & X_{12} & ...& X_{1p}\\X_{21} & X_{22} &... & X_{2p}\\....&....&....&....\\X_{n1} & X_{n2} &....&X_{np}\end{bmatrix}=\begin{bmatrix}\underset{(n \times 1)}{Y_{1}} & \underset{(n \times 1)}{Y_{2}} & ...& \underset{(n \times 1)}{Y_{p}}\end{bmatrix}$

We decompose each n-dimensional measurement into a component parallel and perpendicular to the unit vector, ie. $Y_{i}=(Y_{i} \bullet \frac{1}{\sqrt{n}})\frac{1}{\sqrt{n}}+d_{i}$ where the deviance is defined as $d_{i}=Y_{i}-Y_{i} \bullet \frac{1}{\sqrt{n}}$ and $\underset{(n \times 1)}{1}=[1,1,..,1]^{´}$ . We can then make the following identifications,

$\begin{align} \overline{X}_{i}&=Y_{i} \bullet \frac{1}{\sqrt{n}}\\ S_{ii}&=\frac{\mid d_{i}\mid ^{2}}{n-1}\\ R_{ij}&=\frac{d_{i} \bullet d_{j}}{\mid d_{i} \mid \mid d_{j} \mid}\\ \end{align}$

$\square$

Similar to when dealing with univariate random variables, certain “regularities” govern the sampling distribution of $\overline{X}$ and $S$ (see Johnson, Wichern, and others (2014), page 175, section 4.5). These results are repeated without proof below:

Proposition 4.2 (Large Sample Properties of Sample Mean and Covariance) Let $X_{1},X_{2},...,X_{n}$ be a random sample from a population with $E[X]=\mu$ and finite, non-singular $Cov[X]=\Sigma$ .

As the number of measurements $n$ increases without bound and in any case $n \gg p$ :

$\begin{align} \overline{X} &\overset{Prob}{\longrightarrow} \mu \\ S &\overset{Prob}{\longrightarrow} \Sigma \\ \sqrt{n}(\overline{X}-\mu) &\overset{Dist}{\longrightarrow} \mathcal{N}_{p}(\mu,\Sigma) \\ n(\overline{X}-\mu)^{´}S^{-1}(\overline{X}-\mu) &\overset{Dist}{\longrightarrow} \chi^2_{(p-1)} \end{align}$

$\square$

Since in the limit of large sample sizes the sampling distribution of $\overline{X}$ is (multivariate) normally distributed, we can apply statistical results developed for Multivariate Normal Random Vectors. For Example, the Multivariate Analysis of Variance Model for comparing g Population Mean Vectors (see Johnson, Wichern, and others (2014), chapter 6, page 301, chapter 6)

References

Johnson, Richard Arnold, Dean W Wichern, and others. 2014. Applied Multivariate Statistical Analysis. Vol. 4. Prentice-Hall New Jersey.