25.3 Principal Components

  • Unsupervised learning
  • find important features
  • reduce the dimensions of the data set
  • “decorrelate” multivariate vectors that have dependence.
  • uses eigenvector/eigvenvalue decomposition of covariance (correlation) matrices.

According to the “spectral decomposition theorem”, if Σp×p i s a positive semi-definite, symmetric, real matrix, then there exists an orthogonal matrix A such that AΣA=Λ where Λ is a diagonal matrix containing the eigenvalues Σ

Λ=(λ1000λ2000λp)

A=(a1a2ap)

the i-th column of A , ai, is the i-th p×1 eigenvector of Σ that corresponds to the eigenvalue, λi , where λ1λ2λp . Alternatively, express in matrix decomposition:

Σ=AΛA

Σ=A(λ1000λ2000λp)A=pi=1λiaiai

where the outer product aiai is a p×p matrix of rank 1.

For example,

xN2(μ,Σ)

μ=(512);Σ=(4112)

library(MASS)
mu = as.matrix(c(5, 12))
Sigma = matrix(c(4, 1, 1, 2), nrow = 2, byrow = T)
sim <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)
plot(sim[, 1], sim[, 2])

Here,

A=(0.92390.38270.38270.9239)

Columns of A are the eigenvectors for the decomposition

Under matrix multiplication (AΣA or AA ), the off-diagonal elements equal to 0

Multiplying data by this matrix (i.e., projecting the data onto the orthogonal axes); the distribution of the resulting data (i.e., “scores”) is

N2(Aμ,AΣA)=N2(Aμ,Λ)

Equivalently,

y=AxN[(9.21199.1733),(4.4144001.5859)]

A_matrix = matrix(c(0.9239, -0.3827, 0.3827, 0.9239),
                  nrow = 2,
                  byrow = T)
t(A_matrix) %*% A_matrix
#>          [,1]     [,2]
#> [1,] 1.000051 0.000000
#> [2,] 0.000000 1.000051

sim1 <-
    mvrnorm(
        n = 1000,
        mu = t(A_matrix) %*% mu,
        Sigma = t(A_matrix) %*% Sigma %*% A_matrix
    )
plot(sim1[, 1], sim1[, 2])

No more dependence in the data structure, plot

Notes:

  • The i-th eigenvalue is the variance of a linear combination of the elements of x ; var(yi)=var(aix)=λi

  • The values on the transformed set of axes (i.e., the yi’s) are called the scores. These are the orthogonal projections of the data onto the “new principal component axes

  • Variances of y1 are greater than those for any other possible projection

Covariance matrix decomposition and projection onto orthogonal axes = PCA

25.3.1 Population Principal Components

p×1 vectors x1,,xn which are iid with var(xi)=Σ

  • The first PC is the linear combination y1=a1x=a11x1++a1pxp with a1a1=1 such that var(y1) is the maximum of all linear combinations of x which have unit length

  • The second PC is the linear combination y1=a2x=a21x1++a2pxp with a2a2=1 such that var(y1) is the maximum of all linear combinations of x which have unit length and uncorrelated with y1 (i.e., cov(a1x,a2x)=0

  • continues for all yi to yp

ai’s are those that make up the matrix A in the symmetric decomposition AΣA=Λ , where var(y1)=λ1,,var(yp)=λp And the total variance of x is

var(x1)++var(xp)=tr(Σ)=λ1++λp=var(y1)++var(yp)

Data Reduction

To reduce the dimension of data from p (original) to k dimensions without much “loss of information”, we can use properties of the population principal components

  • Suppose Σki=1λiaiai . Even thought the true variance-covariance matrix has rank p , it can be be well approximate by a matrix of rank k (k <p)

  • New “traits” are linear combinations of the measured traits. We can attempt to make meaningful interpretation fo the combinations (with orthogonality constraints).

  • The proportion of the total variance accounted for by the j-th principal component is

var(yj)pi=1var(yi)=λjpi=1λi

  • The proportion of the total variation accounted for by the first k principal components is ki=1λipi=1λi

  • Above example , we have 4.4144/(4+2)=.735 of the total variability can be explained by the first principal component

25.3.2 Sample Principal Components

Since Σ is unknown, we use

S=1n1ni=1(xiˉx)(xiˉx)

Let ˆλ1ˆλ2ˆλp0 be the eigenvalues of S and ˆa1,ˆa2,,ˆap denote the eigenvectors of S

Then, the i-th sample principal component score (or principal component or score) is

ˆyij=pk=1ˆaikxkj=ˆaixj

Properties of Sample Principal Components

  • The estimated variance of yi=ˆaixj is ˆλi

  • The sample covariance between ˆyi and ˆyi is 0 when ii

  • The proportion of the total sample variance accounted for by the i-th sample principal component is ˆλipk=1ˆλk

  • The estimated correlation between the i-th principal component score and the l-th attribute of x is

rxl,ˆyi=ˆailλisll

  • The correlation coefficient is typically used to interpret the components (i.e., if this correlation is high then it suggests that the l-th original trait is important in the i-th principle component). According to R. A. Johnson, Wichern, et al. (2002), pp.433-434, rxl,ˆyi only measures the univariate contribution of an individual X to a component Y without taking into account the presence of the other X’s. Hence, some prefer ˆail coefficient to interpret the principal component.

  • rxl,ˆyi;ˆail are referred to as “loadings”

To use k principal components, we must calculate the scores for each data vector in the sample

yj=(y1jy2jykj)=(ˆa1xjˆa2xjˆakxj)=(ˆa1ˆa2ˆak)xj

Issues:

  • Large sample theory exists for eigenvalues and eigenvectors of sample covariance matrices if inference is necessary. But we do not do inference with PCA, we only use it as exploratory or descriptive analysis.

  • PC is not invariant to changes in scale (Exception: if all trait are rescaled by multiplying by the same constant, such as feet to inches).

    • PCA based on the correlation matrix R is different than that based on the covariance matrix Σ

    • PCA for the correlation matrix is just rescaling each trait to have unit variance

    • Transform x to z where zij=(xijˉxi)/sii where the denominator affects the PCA

    • After transformation, cov(z)=R

    • PCA on R is calculated in the same way as that on S (where ˆλ1++ˆλp=p )

    • The use of R,S depends on the purpose of PCA.

      • If the scale of the observations if different, covariance matrix is more preferable. but if they are dramatically different, analysis can still be dominated by the large variance traits.
    • How many PCs to use can be guided by

      • Scree Graphs: plot the eigenvalues against their indices. Look for the “elbow” where the steep decline in the graph suddenly flattens out; or big gaps.

      • minimum Percent of total variation (e.g., choose enough components to have 50% or 90%). can be used for interpretations.

      • Kaiser’s rule: use only those PC with eigenvalues larger than 1 (applied to PCA on the correlation matrix) - ad hoc

      • Compare to the eigenvalue scree plot of data to the scree plot when the data are randomized.

25.3.3 Application

PCA on the covariance matrix is usually not preferred due to the fact that PCA is not invariant to changes in scale. Hence, PCA on the correlation matrix is more preferred

This also addresses the problem of multicollinearity

The eigvenvectors may differ by a multiplication of -1 for different implementation, but same interpretation.

library(tidyverse)
## Read in and check data
stock <- read.table("images/stock.dat")
names(stock) <- c("allied", "dupont", "carbide", "exxon", "texaco")
str(stock)
#> 'data.frame':    100 obs. of  5 variables:
#>  $ allied : num  0 0.027 0.1228 0.057 0.0637 ...
#>  $ dupont : num  0 -0.04485 0.06077 0.02995 -0.00379 ...
#>  $ carbide: num  0 -0.00303 0.08815 0.06681 -0.03979 ...
#>  $ exxon  : num  0.0395 -0.0145 0.0862 0.0135 -0.0186 ...
#>  $ texaco : num  0 0.0435 0.0781 0.0195 -0.0242 ...

## Covariance matrix of data
cov(stock)
#>               allied       dupont      carbide        exxon       texaco
#> allied  0.0016299269 0.0008166676 0.0008100713 0.0004422405 0.0005139715
#> dupont  0.0008166676 0.0012293759 0.0008276330 0.0003868550 0.0003109431
#> carbide 0.0008100713 0.0008276330 0.0015560763 0.0004872816 0.0004624767
#> exxon   0.0004422405 0.0003868550 0.0004872816 0.0008023323 0.0004084734
#> texaco  0.0005139715 0.0003109431 0.0004624767 0.0004084734 0.0007587370

## Correlation matrix of data
cor(stock)
#>            allied    dupont   carbide     exxon    texaco
#> allied  1.0000000 0.5769244 0.5086555 0.3867206 0.4621781
#> dupont  0.5769244 1.0000000 0.5983841 0.3895191 0.3219534
#> carbide 0.5086555 0.5983841 1.0000000 0.4361014 0.4256266
#> exxon   0.3867206 0.3895191 0.4361014 1.0000000 0.5235293
#> texaco  0.4621781 0.3219534 0.4256266 0.5235293 1.0000000

# cov(scale(stock)) # give the same result

## PCA with covariance
cov_pca <- prcomp(stock) 
# uses singular value decomposition for calculation and an N -1 divisor
# alternatively, princomp can do PCA via spectral decomposition, 
# but it has worse numerical accuracy

# eigen values
cov_results <- data.frame(eigen_values = cov_pca$sdev ^ 2)
cov_results %>%
    mutate(proportion = eigen_values / sum(eigen_values),
           cumulative = cumsum(proportion)) 
#>   eigen_values proportion cumulative
#> 1 0.0035953867 0.60159252  0.6015925
#> 2 0.0007921798 0.13255027  0.7341428
#> 3 0.0007364426 0.12322412  0.8573669
#> 4 0.0005086686 0.08511218  0.9424791
#> 5 0.0003437707 0.05752091  1.0000000
# first 2 PCs account for 73% variance in the data

# eigen vectors
cov_pca$rotation # prcomp calls rotation
#>               PC1         PC2        PC3         PC4         PC5
#> allied  0.5605914  0.73884565 -0.1260222 -0.28373183  0.20846832
#> dupont  0.4698673 -0.09286987 -0.4675066  0.68793190 -0.28069055
#> carbide 0.5473322 -0.65401929 -0.1140581 -0.50045312  0.09603973
#> exxon   0.2908932 -0.11267353  0.6099196  0.43808002  0.58203935
#> texaco  0.2842017  0.07103332  0.6168831 -0.06227778 -0.72784638
# princomp calls loadings.

# first PC = overall average
# second PC compares Allied to Carbide

## PCA with correlation
#same as scale(stock) %>% prcomp
cor_pca <- prcomp(stock, scale = T)



# eigen values
cor_results <- data.frame(eigen_values = cor_pca$sdev ^ 2)
cor_results %>%
    mutate(proportion = eigen_values / sum(eigen_values),
           cumulative = cumsum(proportion))
#>   eigen_values proportion cumulative
#> 1    2.8564869 0.57129738  0.5712974
#> 2    0.8091185 0.16182370  0.7331211
#> 3    0.5400440 0.10800880  0.8411299
#> 4    0.4513468 0.09026936  0.9313992
#> 5    0.3430038 0.06860076  1.0000000

# first egiven values corresponds to less variance 
# than PCA based on the covariance matrix

# eigen vectors
cor_pca$rotation
#>               PC1        PC2        PC3        PC4        PC5
#> allied  0.4635405  0.2408499 -0.6133570  0.3813727 -0.4532876
#> dupont  0.4570764  0.5090997  0.1778996  0.2113068  0.6749814
#> carbide 0.4699804  0.2605774  0.3370355 -0.6640985 -0.3957247
#> exxon   0.4216770 -0.5252647  0.5390181  0.4728036 -0.1794482
#> texaco  0.4213291 -0.5822416 -0.4336029 -0.3812273  0.3874672
# interpretation of PC2 is different from above: 
# it is a comparison of Allied, Dupont and Carbid to Exxon and Texaco 

Covid Example

To reduce collinearity problem in this dataset, we can use principal components as regressors.

load('images/MOcovid.RData')
covidpca <- prcomp(ndat[,-1],scale = T,center = T)

covidpca$rotation[,1:2]
#>                                                          PC1         PC2
#> X..Population.in.Rural.Areas                      0.32865838  0.05090955
#> Area..sq..miles.                                  0.12014444 -0.28579183
#> Population.density..sq..miles.                   -0.29670124  0.28312922
#> Literacy.rate                                    -0.12517700 -0.08999542
#> Families                                         -0.25856941  0.16485752
#> Area.of.farm.land..sq..miles.                     0.02101106 -0.31070363
#> Number.of.farms                                  -0.03814582 -0.44809679
#> Average.value.of.all.property.per.farm..dollars. -0.05410709  0.14404306
#> Estimation.of.rurality..                         -0.19040210  0.12089501
#> Male..                                            0.02182394 -0.09568768
#> Number.of.Physcians.per.100.000                  -0.31451606  0.13598026
#> average.age                                       0.29414708  0.35593459
#> X0.4.age.proportion                              -0.11431336 -0.23574057
#> X20.44.age.proportion                            -0.32802128 -0.22718550
#> X65.and.over.age.proportion                       0.30585033  0.32201626
#> prop..White..nonHisp                              0.35627561 -0.14142646
#> prop..Hispanic                                   -0.16655381 -0.15105342
#> prop..Black                                      -0.33333359  0.24405802


# Variability of each principal component: pr.var
pr.var <- covidpca$sdev ^ 2
# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)
plot(
    pve,
    xlab = "Principal Component",
    ylab = "Proportion of Variance Explained",
    ylim = c(0, 0.5),
    type = "b"
)


plot(
    cumsum(pve),
    xlab = "Principal Component",
    ylab = "Cumulative Proportion of Variance Explained",
    ylim = c(0, 1),
    type = "b"
)


# the first six principe account for around 80% of the variance. 


#using base lm function for PC regression
pcadat <- data.frame(covidpca$x[, 1:6])
pcadat$y <- ndat$Y
pcr.man <- lm(log(y) ~ ., pcadat)
mean(pcr.man$residuals ^ 2)
#> [1] 0.03453371

#comparison to lm w/o prin comps
lm.fit <- lm(log(Y) ~ ., data = ndat)
mean(lm.fit$residuals ^ 2)
#> [1] 0.02335128

MSE for the PC-based model is larger than regular regression, because models with a large degree of collinearity can still perform well.

pcr function in pls can be used for fitting PC regression (it will select the optimal number of components in the model).

References

Johnson, Richard Arnold, Dean W Wichern, et al. 2002. “Applied Multivariate Statistical Analysis.”