25.3 Principal Components
- Unsupervised learning
- find important features
- reduce the dimensions of the data set
- “decorrelate” multivariate vectors that have dependence.
- uses eigenvector/eigvenvalue decomposition of covariance (correlation) matrices.
According to the “spectral decomposition theorem”, if Σp×p i s a positive semi-definite, symmetric, real matrix, then there exists an orthogonal matrix A such that A′ΣA=Λ where Λ is a diagonal matrix containing the eigenvalues Σ
Λ=(λ10…00λ2…0⋮⋮⋱⋮00…λp)
A=(a1a2…ap)
the i-th column of A , ai, is the i-th p×1 eigenvector of Σ that corresponds to the eigenvalue, λi , where λ1≥λ2≥…≥λp . Alternatively, express in matrix decomposition:
Σ=AΛA′
Σ=A(λ10…00λ2…0⋮⋮⋱⋮00…λp)A′=p∑i=1λiaia′i
where the outer product aia′i is a p×p matrix of rank 1.
For example,
x∼N2(μ,Σ)
μ=(512);Σ=(4112)
library(MASS)
mu = as.matrix(c(5, 12))
Sigma = matrix(c(4, 1, 1, 2), nrow = 2, byrow = T)
sim <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)
plot(sim[, 1], sim[, 2])
Here,
A=(0.9239−0.38270.38270.9239)
Columns of A are the eigenvectors for the decomposition
Under matrix multiplication (A′ΣA or A′A ), the off-diagonal elements equal to 0
Multiplying data by this matrix (i.e., projecting the data onto the orthogonal axes); the distribution of the resulting data (i.e., “scores”) is
N2(A′μ,A′ΣA)=N2(A′μ,Λ)
Equivalently,
y=A′x∼N[(9.21199.1733),(4.4144001.5859)]
A_matrix = matrix(c(0.9239, -0.3827, 0.3827, 0.9239),
nrow = 2,
byrow = T)
t(A_matrix) %*% A_matrix
#> [,1] [,2]
#> [1,] 1.000051 0.000000
#> [2,] 0.000000 1.000051
sim1 <-
mvrnorm(
n = 1000,
mu = t(A_matrix) %*% mu,
Sigma = t(A_matrix) %*% Sigma %*% A_matrix
)
plot(sim1[, 1], sim1[, 2])
No more dependence in the data structure, plot
Notes:
The i-th eigenvalue is the variance of a linear combination of the elements of x ; var(yi)=var(a′ix)=λi
The values on the transformed set of axes (i.e., the yi’s) are called the scores. These are the orthogonal projections of the data onto the “new principal component axes
Variances of y1 are greater than those for any other possible projection
Covariance matrix decomposition and projection onto orthogonal axes = PCA
25.3.1 Population Principal Components
p×1 vectors x1,…,xn which are iid with var(xi)=Σ
The first PC is the linear combination y1=a′1x=a11x1+⋯+a1pxp with a′1a1=1 such that var(y1) is the maximum of all linear combinations of x which have unit length
The second PC is the linear combination y1=a′2x=a21x1+⋯+a2pxp with a′2a2=1 such that var(y1) is the maximum of all linear combinations of x which have unit length and uncorrelated with y1 (i.e., cov(a′1x,a′2x)=0
continues for all yi to yp
ai’s are those that make up the matrix A in the symmetric decomposition A′ΣA=Λ , where var(y1)=λ1,…,var(yp)=λp And the total variance of x is
var(x1)+⋯+var(xp)=tr(Σ)=λ1+⋯+λp=var(y1)+⋯+var(yp)
Data Reduction
To reduce the dimension of data from p (original) to k dimensions without much “loss of information”, we can use properties of the population principal components
Suppose Σ≈∑ki=1λiaia′i . Even thought the true variance-covariance matrix has rank p , it can be be well approximate by a matrix of rank k (k <p)
New “traits” are linear combinations of the measured traits. We can attempt to make meaningful interpretation fo the combinations (with orthogonality constraints).
The proportion of the total variance accounted for by the j-th principal component is
var(yj)∑pi=1var(yi)=λj∑pi=1λi
The proportion of the total variation accounted for by the first k principal components is ∑ki=1λi∑pi=1λi
Above example , we have 4.4144/(4+2)=.735 of the total variability can be explained by the first principal component
25.3.2 Sample Principal Components
Since Σ is unknown, we use
S=1n−1n∑i=1(xi−ˉx)(xi−ˉx)′
Let ˆλ1≥ˆλ2≥⋯≥ˆλp≥0 be the eigenvalues of S and ˆa1,ˆa2,…,ˆap denote the eigenvectors of S
Then, the i-th sample principal component score (or principal component or score) is
ˆyij=p∑k=1ˆaikxkj=ˆa′ixj
Properties of Sample Principal Components
The estimated variance of yi=ˆa′ixj is ˆλi
The sample covariance between ˆyi and ˆyi′ is 0 when i≠i′
The proportion of the total sample variance accounted for by the i-th sample principal component is ˆλi∑pk=1ˆλk
The estimated correlation between the i-th principal component score and the l-th attribute of x is
rxl,ˆyi=ˆail√λi√sll
The correlation coefficient is typically used to interpret the components (i.e., if this correlation is high then it suggests that the l-th original trait is important in the i-th principle component). According to R. A. Johnson, Wichern, et al. (2002), pp.433-434, rxl,ˆyi only measures the univariate contribution of an individual X to a component Y without taking into account the presence of the other X’s. Hence, some prefer ˆail coefficient to interpret the principal component.
rxl,ˆyi;ˆail are referred to as “loadings”
To use k principal components, we must calculate the scores for each data vector in the sample
yj=(y1jy2j⋮ykj)=(ˆa′1xjˆa′2xj⋮ˆa′kxj)=(ˆa′1ˆa′2⋮ˆa′k)xj
Issues:
Large sample theory exists for eigenvalues and eigenvectors of sample covariance matrices if inference is necessary. But we do not do inference with PCA, we only use it as exploratory or descriptive analysis.
PC is not invariant to changes in scale (Exception: if all trait are rescaled by multiplying by the same constant, such as feet to inches).
PCA based on the correlation matrix R is different than that based on the covariance matrix Σ
PCA for the correlation matrix is just rescaling each trait to have unit variance
Transform x to z where zij=(xij−ˉxi)/√sii where the denominator affects the PCA
After transformation, cov(z)=R
PCA on R is calculated in the same way as that on S (where ˆλ1+⋯+ˆλp=p )
The use of R,S depends on the purpose of PCA.
- If the scale of the observations if different, covariance matrix is more preferable. but if they are dramatically different, analysis can still be dominated by the large variance traits.
How many PCs to use can be guided by
Scree Graphs: plot the eigenvalues against their indices. Look for the “elbow” where the steep decline in the graph suddenly flattens out; or big gaps.
minimum Percent of total variation (e.g., choose enough components to have 50% or 90%). can be used for interpretations.
Kaiser’s rule: use only those PC with eigenvalues larger than 1 (applied to PCA on the correlation matrix) - ad hoc
Compare to the eigenvalue scree plot of data to the scree plot when the data are randomized.
25.3.3 Application
PCA on the covariance matrix is usually not preferred due to the fact that PCA is not invariant to changes in scale. Hence, PCA on the correlation matrix is more preferred
This also addresses the problem of multicollinearity
The eigvenvectors may differ by a multiplication of -1 for different implementation, but same interpretation.
library(tidyverse)
## Read in and check data
stock <- read.table("images/stock.dat")
names(stock) <- c("allied", "dupont", "carbide", "exxon", "texaco")
str(stock)
#> 'data.frame': 100 obs. of 5 variables:
#> $ allied : num 0 0.027 0.1228 0.057 0.0637 ...
#> $ dupont : num 0 -0.04485 0.06077 0.02995 -0.00379 ...
#> $ carbide: num 0 -0.00303 0.08815 0.06681 -0.03979 ...
#> $ exxon : num 0.0395 -0.0145 0.0862 0.0135 -0.0186 ...
#> $ texaco : num 0 0.0435 0.0781 0.0195 -0.0242 ...
## Covariance matrix of data
cov(stock)
#> allied dupont carbide exxon texaco
#> allied 0.0016299269 0.0008166676 0.0008100713 0.0004422405 0.0005139715
#> dupont 0.0008166676 0.0012293759 0.0008276330 0.0003868550 0.0003109431
#> carbide 0.0008100713 0.0008276330 0.0015560763 0.0004872816 0.0004624767
#> exxon 0.0004422405 0.0003868550 0.0004872816 0.0008023323 0.0004084734
#> texaco 0.0005139715 0.0003109431 0.0004624767 0.0004084734 0.0007587370
## Correlation matrix of data
cor(stock)
#> allied dupont carbide exxon texaco
#> allied 1.0000000 0.5769244 0.5086555 0.3867206 0.4621781
#> dupont 0.5769244 1.0000000 0.5983841 0.3895191 0.3219534
#> carbide 0.5086555 0.5983841 1.0000000 0.4361014 0.4256266
#> exxon 0.3867206 0.3895191 0.4361014 1.0000000 0.5235293
#> texaco 0.4621781 0.3219534 0.4256266 0.5235293 1.0000000
# cov(scale(stock)) # give the same result
## PCA with covariance
cov_pca <- prcomp(stock)
# uses singular value decomposition for calculation and an N -1 divisor
# alternatively, princomp can do PCA via spectral decomposition,
# but it has worse numerical accuracy
# eigen values
cov_results <- data.frame(eigen_values = cov_pca$sdev ^ 2)
cov_results %>%
mutate(proportion = eigen_values / sum(eigen_values),
cumulative = cumsum(proportion))
#> eigen_values proportion cumulative
#> 1 0.0035953867 0.60159252 0.6015925
#> 2 0.0007921798 0.13255027 0.7341428
#> 3 0.0007364426 0.12322412 0.8573669
#> 4 0.0005086686 0.08511218 0.9424791
#> 5 0.0003437707 0.05752091 1.0000000
# first 2 PCs account for 73% variance in the data
# eigen vectors
cov_pca$rotation # prcomp calls rotation
#> PC1 PC2 PC3 PC4 PC5
#> allied 0.5605914 0.73884565 -0.1260222 -0.28373183 0.20846832
#> dupont 0.4698673 -0.09286987 -0.4675066 0.68793190 -0.28069055
#> carbide 0.5473322 -0.65401929 -0.1140581 -0.50045312 0.09603973
#> exxon 0.2908932 -0.11267353 0.6099196 0.43808002 0.58203935
#> texaco 0.2842017 0.07103332 0.6168831 -0.06227778 -0.72784638
# princomp calls loadings.
# first PC = overall average
# second PC compares Allied to Carbide
## PCA with correlation
#same as scale(stock) %>% prcomp
cor_pca <- prcomp(stock, scale = T)
# eigen values
cor_results <- data.frame(eigen_values = cor_pca$sdev ^ 2)
cor_results %>%
mutate(proportion = eigen_values / sum(eigen_values),
cumulative = cumsum(proportion))
#> eigen_values proportion cumulative
#> 1 2.8564869 0.57129738 0.5712974
#> 2 0.8091185 0.16182370 0.7331211
#> 3 0.5400440 0.10800880 0.8411299
#> 4 0.4513468 0.09026936 0.9313992
#> 5 0.3430038 0.06860076 1.0000000
# first egiven values corresponds to less variance
# than PCA based on the covariance matrix
# eigen vectors
cor_pca$rotation
#> PC1 PC2 PC3 PC4 PC5
#> allied 0.4635405 0.2408499 -0.6133570 0.3813727 -0.4532876
#> dupont 0.4570764 0.5090997 0.1778996 0.2113068 0.6749814
#> carbide 0.4699804 0.2605774 0.3370355 -0.6640985 -0.3957247
#> exxon 0.4216770 -0.5252647 0.5390181 0.4728036 -0.1794482
#> texaco 0.4213291 -0.5822416 -0.4336029 -0.3812273 0.3874672
# interpretation of PC2 is different from above:
# it is a comparison of Allied, Dupont and Carbid to Exxon and Texaco
Covid Example
To reduce collinearity problem in this dataset, we can use principal components as regressors.
load('images/MOcovid.RData')
covidpca <- prcomp(ndat[,-1],scale = T,center = T)
covidpca$rotation[,1:2]
#> PC1 PC2
#> X..Population.in.Rural.Areas 0.32865838 0.05090955
#> Area..sq..miles. 0.12014444 -0.28579183
#> Population.density..sq..miles. -0.29670124 0.28312922
#> Literacy.rate -0.12517700 -0.08999542
#> Families -0.25856941 0.16485752
#> Area.of.farm.land..sq..miles. 0.02101106 -0.31070363
#> Number.of.farms -0.03814582 -0.44809679
#> Average.value.of.all.property.per.farm..dollars. -0.05410709 0.14404306
#> Estimation.of.rurality.. -0.19040210 0.12089501
#> Male.. 0.02182394 -0.09568768
#> Number.of.Physcians.per.100.000 -0.31451606 0.13598026
#> average.age 0.29414708 0.35593459
#> X0.4.age.proportion -0.11431336 -0.23574057
#> X20.44.age.proportion -0.32802128 -0.22718550
#> X65.and.over.age.proportion 0.30585033 0.32201626
#> prop..White..nonHisp 0.35627561 -0.14142646
#> prop..Hispanic -0.16655381 -0.15105342
#> prop..Black -0.33333359 0.24405802
# Variability of each principal component: pr.var
pr.var <- covidpca$sdev ^ 2
# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)
plot(
pve,
xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 0.5),
type = "b"
)
plot(
cumsum(pve),
xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1),
type = "b"
)
# the first six principe account for around 80% of the variance.
#using base lm function for PC regression
pcadat <- data.frame(covidpca$x[, 1:6])
pcadat$y <- ndat$Y
pcr.man <- lm(log(y) ~ ., pcadat)
mean(pcr.man$residuals ^ 2)
#> [1] 0.03453371
#comparison to lm w/o prin comps
lm.fit <- lm(log(Y) ~ ., data = ndat)
mean(lm.fit$residuals ^ 2)
#> [1] 0.02335128
MSE for the PC-based model is larger than regular regression, because models with a large degree of collinearity can still perform well.
pcr
function in pls
can be used for fitting PC regression (it will select the optimal number of components in the model).