8 Attribute analysis
Installation/loading libraries/data used
if (!require(highcharter)) install.packages('highcharter')
library(highcharter)
if (!require(openxlsx)) install.packages('openxlsx')
library(openxlsx)
if (!require(plotly)) install.packages('plotly')
library(plotly)
if (!require(ggplot2)) install.packages('ggplot2')
library(ggplot2)
if (!require(tidyverse)) install.packages('tidyverse')
library(tidyverse)
8.1 Introduction
In general, we can consider that attribute analysis consists of studying, comparing and manipulating the variables of our data analysis, transforming or combining them to create new variables (or eliminate some) in order to obtain a new collection of variables that better represent , and more compactly, the information contained globally in the initial variable collection. In what follows, to develop the concepts of this topic, we will consider that our variables are a collection of attributes, such as population, area, GDP, etc., associated with the countries of the world, as it appears in the owid_country table
. We will denote each attribute (or variable) by the vector xj=(x1j,x2j,..,xmj) where m represents the number of countries. That is, if the first attribute (j=1) is the population, then xi1 would represent the population of country i.
The first phase of our attribute analysis is to compare each pair of attributes xj and xj′ with each other. The easiest way to compare variables is to calculate their correlation, r, which is given by the formula:
r=∑mi=1(xij−ˉxj)(xij′−ˉxj′)√∑mi=1(xij−ˉxj)2∑mi=1(xij′−ˉxj′)2
Where, ˉxj and ˉxj′ represent the mean of the respective variables. The correlation value r moves in the range between -1 and 1 and can be interpreted as follows:
- If r=1 the variable xj′ can be obtained exactly and linearly from xj, if the value of one variable increases the other also increases, and if we draw the scatter plot of both variables, all points fall exactly on the regression line
- If r=0 there is no linear relationship between both variables and the scatter plot, normally, it will give us a dispersed cloud of points
- If r=−1, the case is identical to r=1 with the only difference that, in this case, when one variable increases, the other decreases.
In general, the closer the correlation is to 1 or -1, the better the linear relationship between the variables and the better aligned their values will be on the regression line. An important property of the correlation is that it is independent of the unit of measurement of the variables (as long as the transformation between the measurements is linear), that is, it does not matter measuring a population in millions or thousands, or a temperature in degrees Celsius or Fahrenheit, the correlation of these variables with other variables gives the same value regardless of the unit of measurement used.
It may be that the correlation between the variables improves if we previously apply a non-linear transformation (such as a logarithm) to one of them (or both) before calculating the correlation. This is what we do in the example dashboard Dashboard3.Rmd
from the previous topic, in it we study how to transform the attributes, using the logarithm or Box-Cox function to study the relationship between pairs of attributes. For example, we observe that if we apply a logarithm to GDP per capita, the result of its correlation with life expectancy improves greatly. This tells us that the relationship between the two variables is non-linear.
We will say that a variable xj does not contribute new information to our data analysis if it can be obtained from the other variables. That is:
xij=F({xik}k≠j)∀i
for a certain function F. Therefore, we can remove xj from our data analysis without losing information. In particular, if the correlation between two variables is close to 1 or -1, we can remove one of them because one can be obtained from the other. For example, in the owid_country
table the percentage of people over 65 years old and the percentage of people over 70 years old are stored as indicators. These two indicators are closely related to each other. If we use the dashboard Dashboard3.Rmd
, we observe that the correlation factor is 0.99, and therefore, one can be accurately calculated as a function of the other from the regression line, which in this case is given by
aged_70_older=0.68⋅aged_65_older−0.39 Therefore, we could eliminate the indicator of percentage of people over 70 years of age from our data analysis without losing relevant information. Another example of a variable that we could discard are variables that have a constant value (in this case the function F would be a constant). These types of variables do not contribute anything when it comes to discriminating between countries.
In this chapter we are going to study how to combine the variables and reduce their number without losing relevant information. This problem is called dimensionality reduction.
A limitation of the techniques that we are going to analyze is that we will only use linear combinations of the variables, that is, sums of variables multiplied by coefficients. This leaves out more complex dependencies such as those given by the logarithm function or the Box-Cox transformation used in the dashboard Dashboard3.Rmd
. One way to add non-linear transforms would be, before starting our linear analysis, to add/substitute in the table, variables that are the result of applying non-linear transformations to the initial variables. For example, in view of the results of the dashboard Dashboard3.Rmd
, we could replace the variable gdp_per_capit
with log(gdp_per_capit)
which has a better linear relationship with the others than the initial variable.
8.2 The correlation matrix
As we saw in the previous section, a simple way to eliminate variables is to study the correlation between pairs of them and if the correlation is very close to 1, we can eliminate one of them. To do this, the correlation matrix is used where each value of the matrix corresponds to the value of the correlation, r, between 2 variables. For example, in the following interactive graph we visualize the correlation matrix of the owid_country
table. Non-numeric variables must first be removed, since the correlation is only calculated for numerical variables and the NA
values must also be removed before doing the calculation.
owid_country %>%
select(-location,-continent,-iso_code) %>% # remove non-numeric variables from the table
cor(use='complete.obs') %>% # correlation matrix calculation removing NA previously
hchart() # interactive drawing correlation matrix
Figure 8.1: Correlation matrix using hchart
From the exploration of the correlation values, we observe that aged_65_older
and aged_70_older
are very related, that population
and cardiovasc_death_rat
are very little related to the rest of the variables and that extreme_poverty
, in general, has , negative correlation with the rest of the variables, that is, when this variable increases, the others decrease.
8.3 Principal Component Analysis (PCA)
Principal component analysis, PCA, transforms the variables xj into new ones, yj, making linear combinations of the initial ones. The objective of PCA is to find the linear combinations of xj that maximize the variance of yj in descending order, that is, y1 is the linear combination with the greatest possible variance and yn is the linear combination with the lowest possible variance. When the variance of yj is very small, we consider that it does not provide relevant information and we discard it, in this way we reduce the dimensionality, leaving only the variables yj with significant variance, which are what we call principal components.
Theoretical foundation
Necessary algebraic concepts
Definition: The dot product of two vectors u and v of size n is defined as (u,v)=u1v1+⋯+unvn. We will always consider the vectors vertically, that is, we can write (u,v)=uTv, where uT is the transposed vector.
Definition: A square matrix U is orthonormal (or unitary) if UTU=Id is satisfied, that is, the transposed by itself is the identity matrix. Therefore, for an orthonormal matrix the inverse is the transpose.
Theorem: The column vectors of an orthonormal matrix U=(u1,⋯,un) (with uk=(u1k,…,unk)) satisfy that (uk,uk)=1 and if k≠k′ then (uk,uk′)=0 that is, the vectors are perpendicular.
Definition: the real number λk is an eigenvalue of the square matrix A, if there exists a vector uk=(u1k,…,unk), not null, called eigenvector, such that Auk=λkuk. If uk is an eigenvector, then when multiplied by any non-zero number, it is still an eigenvector. Therefore we can always adjust the eigenvector so that it meets (uk,uk)=1.
Theorem: If the square matrix A of dimension n is symmetric, then it has n eigenvalues that we can order in descending order, λ1≥λ2≥⋯≥λn and It also has an orthonormal matrix of eigenvectors U=(u1,u2,..,un) formed by the eigenvectors uk in columns.
Theorem: If X is any matrix of size mxn, then A=XTX is square of size nxn, symmetric and all its eigenvalues are greater than or equal to zero.
Definition: Given two variables xj and xj′, the covariance Cov(xj,xj′) between them is defined as:
Cov(xj,xj′)=∑mi=1(xij−ˉxj)(xij′−ˉxj′)m−1 where m is the number of observations.
Definition: If X=(x1,x2,...,xn) is a matrix representing n variables x1,x2,...,xn, the covariance matrix of X is given by the matrix Σ of dimension n×n whose elements are Σjj′=Cov(xj,xj′)
Definition: Normalizing a variable, xj, consists of subtracting its mean and dividing it by its standard deviation. The normalized variables have a mean of zero and a variance and standard deviation equal to 1.
Theorem: If X=(x1,x2,...,xn) is a matrix representing n normalized variables x1,x2,...,xn, then the covariance matrix Σ satisfies:
Σ=XTXm−1 moreover
λ1+λ2+..+λn=n where λj are the eigenvalues of the covariance matrix Σ. Furthermore, since the covariance matrix Σ is proportional to XTX, its eigenvectors are the same and the eigenvalues of XTX are those of Σ multiplied by m−1.
Mathematical foundation of ACP
Let us consider n variables xj (the column vectors of our table), and the observations in the form of a matrix X=xij of size mxn. Let’s replace X with a new collection of variables Y=XU where U is an orthonormal matrix of size n. That is, the new variables yj=Xuj are linear combinations of the previous ones obtained from an orthonormal matrix that can be expressed as:
y1=u11x1+u21x2+⋯+un1xny2=u12x1+u22x2+⋯+un2xn⋯yn=u1nx1+u2nx2+⋯+unnxn
Principal component analysis is a technique for choosing the orthonormal matrix U in such a way that the variables yj=Xuj (the principal components) are ordered in descending order based on their variance, that is, the first principal component y1=Xu1 corresponds to the linear combination of the original variables that has the greatest possible variance, y2=Xu2 would be the orthogonal combination to the previous one ((u1,u2)=0) with the greatest possible variance, y3=Xu3 would be the orthogonal combination to the previous ones ((u1,u3)=0 and (u2,u3)=0) with the greatest possible variance, and so on. The greater the variance of yj the more information the variable contains in terms of, for example, distinguishing between different observations. Note that a variable with a constant value (that is, zero variance) does not provide any information to discern between some observations and others. Given that the variables xj can have very different magnitudes and variances, to balance the contribution of each variable, they are first normalized, that is, their mean is subtracted and divided by the standard deviation. That is, we will consider that the variables xj have a mean of zero and a standard deviation of 1. If the mean of each variable xj is zero, the mean of yj=Xuj, which is a linear combination of xj, will also have a mean zero and therefore its variance will be
Var(yj)=∑mi=1y2ijm−1=yTjyjm−1=uTjXTXujm−1 the key to maximizing the variance is the eigenvectors of the matrix A=XTX. If λj are the ordered eigenvalues of A in descending order, that is, λ1≥λ2≥⋯≥λn and uj their corresponding eigenvectors, then
Var(yj)=uTjXTXujm−1=uTjλjujm−1=λjuTjujm−1=λjm−1 Therefore, the procedure to calculate the principal components of X can be divided into the following steps:
Normalize the original variables xj so that it has mean zero and variance 1.
Calculate the eigenvalues λj (sorted in descending order) and eigenvectors uj of the matrix XTX
The principal components are yj=Xuj and its variance is λjm−1
PCA Example
To illustrate the process of calculating the principal components we are going to take a simple case extracted from the owid_country
table where we will use 3 variables (x1=human_development_index
, x2=aged_65_older
and x3=aged_70_older
) and observations from 5 countries (China, Canada, Nigeria, Brazil and Germany). To calculate the PCA the table can only have numerical variables. To preserve the name of the countries, the column_to_rownames
function is used, which assigns the countries as names of each row, in this way they are managed as an external element to the table and do not interfere with the calculation of the PCA.
data <- owid_country %>%
filter( # we filter the 5 countries that will be used
location == "China" |
location == "Canada" |
location == "Nigeria" |
location == "Brazil" |
location == "Germany"
) %>%
column_to_rownames(var="location") %>%
select(human_development_index,aged_65_older,aged_70_older)
data
## human_development_index aged_65_older aged_70_older
## Brazil 0.765 8.552 5.060
## Canada 0.929 16.984 10.797
## China 0.761 10.641 5.929
## Germany 0.947 21.453 15.957
## Nigeria 0.539 2.751 1.447
Next we print the normalized variables using the scale
function and its correlation matrix which shows a high correlation between the variables.
## human_development_index aged_65_older aged_70_older
## Brazil -0.1409164 -0.4824390 -0.4932478
## Canada 0.8552170 0.6718445 0.5253853
## China -0.1652124 -0.1964691 -0.3389525
## Germany 0.9645488 1.2836202 1.4415691
## Nigeria -1.5136370 -1.2765565 -1.1347540
## attr(,"scaled:center")
## human_development_index aged_65_older aged_70_older
## 0.7882 12.0762 7.8380
## attr(,"scaled:scale")
## human_development_index aged_65_older aged_70_older
## 0.1646366 7.3049644 5.6320575
Therefore the matrix X is defined by:
x1x2x3Brazil−0.14−0.48−0.49Canada0.860.670.53China−0.17−0.20−0.34Germany0.961.281.44Nigeria−1.51−1.28−1.13 The new variables yj that we are going to construct as a linear combination of x1,x2 and x3 have the form:
yj=u1j(−0.140.86−0.170.96−1.51)+u2j(−0.480.67−0.201.28−1.28)+u3j(−0.490.53−0.341.44−1.13)=Xuj
Since yj always has zero mean, its variance is yTjyj4=uTjXTXuj4. If uj is a normalized eigenvector of XTX for the eigenvalue λj, then the variance of yj is uTjXTXuj4=λjuTjuj4=λj4
We now compute the matrix XTX: XTX=(4.003.853.683.854.003.963.683.964.00) Next, by hand, or using any mathematical calculation software, we compute the orthonormal matrix U with the eigenvectors of XTX
U=(0.570.780.260.58−0.16−0.790.58−0.610.55)
Simultaneously, the eigenvalues of XTX are computed, which are: λ1=11.66, λ2=0.33, and λ3=0.02 (we observe that the sum of the eigenvalues is approximately 12=n⋅(m−1)=3⋅4) and therefore The standard deviations of yj are σj=√λj/4, that is: σ1=1.71, σ2=0.29, σ3=0.07. The principal components are then:
y1=0.57⋅x1+0.58⋅x2+0.58⋅x3y2=0.78⋅x1−0.16⋅x2−0.60⋅x3y3=0.26⋅x1−0.79⋅x2+0.55⋅x3
and, therefore, the new table of values, Y=XU, is given by:
y1y2y3Brazil−0.650.270.08Canada1.180.24−0.02China−0.400.11−0.073Germany2.13−0.330.02Nigeria−2.27−0.28−0.00
PCA calculation with prcomp
The prcomp
function automatically performs the entire process for calculating the principal components. Let’s use it for the previous example:
The function prcomp
returns a structure (in this example we have called it pca1
) that stores, in the sdev
field, the standard deviation of the principal components yj, in the rotation
field the matrix of eigenvectors U and in the x
field the principal components Y=XU. These results are printed below. It can be seen that the results are the same as what we have calculated for this example in the previous section.
## [1] 1.70716366 0.28734312 0.05501071
## PC1 PC2 PC3
## human_development_index 0.5708408 0.7790411 0.2592986
## aged_65_older 0.5845568 -0.1638432 -0.7946375
## aged_70_older 0.5765709 -0.6051863 0.5489222
## PC1 PC2 PC3
## Brazil -0.6468462 0.2677714 0.0760700585
## Canada 1.1838460 0.2382161 -0.0237206224
## China -0.4045875 0.1086123 -0.0727761422
## Germany 2.1321196 -0.3313071 0.0214026069
## Nigeria -2.2645318 -0.2832927 -0.0009759008
PCA display
The main objective of PCA is to reduce dimensionality, keeping the first principal components that accumulate the greatest variance and discarding the rest. The criterion for deciding how many principal components we keep is based precisely on the analysis of their variance. Normally, the variance of the principal components yj is analyzed in relative terms, that is, the percentage that the variance of yj represents with respect to the global variance given by the sum of all the variances (∑jVar(yj)) . To visualize this, we used an interactive bar chart, created with plotly
, with the percentage of variance explained by each principal component.
p <- tibble(
label=paste("PC",1:length(pca1$sdev)), # create labels for the horizontal axis
varPercent = pca1$sdev^2/sum(pca1$sdev^2) * 100 # calculation percentage of explained variance
) %>%
ggplot(aes(x=label,y=varPercent)) + # create interactive bar chart
geom_bar(stat = "identity") +
labs(x= "Main Components",
y= "Percentage of explained variance")
ggplotly(p)
Figure 8.2: Percentage of variance explained by each principal component using ggplotly
To decide how many principal components we keep, we can, for example, keep the principal components that accumulate more than 90% of the explained variance. Another criterion would be, for example, eliminating the principal components with the explained variance less than 2%. For this example, with the first criterion, we would only be left with the first principal component, and with the second criterion with the first two.
A common graph to visualize the results of PCA is to make a scatter graph where the result of the values of the first two principal components for each observation is put, that is, the points (yi1,yi2) for i=1,⋯,m, in this way we can see how the values of the first two components are distributed. Furthermore, to visualize the contribution of each variable xj to the first two components, for each j a segment is drawn that goes from (0,0) to (uj1,uj2) for j=1,⋯,n. uj1 is the coefficient of xj in the linear combination of the first component y1 and
uj2 is the coefficient of xj in the linear combination of the second component y2. If this segment is aligned with the x-axis, it means that xj contributes little to the second component, and if the segment is also of large magnitude, it means that xj contributes more to the first component than the rest of the variables. To make this graph we will use the hchart
function:
Figure 8.3: Illustrative graph of the first two principal components using hchart
Next we are going to reproduce the experiment, but taking all the indicators and for all the countries. First we prepare the data and calculate the PCA. Since the PCA calculation only allows numerical variables, we have to eliminate non-numeric variables from the table. To preserve the names of the countries, instead of managing it as a variable, we manage it as the name of each record, this way it does not interfere in the calculation of PCA
data <- owid_country %>%
na.omit() %>% # remove records with some NA
column_to_rownames(var="location") %>% # assign the location field as the name of the rows
select(-continent,-iso_code) # remove non-numeric variables
pca2 <- prcomp(data,scale = TRUE) # we calculate the PCA
Now we make the bar graph with the percentage of variance explained by each principal component. By default, the bar chart sorts automatically the x-axis labels alphabetically,
to keep the initial order of the table unchanged we use the fct_inorder
.
p <- tibble(
label=fct_inorder(paste("PC",1:length(pca2$sdev))),
varPercent = pca2$sdev^2/sum(pca2$sdev^2) * 100
) %>%
ggplot(aes(x=label,y=varPercent)) +
geom_bar(stat = "identity") +
labs(x= "Main Components",
y= "Percentage of explained variance"
)
ggplotly(p)
Figure 8.4: Percentage of variance explained by each principal component using ggplotly
Finally we present the graph that illustrates the result for the first two components
Figure 8.5: Illustrative graph of the first two principal components using hchart
References
[Ke23] Zoumana Keita. Principal Component Analysis in R Tutorial, 2023.
[Ku22] Joshua Kunst. Hchart Function, 2022.
[UGA] Data Analysis in the Geosciences (Principal Components Analysis), University of Georgia.