8 Attribute analysis

Installation/loading libraries/data used

if (!require(highcharter)) install.packages('highcharter') 
library(highcharter) 
if (!require(openxlsx)) install.packages('openxlsx') 
library(openxlsx) 
if (!require(plotly)) install.packages('plotly') 
library(plotly)
if (!require(ggplot2)) install.packages('ggplot2') 
library(ggplot2)
if (!require(tidyverse)) install.packages('tidyverse') 
library(tidyverse)
owid_country <- read.xlsx("https://ctim.es/AEDV/data/owid_country.xlsx",sheet=1) %>%
  as_tibble()

8.1 Introduction

In general, we can consider that attribute analysis consists of studying, comparing and manipulating the variables of our data analysis, transforming or combining them to create new variables (or eliminate some) in order to obtain a new collection of variables that better represent , and more compactly, the information contained globally in the initial variable collection. In what follows, to develop the concepts of this topic, we will consider that our variables are a collection of attributes, such as population, area, GDP, etc., associated with the countries of the world, as it appears in the owid_country table. We will denote each attribute (or variable) by the vector xj=(x1j,x2j,..,xmj) where m represents the number of countries. That is, if the first attribute (j=1) is the population, then xi1 would represent the population of country i.

The first phase of our attribute analysis is to compare each pair of attributes xj and xj with each other. The easiest way to compare variables is to calculate their correlation, r, which is given by the formula:

r=mi=1(xijˉxj)(xijˉxj)mi=1(xijˉxj)2mi=1(xijˉxj)2

Where, ˉxj and ˉxj represent the mean of the respective variables. The correlation value r moves in the range between -1 and 1 and can be interpreted as follows:

  • If r=1 the variable xj can be obtained exactly and linearly from xj, if the value of one variable increases the other also increases, and if we draw the scatter plot of both variables, all points fall exactly on the regression line
  • If r=0 there is no linear relationship between both variables and the scatter plot, normally, it will give us a dispersed cloud of points
  • If r=1, the case is identical to r=1 with the only difference that, in this case, when one variable increases, the other decreases.

In general, the closer the correlation is to 1 or -1, the better the linear relationship between the variables and the better aligned their values will be on the regression line. An important property of the correlation is that it is independent of the unit of measurement of the variables (as long as the transformation between the measurements is linear), that is, it does not matter measuring a population in millions or thousands, or a temperature in degrees Celsius or Fahrenheit, the correlation of these variables with other variables gives the same value regardless of the unit of measurement used.

It may be that the correlation between the variables improves if we previously apply a non-linear transformation (such as a logarithm) to one of them (or both) before calculating the correlation. This is what we do in the example dashboard Dashboard3.Rmd from the previous topic, in it we study how to transform the attributes, using the logarithm or Box-Cox function to study the relationship between pairs of attributes. For example, we observe that if we apply a logarithm to GDP per capita, the result of its correlation with life expectancy improves greatly. This tells us that the relationship between the two variables is non-linear.

We will say that a variable xj does not contribute new information to our data analysis if it can be obtained from the other variables. That is:

xij=F({xik}kj)i for a certain function F. Therefore, we can remove xj from our data analysis without losing information. In particular, if the correlation between two variables is close to 1 or -1, we can remove one of them because one can be obtained from the other. For example, in the owid_country table the percentage of people over 65 years old and the percentage of people over 70 years old are stored as indicators. These two indicators are closely related to each other. If we use the dashboard Dashboard3.Rmd, we observe that the correlation factor is 0.99, and therefore, one can be accurately calculated as a function of the other from the regression line, which in this case is given by

aged_70_older=0.68aged_65_older0.39 Therefore, we could eliminate the indicator of percentage of people over 70 years of age from our data analysis without losing relevant information. Another example of a variable that we could discard are variables that have a constant value (in this case the function F would be a constant). These types of variables do not contribute anything when it comes to discriminating between countries.

In this chapter we are going to study how to combine the variables and reduce their number without losing relevant information. This problem is called dimensionality reduction.

A limitation of the techniques that we are going to analyze is that we will only use linear combinations of the variables, that is, sums of variables multiplied by coefficients. This leaves out more complex dependencies such as those given by the logarithm function or the Box-Cox transformation used in the dashboard Dashboard3.Rmd. One way to add non-linear transforms would be, before starting our linear analysis, to add/substitute in the table, variables that are the result of applying non-linear transformations to the initial variables. For example, in view of the results of the dashboard Dashboard3.Rmd, we could replace the variable gdp_per_capit with log(gdp_per_capit) which has a better linear relationship with the others than the initial variable.

8.2 The correlation matrix

As we saw in the previous section, a simple way to eliminate variables is to study the correlation between pairs of them and if the correlation is very close to 1, we can eliminate one of them. To do this, the correlation matrix is used where each value of the matrix corresponds to the value of the correlation, r, between 2 variables. For example, in the following interactive graph we visualize the correlation matrix of the owid_country table. Non-numeric variables must first be removed, since the correlation is only calculated for numerical variables and the NA values must also be removed before doing the calculation.

owid_country %>%
   select(-location,-continent,-iso_code) %>% # remove non-numeric variables from the table
   cor(use='complete.obs') %>% # correlation matrix calculation removing NA previously
   hchart() # interactive drawing correlation matrix
Created with Highcharts 9.3.1-101populationmedian_agelife_expectancyaged_65_olderaged_70_oldergdp_per_capitextreme_povertycardiovasc_death_ratdiabetes_prevalencefemale_smokersmale_smokershandwashing_facilitieshospital_beds_per_thousandhuman_development_indexpopulationmedian_agelife_expectancyaged_65_olderaged_70_oldergdp_per_capitextreme_povertycardiovasc_death_ratdiabetes_prevalencefemale_smokersmale_smokershandwashing_facilitieshospital_beds_per_thousandhuman_development_index

Figure 8.1: Correlation matrix using hchart

From the exploration of the correlation values, we observe that aged_65_older and aged_70_older are very related, that population and cardiovasc_death_rat are very little related to the rest of the variables and that extreme_poverty, in general, has , negative correlation with the rest of the variables, that is, when this variable increases, the others decrease.

8.3 Principal Component Analysis (PCA)

Principal component analysis, PCA, transforms the variables xj into new ones, yj, making linear combinations of the initial ones. The objective of PCA is to find the linear combinations of xj that maximize the variance of yj in descending order, that is, y1 is the linear combination with the greatest possible variance and yn is the linear combination with the lowest possible variance. When the variance of yj is very small, we consider that it does not provide relevant information and we discard it, in this way we reduce the dimensionality, leaving only the variables yj with significant variance, which are what we call principal components.

Theoretical foundation

Necessary algebraic concepts

Definition: The dot product of two vectors u and v of size n is defined as (u,v)=u1v1++unvn. We will always consider the vectors vertically, that is, we can write (u,v)=uTv, where uT is the transposed vector.

Definition: A square matrix U is orthonormal (or unitary) if UTU=Id is satisfied, that is, the transposed by itself is the identity matrix. Therefore, for an orthonormal matrix the inverse is the transpose.

Theorem: The column vectors of an orthonormal matrix U=(u1,,un) (with uk=(u1k,,unk)) satisfy that (uk,uk)=1 and if kk then (uk,uk)=0 that is, the vectors are perpendicular.

Definition: the real number λk is an eigenvalue of the square matrix A, if there exists a vector uk=(u1k,,unk), not null, called eigenvector, such that Auk=λkuk. If uk is an eigenvector, then when multiplied by any non-zero number, it is still an eigenvector. Therefore we can always adjust the eigenvector so that it meets (uk,uk)=1.

Theorem: If the square matrix A of dimension n is symmetric, then it has n eigenvalues that we can order in descending order, λ1λ2λn and It also has an orthonormal matrix of eigenvectors U=(u1,u2,..,un) formed by the eigenvectors uk in columns.

Theorem: If X is any matrix of size mxn, then A=XTX is square of size nxn, symmetric and all its eigenvalues are greater than or equal to zero.

Definition: Given two variables xj and xj, the covariance Cov(xj,xj) between them is defined as:

Cov(xj,xj)=mi=1(xijˉxj)(xijˉxj)m1 where m is the number of observations.

Definition: If X=(x1,x2,...,xn) is a matrix representing n variables x1,x2,...,xn, the covariance matrix of X is given by the matrix Σ of dimension n×n whose elements are Σjj=Cov(xj,xj)

Definition: Normalizing a variable, xj, consists of subtracting its mean and dividing it by its standard deviation. The normalized variables have a mean of zero and a variance and standard deviation equal to 1.

Theorem: If X=(x1,x2,...,xn) is a matrix representing n normalized variables x1,x2,...,xn, then the covariance matrix Σ satisfies:

Σ=XTXm1 moreover

λ1+λ2+..+λn=n where λj are the eigenvalues of the covariance matrix Σ. Furthermore, since the covariance matrix Σ is proportional to XTX, its eigenvectors are the same and the eigenvalues of XTX are those of Σ multiplied by m1.

Mathematical foundation of ACP

Let us consider n variables xj (the column vectors of our table), and the observations in the form of a matrix X=xij of size mxn. Let’s replace X with a new collection of variables Y=XU where U is an orthonormal matrix of size n. That is, the new variables yj=Xuj are linear combinations of the previous ones obtained from an orthonormal matrix that can be expressed as:

y1=u11x1+u21x2++un1xny2=u12x1+u22x2++un2xnyn=u1nx1+u2nx2++unnxn

Principal component analysis is a technique for choosing the orthonormal matrix U in such a way that the variables yj=Xuj (the principal components) are ordered in descending order based on their variance, that is, the first principal component y1=Xu1 corresponds to the linear combination of the original variables that has the greatest possible variance, y2=Xu2 would be the orthogonal combination to the previous one ((u1,u2)=0) with the greatest possible variance, y3=Xu3 would be the orthogonal combination to the previous ones ((u1,u3)=0 and (u2,u3)=0) with the greatest possible variance, and so on. The greater the variance of yj the more information the variable contains in terms of, for example, distinguishing between different observations. Note that a variable with a constant value (that is, zero variance) does not provide any information to discern between some observations and others. Given that the variables xj can have very different magnitudes and variances, to balance the contribution of each variable, they are first normalized, that is, their mean is subtracted and divided by the standard deviation. That is, we will consider that the variables xj have a mean of zero and a standard deviation of 1. If the mean of each variable xj is zero, the mean of yj=Xuj, which is a linear combination of xj, will also have a mean zero and therefore its variance will be

Var(yj)=mi=1y2ijm1=yTjyjm1=uTjXTXujm1 the key to maximizing the variance is the eigenvectors of the matrix A=XTX. If λj are the ordered eigenvalues of A in descending order, that is, λ1λ2λn and uj their corresponding eigenvectors, then

Var(yj)=uTjXTXujm1=uTjλjujm1=λjuTjujm1=λjm1 Therefore, the procedure to calculate the principal components of X can be divided into the following steps:

  • Normalize the original variables xj so that it has mean zero and variance 1.

  • Calculate the eigenvalues λj (sorted in descending order) and eigenvectors uj of the matrix XTX

  • The principal components are yj=Xuj and its variance is λjm1

PCA Example

To illustrate the process of calculating the principal components we are going to take a simple case extracted from the owid_country table where we will use 3 variables (x1=human_development_index, x2=aged_65_older and x3=aged_70_older ) and observations from 5 countries (China, Canada, Nigeria, Brazil and Germany). To calculate the PCA the table can only have numerical variables. To preserve the name of the countries, the column_to_rownames function is used, which assigns the countries as names of each row, in this way they are managed as an external element to the table and do not interfere with the calculation of the PCA.

data <- owid_country %>%
   filter( # we filter the 5 countries that will be used
     location == "China" |
     location == "Canada" |
     location == "Nigeria" |
     location == "Brazil" |
     location == "Germany"
   ) %>%
   column_to_rownames(var="location") %>%
   select(human_development_index,aged_65_older,aged_70_older)
data
##         human_development_index aged_65_older aged_70_older
## Brazil                    0.765         8.552         5.060
## Canada                    0.929        16.984        10.797
## China                     0.761        10.641         5.929
## Germany                   0.947        21.453        15.957
## Nigeria                   0.539         2.751         1.447

Next we print the normalized variables using the scale function and its correlation matrix which shows a high correlation between the variables.

data %>%
   scale()
##         human_development_index aged_65_older aged_70_older
## Brazil               -0.1409164    -0.4824390    -0.4932478
## Canada                0.8552170     0.6718445     0.5253853
## China                -0.1652124    -0.1964691    -0.3389525
## Germany               0.9645488     1.2836202     1.4415691
## Nigeria              -1.5136370    -1.2765565    -1.1347540
## attr(,"scaled:center")
## human_development_index           aged_65_older           aged_70_older 
##                  0.7882                 12.0762                  7.8380 
## attr(,"scaled:scale")
## human_development_index           aged_65_older           aged_70_older 
##               0.1646366               7.3049644               5.6320575
data %>%
   scale() %>%
   cor(use='complete.obs') %>%
   hchart()
Created with Highcharts 9.3.10.90.9250.950.9751human_development_indexaged_65_olderaged_70_olderhuman_development_indexaged_65_olderaged_70_older

Therefore the matrix X is defined by:

x1x2x3Brazil0.140.480.49Canada0.860.670.53China0.170.200.34Germany0.961.281.44Nigeria1.511.281.13 The new variables yj that we are going to construct as a linear combination of x1,x2 and x3 have the form:

yj=u1j(0.140.860.170.961.51)+u2j(0.480.670.201.281.28)+u3j(0.490.530.341.441.13)=Xuj

Since yj always has zero mean, its variance is yTjyj4=uTjXTXuj4. If uj is a normalized eigenvector of XTX for the eigenvalue λj, then the variance of yj is uTjXTXuj4=λjuTjuj4=λj4

We now compute the matrix XTX: XTX=(4.003.853.683.854.003.963.683.964.00) Next, by hand, or using any mathematical calculation software, we compute the orthonormal matrix U with the eigenvectors of XTX

U=(0.570.780.260.580.160.790.580.610.55)

Simultaneously, the eigenvalues of XTX are computed, which are: λ1=11.66, λ2=0.33, and λ3=0.02 (we observe that the sum of the eigenvalues is approximately 12=n(m1)=34) and therefore The standard deviations of yj are σj=λj/4, that is: σ1=1.71, σ2=0.29, σ3=0.07. The principal components are then:

y1=0.57x1+0.58x2+0.58x3y2=0.78x10.16x20.60x3y3=0.26x10.79x2+0.55x3

and, therefore, the new table of values, Y=XU, is given by:

y1y2y3Brazil0.650.270.08Canada1.180.240.02China0.400.110.073Germany2.130.330.02Nigeria2.270.280.00

PCA calculation with prcomp

The prcomp function automatically performs the entire process for calculating the principal components. Let’s use it for the previous example:

pca1 <- prcomp(data,scale = TRUE)

The function prcomp returns a structure (in this example we have called it pca1) that stores, in the sdev field, the standard deviation of the principal components yj, in the rotation field the matrix of eigenvectors U and in the x field the principal components Y=XU. These results are printed below. It can be seen that the results are the same as what we have calculated for this example in the previous section.

pca1$sdev # standard deviations of principal components
## [1] 1.70716366 0.28734312 0.05501071
pca1$rotation # matrix U of eigenvectors of X^TX
##                               PC1        PC2        PC3
## human_development_index 0.5708408  0.7790411  0.2592986
## aged_65_older           0.5845568 -0.1638432 -0.7946375
## aged_70_older           0.5765709 -0.6051863  0.5489222
pca1$x # matrix Y=XU that determines the new variables (main components)
##                PC1        PC2           PC3
## Brazil  -0.6468462  0.2677714  0.0760700585
## Canada   1.1838460  0.2382161 -0.0237206224
## China   -0.4045875  0.1086123 -0.0727761422
## Germany  2.1321196 -0.3313071  0.0214026069
## Nigeria -2.2645318 -0.2832927 -0.0009759008

PCA display

The main objective of PCA is to reduce dimensionality, keeping the first principal components that accumulate the greatest variance and discarding the rest. The criterion for deciding how many principal components we keep is based precisely on the analysis of their variance. Normally, the variance of the principal components yj is analyzed in relative terms, that is, the percentage that the variance of yj represents with respect to the global variance given by the sum of all the variances (jVar(yj)) . To visualize this, we used an interactive bar chart, created with plotly, with the percentage of variance explained by each principal component.

p <- tibble(
   label=paste("PC",1:length(pca1$sdev)), # create labels for the horizontal axis
   varPercent = pca1$sdev^2/sum(pca1$sdev^2) * 100 # calculation percentage of explained variance
) %>%
   ggplot(aes(x=label,y=varPercent)) + # create interactive bar chart
     geom_bar(stat = "identity") +
     labs(x= "Main Components",
          y= "Percentage of explained variance")
ggplotly(p)
PC 1PC 2PC 30255075100
Main ComponentsPercentage of explained variance

Figure 8.2: Percentage of variance explained by each principal component using ggplotly

To decide how many principal components we keep, we can, for example, keep the principal components that accumulate more than 90% of the explained variance. Another criterion would be, for example, eliminating the principal components with the explained variance less than 2%. For this example, with the first criterion, we would only be left with the first principal component, and with the second criterion with the first two.

A common graph to visualize the results of PCA is to make a scatter graph where the result of the values of the first two principal components for each observation is put, that is, the points (yi1,yi2) for i=1,,m, in this way we can see how the values of the first two components are distributed. Furthermore, to visualize the contribution of each variable xj to the first two components, for each j a segment is drawn that goes from (0,0) to (uj1,uj2) for j=1,,n. uj1 is the coefficient of xj in the linear combination of the first component y1 and uj2 is the coefficient of xj in the linear combination of the second component y2. If this segment is aligned with the x-axis, it means that xj contributes little to the second component, and if the segment is also of large magnitude, it means that xj contributes more to the first component than the rest of the variables. To make this graph we will use the hchart function:

hchart(pca1)
Created with Highcharts 9.3.1BrazilBrazilCanadaCanadaChinaChinaGermanyGermanyNigeriaNigeriaobservationsaged_65_olderaged_70_olderhuman_development_index0123-1.25-1-0.75-0.5-0.2500.250.50.751

Figure 8.3: Illustrative graph of the first two principal components using hchart

Next we are going to reproduce the experiment, but taking all the indicators and for all the countries. First we prepare the data and calculate the PCA. Since the PCA calculation only allows numerical variables, we have to eliminate non-numeric variables from the table. To preserve the names of the countries, instead of managing it as a variable, we manage it as the name of each record, this way it does not interfere in the calculation of PCA

data <- owid_country %>%
   na.omit() %>% # remove records with some NA
   column_to_rownames(var="location") %>% # assign the location field as the name of the rows
   select(-continent,-iso_code) # remove non-numeric variables
pca2 <- prcomp(data,scale = TRUE) # we calculate the PCA

Now we make the bar graph with the percentage of variance explained by each principal component. By default, the bar chart sorts automatically the x-axis labels alphabetically, to keep the initial order of the table unchanged we use the fct_inorder.

p <- tibble(
   label=fct_inorder(paste("PC",1:length(pca2$sdev))),
   varPercent = pca2$sdev^2/sum(pca2$sdev^2) * 100
) %>%
   ggplot(aes(x=label,y=varPercent)) +
     geom_bar(stat = "identity") +
     labs(x= "Main Components",
           y= "Percentage of explained variance"
     )
ggplotly(p)
PC 1PC 2PC 3PC 4PC 5PC 6PC 7PC 8PC 9PC 10PC 11PC 12PC 13PC 1402040
Main ComponentsPercentage of explained variance

Figure 8.4: Percentage of variance explained by each principal component using ggplotly

Finally we present the graph that illustrates the result for the first two components

hchart(pca2)
Created with Highcharts 9.3.1ArmeniaArmeniaBeninBeninBurkina FasoBurkina FasoBangladeshBangladeshBosnia and HerzegovinaBosnia and HerzegovinaColombiaColombiaComorosComorosCosta RicaCosta RicaDominican RepublicDominican RepublicAlgeriaAlgeriaEcuadorEcuadorEgyptEgyptEthiopiaEthiopiaGhanaGhanaGambiaGambiaHaitiHaitiIndonesiaIndonesiaIndiaIndiaKazakhstanKazakhstanKenyaKenyaKyrgyzstanKyrgyzstanLaosLaosLiberiaLiberiaMoldovaMoldovaMexicoMexicoMyanmarMyanmarMongoliaMongoliaMozambiqueMozambiqueMalawiMalawiNigerNigerNepalNepalPakistanPakistanParaguayParaguayEl SalvadorEl SalvadorTogoTogoThailandThailandTimorTimorTunisiaTunisiaTanzaniaTanzaniaUgandaUgandaVietnamVietnamYemenYemenSouth AfricaSouth AfricaZambiaZambiaZimbabweZimbabweobservationsaged_65_olderaged_70_oldercardiovasc_death_ratdiabetes_prevalenceextreme_povertyfemale_smokersgdp_per_capithandwashing_facilitieshospital_beds_per_thousandhuman_development_indexlife_expectancymale_smokersmedian_agepopulation-0.4-0.3-0.2-0.100.10.20.30.4-0.6-0.4-0.200.20.4

Figure 8.5: Illustrative graph of the first two principal components using hchart

References

[Ke23] Zoumana Keita. Principal Component Analysis in R Tutorial, 2023.

[Ku22] Joshua Kunst. Hchart Function, 2022.

[UGA] Data Analysis in the Geosciences (Principal Components Analysis), University of Georgia.