8 Attribute analysis

Installation/loading libraries/data used

if (!require(highcharter)) install.packages('highcharter') 
library(highcharter) 
if (!require(openxlsx)) install.packages('openxlsx') 
library(openxlsx) 
if (!require(plotly)) install.packages('plotly') 
library(plotly)
if (!require(ggplot2)) install.packages('ggplot2') 
library(ggplot2)
if (!require(tidyverse)) install.packages('tidyverse') 
library(tidyverse)

owid_country <- read.xlsx("https://ctim.es/AEDV/data/owid_country.xlsx",sheet=1) %>%
  as_tibble()

8.1 Introduction

In general, we can consider that attribute analysis consists of studying, comparing and manipulating the variables of our data analysis, transforming or combining them to create new variables (or eliminate some) in order to obtain a new collection of variables that better represent , and more compactly, the information contained globally in the initial variable collection. In what follows, to develop the concepts of this topic, we will consider that our variables are a collection of attributes, such as population, area, GDP, etc., associated with the countries of the world, as it appears in the owid_country table. We will denote each attribute (or variable) by the vector $x_j=(x_{1j},x_{2j},..,x_{mj})$ where $m$ represents the number of countries. That is, if the first attribute ( $j=1$ ) is the population, then $x_{i1}$ would represent the population of country $i$ .

The first phase of our attribute analysis is to compare each pair of attributes $x_j$ and $x_{j'}$ with each other. The easiest way to compare variables is to calculate their correlation, $r$ , which is given by the formula:

$r=\frac{\sum _{i=1}^m(x_{ij}-\bar x_j)(x_{i{j'}}-\bar x_{j'})} {\sqrt{\sum _{i=1}^m(x_{ij}-\bar x_j)^2\sum _{i=1}^m(x_{i{j'}}-\bar x_{j' })^2}}$

Where, $\bar x_{j}$ and $\bar x_{j'}$ represent the mean of the respective variables. The correlation value $r$ moves in the range between -1 and 1 and can be interpreted as follows:

If $r=1$ the variable $x_{j'}$ can be obtained exactly and linearly from $x_{j}$ , if the value of one variable increases the other also increases, and if we draw the scatter plot of both variables, all points fall exactly on the regression line
If $r=0$ there is no linear relationship between both variables and the scatter plot, normally, it will give us a dispersed cloud of points
If $r=-1$ , the case is identical to $r=1$ with the only difference that, in this case, when one variable increases, the other decreases.

In general, the closer the correlation is to 1 or -1, the better the linear relationship between the variables and the better aligned their values will be on the regression line. An important property of the correlation is that it is independent of the unit of measurement of the variables (as long as the transformation between the measurements is linear), that is, it does not matter measuring a population in millions or thousands, or a temperature in degrees Celsius or Fahrenheit, the correlation of these variables with other variables gives the same value regardless of the unit of measurement used.

It may be that the correlation between the variables improves if we previously apply a non-linear transformation (such as a logarithm) to one of them (or both) before calculating the correlation. This is what we do in the example dashboard Dashboard3.Rmd from the previous topic, in it we study how to transform the attributes, using the logarithm or Box-Cox function to study the relationship between pairs of attributes. For example, we observe that if we apply a logarithm to GDP per capita, the result of its correlation with life expectancy improves greatly. This tells us that the relationship between the two variables is non-linear.

We will say that a variable $x_j$ does not contribute new information to our data analysis if it can be obtained from the other variables. That is:

$x_{ij}=F(\{x_{ik}\}_{k\neq j}) \quad \forall i$ for a certain function $F$ . Therefore, we can remove $x_j$ from our data analysis without losing information. In particular, if the correlation between two variables is close to 1 or -1, we can remove one of them because one can be obtained from the other. For example, in the owid_country table the percentage of people over 65 years old and the percentage of people over 70 years old are stored as indicators. These two indicators are closely related to each other. If we use the dashboard Dashboard3.Rmd, we observe that the correlation factor is $0.99$ , and therefore, one can be accurately calculated as a function of the other from the regression line, which in this case is given by

$\text{aged_70_older} = 0.68\cdot \text{aged_65_older} -0.39$ Therefore, we could eliminate the indicator of percentage of people over 70 years of age from our data analysis without losing relevant information. Another example of a variable that we could discard are variables that have a constant value (in this case the function $F$ would be a constant). These types of variables do not contribute anything when it comes to discriminating between countries.

In this chapter we are going to study how to combine the variables and reduce their number without losing relevant information. This problem is called dimensionality reduction.

A limitation of the techniques that we are going to analyze is that we will only use linear combinations of the variables, that is, sums of variables multiplied by coefficients. This leaves out more complex dependencies such as those given by the logarithm function or the Box-Cox transformation used in the dashboard Dashboard3.Rmd. One way to add non-linear transforms would be, before starting our linear analysis, to add/substitute in the table, variables that are the result of applying non-linear transformations to the initial variables. For example, in view of the results of the dashboard Dashboard3.Rmd, we could replace the variable gdp_per_capit with log(gdp_per_capit) which has a better linear relationship with the others than the initial variable.

8.2 The correlation matrix

As we saw in the previous section, a simple way to eliminate variables is to study the correlation between pairs of them and if the correlation is very close to 1, we can eliminate one of them. To do this, the correlation matrix is used where each value of the matrix corresponds to the value of the correlation, $r$ , between 2 variables. For example, in the following interactive graph we visualize the correlation matrix of the owid_country table. Non-numeric variables must first be removed, since the correlation is only calculated for numerical variables and the NA values must also be removed before doing the calculation.

owid_country %>%
   select(-location,-continent,-iso_code) %>% # remove non-numeric variables from the table
   cor(use='complete.obs') %>% # correlation matrix calculation removing NA previously
   hchart() # interactive drawing correlation matrix

Figure 8.1: Correlation matrix using hchart

From the exploration of the correlation values, we observe that aged_65_older and aged_70_older are very related, that population and cardiovasc_death_rat are very little related to the rest of the variables and that extreme_poverty, in general, has , negative correlation with the rest of the variables, that is, when this variable increases, the others decrease.

8.3 Principal Component Analysis (PCA)

Principal component analysis, PCA, transforms the variables $x_j$ into new ones, $y_j$ , making linear combinations of the initial ones. The objective of PCA is to find the linear combinations of $x_j$ that maximize the variance of $y_j$ in descending order, that is, $y_1$ is the linear combination with the greatest possible variance and $y_n$ is the linear combination with the lowest possible variance. When the variance of $y_j$ is very small, we consider that it does not provide relevant information and we discard it, in this way we reduce the dimensionality, leaving only the variables $y_j$ with significant variance, which are what we call principal components.

Theoretical foundation

Necessary algebraic concepts

Definition: The dot product of two vectors $u$ and $v$ of size n is defined as $(u,v)=u_1v_1+\cdots + u_nv_n$ . We will always consider the vectors vertically, that is, we can write $(u,v)=u^Tv$ , where $u^T$ is the transposed vector.

Definition: A square matrix $U$ is orthonormal (or unitary) if $U^TU=Id$ is satisfied, that is, the transposed by itself is the identity matrix. Therefore, for an orthonormal matrix the inverse is the transpose.

Theorem: The column vectors of an orthonormal matrix $U=(u_1,\cdots,u_n)$ (with $u_k=(u_{1k},\dots,u_{nk})$ ) satisfy that $( u_k,u_k)=1$ and if $k\neq k'$ then $(u_k,u_{k'})=0$ that is, the vectors are perpendicular.

Definition: the real number $\lambda_k$ is an eigenvalue of the square matrix $A$ , if there exists a vector $u_k=(u_{1k},\dots,u_{nk})$ , not null, called eigenvector, such that $Au_k=\lambda_k u_k$ . If $u_k$ is an eigenvector, then when multiplied by any non-zero number, it is still an eigenvector. Therefore we can always adjust the eigenvector so that it meets $(u_k,u_k)=1$ .

Theorem: If the square matrix $A$ of dimension n is symmetric, then it has n eigenvalues that we can order in descending order, $\lambda_1 \geq \lambda_{2} \geq \cdots \geq \lambda_n$ and It also has an orthonormal matrix of eigenvectors $U=(u_1,u_2,..,u_n)$ formed by the eigenvectors $u_k$ in columns.

Theorem: If $X$ is any matrix of size mxn, then $A=X^TX$ is square of size nxn, symmetric and all its eigenvalues are greater than or equal to zero.

Definition: Given two variables $x_j$ and $x_{j'}$ , the covariance $Cov(x_{j},x_{j'})$ between them is defined as:

$Cov(x_{j},x_{j'})=\frac{\sum _{i=1}^m(x_{ij}-\bar x_j)(x_{i{j'}}-\bar x_{ j'})} {m-1}$ where $m$ is the number of observations.

Definition: If $X=(x_1,x_2,...,x_n)$ is a matrix representing $n$ variables $x_1,x_2,...,x_n$ , the covariance matrix of $X$ is given by the matrix $\Sigma$ of dimension $n\times n$ whose elements are $\Sigma_{jj'}=Cov(x_{j},x_{j'})$

Definition: Normalizing a variable, $x_j$ , consists of subtracting its mean and dividing it by its standard deviation. The normalized variables have a mean of zero and a variance and standard deviation equal to 1.

Theorem: If $X=(x_1,x_2,...,x_n)$ is a matrix representing $n$ normalized variables $x_1,x_2,...,x_n$ , then the covariance matrix $\Sigma$ satisfies:

$\Sigma=\frac{X^TX}{m-1}$ moreover

$\lambda_1+\lambda_2+..+\lambda_n=n$ where $\lambda_j$ are the eigenvalues of the covariance matrix $\Sigma$ . Furthermore, since the covariance matrix $\Sigma$ is proportional to $X^TX$ , its eigenvectors are the same and the eigenvalues of $X^TX$ are those of $\Sigma$ multiplied by $m-1$ .

Mathematical foundation of ACP

Let us consider n variables $x_j$ (the column vectors of our table), and the observations in the form of a matrix $X=x_{ij}$ of size mxn. Let’s replace $X$ with a new collection of variables $Y=XU$ where $U$ is an orthonormal matrix of size n. That is, the new variables $y_j=Xu_j$ are linear combinations of the previous ones obtained from an orthonormal matrix that can be expressed as:

$y_1=u_{11}x_1+u_{21}x_2+\cdots+u_{n1}x_n\\ y_2=u_{12}x_1+u_{22}x_2+\cdots+u_{n2}x_n\\ \cdots \\ y_n=u_{1n}x_1+u_{2n}x_2+\cdots+u_{nn}x_n\\$

Principal component analysis is a technique for choosing the orthonormal matrix $U$ in such a way that the variables $y_j=Xu_j$ (the principal components) are ordered in descending order based on their variance, that is, the first principal component $y_1=Xu_1$ corresponds to the linear combination of the original variables that has the greatest possible variance, $y_2=Xu_2$ would be the orthogonal combination to the previous one ( $(u_1,u_2)=0$ ) with the greatest possible variance, $y_3=Xu_3$ would be the orthogonal combination to the previous ones ( $(u_1,u_3)=0$ and $(u_2,u_3)=0$ ) with the greatest possible variance, and so on. The greater the variance of $y_j$ the more information the variable contains in terms of, for example, distinguishing between different observations. Note that a variable with a constant value (that is, zero variance) does not provide any information to discern between some observations and others. Given that the variables $x_j$ can have very different magnitudes and variances, to balance the contribution of each variable, they are first normalized, that is, their mean is subtracted and divided by the standard deviation. That is, we will consider that the variables $x_j$ have a mean of zero and a standard deviation of 1. If the mean of each variable $x_j$ is zero, the mean of $y_j=Xu_j$ , which is a linear combination of $x_j$ , will also have a mean zero and therefore its variance will be

$\text{Var}(y_j)=\frac{\sum _{i=1}^m y_{ij}^2}{m-1}=\frac{y_j^Ty_j}{m-1}=\frac{u_j^TX^TXu_j}{m-1}$ the key to maximizing the variance is the eigenvectors of the matrix $A=X^TX$ . If $\lambda_j$ are the ordered eigenvalues of $A$ in descending order, that is, $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n$ and $u_j$ their corresponding eigenvectors, then

$\text{Var}(y_j)=\frac{u_j^TX^TXu_j}{m-1}=\frac{u_j^T\lambda_j u_j}{m-1}=\frac{\lambda_ju_j^T u_j }{m-1}=\frac{\lambda_j}{m-1}$ Therefore, the procedure to calculate the principal components of $X$ can be divided into the following steps:

Normalize the original variables $x_j$ so that it has mean zero and variance 1.
Calculate the eigenvalues $\lambda_j$ (sorted in descending order) and eigenvectors $u_j$ of the matrix $X^TX$
The principal components are $y_j=Xu_j$ and its variance is $\frac{\lambda_j}{m-1}$

PCA Example

To illustrate the process of calculating the principal components we are going to take a simple case extracted from the owid_country table where we will use 3 variables ( $x_1$ =human_development_index, $x_2$ =aged_65_older and $x_3$ =aged_70_older ) and observations from 5 countries (China, Canada, Nigeria, Brazil and Germany). To calculate the PCA the table can only have numerical variables. To preserve the name of the countries, the column_to_rownames function is used, which assigns the countries as names of each row, in this way they are managed as an external element to the table and do not interfere with the calculation of the PCA.

data <- owid_country %>%
   filter( # we filter the 5 countries that will be used
     location == "China" |
     location == "Canada" |
     location == "Nigeria" |
     location == "Brazil" |
     location == "Germany"
   ) %>%
   column_to_rownames(var="location") %>%
   select(human_development_index,aged_65_older,aged_70_older)
data

##         human_development_index aged_65_older aged_70_older
## Brazil                    0.765         8.552         5.060
## Canada                    0.929        16.984        10.797
## China                     0.761        10.641         5.929
## Germany                   0.947        21.453        15.957
## Nigeria                   0.539         2.751         1.447

Next we print the normalized variables using the scale function and its correlation matrix which shows a high correlation between the variables.

data %>%
   scale()

##         human_development_index aged_65_older aged_70_older
## Brazil               -0.1409164    -0.4824390    -0.4932478
## Canada                0.8552170     0.6718445     0.5253853
## China                -0.1652124    -0.1964691    -0.3389525
## Germany               0.9645488     1.2836202     1.4415691
## Nigeria              -1.5136370    -1.2765565    -1.1347540
## attr(,"scaled:center")
## human_development_index           aged_65_older           aged_70_older 
##                  0.7882                 12.0762                  7.8380 
## attr(,"scaled:scale")
## human_development_index           aged_65_older           aged_70_older 
##               0.1646366               7.3049644               5.6320575

data %>%
   scale() %>%
   cor(use='complete.obs') %>%
   hchart()

Therefore the matrix $X$ is defined by:

$\begin{array} [c]{cccc} & x_{1} & x_{2} & x_{3}\\ \text{Brazil} & -0.14 & -0.48 & -0.49\\ \text{Canada} & 0.86 & 0.67 & 0.53\\ \text{China} & -0.17 & -0.20 & -0.34\\ \text{Germany} & 0.96 & 1.28 & 1.44\\ \text{Nigeria} & -1.51 & -1.28 & -1.13 \end{array}$ The new variables $y_j$ that we are going to construct as a linear combination of $x_1,x_2$ and $x_3$ have the form:

$y_{j}=u_{1j}\left( \begin{array}{c} -0.14 \\ 0.86 \\ -0.17 \\ 0.96 \\ -1.51% \end{array}% \right) +u_{2j}\left( \begin{array}{c} -0.48 \\ 0.67 \\ -0.20 \\ 1.28 \\ -1.28% \end{array}% \right) +u_{3j}\left( \begin{array}{c} -0.49 \\ 0.53 \\ -0.34 \\ 1.44 \\ -1.13% \end{array}% \right) =Xu_{j}$

Since $y_j$ always has zero mean, its variance is $\frac{y_j^Ty_j}{4}=\frac{u_j^TX^TXu_j}{4}$ . If $u_j$ is a normalized eigenvector of $X^TX$ for the eigenvalue $\lambda_j$ , then the variance of $y_j$ is $\frac{u_j^TX^TXu_j}{4}=\frac{\lambda_ju_j ^Tu_j}{4}=\frac{\lambda_j}{4}$

We now compute the matrix $X^{T}X$ : $X^{T}X=\left( \begin{array} [c]{ccc}% 4.00 & 3.85 & 3.68\\ 3.85 & 4.00 & 3.96\\ 3.68 & 3.96 & 4.00 \end{array} \right)$ Next, by hand, or using any mathematical calculation software, we compute the orthonormal matrix $U$ with the eigenvectors of $X^TX$

$U=\left( \begin{array} [c]{ccc}% 0.57 & 0.78 & 0.26\\ 0.58 & -0.16 & -0.79\\ 0.58 & -0.61 & 0.55 \end{array} \right)$

Simultaneously, the eigenvalues of $X^{T}X$ are computed, which are: $\lambda _{1}=11.66$ , $\lambda _{2}=0.33$ , and $\lambda _{3}=0.02$ (we observe that the sum of the eigenvalues is approximately $12=n\cdot(m-1)=3\cdot4$ ) and therefore The standard deviations of $y_j$ are $\sigma_j=\sqrt{\lambda_j/4}$ , that is: $\sigma_1=1.71$ , $\sigma_2=0.29$ , $\sigma_3=0.07$ . The principal components are then:

$\begin{array} [c]{c} y_{1}=0.57 \cdot x_{1}+0.58 \cdot x_{2}+0.58 \cdot x_{3}\\ y_{2}=0.78 \cdot x_{1}-0.16 \cdot x_{2}-0.60 \cdot x_{3}\\ y_{3}=0.26 \cdot x_{1}-0.79 \cdot x_{2}+0.55 \cdot x_{3}% \end{array}$

and, therefore, the new table of values, $Y=XU$ , is given by:

$\begin{array} [c]{cccc} & y_{1} & y_{2} & y_{3}\\ \text{Brazil} & -0.65 & 0.27 & 0.08\\ \text{Canada} & 1.18 & 0.24 & -0.02\\ \text{China} & -0.40 & 0.11 & -0.073\\ \text{Germany} & 2.13 & -0.33 & 0.02\\ \text{Nigeria} & -2.27 & -0.28 & -0.00 \end{array}$

PCA calculation with `prcomp`

The prcomp function automatically performs the entire process for calculating the principal components. Let’s use it for the previous example:

pca1 <- prcomp(data,scale = TRUE)

The function prcomp returns a structure (in this example we have called it pca1) that stores, in the sdev field, the standard deviation of the principal components $y_j$ , in the rotation field the matrix of eigenvectors $U$ and in the x field the principal components $Y=XU$ . These results are printed below. It can be seen that the results are the same as what we have calculated for this example in the previous section.

pca1$sdev # standard deviations of principal components

## [1] 1.70716366 0.28734312 0.05501071

pca1$rotation # matrix U of eigenvectors of X^TX

##                               PC1        PC2        PC3
## human_development_index 0.5708408  0.7790411  0.2592986
## aged_65_older           0.5845568 -0.1638432 -0.7946375
## aged_70_older           0.5765709 -0.6051863  0.5489222

pca1$x # matrix Y=XU that determines the new variables (main components)

##                PC1        PC2           PC3
## Brazil  -0.6468462  0.2677714  0.0760700585
## Canada   1.1838460  0.2382161 -0.0237206224
## China   -0.4045875  0.1086123 -0.0727761422
## Germany  2.1321196 -0.3313071  0.0214026069
## Nigeria -2.2645318 -0.2832927 -0.0009759008

PCA display

The main objective of PCA is to reduce dimensionality, keeping the first principal components that accumulate the greatest variance and discarding the rest. The criterion for deciding how many principal components we keep is based precisely on the analysis of their variance. Normally, the variance of the principal components $y_j$ is analyzed in relative terms, that is, the percentage that the variance of $y_j$ represents with respect to the global variance given by the sum of all the variances ( $\sum_jVar(y_j)$ ) . To visualize this, we used an interactive bar chart, created with plotly, with the percentage of variance explained by each principal component.

p <- tibble(
   label=paste("PC",1:length(pca1$sdev)), # create labels for the horizontal axis
   varPercent = pca1$sdev^2/sum(pca1$sdev^2) * 100 # calculation percentage of explained variance
) %>%
   ggplot(aes(x=label,y=varPercent)) + # create interactive bar chart
     geom_bar(stat = "identity") +
     labs(x= "Main Components",
          y= "Percentage of explained variance")
ggplotly(p)

Figure 8.2: Percentage of variance explained by each principal component using ggplotly

To decide how many principal components we keep, we can, for example, keep the principal components that accumulate more than 90% of the explained variance. Another criterion would be, for example, eliminating the principal components with the explained variance less than 2%. For this example, with the first criterion, we would only be left with the first principal component, and with the second criterion with the first two.

A common graph to visualize the results of PCA is to make a scatter graph where the result of the values of the first two principal components for each observation is put, that is, the points $(y_{i1}, y_{i2})$ for $i=1,\cdots,m$ , in this way we can see how the values of the first two components are distributed. Furthermore, to visualize the contribution of each variable $x_j$ to the first two components, for each $j$ a segment is drawn that goes from $(0,0)$ to $(u_{j1},u_{ j2})$ for $j=1,\cdots,n$ . $u_{j1}$ is the coefficient of $x_j$ in the linear combination of the first component $y_1$ and $u_{j2}$ is the coefficient of $x_j$ in the linear combination of the second component $y_2$ . If this segment is aligned with the x-axis, it means that $x_j$ contributes little to the second component, and if the segment is also of large magnitude, it means that $x_j$ contributes more to the first component than the rest of the variables. To make this graph we will use the hchart function:

hchart(pca1)

Figure 8.3: Illustrative graph of the first two principal components using hchart

Next we are going to reproduce the experiment, but taking all the indicators and for all the countries. First we prepare the data and calculate the PCA. Since the PCA calculation only allows numerical variables, we have to eliminate non-numeric variables from the table. To preserve the names of the countries, instead of managing it as a variable, we manage it as the name of each record, this way it does not interfere in the calculation of PCA

data <- owid_country %>%
   na.omit() %>% # remove records with some NA
   column_to_rownames(var="location") %>% # assign the location field as the name of the rows
   select(-continent,-iso_code) # remove non-numeric variables
pca2 <- prcomp(data,scale = TRUE) # we calculate the PCA

Now we make the bar graph with the percentage of variance explained by each principal component. By default, the bar chart sorts automatically the x-axis labels alphabetically, to keep the initial order of the table unchanged we use the fct_inorder.

p <- tibble(
   label=fct_inorder(paste("PC",1:length(pca2$sdev))),
   varPercent = pca2$sdev^2/sum(pca2$sdev^2) * 100
) %>%
   ggplot(aes(x=label,y=varPercent)) +
     geom_bar(stat = "identity") +
     labs(x= "Main Components",
           y= "Percentage of explained variance"
     )
ggplotly(p)

Figure 8.4: Percentage of variance explained by each principal component using ggplotly

Finally we present the graph that illustrates the result for the first two components

hchart(pca2)

Figure 8.5: Illustrative graph of the first two principal components using hchart

References

[Ke23] Zoumana Keita. Principal Component Analysis in R Tutorial, 2023.

[Ku22] Joshua Kunst. Hchart Function, 2022.

[UGA] Data Analysis in the Geosciences (Principal Components Analysis), University of Georgia.