# Chapter 9 R Lab 8 - 27/05/2022

In this lecture we will learn how to implement unsupervised methods like the principal component analysis (PCA) and clustering.

The following packages are required: `MASS`

(included automatically in the `R`

release), `tidyverse`

, `ggcorrplot`

, `ggfortify`

, `factoextra`

.

`library(tidyverse)`

For this lab we will use the data contained in the file `development-indicators.csv`

. It refers to 39 countries and 8 numerical variables including variables from The Worldwide Governance Indicators (WGI) project
about six dimensions of governance (Voice and Accountability, Political Stability and Absence of Violence/Terrorism, Government Effectiveness, Regulatory Quality, Rule of Law, Control of Corruption; see http://info.worldbank.org/governance/wgi/). We also have the GDP per capita and the total population of each country.

We start by importing the data from the csv file:

```
<- read.csv("/Users/marco/Google Drive Personale/Formazione/3_Unibg/2021-22/ML_tutor/RLabs/Lab8/development-indicators.csv")
dev glimpse(dev)
```

```
## Rows: 39
## Columns: 9
## $ countryname <chr> "Albania", "Austria", "Belgium", "Bulg…
## $ GDPpercapita <dbl> 5353.245, 50121.554, 46345.403, 9828.1…
## $ PopulationTotal <int> 2854191, 8879920, 11502704, 6975761, 3…
## $ WGIVoiceandAccountability <dbl> 0.151804656, 1.326598525, 1.369119644,…
## $ WGIPoliticalStabilityNoViolence <dbl> 0.1185695, 0.9801227, 0.4811862, 0.540…
## $ WGIGovernmentEffectiveness <dbl> -0.061330900, 1.490911484, 1.032227039…
## $ WGIRegulatoryQuality <dbl> 0.27437982, 1.45911550, 1.29099905, 0.…
## $ WGIRuleofLaw <dbl> -0.41117939, 1.88305128, 1.36371493, 0…
## $ WGIControlofCorruption <dbl> -0.528757572, 1.545428872, 1.551960588…
```

Note that the first column is not numerical as it contains the names of the countries.

We first set a property of the `dev`

dataframe which is given by the row names. Automatically they are given by the values from 1 to 39:

`rownames(dev) `

```
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
## [31] "31" "32" "33" "34" "35" "36" "37" "38" "39"
```

We want instead the row names to be given by names of the countries. This will be useful later for some graphical representation:

`rownames(dev) = dev$countryname`

We start the analysis with some summary statistics as the mean value and the variance for each quantitative variable:

```
# Compute the mean vector
%>%
dev select_if(is.numeric) %>%
summarise_all(mean)
```

```
## GDPpercapita PopulationTotal WGIVoiceandAccountability
## 1 32105.22 19081565 0.8020129
## WGIPoliticalStabilityNoViolence WGIGovernmentEffectiveness
## 1 0.5512366 0.8447155
## WGIRegulatoryQuality WGIRuleofLaw WGIControlofCorruption
## 1 0.960771 0.8245179 0.7254922
```

```
# Compute the var vector
%>%
dev select_if(is.numeric) %>%
summarise_all(var)
```

```
## GDPpercapita PopulationTotal WGIVoiceandAccountability
## 1 679811901 8.864497e+14 0.5407094
## WGIPoliticalStabilityNoViolence WGIGovernmentEffectiveness
## 1 0.3637493 0.5651742
## WGIRegulatoryQuality WGIRuleofLaw WGIControlofCorruption
## 1 0.4674545 0.7624714 0.9573218
```

The funtion `select_if`

is used to select only the quantitative variables and `summarise_all`

to apply the desired function (mean or variance) to all the columns jointly. We note that the variables are caracterized by different mean and variability levels. This suggests that we will have to scale/standardize the variables.

To study the relationship between the quantitative variables (all the not the first column) we produce the plot of the correlation matrix. It is a square (8 by 8) picture whose pixels display the correlations between variables. We show only the lower part as the correlation matrix is symmetric.

```
library(ggcorrplot)
ggcorrplot(cor(dev[,-1]),
hc.order = TRUE,#sort the variables
type = "lower", #only lower part
lab = TRUE) #display the corr values
```

The WGI indexes are positively and strongly correlated between themselves and also GDP. On the contrary the population size is negatively correlated with the other variables (and the values are smaller in absolute value).

## 9.1 Principal component analysis

We use the function `prcomp`

to compute the principal components. In particular, with the available data we can estimate \(\min(39-1,8)=8\) PCs. We use the options `center=T`

and `scale=T`

in order to standardize the data and have variables with zero mean and unit variance.

```
= prcomp(dev[,-1], #only the quantitative reg.
pca center = T, #zero-mean data
scale = T) #unit variance data
```

The output is a `list`

with the following components:

`names(pca)`

`## [1] "sdev" "rotation" "center" "scale" "x"`

The object `pca$rotation`

contains the **loadings** (i.e. the \(\phi\) coefficients which define the PCs \(Z_1,Z_2,\ldots,Z_8\) as linear combination of the original variables). They are given in a 8 by 8 matrix:

`$rotation pca`

```
## PC1 PC2 PC3 PC4
## GDPpercapita 0.35065863 -0.10884405 -0.65979307 -0.54248121
## PopulationTotal -0.07780194 -0.93033624 -0.06631828 0.29795108
## WGIVoiceandAccountability 0.37551195 0.02992773 0.49650096 -0.07851315
## WGIPoliticalStabilityNoViolence 0.33790638 0.28434137 -0.43261729 0.75474206
## WGIGovernmentEffectiveness 0.38985094 -0.15155311 0.02567168 0.07074205
## WGIRegulatoryQuality 0.39188875 -0.05722542 0.32191732 0.05698426
## WGIRuleofLaw 0.39825461 -0.03860397 0.14758281 0.02037978
## WGIControlofCorruption 0.38939461 -0.11466650 -0.02309969 -0.18020107
## PC5 PC6 PC7 PC8
## GDPpercapita 0.34207063 -0.02709433 0.08651057 -0.10172637
## PopulationTotal 0.15865111 0.06550130 -0.07258085 0.02279921
## WGIVoiceandAccountability 0.61814890 -0.14788749 -0.22181790 0.39017386
## WGIPoliticalStabilityNoViolence 0.15896599 0.09131232 -0.04873938 0.11041881
## WGIGovernmentEffectiveness -0.38684127 -0.74577134 0.31258960 0.12590117
## WGIRegulatoryQuality 0.05985597 0.42675083 0.66717303 -0.32473318
## WGIRuleofLaw -0.15796955 -0.05362014 -0.58666632 -0.66760776
## WGIControlofCorruption -0.52178154 0.47286419 -0.22058956 0.50764272
```

The object `pca$x`

contains the **scores**, i.e. the projection of each country on each PC. This is a 39 by 8 matrix:

`$x pca`

```
## PC1 PC2 PC3 PC4
## Albania -2.81786457 0.83001739 0.05108225 -0.018111859
## Austria 2.20698529 0.15224557 0.02963347 -0.020206216
## Belgium 1.32303736 -0.01870776 0.32319894 -0.614801009
## Bulgaria -1.70556778 0.72499406 -0.01475751 0.433419354
## Bosnia and Herzegovina -3.79709805 0.71066570 -0.04188091 -0.714621571
## Belarus -4.03540331 0.69148706 -1.63463929 0.223958116
## Switzerland 3.47865606 0.04335362 -0.78991039 -0.279683041
## Czech Republic 0.39774118 0.47704781 0.21824356 0.655434401
## Germany 1.96501320 -2.40883714 0.32247456 0.260633664
## Denmark 3.01873631 0.06419015 -0.01475849 -0.304109766
## Spain 0.07977494 -0.99827534 0.43233977 0.057618727
## Estonia 1.25714266 0.41253728 0.83069018 0.006727737
## Finland 3.04903480 0.04146407 0.50805437 -0.177387799
## France 1.07966660 -1.87466237 0.41046152 -0.029540498
## United Kingdom 1.69001074 -1.86335313 0.40940542 0.111710472
## Greece -1.27759766 0.43505960 0.30629767 -0.119108855
## Croatia -1.09881576 0.83543615 -0.11148224 0.518963376
## Hungary -1.20430410 0.61702249 -0.33216718 0.579155511
## Ireland 2.31873368 0.21520412 -0.69667814 -0.668446291
## Iceland 2.81629074 0.65431469 -0.92349496 0.349248307
## Italy -0.69867269 -1.16888880 -0.03804094 0.300457221
## Lithuania 0.43699756 0.64811923 0.42977646 0.487354780
## Luxembourg 3.86050157 0.19742203 -1.63307475 -1.052842773
## Latvia 0.09629027 0.49911756 0.68497217 0.080995143
## Moldova -3.57033808 0.68928681 0.14264838 -0.638653442
## North Macedonia -2.69031446 0.71309171 0.19057057 -0.226298205
## Malta 0.34519750 0.90602080 -0.04581242 0.598224095
## Montenegro -2.29403011 0.69660531 0.08309784 -0.288168349
## Netherlands 2.80336422 -0.31506613 0.38119005 -0.182931454
## Norway 3.50279864 0.09866698 -0.32728648 -0.393764280
## Poland -0.65388391 -0.46440786 0.28125029 0.499791874
## Portugal 0.77250469 0.52038703 0.19207523 0.802437223
## Romania -1.85464178 0.43533503 -0.04141094 0.402764544
## Russian Federation -4.67157360 -3.91588928 -1.16400038 0.585045050
## Serbia -2.86654847 0.56393933 0.03073970 -0.284038079
## Slovak Republic -0.29452821 0.68211974 0.24272548 0.463753359
## Slovenia 0.57129359 0.60439086 0.21821522 0.268451833
## Sweden 3.01058493 -0.01844561 0.27916501 -0.021188677
## Ukraine -4.54917400 -1.11300876 0.81108691 -1.652242627
## PC5 PC6 PC7 PC8
## Albania 0.200474887 -0.002546889 0.248366468 0.1360118889
## Austria -0.181603079 -0.080446950 -0.249320635 -0.2336041513
## Belgium -0.001155303 0.230287837 -0.248072496 0.1178371228
## Bulgaria 0.124652221 -0.064634026 0.181718586 0.1194217216
## Bosnia and Herzegovina 0.047505355 0.214006487 -0.397973553 -0.2009921036
## Belarus -1.198305165 0.209574421 -0.005468077 0.2469325092
## Switzerland 0.044280084 -0.222770743 0.046531497 -0.0002634139
## Czech Republic 0.141771009 0.041314405 0.113632754 -0.2395791897
## Germany -0.095454353 0.281137805 -0.017567292 0.0437147654
## Denmark -0.369178252 -0.250733556 -0.134603990 0.1680749483
## Spain 0.236655891 -0.172030163 -0.078518147 -0.0356247808
## Estonia -0.471049794 0.333557102 0.146466897 0.0823879330
## Finland -0.555863358 -0.055109886 0.015239228 -0.0084845038
## France -0.060382629 -0.001684818 -0.004444178 -0.1472409098
## United Kingdom -0.177782949 0.284844489 -0.082551681 -0.0302763270
## Greece 0.529554376 -0.222929268 -0.056931031 0.2927051751
## Croatia 0.118227338 0.008129946 -0.054583063 0.0930524718
## Hungary -0.096877444 -0.067240837 0.025117567 -0.2351358037
## Ireland 0.434007322 0.163167794 0.253701778 -0.1196710146
## Iceland 0.092538336 -0.011201678 -0.260391849 0.0399224679
## Italy 0.908059152 0.216684838 0.169402818 0.2041751115
## Lithuania -0.081980599 -0.134027693 0.059654183 -0.0294204168
## Luxembourg 0.498573728 0.041236234 0.162795148 -0.0280359072
## Latvia -0.255799615 -0.296470226 0.239228566 -0.2683203000
## Moldova 0.036439257 0.079181053 -0.038049637 -0.0849354411
## North Macedonia -0.071722744 0.135793255 0.425895489 -0.1471140451
## Malta 0.500029978 -0.278440377 -0.074725422 -0.0877018477
## Montenegro -0.360036600 0.048380163 0.038213202 -0.1013204792
## Netherlands -0.291717513 0.039805509 0.142931645 0.0215738478
## Norway 0.027780376 -0.059959927 0.021406993 -0.0116389130
## Poland 0.052880188 0.312941013 0.162100234 0.1729828354
## Portugal 0.120546617 -0.312306220 -0.266819717 0.1949610022
## Romania 0.550878777 0.501754762 -0.419659237 -0.1403020154
## Russian Federation -0.142621762 -0.323862058 0.032412624 -0.1127249624
## Serbia -0.064010371 -0.160019043 -0.038239527 -0.0642576991
## Slovak Republic 0.264357349 0.018620517 0.182612043 0.0875085683
## Slovenia -0.195727167 -0.167744319 -0.155263188 0.0543467489
## Sweden -0.366194368 0.031509922 -0.003337368 0.0890345078
## Ukraine 0.108250827 -0.307768877 -0.080907633 0.1620005993
```

We are interested in discovering which is the variability of each PC also with respect to the total variability of the original data (which is equal to 8 given that we work with scaled data):

`summary(pca)`

```
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4691 1.0391 0.57446 0.51451 0.36396 0.20640 0.18143
## Proportion of Variance 0.7621 0.1349 0.04125 0.03309 0.01656 0.00533 0.00411
## Cumulative Proportion 0.7621 0.8970 0.93829 0.97138 0.98793 0.99326 0.99737
## PC8
## Standard deviation 0.14495
## Proportion of Variance 0.00263
## Cumulative Proportion 1.00000
```

This table provides the standard deviation of each PC, together with the corresponding ratio with the total variability (8). The last line is the cumulative proportion of explained variance. We see that alone PC1 explains 76.21% of the total variability. If we use PC1 and PC2 we explain about 90% of the variability (and we loose about 10%). This information will be crucial for deciding how many PCs we want to keep. In this respect the **screeplot** is extremely useful. We provide here below two versions of the screeplot.

- The standard version of the screeplot is given by the
`screeplot`

function. It displays the variance of each PC. It is evident that the first PC has the highest variability and it will contribute the most.

`screeplot(pca) `

- We will display the relevant information by using
`ggplot`

. This will require to prepare the data manually to be fed into`ggplot`

. The dataframe we want to create has 8 rows (one for each PC) and two columns reporting the name of the PC and the cumulative proportion of explained variance (we will use the`cumsum`

function for computing the cumulative sum of the PC standard deviation contained in`pca$sdev`

):

```
= data.frame(PC = paste0("PC",1:8),
var_explained_df var_explained=(cumsum(pca$sdev^2))/sum((pca$sdev)^2))
var_explained_df
```

```
## PC var_explained
## 1 PC1 0.7620825
## 2 PC2 0.8970356
## 3 PC3 0.9382855
## 4 PC4 0.9713752
## 5 PC5 0.9879336
## 6 PC6 0.9932589
## 7 PC7 0.9973737
## 8 PC8 1.0000000
```

```
%>%
var_explained_df ggplot() +
geom_point(aes(x=PC,y=var_explained),size=4) +
geom_line(aes(x=PC,y=var_explained, group="a")) +
ggtitle("Scree plot: PCA on scaled data")
```

Note that we use the geometry `geom_line`

even if on the x-axis of the plot we don’t have a quantitative variable but a factor with 8 levels. This requires to specify that all the points are of the same group and the aesthetics should be the same (`group="a"`

where `a`

is arbitrary).

As already seen before, the first 2 PCs explain about 90% of the variability. Including an additional PC will lead to about 94%; with 8 PCs we explain the entire variability but we are not performing any dimension reduction. We decise to keep 2 PCs and try to give an interpretation to PC1 and PC2 considering their relationship with the original variables. We start by analysis the loading

`$rotation[,1:2] pca`

```
## PC1 PC2
## GDPpercapita 0.35065863 -0.10884405
## PopulationTotal -0.07780194 -0.93033624
## WGIVoiceandAccountability 0.37551195 0.02992773
## WGIPoliticalStabilityNoViolence 0.33790638 0.28434137
## WGIGovernmentEffectiveness 0.38985094 -0.15155311
## WGIRegulatoryQuality 0.39188875 -0.05722542
## WGIRuleofLaw 0.39825461 -0.03860397
## WGIControlofCorruption 0.38939461 -0.11466650
```

The loading of PC1 are very similar for all the variables but `PopulationTotal`

. Thus, we could conclude that PC1 is a summary of all the WGI indexes and of the wealth of a country (GDP). The loading of PC2 have a particular value which is the one referring to `PopulationTotal`

(it is the largest in absolute value, while the others are smaller). Thus, we conclude that PC2 represents the size of a country in terms of the population. To have a confirmation of this we can plot the correlation between the scores of PC1 and the original variables using the `ggcorrplot`

introduced before:

```
ggcorrplot(cor(cbind(dev[,-1],pca$x[,1])),
hc.order = TRUE,
type = "lower",
lab = TRUE)
```

It results that PC1 is positevely and strongly correlated with the WGI variables and with GDP. The higher these variables, the higher PC1 (and viceversa). Let’s do the same for the scores of PC2:

```
ggcorrplot(cor(cbind(dev[,-1],pca$x[,2])),
hc.order = TRUE,
type = "lower",
lab = TRUE)
```

PC2 is basically uncorrelated or mildly correlated with all the variables except PopulationTotal. In fact the correlation in the latter case is equal to -0.97 (the higher the total population, the lower the PC2 score).

It’s now time to produce the **biplot** able to show in the same chart the loadings (referred to the original variables) and the scores (related to the countries) with respect to PC1 and PC2. We have a simple version obtained as follows:

```
biplot(pca,
xlabs=dev[,1],#country names
scale=0,#scaled data
cex=0.5) #smaller labels
```

or a nicer version obtained by using the `autoplot`

function in the `ggfortify`

package (the interpretation will be the same):

`library(ggfortify)`

`## Warning: package 'ggfortify' was built under R version 4.0.5`

```
# this function uses the rownames of dev to get the country names
autoplot(pca,
data = dev,
scale = 0, #scaled data
label = TRUE,
loadings = TRUE,
loadings.colour = 'blue',
loadings.label = TRUE,
loadings.label.size = 3)
```

Let’s start commenting the blue arrows (loadings). The arrow pointing down (along the top-down direction of PC2) refers to PopulationTotal: we know PC2 and PopulationTotal are strongly and negatively correlated. All the other arrows are close (meaning the corresponding variables are correlated) and pointing toward right (along the left-right direction of PC1): this confirms that PC1 is explaining the WGI dimensions and the wealth of a country. Let’s move to the countries now (the scores): all the countries which lie close in the plot are similar (see for example Italy and Spain) with respect to PC1 and PC2. Russian Federation is isolated in the bottom left corner: it means that it has a large population (with respect to the average of all the countries) and a low value of wealth and governance. Ukraine has a very low value of PC1 meaning that its values of GDP and WGI indexes are below the average. We will see in the next section how to cluster the countries with respect to the available information summarized by PC1 and PC2.

## 9.2 Clustering

We will now implement clustering, i.e. create groups of similar countries considering the 8 original quantitative variables. We will implement both hierarchical and K-means clustering. For both the methods we need the matrix of standardized data obtained using the `scale`

function:

`= scale(dev[,-1]) scaledevnum `

We also display, as a preliminary analysis, the distance matrix representing the similarity between the countries (using the Euclidean distance and the 8 scaled variables):

`library(factoextra)`

`## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa`

`fviz_dist(get_dist(scaledevnum))`

This is a 39 by 39 picture with color pixel representing the distance between two countries. The bluer the color the higher the dissimilarity. On the diagonal the distance is zero. This picture is symmetric with respect to the diagonal. We expect that similar countries will be in the same cluster.

### 9.2.1 Hierarchical clustering

Hierarchical clustering is implemented by the `hclust`

function which takes in input the **distance matrix** obtained by `dist()`

. It is also necessary to specify the linkage function (here below we use the `complete`

function):

```
= hclust(dist(scaledevnum),
Hclusters method = "complete")
```

We immediately plot the **dendogram** for visualizing the clustering process (remember that we start from 39 singleton cluster and we end with 1 single big cluster):

```
plot(Hclusters,
cex=0.7,#smaller text
hang=-1) #to align at the bottom all the country names
```

Note that the plotted country names are retrieved by the row names we set at the very beginning. Remember that the countries that join first are the most similar. We have to decide where to cut the tree (see the y-axis). We cut at the height equal to 6, but this is something that should be tuned.

```
= cutree(Hclusters, h=6)
HclustersCut HclustersCut
```

```
## Albania Austria Belgium
## 1 2 3
## Bulgaria Bosnia and Herzegovina Belarus
## 3 1 1
## Switzerland Czech Republic Germany
## 2 3 2
## Denmark Spain Estonia
## 2 3 3
## Finland France United Kingdom
## 2 2 2
## Greece Croatia Hungary
## 3 3 3
## Ireland Iceland Italy
## 2 2 3
## Lithuania Luxembourg Latvia
## 3 2 3
## Moldova North Macedonia Malta
## 1 1 3
## Montenegro Netherlands Norway
## 1 2 2
## Poland Portugal Romania
## 3 3 3
## Russian Federation Serbia Slovak Republic
## 1 1 3
## Slovenia Sweden Ukraine
## 3 2 1
```

`HclustersCut`

contains the specification of the corresponding cluster for each country. We can thus count how many countries we have in each cluster and arrange the countries by cluster:

`table(HclustersCut)`

```
## HclustersCut
## 1 2 3
## 9 13 17
```

`data.frame(HclustersCut) %>% arrange(HclustersCut)`

```
## HclustersCut
## Albania 1
## Bosnia and Herzegovina 1
## Belarus 1
## Moldova 1
## North Macedonia 1
## Montenegro 1
## Russian Federation 1
## Serbia 1
## Ukraine 1
## Austria 2
## Switzerland 2
## Germany 2
## Denmark 2
## Finland 2
## France 2
## United Kingdom 2
## Ireland 2
## Iceland 2
## Luxembourg 2
## Netherlands 2
## Norway 2
## Sweden 2
## Belgium 3
## Bulgaria 3
## Czech Republic 3
## Spain 3
## Estonia 3
## Greece 3
## Croatia 3
## Hungary 3
## Italy 3
## Lithuania 3
## Latvia 3
## Malta 3
## Poland 3
## Portugal 3
## Romania 3
## Slovak Republic 3
## Slovenia 3
```

The first cluster, containing 9 countries, is the one displayed on the left in the dendrogram (with Russian Federation and North Macedonia). The second cluster is composed by 13 countries including Austria and Norway (it’s on the right in the dendrogram). The last one, with 17 countries, include Italy and Spain.

### 9.2.2 K-means clustering

With K-means clustering we have to specify how many clusters we want to create, we will try with 3 (see the `centers`

option of the `kmeans`

function). As the final solution is sensitive to the starting point we run the algorithm `nstart`

times (automatically the bset solution with the lowest total within sum of squares will be returned):

```
set.seed(44)
= kmeans(scaledevnum,
cl3 centers = 3, #n. of clusters
iter.max = 1000,
nstart = 25) #run the alg. 25 times
```

We check the final clustering as follows

` cl3`

```
## K-means clustering with 3 clusters of sizes 9, 15, 15
##
## Cluster means:
## GDPpercapita PopulationTotal WGIVoiceandAccountability
## 1 -0.9746276 0.16773189 -1.47343645
## 2 -0.4348758 -0.12587149 0.03634483
## 3 1.0196523 0.02523236 0.84771704
## WGIPoliticalStabilityNoViolence WGIGovernmentEffectiveness
## 1 -1.3878111 -1.3062546
## 2 0.2149136 -0.2161675
## 3 0.6177731 0.9999202
## WGIRegulatoryQuality WGIRuleofLaw WGIControlofCorruption
## 1 -1.4305953 -1.3897438 -1.2237505
## 2 -0.1109633 -0.1795747 -0.3856619
## 3 0.9693205 1.0134210 1.1199122
##
## Clustering vector:
## Albania Austria Belgium
## 1 3 3
## Bulgaria Bosnia and Herzegovina Belarus
## 2 1 1
## Switzerland Czech Republic Germany
## 3 2 3
## Denmark Spain Estonia
## 3 2 3
## Finland France United Kingdom
## 3 3 3
## Greece Croatia Hungary
## 2 2 2
## Ireland Iceland Italy
## 3 3 2
## Lithuania Luxembourg Latvia
## 2 3 2
## Moldova North Macedonia Malta
## 1 1 2
## Montenegro Netherlands Norway
## 1 3 3
## Poland Portugal Romania
## 2 2 2
## Russian Federation Serbia Slovak Republic
## 1 1 2
## Slovenia Sweden Ukraine
## 2 3 1
##
## Within cluster sum of squares by cluster:
## [1] 35.96384 21.34451 33.57328
## (between_SS / total_SS = 70.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
```

`data.frame(cl3$cluster) %>% arrange(cl3.cluster)`

```
## cl3.cluster
## Albania 1
## Bosnia and Herzegovina 1
## Belarus 1
## Moldova 1
## North Macedonia 1
## Montenegro 1
## Russian Federation 1
## Serbia 1
## Ukraine 1
## Bulgaria 2
## Czech Republic 2
## Spain 2
## Greece 2
## Croatia 2
## Hungary 2
## Italy 2
## Lithuania 2
## Latvia 2
## Malta 2
## Poland 2
## Portugal 2
## Romania 2
## Slovak Republic 2
## Slovenia 2
## Austria 3
## Belgium 3
## Switzerland 3
## Germany 3
## Denmark 3
## Estonia 3
## Finland 3
## France 3
## United Kingdom 3
## Ireland 3
## Iceland 3
## Luxembourg 3
## Netherlands 3
## Norway 3
## Sweden 3
```

`table(cl3$cluster)`

```
##
## 1 2 3
## 9 15 15
```

We use the following code for plotting the clustering. As we are considering 8 variables it is not possible to create a plot in 8 dimensions. The function `autoplot`

automatically performs a dimension reduction with… PCA! The first two PC2 are used for the graphical representation. They are the same PCs that we have commented about in the previous section.

```
autoplot(cl3,
data = scaledevnum,
scale = 0,#scaled data
label = TRUE,
frame = TRUE,
frame.type = "norm")
```

What we obtain is very similar to the output of the hierarchical clustering with the dendrogram cut at 6.

We know that K-means algorithm minimizes the total within cluster variability (and at the same time to maximize the between cluster variability). This information is given in the last part of the output of `kmeans`

:

` cl3`

```
## K-means clustering with 3 clusters of sizes 9, 15, 15
##
## Cluster means:
## GDPpercapita PopulationTotal WGIVoiceandAccountability
## 1 -0.9746276 0.16773189 -1.47343645
## 2 -0.4348758 -0.12587149 0.03634483
## 3 1.0196523 0.02523236 0.84771704
## WGIPoliticalStabilityNoViolence WGIGovernmentEffectiveness
## 1 -1.3878111 -1.3062546
## 2 0.2149136 -0.2161675
## 3 0.6177731 0.9999202
## WGIRegulatoryQuality WGIRuleofLaw WGIControlofCorruption
## 1 -1.4305953 -1.3897438 -1.2237505
## 2 -0.1109633 -0.1795747 -0.3856619
## 3 0.9693205 1.0134210 1.1199122
##
## Clustering vector:
## Albania Austria Belgium
## 1 3 3
## Bulgaria Bosnia and Herzegovina Belarus
## 2 1 1
## Switzerland Czech Republic Germany
## 3 2 3
## Denmark Spain Estonia
## 3 2 3
## Finland France United Kingdom
## 3 3 3
## Greece Croatia Hungary
## 2 2 2
## Ireland Iceland Italy
## 3 3 2
## Lithuania Luxembourg Latvia
## 2 3 2
## Moldova North Macedonia Malta
## 1 1 2
## Montenegro Netherlands Norway
## 1 3 3
## Poland Portugal Romania
## 2 2 2
## Russian Federation Serbia Slovak Republic
## 1 1 2
## Slovenia Sweden Ukraine
## 2 3 1
##
## Within cluster sum of squares by cluster:
## [1] 35.96384 21.34451 33.57328
## (between_SS / total_SS = 70.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
```

We use the information about the total within cluster sum of squares (`wss`

) to decide an optimal number of cluster. By passing to the function `fviz_nbclust`

the data (`scaledevnum`

) and the method we want to apply (`kmeans`

) we obtain a plot:

```
fviz_nbclust(scaledevnum,
method = "wss") kmeans,
```

When we increase the number of cluster the total wss will get smaller and smaller (but obviously from an interpretability point of view we would like to have a reduced number of groups). We try to identify in the plot an elbow. We see for example that after 5 cluster we have an increase in the total wss. This is a good suggestion to use 5 clusters.

```
set.seed(44)
= kmeans(scaledevnum,
cl5 centers = 5, #n. of clusters
iter.max = 1000,
nstart = 25) #run the alg. 25 times
autoplot(cl5,
data = scaledevnum,
scale = 0,#scaled data
label = TRUE,
frame = TRUE,
frame.type = "norm")
```

```
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
```

For two clusters which are very small (the red and the blue) it is not possible to draw the ellipse.

## 9.3 Exercises Lab 8

The RMarkdown file **Lab7_home_exercises_empty.Rmd** contains the exercise for Lab. 7. Write your solutions in the RMarkdown file and produce the final html file containing your code, your comments and all the outputs.