8 Unsupervised Learning
This chapter will cover the concepts seen in Unsupervised Learning. We will discuss more in detail how to implement several dimensionality reduction and clustering methods. Make sure that you go over the video and slides before going over this chapter. Note that some of the materials are already covered in data mining, however we repeat how to implement these methods in R for didactical reasons.
8.1 Dimensionality reduction
In a lot of cases, you are confronted with a large number of variables. If you want to visualize that relationship into one graph this often impossible. However, such a plot should reduce the number of variables to a minimum but maintain the variance as much as possible. Another problem , especially in parametric methods, is often the presence of multicollinearity. to solve both of these issues, you can make use of principal components analysis (PCA), one of the most well-known feature extraction methods.
We will show you how to implement PCA in R, making use of the football data set also used in Session 2. To perform PCA in R, you can make use of the stats
package and the function princomp
which is by default in R. However, a better package is the psych
package, since this one has more option.
# load the packages
if (!require("pacman")) install.packages("pacman")
require("pacman", character.only = TRUE, quietly = TRUE)
p_load(tidyverse, psych)
# Set working directory
setwd("C:\\Users\\matbogae\\OneDrive - UGent\\PPA22\\PredictiveAnalytics\\Book_2022\\data codeBook_2022")
# Read in the data
<- read_csv("football_players.csv") football
## Rows: 10390 Columns: 32
## -- Column specification --------------------------
## Delimiter: ","
## chr (1): player_name
## dbl (31): overall_rating, potential, crossing,...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
%>%
football glimpse()
## Rows: 10,390
## Columns: 32
## $ player_name <chr> "Aaron Appindangoye",~
## $ overall_rating <dbl> 63.60000, 66.96970, 6~
## $ potential <dbl> 67.60000, 74.48485, 7~
## $ crossing <dbl> 48.60000, 70.78788, 6~
## $ finishing <dbl> 43.60000, 49.45455, 5~
## $ heading_accuracy <dbl> 70.60000, 52.93939, 5~
## $ short_passing <dbl> 60.60000, 62.27273, 6~
## $ volleys <dbl> 43.60000, 29.15152, 5~
## $ dribbling <dbl> 50.60000, 61.09091, 6~
## $ curve <dbl> 44.60000, 61.87879, 6~
## $ free_kick_accuracy <dbl> 38.60000, 62.12121, 5~
## $ long_passing <dbl> 63.60000, 63.24242, 6~
## $ ball_control <dbl> 48.60000, 61.78788, 6~
## $ acceleration <dbl> 60.00000, 76.00000, 7~
## $ sprint_speed <dbl> 64.00000, 74.93939, 7~
## $ agility <dbl> 59.00000, 75.24242, 7~
## $ reactions <dbl> 46.60000, 67.84848, 5~
## $ balance <dbl> 65.00000, 84.72727, 8~
## $ shot_power <dbl> 54.60000, 65.90909, 6~
## $ jumping <dbl> 58.00000, 75.30303, 6~
## $ stamina <dbl> 54.00000, 72.87879, 7~
## $ strength <dbl> 76.00000, 51.75758, 7~
## $ long_shots <dbl> 34.60000, 54.12121, 5~
## $ aggression <dbl> 65.80000, 65.06061, 5~
## $ interceptions <dbl> 52.20000, 57.87879, 4~
## $ positioning <dbl> 44.60000, 51.48485, 6~
## $ vision <dbl> 53.60000, 57.45455, 6~
## $ penalties <dbl> 47.60000, 53.12121, 6~
## $ marking <dbl> 63.80000, 69.39394, 2~
## $ standing_tackle <dbl> 66.00000, 68.78788, 2~
## $ sliding_tackle <dbl> 67.80000, 71.51515, 2~
## $ gk_reflexes <dbl> 7.60000, 12.90909, 13~
# Since PCA can only be applied to numeric data, select the
# numerics
<- football %>%
nums select_if(is.numeric)
# Before applying PCA, you should always check the mean and
# the variance If the differences in mean and variances are
# large, you should scale the data
%>%
nums summarize_all(mean)
## # A tibble: 1 x 31
## overall_rating potential crossing finishing
## <dbl> <dbl> <dbl> <dbl>
## 1 66.9 72.1 53.0 47.9
## # ... with 27 more variables:
## # heading_accuracy <dbl>, short_passing <dbl>,
## # volleys <dbl>, dribbling <dbl>, curve <dbl>,
## # free_kick_accuracy <dbl>, long_passing <dbl>,
## # ball_control <dbl>, acceleration <dbl>,
## # sprint_speed <dbl>, agility <dbl>,
## # reactions <dbl>, balance <dbl>, ...
%>%
nums summarize_all(var)
## # A tibble: 1 x 31
## overall_rating potential crossing finishing
## <dbl> <dbl> <dbl> <dbl>
## 1 37.9 32.7 259. 327.
## # ... with 27 more variables:
## # heading_accuracy <dbl>, short_passing <dbl>,
## # volleys <dbl>, dribbling <dbl>, curve <dbl>,
## # free_kick_accuracy <dbl>, long_passing <dbl>,
## # ball_control <dbl>, acceleration <dbl>,
## # sprint_speed <dbl>, agility <dbl>,
## # reactions <dbl>, balance <dbl>, ...
# We notice a huge difference in variance for several
# variable the difference in mean is acceptable, however
# gkreflexes has a much lower mean so it is best to scale
# your data
<- nums %>%
football_sc scale()
describe(football_sc)
## [322090 obs.]
## matrix: -0.529293869002172 0.0178629575440086 0.0227834326028726 0.361654410570158 1.03600765672507 1.68863759975536 -1.02106020488565 2.09619592279377 -3.06235442930767 0.0313295208630174 -0.788725518174457 0.414587954646997 0.363458539411657 -0.232478575249716 0.448695959774505 1.40440273422724 0.679401659053123 1.72806392850139 -2.66633034347231 -0.470447320605076 -0.272007786831898 1.10726772574523 0.941136217417357 0.263679149355763 -0.490823197294831 1.30003727066063 -0.335947354824312 1.55798910093393 -2.54719074789534 0.678766110903274 -0.238247687132312 0.0857380576376712 0.55437964497217 -1.19778094743271 -0.501662183967035 1.37847620878451 -0.903898966648627 0.999221515030812 -1.82094823534094 -0.102521066693366 0.935947814191637 -0.202261987107591 0.168508007824898 0.849642090545631 1.09320356779598 0.608451010996183 -0.612671492380447 -1.6509390783288 -2.58297191810236 0.0492641646623795 0.00791916448861004 0.132978661754365 0.34550661174821 0.314125806895916 0.318936696819091 1.32819301968068 0.23007454472467 1.17941086818164 -2.80320083926766 0.632000397125428 -0.206093397482575 -1.0434602263044 0.412245482260651 0.0363111048862791 -0.873738895898907 1.76825179748702 -0.787302648336076 1.23030737435064 -1.92157141424676 0.698619937816604 -0.38695617156069 0.238103190669443 0.71162669765154 -0.0911233138885098 -0.38695617156069 1.29411207531525 -0.209915342050594 1.67408978707402 -2.50804149434509 0.753236943998757 -0.340369386633345 0.664065094670231 0.566028191664115 -0.736671658914407 -0.289214042303188 1.59259644415855 0.297411529690698 0.675582906357198 -2.11918022356837 1.18817188294481 -0.536034900247331 0.913370920103573 0.512473882801592 -0.425652538386924 -1.2902786505129 1.30301621922737 -0.476174285146887 0.434262332528953 -1.805432061642 1.05184667925914 0.650167367699465 0.622598429046503 0.408193319444292 0.436299853182366 0.128974246114541 0.792373248999753 0.647964522722354 -0.0870231763720679 -2.8655732157728 0.218409752185327 -0.874509418752489 0.0363907727010471 0.507973539380804 0.132125567439012 -0.0760484696814999 1.15873664789374 0.090511352616059 1.49090723367234 -2.71179845381203 0.7126698307667 -0.563974981127427 0.815126891143998 0.775345106366935 -1.40717585411947 -1.32937652023807 0.923667316276474 0.32259050818992 2.31688926647802 -4.44269899689081 0.656348715060116 -0.261186510746804 0.715859578351005 0.944558732170304 -1.25917854755742 -1.14718597813478 0.790739380193226 0.00675687656810907 2.1983963779388 -3.65513608340236 0.716571814893577 -0.447829292967689 0.892948676463125 1.08883156579731 -0.131993859545638 -1.12802396731028 1.15421138194923 -0.365281395596016 2.26990148203817 -2.75917041937454 0.364615802216676 -2.1879498336109 0.464969755389464 -1.72023437094443 -0.737493239451223 0.658679799537815 1.65842204561766 -0.443586562170143 1.6891904595812 -2.38771331663888 0.516742587912666 0.0450617448661233 1.80455605200356 1.4481040858465 -1.82019258874722 -0.0298586578894064 0.583510847562385 1.38292607978629 2.02784529764524 -3.61176743724901 0.857168867466784 -0.328325961855968 0.417024238579914 0.212619827317662 0.348512039951843 -0.823947895779249 1.11137959679233 -0.462023368043478 0.222759392250629 -2.21327108289398 0.405671807028177 -0.841760377684083 0.972775425949681 0.134319571650367 0.498727429127371 1.23882260192204 0.323440663761999 0.521524840807835 -0.260952473947886 -2.93912225228703 -0.0524899880308662 -1.01473710876829 0.6582878610437 0.484974124820427 -0.887587678094464 0.0912317573109561 0.829857213218101 0.225933093564197 1.32685747640318 -4.20503191476609 0.986193405519819 0.852472778518433 -1.35353172593797 0.313486485650767 1.09777088809005 0.652277869739013 -0.0743555028677695 -0.694487880231625 -1.4714681205993 -2.05945316736403 -0.0766614871085391 -0.9463487875562 0.188674751220707 0.416435725704812 -1.13796817931251 -1.54638544754668 1.37033022054656 -0.773580756994157 0.326984822307308 -2.26038255846557 0.582508389775714 0.427689254301564 0.377913527721286 -0.0223165814937412 0.663600654084585 0.688889419520381 -0.336762546539305 1.01817858215265 -0.470251044051779 -2.58823017915281 0.884045484363742 0.0857187277449564 0.411416581003319 -0.197077704812883 0.622845886700712 1.49890676638657 0.411995908383714 0.188954866931438 -0.417667745809642 -1.81841450613904 0.71117276398586 -0.545799658456469 -0.131012642846777 0.49090105522592 -1.06915711190123 -0.818113176274875 1.31469699146967 -0.5991607144133 1.30422682747116 -2.38933807554567 0.546878942384502 -0.164322328623037 0.111393339042496 0.9647513380218 -0.151260327140111 -0.67361598644295 1.72142062620538 0.293469723349918 1.61402371942575 -2.92538064208492 0.63606081487573 -0.421504466525904 -0.0202857437808091 0.518715343686425 -0.847405618212843 -0.0320008019860981 1.61267788324735 -0.859592276855854 0.731131234221917 -0.901117187787605 0.0741718387014968 0.887512168933577 1.16804242843979 -1.20678567681835 1.2289611950479 1.57956785787794 -0.722083805406695 0.274261993223046 -1.14699268251156 -1.55975722385519 0.82627764573179 0.820643741927127 0.958248694114398 -1.39478448285715 1.05026712858745 1.31620105452155 -0.881303498833549 0.383470244760734 -1.14609459644177 -1.69662945949871 0.750503002568514 1.00938079076002 1.19077562528086 -1.25876028008667 1.02126878473813 1.34139547971986 -0.722303665383463 0.461138048065832 -1.28129521372648 -1.71509268629279 0.71334292568128 -0.544579889436886 -0.225185651767604 -0.189636635588659 -0.389733606254754 -0.284689235771744 -0.108309399145089 -0.125179628231231 -0.212775048486198 2.18667836898845 -0.14689146228847 ...
## min: -4.72187984569954 - max: 4.4700190235591 - NAs: 0 (0%) - 322059 unique values [322090 obs.]
## array: -0.529293869002172 0.0178629575440086 0.0227834326028726 0.361654410570158 1.03600765672507 1.68863759975536 -1.02106020488565 2.09619592279377 -3.06235442930767 0.0313295208630174 -0.788725518174457 0.414587954646997 0.363458539411657 -0.232478575249716 0.448695959774505 1.40440273422724 0.679401659053123 1.72806392850139 -2.66633034347231 -0.470447320605076 -0.272007786831898 1.10726772574523 0.941136217417357 0.263679149355763 -0.490823197294831 1.30003727066063 -0.335947354824312 1.55798910093393 -2.54719074789534 0.678766110903274 -0.238247687132312 0.0857380576376712 0.55437964497217 -1.19778094743271 -0.501662183967035 1.37847620878451 -0.903898966648627 0.999221515030812 -1.82094823534094 -0.102521066693366 0.935947814191637 -0.202261987107591 0.168508007824898 0.849642090545631 1.09320356779598 0.608451010996183 -0.612671492380447 -1.6509390783288 -2.58297191810236 0.0492641646623795 0.00791916448861004 0.132978661754365 0.34550661174821 0.314125806895916 0.318936696819091 1.32819301968068 0.23007454472467 1.17941086818164 -2.80320083926766 0.632000397125428 -0.206093397482575 -1.0434602263044 0.412245482260651 0.0363111048862791 -0.873738895898907 1.76825179748702 -0.787302648336076 1.23030737435064 -1.92157141424676 0.698619937816604 -0.38695617156069 0.238103190669443 0.71162669765154 -0.0911233138885098 -0.38695617156069 1.29411207531525 -0.209915342050594 1.67408978707402 -2.50804149434509 0.753236943998757 -0.340369386633345 0.664065094670231 0.566028191664115 -0.736671658914407 -0.289214042303188 1.59259644415855 0.297411529690698 0.675582906357198 -2.11918022356837 1.18817188294481 -0.536034900247331 0.913370920103573 0.512473882801592 -0.425652538386924 -1.2902786505129 1.30301621922737 -0.476174285146887 0.434262332528953 -1.805432061642 1.05184667925914 0.650167367699465 0.622598429046503 0.408193319444292 0.436299853182366 0.128974246114541 0.792373248999753 0.647964522722354 -0.0870231763720679 -2.8655732157728 0.218409752185327 -0.874509418752489 0.0363907727010471 0.507973539380804 0.132125567439012 -0.0760484696814999 1.15873664789374 0.090511352616059 1.49090723367234 -2.71179845381203 0.7126698307667 -0.563974981127427 0.815126891143998 0.775345106366935 -1.40717585411947 -1.32937652023807 0.923667316276474 0.32259050818992 2.31688926647802 -4.44269899689081 0.656348715060116 -0.261186510746804 0.715859578351005 0.944558732170304 -1.25917854755742 -1.14718597813478 0.790739380193226 0.00675687656810907 2.1983963779388 -3.65513608340236 0.716571814893577 -0.447829292967689 0.892948676463125 1.08883156579731 -0.131993859545638 -1.12802396731028 1.15421138194923 -0.365281395596016 2.26990148203817 -2.75917041937454 0.364615802216676 -2.1879498336109 0.464969755389464 -1.72023437094443 -0.737493239451223 0.658679799537815 1.65842204561766 -0.443586562170143 1.6891904595812 -2.38771331663888 0.516742587912666 0.0450617448661233 1.80455605200356 1.4481040858465 -1.82019258874722 -0.0298586578894064 0.583510847562385 1.38292607978629 2.02784529764524 -3.61176743724901 0.857168867466784 -0.328325961855968 0.417024238579914 0.212619827317662 0.348512039951843 -0.823947895779249 1.11137959679233 -0.462023368043478 0.222759392250629 -2.21327108289398 0.405671807028177 -0.841760377684083 0.972775425949681 0.134319571650367 0.498727429127371 1.23882260192204 0.323440663761999 0.521524840807835 -0.260952473947886 -2.93912225228703 -0.0524899880308662 -1.01473710876829 0.6582878610437 0.484974124820427 -0.887587678094464 0.0912317573109561 0.829857213218101 0.225933093564197 1.32685747640318 -4.20503191476609 0.986193405519819 0.852472778518433 -1.35353172593797 0.313486485650767 1.09777088809005 0.652277869739013 -0.0743555028677695 -0.694487880231625 -1.4714681205993 -2.05945316736403 -0.0766614871085391 -0.9463487875562 0.188674751220707 0.416435725704812 -1.13796817931251 -1.54638544754668 1.37033022054656 -0.773580756994157 0.326984822307308 -2.26038255846557 0.582508389775714 0.427689254301564 0.377913527721286 -0.0223165814937412 0.663600654084585 0.688889419520381 -0.336762546539305 1.01817858215265 -0.470251044051779 -2.58823017915281 0.884045484363742 0.0857187277449564 0.411416581003319 -0.197077704812883 0.622845886700712 1.49890676638657 0.411995908383714 0.188954866931438 -0.417667745809642 -1.81841450613904 0.71117276398586 -0.545799658456469 -0.131012642846777 0.49090105522592 -1.06915711190123 -0.818113176274875 1.31469699146967 -0.5991607144133 1.30422682747116 -2.38933807554567 0.546878942384502 -0.164322328623037 0.111393339042496 0.9647513380218 -0.151260327140111 -0.67361598644295 1.72142062620538 0.293469723349918 1.61402371942575 -2.92538064208492 0.63606081487573 -0.421504466525904 -0.0202857437808091 0.518715343686425 -0.847405618212843 -0.0320008019860981 1.61267788324735 -0.859592276855854 0.731131234221917 -0.901117187787605 0.0741718387014968 0.887512168933577 1.16804242843979 -1.20678567681835 1.2289611950479 1.57956785787794 -0.722083805406695 0.274261993223046 -1.14699268251156 -1.55975722385519 0.82627764573179 0.820643741927127 0.958248694114398 -1.39478448285715 1.05026712858745 1.31620105452155 -0.881303498833549 0.383470244760734 -1.14609459644177 -1.69662945949871 0.750503002568514 1.00938079076002 1.19077562528086 -1.25876028008667 1.02126878473813 1.34139547971986 -0.722303665383463 0.461138048065832 -1.28129521372648 -1.71509268629279 0.71334292568128 -0.544579889436886 -0.225185651767604 -0.189636635588659 -0.389733606254754 -0.284689235771744 -0.108309399145089 -0.125179628231231 -0.212775048486198 2.18667836898845 -0.14689146228847 ...
## min: -4.72187984569954 - max: 4.4700190235591 - NAs: 0 (0%) - 322059 unique values
# Another thing that you should od is check the
# correlations, some correlation should be present for PCA.
# Correlation should be more than 0.30 for several
# variables. Since we notice that moderate to high
# correlation between several variables, PCA is advisable
p_load(corrplot)
<- cor(nums, use = "complete.obs")
corr corrplot(corr, method = "circle")
Now that the data is prepared, we can perform a pca analysis. Let’s perform pca with and without rotation. We will use the the raw data, because the principal
function uses the correlation matrix to make your principal components and hence should not be scaled. Question: why should this not be scaled?
# Perform Principal Components Analysis (PCA)
<- principal(nums, nfactors = 10, rotate = "none")
pca
# Check output
print(pca, sort = TRUE) #with sort=TRUE, the variables are automatically sorted
## Principal Components Analysis
## Call: principal(r = nums, nfactors = 10, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
## item PC1 PC2 PC3 PC4
## ball_control 12 0.95 -0.02 -0.06 -0.06
## short_passing 6 0.91 0.18 -0.06 -0.04
## dribbling 8 0.91 -0.25 -0.14 -0.02
## positioning 25 0.87 -0.27 0.06 -0.14
## long_shots 22 0.87 -0.23 0.08 -0.21
## crossing 3 0.87 0.02 -0.19 0.08
## curve 9 0.86 -0.18 -0.05 -0.03
## vision 26 0.86 -0.08 0.10 -0.02
## shot_power 18 0.85 -0.08 0.13 -0.26
## free_kick_accuracy 10 0.82 -0.10 0.01 -0.11
## volleys 7 0.81 -0.36 0.12 -0.22
## penalties 27 0.80 -0.20 0.19 -0.25
## finishing 4 0.76 -0.48 0.12 -0.27
## long_passing 11 0.75 0.31 -0.02 0.11
## stamina 20 0.72 0.37 -0.11 0.07
## gk_reflexes 31 -0.71 -0.21 0.44 0.29
## acceleration 13 0.71 -0.26 -0.29 0.36
## sprint_speed 14 0.71 -0.20 -0.26 0.30
## agility 15 0.70 -0.36 -0.23 0.40
## balance 17 0.63 -0.16 -0.32 0.40
## heading_accuracy 5 0.58 0.44 0.12 -0.39
## marking 28 0.22 0.92 -0.19 0.06
## standing_tackle 29 0.27 0.91 -0.19 0.05
## sliding_tackle 30 0.25 0.90 -0.22 0.09
## interceptions 24 0.34 0.86 -0.05 0.08
## aggression 23 0.45 0.73 0.09 -0.07
## strength 21 0.03 0.55 0.50 -0.38
## overall_rating 1 0.50 0.12 0.74 0.31
## reactions 16 0.55 0.10 0.61 0.30
## potential 2 0.45 0.01 0.59 0.43
## jumping 19 0.14 0.29 0.26 0.29
## PC5 PC6 PC7 PC8 PC9
## ball_control -0.01 -0.09 -0.10 -0.01 -0.08
## short_passing -0.17 -0.04 -0.10 0.00 -0.18
## dribbling 0.01 -0.09 -0.03 0.04 -0.05
## positioning 0.05 0.03 0.03 -0.12 -0.01
## long_shots -0.08 0.08 0.04 0.07 0.01
## crossing -0.16 -0.02 0.10 0.15 0.05
## curve -0.18 0.05 0.05 0.20 0.12
## vision -0.25 0.13 0.03 -0.14 -0.18
## shot_power 0.07 0.01 0.01 0.05 0.00
## free_kick_accuracy -0.29 0.16 0.05 0.20 0.16
## volleys 0.03 0.04 -0.01 0.02 0.07
## penalties 0.00 0.12 -0.09 -0.11 0.13
## finishing 0.12 0.02 -0.03 -0.04 0.01
## long_passing -0.40 0.07 -0.02 0.06 -0.28
## stamina 0.21 0.00 0.34 -0.10 -0.15
## gk_reflexes -0.22 0.12 0.16 0.03 0.00
## acceleration 0.32 -0.20 0.07 0.05 0.05
## sprint_speed 0.38 -0.26 0.10 0.08 0.04
## agility 0.10 0.04 0.07 0.00 -0.03
## balance 0.04 0.27 -0.13 -0.32 0.07
## heading_accuracy 0.37 -0.09 -0.25 -0.07 0.06
## marking -0.07 -0.03 -0.07 0.03 0.08
## standing_tackle -0.09 -0.02 -0.06 0.03 0.06
## sliding_tackle -0.08 -0.02 -0.06 0.05 0.08
## interceptions -0.14 0.04 0.02 -0.02 0.07
## aggression 0.12 0.06 0.18 -0.17 0.05
## strength 0.30 -0.10 0.23 0.05 -0.10
## overall_rating -0.09 -0.10 -0.03 -0.01 0.03
## reactions -0.06 0.03 0.15 -0.11 0.20
## potential -0.05 -0.32 -0.27 0.03 -0.07
## jumping 0.61 0.52 -0.14 0.24 -0.10
## PC10 h2 u2 com
## ball_control -0.07 0.95 0.0517 1.1
## short_passing -0.06 0.94 0.0629 1.3
## dribbling -0.06 0.92 0.0791 1.3
## positioning -0.08 0.88 0.1192 1.3
## long_shots 0.09 0.89 0.1120 1.4
## crossing -0.07 0.86 0.1377 1.3
## curve -0.01 0.88 0.1244 1.4
## vision -0.05 0.89 0.1131 1.4
## shot_power 0.17 0.85 0.1453 1.4
## free_kick_accuracy 0.12 0.89 0.1071 1.7
## volleys -0.03 0.86 0.1355 1.6
## penalties 0.11 0.83 0.1657 1.7
## finishing -0.04 0.92 0.0818 2.1
## long_passing 0.01 0.92 0.0841 2.3
## stamina -0.04 0.86 0.1415 2.5
## gk_reflexes 0.08 0.93 0.0706 2.8
## acceleration 0.06 0.94 0.0599 3.0
## sprint_speed 0.06 0.93 0.0660 3.0
## agility -0.04 0.85 0.1466 2.5
## balance 0.21 0.94 0.0641 4.1
## heading_accuracy -0.13 0.93 0.0738 4.4
## marking 0.01 0.96 0.0435 1.3
## standing_tackle 0.01 0.96 0.0359 1.3
## sliding_tackle 0.00 0.95 0.0495 1.4
## interceptions -0.03 0.90 0.1023 1.4
## aggression 0.13 0.84 0.1605 2.2
## strength 0.17 0.90 0.1035 4.2
## overall_rating 0.02 0.93 0.0726 2.3
## reactions -0.29 0.94 0.0598 3.6
## potential 0.14 0.94 0.0576 4.1
## jumping -0.06 0.99 0.0068 4.2
##
## PC1 PC2 PC3 PC4 PC5
## SS loadings 14.69 5.59 2.42 1.76 1.44
## Proportion Var 0.47 0.18 0.08 0.06 0.05
## Cumulative Var 0.47 0.65 0.73 0.79 0.84
## Proportion Explained 0.52 0.20 0.09 0.06 0.05
## Cumulative Proportion 0.52 0.72 0.81 0.87 0.92
## PC6 PC7 PC8 PC9 PC10
## SS loadings 0.70 0.50 0.40 0.34 0.32
## Proportion Var 0.02 0.02 0.01 0.01 0.01
## Cumulative Var 0.86 0.87 0.89 0.90 0.91
## Proportion Explained 0.02 0.02 0.01 0.01 0.01
## Cumulative Proportion 0.94 0.96 0.98 0.99 1.00
##
## Mean item complexity = 2.2
## Test of the hypothesis that 10 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.01
## with the empirical chi square 1939.19 with prob < 4.6e-282
##
## Fit based upon off diagonal values = 1
# based on the highest loadings for the first PC.
# Check the loadings to see what the PCS are about
$loadings pca
##
## Loadings:
## PC1 PC2 PC3 PC4
## overall_rating 0.497 0.122 0.741 0.310
## potential 0.448 0.594 0.425
## crossing 0.868 -0.190
## finishing 0.761 -0.485 0.123 -0.265
## heading_accuracy 0.578 0.438 0.120 -0.392
## short_passing 0.910 0.177
## volleys 0.812 -0.361 0.123 -0.223
## dribbling 0.906 -0.249 -0.145
## curve 0.864 -0.182
## free_kick_accuracy 0.822 -0.103 -0.114
## long_passing 0.750 0.308 0.108
## ball_control 0.955
## acceleration 0.712 -0.261 -0.292 0.357
## sprint_speed 0.707 -0.199 -0.256 0.302
## agility 0.702 -0.357 -0.228 0.402
## reactions 0.549 0.104 0.612 0.300
## balance 0.633 -0.164 -0.324 0.399
## shot_power 0.854 0.129 -0.258
## jumping 0.136 0.288 0.264 0.293
## stamina 0.716 0.367 -0.111
## strength 0.555 0.499 -0.380
## long_shots 0.871 -0.230 -0.206
## aggression 0.447 0.727
## interceptions 0.345 0.861
## positioning 0.872 -0.266 -0.141
## vision 0.860
## penalties 0.797 -0.200 0.185 -0.248
## marking 0.219 0.921 -0.193
## standing_tackle 0.273 0.913 -0.188
## sliding_tackle 0.252 0.901 -0.219
## gk_reflexes -0.713 -0.211 0.441 0.291
## PC5 PC6 PC7 PC8
## overall_rating
## potential -0.325 -0.272
## crossing -0.158 0.150
## finishing 0.117
## heading_accuracy 0.371 -0.247
## short_passing -0.165
## volleys
## dribbling
## curve -0.183 0.203
## free_kick_accuracy -0.286 0.164 0.204
## long_passing -0.401
## ball_control
## acceleration 0.315 -0.198
## sprint_speed 0.383 -0.263 0.104
## agility 0.100
## reactions 0.154 -0.111
## balance 0.271 -0.130 -0.322
## shot_power
## jumping 0.613 0.518 -0.145 0.238
## stamina 0.210 0.340 -0.101
## strength 0.300 -0.103 0.225
## long_shots
## aggression 0.115 0.184 -0.171
## interceptions -0.145
## positioning -0.122
## vision -0.248 0.126 -0.135
## penalties 0.115 -0.109
## marking
## standing_tackle
## sliding_tackle
## gk_reflexes -0.221 0.122 0.161
## PC9 PC10
## overall_rating
## potential 0.143
## crossing
## finishing
## heading_accuracy -0.128
## short_passing -0.175
## volleys
## dribbling
## curve 0.115
## free_kick_accuracy 0.163 0.121
## long_passing -0.276
## ball_control
## acceleration
## sprint_speed
## agility
## reactions 0.199 -0.289
## balance 0.210
## shot_power 0.169
## jumping -0.103
## stamina -0.153
## strength -0.103 0.171
## long_shots
## aggression 0.126
## interceptions
## positioning
## vision -0.179
## penalties 0.134 0.106
## marking
## standing_tackle
## sliding_tackle
## gk_reflexes
##
## PC1 PC2 PC3 PC4 PC5
## SS loadings 14.688 5.588 2.425 1.756 1.440
## Proportion Var 0.474 0.180 0.078 0.057 0.046
## Cumulative Var 0.474 0.654 0.732 0.789 0.835
## PC6 PC7 PC8 PC9 PC10
## SS loadings 0.702 0.505 0.401 0.340 0.322
## Proportion Var 0.023 0.016 0.013 0.011 0.010
## Cumulative Var 0.858 0.874 0.887 0.898 0.909
We can actually see that the first PC loads high on dribbling, ball_control,long shots, shot_power, vision So we can maybe say that this PC is about ball handling and football intelligence. Looking at PC2, we see that it loads on sliding tackles, interception, and agression, so this can be your defensive skills. As an exercise, try to interpret the other PCs.
# Check the score to see which instances score high on
# certain PCs
head(pca$scores)
## PC1 PC2 PC3 PC4
## [1,] -0.3916228 0.7864349 -0.9646160 -1.04977239
## [2,] 0.5358301 0.4865027 -0.9412600 1.77302473
## [3,] 0.5546452 -0.9598542 -0.4630592 0.09590248
## [4,] -0.3624813 1.3420597 0.2677645 -0.84453112
## [5,] -0.3246658 1.7466316 0.7398201 0.29588598
## [6,] 1.4602375 -0.8321987 1.2660571 0.33471345
## PC5 PC6 PC7 PC8
## [1,] -0.1857767 -0.5709827 -1.2941700 -0.2079504
## [2,] -0.1457836 1.1401279 -0.4622942 0.4218186
## [3,] 0.5638614 0.2841281 0.1767969 -0.2646224
## [4,] -0.3313796 -0.3017848 -1.1603707 1.7917709
## [5,] 0.2321783 0.3631449 -1.9867849 -0.8659367
## [6,] -0.3264924 0.1431034 0.1072922 0.2416088
## PC9 PC10
## [1,] -0.791041954 1.7251407
## [2,] 1.288064235 0.7615998
## [3,] -2.230868898 2.4339190
## [4,] -1.274688621 -0.8312371
## [5,] 0.179336223 -1.5820336
## [6,] 0.008017534 -0.8480305
# How many PCs to keep? Check eigenvalues We should go for
# 5 in this case
$values #based on Kaiser criterion: eigenvalues > 1 pca
## [1] 14.68817338 5.58781625 2.42484785
## [4] 1.75561797 1.44033982 0.70155485
## [7] 0.50497148 0.40075652 0.33988059
## [10] 0.32209933 0.26697913 0.25997298
## [13] 0.23301811 0.22980121 0.21847876
## [16] 0.18690795 0.16991019 0.15128742
## [19] 0.14675786 0.13845947 0.11959460
## [22] 0.11392462 0.10610977 0.10095204
## [25] 0.08096761 0.07461259 0.06823503
## [28] 0.06325777 0.04192702 0.03764140
## [31] 0.02514642
# Make Cattell's Scree plot of eigenvalues This makes us
# choose for 5 or 6
plot(pca$values, type = "b", ylab = "Eigenvalues", xlab = "Number of Principal Components",
main = "Cattell's Scree plot")
# Check cumulative PVE We want at least 70%, so we could go
# for 3
cumsum(prop.table(pca$values))
## [1] 0.4738120 0.6540642 0.7322851 0.7889179
## [5] 0.8353805 0.8580113 0.8743007 0.8872283
## [9] 0.8981922 0.9085825 0.9171947 0.9255810
## [13] 0.9330977 0.9405106 0.9475583 0.9535876
## [17] 0.9590686 0.9639488 0.9686830 0.9731494
## [21] 0.9770073 0.9806823 0.9841052 0.9873617
## [25] 0.9899735 0.9923804 0.9945815 0.9966221
## [29] 0.9979746 0.9991888 1.0000000
# Plot cumulative PVE Not so clear cut, maybe 4 or 5
plot(cumsum(prop.table(pca$values)), type = "b", ylab = "Cumulative Proportion of Variance Explained (%)",
xlab = "Number of Principal Components", main = "Selecting the optimal number of PCs")
# MORE ADVANCED METHODS
# Horn's Parallel Analysis By default, fa.parallel takes
# the mean of the eigenvalues of the random matrices. By
# specifying quant=0.95, the 95th percentile (as originally
# in Horn (1965)) are used instead.
<- fa.parallel(x = nums, fm = "ml", fa = "pc", n.iter = 50) hpa
## Parallel analysis suggests that the number of factors = NA and the number of components = 5
Since most methods give around approximately around 5 PCs, 5 PCs would be a good final choice. It is always advisable to try out different methods and check for consistency. Remember that you can also apply a varimax rotation to increase interpretabiliy.
# Rotations
<- principal(nums, nfactors = 5, rotate = "varimax")
pca2
# Check interpretation You can see that the the first PC is
# now clearly ball handling or even striking skills
# Especially PC2 is more clear since it loads very high on
# marking and tackles, which are defensive skills
$loadings pca2
##
## Loadings:
## RC1 RC2 RC4 RC3
## overall_rating 0.294 0.119 0.893
## potential 0.215 0.152 0.814
## crossing 0.706 0.329 0.422 0.124
## finishing 0.870 -0.293 0.197
## heading_accuracy 0.514 0.462
## short_passing 0.767 0.453 0.254 0.180
## volleys 0.876 -0.150 0.188 0.144
## dribbling 0.829 0.458
## curve 0.810 0.110 0.313 0.168
## free_kick_accuracy 0.811 0.170 0.159 0.178
## long_passing 0.566 0.564 0.168 0.297
## ball_control 0.837 0.263 0.357 0.142
## acceleration 0.446 0.819
## sprint_speed 0.444 0.772
## agility 0.473 0.762 0.161
## reactions 0.337 0.140 0.792
## balance 0.374 0.113 0.718
## shot_power 0.855 0.110 0.130 0.152
## jumping -0.163 0.170 0.270 0.320
## stamina 0.452 0.534 0.364 0.105
## strength 0.343 -0.506 0.188
## long_shots 0.900 0.179 0.156
## aggression 0.221 0.754 0.169
## interceptions 0.927 0.157
## positioning 0.861 0.285 0.153
## vision 0.794 0.185 0.199 0.304
## penalties 0.842 0.191
## marking 0.966
## standing_tackle 0.975
## sliding_tackle 0.966
## gk_reflexes -0.634 -0.441 -0.318 0.359
## RC5
## overall_rating 0.129
## potential
## crossing -0.148
## finishing 0.157
## heading_accuracy 0.591
## short_passing
## volleys
## dribbling
## curve -0.148
## free_kick_accuracy -0.191
## long_passing -0.274
## ball_control
## acceleration 0.119
## sprint_speed 0.216
## agility
## reactions 0.119
## balance -0.133
## shot_power 0.203
## jumping 0.635
## stamina 0.272
## strength 0.618
## long_shots
## aggression 0.329
## interceptions
## positioning 0.102
## vision -0.142
## penalties 0.128
## marking
## standing_tackle
## sliding_tackle
## gk_reflexes -0.234
##
## RC1 RC2 RC4 RC3 RC5
## SS loadings 11.217 6.038 3.945 2.907 1.790
## Proportion Var 0.362 0.195 0.127 0.094 0.058
## Cumulative Var 0.362 0.557 0.684 0.778 0.835
# Make a biplot to check the differences between rotation
# and no rotation Do make good biplot, only select the
# first two PCs
<- principal(nums, nfactors = 2, rotate = "none")
pca <- principal(nums, nfactors = 2, rotate = "varimax")
pca2
biplot(pca)
biplot(pca2)
If there are nonlinear relationhsip, it can be good to also try out kernel PCA. If you do not have a response variable, it is of course hard to tell whether there are nonlinearities in the data. Let’s make a synthetic data set to show the power of kpca when the data in highly nonlinear.
# Load the following github directory to make a special
# data set
p_load_current_gh("elbamos/clusteringdatasets")
# Make moons data set
<- data.frame(make_moons(n_samples = 1000, noise = 0.02))
synt
ggplot(synt, aes(samples.1, samples.2, col = as.factor(labels))) +
geom_point()
# As you can see this data is no lineary separable So let's
# perform PCA and see what happens As you can see the data
# is no linearly separable
<- principal(synt[, 1:2], nfactors = 2, rotate = "varimax")
pca3 <- as.data.frame(cbind(pca3$scores, synt$labels))
scores ggplot(scores, aes(RC1, RC2, col = as.factor(V3))) + geom_point()
# Perform kernel PCA
# As you can see that the data is now linearly separable
# Note that this is highly dependent upon your choice for
# sigma
p_load(kernlab)
<- kpca(data.matrix(synt[, 1:2]), kernel = "rbfdot", kpar = list(sigma = 15),
pcak features = 2)
# Plot the scores for kPC2 and kPC2
<- as.data.frame(cbind(rotated(pcak), synt$labels))
scoresk ggplot(scoresk, aes(V1, V2, col = as.factor(V3))) + geom_point()
8.2 Clustering
Clustering is often used in market segmentation and can be very handy to get more insight into your target population. Recall that there are two broad categories: hierarchical and nonhierarchical methods.
We will not discuss all the distance metrics in detail but mention them along the road.
8.2.1 Partitional clustering
Probably the most popular partitional clustering method is k-means. The algorithm is simple and intuitive and works well in a lot of cases. Recall that k-means works with Euclidean distances, so scaling the data is crucial.K-means is built in R by default in the stats
package, so we don’t need to load any other packages.
#We will just use the football data set to see whether we can come up with a
#good clustering solution
#Cluster the scaled dataset in 4 groups (one for each position:
#goalie, defender, midfielder and attacker)
<- kmeans(x=football_sc,
km centers=4,
nstart=50)
#Plot
plot(x=football_sc,
col=(km$cluster+1),
pch = 20 ,
cex = 1.5,main="k-means clustering")
points(km$centers, col = 2:max(km$cluster+1),cex=5,pch=10)
#adds the center, although not clear
#To make a cluster plot based on the PCs, use the cluster package
p_load(cluster)
clusplot(x=football_sc, clus= km$cluster, shade=TRUE, color= TRUE, lines = 0)
#An even nicer plot package is the fviz_cluster function of the factoextra package
#This use a ggplot style theme
p_load(factoextra)
fviz_cluster(km, data = football_sc,
geom = "point",
ellipse.type = "convex")
We have succesfully created clusters, however how do we know what all these clusters mean? A good solution can be to build a decision tree with the clusters as a dependent variable and the others as independent.
p_load(rpart)
p_load(rpart.plot)
# Use the unscaled data for better interpretation Can you
# guess the interpretation of each cluster?
<- data.frame(nums, cluster = as.factor(km$cluster))
tree_data <- rpart(cluster ~ ., tree_data)
dt rpart.plot(dt)
# You can also just have a look at the mean scores per
# cluster Can you guess the interpration now?
%>%
tree_data group_by(cluster) %>%
summarise_all(mean)
## # A tibble: 4 x 32
## cluster overall_rating potential crossing
## <fct> <dbl> <dbl> <dbl>
## 1 1 66.5 71.2 18.9
## 2 2 65.2 71.4 54.9
## 3 3 64.1 69.7 48.1
## 4 4 71.7 75.8 66.3
## # ... with 28 more variables: finishing <dbl>,
## # heading_accuracy <dbl>, short_passing <dbl>,
## # volleys <dbl>, dribbling <dbl>, curve <dbl>,
## # free_kick_accuracy <dbl>, long_passing <dbl>,
## # ball_control <dbl>, acceleration <dbl>,
## # sprint_speed <dbl>, agility <dbl>,
## # reactions <dbl>, balance <dbl>, ...
Whereas k-means uses the Euclidean distance by default, k-medoids allows you to set the distance measure yourself. Because the cluster center in the case of k-medoids is also a true data point (and not a fictional as in k-means), it can easily work with any type of data. To calculate the distances, you can just use the dist
function of the stats
package, check ?dist
to see the various options.
# To implement kmedoids we use the cluster package
p_load(cluster)
# Use a random subsample becasue distance measures can
# takje a while
<- sample(1:nrow(nums), 1000, replace = FALSE)
random
# Calculate dissimilarity matrix with cityblock distance
<- stats::dist(football_sc[random, ], method = "manhattan")
ds
# Perform k-medoids
<- pam(x = ds, k = 4, diss = TRUE) #diss=TRUE means we supply a dissimilarity matrix to x
kmd
# Plot
clusplot(x = football_sc[random, ], clus = kmd$cluster, shade = TRUE,
color = TRUE, lines = 0)
# Have a look at the medoids What is the difference?
$medoids, ] %>%
nums[random, ][kmdas_tibble()
## # A tibble: 4 x 31
## overall_rating potential crossing finishing
## <dbl> <dbl> <dbl> <dbl>
## 1 64.3 74.1 63.3 64
## 2 63.9 67.7 58.4 32
## 3 70.8 74.3 66.7 60.6
## 4 65.2 75.2 22 17.7
## # ... with 27 more variables:
## # heading_accuracy <dbl>, short_passing <dbl>,
## # volleys <dbl>, dribbling <dbl>, curve <dbl>,
## # free_kick_accuracy <dbl>, long_passing <dbl>,
## # ball_control <dbl>, acceleration <dbl>,
## # sprint_speed <dbl>, agility <dbl>,
## # reactions <dbl>, balance <dbl>, ...
As an exercise you can also build a decision tree and see what this says about the clusters.
8.2.2 Hierarchical clustering
One of the downsides of partitional clustering is that you have to specificy the clusters upfront. Hierarchical clustering solves this issue and allows you to make the ‘cut’ yourself. We will focus on agglomorative clustering which applies a bootum-up approach.
Again, hierarchical clustering is included in the stats
package of R, so no need to load a new package.
# Calculate dissimilarity
<- stats::dist(football_sc, method = "euclidean")
ds
# Perform Hierarchical clustering with ward's method (you
# can experiment with others as well)
<- hclust(ds, method = "ward.D")
hclustering
# Plot the dendogram
plot(hclustering)
abline(h = 3000, col = "red", lwd = 3)
# plot a red line across the dendrogram, which gives you 4
# clusters
# Show the 4 clusters
plot(hclustering, cex = 0.6)
rect.hclust(hclustering, k = 4, border = 2:5)
# Cut the tree
<- cutree(hclustering, 7)
cluster_memberships
# We can also plot the total distance against the number of
# clusters and look for a elbow The elbow is around 3 or 5
data.frame(distance = hclustering$height, clusters = cluster_memberships[-1]) %>%
group_by(clusters) %>%
::summarise(means = sum(distance)) %>%
dplyrmutate(cumul = cumsum(means)) %>%
ggplot(aes(clusters, cumul)) + geom_line()
# So let's keep it in the middle and pick 4 just like
# before
<- cutree(hclustering, 4)
cluster_memberships
# Plot the cluster memberships
fviz_cluster(list(data = football_sc, cluster = cluster_memberships),
geom = "point", ellipse.type = "convex")
8.2.3 Advanced clustering
Whern the data is highly nonlinear, the traditional methods like k-means or k-medoids often fall short. The main reason for this is because these traditional methods are based on compactness and not on connectivity. To make the example more clear, we will use the moon data example again from kernel PCA and perform k-means, spectral clustering and DBSCAN.
# Remember the moon data
ggplot(synt, aes(samples.1, samples.2, col = as.factor(labels))) +
geom_point()
# Using Kmeans on this example would give us As you can
# see, this is not really what we want
<- kmeans(x = synt[, 1:2], centers = 2, nstart = 50)
km_synt
ggplot(synt, aes(samples.1, samples.2, col = as.factor(km_synt$cluster))) +
geom_point()
# Let's try spectral clustering We can just use the kernel
# package from kernel PCA
<- specc(x = data.matrix(synt[, 1:2]), centers = 2,
specc_synt iterations = 200, kernel = "rbfdot", kpar = list(sigma = 15))
# also an option to chose kpar automatically
# If we plot this you nicely see that this is the solution
# that we want
ggplot(synt, aes(samples.1, samples.2, col = as.factor(specc_synt@.Data))) +
geom_point()
# Let us now do the same with DBSCAN
p_load(dbscan)
# Note that in DBSCAN you should not even set the number of
# clusters!
<- dbscan(synt[, 1:2], eps = 0.15)
db_synt
# And plot it ...
ggplot(synt, aes(samples.1, samples.2, col = as.factor(db_synt$cluster))) +
geom_point()
# Let's try another dataset for fun
data("face")
qplot(x = face$x, y = face$y)
# Spectral
<- specc(x = data.matrix(face), centers = 5, iterations = 200,
specc_face kernel = "rbfdot")
ggplot(face, aes(x, y, col = as.factor(specc_face@.Data))) +
geom_point()
# DBSCAN
<- dbscan(face, eps = 0.15)
db_face ggplot(face, aes(x, y, col = as.factor(db_face$cluster))) + geom_point()
8.2.4 Cluster evaluation
The question now of course arises: how to pick the best clustering solution. We have tried out several methods, however, we should know which one to pick. In theory, we proposed several methods: silhouette index, Davies-Bouldin index, and the Dunn index. A decision tree can also be build (cf. supra), however this is mainly good for interpretation purposes.
# Lets rebuild our football cluster for kmeans Since we
# don't know the optimal number, let's make a loop and
# decide upon the aforementioned metrics Check cluster
# validity for multiple values of k
<- list()
kms for (k in 2:6) {
<- kmeans(x = football_sc, centers = k, nstart = 50)
kms[[k]]
}
1]] <- NULL
kms[[
# Silhouette index (maximize)
<- dist(football_sc, method = "euclidean")
ds sapply(kms, function(km) summary(silhouette(km$cluster, ds))$avg.width)
## [1] 0.4286009 0.2482518 0.2307827 0.2111661
## [5] 0.1759047
# Dunn index (maximize)
p_load(clValid)
sapply(kms, function(km) dunn(ds, km$cluster))
## [1] 0.08765093 0.09559838 0.09472543 0.02668480
## [5] 0.08928985
# Davies-Bouldin index (minimize)
p_load(clusterSim)
sapply(kms, function(km) index.DB(football_sc, km$cluster)$DB)
## [1] 0.9744974 1.3988414 1.4341121 1.4800272
## [5] 1.6716166
# Choose optimal number of clusters k using all cluster
# validation measures (24 measures)
p_load(NbClust)
<- NbClust(data = football_sc[random, ], min.nc = 2, max.nc = 5,
nbc method = "kmeans", index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 8 proposed 2 as the best number of clusters
## * 11 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
# Plot, optimal number of clusters was suggested as 3
fviz_cluster(list(data = football_sc, cluster = kms[[2]]$cluster),
geom = "point", ellipse.type = "convex")
# Also much evidence for 2: compare
fviz_cluster(list(data = football_sc, cluster = kms[[1]]$cluster),
geom = "point", ellipse.type = "convex")
Looking at the plot, it seems as if both options yield suboptimal results. Therefore, always check your results, as an alternative solution might be required. We could try the spectral clustering approach, with the suggested number of clusters (i.e., 3). However, spectral clustring mostly produces high-quality clusterings on small data sets, and has limited applicability to large scale problems due to its computational complexity. If you don’t believe me: try running this line of code:
# spectral3 <- specc(x=football_sc, centers=3,
# iterations=200, kernel='rbfdot', kpar = list(sigma = 15))
# fviz_cluster(list(data = football_sc, cluster =
# kms[[1]]$cluster), geom = 'point', ellipse.type =
# 'convex' )
Some may have experienced issues due to limited working memory (Error: cannot allocate vector of size …). This can be countered by fitting your clusters on a more limited sample size as is done in the example below (with 1000 samples). Note that the determination of optimal number of clusters with the NbClust
method was also based on a limited sample size.
# Lets rebuild our football cluster for kmeans Since we
# don't know the optimal number, let's make a loop and
# decide upon the aforementioned metrics Check cluster
# validity for multiple values of k
<- list()
kms
<- football_sc[random, ]
smaller_sample
for (k in 2:6) {
<- kmeans(x = smaller_sample, centers = k, nstart = 50)
kms[[k]]
}
1]] <- NULL
kms[[
# Silhouette index (maximize)
<- dist(smaller_sample, method = "euclidean")
ds sapply(kms, function(km) summary(silhouette(km$cluster, ds))$avg.width)
## [1] 0.4363922 0.2315783 0.2269328 0.2138149
## [5] 0.1822053
# Dunn index (maximize)
p_load(clValid)
sapply(kms, function(km) dunn(ds, km$cluster))
## [1] 0.1286129 0.1227784 0.1430272 0.1506762
## [5] 0.1585677
# Davies-Bouldin index (minimize)
p_load(clusterSim)
sapply(kms, function(km) index.DB(smaller_sample, km$cluster)$DB)
## [1] 0.9631697 1.4681063 1.4323039 1.4379298
## [5] 1.6415087
# Choose optimal number of clusters k using all cluster
# validation measures (24 measures)
p_load(NbClust)
<- NbClust(data = smaller_sample, min.nc = 2, max.nc = 5,
nbc method = "kmeans", index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 8 proposed 2 as the best number of clusters
## * 11 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
# If we determine our number clusters on our full dataset,
# We can directly deploy the detected algorithms from this
# model Plot
fviz_cluster(list(data = smaller_sample, cluster = nbc$Best.partition),
geom = "point", ellipse.type = "convex")