3 Chapter 3: Hierarchical & Model-Based Clustering (MBC)

Image source

3.1 Choosing the number of clusters – another approach & metric

  • Silhouette Width Plots
    • Another way to ‘assess’ cluster adequacy is through a plot that depicts the degree of homogeneity within each cluster.
    • One advantage Silhouette Plots have over other methods of evaluation is that they require only a dissimilarity or distance measure.
      • They do not require a feature set, as C(g) does
    • Note: Why does C(g) require a feature set? Recall the definition.

C(g)=tr(B)g1tr(W)ng

3.1.1 Silhouette Width (cont.)

B=gm=1nm(ˉxmˉx)(ˉxmˉx)

W=gm=1nml=1(xmlˉxm)(xmlˉxm)

  • For each point, we can evaluate its average dissimilarity, or distance from every other point in the cluster.
    • Say we are using Euclidean distance (but it can be any distance metric on pairs (i,j):

a(m,i)=1nm1l{1,,nm}ipk=1(xm,ikxm,lk)2

    • a(m,i) is the average (Euclidean) distance of the ith point in the mth cluster from all remaining points in that cluster.
      • The square root term is the distance of the ith point from the lth point in that same cluster.
  • For any other cluster:
    • take that same point i and evaluate its average dissimilarity, or distance, from all the points in that other cluster.
    • We’ll call this measurement ‘D’:

D(C,m,i)=1nCl{1,,nC}ipk=1(xm,ikxC,lk)2

    • D(C,m,i) is the average distance of point i in cluster m from every point in cluster C.
      • The square root term is the Euclidean distance of the ith point cluster m from the lth point in cluster C.
  • Next, we find a neighboring cluster that is ‘closest’ to point i in cluster m, and call that b: b(m,i)=mincD(c,m,i)
    • b(m,i) is the average distance of point i in cluster m from every point in the nearest cluster to i, other than the cluster it’s in.
    • The idea is to compare a(m,i) and b(m,i). Homogenous clusters have large b(m,i) relative to a(m,i).
  • The silhouette width SWim for point i in cluster m is defined: SWim=b(m,i)a(m,i)max{a(m,i),b(m,i)}

3.2 Silhouette width properties

  • Scaled between -1 and 1.
  • Can be negative
    • Why could this happen? Shouldn’t the point have be classified to the ‘closer’ cluster?
      • Think: what would happen if you moved the point?
    • It is rare that it is negative for a lot of observations.
  • Plots of silhouette width, organized by cluster are often informative, and can be used to choose an appropriate cluster ‘solution.’
  • Average silhouette width across all clusters and observations is also used as a global measure of goodness of fit.

3.3 Silhoutte plots

  • We look at silhouette width plots for penguins and crabs data.
    • Think about which ‘solution’ you would choose.
      Look for label switching (some crosstabs might help)
Code
set.seed(2011001)
crabs <- read.dta("./Datasets/crabs.dta")
km.penguins.3<-kmeans(penguins[,3:5],3,nstart=100)
km.penguins.4<-kmeans(penguins[,3:5],4,nstart=100)
xtabs(~km.penguins.3$clust+km.penguins.4$clust)
##                    km.penguins.4$clust
## km.penguins.3$clust   1   2   3   4
##                   1   1   0   0 114
##                   2   0  73  52   0
##                   3  89   4   0   0
Code
km.crabs.4<-kmeans(crabs[,-(1:3)],4,nstart=100)
km.crabs.7<-kmeans(crabs[,-(1:3)],7,nstart=100)
xtabs(~km.crabs.4$clust+km.crabs.7$clust)
##                 km.crabs.7$clust
## km.crabs.4$clust  1  2  3  4  5  6  7
##                1  0  0 24 39  0  0  4
##                2  0 18  0  0  0  0 19
##                3 19  0  0  0  0 15  0
##                4  0  0 14  0 37 11  0

3.3.1 Silhoutte plots: penguins

Code
par(mfrow=c(1,2))
plot(silhouette(km.penguins.3$clust,dist(penguins[,3:5])))
plot(silhouette(km.penguins.4$clust,dist(penguins[,3:5])))

3.3.2 Silhoutte plots: crabs

Code
par(mfrow=c(1,2))
plot(silhouette(km.crabs.4$clust,dist(crabs[,-(1:3)])))
plot(silhouette(km.crabs.7$clust,dist(crabs[,-(1:3)])))

Code
par(mfrow=c(1,1))

3.3.3 Exercise 1

Exercise link

3.4 Partitioning About Medoids (PAM)

  • Clustering from dissimilarity matrices using PAM
    • PAM is similar to the k-means algorithm in that it involves building a partition of the data into k groups
      • k is specified from the outset.
    • hclust can also operate on a dissimilarity matrix as long as the agglomeration rule does not require the original data
      • In the centroid and Ward’s methods you need to construct a mean of features. R’s implemention of the centroid method works around this.
  • PAM differs from k-means in at least 2 ways
    • The dissimilarity matrix used to establish the clustering can be arbitrary.
      • It can reflect subjective opinions (e.g., which cereals are most similar; even Q-sort results).
    • The objective criterion – what makes a clustering ‘good’ – is different
      • We can’t use C(g) unless we use the version revised for distances without raw feature vectors.
    • While k-means minimizes the sum of the squared Euclidean distances between observations and their group mean, PAM minimizes the dissimilarities between the current medoid and all other observations in its assigned cluster.
  • How PAM works:
    • There is an initial phase in which k ‘centers’ or medoids are assigned.
      • These medoids are specific observations that appear ‘close’ (in terms of dissimilarity) to a subset of the other observations (medoid centroid).
    • The process of choosing the initial k medoids is somewhat complex, but basically involves sequentially adding new ‘seeds’ that pull away some observations from existing seeds, because the new seeds are closer to some subset of observations.
      • This repeats until all k seeds are chosen.
    • There is a swap phase, in which alternative choices for medoids are examined.
      • The algorithm will swap a medoid choice if that swap improves an objective function that is based on terms in component a of the silhouette width calculation.
      • The average distance between medoid and all cluster elements must improve if a medoid is to be swapped out.
  • Once all k medoids have been (re-)chosen, each observation is assigned to a group based on the nearest medoid.
    • So an observation may be closest to group A’s medoid, but its average dissimilarity with observations in group A is actually larger than its average dissimilarity with observations in another group, B.
      • This results in a negative silhouette width.
      • The problem is that it isn’t close enough to the medoid of B to be placed in that cluster.
    • Thus, the choice of medoids is quite important, which is why the swap phase is so extensive (and sometimes time-consuming).
    • In some sense, the clustering is the easy part, given the medoids.

3.4.1 Exercise 2

Exercise link

3.5 Model-based Clustering

  • Mixtures of densities
    • The difference between model-based clustering (MBC) and most of the methods that have preceded this:
      • A plausible generating mechanism is fully developed
      • Cluster assignments are based on these
    • What is a generating mechanism?
      • It’s a statistical rule for creating new observations
      • Called Data Generating Process (DGP)

3.5.1 MBC/DGP Example

  • Assume two groups
    • No matter which group you are in, the observation is a draw from a binomial distribution with n=100.
      • This is the count of heads in 100 ‘coin flips,’ but the probability of heads depends on which group you are in.
    • In group 1, Pr(Heads)=0.25
      • Typical observations are 25, 27, etc. The mean, or expected value is 25.
    • In group 2, Pr(Heads)=0.75
      • Expected value 75 (out of 100).
  • Look at a sample with 50 observations from group 1 and 50 from group 2.
Code
set.seed(2011)
y1 <- rbinom(50,100,.25)
y2 <- rbinom(50,100,.75)
plot(density(c(y1,y2)),main='Mixture of two binomials')

  • If we believe that a proposed generating mechanism could represent the data reasonably well, then we can build a statistical model with ‘open’ parameters that govern the way observations occur.
    • We can find the best fitting version of that model that is consistent with the data.
  • Example of a DGP from another course
    • If we believe that a simple linear model (DGP) for Y is: Y=b0+b1X1+b2X2+ε, and assuming εN(0,σ2).
      • Then we can find the ‘best’ values of b0, b1, b2, and σ2 that are consistent with our observations.
  • One version of this process, searching for the ‘best’ values of the parameters that are consistent with the observations is known as maximum likelihood estimation, or MLE.
    • We call these ‘best’ estimates MLEs, and usually denote them with a ‘hat,’ as in ˆσ2, once the model is fit, because they are estimated.
    • We define ‘best’ in terms of a likelihood function, which is a mathematical representation of the generating mechanism.
      • In words, we want to find the parameters that were most likely to generate these observations.
  • Continuing with this example, in regression, we implicitly assume that data are generated as follows:
    • Take the value of X1 and X2, scale them by b1, and b2, sum them along with b0.
    • This is b0+b1X1+b2X2.
    • Take this sum (b0+b1X1+b2X2) and add a ‘random error’ drawn from a normal distribution with mean zero and variance σ2.
      • Call this random error ε.
    • Then a new observation Y, given X1 and X2, is b0+b1X1+b2X2+ε.
  • If the data are generated this way, then the distribution of an observation, given parameters b0, b1, and b2 and σ2 can be written down as follows: Y|X1,X2,b0,b1,b2,σ2N(b0+b1X1+b2X2,σ2)
    • You read this: Y given X1, X2,b0, b1, b2, σ2 is distributed as a normal with mean b0+b1X1+b2X2 and variance σ2.
    • This conditional probability of observing a specific Y, given everything else, is a specific normal density:

P(Y|X1,X2,b0,b1,b2,σ2)=12πσexp((Y(b0+b1X1+b2X2))22σ2)

  • We have a candidate value for Y, X1, X2,b0, b1, b2, and σ2, and we just want to know the value of the above density at this point.

3.5.2 Example

  • Set Y=2;X1=1;X2=1;b0=1;b1=1;b2=1;σ2=1.
    • Then b0+b1X1+b2X2=3, and the above expression reduces to 12πe12(23)2=12πe12=0.242.

    • The goal is to find ‘good’ (even ‘best’) values for b0,b1,b2, σ2.

  • So let’s also try b0=0;b1=1;b2=1 σ2=1, which yields: b0+b1X1+b2X2=2, so when Y=2, the estimated error term is 0, and the above expression reduces to 12πe12(22)2=12πe0=0.399.
  • The second set of parameters is ‘better’ than the first – for this single observation Y=2.

3.5.3 Exercise 3

Exercise link

3.5.4 Likelihood-based estimation (cont.)

  • We ‘invert’ the notation associated with the probability of observing the outcome and say that the likelihood of parameters b0, b1, b2, and σ2 given the data (observations X1, X2, and Y) is L(b0,b1,b2,σ2|Y,X1,X2)= 12πσexp((Y(b0+b1X1+b2X2))22σ2)

3.5.5 Likelihood for mixture of binomials

  • The mixture of binomials prior example, in which observations come from one of two different binomials (e.g., one with probability of success 0.25, and one with 0.75) can be used to generate data, and thus can also be expressed as a likelihood
    • Its form is a bit more complicated, because there are two different generating mechanisms, depending on the cluster.

    • If underlying success probability is p1=0.25, and we observe Y successes out of T binomial trials, then: Pr(Y|T,p1=.25)=(TY)pY1(1p1)TY=(TY)(.25)Y(.75)TY

    • If underlyi success probability is p2=0.75, and we observe Y successes out of T binomial trials, then: Pr(Y|T,p2=.75)=(TY)pY2(1p2)TY=(TY)(.75)Y(.25)TY

3.5.6 Notes:

  • They are not identical (look closely at exponents).
  • Let’s write these two probably formulas down as likelihoods:

L1(p1|T,Y)=(TY)pY1(1p1)TY

L2(p2|T,Y)=(TY)pY2(1p2)TY

  • If we also assign cluster memberships C1,C2,,CN to our N observations Y1,Y2,,YN, then the likelihood of observing all N observations (given known cluster memberships) is: L(p1,p2,C1,,CN|T,Y1,,YN)=Ni=1LCi(pCi|T,Yi).

  • Our goal is to find the ‘best choices’ for memberships: C1,C2,,CN, and ‘best’ probabilities p1 and p2, that plausible generated our observed data.

3.5.7 Example: mixture of three binomials.

  • We have three different probabilities generating the binomial draws.
    • Say, p1=0.3, p2=0.5, and p3=0.7.
    • We will generate 25 with p1=0.3, 25 with p2=0.5, and 50 with p3=0.7.
  • Here is the density (distribution of outcomes in 100 trials, split into 3 clusters, as just described) under this simulation example:
Code
set.seed(2011)
dat <- simBinomMix(c(25,25,50),100,c(.3,.5,.7),2011)
plot(density(dat[,1],bw=3),main='Density: three groups',xlab='Mixture of three binomials (Sample)')

3.6 Iterative algorithm

  • Algorithm for identifying the clusters and binomial probabilities:
    • You choose the number of groups
    • You set relative size of each group, but it’s a starting value only
      • You can assume they are equal sized, to initialize.
    1. Assign each observation to one of the groups - arbitrarily (yes, randomly).
    2. Evaluate the MLE for p1, p2, p3 given the cluster assignments.
    • Identify new ML (best) cluster assignments given ‘new’ p1, p2, p3.
      • Re-assign the clusters based on these MLEs
    1. Go back to 2
      • Stop after a fixed number of iterations, or when no new reassignments occur, whichever comes first.

3.6.1 Example (cont.)

  • Four iterations of this algorithm with the data used for illustrative purposes.
    • The x-axis is ‘jittered’ about zero to reveal the density.
    • Binomial trial values are on the y-axis.
    • The random assignment is in the top left panel.
    • Current p1,p2,p3 estimates are given by arrows to the right.

3.6.2 Example (4 iterations of the learning alg.)

Code
par(mfrow=c(2,2))
learnBinMix.single(dat[,1],3,plot=T,n.iter=4)

## $probs
##        1        2        3 
## 31.34615 49.28000 69.42857 
## 
## $cluster
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Code
par(mfrow=c(1,1))

3.6.3 Notes

  • The final clustering has p1=0.697, p2=0.503, and p3=0.295.
    • The number in each group is: 49, 26, 25.
    • ‘label switching’ occurs, which is not unusual.
      • Why would an algorithm ‘care’ about the order of the probabilities?
    • The size of each cluster is about right, which is remarkable.
    • Things don’t always work out as nicely as in this example.
      • We might recover the probabilities well, but not the proportions in each group, etc., depending on the separation of the clusters.

3.6.4 MBC: Mixture densities – soft clustering

  • We make a subtle adjustment to the use of a ‘mixture’ to generate data.

    • What if assignment to the different groups happens in some random way?
    • For example, first we flip a coin:
      • if it is heads, you are in group 1; tails, group 2.
    • Once we know what group you are in, you get a binomial draw with a group-specific probability, as before.
  • This subtle difference in the generating mechanism leads to a difference in the likelihood.

    • In particular, we no longer include the specific cluster memberships in the likelihood function.
      • Each observation Yi could have been in any of the two groups, so the likelihood is assessed by exploring both possibilities.
  • Reusing L1 and L2 from the binomial draws example, above, we write a different mixture density likelihood: L(p1,p2,π1,π2|T,Y1,,YN)=Ni=12j=1πjLj(pj|T,Yi)

  • The mixing proportions are π1,π2. The rates of success are p1, p2.

  • The term 2j=1πjLj(pj|T,Yi) can be understood as follows:

    • Observation Yi has a probability π1 of being in group 1, and conditional likelihood L1(p1|T,Yi), given group (cluster) 1 membership.
    • Similarly, Yi has a probability π2 of being in group 2, and conditional likelihood L2(p1|T,Yi), given that cluster membership.
    • Since cluster membership is mutually exclusive and exhaustive (π1+ π2 = 1), the total likelihood for observation Yi is the sum of these two likelihoods
      • (using the law of total probability)

3.6.5 Iterative algorithm

  • The MLEs for these likelihoods can be obtained iteratively
    • Often with the use of an ‘EM’ algorithm (described below)
      • EM proceeds in a manner similar to the prior, ‘hard clustering’ approach.
  • Here is a sketch for the mixture of two binomials:
    • You fix the number of groups
    • You set a ‘prior’ set of beliefs about the mixing proportions π1,π2.
      • We often use a prior of ‘equal proportions’ when we don’t know a better choice. (π1=π2=0.5)
    • You must also have a set of initial guesses for the parameters p1, p2.
  1. Start by assigning an observation-specific set of ‘membership probabilities’ to each observation.

    • We might begin by giving them all equal probability of being in any group.
    • We call these zi and they are vectors; every observation has a probability corresponding to each of the possible groups.
    • For the example, there are two groups, so for observation Y1, we might set z1=(0.5,0.5).
    • These z’s are sometimes called hidden data.
  2. Determine the MLE for p1, p2 given these cluster assignment probabilities. This is based on a slightly different likelihood:

    • The zi are ‘missing data’ of a sort.
      • We don’t really know to which cluster each observation belongs, but our best guess is maintained in the zi (zweight).
  3. Use the MLE (current best) for p1, p2 to generate an improved set of guesses for the zi.

    • Use the zi to update π1,π2.
  4. Go back to 2

    • Stop after a fixed number of iterations, or when the change in the likelihood is extremely small, whichever comes first.

3.6.6 Notes

  • We never assign observations to clusters
    • But we do change our beliefs about cluster membership, so that eventually, we may have z1=(0.99,0.01) when we are fairly certain that Y1 belongs to cluster 1.
  • This approach is called ‘soft’ clustering because we never are completely sure where each observation belongs.
    • We allow for the chance that it might belong in any of the clusters.
  • Contrast it to the ‘hard’ clustering we did in the first discussion, where we assigned cluster membership at each stage.
  • An implementation of mixture models for a variety of generating mechanisms is available in the R 'flexmix' package.
    • We are able to fit the data from the previous mixture of 3 binomials as follows:

3.6.7 Flexmix example

Code
newDat <- data.frame(successes=dat[,1],failures=100-dat[,1]) 
fitMix3 <- flexmix::flexmix(cbind(successes,failures)~1,data=newDat,k=3,model=list(FLXMRglm(cbind(successes,failures)~.,family='binomial')))
summary(fitMix3)
## 
## Call:
## flexmix::flexmix(formula = cbind(successes, failures) ~ 1, data = newDat, 
##     k = 3, model = list(FLXMRglm(cbind(successes, failures) ~ 
##         ., family = "binomial")))
## 
##        prior size post>0 ratio
## Comp.1 0.222   22     41 0.537
## Comp.2 0.264   26     60 0.433
## Comp.3 0.514   52     63 0.825
## 
## 'log Lik.' -395.996 (df=5)
## AIC: 801.992   BIC: 815.0178
Code
print(lapply(fitMix3@components,"[[",1))
## $Comp.1
## $coef
## (Intercept) 
##  -0.8463143 
## 
## 
## $Comp.2
## $coef
## (Intercept) 
##  -0.1236218 
## 
## 
## $Comp.3
## $coef
## (Intercept) 
##   0.7971141
Code
expit <- function(x) exp(x)/(1+exp(x))
probs <- rep(NA,3)
for(i in 1:3) probs[i] <- expit(fitMix3@components[[i]][[1]]@parameters$coef)
round(probs,2)
## [1] 0.30 0.47 0.69
  • Taking the inverse-logit on the 3 constants yields success probabilities of 0.47, 0.69, 0.30, respectively.
    • If you match these to their counts, you can see that these estimates are quite close to the ‘truth’ as described by our data generating process.

3.7 MBC as finite mixture of multivariate normals

  • In practice, with continuous outcomes, and multiple features, we use a mixture of multivariate normal densities as our plausible generating mechanism, and our model for the likelihood, given the data.
    • Each multivariate normal potentially has its own, unique mean vector and covariance structure, with the gth group’s structure specified by, Σg.
    • μg is a vector that identifies the center of the cluster, and Σg describes the shape of the spread around that center through its covariance.
    • The mixing proportions, π, play the same role as before.
    • It uses ‘missing data’ z’s, as before, to maintain our best guesses as to cluster membership as the algorithm proceeds.
    • Because this method is likelihood based, we can compare potential ‘solutions’ using information criteria (this is a lot like using a likelihood ratio test with nested models).
    • We can even choose the number of groups using these information criteria.

3.8 Mclust

  • This has been implemented as ‘Mclust’ in R
    • Mclust is unique in the parsimony of its covariance structures
    • gmm in Python’s sklearn (mixture) library is a general implementation.
  • The likelihood, for the mixture of normals, model-based clustering approach, for m clusters, is: L(μ1,,μm,Σ1,,Σm,π1,,πm|Y1,,YN)= Ni=1mg=1πgLg(μg,Σg|Yi) Where Lg(μg,Σg|Yi)=1(2π)k/2|Σg|1/2exp(12(Yiμg)Σg1(Yiμg))and k is the number of features in each observation Yi.

3.8.1 Mclust implementation

  • An important variant of this general form for the model is to reparameterize cluster-specific covariance Σg so that simpler forms are possible. Sketch:
    • The most general form of an eigenvalue/eigenvector decomposition is Σg=λgDgAgDTg.
      • λg controls the volume of the covariance ellipsoid.
      • Dg controls the orientation (rotation wrt coordinate axes).
      • Ag controls the shape (spherical, football-shaped, cigar-shaped, etc.)
    • Letter codes are used that refer to Volume, Shape and Orientation being Equal, Variable, or in some instances Identity (the simplest shape or orientation), reproduced here (their ‘k’ is our ‘g’):

3.8.2 Parsimonious covariance types

3.8.3 Exercise 4

Exercise link

  • Mclust examines a wide range of covariance structures and a range of choices for the number of groups, and selects the best one using the Bayesian Information Criterion known as BIC. The result includes:
    • The means and covariance structures for the different multivariate normals that define the clusters
    • The probability of membership in each group for each subject.
    • The ‘best’ choice for the number of groups.

3.8.4 Exercise 5

Exercise link

3.8.5 Mclust example: penguins

Code
mcl.penguins <- Mclust(penguins[,3:5])
plot(mcl.penguins,what="BIC")

  • BIC values in the search across different numbers of clusters (x-axis) and different parameterizations of Σ (separate lines).
    • Mclust picks the model with the highest BIC.

3.9 Example: penguins (pairs plot)

Code
pairs(penguins[,3:5],col=1+mcl.penguins$class) #1+ starts at red

  • Mclust picked a 4 cluster solution (we know that the ‘truth’ is 3, but the data point to 4)
    • The cluster assignment is based on the highest probability within the ‘z’ vector for that observation.

3.9.1 Mclust example: penguins

Code
par(mfrow=c(1,2))
plot(mcl.penguins,dimen=c(1,3),what="classification")
plot(mcl.penguins,dimen=c(1,3),what="uncertainty")

  • Left panel: cluster means and shape of the covariance (scatterplot of first two features)
  • Right panel: uncertainty is given by the size of the point.
    • It’s plotted as a relative measure, based on the 1-probabilty of most likely group.
    • In this case about 90% of the cases have over 90% probability of belonging to the assigned cluster, according to Mclust (details not shown).

3.9.2 Mclust example: crabs

Code
mcl.crabs <- Mclust(crabs[,-(1:3)])
plot(mcl.crabs,dimen=c(2,3), what="BIC")

  • BIC values in the search for best model.
    • It is a bad sign when the best model is at the ‘boundary’ with 9 groups.
    • You can see how a variety of solutions might work for this data.
Code
cols <- mclust.options("classPlotColors")
syms <- mclust.options("classPlotSymbols")
par(mfrow=c(1,2))
plot(mcl.crabs,dimen=c(2,3), what="classification")
plot(mcl.crabs,dimen=c(2,3), what="uncertainty")

  • Left panel: cluster means and shape of the covariance (scatterplot of features two and three reveal the most).
    • Notice that these do not ‘cross’ the space between, unlike other methods.
  • Right panel: uncertainty.
    • For this data, some clusters have observations with 30% or 40% uncertainty (could easily be misclassified).

3.9.3 Example: crabs (PC space)

Code
pc.crabs <- princomp(crabs[,-(1:3)],cor=T)$scores
par(mfrow=c(1,2))
plot(pc.crabs[,c(2,3)],col=cols[mcl.crabs$class],pch=syms[mcl.crabs$class],main='Classification')
plot(pc.crabs[,c(2,3)],col=cols[mcl.crabs$class],pch=16,cex=pmax(.3,sqrt(6*mcl.crabs$uncertainty)),main='Classification Uncertainty')

  • Left panel: Principal components 2 and 3, shaded by cluster membership.
    • These seem fairly separated (for the crabs data, anyway).
  • Right panel: uncertainty plotted for principal components 2 and 3.
    • Notice the areas of overlap are most uncertain.

3.9.4 Example: Male Life Expectancy

Code
maleLifeExp <- read.dta("./Datasets/malelifeexp.dta")
mcl.life<-Mclust(maleLifeExp[,2:5])
plot(mcl.life,what="BIC")

  • BIC values in the search for a good model.
    • Nine cluster model is chosen over the 2 cluster model.
      • This is a problem.
    • DO OVER, restricting to 5 clusters max

3.9.5 Example: Male Life Expectancy (G in 2 to 5)

Code
mcl.life<-Mclust(maleLifeExp[,2:5],G=2:5)
plot(mcl.life,what="BIC")

Code
plot(mcl.life,what="classification")

  • Clusters are fairly well separated in some of these plots.
Code
par(mfrow=c(1,2))
plot(mcl.life,dimen=c(1,3),what="classification")
plot(mcl.life,dimen=c(1,3),what="uncertainty")

  • Left panel: cluster means and shape of the covariance (scatterplot of features 1 and 3)
  • Right panel: uncertainty plot. No point has large uncertainty.

3.9.6 Example: Eurowork

Code
eurowork <- read.dta("./Datasets/eurowork.dta")
mcl.euro<-Mclust(eurowork[,-1])
plot(mcl.euro,what="BIC")

  • BIC values in the search for best model.
    • It looks like a 5 group solution.
Code
par(mfrow=c(1,2))
plot(mcl.euro,dimen=5:6,what="classification")
plot(mcl.euro,dimen=5:6,what="uncertainty")

  • Left panel: SER vs. CON, with colors and symbols denoting different clusters.
  • Right panel: uncertainty is extremely low here, but it is happening in an expected - border - case.
    • Maybe its clearer in high dimensional space
Code
pc.euro <- princomp(eurowork[,-1])$scores
plot(pc.euro[,2:3],col=1+mcl.euro$class,pch='.',type='n')
text(pc.euro[,2:3],as.character(eurowork[,1]),col=cols[mcl.euro$class],cex=.8)

  • Principal component plot with names of countries given.
  • Some sorting is on the east/west divide.

3.10 Lab