7.5 Canonical LDA

In real life, we usually don’t know what potential shoppers find important, but we do have an idea of, for example, their income, their age, and their professional status. It would therefore be useful to test how well we can predict cluster membership (profile of importance ratings) based on respondent characteristics (income, age, professional), which are also called segmentation variables. The predictive formula could then be used to predict the cluster membership of new potential shoppers. To find the right formula, we use linear discriminant analysis (LDA). But first let’s have a look at the averages of income, age, and professional per cluster:

## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 4
##   km.group income   age professional
##   <fct>     <dbl> <dbl>        <dbl>
## 1 cl1        32.1  30.9        0.5  
## 2 cl2        48.3  44.2        0.333
## 3 cl3        47.5  49          0.75

We see that cluster 1 and 2 are somewhat similar in terms of income and age, but differ in the extent to which they consist of professionals. Cluster 3 differs from cluster 1 and 2 in that it is younger and less wealthy.

We can now use LDA to test how well we can predict cluster membership based on income, age, and professional:

Let’s see how well the LDA has done:

##      
##       lda1 lda2 lda3
##   cl1   12    2    0
##   cl2    3   12    3
##   cl3    2    3    3

We see, for example, that for the 14 observations in cluster 1, LDA correctly predicts that 12 are in cluster 1, but wrongly predicts that 2 are in cluster 2 and 0 are in cluster 3.

The overall prediction accuracy can be obtained as follows:

##      
##        lda1  lda2  lda3
##   cl1 0.300 0.050 0.000
##   cl2 0.075 0.300 0.075
##   cl3 0.050 0.075 0.075
## [1] 0.675

Say we want to predict the cluster membership of new people for whom we only have income, age, and professional status, but not their cluster membership. We could look at the formula that the LDA has derived from the data of people for whom we did have cluster membership:

## Call:
## lda(km.group ~ income + age + professional, data = equipment, 
##     CV = FALSE)
## 
## Prior probabilities of groups:
##  cl1  cl2  cl3 
## 0.35 0.45 0.20 
## 
## Group means:
##       income      age professionalprofessional
## cl1 32.14286 30.92857                0.5000000
## cl2 48.33333 44.22222                0.3333333
## cl3 47.50000 49.00000                0.7500000
## 
## Coefficients of linear discriminants:
##                                 LD1         LD2
## income                   0.02718175 -0.04456448
## age                      0.08017200  0.03838914
## professionalprofessional 0.42492950  2.10385035
## 
## Proportion of trace:
##    LD1    LD2 
## 0.7776 0.2224

We see that the LDA has retained two discriminant dimensions (and this is where it differs from logistic regression, which is unidimensional). The first dimension explains 77.76 percent of the variance in km.group, the second dimension explains 22.24 percent of the variance in km.group. The table with coefficients gives us the formula for each dimension: Discriminant Score 1 = 0.03 \(\times\) income + 0.08 \(\times\) age + 0.42 \(\times\) professional & Discriminant Score 2 = -0.04 \(\times\) income + 0.04 \(\times\) age + 2.1 \(\times\) professional. To assign new observations to a certain cluster, we first need to calculate the average discriminant scores of the clusters and of each new observation (by filling in the (average) values of income, age, and professional of each cluster or observation in the discriminant functions), then calculate the geometrical distances between the discriminant scores of the new observations and the average discriminant scores of the clusters, and finally assign each observation to the cluster that is closest in geometrical space. This is quite a hassle, so we are lucky that R provides a simple way to do this.

Let’s create some new observations first:

## # A tibble: 4 x 3
##   income   age professional    
##    <dbl> <dbl> <chr>           
## 1     65    20 professional    
## 2     65    35 non-professional
## 3     35    45 non-professional
## 4     35    60 professional

Now let’s predict cluster membership for these “new” people:

## # A tibble: 4 x 4
##   income   age professional     prediction
##    <dbl> <dbl> <chr>            <fct>     
## 1     65    20 professional     cl1       
## 2     65    35 non-professional cl2       
## 3     35    45 non-professional cl2       
## 4     35    60 professional     cl3