7.5 Canonical LDA

In real life, we usually don’t know what potential shoppers find important, but we do have an idea of, for example, their income, their age, and their professional status. It would therefore be useful to test how well we can predict cluster membership (profile of importance ratings) based on respondent characteristics (income, age, professional), which are also called segmentation variables. The predictive formula could then be used to predict the cluster membership of new potential shoppers. To find the right formula, we use linear discriminant analysis (LDA). But first let’s have a look at the averages of income, age, and professional per cluster:

equipment %>% 
  group_by(km.group) %>% # Group equipment by cluster.
  summarize(income = mean(income), 
            age = mean(age), 
            professional = mean(as.numeric(professional)-1))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 4
##   km.group income   age professional
##   <fct>     <dbl> <dbl>        <dbl>
## 1 cl1        32.1  30.9        0.5  
## 2 cl2        48.3  44.2        0.333
## 3 cl3        47.5  49          0.75

# We cannot take the mean of professional because it is a factor variable. 
# We therefore ask R to treat it as a numeric variable. 
# Because the numeric version of professional takes on 1 and 2 as values, we deduct 1 so that the average indicates the percentage of people for which professional-1 is equal to 1, i.e., the percentage of professionals.

We see that cluster 1 and 2 are somewhat similar in terms of income and age, but differ in the extent to which they consist of professionals. Cluster 3 differs from cluster 1 and 2 in that it is younger and less wealthy.

We can now use LDA to test how well we can predict cluster membership based on income, age, and professional:

library("MASS") # We need the MASS package. Install it first if needed.

lda.cluster3 <- lda(km.group ~ income + age + professional, data=equipment, CV=TRUE) # CV = TRUE ensures that we can store the prediction of the LDA in the following step
equipment <- equipment %>% 
  mutate(class = factor(lda.cluster3$class, labels = c("lda1","lda2","lda3"))) # Save the prediction of the LDA as a factor. (The predictions are stored in lda.cluster3$class)

Let’s see how well the LDA has done:

ct <- table(equipment$km.group, equipment$class) # how many observations in each cluster were correctly predicted to be in that cluster by LDA?
ct

##      
##       lda1 lda2 lda3
##   cl1   12    2    0
##   cl2    3   12    3
##   cl3    2    3    3

We see, for example, that for the 14 observations in cluster 1, LDA correctly predicts that 12 are in cluster 1, but wrongly predicts that 2 are in cluster 2 and 0 are in cluster 3.

The overall prediction accuracy can be obtained as follows:

prop.table(ct) # get percentages

##      
##        lda1  lda2  lda3
##   cl1 0.300 0.050 0.000
##   cl2 0.075 0.300 0.075
##   cl3 0.050 0.075 0.075

# add the percentages on the diagonal
sum(diag(prop.table(ct))) # Proportion correctly predicted

## [1] 0.675

Say we want to predict the cluster membership of new people for whom we only have income, age, and professional status, but not their cluster membership. We could look at the formula that the LDA has derived from the data of people for whom we did have cluster membership:

lda.cluster3.formula <- lda(km.group ~ income + age + professional, data=equipment, CV=FALSE) # CV = FALSE ensures that we view the formula that we can use for prediction
lda.cluster3.formula

## Call:
## lda(km.group ~ income + age + professional, data = equipment, 
##     CV = FALSE)
## 
## Prior probabilities of groups:
##  cl1  cl2  cl3 
## 0.35 0.45 0.20 
## 
## Group means:
##       income      age professionalprofessional
## cl1 32.14286 30.92857                0.5000000
## cl2 48.33333 44.22222                0.3333333
## cl3 47.50000 49.00000                0.7500000
## 
## Coefficients of linear discriminants:
##                                 LD1         LD2
## income                   0.02718175 -0.04456448
## age                      0.08017200  0.03838914
## professionalprofessional 0.42492950  2.10385035
## 
## Proportion of trace:
##    LD1    LD2 
## 0.7776 0.2224

We see that the LDA has retained two discriminant dimensions (and this is where it differs from logistic regression, which is unidimensional). The first dimension explains 77.76 percent of the variance in km.group, the second dimension explains 22.24 percent of the variance in km.group. The table with coefficients gives us the formula for each dimension: Discriminant Score 1 = 0.03 $\times$ income + 0.08 $\times$ age + 0.42 $\times$ professional & Discriminant Score 2 = -0.04 $\times$ income + 0.04 $\times$ age + 2.1 $\times$ professional. To assign new observations to a certain cluster, we first need to calculate the average discriminant scores of the clusters and of each new observation (by filling in the (average) values of income, age, and professional of each cluster or observation in the discriminant functions), then calculate the geometrical distances between the discriminant scores of the new observations and the average discriminant scores of the clusters, and finally assign each observation to the cluster that is closest in geometrical space. This is quite a hassle, so we are lucky that R provides a simple way to do this.

Let’s create some new observations first:

# the tibble function can be used to create a new data frame
new_data <- tibble(income = c(65, 65, 35, 35), # to define a variable within the data frame, first provide the name of the variable (income), then provide the values
                   age = c(20, 35, 45, 60),
                   professional = c("professional","non-professional","non-professional","professional"))

# check out the new data:
new_data

## # A tibble: 4 x 3
##   income   age professional    
##    <dbl> <dbl> <chr>           
## 1     65    20 professional    
## 2     65    35 non-professional
## 3     35    45 non-professional
## 4     35    60 professional

Now let’s predict cluster membership for these “new” people:

new_data <- new_data %>% 
  mutate(prediction = predict(lda.cluster3.formula, new_data)$class) 
# Create a new column called prediction in the new_data data frame and store in that the prediction, 
# accessed by $class, for the new_data based on the formula from the LDA based on the old data (use the LDA where CV = FALSE).

# have a look at the prediction:
new_data

## # A tibble: 4 x 4
##   income   age professional     prediction
##    <dbl> <dbl> <chr>            <fct>     
## 1     65    20 professional     cl1       
## 2     65    35 non-professional cl2       
## 3     35    45 non-professional cl2       
## 4     35    60 professional     cl3