21.4 Discriminant Analysis

Suppose we have two or more different populations from which observations could come from. Discriminant analysis seeks to determine which of the possible population an observation comes from while making as few mistakes as possible

Notation

SImilar to MANOVA, let \(\mathbf{y}_{j1},\mathbf{y}_{j2},\dots, \mathbf{y}_{in_j} \sim iid f_j (\mathbf{y})\) for \(j = 1,\dots, h\)

Let \(f_j(\mathbf{y})\) be the density function for population j . Note that each vector \(\mathbf{y}\) contain measurements on all \(p\) traits

  1. Assume that each observation is from one of \(h\) possible populations.
  2. We want to form a discriminant rule that will allocate an observation \(\mathbf{y}\) to population j when \(\mathbf{y}\) is in fact from this population

21.4.1 Known Populations

The maximum likelihood discriminant rule for assigning an observation \(\mathbf{y}\) to one of the \(h\) populations allocates \(\mathbf{y}\) to the population that gives the largest likelihood to \(\mathbf{y}\)

Consider the likelihood for a single observation \(\mathbf{y}\), which has the form \(f_j (\mathbf{y})\) where j is the true population.

Since \(j\) is unknown, to make the likelihood as large as possible, we should choose the value j which causes \(f_j (\mathbf{y})\) to be as large as possible


Consider a simple univariate example. Suppose we have data from one of two binomial populations.

  • The first population has \(n= 10\) trials with success probability \(p = .5\)

  • The second population has \(n= 10\) trials with success probability \(p = .7\)

  • to which population would we assign an observation of \(y = 7\)

  • Note:

    • \(f(y = 7|n = 10, p = .5) = .117\)

    • \(f(y = 7|n = 10, p = .7) = .267\) where \(f(.)\) is the binomial likelihood.

    • Hence, we choose the second population


Another example

We have 2 populations, where

  • First population: \(N(\mu_1, \sigma^2_1)\)

  • Second population: \(N(\mu_2, \sigma^2_2)\)

The likelihood for a single observation is

\[ f_j (y) = (2\pi \sigma^2_j)^{-1/2} \exp\{ -\frac{1}{2}(\frac{y - \mu_j}{\sigma_j})^2\} \]

Consider a likelihood ratio rule

\[ \begin{aligned} \Lambda &= \frac{\text{likelihood of y from pop 1}}{\text{likelihood of y from pop 2}} \\ &= \frac{f_1(y)}{f_2(y)} \\ &= \frac{\sigma_2}{\sigma_1} \exp\{-\frac{1}{2}[(\frac{y - \mu_1}{\sigma_1})^2- (\frac{y - \mu_2}{\sigma_2})^2] \} \end{aligned} \]

Hence, we classify into

  • pop 1 if \(\Lambda >1\)

  • pop 2 if \(\Lambda <1\)

  • for ties, flip a coin

Another way to think:

we classify into population 1 if the “standardized distance” of y from \(\mu_1\) is less than the “standardized distance” of y from \(\mu_2\) which is referred to as a quadratic discriminant rule.

(Significant simplification occurs in th special case where \(\sigma_1 = \sigma_2 = \sigma^2\))

Thus, we classify into population 1 if

\[ (y - \mu_2)^2 > (y - \mu_1)^2 \]

or

\[ |y- \mu_2| > |y - \mu_1| \]

and

\[ -2 \log (\Lambda) = -2y \frac{(\mu_1 - \mu_2)}{\sigma^2} + \frac{(\mu_1^2 - \mu_2^2)}{\sigma^2} = \beta y + \alpha \]

Thus, we classify into population 1 if this is less than 0.

Discriminant classification rule is linear in y in this case.


21.4.1.1 Multivariate Expansion

Suppose that there are 2 populations

  • \(N_p(\mathbf{\mu}_1, \mathbf{\Sigma}_1)\)

  • \(N_p(\mathbf{\mu}_2, \mathbf{\Sigma}_2)\)

\[ \begin{aligned} -2 \log(\frac{f_1 (\mathbf{x})}{f_2 (\mathbf{x})}) &= \log|\mathbf{\Sigma}_1| + (\mathbf{x} - \mathbf{\mu}_1)' \mathbf{\Sigma}^{-1}_1 (\mathbf{x} - \mathbf{\mu}_1) \\ &- [\log|\mathbf{\Sigma}_2|+ (\mathbf{x} - \mathbf{\mu}_2)' \mathbf{\Sigma}^{-1}_2 (\mathbf{x} - \mathbf{\mu}_2) ] \end{aligned} \]

Again, we classify into population 1 if this is less than 0, otherwise, population 2. And like the univariate case with non-equal variances, this is a quadratic discriminant rule.

And if the covariance matrices are equal: \(\mathbf{\Sigma}_1 = \mathbf{\Sigma}_2 = \mathbf{\Sigma}_1\) classify into population 1 if

\[ (\mathbf{\mu}_1 - \mathbf{\mu}_2)' \mathbf{\Sigma}^{-1}\mathbf{x} - \frac{1}{2} (\mathbf{\mu}_1 - \mathbf{\mu}_2)' \mathbf{\Sigma}^{-1} (\mathbf{\mu}_1 - \mathbf{\mu}_2) \ge 0 \]

This linear discriminant rule is also referred to as Fisher’s linear discriminant function

By assuming the covariance matrices are equal, we assume that the shape and orientation fo the two populations must be the same (which can be a strong restriction)

When \(\mathbf{\mu}_1, \mathbf{\mu}_2, \mathbf{\Sigma}\) are known, the probability of misclassification can be determined:

\[ \begin{aligned} P(2|1) &= P(\text{calssify into pop 2| x is from pop 1}) \\ &= P((\mathbf{\mu}_1 - \mathbf{\mu}_2)' \mathbf{\Sigma}^{-1} \mathbf{x} \le \frac{1}{2} (\mathbf{\mu}_1 - \mathbf{\mu}_2)' \mathbf{\Sigma}^{-1} (\mathbf{\mu}_1 - \mathbf{\mu}_2)|\mathbf{x} \sim N(\mu_1, \mathbf{\Sigma}) \\ &= \Phi(-\frac{1}{2} \delta) \end{aligned} \]

where

  • \(\delta^2 = (\mathbf{\mu}_1 - \mathbf{\mu}_2)' \mathbf{\Sigma}^{-1} (\mathbf{\mu}_1 - \mathbf{\mu}_2)\)

  • \(\Phi\) is the standard normal cdf


Suppose there are \(h\) possible populations, which are distributed as \(N_p (\mathbf{\mu}_p, \mathbf{\Sigma})\). Then, the maximum likelihood (linear) discriminant rule allocates \(\mathbf{y}\) to population j where j minimizes the squared Mahalanobis distance

\[ (\mathbf{y} - \mathbf{\mu}_j)' \mathbf{\Sigma}^{-1} (\mathbf{y} - \mathbf{\mu}_j) \]


Bayes Discriminant Rules

If we know that population j has prior probabilities \(\pi_j\) (assume \(\pi_j >0\)) we can form the Bayes discriminant rule.

This rule allocates an observation \(\mathbf{y}\) to the population for which \(\pi_j f_j (\mathbf{y})\) is maximized.

Note:

  • Maximum likelihood discriminant rule is a special case of the Bayes discriminant rule, where it sets all the \(\pi_j = 1/h\)


Optimal Properties of Bayes Discriminant Rules

  • let \(p_{ii}\) be the probability of correctly assigning an observation from population i

  • then one rule (with probabilities \(p_{ii}\) ) is as good as another rule (with probabilities \(p_{ii}'\) ) if \(p_{ii} \ge p_{ii}'\) for all \(i = 1,\dots, h\)

  • The first rule is better than the alternative if \(p_{ii} > p_{ii}'\) for at least one i.

  • A rule for which there is no better alternative is called admissible

  • Bayes Discriminant Rules are admissible

  • If we utilized prior probabilities, then we can form the posterior probability of a correct allocation, \(\sum_{i=1}^h \pi_i p_{ii}\)

  • Bayes Discriminant Rules have the largest possible posterior probability of correct allocation with respect to the prior

  • These properties show that Bayes Discriminant rule is our best approach.


Unequal Cost

  • We want to consider the cost misallocation Define \(c_{ij}\) to be the cost associated with allocation a member of population j to population i.

  • Assume that

    • \(c_{ij} >0\) for all \(i \neq j\)

    • \(c_{ij} = 0\) if \(i = j\)

  • We could determine the expected amount of loss for an observation allocated to population i as \(\sum_j c_{ij} p_{ij}\) where the \(p_{ij}s\) are the probabilities of allocating an observation from population j into population i

  • We want to minimize the amount of loss expected for our rule. Using a Bayes Discrimination, allocate \(\mathbf{y}\) to the population j which minimizes \(\sum_{k \neq j} c_{ij} \pi_k f_k(\mathbf{y})\)

  • We could assign equal probabilities to each group and get a maximum likelihood type rule. here, we would allocate \(\mathbf{y}\) to population j which minimizes \(\sum_{k \neq j}c_{jk} f_k(\mathbf{y})\)

Example:

Two binomial populations, each of size 10, with probabilities \(p_1 = .5\) and \(p_2 = .7\)

And the probability of being in the first population is .9

However, suppose the cost of inappropriately allocating into the first population is 1 and the cost of incorrectly allocating into the second population is 5.

In this case, we pick population 1 over population 2

In general, we consider two regions, \(R_1\) and \(R_2\) associated with population 1 and 2:

\[ R_1: \frac{f_1 (\mathbf{x})}{f_2 (\mathbf{x})} \ge \frac{c_{12} \pi_2}{c_{21} \pi_1} \]

\[ R_2: \frac{f_1 (\mathbf{x})}{f_2 (\mathbf{x})} < \frac{c_{12} \pi_2}{c_{21} \pi_1} \]

where \(c_{12}\) is the cost of assigning a member of population 2 to population 1.


21.4.1.2 Discrimination Under Estimation

Suppose we know the form of the distributions for populations of interests, but we still have to estimate the parameters.

Example:

we know the distributions are multivariate normal, but we have to estimate the means and variances

21.4.1.3 The maximum likelihood discriminant rule allocates an observation \(\mathbf{y}\) to population j when j maximizes the function

\[ f_j (\mathbf{y} |\hat{\theta}) \]

where \(\hat{\theta}\) are the maximum likelihood estimates of the unknown parameters


For instance, we have 2 multivariate normal populations with distinct means, but common variance covariance matrix

MLEs for \(\mathbf{\mu}_1\) and \(\mathbf{\mu}_2\) are \(\mathbf{\bar{y}}_1\) and \(\mathbf{\bar{y}}_2\)and common \(\mathbf{\Sigma}\) is \(\mathbf{S}\).

Thus, an estimated discriminant rule could be formed by substituting these sample values for the population values


21.4.2 Probabilities of Misclassification

When the distribution are exactly known, we can determine the misclassification probabilities exactly. however, when we need to estimate the population parameters, we have to estimate the probability of misclassification

  • Naive method

    • Plugging the parameters estimates into the form for the misclassification probabilities results to derive at the estimates of the misclassification probability.

    • But this will tend to be optimistic when the number of samples in one or more populations is small.

  • Resubstitution method

    • Use the proportion of the samples from population i that would be allocated to another population as an estimate of the misclassification probability

    • But also optimistic when the number of samples is small

  • Jack-knife estimates:

    • The above two methods use observation to estimate both parameters and also misclassification probabilities based upon the discriminant rule

    • Alternatively, we determine the discriminant rule based upon all of the data except the k-th observation from the j-th population

    • then, determine if the k-th observation would be misclassified under this rule

    • perform this process for all \(n_j\) observation in population j . An estimate fo the misclassficaiton probability would be the fraction of \(n_j\) observations which were misclassified

    • repeat the process for other \(i \neq j\) populations

    • This method is more reliable than the others, but also computationally intensive

  • Cross-Validation


Summary

Consider the group-specific densities \(f_j (\mathbf{x})\) for multivariate vector \(\mathbf{x}\).

Assume equal misclassification costs, the Bayes classification probability of \(\mathbf{x}\) belonging to the j-th population is

\[ p(j |\mathbf{x}) = \frac{\pi_j f_j (\mathbf{x})}{\sum_{k=1}^h \pi_k f_k (\mathbf{x})} \]

\(j = 1,\dots, h\)

where there are \(h\) possible groups.

We then classify into the group for which this probability of membership is largest

Alternatively, we can write this in terms of a generalized squared distance formation

\[ D_j^2 (\mathbf{x}) = d_j^2 (\mathbf{x})+ g_1(j) + g_2 (j) \]

where

  • \(d_j^2(\mathbf{x}) = (\mathbf{x} - \mathbf{\mu}_j)' \mathbf{V}_j^{-1} (\mathbf{x} - \mathbf{\mu}_j)\) is the squared Mahalanobis distance from \(\mathbf{x}\) to the centroid of group j, and

    • \(\mathbf{V}_j = \mathbf{S}_j\) if the within group covariance matrices are not equal

    • \(\mathbf{V}_j = \mathbf{S}_p\) if a pooled covariance estimate is appropriate

and

\[ g_1(j) = \begin{cases} \ln |\mathbf{S}_j| & \text{within group covariances are not equal} \\ 0 & \text{pooled covariance} \end{cases} \]

\[ g_2(j) = \begin{cases} -2 \ln \pi_j & \text{prior probabilities are not equal} \\ 0 & \text{prior probabilities are equal} \end{cases} \]

then, the posterior probability of belonging to group j is

\[ p(j| \mathbf{x}) = \frac{\exp(-.5 D_j^2(\mathbf{x}))}{\sum_{k=1}^h \exp(-.5 D^2_k (\mathbf{x}))} \]

where \(j = 1,\dots , h\)

and \(\mathbf{x}\) is classified into group j if \(p(j | \mathbf{x})\) is largest for \(j = 1,\dots,h\) (or, \(D_j^2(\mathbf{x})\) is smallest).


21.4.3 Unknown Populations/ Nonparametric Discrimination

When your multivariate data are not Gaussian, or known distributional form at all, we can use the following methods

21.4.3.1 Kernel Methods

We approximate \(f_j (\mathbf{x})\) by a kernel density estimate

\[ \hat{f}_j(\mathbf{x}) = \frac{1}{n_j} \sum_{i = 1}^{n_j} K_j (\mathbf{x} - \mathbf{x}_i) \]

where

  • \(K_j (.)\) is a kernel function satisfying \(\int K_j(\mathbf{z})d\mathbf{z} =1\)

  • \(\mathbf{x}_i\) , \(i = 1,\dots , n_j\) is a random sample from the j-th population.

Thus, after finding \(\hat{f}_j (\mathbf{x})\) for each of the \(h\) populations, the posterior probability of group membership is

\[ p(j |\mathbf{x}) = \frac{\pi_j \hat{f}_j (\mathbf{x})}{\sum_{k-1}^h \pi_k \hat{f}_k (\mathbf{x})} \]

where \(j = 1,\dots, h\)

There are different choices for the kernel funciton:

  • Uniform

  • Normal

  • Epanechnikov

  • Biweight

  • Triweight

We these kernels, we have to pick the “radius” (or variance, width, window width, bandwidth) of the kernel, which is a smoothing parameter (the larger the radius, the more smooth the kernel estimate of the density).

To select the smoothness parameter, we can use the following method

If we believe the populations were close to multivariate normal, then

\[ R = (\frac{4/(2p+1)}{n_j})^{1/(p+1} \]

But since we do not know for sure, we might choose several different values and select one that vies the best out of sample or cross-validation discrimination.

Moreover, you also have to decide whether to use different kernel smoothness for different populations, which is similar to the individual andpooled covariances in the classical methodology.


21.4.3.2 Nearest Neighbor Methods

The nearest neighbor (also known as k-nearest neighbor) method performs the classification of a new observation vector based on the group membership of its nearest neighbors. In practice, we find

\[ d_{ij}^2 (\mathbf{x}, \mathbf{x}_i) = (\mathbf{x}, \mathbf{x}_i) V_j^{-1}(\mathbf{x}, \mathbf{x}_i) \]

which is the distance between the vector \(\mathbf{x}\) and the i-th observation in group j

We consider different choices for \(\mathbf{V}_j\)

For example,

\[ \mathbf{V}_j = \mathbf{S}_p \\ \mathbf{V}_j = \mathbf{S}_j \\ \mathbf{V}_j = \mathbf{I} \\ \mathbf{V}_j = diag (\mathbf{S}_p) \]

We find the \(k\) observations that are closest to \(\mathbf{x}\) (where users pick k). Then we classify into the most common population, weighted by the prior.


21.4.3.3 Modern Discriminant Methods

Note:

Logistic regression (with or without random effects) is a flexible model-based procedure for classification between two populations.

The extension of logistic regression to the multi-group setting is polychotomous logistic regression (or, mulinomial regression).

The machine learning and pattern recognition are growing with strong focus on nonlienar discriminant analysis methods such as:

  • radial basis function networks

  • support vector machines

  • multiplayer perceptrons (neural networks)


The general framework

\[ g_j (\mathbf{x}) = \sum_{l = 1}^m w_{jl}\phi_l (\mathbf{x}; \mathbf{\theta}_l) + w_{j0} \]

where

  • \(j = 1,\dots, h\)

  • \(m\) nonlinear basis functions \(\phi_l\), each of which has \(n_m\) parameters given by \(\theta_l = \{ \theta_{lk}: k = 1, \dots , n_m \}\)

We assign \(\mathbf{x}\) to the j-th population if \(g_j(\mathbf{x})\) is the maximum for all \(j = 1,\dots, h\)

Development usually focuses on the choice and estimation of the basis functions, \(\phi_l\) and the estimation of the weights \(w_{jl}\)

[More details](Statistical Pattern Recognition | Wiley Online Books

)


21.4.4 Application

library(class)
## 
## Attaching package: 'class'
## The following object is masked from 'package:reshape':
## 
##     condense
library(klaR)
## Warning: package 'klaR' was built under R version 4.0.5
library(MASS)
library(tidyverse)

## Read in the data
crops <- read.table("images/crops.txt")
names(crops) <- c("crop", "y1", "y2", "y3", "y4")
str(crops)
## 'data.frame':    36 obs. of  5 variables:
##  $ crop: chr  "Corn" "Corn" "Corn" "Corn" ...
##  $ y1  : int  16 15 16 18 15 15 12 20 24 21 ...
##  $ y2  : int  27 23 27 20 15 32 15 23 24 25 ...
##  $ y3  : int  31 30 27 25 31 32 16 23 25 23 ...
##  $ y4  : int  33 30 26 23 32 15 73 25 32 24 ...
## Read in test data
crops_test <- read.table("images/crops_test.txt")
names(crops_test) <- c("crop", "y1", "y2", "y3", "y4")
str(crops_test)
## 'data.frame':    5 obs. of  5 variables:
##  $ crop: chr  "Corn" "Soybeans" "Cotton" "Sugarbeets" ...
##  $ y1  : int  16 21 29 54 32
##  $ y2  : int  27 25 24 23 32
##  $ y3  : int  31 23 26 21 62
##  $ y4  : int  33 24 28 54 16

21.4.4.1 LDA

Default prior is proportional to sample size and lda and qda do not fit a constant or intercept term

## Linear discriminant analysis
lda_mod <- lda(crop ~ y1 + y2 + y3 + y4,
               data = crops)
lda_mod
## Call:
## lda(crop ~ y1 + y2 + y3 + y4, data = crops)
## 
## Prior probabilities of groups:
##     Clover       Corn     Cotton   Soybeans Sugarbeets 
##  0.3055556  0.1944444  0.1666667  0.1666667  0.1666667 
## 
## Group means:
##                  y1       y2       y3       y4
## Clover     46.36364 32.63636 34.18182 36.63636
## Corn       15.28571 22.71429 27.42857 33.14286
## Cotton     34.50000 32.66667 35.00000 39.16667
## Soybeans   21.00000 27.00000 23.50000 29.66667
## Sugarbeets 31.00000 32.16667 20.00000 40.50000
## 
## Coefficients of linear discriminants:
##              LD1          LD2         LD3          LD4
## y1 -6.147360e-02  0.009215431 -0.02987075 -0.014680566
## y2 -2.548964e-02  0.042838972  0.04631489  0.054842132
## y3  1.642126e-02 -0.079471595  0.01971222  0.008938745
## y4  5.143616e-05 -0.013917423  0.05381787 -0.025717667
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4 
## 0.7364 0.1985 0.0576 0.0075
## Look at accuracy on the training data
lda_fitted <- predict(lda_mod,newdata = crops)
# Contingency table
lda_table <- table(truth = crops$crop, fitted = lda_fitted$class)
lda_table
##             fitted
## truth        Clover Corn Cotton Soybeans Sugarbeets
##   Clover          6    0      3        0          2
##   Corn            0    6      0        1          0
##   Cotton          3    0      1        2          0
##   Soybeans        0    1      1        3          1
##   Sugarbeets      1    1      0        2          2
# accuracy of 0.5 is just random (not good)

## Posterior probabilities of membership
crops_post <- cbind.data.frame(crops,
                               crop_pred = lda_fitted$class,
                               lda_fitted$posterior)
crops_post <- crops_post %>%
    mutate(missed = crop != crop_pred)
head(crops_post)
##   crop y1 y2 y3 y4 crop_pred     Clover      Corn    Cotton  Soybeans
## 1 Corn 16 27 31 33      Corn 0.08935164 0.4054296 0.1763189 0.2391845
## 2 Corn 15 23 30 30      Corn 0.07690181 0.4558027 0.1420920 0.2530101
## 3 Corn 16 27 27 26      Corn 0.09817815 0.3422454 0.1365315 0.3073105
## 4 Corn 18 20 25 23      Corn 0.10521511 0.3633673 0.1078076 0.3281477
## 5 Corn 15 15 31 32      Corn 0.05879921 0.5753907 0.1173332 0.2086696
## 6 Corn 15 32 32 15  Soybeans 0.09723648 0.3278382 0.1318370 0.3419924
##   Sugarbeets missed
## 1 0.08971545  FALSE
## 2 0.07219340  FALSE
## 3 0.11573442  FALSE
## 4 0.09546233  FALSE
## 5 0.03980738  FALSE
## 6 0.10109590   TRUE
# posterior shows that posterior of corn membershp is much higher than the prior

## LOOCV
# leave-one-out cross validation for linear discriminant analysis
# cannot run the prdecit function using the object with CV = TRUE because it returns the wihtin sample predictions
lda_cv <- lda(crop ~ y1 + y2 + y3 + y4,
              data = crops, CV = TRUE)
# Contingency table
lda_table_cv <- table(truth = crops$crop, fitted = lda_cv$class)
lda_table_cv
##             fitted
## truth        Clover Corn Cotton Soybeans Sugarbeets
##   Clover          4    3      1        0          3
##   Corn            0    4      1        2          0
##   Cotton          3    0      0        2          1
##   Soybeans        0    1      1        3          1
##   Sugarbeets      2    1      0        2          1
## Predict the test data
lda_pred <- predict(lda_mod, newdata = crops_test)

## Make a contingency table with truth and most likely class
table(truth=crops_test$crop, predict=lda_pred$class)
##             predict
## truth        Clover Corn Cotton Soybeans Sugarbeets
##   Clover          0    0      1        0          0
##   Corn            0    1      0        0          0
##   Cotton          0    0      0        1          0
##   Soybeans        0    0      0        1          0
##   Sugarbeets      1    0      0        0          0

LDA didn’t do well on both within sample and out-of-sample data.

21.4.4.2 QDA

## Quadratic discriminant analysis
qda_mod <- qda(crop ~ y1 + y2 + y3 + y4,
               data = crops)

## Look at accuracy on the training data
qda_fitted <- predict(qda_mod, newdata = crops)
# Contingency table
qda_table <- table(truth = crops$crop, fitted = qda_fitted$class)
qda_table
##             fitted
## truth        Clover Corn Cotton Soybeans Sugarbeets
##   Clover          9    0      0        0          2
##   Corn            0    7      0        0          0
##   Cotton          0    0      6        0          0
##   Soybeans        0    0      0        6          0
##   Sugarbeets      0    0      1        1          4
## LOOCV
qda_cv <- qda(crop ~ y1 + y2 + y3 + y4,
              data = crops, CV = TRUE)
# Contingency table
qda_table_cv <- table(truth = crops$crop, fitted = qda_cv$class)
qda_table_cv
##             fitted
## truth        Clover Corn Cotton Soybeans Sugarbeets
##   Clover          9    0      0        0          2
##   Corn            3    2      0        0          2
##   Cotton          3    0      2        0          1
##   Soybeans        3    0      0        2          1
##   Sugarbeets      3    0      1        1          1
## Predict the test data
qda_pred <- predict(qda_mod, newdata = crops_test)
## Make a contingency table with truth and most likely class
table(truth = crops_test$crop, predict = qda_pred$class)
##             predict
## truth        Clover Corn Cotton Soybeans Sugarbeets
##   Clover          1    0      0        0          0
##   Corn            0    1      0        0          0
##   Cotton          0    0      1        0          0
##   Soybeans        0    0      0        1          0
##   Sugarbeets      0    0      0        0          1

21.4.4.3 KNN

knn uses design matrices of the features.

## Design matrices
X_train <- crops %>%
    dplyr::select(-crop)
X_test <- crops_test %>%
    dplyr::select(-crop)
Y_train <- crops$crop
Y_test <- crops_test$crop

## Nearest neighbors with 2 neighbors
knn_2 <- knn(X_train, X_train, Y_train, k = 2)
table(truth = Y_train, fitted = knn_2)
##             fitted
## truth        Clover Corn Cotton Soybeans Sugarbeets
##   Clover          7    0      3        1          0
##   Corn            0    6      0        1          0
##   Cotton          0    0      5        0          1
##   Soybeans        0    0      0        5          1
##   Sugarbeets      1    0      2        0          3
## Accuracy
mean(Y_train==knn_2)
## [1] 0.7222222
## Performance on test data
knn_2_test <- knn(X_train, X_test, Y_train, k = 2)
table(truth = Y_test, predict = knn_2_test)
##             predict
## truth        Clover Corn Cotton Soybeans Sugarbeets
##   Clover          1    0      0        0          0
##   Corn            0    1      0        0          0
##   Cotton          0    0      0        0          1
##   Soybeans        0    0      0        1          0
##   Sugarbeets      0    0      0        0          1
## Accuracy
mean(Y_test==knn_2_test)
## [1] 0.8
## Nearest neighbors with 3 neighbors
knn_3 <- knn(X_train, X_train, Y_train, k = 3)
table(truth = Y_train, fitted = knn_3)
##             fitted
## truth        Clover Corn Cotton Soybeans Sugarbeets
##   Clover          8    0      2        1          0
##   Corn            0    5      0        2          0
##   Cotton          1    0      3        0          2
##   Soybeans        0    1      0        4          1
##   Sugarbeets      0    0      0        2          4
## Accuracy
mean(Y_train==knn_3)
## [1] 0.6666667
## Performance on test data
knn_3_test <- knn(X_train, X_test, Y_train, k = 3)
table(truth = Y_test, predict = knn_3_test)
##             predict
## truth        Clover Corn Cotton Soybeans Sugarbeets
##   Clover          1    0      0        0          0
##   Corn            0    1      0        0          0
##   Cotton          0    0      1        0          0
##   Soybeans        0    0      0        1          0
##   Sugarbeets      0    0      0        0          1
## Accuracy
mean(Y_test==knn_3_test)
## [1] 1

21.4.4.4 Stepwise

Stepwise discriminant analysis using the stepclass in function in the klaR package.

step <- stepclass(
    crop ~ y1 + y2 + y3 + y4,
    data = crops,
    method = "qda",
    improvement = 0.15
)
##  `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
## 36 observations of 4 variables in 5 classes; direction: both
## stop criterion: improvement less than 15%.
## correctness rate: 0.46667;  in: "y1";  variables (1): y1 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##        0.00        0.00        0.16
step$process
##    step var varname result.pm
## 0 start   0      -- 0.0000000
## 1    in   1      y1 0.4666667
step$performance.measure
## [1] "correctness rate"

Iris Data

library(dplyr)
data('iris')
set.seed(1)
samp <-
    sample.int(nrow(iris), size = floor(0.70 * nrow(iris)), replace = F)

train.iris <- iris[samp,] %>% mutate_if(is.numeric,scale)
test.iris <- iris[-samp,] %>% mutate_if(is.numeric,scale)

library(ggplot2)
iris.model <- lda(Species ~ ., data = train.iris)
#pred
pred.lda <- predict(iris.model, test.iris)
table(truth = test.iris$Species, prediction = pred.lda$class)
##             prediction
## truth        setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         17         0
##   virginica       0          0        13
plot(iris.model)

iris.model.qda <- qda(Species~.,data=train.iris)
#pred
pred.qda <- predict(iris.model.qda,test.iris)
table(truth=test.iris$Species,prediction=pred.qda$class)
##             prediction
## truth        setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         16         1
##   virginica       0          0        13

21.4.4.5 PCA with Discriminant Analysis

we can use both PCA for dimension reduction in discriminant analysis

zeros <- as.matrix(read.table("images/mnist0_train_b.txt"))
nines <- as.matrix(read.table("images/mnist9_train_b.txt"))
train <- rbind(zeros[1:1000, ], nines[1:1000, ])
train <- train / 255 #divide by 255 per notes (so ranges from 0 to 1)
train <- t(train) #each column is an observation
image(matrix(train[, 1], nrow = 28), main = 'Example image, unrotated')

test <- rbind(zeros[2501:3000, ], nines[2501:3000, ])
test <- test / 255
test <- t(test)
y.train <- c(rep(0, 1000), rep(9, 1000))
y.test <- c(rep(0, 500), rep(9, 500))


library(MASS)
pc <- prcomp(t(train))
train.large <- data.frame(cbind(y.train, pc$x[, 1:10]))
large <- lda(y.train ~ ., data = train.large)
#the test data set needs to be constucted w/ the same 10 princomps
test.large <- data.frame(cbind(y.test, predict(pc, t(test))[, 1:10]))
pred.lda <- predict(large, test.large)
table(truth = test.large$y.test, prediction = pred.lda$class)
##      prediction
## truth   0   9
##     0 491   9
##     9   5 495
large.qda <- qda(y.train~.,data=train.large)
#prediction
pred.qda <- predict(large.qda,test.large)
table(truth=test.large$y.test,prediction=pred.qda$class)
##      prediction
## truth   0   9
##     0 493   7
##     9   3 497