## 6.4 LDA

Linear Discriminant Analysis (LDA) is a supervised machine learning classification (binary or multimonial) and dimension reduction method. LDA finds linear combinations of variables that best “discriminate” the response classes.

One approach (Welch) to LDA assumes the predictor variables are continuous random variables normally distributed and with equal variance. You will typically scale the data to meet these conditions.

For a response variable of $$k$$ levels, LDA produces $$k-1$$ discriminants using Bayes Theorem.

$Pr[Y = C_l | X] = \frac{P[Y = C_l] P[X | Y = C_l]}{\sum_{l=1}^C Pr[Y = C_l] Pr[X | Y = C_l]}$

The probability that $$Y$$ equals class level $$C_l$$ given the predictors $$X$$ equals the prior probability of $$Y$$ multiplied by the probability of observing $$X$$ if $$Y = C_l$$ divided by the sum of all priors and probabilities of $$X$$ given the priors. The predicted value for any $$X$$ is just the $$C_l$$ with the maximimum probability.

One way to calculate the probabilities is by assuming $$X$$ has a multivariate normal distribution with means $$\mu_l$$ and common variance $$\Sigma$$. Then the linear discriminant function group $$l$$ is

$X'\Sigma^{-1}\mu_l - 0.5 \mu_l^{'}\Sigma^{-1}\mu_l + \log(Pr[Y = C_l])$ The theoretical means and covariance matrix is estimated by the sample mean $$\mu = \bar{x}_l$$ and covariance $$\Sigma = S$$, and the population predictors $$X$$ are replaced with the sample predictors $$u$$.

Another approach (Fisher) to LDA is to find a linear combination of predictors that maximizes the between-group covariance matrix $$B$$ relative to the within-group covariance matrix $$W$$.

$\frac{b'Bb}{b'Wb}$

The solution to this optimization problem is the eigenvector corresponding to the largest eigenvalue of $$W^{-1}B$$. This vector is a linear discrminant. Solving for two-group setting gives the discriminant function $$S^{-1}(\bar{x}_1 - \bar{x}_2)$$ where $$S^{-1}$$ is the inverse of the covariance matrix of the data and $$\bar{x}_1$$ and $$\bar{x}_2$$ are the means of each predictor in response groups 1 and 2. In practice, a new sample, $$u$$, is projected onto the discriminant function as $$uS^{-1}(\bar{x}_1 - \bar{x}_2)$$, which returns a discriminant score. A new sample is then classified into group 1 if the sample is closer to the group 1 mean than the group 2 mean in the projection:

$\left| b'(u - \bar{x}_1) \right| - \left| b'(u - \bar{x}_2) \right| < 0$

In general, the model requires $$CP + P(P + 1)/2$$ parameters with $$P$$ predictors and $$C$$ classes. The value of the extra parameters in LDA models is that the between-predictor correlations are explicitly handled by the model. This should provide some advantage to LDA over logistic regression when there are substantial correlations, although both models will break down when the multicollinearity becomes extreme.

Fisher’s formulation is intuitive, easy to solve mathematically, and, unlike Welch’s approach, involves no assumptions about the underlying distributions of the data.

In practice, it is best to center and scale predictors and remove near-zero-variance predictors. If the matrix is still not invertible, use PLS or regularization.