4.3 Linear Discriminant Analysis

Suppose \(f_k(x)\) is the class-conditional density of X, and let \(\pi_k\) be the prior-probability, with \(\sum \pi_k = 1\). The Bayes theorem gives us (4.7):

\[ Pr(G = k, X = x) = \cfrac{f_k(x)\pi_k}{\sum_{l=1}^K f_l(x)\pi_l} \]

Suppose that we model each class density as multivariate Gaussian (4.8):

\[ f_k(x) = \cfrac{1}{(2\pi)^{p/2} |\Sigma_k|^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)} \]

Linear discriminant analysis (LDA) arises when \(\Sigma_k = \Sigma \text{ }\forall k\). The log-ration between two classes \(k \text{ and } l\) is (4.9):

\[ \begin{align} log \cfrac{PR(G=k|X=x)}{PR(G=l|X=x)} &= log \cfrac{f_k(x)}{f_l(x)} + log \cfrac{\pi_k}{\pi_l}\\ &= log \cfrac{\pi_k}{\pi_l} - \frac{1}{2}(\mu_k+\mu_l)^T\Sigma^{-1}(\mu_k - \mu_l) + x^T\Sigma^{-1}(\mu_k-\mu_l), \end{align} \]

an equation is linear in x.

From (4.9) we see that the linear discriminant functions (4.10):

\[ \delta_k(x) = x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + log \pi_k \]

are an equivalent description of the decision rule, with \(G(x) = argmax_k \delta_k(x)\).

In practice we do not know the parameters of the Gaussian distributions, and will need to estimate them using the training data:

\(\hat{\pi_k} = N_k / N, N_k\) is the number of class-k observations;
\(\hat{\mu}_k = \sum_{g_i = k} x_i / N_k\)
\(\hat{\Sigma} = \sum_{k=1}^K\sum_{g_i=k} (x_i - \hat{\mu}_k)(x_i-\hat{\mu}_k)^T / (N - K)\)

With two classes, the LDA rule classifies to class 2 if (4.11):

\[ x^T\hat{\Sigma}^{-1}(\hat{\mu}_2 - \hat{\mu}_1) > \frac{1}{2}\hat{\mu}_2^T\hat{\Sigma}^{-1}\hat{\mu}_2 - \frac{1}{2}\hat{\mu}_1^T\hat{\Sigma}^{-1}\hat{\mu}_1 + log(N_1/N) - log(N_2/N) \]

Suppose we code the targets in the 2-classes as +1 and -1. It is easy to show that the coefficient vector from least squares is proportional to the LDA direction given in (4.11). However unless \(N_1 = N_2\) the intercepts are different. (TODO: solve exercise 4.11)

Since LDA direction via least squares does not use a Gaussian assumption, except the derivation of the intercept or cut-point via (4.11). Thus it makes sense to choose the cut-point that minimizes the training error.

With more than two classes, LDA is not the same as linear regression and it avoids the masking problems.

Quadratic discriminant functions

If \(\Sigma_k\) are assumed to be equal, then we get quadratic discriminant functions (QDA) (4.12): \[ \delta_k(x)=\cfrac{1}{2}log|\Sigma_k| - \cfrac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)+log \pi_k \]