4.3.3 Reduced-Rank Linear Discriminant Analysis

The K centroids (p-dimensional) lie in an affine subspace of dimension <= K - 1. Morever, in locating the closest centroid, we can ignore distances orthogonal to this subspace, since they contribute equally to each class. Thus, we might just project the \(X^{*}\) onto this subspace.

If K = 3, this could allow us to view the data in a two dimensional plot.

If K > 3, we might then ask for a L < K - 1 dimensional subspace \(H_l \subseteq H_{k-1}\). In summary, finding the sequences of optimal subspaces for LDA involves the following steps:

Combining all these operations the lth discriminant variable is given by \(Z_l = v_l^TX\) with \(v_l=\mathbf{W}^{-1/2}v_l^{*}\).

TODO: Implement the algorithm and plot Figure 4.8.

Fisher arrived at this decomposition via a different route. He posed the problem:

Find the linear combination Z = a^T X such that the between-class variance is maximized relative to the within-class variance.

The between-class variance of Z is \(a^T\mathbf{B}a\) and within-class variance \(a^T\mathbf{W}a\), where W is defined earlier, and B is the covariance matrix of the class centroid matrix M. Note that \(\mathbf{B}+\mathbf{W}=\mathbf{T}\), where T is the total covariance matrix of X, ignoring class information.

Fisher’s problem amounts to maximizing the Rayleigh quotient (4.15): \[ \max_{a} \cfrac{a^T\mathbf{B}a}{a^T\mathbf{W}a} \]

or equivalently (4.16):

\[ \max_{a} a^T\mathbf{B}a \text{ subject to } a^T\mathbf{W}a = 1. \]

This is a generalized eigenvalue problem, with a given by the largest eigenvalue of \(\mathbf{W}^{-1}\mathbf{B}\). - The optimal \(a_1\) is identical to \(v_1\) (defined above) - The next direction \(a_2\), orthogonal in W to \(a_1\) is \(a_2=v_2\). - The \(a_l\) are referred to as discriminant coordinates, or canonical variates.

To summarize the developments so far:

TODO: Finish the last paragraphs.