if \(\mathcal{G}\) has K classes, there will be K indicators \(Y_k, k = 1,...,K\) with Y_k = 1 if G = k else 0.
The N training instance will form an \(N \times K\) indicator response matrix Y.
We fit a linear regression model to columns of Y simultaneously (4.3): \[ \hat{\mathbf{Y}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \]
and the \((p+1)\times K\) coefficient matrix: \(\mathbf{\hat{B}}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\)
A new input x is classified as follows:
Compute the fitted output \(\hat{f}(x)^T=(1, x^T)\hat{\mathbf{B}}\), a K vector;
Identify the largest component and classify accordingly (4.4): \[ \hat{G}(x) = argmax_{k \in \mathcal{G}} \hat{f}_k(x) \]
Some properties: - \(\sum_{k \in \mathcal{G}} \hat{f}_k(x) = 1\)
However \(\hat{f}_k(x)\) can be negative or greater than 1. Thus violets probability distribution.
The violation doesn’t say that it won’t work.
A more simplistic viewpoint is to construct targets \(t_k\) for each class, where \(t_k\) is the kth column of the \(\mathbf{I}_k\), then fit the linear model by least squares (4.5): \[ \underset{\mathbf{B}}{min} \sum_{i=1}^N \| y_i - [(1, x_i^T)\mathbf{B}]^T\|^2 \]
is a sum-of-squared Euclidean distances of the fitted vectors from their targets. A new observation is classified to the closest target (4.6):
\[ \hat{G}(x) = \underset{k}{argmin} \| \hat{f}(x) - t_k \|^2 \]
The sum-of-squared-norm criterian is exactly the criterian for multiple response linear regression, just view differently. The components of (4.5) can rearranged as a separate linear model for each element.
The closest target classification rule is exactly as the same as the maximum fitted component cirterion.
There is a serious problem with the regression approach when K >= 3. Because of the rigid nature of the regression model, classes can be masked by others. A general rule to fix this issue for K >= 4 classes is using polynomial terms up to degrees K - 1.
TODO: implement FIGURE 4.2
TODO: implement FIGURE 4.3