4.2 Linear Regression of an Indicator Matrix

We fit a linear regression model to columns of Y simultaneously (4.3): \[ \hat{\mathbf{Y}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \]

and the \((p+1)\times K\) coefficient matrix: \(\mathbf{\hat{B}}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\)

A new input x is classified as follows:

Some properties: - \(\sum_{k \in \mathcal{G}} \hat{f}_k(x) = 1\)

A more simplistic viewpoint is to construct targets \(t_k\) for each class, where \(t_k\) is the kth column of the \(\mathbf{I}_k\), then fit the linear model by least squares (4.5): \[ \underset{\mathbf{B}}{min} \sum_{i=1}^N \| y_i - [(1, x_i^T)\mathbf{B}]^T\|^2 \]

is a sum-of-squared Euclidean distances of the fitted vectors from their targets. A new observation is classified to the closest target (4.6):

\[ \hat{G}(x) = \underset{k}{argmin} \| \hat{f}(x) - t_k \|^2 \]

There is a serious problem with the regression approach when K >= 3. Because of the rigid nature of the regression model, classes can be masked by others. A general rule to fix this issue for K >= 4 classes is using polynomial terms up to degrees K - 1.

TODO: implement FIGURE 4.2

TODO: implement FIGURE 4.3