Our predictor G(x) takes values in a discrete set \(\mathcal{G}\).
We can divide the input space into regions labeled according to the classification.
Decision boundaries of regions are linear; this is what we’ll mean by linear methods for classification.
Suppose there are K classes and the fitted linear model: \(\hat{f}_k(x)=\hat{\beta}_{k0} + \hat{\beta}_k^Tx\); then the decision boundary between class k and l is \(\hat{f}_k(x)=\hat{f}_l(x)\) that is \(\{ x : (\hat{\beta}_{k0}-\hat{\beta}_{l0}) + (\hat{\beta}_{k}-\hat{\beta}_{l})^Tx = 0 \}\). This regression approach is a member of a class of methods that model discriminant functions \(\delta_k(x)\) for each class, and then classify x to the class with the largest value for its discriminant function. Methods that model the posterior probabilities Pr(G = k | X = x) are also in this class. Clearly, if either the \(\delta_k(x)\) or Pr(G = k| X = x) are linear in x, then the decision boundaries will be linear.
For example, a popular model for the posterior probabilities for two classes are (4.1):
\[ \begin{equation} Pr(G = 1|X=x)=\cfrac{exp(\beta_0 + \beta^Tx)}{1+exp(\beta_0 + \beta^Tx)}\\ Pr(G = 2|X=x)=\cfrac{1}{1+exp(\beta_0 + \beta^Tx)} \end{equation} \]
Here the monotone transformation is the logit (or log-odds) transformation: log[p/(1-p)], in fact (4.2):
\[ log \frac{Pr(G = 1|X=x)}{Pr(G = 2|X=x)} = \beta_0 + \beta^Tx \]
The decision boundary defined by \(\{x : \beta_0 + \beta^Tx = 0\}\)