2.9 Model Selection and the Bias-Variance Tradeoff

All the models described have a smoothing or complexity parameter that has to be determined:

the multiplier of the penalty term;
the width of the kernel
or the number of basis functions

We cannont use RSS on the training data to determine these parameters, since we would always pick those that gave interpolating fits and have zero residuals.

The kNN regression fit \(\hat{f_k}(x_0)\) illustrates the competing forces that effect the predictive ability of such approximations. Suppose the data arise from a model \(Y = f(X) + \varepsilon\) with \(E(\varepsilon)=0\) and \(Var(\varepsilon) = \sigma^2\). We assume that the values of \(x_i\) in the sample are fixed. The EPE at \(x_0\):

\[ \begin{align} EPE_k(x_0) &= E[(Y - \hat{f_k}(x_0))^2|X=x_0]\\ &=\sigma^2 + [Bias^2(\hat{f_k}(x_0)) + Var_\tau(\hat{f_k}(x_0))]\\ &=\sigma^2 + \left[f(x_0) - \frac{1}{k}\sum_{l=1}^k{f(x_{(l)})} \right]^2 + \frac{\sigma^2}{k} \end{align} \]

The subscripts in parentheses (\(l\)) indicates the sequence of nearest neighbors to \(x_0\). There are three terms in this expression:

\(\sigma^2\) is the irreducible error - is beyond our control, even if we know the true \(f(x_0)\).
The bias term and the expected value of the estimate - \([E_\tau(\hat{f_k}(x_0))-f(x_0)]^2\) - where the expected averages the randomness in the training data. This term increases with \(k\) if the function is smooth.
The variance term and it decreases as the inverse of k. The expected value of the variance is:

\[ \begin{align} Var_\tau(\hat{f_k}(x_0)) &= E_\tau\left[\hat{f_k}(x_0) - E_\tau(\hat{f_k}(x_0))\right]^2\\ &= E_\tau\left[\frac{1}{k}\sum_{l=1}^k (f(x_{(l)}) + \varepsilon_l) - \frac{1}{k}\sum_{l=1}^k{f(x_{(l)})}\right]^2\\ &= E_\tau\left[\frac{1}{k}\sum_{l=1}^k \varepsilon_l\right]^2\\ &= \frac{1}{k^2}E_\tau\left[\sum_{l=1}^k \varepsilon_l\right]^2\\ &= \frac{1}{k^2}E_\tau\left[\sum_{l=1}^k \varepsilon_l^2\right]\\ &= \frac{\sigma^2}{k} \end{align} \]

As the model complexity of our procedure is increased, the variance tends to increase and the squared bias tends to decrease. The opposite behavior occurs as the model complexity is decreased.