(2.12) It suffices to minimize EPE pointwise: \[f(x) = argmin_c E_{Y|X} ( [Y - c]^2 | X = x)\]
(2.13) The solution is : \[f(x)=E(Y | X = x)\]
The nearest-neighbor methods attempt to directly implement this recipe using the training data. Since there is typically at most one observation at any point x, we settle for: \[\hat{f}(x) = \text{Ave}(y_i|x_i \in N_k(x))\]
For large training sample size N, the points in the neighborhood are likely to be close to x, and as k gets large the average will get more stable.
How does linear regression fit into this framework ?
\[f(x) \approx x^T\beta\]
Plugging this into EPE (2.9) and differentiating we can solve for β:
The least squares solution \(\hat{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\) amounts to replacing the expectation in \(\beta = [E(XX^T)]^{-1}E(XY)\) by averages over the training data. Note: Each expectation produces N times more than average, however the constant (i.e 1/N) in two expectations cancel out each other.