We have broken down the MSE into two components: variance and squared bias. Such decomposition is always possible and is known as the bias-variance decomposition.
(2.26) Suppose that the relationship between Y and X is linear with some noise:
\[Y = X^T\beta + \varepsilon \]
where \(\varepsilon \sim N(0, \sigma^2)\) and we fit the model by least squares to the training data. For a test point \(x_0\) we have \(\hat{y_0}=x_0^T\hat{\beta}\) which can be written as \(\hat{y_0} = x_0^T\beta + \sum_{i=1}^N {l_i(x_0)\varepsilon_i}\) where \(l_i(x_0)\) is the \(i\)th element of \(\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}x_0\)
we can get \(\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}x_0\) from \((x_0^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T)^T\) by using two matrix properties:
\((\mathbf{AB})^T=\mathbf{B}^T\mathbf{A}^T\)
\((\mathbf{A}^{-1})^T = (\mathbf{A}^T)^{-1}\)
Under this model the least square estimates are unbiased, so the expected prediction error will be: \[
\begin{align}
\text{EPE}(x_0) & = E_{y_0|x_0}E_\tau(y_0-\hat{y_0})^2\\
& = \text{Var}(y_0|x_0) + Var_\tau(\hat{y_0}) + \text{Bias}^2(\hat{y_0})\\
& = \sigma^2 + E_{\tau}x_0^T(\mathbf{X}^T\mathbf{X})^{-1}x_0\sigma^2 + 0^2
\end{align}
\]
(2.27) and we can find variance: \[
\begin{align}
\text{Var}_\tau(\hat{y_0}) & = E_\tau(\hat{y_0} - E_\tau(\hat{y_0})) ^ 2\\
& = E_\tau(x_0^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\varepsilon)\\
& = E_\tau(x_0^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\varepsilon\varepsilon^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}x_0)
\end{align}
\]
where \(\varepsilon\varepsilon^T=\sigma^2\mathbf{I}_n\), so we can simplify further: \[\text{Var}_\tau(\hat{y_0}) = \sigma^2x_0^{T}E_\tau[(\mathbf{X}^T\mathbf{X})^{-1})]x_0\]
(2.28) if N is large and \(\tau\) were selected at random, and assuming E(X) = 0, then \(\mathbf{X}^T\mathbf{X}\)->\(NCov(\mathbf{X})\).
Proof: By definition of covariance \(\text{Cov}(X) = E[(X-E(X))(X-E(X))^T] = E(XX^T) = \frac{\mathbf{X}^T\mathbf{X}}{N}\)
and we can derive that: \[
\begin{align}
E_{x_0}\text{EPE}(x_0) & = E_{x_0}x_0^{T}\text{Cov}^{-1}(X)x_0\sigma^2/N+\sigma^2\\
& = \text{trace}[\text{Cov}^{-1}(X)\text{Cov}(x_0)]\sigma^2/N+\sigma^2\\
& = \sigma^2(p/N)+\sigma^2
\end{align}
\]