2.5 Local Methods in High Dimensions

Mean squared error for estimating f(0):

Assume that the relationship between X and Y is: $Y = f(X)$

\[ \begin{align} \text{MSE}(x_0) & = E_\tau[f(x_0) - \hat{y_0}]^2\\ & = E_\tau[(f(x_0) - E_\tau(\hat{y_0})) + (E_\tau(\hat{y_0}) - \hat{y_0})]^2\\ & = E_\tau[(E_\tau(\hat{y_0}) - \hat{y_0})^2 + 2(f(x_0) - E_\tau(\hat{y_0}))(E_\tau(\hat{y_0}) - \hat{y_0})+ (f(x_0) - E_\tau(\hat{y_0}))^2]\\ & = E_\tau[(E_\tau(\hat{y_0}) - \hat{y_0})^2] + E_\tau[(E_\tau(\hat{y_0}) - f(x_0))^2]\\ & = E_\tau[\hat{y_0} - E_\tau(\hat{y_0})]^2 + [E_\tau(\hat{y_0}) - f(x_0)]^2\\ & = Var_\tau(\hat{y_0}) + Bias^2(\hat{y_0}) \end{align} \]

We have broken down the MSE into two components: variance and squared bias. Such decomposition is always possible and is known as the bias-variance decomposition.

(2.26) Suppose that the relationship between Y and X is linear with some noise:

\[Y = X^T\beta + \varepsilon \]

where $\varepsilon \sim N(0, \sigma^2)$ and we fit the model by least squares to the training data. For a test point $x_0$ we have $\hat{y_0}=x_0^T\hat{\beta}$ which can be written as $\hat{y_0} = x_0^T\beta + \sum_{i=1}^N {l_i(x_0)\varepsilon_i}$ where $l_i(x_0)$ is the $i$th element of $\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}x_0$

Proof:

\[ \begin{equation} \hat{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \\ \hat{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{X}\beta + \varepsilon) \\ \hat{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{X}\beta + \varepsilon) \\ \hat{\beta}=\beta + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\varepsilon \\ \end{equation} \]

and by plugging $\hat{B}$ into the linear model:

\[ \begin{equation} \hat{y_0} = x_0^T(\beta+(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\varepsilon) \\ \hat{y_0} = x_0^T\beta+x_0^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\varepsilon \end{equation} \]

we can get $\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}x_0$ from $(x_0^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T)^T$ by using two matrix properties:

$(\mathbf{AB})^T=\mathbf{B}^T\mathbf{A}^T$
$(\mathbf{A}^{-1})^T = (\mathbf{A}^T)^{-1}$

Under this model the least square estimates are unbiased, so the expected prediction error will be: \[ \begin{align} \text{EPE}(x_0) & = E_{y_0|x_0}E_\tau(y_0-\hat{y_0})^2\\ & = \text{Var}(y_0|x_0) + Var_\tau(\hat{y_0}) + \text{Bias}^2(\hat{y_0})\\ & = \sigma^2 + E_{\tau}x_0^T(\mathbf{X}^T\mathbf{X})^{-1}x_0\sigma^2 + 0^2 \end{align} \]

Proof: \[ \begin{align} \text{EPE}(x_0) & = E_{y_0|x_0}E_\tau(y_0-\hat{y_0})^2\\ & = E_{y_0|x_0}E_\tau((y_0 - f(x_0)) + (f(x_0) - \hat{y_0}))^2\\ & = E_{y_0|x_0}E_\tau(y_0 - f(x_0))^2 + 2E_{y_0|x_0}E_\tau(y_0 - f(x_0))(f(x_0) - \hat{y_0}) + E_{y_0|x_0}E_\tau(f(x_0) - \hat{y_0})^2\\ & = U_1 + U_2 + U_3 \end{align} \]

There are three components $U_1$, $U_2$, $U_3$ and we’re going to expand them as well.

$U_1 = E_{y_0|x_0}E_\tau(y_0 - f(x_0))^2 = E_{y_0|x_0}(y_0-f(x_0))^2 = \sigma^2$

Note: $f(x_0) = E_{y_0|x_0}(y_0) $

$U_2 = 2E_{y_0|x_0}E_\tau(y_0 - f(x_0))(f(x_0) - \hat{y_0}) = 0$

Note: $E_{y_0|x_0}(y_0-f(x_0)) = 0$

$U_3$:

\[ \begin{align} U_3 & = E_{y_0|x_0}E_\tau(f(x_0) - \hat{y_0})^2\\ & = E_{y_0|x_0}E_\tau((\hat{y_0} - E_\tau(\hat{y_0})) + (E_\tau(\hat{y_0}) - f(x_0)))^2\\ & = E_{y_0|x_0}E_\tau(\hat{y_0} - E_\tau(\hat{y_0}))^2 + 2E_{y_0|x_0}E_\tau[(\hat{y_0} - E_\tau(\hat{y_0}))(E_\tau(\hat{y_0}) - f(x_0))] + E_{y_0|x_0}E_\tau(E_\tau(\hat{y_0}) - f(x_0))^2\\ & = E_\tau(\hat{y_0} - E_\tau(\hat{y_0}))^2 + (E_\tau(\hat{y_0}) - f(x_0))^2\\ & = \text{Var}_\tau(\hat{y_0}) + \text{Bias}_\tau^2(\hat{y_0}) \end{align} \]

Finally if we sum all $U_i$ we get: \[\text{EPE}(x_0) = U_1+U_2+U_3 = \sigma^2 + 0 + (\text{Var}_\tau(\hat{y_0}) + \text{Bias}_\tau^2(\hat{y_0}))\]

$E_\tau(\hat{y_0}) = E_\tau(x_0^T\beta + \sum_{i=1}^N {l_i(x_0)\varepsilon_i})=x_0^T\beta + E(\sum_{i=1}^N {l_i(x_0)\varepsilon_i}) = x_0^T\beta + 0$ thus $\text{Bias}_\tau{\hat{y_0}} = 0$

(2.27) and we can find variance: \[ \begin{align} \text{Var}_\tau(\hat{y_0}) & = E_\tau(\hat{y_0} - E_\tau(\hat{y_0})) ^ 2\\ & = E_\tau(x_0^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\varepsilon)\\ & = E_\tau(x_0^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\varepsilon\varepsilon^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}x_0) \end{align} \]

where $\varepsilon\varepsilon^T=\sigma^2\mathbf{I}_n$, so we can simplify further: \[\text{Var}_\tau(\hat{y_0}) = \sigma^2x_0^{T}E_\tau[(\mathbf{X}^T\mathbf{X})^{-1})]x_0\]

(2.28) if N is large and $\tau$ were selected at random, and assuming E(X) = 0, then $\mathbf{X}^T\mathbf{X}$->$NCov(\mathbf{X})$.

Proof: By definition of covariance $\text{Cov}(X) = E[(X-E(X))(X-E(X))^T] = E(XX^T) = \frac{\mathbf{X}^T\mathbf{X}}{N}$

and we can derive that: \[ \begin{align} E_{x_0}\text{EPE}(x_0) & = E_{x_0}x_0^{T}\text{Cov}^{-1}(X)x_0\sigma^2/N+\sigma^2\\ & = \text{trace}[\text{Cov}^{-1}(X)\text{Cov}(x_0)]\sigma^2/N+\sigma^2\\ & = \sigma^2(p/N)+\sigma^2 \end{align} \]

FIGURE 2.9

%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

simulations = 10000
sample_size = 500

def least_square_error(x_0, y_0, train_x, train_y):
    X = np.c_[np.ones((len(train_x), 1)), train_x]
    beta = np.linalg.lstsq(X, train_y, rcond = None)[0]
    return (np.dot(np.array([1, *x_0]), beta) - y_0) ** 2

# 1-nearest neighbor error
def nn_error(x_0, y_0, train_x, train_y):
    X = (train_x * train_x).sum(axis=1)
    return (y_0 - train_y[X.argmin()]) ** 2

cubic_epe_ratio = []
linear_epe_ratio = []

for p in range(1, 11):
    least_square_epe = [0, 0]
    nn_epe = [0, 0]
    for _ in range(simulations):
        error = np.random.standard_normal(sample_size)
        train_x = np.random.uniform(-1, 1, size=(sample_size, p))
        train_y = [train_x[:, 0] + error, 
                   0.5 * (train_x[:, 0] + 1) ** 3 + error]
        x_0 = np.zeros(p)
        y_0 = [np.random.standard_normal(),
               0.5 + np.random.standard_normal()]
        for i in range(2):
            least_square_epe[i] += least_square_error(x_0, y_0[i], 
                                                      train_x, train_y[i])
            nn_epe[i] += nn_error(x_0, y_0[i], 
                                  train_x, train_y[i])
    for i in range(2):
        least_square_epe[i] /= simulations
        nn_epe[i] /= simulations
    linear_epe_ratio.append(nn_epe[0] / least_square_epe[0])
    cubic_epe_ratio.append(nn_epe[1] / least_square_epe[1])

# plot
fig = plt.figure(1, figsize = (8, 8))
axes = fig.add_subplot(1, 1, 1)

axes.set_title("Expected Prediction Error of 1NN vs. OLS")

axes.plot(np.arange(1, 11), linear_epe_ratio, '-o',
          color = 'orange', label = "Linear")

axes.plot(np.arange(1, 11), cubic_epe_ratio, '-o',
          color = 'blue', label = "Cubic")

axes.legend()
axes.set_xlabel("Dimension")
axes.set_ylabel("EPE Ration")
plt.show()