In many situations, we have a large number of inputs, ofter very correlated. This methods in this section produce a small number of linear combinations Z of the original inputs X.

3.5.1 Principal Components Regression

Princinal component regression (PCR) forms the derived input columns \(z_m=\mathbf{X}v_m\) and then regresses \(\mathbf{y} \text{ on } \mathbf{z}_1,...,\mathbf{z}_M\) for some M <= p. Since the Z are orthogonal (3.61): \[ \hat{\mathbf{y}}_{(M)}^{pcr} = \overline{y}\mathbf{1} + \sum_{m=1}^M \hat{\theta}_m\mathbf{z}_m \]

where \(\hat{\theta}_m = \langle \mathbf{z}_m, \mathbf{y} \rangle / \langle \mathbf{z}_m, \mathbf{z}_m\rangle\). Since the \(\mathbf{z}_m\) are linear combinations of the original \(\mathbf{x}_j\), we can express in terms of coefficients of the \(\mathbf{x}_j\) (3.62):

\[ \hat{\beta}^{pcr}(M) = \sum_{m=1}^M \hat{\theta}_m v_m \]

TODO: proof

we first standardize the inputs.
Note that if M = p, we would get least squares estimates, since Z=UD span the column space of X.
PCR is very similar to ridge regression.

3.5.2 Partial Least Squares

Partial Least Squares (PLS) begins by computing \(\hat{\varphi}_{1j} = \langle \mathbf{x}_j, \mathbf{y} \rangle\) for each j.
Then we construct the derived input \(\mathbf{z}_1=\sum_j \hat{\varphi}_{1j} \mathbf{x}_j\), which the 1st partial least squares direction.
The outcome \(\mathbf{y}\) is regressed on \(\mathbf{z}_1\) giving coefficient \(\hat{\theta}_1\)
And then we orthogonalize \(\mathbf{x}_1,..., \mathbf{x}_p\) w.r.t \(\mathbf{z}_1\).
We continue this process until M <= p directions have been obtained.

Note: PLS produces orthogonal inputs(or directions) \(\mathbf{z}_1,...,\mathbf{z}_M\) and if M = p, we would get least squares estimates.

# Algorithm 3.3 Partial Least Squares.
import numpy as np
import pandas as pd
from scipy import stats

df = pd.read_csv('../data/prostate/prostate.data', delimiter='\t', index_col=0)
mask_train = df.pop('train')
df_y = df.pop('lpsa')

train_x = df[mask_train == 'T']
train_y = df_y[mask_train == 'T']

train_x_centered = train_x - train_x.mean(axis = 0)
train_x_centered /= np.linalg.norm(train_x_centered, axis=0)
train_y_centered = train_y - train_y.mean()

def PLS(X, y):
    X = X.copy()
    y = y.copy()

    n, p = X.shape
    y_fit = y.mean() * np.ones(n)
    Z = np.zeros_like(X)
    theta = np.zeros(p)
    
    for m in range(p):
        for j in range(p):
            phi = np.dot(X[:, j], y)
            Z[:, m] += phi * X[:, j]

        theta[m] = np.dot(Z[:, m], y) / np.dot(Z[:, m], Z[:, m])
        y_fit += theta[m] * Z[:, m]
        
        for j in range(p):
            X[:, j] -= np.dot(Z[:, m], X[:, j]) / np.dot(Z[:, m], Z[:, m]) * Z[:, m]
    return y_fit

y_fit = PLS(train_x_centered.values, train_y_centered.values)
train_error = np.mean((y_fit - train_y_centered) ** 2)
print('Train error: ', train_error)

Train error:  0.43919976805833433

It can be shown that PLS seeks directions that have high variance and have high correlation with the response. The mth PLS direction \(\hat{\varphi}_m\) solves:

\[ \begin{equation} max_\alpha Corr^2(\mathbf{y}, \mathbf{X}\alpha)Var(\mathbf{X}\alpha)\\ \text{ subject to } \| \alpha \| = 1, \alpha^T\mathbf{S}v_l = 0, l = 1, ..., m - 1 \end{equation} \]