import numpy as np
import pandas as pd
from scipy import stats
df = pd.read_csv('../data/prostate/prostate.data', delimiter='\t', index_col=0)
mask_train = df.pop('train')
df_y = df.pop('lpsa')
train_x = df[mask_train == 'T']
train_y = df_y[mask_train == 'T']
train_x_centered = train_x - train_x.mean(axis = 0)
train_x_centered /= np.linalg.norm(train_x_centered, axis=0)
train_y_centered = train_y - train_y.mean()Least angle regression (LAR) uses a similar strategy to Forwarf stepwise regression, but only enters “as much” of a predictor as it deserves.
Algorithm 3.2
Standardize the predictors to have mean zero and unit norm. Start with the residual \(\mathbf{r} = \mathbf{y} - \mathbf{\overline{y}}\) and \(\beta_1,...,\beta_p = 0\)
Find the predictor \(\mathbf{x}_j\) most correlated with \(\mathbf{r}\).
Move \(\beta_j\) from 0 towards its least-squares coefficient \(\langle \mathbf{x}_j, \mathbf{r} \rangle\), until some other competitor \(\mathbf{x}_k\) has as much correlation with the current residual as does \(\mathbf{x}_j\).
Move \(\beta_j\) and \(\beta_k\) in the direction defined by their joint least squares coefficient of the current residual on \(\langle \mathbf{x}_j, \mathbf{x}_k \rangle\), until some other competitor \(\mathbf{x}_l\) has as much correlation with the current residual.
Continue in this way until all \(p\) predictors have been entered. After min(N - 1, p) steps, we arrive at the full least-squares solution.
Suppose at the beginning of the kth step:
\(\mathcal{A}_k\) is the active set of variables
\(\beta_{\mathcal{A}_k}\) be the coefficients
\(\mathbf{r}_k=\mathbf{y} - \mathbf{X}_{\mathcal{A}_k}\beta_{\mathcal{A}_k}\) is the current residual,
then the direction for this step is (3.55):
\[\delta_k = (\mathbf{X}_{\mathcal{A}_k}^T\mathbf{X}_{\mathcal{A}_k})^{-1}\mathbf{X}_{\mathcal{A}_k}^T\mathbf{r}_k\]
The coefficient profile then evolves as \(\beta_{\mathcal{A}_k}(\alpha)=\beta_{\mathcal{A}_k} + \alpha \cdot \delta_k\) and the fit vector evolves as \(\hat{f}_k(\alpha)=\hat{f}_k + \alpha \cdot \mathbf{u}_k\)
def lars(X, y):
n, p = X.shape
mu = np.zeros_like(y)
beta = np.zeros(p)
for _ in range(p):
c = X.T @ (y - mu)
c_abs = np.abs(c)
c_max = c_abs.max()
active = np.isclose(c_abs, c_max)
signs = np.where(c[active] > 0, 1, -1)
X_active = signs * X[:, active]
G = X_active.T @ X_active
Ginv = np.linalg.inv(G)
A = Ginv.sum() ** (-0.5)
w = A * Ginv.sum(axis = 1)
u = X_active @ w
gamma = c_max / A
if not np.all(active):
a = X.T @ u
complement = np.invert(active)
cc = c[complement]
ac = a[complement]
candidates = np.concatenate([(c_max - cc) / (A - ac),
(c_max + cc) / (A + ac)])
gamma = candidates[candidates >= 0].min()
mu += gamma * u
beta[active] += gamma * signs
return mu, betay_fit, beta = lars(train_x_centered.as_matrix(), train_y_centered.as_matrix())
train_error = np.mean((y_fit - train_y_centered) ** 2)
print ('Beta: ', beta)
print ('train error: ', train_error)Beta: [10.06592433 6.58925225 -1.95834047 4.00665484 5.62572779 -1.65081245
-0.20495795 3.9589639 ]
train error: 0.43919976805833433
Algorithm 3.2a
4a. If a non-zero coefficient hits zero, drop its variable from the active set of variables and recompute the current joint least squares direction.
The LAR(lasso) algorithm is extremely efficient, requiring the same order of computation as that of a single least squares fit using the p predictors.
Heuristic argument why LAR and Lasso are similar
Suppose \(\mathcal{A}\) is the active set of variables at some stage. We can express as (3.56): \[\mathbf{x}_j^T(\mathbf{y}-\mathbf{X}\beta)=\lambda \cdot s_j, j \in \mathcal{A}\]
also \(|\mathbf{x}_j^T(\mathbf{y}-\mathbf{X}\beta)| \le \lambda, j \notin \mathcal{A}\). Now consider the lasso criterian (3.57):
\[R(\beta)=\frac{1}{2}||\mathbf{y}-\mathbf{X}\beta||_2^2 + \lambda||\beta||_1\]
Let \(\mathcal{B}\) be the active set of variables in the solution for a given value of \(\lambda\), and \(R(\beta)\) is differentiable, and the stationarity conditions give (3.58):
\[\mathbf{x}_j^T(\mathbf{y}-\mathbf{X}\beta)=\lambda \cdot sign(\beta_j), j \in \mathcal{B}\]
Comparing (3.56) and (3.58), we see that they are identical only if the sign of \(\beta{j}\) matches the sign of the inner product. That is why the LAR algorithm and lasso starts to differ when an active coefficient passes through zero; The stationary conditions for the non-active variable require that (3.59):
\[|\mathbf{x}_j^T(\mathbf{y}-\mathbf{X}\beta)|\le \lambda, j \notin \mathcal{B}\]
Degrees-of-Freedom Formula for LAR and Lasso
We define the degrees of freedom of the fitted vector \(\hat{y}\) as:
\[ df(\hat{y})=\frac{1}{\sigma^2}\sum_{i=1}^N Cov(\hat{y}_i,y_i) \]
This makes intuitive sense: the harder that we fit to the data, the larger this covariance and hence \(df(\hat{\mathbf{y}})\).