3.4.1 Ridge Regression

Ridge regression shrinks the regression coefficients by imposing a penalty on their size. The ridge coefficients minimize a penalized RSS (3.41): \[ \hat{\beta}^{ridge} = \underset{\beta}{argmin} \left\{ \sum_{i=1}^N \left( y_i-\beta_0-\sum_{j=1}^p{x_{ij}\beta_j} \right)^2 +\gamma\sum_{j=1}^p{\beta_j^2} \right\} \] Here \(\gamma \ge 0\) is a complexity parameter: \(\gamma \rightarrow \infty\), the coefficients are shrunk toward zero (and each other). This idea also used in neural networks (known as weight decay).

An equivalent way to write the ridge problem is (3.42): \[ \begin{equation} \hat{\beta}^{ridge} = \underset{\beta}{argmin} \left\{ \sum_{i=1}^N \left( y_i-\beta_0-\sum_{j=1}^p{x_{ij}\beta_j} \right)^2 \right\},\\ \text{subject to } \sum_{j=1}^p {\beta_j^2 \le t} \end{equation} \]

There is a one-to-one correspondence between \(\gamma \text{ and } t\). A large positive coefficient on one variable can be canceled by a similarly large negative coefficient on its correlated cousin. By imposing constraints as in (3.42), this problem is alleviated.

The ridge solutions are not equivariant under scaling of the inputs, so one normally standardizes the inputs before solving (3.41).

Notice that the intercepts has been left out of the penalty term: because, adding a constant to each target \(y_i\) would not simply result in a shift of the prediction by the same amount. It can be shown that the solution to (3.41) can be separated into two parts, after reparametrization using centered inputs, i.e replacing \(x_{ij} \text{ by } x_{ij}-\overline{x}_j\): 1. we estimate \(\beta_0 \text{ as } \overline{y}\), 2. The remaining coefficients get estimated by a ridge regression.

Proof:

\[ \begin{equation} \hat{\beta}^{ridge} = \underset{\beta}{argmin} \left\{ \sum_{i=1}^N \left( y_i- \beta_0-\sum_{j=1}^p\overline{x}_j\beta_j -\sum_{j=1}^p{(x_{ij} - \overline{x}_j)\beta_j} \right)^2 +\gamma\sum_{j=1}^p{\beta_j^2} \right\}\\ \hat{\beta}^{ridge} = \underset{\beta}{argmin} \left\{ \sum_{i=1}^N \left( y_i- \beta_0^C -\sum_{j=1}^p{(x_{ij} - \overline{x}_j)\beta_j} \right)^2 +\gamma\sum_{j=1}^p{\beta_j^2} \right\} \end{equation} \]

where \(\beta_0^C = \beta_0-\sum_{j=1}^p\overline{x}_j\beta_j\). Now the stationary point w.r.t \(\beta_0^C\) will be: \[ \begin{align} \beta_0^C &= \frac{1}{N}(\sum_{i=1}^N y_i- \sum_{i=1}^N\sum_{j=1}^p{x_{ij}\beta_j + \sum_{i=1}^N\sum_{j=1}^p\overline{x}_j\beta_j}) \\ &=\frac{1}{N}(\sum_{i=1}^N y_i- \sum_{j=1}^p\beta_j\sum_{i=1}^N x_{ij} + N\sum_{j=1}^p\overline{x}_j\beta_j) \\ &=\frac{1}{N}(\sum_{i=1}^N y_i- \sum_{j=1}^p\beta_j\overline{x}_jN + N\sum_{j=1}^p\overline{x}_j\beta_j) \\ &=\overline{y} \end{align} \]

This completes proof.

We assume that the centering has been done, so that the matrix X has p columns (3.43):

\[ RSS(\gamma)=(\mathbf{y} - \mathbf{X}\beta)^T(\mathbf{y} - \mathbf{X}\beta) + \gamma\beta^T\beta \]

the ridge regression solutions are easily seen to be: \[ \hat{\beta}^{ridge}=(\mathbf{X}^T\mathbf{X}+\gamma\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y} \]

where I is the \(p\times p\) identity matrix. Notice that the solution is again a linear function of y.

Proof:

\[ \begin{align} \cfrac{\partial{RSS(\gamma)}}{\partial{\beta}} &=\cfrac{\partial{((\mathbf{y} - \mathbf{X}\beta)^T(\mathbf{y} - \mathbf{X}\beta) + \gamma\beta^T\beta)}}{\partial{\beta}}\\ &=\cfrac{\partial{\left( \mathbf{y}^T\mathbf{y} - 2\mathbf{y}^T\mathbf{X}\beta + (\mathbf{X}\beta)^T\mathbf{X}\beta + \gamma\beta^T\beta \right)}}{\partial{\beta}}\\ &=-2\mathbf{y}^T\mathbf{X}+2\beta^T\mathbf{X}^T\mathbf{X}+2\gamma\beta^T \end{align} \]

We set the first derivative to zero:

\[ \begin{equation} \cfrac{\partial{RSS(\gamma)}}{\partial{\beta}} = 0\\ -\mathbf{y}^T\mathbf{X}+\beta^T\mathbf{X}^T\mathbf{X}+\gamma\beta^T=0\\ \beta^T(\mathbf{X}^T\mathbf{X}+\gamma\mathbf{I})=\mathbf{y}^T\mathbf{X}\\ \beta^T=\mathbf{y}^T\mathbf{X}(\mathbf{X}^T\mathbf{X}+\gamma\mathbf{I})^{-1}\\ \beta=(\mathbf{X}^T\mathbf{X}+\gamma\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y} \end{equation} \]

Figure 3.8 shows the ridge coefficient estimates for the prostate cancer example, plotted as functions of \(df(\gamma)\), the effective degrees of freedom implied by the penalty \(\gamma\). In the case of orthonormal inputs, the ridge estimates are: \(\hat{\beta}^{ridge}=\hat{\beta}/(1+\gamma)\)

TODO: Implement Figure 3.8.

Ridge expression can be also derived as the mean or mode of a posterior distribution, with a prior distribution. Suppose, \(y_i \sim N(\beta_0+x_i^T\beta, \sigma^2)\) and the \(\beta_j \sim N(0, \mathcal{T}^2)\), independently of one another. Then the log-posterior density of \(\beta\), is equal to the expression in (3.41), with \(\gamma=\sigma^2/\mathcal{T}^2\).

TODO: proof.

The singular value decomposition (SVD) of the centered input matrix X gives us some insight into the nature of ridge regression. The SVD of the \(N \times p\) matrix X (3.45): \[ \mathbf{X}=\mathbf{UD}\mathbf{V}^T \] - U - \(N \times p\) orthogonal matrix, the columns of U span the column space of X. - V - \(N \times p\) orthogonal matrix, the columns of V span the row space of X. - D - \(p \times p\) diagonal matrix, with diagonal entries \(d_1 \ge d_2 \ge ... \ge d_p \ge 0\), called singular values of X. If one or more values \(d_j = 0\), X is singular.

Using SVD we can write the least squares fitted vector as (3.46):

\[ \begin{align} \mathbf{X}\hat{\beta}^{ls} &= \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ &= \mathbf{UD}\mathbf{V}^T(\mathbf{VD}\mathbf{U}^T\mathbf{UD}\mathbf{V}^T)^{-1}\mathbf{VD}\mathbf{U}^T\mathbf{y}\\ &= \mathbf{UD}\mathbf{V}^T(\mathbf{VD}\mathbf{D}\mathbf{V}^T)^{-1}\mathbf{VD}\mathbf{U}^T\mathbf{y}\\ &= \mathbf{UD}\mathbf{V}^T[\mathbf{V}^T]^{-1}\mathbf{D}^{-1}\mathbf{D}^{-1}\mathbf{V}^{-1}\mathbf{VD}\mathbf{U}^T\mathbf{y}\\ &= \mathbf{UD}\mathbf{D}^{-1}\mathbf{D}^{-1}\mathbf{D}\mathbf{U}^T\mathbf{y}\\ &= \mathbf{U}\mathbf{U}^T\mathbf{y} \end{align} \]

Now the ridge solutions are (3.47):

\[ \begin{align} \mathbf{X}\hat{\beta}^{ridge}&=\mathbf{X}(\mathbf{X}^T\mathbf{X}+\gamma\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\\ &=\mathbf{UD}\mathbf{V}^T (\mathbf{V}\mathbf{D}^2\mathbf{V}^T+\gamma\mathbf{V}\mathbf{V}^T)^{-1} \mathbf{VD}\mathbf{U}^T\mathbf{y}\\ &=\mathbf{UD}\mathbf{V}^T (\mathbf{V}(\mathbf{D}^2+\gamma\mathbf{I})\mathbf{V}^T)^{-1} \mathbf{VD}\mathbf{U}^T\mathbf{y}\\ &=\mathbf{UD}\mathbf{V}^T \mathbf{V}(\mathbf{D}^2+\gamma\mathbf{I})^{-1}\mathbf{V}^T \mathbf{VD}\mathbf{U}^T\mathbf{y}\\ &=\mathbf{UD}(\mathbf{D}^2+\gamma\mathbf{I})^{-1}\mathbf{D}\mathbf{U}^T\mathbf{y}\\ &=\sum_{j=1}^p\mathbf{u}_j\cfrac{d_j^2}{d_j^2+\gamma}\mathbf{u}_j^T\mathbf{y} \end{align} \] - Computes the coordinates of y w.r.t the orthonormal basis U. - Then shrinks these coordinates of y by the factors \(d_j^2/(d_j^2+\gamma)\). - A greater amount of shrinkage is applied to the coordinates of basis vectors with smaller \(d_j^2\).

Principal Components

What does a small value of \(d_j^2\) mean? The sample covariance matrix is \(\mathbf{S}=\mathbf{X}^T\mathbf{X}/N\) (3.46): \[ \mathbf{X}^T\mathbf{X} = \mathbf{VD}^2\mathbf{V}^T \]

which is eigen decomposition of \(\mathbf{X}^T\mathbf{X}\), aka the principal components directions of X.

The first principal component (PC) direction \(v_1\) has the largest variance, that is: \[ \begin{align} Var(\mathbf{z}_1) &= Var(\mathbf{X}v_1)\\ &= \frac{1}{N} v_1^T\mathbf{VD}^2\mathbf{V}^Tv_1\\ &= \frac{1}{N} \begin{pmatrix} 1&0 & \dots & 0 \end{pmatrix} \mathbf{D}^2 \begin{pmatrix} 1 \\ 0 \\ \vdots \\ 0 \end{pmatrix}\\ &=\frac{d_1^2}{N} \end{align} \]

and in fact \(\mathbf{z}_1 = \mathbf{X}v_1 = \mathbf{UD}\mathbf{V}^{T}v_1= \mathbf{u}_1d_1\) and it is called the first PC. Ridge regression shrinks PC directions having small variance the most.

Effective degrees of freedom

In Figure 3.7 we have plotted the estimated prediction error versus the quantity: \[ \begin{align} df(\gamma) &= tr[\mathbf{X}(\mathbf{X}^T\mathbf{X}+\gamma\mathbf{I})^{-1}\mathbf{X}^T]\\ &= tr(\mathbf{H_{\gamma}})\\ &= tr[\mathbf{UD}(\mathbf{D}^2+\gamma\mathbf{I})^{-1}\mathbf{D}\mathbf{U}^T]\\ &= \sum_{j=1}^p \cfrac{d_j^2}{d_j^2+\gamma} \end{align} \]

This monotone decreasing function of \(\gamma\) is the effective degrees of freedom of the ridge regression fit. Note that: - \(\gamma = 0 \text{ then } df(\gamma)=p\) - \(\gamma \rightarrow \infty \text{ then } df(\gamma) \rightarrow 0\) - Although all p coefficients in a ridge fit will be non-zero, they are fit in a restricted fashion controlled by \(\gamma\).