There are two reasons why we are not satisfied with the least squares estimates:

Prediction accuracy: The least squares estimates often have low bias but large variance. Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero.
Interpretation: In order to get the “big picture”, we are willing to sacrifice the small details by selecting a smaller subset.

3.3.1 Best-Subset Selection

Best subset regression finds for each k the subset of size k that givest smallest RSS. An efficient algorithm - the leaps and bounds procedure (feasible for p ~ 30, 40). The question of how to choose k involves the tradeoff between bias and variance; typically we choose the smallest model that minimizes an EPE.

TODO: implement FIGURE 3.5

3.3.2 Forward- and Backward-Stepwise Selection

Rather than search through all possible subset (which is infeasible for p > 40), we can seek a good path through them.

Forward-stepwise selection.

Forward-stepwise selection starts with the intercept, and sequentially adds the predictor that most improves the fit. Clever updating algorithms can exploit the QR decomposition for the current fit to rapidly establish the next candidate. List best-subset regression, the subset size k must be determined.

The Forward-stepwise selection is a greedy algorithm, however, there are reasons why it might be preferred: - Computational: for large p we cannot compute the best subset sequence. - Statistical: will have lower variance, but perhaps more bias.

Backward-stepwise selection.

Backward-stepwise selection starts with the full model, and sequentially deletes the predictors that has the least impact on the fit (i.e the smallest Z-score). Backward selection can be used only when N > p.

Hybrid stepwise selection.

Some software packages implement hybrid stepwise-selection strategies that consider both forward and backward at each step, and select the best of the two (i.e in the R package the “step” function).

3.3.3 Forward-Stagewise Regression

Forward-Stagewise Regression(FS) is even more constrained than forward-stepwise regression. It starts with an intercept equal to \(\overline{y}\), and centered predictors with coefficients intally all 0. At each step it identifies the variable most correlated with the current residual. It then computes the simple linear regression coefficient of the residual on this chosen variable, and then adds it to the current coefficient for that variable. This is continue till none of variables have correlation with the residuals.

3.3.4 Prostate Cancer Data Example

TODO: