Why does L1 regularization encourage coefficients to shrink to zero?
basics
loss
l1
l2
lasso
ridge
Author
Madiyar Aitbayev
Published
December 27, 2024
Why does L1 regularization encourage coefficients to shrink to zero?
Introduction
Regularization is a common method for dealing with overfitting in Machine Learning (ML). The simplest and most widely used methods are L1 (Lasso) and L2 (Ridge). The L1 and L2 regularizations are well covered in numerous tutorials and books. However, I could not find any good geometric or intuitive explanation of why L1 encourages coefficients to shrink to zero. This post tries to address this.
The post contains collapsed code sections that are used to produce the visualizations. They’re optional, hence collapsed.
The lasso regression is a linear regression model that shrinks the coefficients by imposing a constraint on their magnitude. Namely, it constrains the sum of absolute values of the coefficients:
\[
\begin{align}
\hat{\beta}^{lasso} = \underset{\beta}{argmin} & \sum_{i=1}^N \left( y_i-\beta_0-\sum_{j=1}^p{x_{ij}\beta_j} \right)^2\\
\text{ subject to } & \sum_{j=1}^p|\beta_j| \le t
\end{align}
\]
The above equation is the same equation (3.51) from “The Elements of Statistical Learning” (ESL) book by Hastie, Tibshirani, and Friedman.
We can also write the lasso in the equivalent Lagrangian form (3.52), which penalizes the sum of the absolute values of the coefficients:
The equations (3.51) and (3.52) are equivalent under the correct \(\lambda\) and \(t\) hyperparameters. The latter equation (3.52) is more preferred in ML.
Making \(t\) sufficiently small will shrink some of the coefficients to be exactly zero, which we will give geometric intuition later. Thus, the lasso could be used for feature selection, i.e., for identifying less important features.
The Ridge
The ridge regression is similar to the lasso except it penalizes the sum-of-squares of the coefficients (3.41):
An equivalent way to write the ridge problem is (3.42):
\[
\begin{align}
\hat{\beta}^{ridge} = \underset{\beta}{argmin} & \sum_{i=1}^N \left( y_i-\beta_0-\sum_{j=1}^p{x_{ij}\beta_j} \right)^2\\
\text{ subject to } & \sum_{j=1}^p\beta_j^2 \le t
\end{align}
\]
The ridge regression will shrink the coefficient towards zero when \(t\) is sufficiently small; however, the coefficients might not be exactly zero.
Ridge vs Lasso
Prior expalanations
The best post I found so far is https://explained.ai/regularization, which explains the difference between L1 and L2 empirically by simulating random loss functions. I really encourage you to check out the tutorial.
Another explanation is given by the amazing The Elements of Statistical Learning book:
The constraint region for L2 is the disk \(\beta_1^2+\beta_2^2 \le t\) (right figure), while it is the diamond \(|\beta_1|+|\beta_2| \le t\) for L1 (left figure). Both methods find the first intersection point between the elliptical contours (loss) and the constrained region. The corners of the diamond have one parameter equal to zero. The diamond becomes a rhomboid in higher dimensional space, and has many corners; there are many more opportunities to intersect at the corners.
The explanations given by the book and other places are well-contained, but I could not fully grasp what’s so special about the corners w.r.t the other points on the edges. There are only 4 corners and unlimited points on the edges, so shouldn’t the probability to touch the other points be higher?
Geometric Explanation
Setup
We will use a simple loss function that illustrates circle contours instead of elliptical ones.:
Once we understand the intution for circles, it is easy to extend to other contours such as elliptical ones. We will use \(c_x=15\) and \(c_y=5\) most of the time.
Let’s visualize our loss \(2{(\beta_1 - 15)}^2 + 2{(\beta_2 - 5)}^2 + 100\) with the L1 constaint \(t=5\), i.e., \(|\beta_1| + |\beta_2| \le 5\) in 3D:
plot3d(Reg.L1, t=5)
In Lasso Regression, we’re looking for \(\beta_1\) and \(\beta_2\) within the diamond that has the lowest loss, which is marked with the red point in the figure above. The global minima without any constraint is marked with “x”.
The same visualization but with the L2 constraint t=5, i.e., \(\beta_1^2+\beta_2^2 \le 5^2\):
plot3d(Reg.L2, t=5)
The corresponding 2D visualizations that are similar to the ones given by the Elements of Statistical Learning book:
code for plot2d
def loss_contour(ax): beta0 = np.linspace(*beta_range, 100) beta1 = np.linspace(*beta_range, 100) B0, B1 = np.meshgrid(beta0, beta1) Z = loss(B0, B1, cx=cx, cy=cy) ax.contour(B0, B1, Z, levels=50, linewidths=.5, cmap='coolwarm', vmax=vmax)# draw the global minima ax.plot([cx], [cy], marker='x', markersize=10, color='black')def ax2d_init(ax): ax.set_aspect('equal') ax.set_xlabel("$\\beta_1$", labelpad=0) ax.set_ylabel("$\\beta_2$", labelpad=0) ax.tick_params(axis='x', pad=0) ax.tick_params(axis='y', pad=0) ax.set_xlim(*beta_range) ax.set_ylim(*beta_range)# draw axes ax.axhline(y=0, color='k') ax.axvline(x=0, color='k')def plot2d(regs: [Reg], t: float): fig = plt.figure(figsize=(4*len(regs), 4)) axes = fig.subplots(1, len(regs))for ax in axes: ax2d_init(ax)for reg, ax inzip(regs, axes):# draw the regularization safe region ax.add_patch(make_reg_shape(reg=reg, t=t)) loss_contour(ax)# draw minima within constraint mx, my = argmin_within_constraint(reg, t) ax.plot([mx], [my], marker='.', markersize=10, color='r') plt.tight_layout() plt.show()
plot2d([Reg.L1, Reg.L2], t=5)
Explanation
In both cases (L1, L2), we’re looking for a contour that just touches the constraint region. For example, when we have the following case:
Then it is more optimal to reduce the constraint region (i.e., the diamond) until they touch in a single point as shown below. Note that a contour has the same loss along its points.
When does the Lasso pick the corner of the diamond over any other point on the edges? To simplify our problem, let’s fix a specific circle among the contours: a circle with a fixed radius \(r\). Now, we want to come up with a formula that gives the probability of a random tangent circle with radius \(r\) touching our diamond at the corners versus at any other points.
The above problem becomes much simpler with a visualization:
from IPython.display import HTMLHTML(html5_video_tangent_circle_trajectories(Reg.L1, t=5, radius=6))
The animation above illustrates all tangent circle trajectories. The circle is green when it touches the diamond at the corners, and it is blue when it touches at other points. The total length of all blue (touching at other points) trajectories is given by \(4\times \sqrt{2}\times t\), while the total length of all green (touching at the corners) trajectories is given by \(2\pi \times r\), where t is the constraint and r is the radius of the circle. This means that it is likely to touch at the corners when t is sufficiently small or r is sufficiently high. We illustrate one of the such cases below:
In Ridge (L2), we don’t have special points like corners, and all points along the disk have the same probability of touching a random tangent circle with radius \(r\):