Why does L1 regularization encourage coefficients to shrink to zero?

Introduction

Regularization is a common method for dealing with overfitting in Machine Learning (ML). The simplest and most widely used methods are L1 (Lasso) and L2 (Ridge). The L1 and L2 regularizations are well covered in numerous tutorials and books. However, I could not find any good geometric or intuitive explanation of why L1 encourages coefficients to shrink to zero. This post tries to address this.

The post contains collapsed code sections that are used to produce the visualizations. They’re optional, hence collapsed.

Feel free to leave feedback on my telegram channel.

Recap

The Lasso

The lasso regression is a linear regression model that shrinks the coefficients by imposing a constraint on their magnitude. Namely, it constrains the sum of absolute values of the coefficients:

\[ \begin{align} \hat{\beta}^{lasso} = \underset{\beta}{argmin} & \sum_{i=1}^N \left( y_i-\beta_0-\sum_{j=1}^p{x_{ij}\beta_j} \right)^2\\ \text{ subject to } & \sum_{j=1}^p|\beta_j| \le t \end{align} \]

The above equation is the same equation (3.51) from “The Elements of Statistical Learning” (ESL) book by Hastie, Tibshirani, and Friedman.

We can also write the lasso in the equivalent Lagrangian form (3.52), which penalizes the sum of the absolute values of the coefficients:

\[ \hat{\beta}^{lasso} = \underset{\beta}{argmin} \left\{ \frac{1}{2}\sum_{i=1}^N \left( y_i-\beta_0-\sum_{j=1}^p{x_{ij}\beta_j} \right)^2 +\lambda\sum_{j=1}^p|\beta_j| \right\} \]

The equations (3.51) and (3.52) are equivalent under the correct \(\lambda\) and \(t\) hyperparameters. The latter equation (3.52) is more preferred in ML.

Making \(t\) sufficiently small will shrink some of the coefficients to be exactly zero, which we will give geometric intuition later. Thus, the lasso could be used for feature selection, i.e., for identifying less important features.

The Ridge

The ridge regression is similar to the lasso except it penalizes the sum-of-squares of the coefficients (3.41):

\[ \hat{\beta}^{ridge} = \underset{\beta}{argmin} \left\{ \frac{1}{2}\sum_{i=1}^N \left( y_i-\beta_0-\sum_{j=1}^p{x_{ij}\beta_j} \right)^2 +\lambda\sum_{j=1}^p\beta_j^2 \right\} \]

An equivalent way to write the ridge problem is (3.42):

\[ \begin{align} \hat{\beta}^{ridge} = \underset{\beta}{argmin} & \sum_{i=1}^N \left( y_i-\beta_0-\sum_{j=1}^p{x_{ij}\beta_j} \right)^2\\ \text{ subject to } & \sum_{j=1}^p\beta_j^2 \le t \end{align} \]

The ridge regression will shrink the coefficient towards zero when \(t\) is sufficiently small; however, the coefficients might not be exactly zero.

Ridge vs Lasso

Prior expalanations

The best post I found so far is https://explained.ai/regularization, which explains the difference between L1 and L2 empirically by simulating random loss functions. I really encourage you to check out the tutorial.

Another explanation is given by the amazing The Elements of Statistical Learning book:

The constraint region for L2 is the disk \(\beta_1^2+\beta_2^2 \le t\) (right figure), while it is the diamond \(|\beta_1|+|\beta_2| \le t\) for L1 (left figure). Both methods find the first intersection point between the elliptical contours (loss) and the constrained region. The corners of the diamond have one parameter equal to zero. The diamond becomes a rhomboid in higher dimensional space, and has many corners; there are many more opportunities to intersect at the corners.

There are cool algebraic explanations as well, for example, given here https://www.reddit.com/r/MachineLearning/….

The explanations given by the book and other places are well-contained, but I could not fully grasp what’s so special about the corners w.r.t the other points on the edges. There are only 4 corners and unlimited points on the edges, so shouldn’t the probability to touch the other points be higher?

Geometric Explanation

Setup

We will use a simple loss function that illustrates circle contours instead of elliptical ones.:

\[ Loss(\beta_1,\beta_2 | c_x, c_y)=2{(\beta_1 - c_x)}^2 + 2{(\beta_2 - c_y)}^2 + 100 \]

Once we understand the intution for circles, it is easy to extend to other contours such as elliptical ones. We will use \(c_x=15\) and \(c_y=5\) most of the time.

code for loss setup and helper functions

from enum import Enum

import mpl_toolkits.mplot3d.art3d as art3d
import numpy as np
from matplotlib import animation
from matplotlib import pyplot as plt
from matplotlib.patches import Circle, Polygon


def loss(b0, b1, cx, cy, scale=2.0, bias=100):
    return scale * (b0 - cx) ** 2 + scale * (b1 - cy) ** 2 + bias


class Reg(Enum):
    L1 = 1
    L2 = 2


def make_reg_shape(reg: Reg, t: float, color="k"):
    if reg == Reg.L1:
        return Polygon(xy=[(t, 0), (0, t), (-t, 0), (0, -t)], color=color, fill=False, linestyle='--')
    else:
        return Circle(xy=(0, 0), radius=t, color=color, fill=False, linestyle='--')


def argmin_within_constraint(reg: Reg, t: float):
    beta0 = np.linspace(*beta_range, 100)
    beta1 = np.linspace(*beta_range, 100)
    B0, B1 = np.meshgrid(beta0, beta1)
    Z = loss(B0, B1, cx=cx, cy=cy)
    if reg == Reg.L1:
        mask = np.abs(B0) + np.abs(B1) <= t
    else:
        mask = B0 * B0 + B1 * B1 <= t * t
    index = np.argmin(Z[mask])
    return B0[mask][index], B1[mask][index]


beta_range = -20, 20
cx, cy = 15, 5
vmax = 1000

code for plot3d

def base_fig3():
    # Create a figure and a 3D Axes
    fig = plt.figure(figsize=(5, 5))
    ax = fig.add_subplot(111, projection='3d')
    ax.set_xlabel("$\\beta_1$", labelpad=0)
    ax.set_ylabel("$\\beta_2$", labelpad=0)
    ax.set_zlim(0, 500)
    ax.tick_params(axis='x', pad=0)
    ax.tick_params(axis='y', pad=0)
    ax.set_xlim(*beta_range)
    ax.set_ylim(*beta_range)
    # draw axes
    ax.plot(beta_range, [0, 0], color='k')
    ax.plot([0, 0], beta_range, color='k')
    return fig, ax


def plot3d(reg: Reg, t=3):
    fig, ax = base_fig3()

    # surface
    beta0 = np.linspace(*beta_range, 100)
    beta1 = np.linspace(*beta_range, 100)
    B0, B1 = np.meshgrid(beta0, beta1)
    Z = loss(B0, B1, cx=cx, cy=cy)
    ax.plot_surface(B0, B1, Z, alpha=0.7, cmap='coolwarm', vmax=vmax)

    # contours
    ax.plot([cx], [cy], marker='x', markersize=10, color='black')
    ax.contour(B0, B1, Z, levels=50, linewidths=.5, cmap='coolwarm', zdir='z', offset=0, vmax=vmax)
    
    # minima within regularization shape
    mx, my = argmin_within_constraint(reg, t)
    ax.plot([mx], [my], marker='.', markersize=10, color='r')

    # regularization contraints
    reg_shape = make_reg_shape(reg, t, color="black")
    ax.add_patch(reg_shape)
    art3d.pathpatch_2d_to_3d(reg_shape, z=0)

    ax.view_init(elev=39, azim=-106)
    plt.tight_layout()
    plt.show()

Let’s visualize our loss \(2{(\beta_1 - 15)}^2 + 2{(\beta_2 - 5)}^2 + 100\) with the L1 constaint \(t=5\), i.e., \(|\beta_1| + |\beta_2| \le 5\) in 3D:

plot3d(Reg.L1, t=5)

In Lasso Regression, we’re looking for \(\beta_1\) and \(\beta_2\) within the diamond that has the lowest loss, which is marked with the red point in the figure above. The global minima without any constraint is marked with “x”.

The same visualization but with the L2 constraint t=5, i.e., \(\beta_1^2+\beta_2^2 \le 5^2\):

plot3d(Reg.L2, t=5)

The corresponding 2D visualizations that are similar to the ones given by the Elements of Statistical Learning book:

code for plot2d

def loss_contour(ax):
    beta0 = np.linspace(*beta_range, 100)
    beta1 = np.linspace(*beta_range, 100)
    B0, B1 = np.meshgrid(beta0, beta1)
    Z = loss(B0, B1, cx=cx, cy=cy)

    ax.contour(B0, B1, Z, levels=50, linewidths=.5, cmap='coolwarm', vmax=vmax)
    # draw the global minima
    ax.plot([cx], [cy], marker='x', markersize=10, color='black')


def ax2d_init(ax):
    ax.set_aspect('equal')
    ax.set_xlabel("$\\beta_1$", labelpad=0)
    ax.set_ylabel("$\\beta_2$", labelpad=0)
    ax.tick_params(axis='x', pad=0)
    ax.tick_params(axis='y', pad=0)
    ax.set_xlim(*beta_range)
    ax.set_ylim(*beta_range)

    # draw axes
    ax.axhline(y=0, color='k')
    ax.axvline(x=0, color='k')


def plot2d(regs: [Reg], t: float):
    fig = plt.figure(figsize=(4 * len(regs), 4))
    axes = fig.subplots(1, len(regs))
    for ax in axes:
        ax2d_init(ax)
    for reg, ax in zip(regs, axes):
        # draw the regularization safe region
        ax.add_patch(make_reg_shape(reg=reg, t=t))
        loss_contour(ax)
        # draw minima within constraint
        mx, my = argmin_within_constraint(reg, t)
        ax.plot([mx], [my], marker='.', markersize=10, color='r')
    plt.tight_layout()
    plt.show()

plot2d([Reg.L1, Reg.L2], t=5)

Explanation

In both cases (L1, L2), we’re looking for a contour that just touches the constraint region. For example, when we have the following case:

code for plot_reg_and_circle

def plot_reg_and_circle(reg: Reg, t: float, cx: float, cy: float, radius: float):
    fig = plt.figure(figsize=(3, 3))
    ax = fig.add_subplot()
    ax2d_init(ax)
    ax.add_patch(Circle(xy=(cx, cy), radius=radius, color='b', fill=False))
    ax.add_patch(make_reg_shape(reg, t=t))
    plt.tight_layout()
    plt.show()

plot_reg_and_circle(Reg.L1, t=7, cx=5, cy=5, radius=4)

Then it is more optimal to reduce the constraint region (i.e., the diamond) until they touch in a single point as shown below. Note that a contour has the same loss along its points.

plot_reg_and_circle(Reg.L1, t=(np.sqrt(50)-4) * np.sqrt(2), cx=5, cy=5, radius=4)

When does the Lasso pick the corner of the diamond over any other point on the edges? To simplify our problem, let’s fix a specific circle among the contours: a circle with a fixed radius \(r\). Now, we want to come up with a formula that gives the probability of a random tangent circle with radius \(r\) touching our diamond at the corners versus at any other points.

The above problem becomes much simpler with a visualization:

code for html5_video_l1_tangent_circles

def l1_tangent_circle_locations(t: float, radius: float) -> list[np.array]:
    def is_corner(v):
        return min(abs(v[0]), abs(v[1])) <= radius / np.sqrt(2) + 1e-3

    vertices = make_reg_shape(Reg.L1, t=t + np.sqrt(2) * radius).get_path().interpolated(50).vertices
    vertices = vertices.tolist()
    while is_corner(vertices[-1]):
        vertices.insert(0, vertices.pop())
    vertices.reverse()
    locations = []
    for i in range(8):
        should_be_corner = i % 2 == 0
        group = []
        while len(vertices) > 0 and is_corner(vertices[-1]) == should_be_corner:
            v = vertices.pop()
            if should_be_corner:
                if abs(v[0]) <= abs(v[1]):
                    corner = [0, np.sign(v[1]) * t]
                else:
                    corner = [np.sign(v[0]) * t, 0]

                vec = np.array(v) - np.array(corner)
                vec = (vec / np.linalg.norm(vec)) * radius
                group.append(corner + vec)
            else:
                group.append(v)
        locations.append(np.array(group))
    return locations


def l2_tangent_circle_locations(t: float, radius: float) -> list[np.array]:
    angles = np.linspace(0, 2 * np.pi, 200)
    vertices = np.column_stack([np.cos(angles), np.sin(angles)]) * (t + radius) 
    return [vertices]


def animate_tangent_circle_trajectories(reg: Reg, t: float, radius: float):
    fig = plt.figure(figsize=(4, 4))
    ax = fig.add_subplot()
    ax2d_init(ax)
    if reg == Reg.L1:
        diamond = make_reg_shape(Reg.L1, t=t)
        circle_locations = l1_tangent_circle_locations(t, radius)
        plots = []
        for i, locations in enumerate(circle_locations):
            color = 'g' if i % 2 == 0 else 'b'
            plots.append(ax.plot([], [], color=color, linewidth=2)[0])
    else:
        circle_locations = l2_tangent_circle_locations(t, radius)
        plots = [ax.plot([], [], color='b', linewidth=2)[0]]
    
    circle = Circle(xy=(t + radius, 0), radius=radius, color='g', fill=False)
    ax.add_patch(circle)
    ax.add_patch(make_reg_shape(reg, t=t))

    def update(frame):
        remaining = frame
        for plot, locations in zip(plots, circle_locations):
            need = min(remaining, len(locations))
            remaining -= need
            plot.set_xdata(locations[:need, 0])
            plot.set_ydata(locations[:need, 1])
            if remaining == 0 and need > 0:
                last = locations[need - 1]
                circle.set_center(last)
                circle.set_color(plot.get_color())

    num_frames = np.sum([len(l) for l in circle_locations])
    plt.tight_layout()
    ani = animation.FuncAnimation(fig, func=update, frames=range(0, num_frames, 2), interval=30)
    return ani


def html5_video_tangent_circle_trajectories(reg: Reg, t: float, radius: float):
    ani = animate_tangent_circle_trajectories(reg, t=t, radius=radius)
    html = ani.to_html5_video()
    plt.close()
    return html

from IPython.display import HTML
HTML(html5_video_tangent_circle_trajectories(Reg.L1, t=5, radius=6))

The animation above illustrates all tangent circle trajectories. The circle is green when it touches the diamond at the corners, and it is blue when it touches at other points. The total length of all blue (touching at other points) trajectories is given by \(4\times \sqrt{2}\times t\), while the total length of all green (touching at the corners) trajectories is given by \(2\pi \times r\), where t is the constraint and r is the radius of the circle. This means that it is likely to touch at the corners when t is sufficiently small or r is sufficiently high. We illustrate one of the such cases below:

HTML(html5_video_tangent_circle_trajectories(Reg.L1, t=3, radius=10))

In Ridge (L2), we don’t have special points like corners, and all points along the disk have the same probability of touching a random tangent circle with radius \(r\):

HTML(html5_video_tangent_circle_trajectories(Reg.L2, t=5, radius=6))

End

I hope you enjoyed this post. You can ask further questions on my telegram channel