3.2.3 Multiple Regression From Simple Univariate Regression

Suppose we have a univariate (p = 1) model with no intercept (3.23): \[Y=X\beta+\varepsilon\]

The least squares estimate and residuals are (3.24): \[ \begin{equation} \hat{\beta} = \cfrac{\sum_1^N {x_iy_i}}{\sum_1^N {x_i^2}} \\ r_i = y_i - x_i\hat{\beta} \end{equation} \]

With the inner product: \[ \begin{equation} \hat{\beta} = \cfrac{\langle \mathbf{x}, \mathbf{y} \rangle}{\langle \mathbf{x}, \mathbf{x}\rangle}\\ \mathbf{r} = \mathbf{y} - \mathbf{x}\hat{\beta} \end{equation} \]

Suppose that the columns of the matrix X are orthogonal; that is \(\langle \mathbf{x}_j, \mathbf{x}_k \rangle = 0\) then it is easy to check that \(\hat{\beta_j} = \langle \mathbf{x}_j, \mathbf{y} \rangle / \langle \mathbf{x}_j, \mathbf{x}_j \rangle\), i.e the inputs have no effect on each other’s parameter estimates.

Suppose next that we have an intercept and a single input x (3.27): \[\hat{B}_1 = \cfrac{\langle \mathbf{x} - \overline{x}\mathbf{1}, \mathbf{y} \rangle}{ \langle \mathbf{x} - \overline{x}\mathbf{1}, \mathbf{x} - \overline{x}\mathbf{1} \rangle}\]

We can view the estimate as the result of two simple regression:

  1. Regress x on 1 to produce the residual \(\mathbf{z} = \mathbf{x} - \overline{x}\mathbf{1}\)

  2. Regress y on the residual z to give the coefficient \(\hat{\beta}_1\).

Regress b on a means \(\hat{\gamma}=\langle \mathbf{a},\mathbf{b} \rangle / \langle \mathbf{a}, \mathbf{a}\rangle\) and the residual vector \(\mathbf{b} - \hat{\gamma}\mathbf{a}\).

This recipe generalizes to the case of p inputs, as shown in Algorithm 3.1.

Algorithm 3.1 Regression by Successive Orthogonalization 1. \(\mathbf{z}_0 = \mathbf{x}_0 = \mathbf{1}\)

  1. For \(j = 1, 2, \cdots, p\)

    • Regress \(\mathbf{x}_j\) on \(\mathbf{z}_0,...,\mathbf{z}_{j - 1}\) to produce \(\hat{\gamma}_{lj}=\langle \mathbf{z}_l, \mathbf{x}_j \rangle / \langle \mathbf{z}_l,\mathbf{z}_l \rangle\) \(l=0,\cdots,j-1\), and residualt vector \(\mathbf{z}_j=\mathbf{x}_j - \sum_{k=0}^{j-1} \hat{\gamma}_{kj}\mathbf{z}_k\)
  2. Regress \(\mathbf{y}\) on the residual \(\mathbf{z}_p\) to give the estimate \(\hat{\beta}_p\)

import numpy as np
import pandas as pd
from scipy import stats, linalg

df = pd.read_csv('../data/prostate/prostate.data', delimiter='\t', index_col=0)
mask_train = df.pop('train')
df_y = df.pop('lpsa')
df = df.apply(stats.zscore)

def orthogonalize(X):
    p = X.shape[1]
    G = np.eye(p)
    Z = X.copy()
    for j in range(1, p): 
        for l in range(j):
            G[l, j] = np.dot(Z[:, l], X[:, j]) / np.dot(Z[:, l], Z[:, l])
        for k in range(j):
            Z[:, j] -= G[k, j] * Z[:, k]
    return Z, G

The result of this algorithm is (3.28):

\[\hat{\beta}_p=\cfrac{\langle \mathbf{z}_p, \mathbf{y} \rangle}{\langle \mathbf{z}_p,\mathbf{z}_p \rangle}\]

If \(\mathbf{x}_p\) is highly correlated with some of the other \(\mathbf{x}_k\)’s the residual vector \(\mathbf{x}_p\) will be close to zero, and from (3.28) the coefficient \(\hat{\beta}_p\) will be unstable.

From (3.28) we also obtain an alternative formula for the variance estimates, (3.29):

\[Var(\hat{\beta}_p) = \cfrac{\sigma^2}{\langle \mathbf{z}_p, \mathbf{z}_p \rangle}=\cfrac{\sigma^2}{||\mathbf{z}_p||^2} \]

On other words, the precision with which we can estimate \(\hat{\beta}_p\) depends on the lengths of the residual vector \(\mathbf{z}_p\);

Algorithm 3.1 is known as the Gram–Schmidt procedure for multiple regression. We can represent step 2 of Algorithm 3.1 in matrix form (3.30):

\[\mathbf{X}=\mathbf{Z\Gamma}\]

where \(\mathbf{Z}\) has as columns the \(z_j\) (in order), and \(\mathbf{\Gamma}\) is the upper triangular matrix with entries \(\hat{\gamma}_{kj}\). Introducing the diagonal matrix \(\mathbf{D}\) with \(D_{jj}=||z_j||\), we get (3.31):

\[\mathbf{X}=\mathbf{Z}\mathbf{D}^{-1}\mathbf{D}\mathbf{\Gamma}=\mathbf{QR}\]

the so-called QR decomposition of \(\mathbf{X}\). Here \(\mathbf{Q}\) is an N × (p +1) orthogonal matrix, \(\mathbf{Q}^T\mathbf{Q} = \mathbf{I}\), and R is a (p + 1) × (p + 1) upper triangular matrix.

The least squares solution is given by:

\[ \hat{\beta}=\mathbf{R}^{-1}\mathbf{Q}^T\mathbf{y} \]

Proof: \[ \begin{equation} \mathbf{X}^T\mathbf{y}=\mathbf{X}^T\mathbf{X}\hat{\beta}\\ \mathbf{R}^T\mathbf{Q}^T\mathbf{y}=\mathbf{R}^T\mathbf{Q}^T\mathbf{Q}\mathbf{R}\hat{\beta}\\ \mathbf{R}^T\mathbf{Q}^T\mathbf{y}=\mathbf{R}^T\mathbf{R}\hat{\beta}\\ \mathbf{Q}^T\mathbf{y}=\mathbf{R}\hat{\beta}\\ \end{equation} \] And the predicted training values:

\[ \hat{\mathbf{y}}=\mathbf{QQ}^T\mathbf{y} \]

Proof:

\[ \begin{align} \hat{\mathbf{y}}&=\mathbf{X}\hat{\beta}\\ &=\mathbf{QR}\mathbf{R}^{-1}\mathbf{Q}^T\mathbf{y}\\ &=\mathbf{QQ}^T\mathbf{y} \end{align} \]

We can obtain from it not just \(\hat{\beta}_p\), but also the entire multiple least squares fit.

Proof: We can easily derive that: \[ \mathbf{R}\hat{\beta}=\mathbf{Q}^T\mathbf{y} \]

which can be expanded into: \[ \begin{equation} \begin{bmatrix} R_{0 0} & R_{02} & \dots & R_{0p} \\ 0 & R_{11} & \dots & R_{1p} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & R_{pp} \end{bmatrix} \begin{bmatrix} \hat{\beta_0} \\ \hat{\beta_1} \\ \vdots \\ \hat{\beta_p} \end{bmatrix} = \begin{bmatrix} {Q_{0}}^T\mathbf{y} \\ {Q_{1}}^T\mathbf{y} \\ \vdots \\ {Q_{p}}^T\mathbf{y} \end{bmatrix} \end{equation} \]

Now by applying the backward substitution it is possible to obtain the entire multiple least squares fit. For example to find the \(\hat{\beta}_p\): \[ \begin{equation} R_{pp}\hat{\beta}_p = {Q_{p}}^T\mathbf{y}\\ \hat{\beta}_p = \cfrac{\langle Q_p, \mathbf{y} \rangle}{R_{pp}}=\cfrac{\langle \mathbf{z}_p, \mathbf{y} \rangle}{\langle \mathbf{z}_p,\mathbf{z}_p \rangle} \end{equation} \]

def least_squares_qr(data_x, data_y):
    X = np.c_[np.ones((len(data_x), 1)), data_x]
    Z, G = orthogonalize(X)

    D = linalg.norm(Z, axis=0)
    Q = Z / D
    R = np.diag(D) @ G
    beta = linalg.solve_triangular(R, Q.T @ data_y)
    return beta

beta = least_squares_qr(df[mask_train == 'T'].as_matrix(), df_y[mask_train == 'T'].as_matrix())
print ("Coefficient: ", beta)
Coefficient:  [ 2.46493292  0.67601634  0.26169361 -0.14073374  0.20906052  0.30362332
 -0.28700184 -0.02119493  0.26557614]