import numpy as np
import pandas as pd
from scipy import stats, linalg
df = pd.read_csv('../data/prostate/prostate.data', delimiter='\t', index_col=0)
mask_train = df.pop('train')
df_y = df.pop('lpsa')
df = df.apply(stats.zscore)
def orthogonalize(X):
p = X.shape[1]
G = np.eye(p)
Z = X.copy()
for j in range(1, p):
for l in range(j):
G[l, j] = np.dot(Z[:, l], X[:, j]) / np.dot(Z[:, l], Z[:, l])
for k in range(j):
Z[:, j] -= G[k, j] * Z[:, k]
return Z, GSuppose we have a univariate (p = 1) model with no intercept (3.23): \[Y=X\beta+\varepsilon\]
The least squares estimate and residuals are (3.24): \[ \begin{equation} \hat{\beta} = \cfrac{\sum_1^N {x_iy_i}}{\sum_1^N {x_i^2}} \\ r_i = y_i - x_i\hat{\beta} \end{equation} \]
With the inner product: \[ \begin{equation} \hat{\beta} = \cfrac{\langle \mathbf{x}, \mathbf{y} \rangle}{\langle \mathbf{x}, \mathbf{x}\rangle}\\ \mathbf{r} = \mathbf{y} - \mathbf{x}\hat{\beta} \end{equation} \]
Suppose that the columns of the matrix X are orthogonal; that is \(\langle \mathbf{x}_j, \mathbf{x}_k \rangle = 0\) then it is easy to check that \(\hat{\beta_j} = \langle \mathbf{x}_j, \mathbf{y} \rangle / \langle \mathbf{x}_j, \mathbf{x}_j \rangle\), i.e the inputs have no effect on each other’s parameter estimates.
Suppose next that we have an intercept and a single input x (3.27): \[\hat{B}_1 = \cfrac{\langle \mathbf{x} - \overline{x}\mathbf{1}, \mathbf{y} \rangle}{ \langle \mathbf{x} - \overline{x}\mathbf{1}, \mathbf{x} - \overline{x}\mathbf{1} \rangle}\]
We can view the estimate as the result of two simple regression:
Regress x on 1 to produce the residual \(\mathbf{z} = \mathbf{x} - \overline{x}\mathbf{1}\)
Regress y on the residual z to give the coefficient \(\hat{\beta}_1\).
Regress b on a means \(\hat{\gamma}=\langle \mathbf{a},\mathbf{b} \rangle / \langle \mathbf{a}, \mathbf{a}\rangle\) and the residual vector \(\mathbf{b} - \hat{\gamma}\mathbf{a}\).
This recipe generalizes to the case of p inputs, as shown in Algorithm 3.1.
Algorithm 3.1 Regression by Successive Orthogonalization 1. \(\mathbf{z}_0 = \mathbf{x}_0 = \mathbf{1}\)
For \(j = 1, 2, \cdots, p\)
- Regress \(\mathbf{x}_j\) on \(\mathbf{z}_0,...,\mathbf{z}_{j - 1}\) to produce \(\hat{\gamma}_{lj}=\langle \mathbf{z}_l, \mathbf{x}_j \rangle / \langle \mathbf{z}_l,\mathbf{z}_l \rangle\) \(l=0,\cdots,j-1\), and residualt vector \(\mathbf{z}_j=\mathbf{x}_j - \sum_{k=0}^{j-1} \hat{\gamma}_{kj}\mathbf{z}_k\)
Regress \(\mathbf{y}\) on the residual \(\mathbf{z}_p\) to give the estimate \(\hat{\beta}_p\)
The result of this algorithm is (3.28):
\[\hat{\beta}_p=\cfrac{\langle \mathbf{z}_p, \mathbf{y} \rangle}{\langle \mathbf{z}_p,\mathbf{z}_p \rangle}\]
If \(\mathbf{x}_p\) is highly correlated with some of the other \(\mathbf{x}_k\)’s the residual vector \(\mathbf{x}_p\) will be close to zero, and from (3.28) the coefficient \(\hat{\beta}_p\) will be unstable.
From (3.28) we also obtain an alternative formula for the variance estimates, (3.29):
\[Var(\hat{\beta}_p) = \cfrac{\sigma^2}{\langle \mathbf{z}_p, \mathbf{z}_p \rangle}=\cfrac{\sigma^2}{||\mathbf{z}_p||^2} \]
On other words, the precision with which we can estimate \(\hat{\beta}_p\) depends on the lengths of the residual vector \(\mathbf{z}_p\);
Algorithm 3.1 is known as the Gram–Schmidt procedure for multiple regression. We can represent step 2 of Algorithm 3.1 in matrix form (3.30):
\[\mathbf{X}=\mathbf{Z\Gamma}\]
where \(\mathbf{Z}\) has as columns the \(z_j\) (in order), and \(\mathbf{\Gamma}\) is the upper triangular matrix with entries \(\hat{\gamma}_{kj}\). Introducing the diagonal matrix \(\mathbf{D}\) with \(D_{jj}=||z_j||\), we get (3.31):
\[\mathbf{X}=\mathbf{Z}\mathbf{D}^{-1}\mathbf{D}\mathbf{\Gamma}=\mathbf{QR}\]
the so-called QR decomposition of \(\mathbf{X}\). Here \(\mathbf{Q}\) is an N × (p +1) orthogonal matrix, \(\mathbf{Q}^T\mathbf{Q} = \mathbf{I}\), and R is a (p + 1) × (p + 1) upper triangular matrix.
The least squares solution is given by:
\[ \hat{\beta}=\mathbf{R}^{-1}\mathbf{Q}^T\mathbf{y} \]
Proof: \[ \begin{equation} \mathbf{X}^T\mathbf{y}=\mathbf{X}^T\mathbf{X}\hat{\beta}\\ \mathbf{R}^T\mathbf{Q}^T\mathbf{y}=\mathbf{R}^T\mathbf{Q}^T\mathbf{Q}\mathbf{R}\hat{\beta}\\ \mathbf{R}^T\mathbf{Q}^T\mathbf{y}=\mathbf{R}^T\mathbf{R}\hat{\beta}\\ \mathbf{Q}^T\mathbf{y}=\mathbf{R}\hat{\beta}\\ \end{equation} \] And the predicted training values:
\[ \hat{\mathbf{y}}=\mathbf{QQ}^T\mathbf{y} \]
Proof:
\[ \begin{align} \hat{\mathbf{y}}&=\mathbf{X}\hat{\beta}\\ &=\mathbf{QR}\mathbf{R}^{-1}\mathbf{Q}^T\mathbf{y}\\ &=\mathbf{QQ}^T\mathbf{y} \end{align} \]
We can obtain from it not just \(\hat{\beta}_p\), but also the entire multiple least squares fit.
Proof: We can easily derive that: \[ \mathbf{R}\hat{\beta}=\mathbf{Q}^T\mathbf{y} \]
which can be expanded into: \[ \begin{equation} \begin{bmatrix} R_{0 0} & R_{02} & \dots & R_{0p} \\ 0 & R_{11} & \dots & R_{1p} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & R_{pp} \end{bmatrix} \begin{bmatrix} \hat{\beta_0} \\ \hat{\beta_1} \\ \vdots \\ \hat{\beta_p} \end{bmatrix} = \begin{bmatrix} {Q_{0}}^T\mathbf{y} \\ {Q_{1}}^T\mathbf{y} \\ \vdots \\ {Q_{p}}^T\mathbf{y} \end{bmatrix} \end{equation} \]
Now by applying the backward substitution it is possible to obtain the entire multiple least squares fit. For example to find the \(\hat{\beta}_p\): \[ \begin{equation} R_{pp}\hat{\beta}_p = {Q_{p}}^T\mathbf{y}\\ \hat{\beta}_p = \cfrac{\langle Q_p, \mathbf{y} \rangle}{R_{pp}}=\cfrac{\langle \mathbf{z}_p, \mathbf{y} \rangle}{\langle \mathbf{z}_p,\mathbf{z}_p \rangle} \end{equation} \]
def least_squares_qr(data_x, data_y):
X = np.c_[np.ones((len(data_x), 1)), data_x]
Z, G = orthogonalize(X)
D = linalg.norm(Z, axis=0)
Q = Z / D
R = np.diag(D) @ G
beta = linalg.solve_triangular(R, Q.T @ data_y)
return beta
beta = least_squares_qr(df[mask_train == 'T'].as_matrix(), df_y[mask_train == 'T'].as_matrix())
print ("Coefficient: ", beta)Coefficient: [ 2.46493292 0.67601634 0.26169361 -0.14073374 0.20906052 0.30362332
-0.28700184 -0.02119493 0.26557614]