3.2 Linear Regression Models and Least Squares

We have an input vector \(X^T=(X_1,...,X_p)\) and want to predict a real-valued output \(Y\).The linear regression model has the form:

\[f(X) = B_0 + \sum_{j=1}^p {X_j\beta_j}\]

Typically we have a set of training data \((x_1, y_1)...(x_N, y_n)\) from which to estimate the parameters \(\beta\). The most popular estimation method is least squares, in which we pick \(\beta\) to minimize the residual sum of squares, (3.2): \[ \begin{align} RSS(\beta)&=\sum_{i=1}^N(y_i-f(x_i))\\ &=\sum_{i=1}^N(y_i-\beta_0-\sum_{j=1}^p{x_{ij}\beta_j})^2 \end{align} \]

How do we minimize (3.2)? We can write the (3.2) using matrix, (3.3): \[RSS(\beta)=(\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)\]

Differentiating with respect to \(\beta\) we obtain: \[ \begin{align} \frac{\partial{RSS}}{\partial\beta} = -2\mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta) \end{align} \]

Assuming that X has full column rank, and hence the second derivative is positive definite: \[\mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta)=0\]

and the unique solution is: \[\hat{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

The predicted value at an input vector \(x_0\) are given by \(\hat{f}(x_0)=(1:x_0)^T\hat{\beta}\):

\[\hat{y}=\mathbf{X}\hat{\beta}=\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

The matrix \(\mathbf{H}=\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\) is sometimes called the “hat” matrix.

Geometrical representation of the least squares: We denote the column vectors of X by \(x_0, x_1, ..., x_p\). These vectors span a subspace of \(\mathcal{R}^N\), also referred as the column space of X. We minimize \(RSS(\beta)=||\mathbf{y}-\mathbf{X}\beta||^2\) by choosing \(\hat{\beta}\) so that the residual vector \(\mathbf{y} - \hat{\mathbf{y}}\) is orthogonal to this subspace and the orthogonality is expressed by \(\mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta)=0\). The hat matrix H is the projection matrix.

Sampling properties of \(\hat{\beta}\): In order to pin down the sampling properties of \(\hat{\beta}\), we assume that the observations \(y_i\) are uncorrelated and have constant variance \(\sigma^2\), and that the \(x_i\) are fixed. The variance-covariance matrix is given by (3.8):

\[ \begin{align} Var(\hat{\beta}) &= E\left[(\hat{\beta}-E(\hat{\beta}))(\hat{\beta}-E(\hat{\beta})^T)\right]\\ &= E\left[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{\varepsilon}\mathbf{\varepsilon}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\right]\\ &= \sigma^2(\mathbf{X}^T\mathbf{X})^{-1} \end{align} \]

One estimates the variance \(\sigma^2\) by: \[ \hat{\sigma}^2 = \frac{1}{N-p-1} \sum_{i=1}^N(y_i-\hat{y_i})^2 \]

The N-p-1 rather than N in the denominator makes \(\hat{\sigma}^2\) an unbiased estimate of \(\sigma^2\): \(E(\hat{\sigma}^2)=\sigma^2\).

Proof: \[ \begin{align} \hat{\varepsilon} &= \mathbf{y} - \mathbf{\hat{y}}\\ &= \mathbf{X}\beta + \varepsilon - \mathbf{X}\hat{\beta}\\ &= \mathbf{X}\beta + \varepsilon - \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{X}\beta + \varepsilon)\\ &= \varepsilon - \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\varepsilon\\ &= (\mathbf{I}_n - \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T)\varepsilon\\ &= (\mathbf{I}_n - \mathbf{H})\varepsilon \end{align} \]

and we would like to find \(Var(\hat{\varepsilon})=E(\hat{\varepsilon}^T\hat{\varepsilon})\):

\[ \begin{align} E[\hat{\varepsilon}^T\hat{\varepsilon}] &= E\left[\varepsilon^T(\mathbf{I}_n - \mathbf{H})^T(\mathbf{I}_n - \mathbf{H})\varepsilon\right]\\ &= E\left[tr(\varepsilon^T(\mathbf{I}_n - \mathbf{H})^T(\mathbf{I}_n - \mathbf{H})\varepsilon)\right]\\ &= E\left[tr(\varepsilon\varepsilon^T(\mathbf{I}_n - \mathbf{H})^T(\mathbf{I}_n - \mathbf{H}))\right]\\ &= \sigma^2E\left[tr((\mathbf{I}_n - \mathbf{H})^T(\mathbf{I}_n - \mathbf{H}))\right]\\ &= \sigma^2E\left[tr(\mathbf{I}_n - \mathbf{H})\right]\\ &= \sigma^2E\left[tr(\mathbf{I}_n) - tr(\mathbf{I}_{p+1})\right]\\ &= \sigma^2(n-p-1) \end{align} \]

Note that, both \(\mathbf{H}\) and \(\mathbf{I_n}-\mathbf{H}\) are:

Inferences about the parameters and the model: We now assume that deviations of Y around its expectations and Gaussian. Hence (3.9):

\[ \begin{align} Y &=E(Y|X_1,...,X_p)+\varepsilon\\ &= \beta_0 + \sum_{j=1}^P{X_j\beta_j} + \varepsilon \end{align} \]

where $N(0, ^2) $

Under (3.9), it is easy to show that (3.10): \[ \hat{\beta} \sim N(\beta, (\mathbf{X}^T\mathbf{X})^{-1}\sigma^2) \]

Also (3.11): \[(N-p-1)\hat{\sigma}^2 \sim \sigma^2\chi_{N-p-1}^2\]

a chi-squared distribution with N-p-1 degrees of freedom and \(\hat{\beta}\) and \(\hat{\sigma}\) are statistically independent.

Hypothesis test: To test \(H_0: \beta_j = 0\) we form the standardized coefficient or Z-score: \[ z_j=\frac{\hat{B}_j}{\hat{\sigma}\sqrt{v_j}} \]

where \(v_j\) is the jth diagonal element of \((\mathbf{X}^T\mathbf{X})^{-1}\). Under the null hypothesis \(z_j\) is distributed as \(t_{N-p-1}\), and hence a large value of \(z_j\) will lead to rejection. If \(\hat{\sigma}\) is replaced by \(\sigma\) then \(z_j\) is a standard normal distribution. The difference between tail quantiles of a t-distribution and a standard normal become negligible as the sample size increases, see the Figure (3.3) below:

# Figure 3.3
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm, t
   
fig = plt.figure(figsize = (12, 8))
axes = fig.add_subplot(1, 1, 1)

z = np.linspace(1.9, 3, 500)

normal_probabilities = 1 - norm.cdf(z) + norm.cdf(-z)
t_30_probabilities = 1 - t.cdf(z, 30) + t.cdf(-z, 30) 
t_100_probabilities = 1 - t.cdf(z, 100) + t.cdf(-z, 100) 

axes.plot(z, normal_probabilities, color='C0', label = 'normal')
axes.plot(z, t_30_probabilities, color='C1', label = '$t_{30}$')
axes.plot(z, t_100_probabilities, color='C2', label = '$t_{100}$')

xlim = axes.get_xlim()

for y in [0.01, 0.05]:
    axes.plot(xlim, [y, y], '--', color = 'gray', 
              scalex = False, scaley = False)

    for index, probs in enumerate([normal_probabilities, t_30_probabilities,
                                   t_100_probabilities]):
        x = z[np.abs(probs - y).argmin()]
        axes.plot([x, x], [0, y], '--', color = f"C{index}",
                  scalex = False, scaley = False)
    
axes.legend()
axes.set_xlabel('Z')
axes.set_ylabel('Tail Probabilities')
plt.show()

Test for the significance of groups of coefficients simultaneously: We use the F-statistics (3.13):

\[F=\frac{(RSS_0-RSS_1) / (p_1 - p_0)}{RSS_1/(N-p_1-1)}\]

Where \(RSS_1\) is for the bigger model with \(p_1+1\) parameters and \(RSS_0\) for the nested smaller model with \(p_0+1\) parameters. Under the null hypothesis that the smaller model is correct, the F statistic will have a \(F_{p_1-p_0,N-p_1-1}\) distribution. The \(z_j\) in (3.13) is equivalent to the F statistic for dropping the single coefficient \(\beta_j\) from the model.

Similarly, we can isolate \(\beta_j\) in (3.10) to obtain \(1-2\alpha\) confidence interval (3.14) \[(\hat{\beta_j} - z^{(1-\alpha)}v_j^{\frac{1}{2}}\hat{\sigma}, \hat{\beta_j} + z^{(1-\alpha)}v_j^{\frac{1}{2}}\hat{\sigma})\]

In a similar fashion we can obtain an approximate confidence set for the entire parameter vector \(\beta\) (3.15): \[ C_{\beta} = \{{\beta|(\hat{\beta}-\beta)^T\mathbf{X}^T\mathbf{X}(\hat{\beta}-\beta)} \le \hat{\sigma}^2{\chi_{p+1}^2}^{(1-\alpha)} \}\]

where \({\chi_{l}^2}^{(1-\alpha)}\) is the \(1-\alpha\) percentile of the chi-squared distribution of \(l\) degrees of freedom.