3.2.2 The Gauss–Markov Theorem

The Gauss-Markov theorem states that the least squares estimates of the \(\beta\) have the smallest variance among all linear unbiased estimates. We focus on estimation of any linear combination of the parameters \(\theta=\alpha^T\hat{\beta}\), i.e predictions \(f(x_0)={x_0}^T\beta\) are of this form. The least squares estimate is (3.17):

\[ \hat{\theta}=\alpha^T\hat{\beta} = \alpha^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \]

and \(\alpha^T\hat{\beta}\) is unbiased, since (3.18):

\[ \begin{align} E(\alpha^T\hat{\beta}) &= E( \alpha^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y})\\ &= \alpha^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\beta\\ &= \alpha^T\beta \end{align} \]

The Gauss-Markov states that if we have any other linear estimator \(\tilde{\theta}=\mathbf{c}^T\mathbf{y}\), that is, \(E(\mathbf{c}^T\mathbf{y})=\alpha^T\beta\), then (3.19):

\[ Var(\alpha^T\hat{\beta})\le Var(\mathbf{c}^T\mathbf{y}) \]

Proof:

Let’s assume that \(\mathbf{c}^T\mathbf{y}=(\alpha^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{d}^T)\mathbf{y}\), then:

\[ \begin{align} E(\mathbf{c}^T\mathbf{y})&= E((\alpha^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{d}^T)\mathbf{y})\\ &= E((\alpha^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{d}^T)(\mathbf{X}\beta + \varepsilon))\\ &= \alpha^T\beta + d^T\mathbf{X}\beta \end{align} \]

Since \(E(\mathbf{c}^T\mathbf{y})\) is unbiased \(d^T\mathbf{X}=0\).

\[ \begin{align} Var(\mathbf{c}^T\mathbf{y}) &= \mathbf{c}^{T}Var(\mathbf{y})\mathbf{c} = \sigma^2\mathbf{c}^T\mathbf{c}\\ &= \sigma^2(\alpha^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{d}^T)(\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\alpha + \mathbf{d})\\ &= \sigma^2(\alpha^T(\mathbf{X}^T\mathbf{X})^{-1}\alpha+\mathbf{d}^T\mathbf{d})\\ &= Var(\alpha^T\hat{\beta})+\sigma^2\mathbf{d}^T\mathbf{d} \end{align} \]

Since \(\mathbf{d}^T\mathbf{d}\ge 0\), the proof is completed.

Consider the MSE of an estimator \(\tilde{\theta}\) is estimating \(\theta\): \[ \begin{align} MSE(\tilde{\theta}) &= E(\tilde{\theta} - \theta)^2\\ &= Var(\tilde{\theta})+[E(\tilde{\theta}) - \theta]^2 \end{align} \]

The first term is the variance, while the second term is the squared bias. The theorem implies that the least squares estimator has the smallest mean squared error of all linear estimators with no bias. However, there may exist a biased estimator with smaller MSE.

Mean squared error is related to prediction accuracy. Consider the prediction of the \(x_0\), (3.21):

\[ Y_0 = f(x_0) + \varepsilon_0 \]

Then the EPE of an estimate \(\tilde{f}(x_0)={x_0}^T\tilde{\beta}\) is (3.22): \[ \begin{align} E(Y_0 - \tilde{f}(x_0)) &= \sigma^2 + E({x_0}^T\tilde{\beta}-f(x_0))^2\\ &= \sigma^2 + MSE(\tilde{f}(x_0)) \end{align} \]

Therefore, EPE and MSE differ only by the constant \(\sigma^2\).