4.4.1 Fitting Logistic Regression Models

Logistic regression models fit by maximum likelihood, the log-likelihood for N observations is (4.19): \[ l(\theta)=\sum_{i=1}^N \log p_{g_i}(x_i, \theta), \]

We discuss in detail the two-class case, it is convenient to code the two-class \(g_i\) via 0/1 response \(y_i\). The log-likelihood can be written (4.20): \[ \begin{align} l(\beta) &= \sum_{i=1}^N \left \{y_i \log p(x_i; \beta) + (1-y_i)log(1 - p(x_i;\beta)) \right \}\\ &= \sum_{i=1}^N \left \{ y_i(\beta^Tx_i - log(1+e^{\beta^Tx_i})) - (1-y_i)log(1+e^{\beta^Tx_i}) \right\}\\ &= \sum_{i=1}^N \left \{ y_i\beta^Tx_i - \log (1+e^{\beta^Tx_i}) \right\} \end{align} \]

Here \(\beta = \{\beta_{10}, \beta_1\}\), and we assume that the vector of inputs \(x_i\) includes the constant term 1.

To maximize the log-likelihood, we set its derivative to zero (4.21): \[ \begin{align} \cfrac{\partial l(\beta)}{\partial \beta} &=\cfrac{\partial \left [ \sum_{i=1}^N \left \{ y_i\beta^Tx_i - \log (1+e^{\beta^Tx_i}) \right\} \right]} {\partial \beta}\\ &= \sum_{i=1}^N \left \{ y_ix_i - \cfrac{e^{\beta^Tx_i}}{1+e^{\beta^Tx_i}} x_i\right\}\\ &= \sum_{i=1}^N x_i(y_i-p(x_i; \beta)) \end{align} \]

which are p + 1 equations nonlinear in \(\beta\). Notice that since the first component of \(x_i\) is 1, the first score equation specifies that \(\sum_{i=1}^N y_i = \sum_{i=1}^N p(x_i;\beta)\), the expected number of class ones matches the observed number (ane hence also class twos).

To solve the score equation (4.21), we use the Newton-Raphson algorithm, which requires the second-derivatie or Hessian Matrix (4.22): \[ \begin{align} \cfrac{\partial^2 l(\beta)}{\partial \beta \partial \beta^T} &= \cfrac{\partial \left[ \sum_{i=1}^N x_iy_i - x_i\cfrac{e^{\beta^Tx_i}}{1+e^{\beta^Tx_i}}\right]}{ \partial \beta^T}\\ &= \cfrac{\partial \left[ \sum_{i=1}^N -x_i\cfrac{e^{\beta^Tx_i}}{1+e^{\beta^Tx_i}}\right]}{ \partial \beta^T}\\ &= \sum_{i=1}^N -x_i x_i^T \cfrac{ e^{\beta^Tx_i}(1+e^{\beta^Tx_i}) - e^{2\beta^Tx_i} }{ (1+e^{\beta^Tx_i})^2 }\\ &= \sum_{i=1}^N -x_i x_i^T \cfrac{ e^{\beta^Tx_i} }{ 1+e^{\beta^Tx_i} } \cfrac{ 1 }{ 1+e^{\beta^Tx_i} }\\ &= -\sum_{i=1}^N x_ix_i^Tp(x_i;\beta)(1-p(x_i;\beta)) \end{align} \]

A single Newton update is (4.23): \[ \beta^{new}=\beta^{old} - \left ( \cfrac{\partial^2 l(\beta)}{\partial \beta \partial \beta^T} \right)^{-1} \cfrac{\partial l(\beta)}{\partial \beta} \]

We can write it as (4.24, 4.25): \[ \begin{equation} \cfrac{\partial l(\beta)}{\partial \beta} = \mathbf{X}^T(\mathbf{y}-\mathbf{p})\\ \cfrac{\partial^2 l(\beta)}{\partial \beta \partial \beta^T} = -\mathbf{X}^T\mathbf{W}\mathbf{X} \end{equation} \] where:

X - the \(N \times (p + 1)\) input matrix,
p - the vector of fitted probabilities.
W - a \(N \times N\) diagonal matrix with ith element \(p(x_i, \beta_{old})(1-p(x_i;\beta_{old}))\)

The Newton step is thus (4.26): \[ \begin{align} \beta^{new} &= \beta^{old} + (\mathbf{X}^T\mathbf{WX})^{-1}\mathbf{X}^T(\mathbf{y}-\mathbf{p})\\ &= (\mathbf{X}^T\mathbf{WX})^{-1}\mathbf{X}^T\mathbf{W} \left(\mathbf{X}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})\right)\\ &= (\mathbf{X}^T\mathbf{WX})^{-1}\mathbf{X}^T\mathbf{W}\mathbf{z} \end{align} \]

We re-expressed the Newton step as a weighted least squares step, with the response (4.27): \[ \mathbf{z}=\mathbf{X}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p}) \]

also known as the adjusted response. These equations get solved repeatedly and referred to as iteratively least squares (IRLS), since each iteration solves the weighted least squares problem(4.28): \[ \beta^{new} \leftarrow \underset{\beta}{argmin} (\mathbf{z}-\mathbf{X}\beta)^T\mathbf{W}(\mathbf{z}-\mathbf{X}\beta) \]