2.3.3 From Least Squares to Nearest Neighbors
Generates 10 means \(m_k\) from a bivariate Gaussian distrubition for each color:
\(N((1, 0)^T, \textbf{I})\) for BLUE
\(N((0, 1)^T, \textbf{I})\) for ORANGE
For each color generates 100 observations as following:
For each observation it picks \(m_k\) at random with probability 1/10.
Then generates a \(N(m_k,\textbf{I}/5)\)
% matplotlib inline
import random
import numpy as np
import matplotlib.pyplot as plt
sample_size = 100
def generate_data(size, mean):
identity = np.identity(2 )
m = np.random.multivariate_normal(mean, identity, 10 )
return np.array([
np.random.multivariate_normal(random.choice(m), identity / 5 )
for _ in range (size)
])
def plot_data(orange_data, blue_data):
axes.plot(orange_data[:, 0 ], orange_data[:, 1 ], 'o' , color= 'orange' )
axes.plot(blue_data[:, 0 ], blue_data[:, 1 ], 'o' , color= 'blue' )
blue_data = generate_data(sample_size, [1 , 0 ])
orange_data = generate_data(sample_size, [0 , 1 ])
data_x = np.r_[blue_data, orange_data]
data_y = np.r_[np.zeros(sample_size), np.ones(sample_size)]
# plotting
fig = plt.figure(figsize = (8 , 8 ))
axes = fig.add_subplot(1 , 1 , 1 )
plot_data(orange_data, blue_data)
plt.show()
2.3.1 Linear Models and Least Squares
\[\hat{Y} = \hat{\beta_0} + \sum_{j=1}^{p} X_j\hat{\beta_j}\]
where \(\hat{\beta_0}\) is the intercept, also know as the bias . It is convenient to include the constant variable 1 in X and \(\hat{\beta_0}\) in the vector of coefficients \(\hat{\beta}\) , and then write as:
\[\hat{Y} = X^T\hat{\beta} \]
Residual sum of squares
How to fit the linear model to a set of training data? Pick the coefficients \(\beta\) to minimize the residual sum of squares :
\[RSS(\beta) = \sum_{i=1}^{N} (y_i - x_i^T\beta) ^ 2 = (\textbf{y} - \textbf{X}\beta)^T (\textbf{y} - \textbf{X}\beta)\]
where \(\textbf{X}\) is an \(N \times p\) matrix with each row an input vector, and \(\textbf{y}\) is an N-vector of the outputs in the training set. Differentiating w.r.t. β we get the normal equations:
\[\mathbf{X}^T(\mathbf{y} - \mathbf{X}\beta) = 0\]
If \(\mathbf{X}^T\mathbf{X}\) is nonsingular, then the unique solution is given by:
\[\hat{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]
class LinearRegression:
def fit(self , X, y):
X = np.c_[np.ones((X.shape[0 ], 1 )), X]
self .beta = np.linalg.inv(X.T @ X) @ X.T @ y
return self
def predict(self , x):
return np.dot(self .beta, np.r_[1 , x])
model = LinearRegression().fit(data_x, data_y)
print ("beta = " , model.beta)
beta = [ 0.52677771 -0.15145005 0.15818643]
Example of the linear model in a classification context
The fitted values \(\hat{Y}\) are converted to a fitted class variable \(\hat{G}\) according to the rule:
\[
\begin{equation}
\hat{G} = \begin{cases}
\text{ORANGE} & \text{ if } \hat{Y} \gt 0.5 \\
\text{BLUE } & \text{ if } \hat{Y} \leq 0.5
\end{cases}
\end{equation}
\]
from itertools import filterfalse, product
def plot_grid(orange_grid, blue_grid):
axes.plot(orange_grid[:, 0 ], orange_grid[:, 1 ], '.' , zorder = 0.001 ,
color= 'orange' , alpha = 0.3 , scalex = False , scaley = False )
axes.plot(blue_grid[:, 0 ], blue_grid[:, 1 ], '.' , zorder = 0.001 ,
color= 'blue' , alpha = 0.3 , scalex = False , scaley = False )
plot_xlim = axes.get_xlim()
plot_ylim = axes.get_ylim()
grid = np.array([* product(np.linspace(* plot_xlim, 50 ), np.linspace(* plot_ylim, 50 ))])
is_orange = lambda x: model.predict(x) > 0.5
orange_grid = np.array([* filter (is_orange, grid)])
blue_grid = np.array([* filterfalse(is_orange, grid)])
axes.clear()
axes.set_title("Linear Regression of 0/1 Response" )
plot_data(orange_data, blue_data)
plot_grid(orange_grid, blue_grid)
find_y = lambda x: (0.5 - model.beta[0 ] - x * model.beta[1 ]) / model.beta[2 ]
axes.plot(plot_xlim, [* map (find_y, plot_xlim)], color = 'black' ,
scalex = False , scaley = False )
fig
2.3.2 Nearest-Neighbor Methods
\[\hat{Y}(x) = \frac{1}{k} \sum_{x_i \in N_k(x)} y_i\]
where \(N_k(x)\) is the neighborhood of \(x\) defined by the \(k\) closest points \(x_i\) in the training sample.
class KNeighborsRegressor:
def __init__ (self , k):
self ._k = k
def fit(self , X, y):
self ._X = X
self ._y = y
return self
def predict(self , x):
X, y, k = self ._X, self ._y, self ._k
distances = ((X - x) ** 2 ).sum (axis= 1 )
return np.mean(y[distances.argpartition(k)[:k]])
def plot_k_nearest_neighbors(k):
model = KNeighborsRegressor(k).fit(data_x, data_y)
is_orange = lambda x: model.predict(x) > 0.5
orange_grid = np.array([* filter (is_orange, grid)])
blue_grid = np.array([* filterfalse(is_orange, grid)])
axes.clear()
axes.set_title(str (k) + "-Nearest Neighbor Classifier" )
plot_data(orange_data, blue_data)
plot_grid(orange_grid, blue_grid)
plot_k_nearest_neighbors(1 )
fig
It appears that k-nearest-neighbor have a single parameter (k ), however the effective number of parameters is N/k and is generally bigger than the p parameters in least-squares fits. Note: if the neighborhoods were nonoverlapping, there would be N/k neighborhoods and we would fit one parameter (a mean) in each neighborhood.
plot_k_nearest_neighbors(15 )
fig