LinearRegression

class numpy_ml.linear_models.LinearRegression(fit_intercept=True)[source]

An ordinary least squares regression model fit via the normal equation.

Notes

Given data matrix X and target vector y, the maximum-likelihood estimate for the regression coefficients, \(\beta\), is:

\[\hat{\beta} = \Sigma^{-1} \mathbf{X}^\top \mathbf{y}\]

where \(\Sigma^{-1} = (\mathbf{X}^\top \mathbf{X})^{-1}\).

Parameters:fit_intercept (bool) – Whether to fit an intercept term in addition to the model coefficients. Default is True.
update(X, y)[source]

Incrementally update the least-squares coefficients for a set of new examples.

Notes

The recursive least-squares algorithm [1] [2] is used to efficiently update the regression parameters as new examples become available. For a single new example \((\mathbf{x}_{t+1}, \mathbf{y}_{t+1})\), the parameter updates are

\[\beta_{t+1} = \left( \mathbf{X}_{1:t}^\top \mathbf{X}_{1:t} + \mathbf{x}_{t+1}\mathbf{x}_{t+1}^\top \right)^{-1} \mathbf{X}_{1:t}^\top \mathbf{Y}_{1:t} + \mathbf{x}_{t+1}^\top \mathbf{y}_{t+1}\]

where \(\beta_{t+1}\) are the updated regression coefficients, \(\mathbf{X}_{1:t}\) and \(\mathbf{Y}_{1:t}\) are the set of examples observed from timestep 1 to t.

In the single-example case, the RLS algorithm uses the Sherman-Morrison formula [3] to avoid re-inverting the covariance matrix on each new update. In the multi-example case (i.e., where \(\mathbf{X}_{t+1}\) and \(\mathbf{y}_{t+1}\) are matrices of N examples each), we use the generalized Woodbury matrix identity [4] to update the inverse covariance. This comes at a performance cost, but is still more performant than doing multiple single-example updates if N is large.

References

[1]Gauss, C. F. (1821) _Theoria combinationis observationum erroribus minimis obnoxiae_, Werke, 4. Gottinge
[2]https://en.wikipedia.org/wiki/Recursive_least_squares_filter
[3]https://en.wikipedia.org/wiki/Sherman%E2%80%93Morrison_formula
[4]https://en.wikipedia.org/wiki/Woodbury_matrix_identity
Parameters:
  • X (ndarray of shape (N, M)) – A dataset consisting of N examples, each of dimension M
  • y (ndarray of shape (N, K)) – The targets for each of the N examples in X, where each target has dimension K
fit(X, y)[source]

Fit the regression coefficients via maximum likelihood.

Parameters:
  • X (ndarray of shape (N, M)) – A dataset consisting of N examples, each of dimension M.
  • y (ndarray of shape (N, K)) – The targets for each of the N examples in X, where each target has dimension K.
predict(X)[source]

Use the trained model to generate predictions on a new collection of data points.

Parameters:X (ndarray of shape (Z, M)) – A dataset consisting of Z new examples, each of dimension M.
Returns:y_pred (ndarray of shape (Z, K)) – The model predictions for the items in X.

RidgeRegression

class numpy_ml.linear_models.RidgeRegression(alpha=1, fit_intercept=True)[source]

A ridge regression model fit via the normal equation.

Notes

Given data matrix X and target vector y, the maximum-likelihood estimate for the ridge coefficients, \(\\beta\), is:

\[\hat{\beta} = \left(\mathbf{X}^\top \mathbf{X} + \alpha \mathbf{I} \right)^{-1} \mathbf{X}^\top \mathbf{y}\]

It turns out that this estimate for \(\beta\) also corresponds to the MAP estimate if we assume a multivariate Gaussian prior on the model coefficients:

\[\beta \sim \mathcal{N}(\mathbf{0}, \frac{1}{2M} \mathbf{I})\]

Note that this assumes that the data matrix X has been standardized and the target values y centered at 0.

Parameters:
  • alpha (float) – L2 regularization coefficient. Higher values correspond to larger penalty on the L2 norm of the model coefficients. Default is 1.
  • fit_intercept (bool) – Whether to fit an additional intercept term in addition to the model coefficients. Default is True.
fit(X, y)[source]

Fit the regression coefficients via maximum likelihood.

Parameters:
  • X (ndarray of shape (N, M)) – A dataset consisting of N examples, each of dimension M.
  • y (ndarray of shape (N, K)) – The targets for each of the N examples in X, where each target has dimension K.
predict(X)[source]

Use the trained model to generate predictions on a new collection of data points.

Parameters:X (ndarray of shape (Z, M)) – A dataset consisting of Z new examples, each of dimension M.
Returns:y_pred (ndarray of shape (Z, K)) – The model predictions for the items in X.

LogisticRegression

class numpy_ml.linear_models.LogisticRegression(penalty='l2', gamma=0, fit_intercept=True)[source]

A simple logistic regression model fit via gradient descent on the penalized negative log likelihood.

Notes

For logistic regression, the penalized negative log likelihood of the targets y under the current model is

\[- \log \mathcal{L}(\mathbf{b}, \mathbf{y}) = -\frac{1}{N} \left[ \left( \sum_{i=0}^N y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right) - R(\mathbf{b}, \gamma) \right]\]

where

\[\begin{split}R(\mathbf{b}, \gamma) = \left\{ \begin{array}{lr} \frac{\gamma}{2} ||\mathbf{beta}||_2^2 & :\texttt{ penalty = 'l2'}\\ \gamma ||\beta||_1 & :\texttt{ penalty = 'l1'} \end{array} \right.\end{split}\]

is a regularization penalty, \(\gamma\) is a regularization weight, N is the number of examples in y, and b is the vector of model coefficients.

Parameters:
  • penalty ({'l1', 'l2'}) – The type of regularization penalty to apply on the coefficients beta. Default is ‘l2’.
  • gamma (float) – The regularization weight. Larger values correspond to larger regularization penalties, and a value of 0 indicates no penalty. Default is 0.
  • fit_intercept (bool) – Whether to fit an intercept term in addition to the coefficients in b. If True, the estimates for beta will have M + 1 dimensions, where the first dimension corresponds to the intercept. Default is True.
fit(X, y, lr=0.01, tol=1e-07, max_iter=10000000.0)[source]

Fit the regression coefficients via gradient descent on the negative log likelihood.

Parameters:
  • X (ndarray of shape (N, M)) – A dataset consisting of N examples, each of dimension M.
  • y (ndarray of shape (N,)) – The binary targets for each of the N examples in X.
  • lr (float) – The gradient descent learning rate. Default is 1e-7.
  • max_iter (float) – The maximum number of iterations to run the gradient descent solver. Default is 1e7.
predict(X)[source]

Use the trained model to generate prediction probabilities on a new collection of data points.

Parameters:X (ndarray of shape (Z, M)) – A dataset consisting of Z new examples, each of dimension M.
Returns:y_pred (ndarray of shape (Z,)) – The model prediction probabilities for the items in X.

BayesianLinearRegressionUnknownVariance

class numpy_ml.linear_models.BayesianLinearRegressionUnknownVariance(alpha=1, beta=2, b_mean=0, b_V=None, fit_intercept=True)[source]

Bayesian linear regression model with unknown variance and conjugate Normal-Gamma prior on b and \(\sigma^2\).

Notes

Uses a conjugate Normal-Gamma prior on b and \(\sigma^2\). The joint and marginal posteriors over error variance and model parameters are:

\[\begin{split}b, \sigma^2 &\sim \text{NG}(b_{mean}, b_{V}, \alpha, \beta) \\ \sigma^2 &\sim \text{InverseGamma}(\alpha, \beta) \\ b &\sim \mathcal{N}(b_{mean}, \sigma^2 \cdot b_V)\end{split}\]
Parameters:
  • alpha (float) – The shape parameter for the Inverse-Gamma prior on \(\sigma^2\). Must be strictly greater than 0. Default is 1.
  • beta (float) – The scale parameter for the Inverse-Gamma prior on \(\sigma^2\). Must be strictly greater than 0. Default is 1.
  • b_mean (ndarray of shape (M,) or float) – The mean of the Gaussian prior on b. If a float, assume b_mean is np.ones(M) * b_mean. Default is 0.
  • b_V (ndarray of shape (N, N) or (N,) or None) – A symmetric positive definite matrix that when multiplied element-wise by \(b_sigma^2\) gives the covariance matrix for the Gaussian prior on b. If a list, assume b_V = diag(b_V). If None, assume b_V is the identity matrix. Default is None.
  • fit_intercept (bool) – Whether to fit an intercept term in addition to the coefficients in b. If True, the estimates for b will have M + 1 dimensions, where the first dimension corresponds to the intercept. Default is True.
fit(X, y)[source]

Compute the posterior over model parameters using the data in X and y.

Parameters:
  • X (ndarray of shape (N, M)) – A dataset consisting of N examples, each of dimension M.
  • y (ndarray of shape (N, K)) – The targets for each of the N examples in X, where each target has dimension K.
predict(X)[source]

Return the MAP prediction for the targets associated with X.

Parameters:X (ndarray of shape (Z, M)) – A dataset consisting of Z new examples, each of dimension M.
Returns:y_pred (ndarray of shape (Z, K)) – The model predictions for the items in X.

BayesianLinearRegressionKnownVariance

class numpy_ml.linear_models.BayesianLinearRegressionKnownVariance(b_mean=0, b_sigma=1, b_V=None, fit_intercept=True)[source]

Bayesian linear regression model with known error variance and conjugate Gaussian prior on model parameters.

Notes

Uses a conjugate Gaussian prior on the model coefficients. The posterior over model parameters is

\[b \mid b_{mean}, \sigma^2, b_V \sim \mathcal{N}(b_{mean}, \sigma^2 b_V)\]

Ridge regression is a special case of this model where \(b_{mean}\) = 0, \(\sigma\) = 1 and b_V = I (ie., the prior on b is a zero-mean, unit covariance Gaussian).

Parameters:
  • b_mean (ndarray of shape (M,) or float) – The mean of the Gaussian prior on b. If a float, assume b_mean is np.ones(M) * b_mean. Default is 0.
  • b_sigma (float) – A scaling term for covariance of the Gaussian prior on b. Default is 1.
  • b_V (ndarray of shape (N,N) or (N,) or None) – A symmetric positive definite matrix that when multiplied element-wise by b_sigma^2 gives the covariance matrix for the Gaussian prior on b. If a list, assume b_V = diag(b_V). If None, assume b_V is the identity matrix. Default is None.
  • fit_intercept (bool) – Whether to fit an intercept term in addition to the coefficients in b. If True, the estimates for b will have M + 1 dimensions, where the first dimension corresponds to the intercept. Default is True.
fit(X, y)[source]

Compute the posterior over model parameters using the data in X and y.

Parameters:
  • X (ndarray of shape (N, M)) – A dataset consisting of N examples, each of dimension M.
  • y (ndarray of shape (N, K)) – The targets for each of the N examples in X, where each target has dimension K.
predict(X)[source]

Return the MAP prediction for the targets associated with X.

Parameters:X (ndarray of shape (Z, M)) – A dataset consisting of Z new examples, each of dimension M.
Returns:y_pred (ndarray of shape (Z, K)) – The MAP predictions for the targets associated with the items in X.