LinearRegression

class numpy_ml.linear_models.LinearRegression(fit_intercept=True)[source]

A weighted linear least-squares regression model.

Notes

In weighted linear least-squares regression [1], a real-valued target vector, y, is modeled as a linear combination of covariates, X, and model coefficients, \(\beta\):

\[y_i = \beta^\top \mathbf{x}_i + \epsilon_i\]

In this equation \(\epsilon_i \sim \mathcal{N}(0, \sigma^2_i)\) is the error term associated with example \(i\), and \(\sigma^2_i\) is the variance of the corresponding example.

Under this model, the maximum-likelihood estimate for the regression coefficients, \(\beta\), is:

\[\hat{\beta} = \Sigma^{-1} \mathbf{X}^\top \mathbf{Wy}\]

where \(\Sigma^{-1} = (\mathbf{X}^\top \mathbf{WX})^{-1}\) and W is a diagonal matrix of weights, with each entry inversely proportional to the variance of the corresponding measurement. When W is the identity matrix the examples are weighted equally and the model reduces to standard linear least squares [2].

References

[1]https://en.wikipedia.org/wiki/Weighted_least_squares
[2]https://en.wikipedia.org/wiki/General_linear_model
Parameters:

fit_intercept (bool) – Whether to fit an intercept term in addition to the model coefficients. Default is True.

Variables:
  • beta (ndarray of shape (M, K) or None) – Fitted model coefficients.
  • sigma_inv (ndarray of shape (N, N) or None) – Inverse of the data covariance matrix.
update(X, y, weights=None)[source]

Incrementally update the linear least-squares coefficients for a set of new examples.

Notes

The recursive least-squares algorithm [3] [4] is used to efficiently update the regression parameters as new examples become available. For a single new example \((\mathbf{x}_{t+1}, \mathbf{y}_{t+1})\), the parameter updates are

\[\beta_{t+1} = \left( \mathbf{X}_{1:t}^\top \mathbf{X}_{1:t} + \mathbf{x}_{t+1}\mathbf{x}_{t+1}^\top \right)^{-1} \mathbf{X}_{1:t}^\top \mathbf{Y}_{1:t} + \mathbf{x}_{t+1}^\top \mathbf{y}_{t+1}\]

where \(\beta_{t+1}\) are the updated regression coefficients, \(\mathbf{X}_{1:t}\) and \(\mathbf{Y}_{1:t}\) are the set of examples observed from timestep 1 to t.

In the single-example case, the RLS algorithm uses the Sherman-Morrison formula [5] to avoid re-inverting the covariance matrix on each new update. In the multi-example case (i.e., where \(\mathbf{X}_{t+1}\) and \(\mathbf{y}_{t+1}\) are matrices of N examples each), we use the generalized Woodbury matrix identity [6] to update the inverse covariance. This comes at a performance cost, but is still more performant than doing multiple single-example updates if N is large.

References

[3]Gauss, C. F. (1821) Theoria combinationis observationum erroribus minimis obnoxiae, Werke, 4. Gottinge
[4]https://en.wikipedia.org/wiki/Recursive_least_squares_filter
[5]https://en.wikipedia.org/wiki/Sherman%E2%80%93Morrison_formula
[6]https://en.wikipedia.org/wiki/Woodbury_matrix_identity
Parameters:
  • X (ndarray of shape (N, M)) – A dataset consisting of N examples, each of dimension M
  • y (ndarray of shape (N, K)) – The targets for each of the N examples in X, where each target has dimension K
  • weights (ndarray of shape (N,) or None) – Weights associated with the examples in X. Examples with larger weights exert greater influence on model fit. When y is a vector (i.e., K = 1), weights should be set to the reciporical of the variance for each measurement (i.e., \(w_i = 1/\sigma^2_i\)). When K > 1, it is assumed that all columns of y share the same weight \(w_i\). If None, examples are weighted equally, resulting in the standard linear least squares update. Default is None.
Returns:

self (LinearRegression instance)

fit(X, y, weights=None)[source]

Fit regression coefficients via maximum likelihood.

Parameters:
  • X (ndarray of shape (N, M)) – A dataset consisting of N examples, each of dimension M.
  • y (ndarray of shape (N, K)) – The targets for each of the N examples in X, where each target has dimension K.
  • weights (ndarray of shape (N,) or None) – Weights associated with the examples in X. Examples with larger weights exert greater influence on model fit. When y is a vector (i.e., K = 1), weights should be set to the reciporical of the variance for each measurement (i.e., \(w_i = 1/\sigma^2_i\)). When K > 1, it is assumed that all columns of y share the same weight \(w_i\). If None, examples are weighted equally, resulting in the standard linear least squares update. Default is None.
Returns:

self (LinearRegression instance)

predict(X)[source]

Use the trained model to generate predictions on a new collection of data points.

Parameters:X (ndarray of shape (Z, M)) – A dataset consisting of Z new examples, each of dimension M.
Returns:y_pred (ndarray of shape (Z, K)) – The model predictions for the items in X.

RidgeRegression

class numpy_ml.linear_models.RidgeRegression(alpha=1, fit_intercept=True)[source]

A ridge regression model with maximum likelihood fit via the normal equations.

Notes

Ridge regression is a biased estimator for linear models which adds an additional penalty proportional to the L2-norm of the model coefficients to the standard mean-squared-error loss:

\[\mathcal{L}_{Ridge} = (\mathbf{y} - \mathbf{X} \beta)^\top (\mathbf{y} - \mathbf{X} \beta) + \alpha ||\beta||_2^2\]

where \(\alpha\) is a weight controlling the severity of the penalty.

Given data matrix X and target vector y, the maximum-likelihood estimate for ridge coefficients, \(\beta\), is:

\[\hat{\beta} = \left(\mathbf{X}^\top \mathbf{X} + \alpha \mathbf{I} \right)^{-1} \mathbf{X}^\top \mathbf{y}\]

It turns out that this estimate for \(\beta\) also corresponds to the MAP estimate if we assume a multivariate Gaussian prior on the model coefficients, assuming that the data matrix X has been standardized and the target values y centered at 0:

\[\beta \sim \mathcal{N}\left(\mathbf{0}, \frac{1}{2M} \mathbf{I}\right)\]
Parameters:
  • alpha (float) – L2 regularization coefficient. Larger values correspond to larger penalty on the L2 norm of the model coefficients. Default is 1.
  • fit_intercept (bool) – Whether to fit an additional intercept term. Default is True.
Variables:

beta (ndarray of shape (M, K) or None) – Fitted model coefficients.

fit(X, y)[source]

Fit the regression coefficients via maximum likelihood.

Parameters:
  • X (ndarray of shape (N, M)) – A dataset consisting of N examples, each of dimension M.
  • y (ndarray of shape (N, K)) – The targets for each of the N examples in X, where each target has dimension K.
Returns:

self (RidgeRegression instance)

predict(X)[source]

Use the trained model to generate predictions on a new collection of data points.

Parameters:X (ndarray of shape (Z, M)) – A dataset consisting of Z new examples, each of dimension M.
Returns:y_pred (ndarray of shape (Z, K)) – The model predictions for the items in X.

LogisticRegression

class numpy_ml.linear_models.LogisticRegression(penalty='l2', gamma=0, fit_intercept=True)[source]

A simple binary logistic regression model fit via gradient descent on the penalized negative log likelihood.

Notes

In simple binary logistic regression, the entries in a binary target vector \(\mathbf{y} = (y_1, \ldots, y_N)\) are assumed to have been drawn from a series of independent Bernoulli random variables with expected values \(p_1, \ldots, p_N\). The binary logistic regession model models the logit of these unknown mean parameters as a linear function of the model coefficients, \(\mathbf{b}\), and the covariates for the corresponding example, \(\mathbf{x}_i\):

\[\text{Logit}(p_i) = \log \left( \frac{p_i}{1 - p_i} \right) = \mathbf{b}^\top\mathbf{x}_i\]

The model predictions \(\hat{\mathbf{y}}\) are the expected values of the Bernoulli parameters for each example:

\[\hat{y}_i = \mathbb{E}[y_i \mid \mathbf{x}_i] = \sigma(\mathbf{b}^\top \mathbf{x}_i)\]

where \(\sigma\) is the logistic sigmoid function \(\sigma(x) = \frac{1}{1 + e^{-x}}\). Under this model, the (penalized) negative log likelihood of the targets y is

\[- \log \mathcal{L}(\mathbf{b}, \mathbf{y}) = -\frac{1}{N} \left[ \left( \sum_{i=0}^N y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right) - R(\mathbf{b}, \gamma) \right]\]

where

\[\begin{split}R(\mathbf{b}, \gamma) = \left\{ \begin{array}{lr} \frac{\gamma}{2} ||\mathbf{b}||_2^2 & :\texttt{ penalty = 'l2'}\\ \gamma ||\mathbf{b}||_1 & :\texttt{ penalty = 'l1'} \end{array} \right.\end{split}\]

is a regularization penalty, \(\gamma\) is a regularization weight, N is the number of examples in y, \(\hat{y}_i\) is the model prediction on example i, and b is the vector of model coefficients.

Parameters:
  • penalty ({'l1', 'l2'}) – The type of regularization penalty to apply on the coefficients beta. Default is ‘l2’.
  • gamma (float) – The regularization weight. Larger values correspond to larger regularization penalties, and a value of 0 indicates no penalty. Default is 0.
  • fit_intercept (bool) – Whether to fit an intercept term in addition to the coefficients in b. If True, the estimates for beta will have M + 1 dimensions, where the first dimension corresponds to the intercept. Default is True.
Variables:

beta (ndarray of shape (M, 1) or None) – Fitted model coefficients.

fit(X, y, lr=0.01, tol=1e-07, max_iter=10000000.0)[source]

Fit the regression coefficients via gradient descent on the negative log likelihood.

Parameters:
  • X (ndarray of shape (N, M)) – A dataset consisting of N examples, each of dimension M.
  • y (ndarray of shape (N,)) – The binary targets for each of the N examples in X.
  • lr (float) – The gradient descent learning rate. Default is 1e-7.
  • max_iter (float) – The maximum number of iterations to run the gradient descent solver. Default is 1e7.
predict(X)[source]

Use the trained model to generate prediction probabilities on a new collection of data points.

Parameters:X (ndarray of shape (Z, M)) – A dataset consisting of Z new examples, each of dimension M.
Returns:y_pred (ndarray of shape (Z,)) – The model prediction probabilities for the items in X.

BayesianLinearRegressionUnknownVariance

class numpy_ml.linear_models.BayesianLinearRegressionUnknownVariance(alpha=1, beta=2, mu=0, V=None, fit_intercept=True)[source]

Bayesian linear regression model with unknown variance. Assumes a conjugate normal-inverse-gamma joint prior on the model parameters and error variance.

Notes

The current model uses a conjugate normal-inverse-gamma joint prior on model parameters b and error variance \(\sigma^2\). The joint and marginal posteriors over each are:

\[\begin{split}\mathbf{b}, \sigma^2 &\sim \text{N-\Gamma^{-1}}(\mu, \mathbf{V}^{-1}, \alpha, \beta) \\ \sigma^2 &\sim \text{InverseGamma}(\alpha, \beta) \\ \mathbf{b} \mid \sigma^2 &\sim \mathcal{N}(\mu, \sigma^2 \mathbf{V})\end{split}\]
Parameters:
  • alpha (float) – The shape parameter for the Inverse-Gamma prior on \(\sigma^2\). Must be strictly greater than 0. Default is 1.
  • beta (float) – The scale parameter for the Inverse-Gamma prior on \(\sigma^2\). Must be strictly greater than 0. Default is 1.
  • mu (ndarray of shape (M,) or float) – The mean of the Gaussian prior on b. If a float, assume mu is np.ones(M) * mu. Default is 0.
  • V (ndarray of shape (N, N) or (N,) or None) – A symmetric positive definite matrix that when multiplied element-wise by \(\sigma^2\) gives the covariance matrix for the Gaussian prior on b. If a list, assume V = diag(V). If None, assume V is the identity matrix. Default is None.
  • fit_intercept (bool) – Whether to fit an intercept term in addition to the coefficients in b. If True, the estimates for b will have M + 1 dimensions, where the first dimension corresponds to the intercept. Default is True.
Variables:
  • posterior (dict or None) – Frozen random variables for the posterior distributions \(P(\sigma^2 \mid X)\) and \(P(b \mid X, \sigma^2)\).
  • posterior_predictive (dict or None) – Frozen random variable for the posterior predictive distribution, \(P(y \mid X)\). This value is only set following a call to predict.
fit(X, y)[source]

Compute the posterior over model parameters using the data in X and y.

Parameters:
  • X (ndarray of shape (N, M)) – A dataset consisting of N examples, each of dimension M.
  • y (ndarray of shape (N, K)) – The targets for each of the N examples in X, where each target has dimension K.
Returns:

self (BayesianLinearRegressionUnknownVariance instance)

predict(X)[source]

Return the MAP prediction for the targets associated with X.

Parameters:X (ndarray of shape (Z, M)) – A dataset consisting of Z new examples, each of dimension M.
Returns:y_pred (ndarray of shape (Z, K)) – The model predictions for the items in X.

BayesianLinearRegressionKnownVariance

class numpy_ml.linear_models.BayesianLinearRegressionKnownVariance(mu=0, sigma=1, V=None, fit_intercept=True)[source]

Bayesian linear regression model with known error variance and conjugate Gaussian prior on model parameters.

Notes

Uses a conjugate Gaussian prior on the model coefficients b. The posterior over model coefficients is then

\[\mathbf{b} \mid \mu, \sigma^2, \mathbf{V} \sim \mathcal{N}(\mu, \sigma^2 \mathbf{V})\]

Ridge regression is a special case of this model where \(\mu = \mathbf{0}\), \(\sigma = 1\) and \(\mathbf{V} = \mathbf{I}\) (ie., the prior on the model coefficients b is a zero-mean, unit covariance Gaussian).

Parameters:
  • mu (ndarray of shape (M,) or float) – The mean of the Gaussian prior on b. If a float, assume mu is np.ones(M) * mu. Default is 0.
  • sigma (float) – The square root of the scaling term for covariance of the Gaussian prior on b. Default is 1.
  • V (ndarray of shape (N,N) or (N,) or None) – A symmetric positive definite matrix that when multiplied element-wise by sigma ** 2 gives the covariance matrix for the Gaussian prior on b. If a list, assume V = diag(V). If None, assume V is the identity matrix. Default is None.
  • fit_intercept (bool) – Whether to fit an intercept term in addition to the coefficients in b. If True, the estimates for b will have M + 1 dimensions, where the first dimension corresponds to the intercept. Default is True.
Variables:
  • posterior (dict or None) – Frozen random variable for the posterior distribution \(P(b \mid X, \sigma^2)\).
  • posterior_predictive (dict or None) – Frozen random variable for the posterior predictive distribution, \(P(y \mid X)\). This value is only set following a call to predict.
fit(X, y)[source]

Compute the posterior over model parameters using the data in X and y.

Parameters:
  • X (ndarray of shape (N, M)) – A dataset consisting of N examples, each of dimension M.
  • y (ndarray of shape (N, K)) – The targets for each of the N examples in X, where each target has dimension K.
predict(X)[source]

Return the MAP prediction for the targets associated with X.

Parameters:X (ndarray of shape (Z, M)) – A dataset consisting of Z new examples, each of dimension M.
Returns:y_pred (ndarray of shape (Z, K)) – The MAP predictions for the targets associated with the items in X.

GaussianNBClassifier

class numpy_ml.linear_models.GaussianNBClassifier(eps=1e-06)[source]

A naive Bayes classifier for real-valued data.

Notes

The naive Bayes model assumes the features of each training example \(\mathbf{x}\) are mutually independent given the example label y:

\[P(\mathbf{x}_i \mid y_i) = \prod_{j=1}^M P(x_{i,j} \mid y_i)\]

where \(M\) is the rank of the \(i^{th}\) example \(\mathbf{x}_i\) and \(y_i\) is the label associated with the \(i^{th}\) example.

Combining the conditional independence assumption with a simple application of Bayes’ theorem gives the naive Bayes classification rule:

\[\begin{split}\hat{y} &= \arg \max_y P(y \mid \mathbf{x}) \\ &= \arg \max_y P(y) P(\mathbf{x} \mid y) \\ &= \arg \max_y P(y) \prod_{j=1}^M P(x_j \mid y)\end{split}\]

In the final expression, the prior class probability \(P(y)\) can be specified in advance or estimated empirically from the training data.

In the Gaussian version of the naive Bayes model, the feature likelihood is assumed to be normally distributed for each class:

\[\mathbf{x}_i \mid y_i = c, \theta \sim \mathcal{N}(\mu_c, \Sigma_c)\]

where \(\theta\) is the set of model parameters: \(\{\mu_1, \Sigma_1, \ldots, \mu_K, \Sigma_K\}\), \(K\) is the total number of unique classes present in the data, and the parameters for the Gaussian associated with class \(c\), \(\mu_c\) and \(\Sigma_c\) (where \(1 \leq c \leq K\)), are estimated via MLE from the set of training examples with label \(c\).

Parameters:

eps (float) – A value added to the variance to prevent numerical error. Default is 1e-6.

Variables:
  • parameters (dict) – Dictionary of model parameters: “mean”, the (K, M) array of feature means under each class, “sigma”, the (K, M) array of feature variances under each class, and “prior”, the (K,) array of empirical prior probabilities for each class label.
  • hyperparameters (dict) – Dictionary of model hyperparameters
  • labels (ndarray of shape (K,)) – An array containing the unique class labels for the training examples.
fit(X, y)[source]

Fit the model parameters via maximum likelihood.

Notes

The model parameters are stored in the parameters attribute. The following keys are present:

“mean”: ndarray of shape (K, M)
Feature means for each of the K label classes
“sigma”: ndarray of shape (K, M)
Feature variances for each of the K label classes
“prior”: ndarray of shape (K,)
Prior probability of each of the K label classes, estimated empirically from the training data
Parameters:
  • X (ndarray of shape (N, M)) – A dataset consisting of N examples, each of dimension M
  • y (ndarray of shape (N,)) – The class label for each of the N examples in X
Returns:

self (GaussianNBClassifier instance)

predict(X)[source]

Use the trained classifier to predict the class label for each example in X.

Parameters:X (ndarray of shape (N, M)) – A dataset of N examples, each of dimension M
Returns:labels (ndarray of shape (N)) – The predicted class labels for each example in X

GeneralizedLinearModel

class numpy_ml.linear_models.GeneralizedLinearModel(link, fit_intercept=True, tol=1e-05, max_iter=100)[source]

A generalized linear model with maximum likelihood fit via iteratively reweighted least squares (IRLS).

Notes

The generalized linear model (GLM) [7] [8] assumes that each target/dependent variable \(y_i\) in target vector \(\mathbf{y} = (y_1, \ldots, y_n)\), has been drawn independently from a pre-specified distribution in the exponential family [11] with unknown mean \(\mu_i\). The GLM models a (one-to-one, continuous, differentiable) function, g, of this mean value as a linear combination of the model parameters \(\mathbf{b}\) and observed covariates, \(\mathbf{x}_i\):

\[g(\mathbb{E}[y_i \mid \mathbf{x}_i]) = g(\mu_i) = \mathbf{b}^\top \mathbf{x}_i\]

where g is known as the “link function” associated with the GLM. The choice of link function is informed by the instance of the exponential family the target is drawn from. Common examples:

Distribution Link Formula
Normal Identity \(g(x) = x\)
Bernoulli Logit \(g(x) = \log(x) - \log(1 - x)\)
Binomial Logit \(g(x) = \log(x) - \log(n - x)\)
Poisson Log \(g(x) = \log(x)\)

An iteratively re-weighted least squares (IRLS) algorithm [9] can be employed to find the maximum likelihood estimate for the model parameters \(\beta\) in any instance of the generalized linear model. IRLS is equivalent to Fisher scoring [10], which itself is a slight modification of classic Newton-Raphson for finding the zeros of the first derivative of the model log-likelihood.

References

[7]Nelder, J., & Wedderburn, R. (1972). Generalized linear models. Journal of the Royal Statistical Society, Series A (General), 135(3): 370–384.
[8]https://en.wikipedia.org/wiki/Generalized_linear_model
[9]https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares
[10]https://en.wikipedia.org/wiki/Scoring_algorithm
[11]https://en.wikipedia.org/wiki/Exponential_family
Parameters:
  • link ({'identity', 'logit', 'log'}) – The link function to use during modeling.
  • fit_intercept (bool) – Whether to fit an intercept term in addition to the model coefficients. Default is True.
  • tol (float) – The minimum difference between successive iterations of IRLS Default is 1e-5.
  • max_iter (int) – The maximum number of iteratively reweighted least squares iterations to run during fitting. Default is 100.
Variables:

beta (ndarray of shape (M, 1) or None) – Fitted model coefficients.

fit(X, y)[source]

Find the maximum likelihood GLM coefficients via IRLS.

Parameters:
  • X (ndarray of shape (N, M)) – A dataset consisting of N examples, each of dimension M.
  • y (ndarray of shape (N,)) – The targets for each of the N examples in X.
Returns:

self (GeneralizedLinearModel instance)

predict(X)[source]

Use the trained model to generate predictions for the distribution means, \(\mu\), associated with the collection of data points in X.

Parameters:X (ndarray of shape (Z, M)) – A dataset consisting of Z new examples, each of dimension M.
Returns:mu_pred (ndarray of shape (Z,)) – The model predictions for the expected value of the target associated with each item in X.