LinearRegression
¶
-
class
numpy_ml.linear_models.
LinearRegression
(fit_intercept=True)[source]¶ A weighted linear least-squares regression model.
Notes
In weighted linear least-squares regression [1], a real-valued target vector, y, is modeled as a linear combination of covariates, X, and model coefficients, \(\beta\):
\[y_i = \beta^\top \mathbf{x}_i + \epsilon_i\]In this equation \(\epsilon_i \sim \mathcal{N}(0, \sigma^2_i)\) is the error term associated with example \(i\), and \(\sigma^2_i\) is the variance of the corresponding example.
Under this model, the maximum-likelihood estimate for the regression coefficients, \(\beta\), is:
\[\hat{\beta} = \Sigma^{-1} \mathbf{X}^\top \mathbf{Wy}\]where \(\Sigma^{-1} = (\mathbf{X}^\top \mathbf{WX})^{-1}\) and W is a diagonal matrix of weights, with each entry inversely proportional to the variance of the corresponding measurement. When W is the identity matrix the examples are weighted equally and the model reduces to standard linear least squares [2].
References
[1] https://en.wikipedia.org/wiki/Weighted_least_squares [2] https://en.wikipedia.org/wiki/General_linear_model Parameters: fit_intercept (bool) – Whether to fit an intercept term in addition to the model coefficients. Default is True.
Variables: -
update
(X, y, weights=None)[source]¶ Incrementally update the linear least-squares coefficients for a set of new examples.
Notes
The recursive least-squares algorithm [3] [4] is used to efficiently update the regression parameters as new examples become available. For a single new example \((\mathbf{x}_{t+1}, \mathbf{y}_{t+1})\), the parameter updates are
\[\beta_{t+1} = \left( \mathbf{X}_{1:t}^\top \mathbf{X}_{1:t} + \mathbf{x}_{t+1}\mathbf{x}_{t+1}^\top \right)^{-1} \mathbf{X}_{1:t}^\top \mathbf{Y}_{1:t} + \mathbf{x}_{t+1}^\top \mathbf{y}_{t+1}\]where \(\beta_{t+1}\) are the updated regression coefficients, \(\mathbf{X}_{1:t}\) and \(\mathbf{Y}_{1:t}\) are the set of examples observed from timestep 1 to t.
In the single-example case, the RLS algorithm uses the Sherman-Morrison formula [5] to avoid re-inverting the covariance matrix on each new update. In the multi-example case (i.e., where \(\mathbf{X}_{t+1}\) and \(\mathbf{y}_{t+1}\) are matrices of N examples each), we use the generalized Woodbury matrix identity [6] to update the inverse covariance. This comes at a performance cost, but is still more performant than doing multiple single-example updates if N is large.
References
[3] Gauss, C. F. (1821) Theoria combinationis observationum erroribus minimis obnoxiae, Werke, 4. Gottinge [4] https://en.wikipedia.org/wiki/Recursive_least_squares_filter [5] https://en.wikipedia.org/wiki/Sherman%E2%80%93Morrison_formula [6] https://en.wikipedia.org/wiki/Woodbury_matrix_identity Parameters: - X (
ndarray
of shape (N, M)) – A dataset consisting of N examples, each of dimension M - y (
ndarray
of shape (N, K)) – The targets for each of the N examples in X, where each target has dimension K - weights (
ndarray
of shape (N,) or None) – Weights associated with the examples in X. Examples with larger weights exert greater influence on model fit. When y is a vector (i.e., K = 1), weights should be set to the reciporical of the variance for each measurement (i.e., \(w_i = 1/\sigma^2_i\)). When K > 1, it is assumed that all columns of y share the same weight \(w_i\). If None, examples are weighted equally, resulting in the standard linear least squares update. Default is None.
Returns: self (
LinearRegression
instance)- X (
-
fit
(X, y, weights=None)[source]¶ Fit regression coefficients via maximum likelihood.
Parameters: - X (
ndarray
of shape (N, M)) – A dataset consisting of N examples, each of dimension M. - y (
ndarray
of shape (N, K)) – The targets for each of the N examples in X, where each target has dimension K. - weights (
ndarray
of shape (N,) or None) – Weights associated with the examples in X. Examples with larger weights exert greater influence on model fit. When y is a vector (i.e., K = 1), weights should be set to the reciporical of the variance for each measurement (i.e., \(w_i = 1/\sigma^2_i\)). When K > 1, it is assumed that all columns of y share the same weight \(w_i\). If None, examples are weighted equally, resulting in the standard linear least squares update. Default is None.
Returns: self (
LinearRegression
instance)- X (
-
RidgeRegression
¶
-
class
numpy_ml.linear_models.
RidgeRegression
(alpha=1, fit_intercept=True)[source]¶ A ridge regression model with maximum likelihood fit via the normal equations.
Notes
Ridge regression is a biased estimator for linear models which adds an additional penalty proportional to the L2-norm of the model coefficients to the standard mean-squared-error loss:
\[\mathcal{L}_{Ridge} = (\mathbf{y} - \mathbf{X} \beta)^\top (\mathbf{y} - \mathbf{X} \beta) + \alpha ||\beta||_2^2\]where \(\alpha\) is a weight controlling the severity of the penalty.
Given data matrix X and target vector y, the maximum-likelihood estimate for ridge coefficients, \(\beta\), is:
\[\hat{\beta} = \left(\mathbf{X}^\top \mathbf{X} + \alpha \mathbf{I} \right)^{-1} \mathbf{X}^\top \mathbf{y}\]It turns out that this estimate for \(\beta\) also corresponds to the MAP estimate if we assume a multivariate Gaussian prior on the model coefficients, assuming that the data matrix X has been standardized and the target values y centered at 0:
\[\beta \sim \mathcal{N}\left(\mathbf{0}, \frac{1}{2M} \mathbf{I}\right)\]Parameters: Variables: beta (
ndarray
of shape (M, K) or None) – Fitted model coefficients.-
fit
(X, y)[source]¶ Fit the regression coefficients via maximum likelihood.
Parameters: Returns: self (
RidgeRegression
instance)
-
LogisticRegression
¶
-
class
numpy_ml.linear_models.
LogisticRegression
(penalty='l2', gamma=0, fit_intercept=True)[source]¶ A simple binary logistic regression model fit via gradient descent on the penalized negative log likelihood.
Notes
In simple binary logistic regression, the entries in a binary target vector \(\mathbf{y} = (y_1, \ldots, y_N)\) are assumed to have been drawn from a series of independent Bernoulli random variables with expected values \(p_1, \ldots, p_N\). The binary logistic regession model models the logit of these unknown mean parameters as a linear function of the model coefficients, \(\mathbf{b}\), and the covariates for the corresponding example, \(\mathbf{x}_i\):
\[\text{Logit}(p_i) = \log \left( \frac{p_i}{1 - p_i} \right) = \mathbf{b}^\top\mathbf{x}_i\]The model predictions \(\hat{\mathbf{y}}\) are the expected values of the Bernoulli parameters for each example:
\[\hat{y}_i = \mathbb{E}[y_i \mid \mathbf{x}_i] = \sigma(\mathbf{b}^\top \mathbf{x}_i)\]where \(\sigma\) is the logistic sigmoid function \(\sigma(x) = \frac{1}{1 + e^{-x}}\). Under this model, the (penalized) negative log likelihood of the targets y is
\[- \log \mathcal{L}(\mathbf{b}, \mathbf{y}) = -\frac{1}{N} \left[ \left( \sum_{i=0}^N y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right) - R(\mathbf{b}, \gamma) \right]\]where
\[\begin{split}R(\mathbf{b}, \gamma) = \left\{ \begin{array}{lr} \frac{\gamma}{2} ||\mathbf{b}||_2^2 & :\texttt{ penalty = 'l2'}\\ \gamma ||\mathbf{b}||_1 & :\texttt{ penalty = 'l1'} \end{array} \right.\end{split}\]is a regularization penalty, \(\gamma\) is a regularization weight, N is the number of examples in y, \(\hat{y}_i\) is the model prediction on example i, and b is the vector of model coefficients.
Parameters: - penalty ({'l1', 'l2'}) – The type of regularization penalty to apply on the coefficients beta. Default is ‘l2’.
- gamma (float) – The regularization weight. Larger values correspond to larger regularization penalties, and a value of 0 indicates no penalty. Default is 0.
- fit_intercept (bool) – Whether to fit an intercept term in addition to the coefficients in b. If True, the estimates for beta will have M + 1 dimensions, where the first dimension corresponds to the intercept. Default is True.
Variables: beta (
ndarray
of shape (M, 1) or None) – Fitted model coefficients.-
fit
(X, y, lr=0.01, tol=1e-07, max_iter=10000000.0)[source]¶ Fit the regression coefficients via gradient descent on the negative log likelihood.
Parameters: - X (
ndarray
of shape (N, M)) – A dataset consisting of N examples, each of dimension M. - y (
ndarray
of shape (N,)) – The binary targets for each of the N examples in X. - lr (float) – The gradient descent learning rate. Default is 1e-7.
- max_iter (float) – The maximum number of iterations to run the gradient descent solver. Default is 1e7.
- X (
-
predict
(X)[source]¶ Use the trained model to generate prediction probabilities on a new collection of data points.
Parameters: X ( ndarray
of shape (Z, M)) – A dataset consisting of Z new examples, each of dimension M.Returns: y_pred ( ndarray
of shape (Z,)) – The model prediction probabilities for the items in X.
BayesianLinearRegressionUnknownVariance
¶
-
class
numpy_ml.linear_models.
BayesianLinearRegressionUnknownVariance
(alpha=1, beta=2, mu=0, V=None, fit_intercept=True)[source]¶ Bayesian linear regression model with unknown variance. Assumes a conjugate normal-inverse-gamma joint prior on the model parameters and error variance.
Notes
The current model uses a conjugate normal-inverse-gamma joint prior on model parameters b and error variance \(\sigma^2\). The joint and marginal posteriors over each are:
\[\begin{split}\mathbf{b}, \sigma^2 &\sim \text{N-\Gamma^{-1}}(\mu, \mathbf{V}^{-1}, \alpha, \beta) \\ \sigma^2 &\sim \text{InverseGamma}(\alpha, \beta) \\ \mathbf{b} \mid \sigma^2 &\sim \mathcal{N}(\mu, \sigma^2 \mathbf{V})\end{split}\]Parameters: - alpha (float) – The shape parameter for the Inverse-Gamma prior on \(\sigma^2\). Must be strictly greater than 0. Default is 1.
- beta (float) – The scale parameter for the Inverse-Gamma prior on \(\sigma^2\). Must be strictly greater than 0. Default is 1.
- mu (
ndarray
of shape (M,) or float) – The mean of the Gaussian prior on b. If a float, assume mu isnp.ones(M) * mu
. Default is 0. - V (
ndarray
of shape (N, N) or (N,) or None) – A symmetric positive definite matrix that when multiplied element-wise by \(\sigma^2\) gives the covariance matrix for the Gaussian prior on b. If a list, assumeV = diag(V)
. If None, assume V is the identity matrix. Default is None. - fit_intercept (bool) – Whether to fit an intercept term in addition to the coefficients in b. If True, the estimates for b will have M + 1 dimensions, where the first dimension corresponds to the intercept. Default is True.
Variables: - posterior (dict or None) – Frozen random variables for the posterior distributions \(P(\sigma^2 \mid X)\) and \(P(b \mid X, \sigma^2)\).
- posterior_predictive (dict or None) – Frozen random variable for the posterior predictive distribution,
\(P(y \mid X)\). This value is only set following a call to
predict
.
-
fit
(X, y)[source]¶ Compute the posterior over model parameters using the data in X and y.
Parameters: Returns: self (
BayesianLinearRegressionUnknownVariance
instance)
BayesianLinearRegressionKnownVariance
¶
-
class
numpy_ml.linear_models.
BayesianLinearRegressionKnownVariance
(mu=0, sigma=1, V=None, fit_intercept=True)[source]¶ Bayesian linear regression model with known error variance and conjugate Gaussian prior on model parameters.
Notes
Uses a conjugate Gaussian prior on the model coefficients b. The posterior over model coefficients is then
\[\mathbf{b} \mid \mu, \sigma^2, \mathbf{V} \sim \mathcal{N}(\mu, \sigma^2 \mathbf{V})\]Ridge regression is a special case of this model where \(\mu = \mathbf{0}\), \(\sigma = 1\) and \(\mathbf{V} = \mathbf{I}\) (ie., the prior on the model coefficients b is a zero-mean, unit covariance Gaussian).
Parameters: - mu (
ndarray
of shape (M,) or float) – The mean of the Gaussian prior on b. If a float, assume mu isnp.ones(M) * mu
. Default is 0. - sigma (float) – The square root of the scaling term for covariance of the Gaussian prior on b. Default is 1.
- V (
ndarray
of shape (N,N) or (N,) or None) – A symmetric positive definite matrix that when multiplied element-wise bysigma ** 2
gives the covariance matrix for the Gaussian prior on b. If a list, assumeV = diag(V)
. If None, assume V is the identity matrix. Default is None. - fit_intercept (bool) – Whether to fit an intercept term in addition to the coefficients in b. If True, the estimates for b will have M + 1 dimensions, where the first dimension corresponds to the intercept. Default is True.
Variables: - mu (
GaussianNBClassifier
¶
-
class
numpy_ml.linear_models.
GaussianNBClassifier
(eps=1e-06)[source]¶ A naive Bayes classifier for real-valued data.
Notes
The naive Bayes model assumes the features of each training example \(\mathbf{x}\) are mutually independent given the example label y:
\[P(\mathbf{x}_i \mid y_i) = \prod_{j=1}^M P(x_{i,j} \mid y_i)\]where \(M\) is the rank of the \(i^{th}\) example \(\mathbf{x}_i\) and \(y_i\) is the label associated with the \(i^{th}\) example.
Combining the conditional independence assumption with a simple application of Bayes’ theorem gives the naive Bayes classification rule:
\[\begin{split}\hat{y} &= \arg \max_y P(y \mid \mathbf{x}) \\ &= \arg \max_y P(y) P(\mathbf{x} \mid y) \\ &= \arg \max_y P(y) \prod_{j=1}^M P(x_j \mid y)\end{split}\]In the final expression, the prior class probability \(P(y)\) can be specified in advance or estimated empirically from the training data.
In the Gaussian version of the naive Bayes model, the feature likelihood is assumed to be normally distributed for each class:
\[\mathbf{x}_i \mid y_i = c, \theta \sim \mathcal{N}(\mu_c, \Sigma_c)\]where \(\theta\) is the set of model parameters: \(\{\mu_1, \Sigma_1, \ldots, \mu_K, \Sigma_K\}\), \(K\) is the total number of unique classes present in the data, and the parameters for the Gaussian associated with class \(c\), \(\mu_c\) and \(\Sigma_c\) (where \(1 \leq c \leq K\)), are estimated via MLE from the set of training examples with label \(c\).
Parameters: eps (float) – A value added to the variance to prevent numerical error. Default is 1e-6.
Variables: - parameters (dict) – Dictionary of model parameters: “mean”, the (K, M) array of feature means under each class, “sigma”, the (K, M) array of feature variances under each class, and “prior”, the (K,) array of empirical prior probabilities for each class label.
- hyperparameters (dict) – Dictionary of model hyperparameters
- labels (
ndarray
of shape (K,)) – An array containing the unique class labels for the training examples.
-
fit
(X, y)[source]¶ Fit the model parameters via maximum likelihood.
Notes
The model parameters are stored in the
parameters
attribute. The following keys are present:Parameters: Returns: self (
GaussianNBClassifier
instance)
GeneralizedLinearModel
¶
-
class
numpy_ml.linear_models.
GeneralizedLinearModel
(link, fit_intercept=True, tol=1e-05, max_iter=100)[source]¶ A generalized linear model with maximum likelihood fit via iteratively reweighted least squares (IRLS).
Notes
The generalized linear model (GLM) [7] [8] assumes that each target/dependent variable \(y_i\) in target vector \(\mathbf{y} = (y_1, \ldots, y_n)\), has been drawn independently from a pre-specified distribution in the exponential family [11] with unknown mean \(\mu_i\). The GLM models a (one-to-one, continuous, differentiable) function, g, of this mean value as a linear combination of the model parameters \(\mathbf{b}\) and observed covariates, \(\mathbf{x}_i\):
\[g(\mathbb{E}[y_i \mid \mathbf{x}_i]) = g(\mu_i) = \mathbf{b}^\top \mathbf{x}_i\]where g is known as the “link function” associated with the GLM. The choice of link function is informed by the instance of the exponential family the target is drawn from. Common examples:
Distribution Link Formula Normal Identity \(g(x) = x\) Bernoulli Logit \(g(x) = \log(x) - \log(1 - x)\) Binomial Logit \(g(x) = \log(x) - \log(n - x)\) Poisson Log \(g(x) = \log(x)\) An iteratively re-weighted least squares (IRLS) algorithm [9] can be employed to find the maximum likelihood estimate for the model parameters \(\beta\) in any instance of the generalized linear model. IRLS is equivalent to Fisher scoring [10], which itself is a slight modification of classic Newton-Raphson for finding the zeros of the first derivative of the model log-likelihood.
References
[7] Nelder, J., & Wedderburn, R. (1972). Generalized linear models. Journal of the Royal Statistical Society, Series A (General), 135(3): 370–384. [8] https://en.wikipedia.org/wiki/Generalized_linear_model [9] https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares [10] https://en.wikipedia.org/wiki/Scoring_algorithm [11] https://en.wikipedia.org/wiki/Exponential_family Parameters: - link ({'identity', 'logit', 'log'}) – The link function to use during modeling.
- fit_intercept (bool) – Whether to fit an intercept term in addition to the model coefficients. Default is True.
- tol (float) – The minimum difference between successive iterations of IRLS Default is 1e-5.
- max_iter (int) – The maximum number of iteratively reweighted least squares iterations to run during fitting. Default is 100.
Variables: beta (
ndarray
of shape (M, 1) or None) – Fitted model coefficients.-
fit
(X, y)[source]¶ Find the maximum likelihood GLM coefficients via IRLS.
Parameters: Returns: self (
GeneralizedLinearModel
instance)
-
predict
(X)[source]¶ Use the trained model to generate predictions for the distribution means, \(\mu\), associated with the collection of data points in X.
Parameters: X ( ndarray
of shape (Z, M)) – A dataset consisting of Z new examples, each of dimension M.Returns: mu_pred ( ndarray
of shape (Z,)) – The model predictions for the expected value of the target associated with each item in X.