Linear models
#############
.. raw:: html
Ordinary and Weighted Linear Least Squares
In weighted linear least-squares regression (WLS), a real-valued target
:math:`y_i`, is modeled as a linear combination of covariates
:math:`\mathbf{x}_i` and model coefficients **b**:
.. math::
y_i = \mathbf{b}^\top \mathbf{x}_i + \epsilon_i
In the above equation, :math:`\epsilon_i \sim \mathcal{N}(0, \sigma_i^2)` is a
normally distributed error term with variance :math:`\sigma_i^2`. Ordinary
least squares (OLS) is a special case of this model where the variance is fixed
across all examples, i.e., :math:`\sigma_i = \sigma_j \ \forall i,j`. The
maximum likelihood model parameters, :math:`\hat{\mathbf{b}}_{WLS}`, are those
that minimize the weighted squared error between the model predictions and the
true values:
.. math::
\mathcal{L} = ||\mathbf{W}^{0.5}(\mathbf{y} - \mathbf{bX})||_2^2
where :math:`\mathbf{W}` is a diagonal matrix of the example weights. In OLS,
:math:`\mathbf{W}` is the identity matrix. The maximum likelihood estimate for
the model parameters can be computed in closed-form using the normal equations:
.. math::
\hat{\mathbf{b}}_{WLS} =
(\mathbf{X}^\top \mathbf{WX})^{-1} \mathbf{X}^\top \mathbf{Wy}
**Models**
- :class:`~numpy_ml.linear_models.LinearRegression`
.. raw:: html
Ridge Regression
Ridge regression uses the same simple linear regression model but adds an
additional penalty on the `L2`-norm of the coefficients to the loss function.
This is sometimes known as Tikhonov regularization.
In particular, the ridge model is the same as the OLS model:
.. math::
\mathbf{y} = \mathbf{bX} + \mathbf{\epsilon}
where :math:`\epsilon \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})`,
except now the error for the model is calculated as
.. math::
\mathcal{L} = ||\mathbf{y} - \mathbf{bX}||_2^2 + \alpha ||\mathbf{b}||_2^2
The MLE for the model parameters **b** can be computed in closed form via
the adjusted normal equation:
.. math::
\hat{\mathbf{b}}_{Ridge} =
(\mathbf{X}^\top \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}
where :math:`(\mathbf{X}^\top \mathbf{X} + \alpha \mathbf{I})^{-1}
\mathbf{X}^\top` is the pseudoinverse / Moore-Penrose inverse adjusted for
the `L2` penalty on the model coefficients.
**Models**
- :class:`~numpy_ml.linear_models.RidgeRegression`
.. raw:: html
Bayesian Linear Regression
In its general form, Bayesian linear regression extends the simple linear
regression model by introducing priors on model parameters *b* and/or the
error variance :math:`\sigma^2`.
The introduction of a prior allows us to quantify the uncertainty in our
parameter estimates for b by replacing the MLE point estimate in simple
linear regression with an entire posterior *distribution*, :math:`p(b \mid X, y,
\sigma)`, simply by applying Bayes rule:
.. math::
p(b \mid X, y) = \frac{ p(y \mid X, b) p(b \mid \sigma) }{p(y \mid X)}
We can also quantify the uncertainty in our predictions :math:`y^*` for some new
data :math:`X^*` with the posterior predictive distribution:
.. math::
p(y^* \mid X^*, X, Y) = \int_{b} p(y^* \mid X^*, b) p(b \mid X, y) \ \text{d}b
Depending on the choice of prior it may be impossible to compute an
analytic form for the posterior / posterior predictive distribution. In
these cases, it is common to use approximations, either via MCMC or
variational inference.
.. raw:: html
Known variance
--------------------------------
If we happen to already know the error variance :math:`\sigma^2`, the conjugate
prior on `b` is Gaussian. A common parameterization is:
.. math::
b | \sigma, V \sim \mathcal{N}(\mu, \sigma^2 V)
where :math:`\mu`, :math:`\sigma` and :math:`V` are hyperparameters. Ridge
regression is a special case of this model where :math:`\mu = 0`,
:math:`\sigma = 1` and :math:`V = I` (i.e., the prior on *b* is a zero-mean,
unit covariance Gaussian).
Due to the conjugacy of the above prior with the Gaussian likelihood, there
exists a closed-form solution for the posterior over the model
parameters:
.. math::
A &= (V^{-1} + X^\top X)^{-1} \\
\mu_b &= A V^{-1} \mu + A X^\top y \\
\Sigma_b &= \sigma^2 A \\
The model posterior is then
.. math::
b \mid X, y \sim \mathcal{N}(\mu_b, \Sigma_b)
We can also compute a closed-form solution for the posterior predictive distribution as
well:
.. math::
y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ \ X^* \Sigma X^{* \top} + I)
where :math:`X^*` is the matrix of new data we wish to predict, and :math:`y^*`
are the predicted targets for those data.
**Models**
- :class:`~numpy_ml.linear_models.BayesianLinearRegressionKnownVariance`
.. raw:: html
Unknown variance
--------------------------------
If *both* *b* and the error variance :math:`\sigma^2` are unknown, the
conjugate prior for the Gaussian likelihood is the Normal-Gamma
distribution (univariate likelihood) or the Normal-Inverse-Wishart
distribution (multivariate likelihood).
**Univariate**
.. math::
b, \sigma^2 &\sim \text{NG}(\mu, V, \alpha, \beta) \\
\sigma^2 &\sim \text{InverseGamma}(\alpha, \beta) \\
b \mid \sigma^2 &\sim \mathcal{N}(\mu, \sigma^2 V)
where :math:`\alpha, \beta, V`, and :math:`\mu` are parameters of the
prior.
**Multivariate**
.. math::
b, \Sigma &\sim \mathcal{NIW}(\mu, \lambda, \Psi, \rho) \\
\Sigma &\sim \mathcal{W}^{-1}(\Psi, \rho) \\
b \mid \Sigma &\sim \mathcal{N}(\mu, \frac{1}{\lambda} \Sigma)
where :math:`\mu, \lambda, \Psi`, and :math:`\rho` are
parameters of the prior.
Due to the conjugacy of the above priors with the Gaussian likelihood,
there exists a closed-form solution for the posterior over the model
parameters:
.. math::
B &= y - X \mu \\
\text{shape} &= N + \alpha \\
\text{scale} &= \frac{1}{\text{shape}} (\alpha \beta + B^\top (X V X^\top + I)^{-1} B) \\
where
.. math::
\sigma^2 \mid X, y &\sim \text{InverseGamma}(\text{shape}, \text{scale}) \\
A &= (V^{-1} + X^\top X)^{-1} \\
\mu_b &= A V^{-1} \mu + A X^\top y \\
\Sigma_b &= \sigma^2 A
The model posterior is then
.. math::
b | X, y, \sigma^2 \sim \mathcal{N}(\mu_b, \Sigma_b)
We can also compute a closed-form solution for the posterior predictive distribution:
.. math::
y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ X^* \Sigma_b X^{* \top} + I)
**Models**
- :class:`~numpy_ml.linear_models.BayesianLinearRegressionUnknownVariance`
.. raw:: html
Naive Bayes Classifier
The naive Bayes model assumes the features of a training example
:math:`\mathbf{x}` are mutually independent given the example label :math:`y`:
.. math::
P(\mathbf{x}_i \mid y_i) = \prod_{j=1}^M P(x_{i,j} \mid y_i)
where :math:`M` is the rank of the :math:`i^{th}` example :math:`\mathbf{x}_i`
and :math:`y_i` is the label associated with the :math:`i^{th}` example.
Combining this conditional independence assumption with a simple application of
Bayes' theorem gives the naive Bayes classification rule:
.. math::
\hat{y} &= \arg \max_y P(y \mid \mathbf{x}) \\
&= \arg \max_y P(y) P(\mathbf{x} \mid y) \\
&= \arg \max_y P(y) \prod_{j=1}^M P(x_j \mid y)
The prior class probability :math:`P(y)` can be specified in advance or
estimated empirically from the training data.
**Models**
- :class:`~numpy_ml.linear_models.GaussianNBClassifier`
.. raw:: html
Generalized Linear Model
The generalized linear model (GLM) assumes that each target/dependent variable
:math:`y_i` in target vector :math:`\mathbf{y} = (y_1, \ldots, y_n)`, has been
drawn independently from a pre-specified distribution in the exponential family
with unknown mean :math:`\mu_i`. The GLM models a (one-to-one, continuous,
differentiable) function, *g*, of this mean value as a linear combination of
the model parameters :math:`\mathbf{b}` and observed covariates,
:math:`\mathbf{x}_i` :
.. math::
g(\mathbb{E}[y_i \mid \mathbf{x}_i]) =
g(\mu_i) = \mathbf{b}^\top \mathbf{x}_i
where *g* is known as the link function. The choice of link function is
informed by the instance of the exponential family the target is drawn from.
**Models**
- :class:`~numpy_ml.linear_models.GeneralizedLinearModel`
.. toctree::
:maxdepth: 2
:hidden:
numpy_ml.linear_models.lm