Linear models
#############

.. raw:: html

   <h2>Ordinary and Weighted Linear Least Squares</h2>

In weighted linear least-squares regression (WLS), a real-valued target
:math:`y_i`, is modeled as a linear combination of covariates
:math:`\mathbf{x}_i` and model coefficients **b**:

.. math::

    y_i = \mathbf{b}^\top \mathbf{x}_i + \epsilon_i

In the above equation, :math:`\epsilon_i \sim \mathcal{N}(0, \sigma_i^2)` is a
normally distributed error term with variance :math:`\sigma_i^2`. Ordinary
least squares (OLS) is a special case of this model where the variance is fixed
across all examples, i.e., :math:`\sigma_i = \sigma_j \ \forall i,j`. The
maximum likelihood model parameters, :math:`\hat{\mathbf{b}}_{WLS}`, are those
that minimize the weighted squared error between the model predictions and the
true values:

.. math::

    \mathcal{L} = ||\mathbf{W}^{0.5}(\mathbf{y} - \mathbf{bX})||_2^2

where :math:`\mathbf{W}` is a diagonal matrix of the example weights. In OLS,
:math:`\mathbf{W}` is the identity matrix. The maximum likelihood estimate for
the model parameters can be computed in closed-form using the normal equations:

.. math::

    \hat{\mathbf{b}}_{WLS} =
        (\mathbf{X}^\top \mathbf{WX})^{-1} \mathbf{X}^\top \mathbf{Wy}


**Models**

- :class:`~numpy_ml.linear_models.LinearRegression`

.. raw:: html

   <h2>Ridge Regression</h2>

Ridge regression uses the same simple linear regression model but adds an
additional penalty on the `L2`-norm of the coefficients to the loss function.
This is sometimes known as Tikhonov regularization.

In particular, the ridge model is the same as the OLS model:

.. math::

    \mathbf{y} = \mathbf{bX} + \mathbf{\epsilon}

where :math:`\epsilon \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})`,
except now the error for the model is calculated as

.. math::

    \mathcal{L} = ||\mathbf{y} - \mathbf{bX}||_2^2 + \alpha ||\mathbf{b}||_2^2

The MLE for the model parameters **b** can be computed in closed form via
the adjusted normal equation:

.. math::

    \hat{\mathbf{b}}_{Ridge} =
        (\mathbf{X}^\top \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}

where :math:`(\mathbf{X}^\top \mathbf{X} + \alpha \mathbf{I})^{-1}
\mathbf{X}^\top` is the pseudoinverse / Moore-Penrose inverse adjusted for
the `L2` penalty on the model coefficients.

**Models**

- :class:`~numpy_ml.linear_models.RidgeRegression`

.. raw:: html

   <h2>Bayesian Linear Regression</h2>

In its general form, Bayesian linear regression extends the simple linear
regression model by introducing priors on model parameters *b* and/or the
error variance :math:`\sigma^2`.

The introduction of a prior allows us to quantify the uncertainty in our
parameter estimates for b by replacing the MLE point estimate in simple
linear regression with an entire posterior *distribution*, :math:`p(b \mid X, y,
\sigma)`, simply by applying Bayes rule:

.. math::

    p(b \mid X, y) = \frac{ p(y \mid X, b) p(b \mid \sigma) }{p(y \mid X)}

We can also quantify the uncertainty in our predictions :math:`y^*` for some new
data :math:`X^*` with the posterior predictive distribution:

.. math::

    p(y^* \mid X^*, X, Y) = \int_{b} p(y^* \mid X^*, b) p(b \mid X, y) \ \text{d}b

Depending on the choice of prior it may be impossible to compute an
analytic form for the posterior / posterior predictive distribution. In
these cases, it is common to use approximations, either via MCMC or
variational inference.

.. raw:: html

   <h4>Known variance</h4>

--------------------------------

If we happen to already know the error variance :math:`\sigma^2`, the conjugate
prior on `b` is Gaussian. A common parameterization is:

.. math::

    b | \sigma, V  \sim  \mathcal{N}(\mu, \sigma^2 V)

where :math:`\mu`, :math:`\sigma` and :math:`V` are hyperparameters. Ridge
regression is a special case of this model where :math:`\mu = 0`,
:math:`\sigma = 1` and :math:`V = I` (i.e., the prior on *b* is a zero-mean,
unit covariance Gaussian).

Due to the conjugacy of the above prior with the Gaussian likelihood, there
exists a closed-form solution for the posterior over the model
parameters:

.. math::

    A  &=  (V^{-1} + X^\top X)^{-1} \\
    \mu_b  &=  A V^{-1} \mu + A X^\top y \\
    \Sigma_b  &=  \sigma^2 A \\

The model posterior is then

.. math::

    b \mid X, y  \sim  \mathcal{N}(\mu_b, \Sigma_b)

We can also compute a closed-form solution for the posterior predictive distribution as
well:

.. math::

    y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ \ X^* \Sigma X^{* \top} + I)

where :math:`X^*` is the matrix of new data we wish to predict, and :math:`y^*`
are the predicted targets for those data.

**Models**

- :class:`~numpy_ml.linear_models.BayesianLinearRegressionKnownVariance`


.. raw:: html

   <h4>Unknown variance</h4>

--------------------------------

If *both* *b* and the error variance :math:`\sigma^2` are unknown, the
conjugate prior for the Gaussian likelihood is the Normal-Gamma
distribution (univariate likelihood) or the Normal-Inverse-Wishart
distribution (multivariate likelihood).

    **Univariate**

    .. math::

        b, \sigma^2  &\sim  \text{NG}(\mu, V, \alpha, \beta) \\
        \sigma^2  &\sim  \text{InverseGamma}(\alpha, \beta) \\
        b \mid \sigma^2  &\sim  \mathcal{N}(\mu, \sigma^2 V)

    where :math:`\alpha, \beta, V`, and :math:`\mu` are parameters of the
    prior.

    **Multivariate**

    .. math::

        b, \Sigma  &\sim  \mathcal{NIW}(\mu, \lambda, \Psi, \rho) \\
        \Sigma  &\sim  \mathcal{W}^{-1}(\Psi, \rho) \\
        b \mid \Sigma  &\sim  \mathcal{N}(\mu, \frac{1}{\lambda} \Sigma)

    where :math:`\mu, \lambda, \Psi`, and :math:`\rho` are
    parameters of the prior.


Due to the conjugacy of the above priors with the Gaussian likelihood,
there exists a closed-form solution for the posterior over the model
parameters:

.. math::

    B  &=  y - X \mu \\
    \text{shape}  &=  N + \alpha \\
    \text{scale}  &=  \frac{1}{\text{shape}} (\alpha \beta + B^\top (X V X^\top + I)^{-1} B) \\

where

.. math::

    \sigma^2 \mid X, y  &\sim  \text{InverseGamma}(\text{shape}, \text{scale}) \\
    A  &=  (V^{-1} + X^\top X)^{-1} \\
    \mu_b  &=  A V^{-1} \mu + A X^\top y \\
    \Sigma_b  &=  \sigma^2 A

The model posterior is then

.. math::

    b | X, y, \sigma^2 \sim \mathcal{N}(\mu_b, \Sigma_b)

We can also compute a closed-form solution for the posterior predictive distribution:

.. math::

    y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ X^* \Sigma_b X^{* \top} + I)

**Models**

- :class:`~numpy_ml.linear_models.BayesianLinearRegressionUnknownVariance`

.. raw:: html

   <h2>Naive Bayes Classifier</h2>

The naive Bayes model assumes the features of a training example
:math:`\mathbf{x}` are mutually independent given the example label :math:`y`:

.. math::

    P(\mathbf{x}_i \mid y_i) = \prod_{j=1}^M P(x_{i,j} \mid y_i)

where :math:`M` is the rank of the :math:`i^{th}` example :math:`\mathbf{x}_i`
and :math:`y_i` is the label associated with the :math:`i^{th}` example.

Combining this conditional independence assumption with a simple application of
Bayes' theorem gives the naive Bayes classification rule:

.. math::

    \hat{y} &= \arg \max_y P(y \mid \mathbf{x}) \\
            &= \arg \max_y  P(y) P(\mathbf{x} \mid y) \\
            &= \arg \max_y  P(y) \prod_{j=1}^M P(x_j \mid y)

The prior class probability :math:`P(y)` can be specified in advance or
estimated empirically from the training data.

**Models**

- :class:`~numpy_ml.linear_models.GaussianNBClassifier`

.. raw:: html

   <h2>Generalized Linear Model</h2>

The generalized linear model (GLM) assumes that each target/dependent variable
:math:`y_i` in target vector :math:`\mathbf{y} = (y_1, \ldots, y_n)`, has been
drawn independently from a pre-specified distribution in the exponential family
with unknown mean :math:`\mu_i`. The GLM models a (one-to-one, continuous,
differentiable) function, *g*, of this mean value as a linear combination of
the model parameters :math:`\mathbf{b}` and observed covariates,
:math:`\mathbf{x}_i` :

.. math::

    g(\mathbb{E}[y_i \mid \mathbf{x}_i]) =
        g(\mu_i) = \mathbf{b}^\top \mathbf{x}_i

where *g* is known as the link function.  The choice of link function is
informed by the instance of the exponential family the target is drawn from.

**Models**

- :class:`~numpy_ml.linear_models.GeneralizedLinearModel`

.. toctree::
   :maxdepth: 2
   :hidden:

   numpy_ml.linear_models.lm