# Linear models¶

The linear models module contains several popular instances of the generalized linear model (GLM).

## Linear Regression

The simple linear regression model is

$\mathbf{y} = \mathbf{bX} + \mathbf{\epsilon}$

where

$\epsilon \sim \mathcal{N}(0, \sigma^2 I)$

In probabilistic terms this corresponds to

$\begin{split}\mathbf{y} - \mathbf{bX} &\sim \mathcal{N}(0, \sigma^2 I) \\ \mathbf{y} \mid \mathbf{X}, \mathbf{b} &\sim \mathcal{N}(\mathbf{bX}, \sigma^2 I)\end{split}$

The loss for the model is simply the squared error between the model predictions and the true values:

$\mathcal{L} = ||\mathbf{y} - \mathbf{bX}||_2^2$

The MLE for the model parameters b can be computed in closed form via the normal equation:

$\mathbf{b}_{MLE} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}$

where $$(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top$$ is known as the pseudoinverse / Moore-Penrose inverse.

Models

## Ridge Regression

Ridge regression uses the same simple linear regression model but adds an additional penalty on the L2-norm of the coefficients to the loss function. This is sometimes known as Tikhonov regularization.

In particular, the ridge model is still simply

$\mathbf{y} = \mathbf{bX} + \mathbf{\epsilon}$

where

$\epsilon \sim \mathcal{N}(0, \sigma^2 I)$

except now the error for the model is calcualted as

$\mathcal{L} = ||\mathbf{y} - \mathbf{bX}||_2^2 + \alpha ||\mathbf{b}||_2^2$

The MLE for the model parameters b can be computed in closed form via the adjusted normal equation:

$\mathbf{b}_{MLE} = (\mathbf{X}^\top \mathbf{X} + \alpha I)^{-1} \mathbf{X}^\top \mathbf{y}$

where $$(\mathbf{X}^\top \mathbf{X} + \alpha I)^{-1} \mathbf{X}^\top$$ is the pseudoinverse / Moore-Penrose inverse adjusted for the L2 penalty on the model coefficients.

Models

## Bayesian Linear Regression

In its general form, Bayesian linear regression extends the simple linear regression model by introducing priors on model parameters b and/or the error variance $$\sigma^2$$.

The introduction of a prior allows us to quantify the uncertainty in our parameter estimates for b by replacing the MLE point estimate in simple linear regression with an entire posterior distribution, $$p(b \mid X, y, \sigma)$$, simply by applying Bayes rule:

$p(b \mid X, y) = \frac{ p(y \mid X, b) p(b \mid \sigma) }{p(y \mid X)}$

We can also quantify the uncertainty in our predictions $$y^*$$ for some new data $$X^*$$ with the posterior predictive distribution:

$p(y^* \mid X^*, X, Y) = \int_{b} p(y^* \mid X^*, b) p(b \mid X, y) db$

Depending on the choice of prior it may be impossible to compute an analytic form for the posterior / posterior predictive distribution. In these cases, it is common to use approximations, either via MCMC or variational inference.

#### Known variance

If we happen to already know the error variance $$\sigma^2$$, the conjugate prior on b is Gaussian. A common parameterization is:

$b | \sigma, b_V \sim \mathcal{N}(b_{mean}, \sigma^2 b_V)$

where $$b_{mean}$$, $$\sigma$$ and $$b_V$$ are hyperparameters. Ridge regression is a special case of this model where $$b_{mean}$$ = 0, $$\sigma$$ = 1 and $$b_V = I$$ (ie., the prior on b is a zero-mean, unit covariance Gaussian).

Due to the conjugacy of the above prior with the Gaussian likelihood, there exists a closed-form solution for the posterior over the model parameters:

$\begin{split}A &= (b_V^{-1} + X^\top X)^{-1} \\ \mu_b &= A b_V^{-1} b_{mean} + A X^\top y \\ \text{cov}_b &= \sigma^2 A \\\end{split}$

The model posterior is then

$b \mid X, y \sim \mathcal{N}(\mu_b, \text{cov}_b)$

We can also compute a closed-form solution for the posterior predictive distribution as well:

$y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ \ X^* \text{cov}_b X^{* \top} + I)$

where $$X^*$$ is the matrix of new data we wish to predict, and $$y^*$$ are the predicted targets for those data.

Models

#### Unknown variance

If both b and the error variance $$\sigma^2$$ are unknown, the conjugate prior for the Gaussian likelihood is the Normal-Gamma distribution (univariate likelihood) or the Normal-Inverse-Wishart distribution (multivariate likelihood).

Univariate

$\begin{split}b, \sigma^2 &\sim \text{NG}(b_{mean}, b_{V}, \alpha, \beta) \\ \sigma^2 &\sim \text{InverseGamma}(\alpha, \beta) \\ b \mid \sigma^2 &\sim \mathcal{N}(b_{mean}, \sigma^2 b_{V})\end{split}$

where $$\alpha, \beta, b_{V}$$, and $$b_{mean}$$ are parameters of the prior.

Multivariate

$\begin{split}b, \Sigma &\sim \mathcal{NIW}(b_{mean}, \lambda, \Psi, \rho) \\ \Sigma &\sim \mathcal{W}^{-1}(\Psi, \rho) \\ b \mid \Sigma &\sim \mathcal{N}(b_{mean}, \frac{1}{\lambda} \Sigma)\end{split}$

where $$b_{mean}, \lambda, \Psi$$, and $$\rho$$ are parameters of the prior.

Due to the conjugacy of the above priors with the Gaussian likelihood, there exists a closed-form solution for the posterior over the model parameters:

$\begin{split}B &= y - X b_{mean} \\ \text{shape} &= N + \alpha \\ \text{scale} &= \frac{1}{\text{shape}} (\alpha \beta + B^\top (X b_V X^\top + I)^{-1} B) \\\end{split}$

where

$\begin{split}\sigma^2 \mid X, y &\sim \text{InverseGamma}(\text{shape}, \text{scale}) \\ A &= (b_V^{-1} + X^\top X)^{-1} \\ \mu_b &= A b_V^{-1} b_{mean} + A X^\top y \\ \text{cov}_b &= \sigma^2 A\end{split}$

The model posterior is then

$b | X, y, \sigma^2 \sim \mathcal{N}(\mu_b, \text{cov}_b)$

We can also compute a closed-form solution for the posterior predictive distribution:

$y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ X^* \text{cov}_b X^{* \top} + I)$

Models