Linear modelsΒΆ

The linear models module contains several popular instances of the generalized linear model (GLM).

Linear Regression

The simple linear regression model is

\[\mathbf{y} = \mathbf{bX} + \mathbf{\epsilon}\]

where

\[\epsilon \sim \mathcal{N}(0, \sigma^2 I)\]

In probabilistic terms this corresponds to

\[\begin{split}\mathbf{y} - \mathbf{bX} &\sim \mathcal{N}(0, \sigma^2 I) \\ \mathbf{y} \mid \mathbf{X}, \mathbf{b} &\sim \mathcal{N}(\mathbf{bX}, \sigma^2 I)\end{split}\]

The loss for the model is simply the squared error between the model predictions and the true values:

\[\mathcal{L} = ||\mathbf{y} - \mathbf{bX}||_2^2\]

The MLE for the model parameters b can be computed in closed form via the normal equation:

\[\mathbf{b}_{MLE} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}\]

where \((\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top\) is known as the pseudoinverse / Moore-Penrose inverse.

Models

Ridge Regression

Ridge regression uses the same simple linear regression model but adds an additional penalty on the L2-norm of the coefficients to the loss function. This is sometimes known as Tikhonov regularization.

In particular, the ridge model is still simply

\[\mathbf{y} = \mathbf{bX} + \mathbf{\epsilon}\]

where

\[\epsilon \sim \mathcal{N}(0, \sigma^2 I)\]

except now the error for the model is calcualted as

\[\mathcal{L} = ||\mathbf{y} - \mathbf{bX}||_2^2 + \alpha ||\mathbf{b}||_2^2\]

The MLE for the model parameters b can be computed in closed form via the adjusted normal equation:

\[\mathbf{b}_{MLE} = (\mathbf{X}^\top \mathbf{X} + \alpha I)^{-1} \mathbf{X}^\top \mathbf{y}\]

where \((\mathbf{X}^\top \mathbf{X} + \alpha I)^{-1} \mathbf{X}^\top\) is the pseudoinverse / Moore-Penrose inverse adjusted for the L2 penalty on the model coefficients.

Models

Bayesian Linear Regression

In its general form, Bayesian linear regression extends the simple linear regression model by introducing priors on model parameters b and/or the error variance \(\sigma^2\).

The introduction of a prior allows us to quantify the uncertainty in our parameter estimates for b by replacing the MLE point estimate in simple linear regression with an entire posterior distribution, \(p(b \mid X, y, \sigma)\), simply by applying Bayes rule:

\[p(b \mid X, y) = \frac{ p(y \mid X, b) p(b \mid \sigma) }{p(y \mid X)}\]

We can also quantify the uncertainty in our predictions \(y^*\) for some new data \(X^*\) with the posterior predictive distribution:

\[p(y^* \mid X^*, X, Y) = \int_{b} p(y^* \mid X^*, b) p(b \mid X, y) db\]

Depending on the choice of prior it may be impossible to compute an analytic form for the posterior / posterior predictive distribution. In these cases, it is common to use approximations, either via MCMC or variational inference.

Known variance


If we happen to already know the error variance \(\sigma^2\), the conjugate prior on b is Gaussian. A common parameterization is:

\[b | \sigma, b_V \sim \mathcal{N}(b_{mean}, \sigma^2 b_V)\]

where \(b_{mean}\), \(\sigma\) and \(b_V\) are hyperparameters. Ridge regression is a special case of this model where \(b_{mean}\) = 0, \(\sigma\) = 1 and \(b_V = I\) (ie., the prior on b is a zero-mean, unit covariance Gaussian).

Due to the conjugacy of the above prior with the Gaussian likelihood, there exists a closed-form solution for the posterior over the model parameters:

\[\begin{split}A &= (b_V^{-1} + X^\top X)^{-1} \\ \mu_b &= A b_V^{-1} b_{mean} + A X^\top y \\ \text{cov}_b &= \sigma^2 A \\\end{split}\]

The model posterior is then

\[b \mid X, y \sim \mathcal{N}(\mu_b, \text{cov}_b)\]

We can also compute a closed-form solution for the posterior predictive distribution as well:

\[y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ \ X^* \text{cov}_b X^{* \top} + I)\]

where \(X^*\) is the matrix of new data we wish to predict, and \(y^*\) are the predicted targets for those data.

Models

Unknown variance


If both b and the error variance \(\sigma^2\) are unknown, the conjugate prior for the Gaussian likelihood is the Normal-Gamma distribution (univariate likelihood) or the Normal-Inverse-Wishart distribution (multivariate likelihood).

Univariate

\[\begin{split}b, \sigma^2 &\sim \text{NG}(b_{mean}, b_{V}, \alpha, \beta) \\ \sigma^2 &\sim \text{InverseGamma}(\alpha, \beta) \\ b \mid \sigma^2 &\sim \mathcal{N}(b_{mean}, \sigma^2 b_{V})\end{split}\]

where \(\alpha, \beta, b_{V}\), and \(b_{mean}\) are parameters of the prior.

Multivariate

\[\begin{split}b, \Sigma &\sim \mathcal{NIW}(b_{mean}, \lambda, \Psi, \rho) \\ \Sigma &\sim \mathcal{W}^{-1}(\Psi, \rho) \\ b \mid \Sigma &\sim \mathcal{N}(b_{mean}, \frac{1}{\lambda} \Sigma)\end{split}\]

where \(b_{mean}, \lambda, \Psi\), and \(\rho\) are parameters of the prior.

Due to the conjugacy of the above priors with the Gaussian likelihood, there exists a closed-form solution for the posterior over the model parameters:

\[\begin{split}B &= y - X b_{mean} \\ \text{shape} &= N + \alpha \\ \text{scale} &= \frac{1}{\text{shape}} (\alpha \beta + B^\top (X b_V X^\top + I)^{-1} B) \\\end{split}\]

where

\[\begin{split}\sigma^2 \mid X, y &\sim \text{InverseGamma}(\text{shape}, \text{scale}) \\ A &= (b_V^{-1} + X^\top X)^{-1} \\ \mu_b &= A b_V^{-1} b_{mean} + A X^\top y \\ \text{cov}_b &= \sigma^2 A\end{split}\]

The model posterior is then

\[b | X, y, \sigma^2 \sim \mathcal{N}(\mu_b, \text{cov}_b)\]

We can also compute a closed-form solution for the posterior predictive distribution:

\[y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ X^* \text{cov}_b X^{* \top} + I)\]

Models