Linear models

Ordinary and Weighted Linear Least Squares

In weighted linear least-squares regression (WLS), a real-valued target \(y_i\), is modeled as a linear combination of covariates \(\mathbf{x}_i\) and model coefficients b:

\[y_i = \mathbf{b}^\top \mathbf{x}_i + \epsilon_i\]

In the above equation, \(\epsilon_i \sim \mathcal{N}(0, \sigma_i^2)\) is a normally distributed error term with variance \(\sigma_i^2\). Ordinary least squares (OLS) is a special case of this model where the variance is fixed across all examples, i.e., \(\sigma_i = \sigma_j \ \forall i,j\). The maximum likelihood model parameters, \(\hat{\mathbf{b}}_{WLS}\), are those that minimize the weighted squared error between the model predictions and the true values:

\[\mathcal{L} = ||\mathbf{W}^{0.5}(\mathbf{y} - \mathbf{bX})||_2^2\]

where \(\mathbf{W}\) is a diagonal matrix of the example weights. In OLS, \(\mathbf{W}\) is the identity matrix. The maximum likelihood estimate for the model parameters can be computed in closed-form using the normal equations:

\[\hat{\mathbf{b}}_{WLS} = (\mathbf{X}^\top \mathbf{WX})^{-1} \mathbf{X}^\top \mathbf{Wy}\]

Models

Ridge Regression

Ridge regression uses the same simple linear regression model but adds an additional penalty on the L2-norm of the coefficients to the loss function. This is sometimes known as Tikhonov regularization.

In particular, the ridge model is the same as the OLS model:

\[\mathbf{y} = \mathbf{bX} + \mathbf{\epsilon}\]

where \(\epsilon \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})\), except now the error for the model is calculated as

\[\mathcal{L} = ||\mathbf{y} - \mathbf{bX}||_2^2 + \alpha ||\mathbf{b}||_2^2\]

The MLE for the model parameters b can be computed in closed form via the adjusted normal equation:

\[\hat{\mathbf{b}}_{Ridge} = (\mathbf{X}^\top \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}\]

where \((\mathbf{X}^\top \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^\top\) is the pseudoinverse / Moore-Penrose inverse adjusted for the L2 penalty on the model coefficients.

Models

Bayesian Linear Regression

In its general form, Bayesian linear regression extends the simple linear regression model by introducing priors on model parameters b and/or the error variance \(\sigma^2\).

The introduction of a prior allows us to quantify the uncertainty in our parameter estimates for b by replacing the MLE point estimate in simple linear regression with an entire posterior distribution, \(p(b \mid X, y, \sigma)\), simply by applying Bayes rule:

\[p(b \mid X, y) = \frac{ p(y \mid X, b) p(b \mid \sigma) }{p(y \mid X)}\]

We can also quantify the uncertainty in our predictions \(y^*\) for some new data \(X^*\) with the posterior predictive distribution:

\[p(y^* \mid X^*, X, Y) = \int_{b} p(y^* \mid X^*, b) p(b \mid X, y) \ \text{d}b\]

Depending on the choice of prior it may be impossible to compute an analytic form for the posterior / posterior predictive distribution. In these cases, it is common to use approximations, either via MCMC or variational inference.

Known variance


If we happen to already know the error variance \(\sigma^2\), the conjugate prior on b is Gaussian. A common parameterization is:

\[b | \sigma, V \sim \mathcal{N}(\mu, \sigma^2 V)\]

where \(\mu\), \(\sigma\) and \(V\) are hyperparameters. Ridge regression is a special case of this model where \(\mu = 0\), \(\sigma = 1\) and \(V = I\) (i.e., the prior on b is a zero-mean, unit covariance Gaussian).

Due to the conjugacy of the above prior with the Gaussian likelihood, there exists a closed-form solution for the posterior over the model parameters:

\[\begin{split}A &= (V^{-1} + X^\top X)^{-1} \\ \mu_b &= A V^{-1} \mu + A X^\top y \\ \Sigma_b &= \sigma^2 A \\\end{split}\]

The model posterior is then

\[b \mid X, y \sim \mathcal{N}(\mu_b, \Sigma_b)\]

We can also compute a closed-form solution for the posterior predictive distribution as well:

\[y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ \ X^* \Sigma X^{* \top} + I)\]

where \(X^*\) is the matrix of new data we wish to predict, and \(y^*\) are the predicted targets for those data.

Models

Unknown variance


If both b and the error variance \(\sigma^2\) are unknown, the conjugate prior for the Gaussian likelihood is the Normal-Gamma distribution (univariate likelihood) or the Normal-Inverse-Wishart distribution (multivariate likelihood).

Univariate

\[\begin{split}b, \sigma^2 &\sim \text{NG}(\mu, V, \alpha, \beta) \\ \sigma^2 &\sim \text{InverseGamma}(\alpha, \beta) \\ b \mid \sigma^2 &\sim \mathcal{N}(\mu, \sigma^2 V)\end{split}\]

where \(\alpha, \beta, V\), and \(\mu\) are parameters of the prior.

Multivariate

\[\begin{split}b, \Sigma &\sim \mathcal{NIW}(\mu, \lambda, \Psi, \rho) \\ \Sigma &\sim \mathcal{W}^{-1}(\Psi, \rho) \\ b \mid \Sigma &\sim \mathcal{N}(\mu, \frac{1}{\lambda} \Sigma)\end{split}\]

where \(\mu, \lambda, \Psi\), and \(\rho\) are parameters of the prior.

Due to the conjugacy of the above priors with the Gaussian likelihood, there exists a closed-form solution for the posterior over the model parameters:

\[\begin{split}B &= y - X \mu \\ \text{shape} &= N + \alpha \\ \text{scale} &= \frac{1}{\text{shape}} (\alpha \beta + B^\top (X V X^\top + I)^{-1} B) \\\end{split}\]

where

\[\begin{split}\sigma^2 \mid X, y &\sim \text{InverseGamma}(\text{shape}, \text{scale}) \\ A &= (V^{-1} + X^\top X)^{-1} \\ \mu_b &= A V^{-1} \mu + A X^\top y \\ \Sigma_b &= \sigma^2 A\end{split}\]

The model posterior is then

\[b | X, y, \sigma^2 \sim \mathcal{N}(\mu_b, \Sigma_b)\]

We can also compute a closed-form solution for the posterior predictive distribution:

\[y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ X^* \Sigma_b X^{* \top} + I)\]

Models

Naive Bayes Classifier

The naive Bayes model assumes the features of a training example \(\mathbf{x}\) are mutually independent given the example label \(y\):

\[P(\mathbf{x}_i \mid y_i) = \prod_{j=1}^M P(x_{i,j} \mid y_i)\]

where \(M\) is the rank of the \(i^{th}\) example \(\mathbf{x}_i\) and \(y_i\) is the label associated with the \(i^{th}\) example.

Combining this conditional independence assumption with a simple application of Bayes’ theorem gives the naive Bayes classification rule:

\[\begin{split}\hat{y} &= \arg \max_y P(y \mid \mathbf{x}) \\ &= \arg \max_y P(y) P(\mathbf{x} \mid y) \\ &= \arg \max_y P(y) \prod_{j=1}^M P(x_j \mid y)\end{split}\]

The prior class probability \(P(y)\) can be specified in advance or estimated empirically from the training data.

Models

Generalized Linear Model

The generalized linear model (GLM) assumes that each target/dependent variable \(y_i\) in target vector \(\mathbf{y} = (y_1, \ldots, y_n)\), has been drawn independently from a pre-specified distribution in the exponential family with unknown mean \(\mu_i\). The GLM models a (one-to-one, continuous, differentiable) function, g, of this mean value as a linear combination of the model parameters \(\mathbf{b}\) and observed covariates, \(\mathbf{x}_i\) :

\[g(\mathbb{E}[y_i \mid \mathbf{x}_i]) = g(\mu_i) = \mathbf{b}^\top \mathbf{x}_i\]

where g is known as the link function. The choice of link function is informed by the instance of the exponential family the target is drawn from.

Models