# Linear models¶

## Ordinary and Weighted Linear Least Squares

In weighted linear least-squares regression (WLS), a real-valued target
\(y_i\), is modeled as a linear combination of covariates
\(\mathbf{x}_i\) and model coefficients **b**:

In the above equation, \(\epsilon_i \sim \mathcal{N}(0, \sigma_i^2)\) is a normally distributed error term with variance \(\sigma_i^2\). Ordinary least squares (OLS) is a special case of this model where the variance is fixed across all examples, i.e., \(\sigma_i = \sigma_j \ \forall i,j\). The maximum likelihood model parameters, \(\hat{\mathbf{b}}_{WLS}\), are those that minimize the weighted squared error between the model predictions and the true values:

where \(\mathbf{W}\) is a diagonal matrix of the example weights. In OLS, \(\mathbf{W}\) is the identity matrix. The maximum likelihood estimate for the model parameters can be computed in closed-form using the normal equations:

**Models**

## Ridge Regression

Ridge regression uses the same simple linear regression model but adds an additional penalty on the L2-norm of the coefficients to the loss function. This is sometimes known as Tikhonov regularization.

In particular, the ridge model is the same as the OLS model:

where \(\epsilon \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})\), except now the error for the model is calculated as

The MLE for the model parameters **b** can be computed in closed form via
the adjusted normal equation:

where \((\mathbf{X}^\top \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^\top\) is the pseudoinverse / Moore-Penrose inverse adjusted for the L2 penalty on the model coefficients.

**Models**

## Bayesian Linear Regression

In its general form, Bayesian linear regression extends the simple linear
regression model by introducing priors on model parameters *b* and/or the
error variance \(\sigma^2\).

The introduction of a prior allows us to quantify the uncertainty in our
parameter estimates for b by replacing the MLE point estimate in simple
linear regression with an entire posterior *distribution*, \(p(b \mid X, y,
\sigma)\), simply by applying Bayes rule:

We can also quantify the uncertainty in our predictions \(y^*\) for some new data \(X^*\) with the posterior predictive distribution:

Depending on the choice of prior it may be impossible to compute an analytic form for the posterior / posterior predictive distribution. In these cases, it is common to use approximations, either via MCMC or variational inference.

#### Known variance

If we happen to already know the error variance \(\sigma^2\), the conjugate prior on b is Gaussian. A common parameterization is:

where \(\mu\), \(\sigma\) and \(V\) are hyperparameters. Ridge
regression is a special case of this model where \(\mu = 0\),
\(\sigma = 1\) and \(V = I\) (i.e., the prior on *b* is a zero-mean,
unit covariance Gaussian).

Due to the conjugacy of the above prior with the Gaussian likelihood, there exists a closed-form solution for the posterior over the model parameters:

The model posterior is then

We can also compute a closed-form solution for the posterior predictive distribution as well:

where \(X^*\) is the matrix of new data we wish to predict, and \(y^*\) are the predicted targets for those data.

**Models**

#### Unknown variance

If *both* *b* and the error variance \(\sigma^2\) are unknown, the
conjugate prior for the Gaussian likelihood is the Normal-Gamma
distribution (univariate likelihood) or the Normal-Inverse-Wishart
distribution (multivariate likelihood).

Univariate\[\begin{split}b, \sigma^2 &\sim \text{NG}(\mu, V, \alpha, \beta) \\ \sigma^2 &\sim \text{InverseGamma}(\alpha, \beta) \\ b \mid \sigma^2 &\sim \mathcal{N}(\mu, \sigma^2 V)\end{split}\]where \(\alpha, \beta, V\), and \(\mu\) are parameters of the prior.

Multivariate\[\begin{split}b, \Sigma &\sim \mathcal{NIW}(\mu, \lambda, \Psi, \rho) \\ \Sigma &\sim \mathcal{W}^{-1}(\Psi, \rho) \\ b \mid \Sigma &\sim \mathcal{N}(\mu, \frac{1}{\lambda} \Sigma)\end{split}\]where \(\mu, \lambda, \Psi\), and \(\rho\) are parameters of the prior.

Due to the conjugacy of the above priors with the Gaussian likelihood, there exists a closed-form solution for the posterior over the model parameters:

where

The model posterior is then

We can also compute a closed-form solution for the posterior predictive distribution:

**Models**

## Naive Bayes Classifier

The naive Bayes model assumes the features of a training example \(\mathbf{x}\) are mutually independent given the example label \(y\):

where \(M\) is the rank of the \(i^{th}\) example \(\mathbf{x}_i\) and \(y_i\) is the label associated with the \(i^{th}\) example.

Combining this conditional independence assumption with a simple application of Bayes’ theorem gives the naive Bayes classification rule:

The prior class probability \(P(y)\) can be specified in advance or estimated empirically from the training data.

**Models**

## Generalized Linear Model

The generalized linear model (GLM) assumes that each target/dependent variable
\(y_i\) in target vector \(\mathbf{y} = (y_1, \ldots, y_n)\), has been
drawn independently from a pre-specified distribution in the exponential family
with unknown mean \(\mu_i\). The GLM models a (one-to-one, continuous,
differentiable) function, *g*, of this mean value as a linear combination of
the model parameters \(\mathbf{b}\) and observed covariates,
\(\mathbf{x}_i\) :

where *g* is known as the link function. The choice of link function is
informed by the instance of the exponential family the target is drawn from.

**Models**