# Loss functions¶

## CrossEntropy¶

class numpy_ml.neural_nets.losses.CrossEntropy[source]

A cross-entropy loss.

Notes

For a one-hot target y and predicted class probabilities $$\hat{\mathbf{y}}$$, the cross entropy is

$\mathcal{L}(\mathbf{y}, \hat{\mathbf{y}}) = \sum_i y_i \log \hat{y}_i$
static loss(y, y_pred)[source]

Compute the cross-entropy (log) loss.

Notes

This method returns the sum (not the average!) of the losses for each sample.

Parameters: y (ndarray of shape (n, m)) – Class labels (one-hot with m possible classes) for each of n examples. y_pred (ndarray of shape (n, m)) – Probabilities of each of m classes for the n examples in the batch. loss (float) – The sum of the cross-entropy across classes and examples.
static grad(y, y_pred)[source]

Compute the gradient of the cross entropy loss with regard to the softmax input, z.

Notes

The gradient for this method goes through both the cross-entropy loss AND the softmax non-linearity to return $$\frac{\partial \mathcal{L}}{\partial \mathbf{z}}$$ (rather than $$\frac{\partial \mathcal{L}}{\partial \text{softmax}(\mathbf{z})}$$).

In particular, let:

$\mathcal{L}(\mathbf{z}) = \text{cross_entropy}(\text{softmax}(\mathbf{z})).$

The current method computes:

$\begin{split}\frac{\partial \mathcal{L}}{\partial \mathbf{z}} &= \text{softmax}(\mathbf{z}) - \mathbf{y} \\ &= \hat{\mathbf{y}} - \mathbf{y}\end{split}$
Parameters: y (ndarray of shape (n, m)) – A one-hot encoding of the true class labels. Each row constitues a training example, and each column is a different class. y_pred (ndarray of shape (n, m)) – The network predictions for the probability of each of m class labels on each of n examples in a batch. grad (ndarray of shape (n, m)) – The gradient of the cross-entropy loss with respect to the input to the softmax function.

## SquaredError¶

class numpy_ml.neural_nets.losses.SquaredError[source]

A squared-error / L2 loss.

Notes

For real-valued target y and predictions $$\hat{\mathbf{y}}$$, the squared error is

$\mathcal{L}(\mathbf{y}, \hat{\mathbf{y}}) = 0.5 ||\hat{\mathbf{y}} - \mathbf{y}||_2^2$
static loss(y, y_pred)[source]

Compute the squared error between y and y_pred.

Parameters: y (ndarray of shape (n, m)) – Ground truth values for each of n examples y_pred (ndarray of shape (n, m)) – Predictions for the n examples in the batch. loss (float) – The sum of the squared error across dimensions and examples.
static grad(y, y_pred, z, act_fn)[source]

Gradient of the squared error loss with respect to the pre-nonlinearity input, z.

Notes

The current method computes the gradient $$\frac{\partial \mathcal{L}}{\partial \mathbf{z}}$$, where

$\begin{split}\mathcal{L}(\mathbf{z}) &= \text{squared_error}(\mathbf{y}, g(\mathbf{z})) \\ g(\mathbf{z}) &= \text{act_fn}(\mathbf{z})\end{split}$

The gradient with respect to $$\mathbf{z}$$ is then

$\frac{\partial \mathcal{L}}{\partial \mathbf{z}} = (g(\mathbf{z}) - \mathbf{y}) \left( \frac{\partial g}{\partial \mathbf{z}} \right)$
Parameters: y (ndarray of shape (n, m)) – Ground truth values for each of n examples. y_pred (ndarray of shape (n, m)) – Predictions for the n examples in the batch. act_fn (Activation object) – The activation function for the output layer of the network. grad (ndarray of shape (n, m)) – The gradient of the squared error loss with respect to z.

## NCELoss¶

class numpy_ml.neural_nets.losses.NCELoss(n_classes, noise_sampler, num_negative_samples, optimizer=None, init='glorot_uniform', subtract_log_label_prob=True)[source]

A noise contrastive estimation (NCE) loss function.

Notes

Noise contrastive estimation is a candidate sampling method often used to reduce the computational challenge of training a softmax layer on problems with a large number of output classes. It proceeds by training a logistic regression model to discriminate between samples from the true data distribution and samples from an artificial noise distribution.

It can be shown that as the ratio of negative samples to data samples goes to infinity, the gradient of the NCE loss converges to the original softmax gradient.

For input data X, target labels targets, loss parameters W and b, and noise samples noise sampled from the noise distribution Q, the NCE loss is

$\text{NCE}(X, targets) = \text{cross_entropy}(\mathbf{y}_{targets}, \hat{\mathbf{y}}_{targets}) + \text{cross_entropy}(\mathbf{y}_{noise}, \hat{\mathbf{y}}_{noise})$

where

$\begin{split}\hat{\mathbf{y}}_{targets} &= \sigma(\mathbf{W}[targets] \mathbf{X} + \mathbf{b}[targets] - \log Q(targets)) \\ \hat{\mathbf{y}}_{noise} &= \sigma(\mathbf{W}[noise] \mathbf{X} + \mathbf{b}[noise] - \log Q(noise))\end{split}$

In the above equations, $$\sigma$$ is the logistic sigmoid function, and $$Q(x)$$ corresponds to the probability of the values in x under Q.

References

  Gutmann, M. & Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. AISTATS, 13: 297-304.
  Minh, A. & Teh, Y. W. (2012). A fast and simple algorithm for training neural probabilistic language models. ICML, 29: 1751-1758.
Parameters: n_classes (int) – The total number of output classes in the model. noise_sampler (DiscreteSampler instance) – The negative sampler. Defines a distribution over all classes in the dataset. num_negative_samples (int) – The number of negative samples to draw for each target / batch of targets. init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None. subtract_log_label_prob (bool) – Whether to subtract the log of the probability of each label under the noise distribution from its respective logit. Set to False for negative sampling, True for NCE. Default is True. gradients (dict) – The accumulated parameter gradients. parameters (dict) – The loss parameter values. hyperparameters (dict) – The loss hyperparameter values. derived_variables (dict) – Useful intermediate values computed during the loss computation.
hyperparameters[source]
freeze()[source]

Freeze the loss parameters at their current values so they can no longer be updated.

unfreeze()[source]

Unfreeze the layer parameters so they can be updated.

flush_gradients()[source]

Erase all the layer’s derived variables and gradients.

update(cur_loss=None)[source]

Update the loss parameters using the accrued gradients and optimizer. Flush all gradients once the update is complete.

loss(X, target, neg_samples=None, retain_derived=True)[source]

Compute the NCE loss for a collection of inputs and associated targets.

Parameters: X (ndarray of shape (n_ex, n_c, n_in)) – Layer input. A minibatch of n_ex examples, where each example is an n_c by n_in matrix (e.g., the matrix of n_c context embeddings, each of dimensionality n_in, for a CBOW model). target (ndarray of shape (n_ex,)) – Integer indices of the target class(es) for each example in the minibatch (e.g., the target word id for an example in a CBOW model). neg_samples (ndarray of shape (num_negative_samples,) or None) – An optional array of negative samples to use during the loss calculation. These will be used instead of samples draw from self.noise_sampler. Default is None. retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through with regard to this input. Default is True. loss (float) – The NCE loss summed over the minibatch and samples. y_pred (ndarray of shape (n_ex, n_c)) – The network predictions for the conditional probability of each target given each context: entry (i, j) gives the predicted probability of target i under context vector j.
grad(retain_grads=True, update_params=True)[source]

Compute the gradient of the NCE loss with regard to the inputs, weights, and biases.

Parameters: retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True. update_params (bool) – Whether to perform a single step of gradient descent on the layer weights and bias using the calculated gradients. If retain_grads is False, this option is ignored and the parameter gradients are not updated. Default is True. dLdX (ndarray of shape (n_ex, n_in) or list of arrays) – The gradient of the loss with regard to the layer input(s) X.

## VAELoss¶

class numpy_ml.neural_nets.losses.VAELoss[source]

The variational lower bound for a variational autoencoder with Bernoulli units.

Notes

The VLB to the sum of the binary cross entropy between the true input and the predicted output (the “reconstruction loss”) and the KL divergence between the learned variational distribution $$q$$ and the prior, $$p$$, assumed to be a unit Gaussian.

$\text{VAELoss} = \text{cross_entropy}(\mathbf{y}, \hat{\mathbf{y}}) + \mathbb{KL}[q \ || \ p]$

where $$\mathbb{KL}[q \ || \ p]$$ is the Kullback-Leibler divergence between the distributions $$q$$ and $$p$$.

References

  Kingma, D. P. & Welling, M. (2014). “Auto-encoding variational Bayes”. arXiv preprint arXiv:1312.6114. https://arxiv.org/pdf/1312.6114.pdf
static loss(y, y_pred, t_mean, t_log_var)[source]

Variational lower bound for a Bernoulli VAE.

Parameters: y (ndarray of shape (n_ex, N)) – The original images. y_pred (ndarray of shape (n_ex, N)) – The VAE reconstruction of the images. t_mean (ndarray of shape (n_ex, T)) – Mean of the variational distribution $$q(t \mid x)$$. t_log_var (ndarray of shape (n_ex, T)) – Log of the variance vector of the variational distribution $$q(t \mid x)$$. loss (float) – The VLB, averaged across the batch.
static grad(y, y_pred, t_mean, t_log_var)[source]

Compute the gradient of the VLB with regard to the network parameters.

Parameters: y (ndarray of shape (n_ex, N)) – The original images. y_pred (ndarray of shape (n_ex, N)) – The VAE reconstruction of the images. t_mean (ndarray of shape (n_ex, T)) – Mean of the variational distribution $$q(t | x)$$. t_log_var (ndarray of shape (n_ex, T)) – Log of the variance vector of the variational distribution $$q(t | x)$$. dY_pred (ndarray of shape (n_ex, N)) – The gradient of the VLB with regard to y_pred. dLogVar (ndarray of shape (n_ex, T)) – The gradient of the VLB with regard to t_log_var. dMean (ndarray of shape (n_ex, T)) – The gradient of the VLB with regard to t_mean.

## WGAN_GPLoss¶

class numpy_ml.neural_nets.losses.WGAN_GPLoss(lambda_=10)[source]

The loss function for a Wasserstein GAN [*] [†] with gradient penalty.

Notes

Assuming an optimal critic, minimizing this quantity wrt. the generator parameters corresponds to minimizing the Wasserstein-1 (earth-mover) distance between the fake and real data distributions.

The formula for the WGAN-GP critic loss is

$\begin{split}\text{WGANLoss} &= \sum_{x \in X_{real}} p(x) D(x) - \sum_{x' \in X_{fake}} p(x') D(x') \\ \text{WGANLossGP} &= \text{WGANLoss} + \lambda (||\nabla_{X_{interp}} D(X_{interp})||_2 - 1)^2\end{split}$

where

$\begin{split}X_{fake} &= \text{Generator}(\mathbf{z}) \\ X_{interp} &= \alpha X_{real} + (1 - \alpha) X_{fake} \\\end{split}$

and

$\begin{split}\mathbf{z} &\sim \mathcal{N}(0, \mathbb{1}) \\ \alpha &\sim \text{Uniform}(0, 1)\end{split}$

References

 [*] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017) “Improved training of Wasserstein GANs” Advances in Neural Information Processing Systems, 31: 5769-5779.
 [†] Goodfellow, I. J, Abadie, P. A., Mirza, M., Xu, B., Farley, D. W., Ozair, S., Courville, A., & Bengio, Y. (2014) “Generative adversarial nets” Advances in Neural Information Processing Systems, 27: 2672-2680.
Parameters: lambda (float) – The gradient penalty coefficient. Default is 10.
loss(Y_fake, module, Y_real=None, gradInterp=None)[source]

Computes the generator and critic loss using the WGAN-GP value function.

Parameters: Y_fake (ndarray of shape (n_ex,)) – The output of the critic for X_fake. module ({'C', 'G'}) – Whether to calculate the loss for the critic (‘C’) or the generator (‘G’). If calculating loss for the critic, Y_real and gradInterp must not be None. Y_real (ndarray of shape (n_ex,) or None) – The output of the critic for X_real. Default is None. gradInterp (ndarray of shape (n_ex, n_feats) or None) – The gradient of the critic output for X_interp wrt. X_interp. Default is None. loss (float) – Depending on the setting for module, either the critic or generator loss, averaged over examples in the minibatch.
grad(Y_fake, module, Y_real=None, gradInterp=None)[source]

Computes the gradient of the generator or critic loss with regard to its inputs.

Parameters: Y_fake (ndarray of shape (n_ex,)) – The output of the critic for X_fake. module ({'C', 'G'}) – Whether to calculate the gradient for the critic loss (‘C’) or the generator loss (‘G’). If calculating grads for the critic, Y_real and gradInterp must not be None. Y_real (ndarray of shape (n_ex,) or None) – The output of the critic for X_real. Default is None. gradInterp (ndarray of shape (n_ex, n_feats) or None) – The gradient of the critic output on X_interp wrt. X_interp. Default is None. grads (tuple) – If module == ‘C’, returns a 3-tuple containing the gradient of the critic loss with regard to (Y_fake, Y_real, gradInterp). If module == ‘G’, returns the gradient of the generator with regard to Y_fake.