Loss functions¶
CrossEntropy
¶
-
class
numpy_ml.neural_nets.losses.
CrossEntropy
[source]¶ A cross-entropy loss.
Notes
For a one-hot target y and predicted class probabilities \(\hat{\mathbf{y}}\), the cross entropy is
\[\mathcal{L}(\mathbf{y}, \hat{\mathbf{y}}) = \sum_i y_i \log \hat{y}_i\]-
static
loss
(y, y_pred)[source]¶ Compute the cross-entropy (log) loss.
Notes
This method returns the sum (not the average!) of the losses for each sample.
Parameters: Returns: loss (float) – The sum of the cross-entropy across classes and examples.
-
static
grad
(y, y_pred)[source]¶ Compute the gradient of the cross entropy loss with regard to the softmax input, z.
Notes
The gradient for this method goes through both the cross-entropy loss AND the softmax non-linearity to return \(\frac{\partial \mathcal{L}}{\partial \mathbf{z}}\) (rather than \(\frac{\partial \mathcal{L}}{\partial \text{softmax}(\mathbf{z})}\)).
In particular, let:
\[\mathcal{L}(\mathbf{z}) = \text{cross_entropy}(\text{softmax}(\mathbf{z})).\]The current method computes:
\[\begin{split}\frac{\partial \mathcal{L}}{\partial \mathbf{z}} &= \text{softmax}(\mathbf{z}) - \mathbf{y} \\ &= \hat{\mathbf{y}} - \mathbf{y}\end{split}\]Parameters: Returns: grad (
ndarray
of shape (n, m)) – The gradient of the cross-entropy loss with respect to the input to the softmax function.
-
static
SquaredError
¶
-
class
numpy_ml.neural_nets.losses.
SquaredError
[source]¶ A squared-error / L2 loss.
Notes
For real-valued target y and predictions \(\hat{\mathbf{y}}\), the squared error is
\[\mathcal{L}(\mathbf{y}, \hat{\mathbf{y}}) = 0.5 ||\hat{\mathbf{y}} - \mathbf{y}||_2^2\]-
static
loss
(y, y_pred)[source]¶ Compute the squared error between y and y_pred.
Parameters: Returns: loss (float) – The sum of the squared error across dimensions and examples.
-
static
grad
(y, y_pred, z, act_fn)[source]¶ Gradient of the squared error loss with respect to the pre-nonlinearity input, z.
Notes
The current method computes the gradient \(\frac{\partial \mathcal{L}}{\partial \mathbf{z}}\), where
\[\begin{split}\mathcal{L}(\mathbf{z}) &= \text{squared_error}(\mathbf{y}, g(\mathbf{z})) \\ g(\mathbf{z}) &= \text{act_fn}(\mathbf{z})\end{split}\]The gradient with respect to \(\mathbf{z}\) is then
\[\frac{\partial \mathcal{L}}{\partial \mathbf{z}} = (g(\mathbf{z}) - \mathbf{y}) \left( \frac{\partial g}{\partial \mathbf{z}} \right)\]Parameters: - y (
ndarray
of shape (n, m)) – Ground truth values for each of n examples. - y_pred (
ndarray
of shape (n, m)) – Predictions for the n examples in the batch. - act_fn (Activation object) – The activation function for the output layer of the network.
Returns: grad (
ndarray
of shape (n, m)) – The gradient of the squared error loss with respect to z.- y (
-
static
NCELoss
¶
-
class
numpy_ml.neural_nets.losses.
NCELoss
(n_classes, noise_sampler, num_negative_samples, optimizer=None, init='glorot_uniform', subtract_log_label_prob=True)[source]¶ A noise contrastive estimation (NCE) loss function.
Notes
Noise contrastive estimation is a candidate sampling method often used to reduce the computational challenge of training a softmax layer on problems with a large number of output classes. It proceeds by training a logistic regression model to discriminate between samples from the true data distribution and samples from an artificial noise distribution.
It can be shown that as the ratio of negative samples to data samples goes to infinity, the gradient of the NCE loss converges to the original softmax gradient.
For input data X, target labels targets, loss parameters W and b, and noise samples noise sampled from the noise distribution Q, the NCE loss is
\[\text{NCE}(X, targets) = \text{cross_entropy}(\mathbf{y}_{targets}, \hat{\mathbf{y}}_{targets}) + \text{cross_entropy}(\mathbf{y}_{noise}, \hat{\mathbf{y}}_{noise})\]where
\[\begin{split}\hat{\mathbf{y}}_{targets} &= \sigma(\mathbf{W}[targets] \mathbf{X} + \mathbf{b}[targets] - \log Q(targets)) \\ \hat{\mathbf{y}}_{noise} &= \sigma(\mathbf{W}[noise] \mathbf{X} + \mathbf{b}[noise] - \log Q(noise))\end{split}\]In the above equations, \(\sigma\) is the logistic sigmoid function, and \(Q(x)\) corresponds to the probability of the values in x under Q.
References
[1] Gutmann, M. & Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. AISTATS, 13: 297-304. [2] Minh, A. & Teh, Y. W. (2012). A fast and simple algorithm for training neural probabilistic language models. ICML, 29: 1751-1758. Parameters: - n_classes (int) – The total number of output classes in the model.
- noise_sampler (
DiscreteSampler
instance) – The negative sampler. Defines a distribution over all classes in the dataset. - num_negative_samples (int) – The number of negative samples to draw for each target / batch of targets.
- init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’.
- optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates
within the
update()
method. If None, use theSGD
optimizer with default parameters. Default is None. - subtract_log_label_prob (bool) – Whether to subtract the log of the probability of each label under the noise distribution from its respective logit. Set to False for negative sampling, True for NCE. Default is True.
Variables: - gradients (dict) – The accumulated parameter gradients.
- parameters (dict) – The loss parameter values.
- hyperparameters (dict) – The loss hyperparameter values.
- derived_variables (dict) – Useful intermediate values computed during the loss computation.
-
freeze
()[source]¶ Freeze the loss parameters at their current values so they can no longer be updated.
-
update
(cur_loss=None)[source]¶ Update the loss parameters using the accrued gradients and optimizer. Flush all gradients once the update is complete.
-
loss
(X, target, neg_samples=None, retain_derived=True)[source]¶ Compute the NCE loss for a collection of inputs and associated targets.
Parameters: - X (
ndarray
of shape (n_ex, n_c, n_in)) – Layer input. A minibatch of n_ex examples, where each example is an n_c by n_in matrix (e.g., the matrix of n_c context embeddings, each of dimensionality n_in, for a CBOW model). - target (
ndarray
of shape (n_ex,)) – Integer indices of the target class(es) for each example in the minibatch (e.g., the target word id for an example in a CBOW model). - neg_samples (
ndarray
of shape (num_negative_samples,) or None) – An optional array of negative samples to use during the loss calculation. These will be used instead of samples draw fromself.noise_sampler
. Default is None. - retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through with regard to this input. Default is True.
Returns: - loss (float) – The NCE loss summed over the minibatch and samples.
- y_pred (
ndarray
of shape (n_ex, n_c)) – The network predictions for the conditional probability of each target given each context: entry (i, j) gives the predicted probability of target i under context vector j.
- X (
-
grad
(retain_grads=True, update_params=True)[source]¶ Compute the gradient of the NCE loss with regard to the inputs, weights, and biases.
Parameters: - retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True.
- update_params (bool) – Whether to perform a single step of gradient descent on the layer weights and bias using the calculated gradients. If retain_grads is False, this option is ignored and the parameter gradients are not updated. Default is True.
Returns: dLdX (
ndarray
of shape (n_ex, n_in) or list of arrays) – The gradient of the loss with regard to the layer input(s) X.
VAELoss
¶
-
class
numpy_ml.neural_nets.losses.
VAELoss
[source]¶ The variational lower bound for a variational autoencoder with Bernoulli units.
Notes
The VLB to the sum of the binary cross entropy between the true input and the predicted output (the “reconstruction loss”) and the KL divergence between the learned variational distribution \(q\) and the prior, \(p\), assumed to be a unit Gaussian.
\[\text{VAELoss} = \text{cross_entropy}(\mathbf{y}, \hat{\mathbf{y}}) + \mathbb{KL}[q \ || \ p]\]where \(\mathbb{KL}[q \ || \ p]\) is the Kullback-Leibler divergence between the distributions \(q\) and \(p\).
References
[1] Kingma, D. P. & Welling, M. (2014). “Auto-encoding variational Bayes”. arXiv preprint arXiv:1312.6114. https://arxiv.org/pdf/1312.6114.pdf -
static
loss
(y, y_pred, t_mean, t_log_var)[source]¶ Variational lower bound for a Bernoulli VAE.
Parameters: - y (
ndarray
of shape (n_ex, N)) – The original images. - y_pred (
ndarray
of shape (n_ex, N)) – The VAE reconstruction of the images. - t_mean (
ndarray
of shape (n_ex, T)) – Mean of the variational distribution \(q(t \mid x)\). - t_log_var (
ndarray
of shape (n_ex, T)) – Log of the variance vector of the variational distribution \(q(t \mid x)\).
Returns: loss (float) – The VLB, averaged across the batch.
- y (
-
static
grad
(y, y_pred, t_mean, t_log_var)[source]¶ Compute the gradient of the VLB with regard to the network parameters.
Parameters: - y (
ndarray
of shape (n_ex, N)) – The original images. - y_pred (
ndarray
of shape (n_ex, N)) – The VAE reconstruction of the images. - t_mean (
ndarray
of shape (n_ex, T)) – Mean of the variational distribution \(q(t | x)\). - t_log_var (
ndarray
of shape (n_ex, T)) – Log of the variance vector of the variational distribution \(q(t | x)\).
Returns: - y (
-
static
WGAN_GPLoss
¶
-
class
numpy_ml.neural_nets.losses.
WGAN_GPLoss
(lambda_=10)[source]¶ The loss function for a Wasserstein GAN [*] [†] with gradient penalty.
Notes
Assuming an optimal critic, minimizing this quantity wrt. the generator parameters corresponds to minimizing the Wasserstein-1 (earth-mover) distance between the fake and real data distributions.
The formula for the WGAN-GP critic loss is
\[\begin{split}\text{WGANLoss} &= \sum_{x \in X_{real}} p(x) D(x) - \sum_{x' \in X_{fake}} p(x') D(x') \\ \text{WGANLossGP} &= \text{WGANLoss} + \lambda (||\nabla_{X_{interp}} D(X_{interp})||_2 - 1)^2\end{split}\]where
\[\begin{split}X_{fake} &= \text{Generator}(\mathbf{z}) \\ X_{interp} &= \alpha X_{real} + (1 - \alpha) X_{fake} \\\end{split}\]and
\[\begin{split}\mathbf{z} &\sim \mathcal{N}(0, \mathbb{1}) \\ \alpha &\sim \text{Uniform}(0, 1)\end{split}\]References
[*] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017) “Improved training of Wasserstein GANs” Advances in Neural Information Processing Systems, 31: 5769-5779. [†] Goodfellow, I. J, Abadie, P. A., Mirza, M., Xu, B., Farley, D. W., Ozair, S., Courville, A., & Bengio, Y. (2014) “Generative adversarial nets” Advances in Neural Information Processing Systems, 27: 2672-2680. Parameters: lambda (float) – The gradient penalty coefficient. Default is 10. -
loss
(Y_fake, module, Y_real=None, gradInterp=None)[source]¶ Computes the generator and critic loss using the WGAN-GP value function.
Parameters: - Y_fake (
ndarray
of shape (n_ex,)) – The output of the critic for X_fake. - module ({'C', 'G'}) – Whether to calculate the loss for the critic (‘C’) or the generator (‘G’). If calculating loss for the critic, Y_real and gradInterp must not be None.
- Y_real (
ndarray
of shape (n_ex,) or None) – The output of the critic for X_real. Default is None. - gradInterp (
ndarray
of shape (n_ex, n_feats) or None) – The gradient of the critic output for X_interp wrt. X_interp. Default is None.
Returns: loss (float) – Depending on the setting for module, either the critic or generator loss, averaged over examples in the minibatch.
- Y_fake (
-
grad
(Y_fake, module, Y_real=None, gradInterp=None)[source]¶ Computes the gradient of the generator or critic loss with regard to its inputs.
Parameters: - Y_fake (
ndarray
of shape (n_ex,)) – The output of the critic for X_fake. - module ({'C', 'G'}) – Whether to calculate the gradient for the critic loss (‘C’) or the generator loss (‘G’). If calculating grads for the critic, Y_real and gradInterp must not be None.
- Y_real (
ndarray
of shape (n_ex,) or None) – The output of the critic for X_real. Default is None. - gradInterp (
ndarray
of shape (n_ex, n_feats) or None) – The gradient of the critic output on X_interp wrt. X_interp. Default is None.
Returns: grads (tuple) – If module == ‘C’, returns a 3-tuple containing the gradient of the critic loss with regard to (Y_fake, Y_real, gradInterp). If module == ‘G’, returns the gradient of the generator with regard to Y_fake.
- Y_fake (
-