Full networks

WGAN_GP

class numpy_ml.neural_nets.models.WGAN_GP(g_hidden=512, init='he_uniform', optimizer='RMSProp(lr=0.0001)', debug=False)[source]

A Wasserstein generative adversarial network (WGAN) architecture with gradient penalty (GP).

Notes

In contrast to a regular WGAN, WGAN-GP uses gradient penalty on the generator rather than weight clipping to encourage the 1-Lipschitz constraint:

\[| \text{Generator}(\mathbf{x}_1) - \text{Generator}(\mathbf{x}_2) | \leq |\mathbf{x}_1 - \mathbf{x}_2 | \ \ \ \ \forall \mathbf{x}_1, \mathbf{x}_2\]

In other words, the generator must have input gradients with a norm of at most 1 under the \(\mathbf{X}_{real}\) and \(\mathbf{X}_{fake}\) data distributions.

To enforce this constraint, WGAN-GP penalizes the model if the generator gradient norm moves away from a target norm of 1. See WGAN_GPLoss for more details.

In contrast to a standard WGAN, WGAN-GP avoids using BatchNorm in the critic, as correlation between samples in a batch can impact the stability of the gradient penalty.

WGAP-GP architecture:

X_real ------------------------|
                                >---> [Critic] --> Y_out
Z --> [Generator] --> X_fake --|

where [Generator] is

FC1 -> ReLU -> FC2 -> ReLU -> FC3 -> ReLU -> FC4

and [Critic] is

FC1 -> ReLU -> FC2 -> ReLU -> FC3 -> ReLU -> FC4

and

\[Z \sim \mathcal{N}(0, 1)\]

Wasserstein generative adversarial network with gradient penalty.

Parameters:
  • g_hidden (int) – The number of units in the critic and generator hidden layers. Default is 512.
  • init (str) – The weight initialization strategy. Valid entries are {‘glorot_normal’, ‘glorot_uniform’, ‘he_normal’, ‘he_uniform’, ‘std_normal’, ‘trunc_normal’}. Default is “he_uniform”.
  • optimizer (str or Optimizer object or None) – The optimization strategy to use when performing gradient updates. If None, use the SGD optimizer with default parameters. Default is “RMSProp(lr=0.0001)”.
  • debug (bool) – Whether to store additional intermediate output within self.derived_variables. Default is False.
hyperparameters[source]
parameters[source]
derived_variables[source]
gradients[source]
forward(X, module, retain_derived=True)[source]

Perform the forward pass for either the generator or the critic.

Parameters:
  • X (ndarray of shape (batchsize, *)) – Input data
  • module ({'C' or 'G'}) – Whether to perform the forward pass for the critic (‘C’) or for the generator (‘G’).
  • retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True.
Returns:

  • out (ndarray of shape (batchsize, *)) – The output of the final layer of the module.
  • Xs (dict) – A dictionary with layer ids as keys and values corresponding to the input to each intermediate layer during the forward pass. Useful during debugging.

backward(grad, module, retain_grads=True)[source]

Perform the backward pass for either the generator or the critic.

Parameters:
  • grad (ndarray of shape (batchsize, *) or list of arrays) – Gradient of the loss with respect to module output(s).
  • module ({'C' or 'G'}) – Whether to perform the backward pass for the critic (‘C’) or for the generator (‘G’).
  • retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True.
Returns:

  • out (ndarray of shape (batchsize, *)) – The gradient of the loss with respect to the module input.
  • dXs (dict) – A dictionary with layer ids as keys and values corresponding to the input to each intermediate layer during the backward pass. Useful during debugging.

update_critic(X_real)[source]

Compute parameter gradients for the critic on a single minibatch.

Parameters:X_real (ndarray of shape (batchsize, n_feats)) – Input data.
Returns:C_loss (float) – The critic loss on the current data.
update_generator(X_shape)[source]

Compute parameter gradients for the generator on a single minibatch.

Parameters:X_shape (tuple of (batchsize, n_feats)) – Shape for the input batch.
Returns:G_loss (float) – The generator loss on the fake data (generated during the critic update)
flush_gradients(module)[source]

Reset parameter gradients to 0 after an update.

update(module, module_loss=None)[source]

Perform gradient updates and flush gradients upon completion

fit(X_real, lambda_, n_steps=1000, batchsize=128, c_updates_per_epoch=5, verbose=True)[source]

Fit WGAN_GP on a training dataset.

Parameters:
  • X_real (ndarray of shape (n_ex, n_feats)) – Training dataset
  • lambda (float) – Gradient penalty coefficient for the critic loss
  • n_steps (int) – The maximum number of generator updates to perform. Default is 1000.
  • batchsize (int) – Number of examples to use in each training minibatch. Default is 128.
  • c_updates_per_epoch (int) – The number of critic updates to perform at each generator update.
  • verbose (bool) – Print loss values after each update. If False, only print loss every 100 steps. Default is True.

BernoulliVAE

class numpy_ml.neural_nets.models.BernoulliVAE(T=5, latent_dim=256, enc_conv1_pad=0, enc_conv2_pad=0, enc_conv1_out_ch=32, enc_conv2_out_ch=64, enc_conv1_stride=1, enc_pool1_stride=2, enc_conv2_stride=1, enc_pool2_stride=1, enc_conv1_kernel_shape=(5, 5), enc_pool1_kernel_shape=(2, 2), enc_conv2_kernel_shape=(5, 5), enc_pool2_kernel_shape=(2, 2), optimizer='RMSProp(lr=0.0001)', init='glorot_uniform')[source]

A variational autoencoder (VAE) with 2D convolutional encoder and Bernoulli input and output units.

Notes

The VAE architecture is

                |-- t_mean ----|
X -> [Encoder] -|              |--> [Sampler] -> [Decoder] -> X_recon
                |-- t_log_var -|

where [Encoder] is

Conv1 -> ReLU -> MaxPool1 -> Conv2 -> ReLU ->
    MaxPool2 -> Flatten -> FC1 -> ReLU -> FC2

[Decoder] is

FC1 -> FC2 -> Sigmoid

and [Sampler] draws a sample from the distribution

\[\mathcal{N}(\text{t_mean}, \exp \left\{\text{t_log_var}\right\} I)\]

using the reparameterization trick.

Parameters:
  • T (int) – The dimension of the variational parameter t. Default is 5.
  • enc_conv1_pad (int) – The padding for the first convolutional layer of the encoder. Default is 0.
  • enc_conv1_stride (int) – The stride for the first convolutional layer of the encoder. Default is 1.
  • enc_conv1_out_ch (int) – The number of output channels for the first convolutional layer of the encoder. Default is 32.
  • enc_conv1_kernel_shape (tuple) – The number of rows and columns in each filter of the first convolutional layer of the encoder. Default is (5, 5).
  • enc_pool1_kernel_shape (tuple) – The number of rows and columns in the receptive field of the first max pool layer of the encoder. Default is (2, 3).
  • enc_pool1_stride (int) – The stride for the first MaxPool layer of the encoder. Default is 2.
  • enc_conv2_pad (int) – The padding for the second convolutional layer of the encoder. Default is 0.
  • enc_conv2_out_ch (int) – The number of output channels for the second convolutional layer of the encoder. Default is 64.
  • enc_conv2_kernel_shape (tuple) – The number of rows and columns in each filter of the second convolutional layer of the encoder. Default is (5, 5).
  • enc_conv2_stride (int) – The stride for the second convolutional layer of the encoder. Default is 1.
  • enc_pool2_stride (int) – The stride for the second MaxPool layer of the encoder. Default is 1.
  • enc_pool2_kernel_shape (tuple) – The number of rows and columns in the receptive field of the second max pool layer of the encoder. Default is (2, 3).
  • latent_dim (int) – The dimension of the output for the first FC layer of the encoder. Default is 256.
  • optimizer (str or Optimizer object or None) – The optimization strategy to use when performing gradient updates. If None, use the SGD optimizer with default parameters. Default is “RMSProp(lr=0.0001)”.
  • init (str) – The weight initialization strategy. Valid entries are {‘glorot_normal’, ‘glorot_uniform’, ‘he_normal’, ‘he_uniform’, ‘std_normal’, ‘trunc_normal’}. Default is ‘glorot_uniform’.
parameters[source]
hyperparameters[source]
derived_variables[source]
gradients[source]
forward(X_train)[source]

VAE forward pass

backward(X_train, X_recon)[source]

VAE backward pass

update(cur_loss=None)[source]

Perform gradient updates

flush_gradients()[source]

Reset parameter gradients after update

fit(X_train, n_epochs=20, batchsize=128, verbose=True)[source]

Fit the VAE to a training dataset.

Parameters:
  • X_train (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – The input volume
  • n_epochs (int) – The maximum number of training epochs to run. Default is 20.
  • batchsize (int) – The desired number of examples in each training batch. Default is 128.
  • verbose (bool) – Print batch information during training. Default is True.

Word2Vec

class numpy_ml.neural_nets.models.Word2Vec(context_len=5, min_count=None, skip_gram=False, max_tokens=None, embedding_dim=300, filter_stopwords=True, noise_dist_power=0.75, init='glorot_uniform', num_negative_samples=64, optimizer='SGD(lr=0.1)')[source]

A word2vec model supporting both continuous bag of words (CBOW) and skip-gram architectures, with training via noise contrastive estimation.

Parameters:
  • context_len (int) – The number of words to the left and right of the current word to use as context during training. Larger values result in more training examples and thus can lead to higher accuracy at the expense of additional training time. Default is 5.
  • min_count (int or None) – Minimum number of times a token must occur in order to be included in vocab. If None, include all tokens from corpus_fp in vocab. Default is None.
  • skip_gram (bool) – Whether to train the skip-gram or CBOW model. The skip-gram model is trained to predict the target word i given its surrounding context, words[i - context:i] and words[i + 1:i + 1 + context] as input. Default is False.
  • max_tokens (int or None) – Only add the first max_tokens most frequent tokens that occur more than min_count to the vocabulary. If None, add all tokens that occur more than than min_count. Default is None.
  • embedding_dim (int) – The number of dimensions in the final word embeddings. Default is 300.
  • filter_stopwords (bool) – Whether to remove stopwords before encoding the words in the corpus. Default is True.
  • noise_dist_power (float) – The power the unigram count is raised to when computing the noise distribution for negative sampling. A value of 0 corresponds to a uniform distribution over tokens, and a value of 1 corresponds to a distribution proportional to the token unigram counts. Default is 0.75.
  • init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’.
  • num_negative_samples (int) – The number of negative samples to draw from the noise distribution for each positive training sample. If 0, use the hierarchical softmax formulation of the model instead. Default is 5.
  • optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update method. If None, use the SGD optimizer with default parameters. Default is None.
Variables:

Notes

The word2vec model is outlined in in [1].

CBOW architecture:

w_{t-R}   ----|
w_{t-R+1} ----|
...            --> Average --> Embedding layer --> [NCE Layer / HSoftmax] --> P(w_{t} | w_{...})
w_{t+R-1} ----|
w_{t+R}   ----|

Skip-gram architecture:

                                                       |-->  P(w_{t-R} | w_{t})
                                                       |-->  P(w_{t-R+1} | w_{t})
w_{t} --> Embedding layer --> [NCE Layer / HSoftmax] --|     ...
                                                       |-->  P(w_{t+R-1} | w_{t})
                                                       |-->  P(w_{t+R} | w_{t})

where \(w_{i}\) is the one-hot representation of the word at position i within a sentence in the corpus and R is the length of the context window on either side of the target word.

References

[1]Mikolov et al. (2013). “Distributed representations of words and phrases and their compositionality,” Proceedings of the 26th International Conference on Neural Information Processing Systems. https://arxiv.org/pdf/1310.4546.pdf
parameters[source]

Model parameters

hyperparameters[source]

Model hyperparameters

derived_variables[source]

Variables computed during model operation

gradients[source]

Model parameter gradients

forward(X, targets, retain_derived=True)[source]

Evaluate the network on a single minibatch.

Parameters:
  • X (ndarray of shape (n_ex, n_in)) – Layer input, representing a minibatch of n_ex examples, each consisting of n_in integer word indices
  • targets (ndarray of shape (n_ex,)) – Target word index for each example in the minibatch.
  • retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default True.
Returns:

  • loss (float) – The loss associated with the current minibatch
  • y_pred (ndarray of shape (n_ex,)) – The conditional probabilities of the words in targets given the corresponding example / context in X.

backward()[source]

Compute the gradient of the loss wrt the current network parameters.

update(cur_loss=None)[source]

Perform gradient updates

flush_gradients()[source]

Reset parameter gradients after update

get_embedding(word_ids)[source]

Retrieve the embeddings for a collection of word IDs.

Parameters:word_ids (ndarray of shape (M,)) – An array of word IDs to retrieve embeddings for.
Returns:embeddings (ndarray of shape (M, n_out)) – The embedding vectors for each of the M word IDs.
minibatcher(corpus_fps, encoding)[source]

A minibatch generator for skip-gram and CBOW models.

Parameters:
  • corpus_fps (str or list of strs) – The filepath / list of filepaths to the document(s) to be encoded. Each document is expected to be encoded as newline-separated string of text, with adjacent tokens separated by a whitespace character.
  • encoding (str) – Specifies the text encoding for corpus. This value is passed directly to Python’s open builtin. Common entries are either ‘utf-8’ (no header byte), or ‘utf-8-sig’ (header byte).
Yields:
  • X (list of length batchsize or ndarray of shape (batchsize, n_in)) – The context IDs for a minibatch of batchsize examples. If self.skip_gram is False, X will be a ragged list consisting of batchsize variable-length lists. If self.skip_gram is True, all sublists will be of the same length (n_in) and X will be returned as a ndarray of shape (batchsize, n_in).
  • target (ndarray of shape (batchsize, 1)) – The target IDs associated with each example in X
fit(corpus_fps, encoding='utf-8-sig', n_epochs=20, batchsize=128, verbose=True)[source]

Learn word2vec embeddings for the examples in X_train.

Parameters:
  • corpus_fps (str or list of strs) – The filepath / list of filepaths to the document(s) to be encoded. Each document is expected to be encoded as newline-separated string of text, with adjacent tokens separated by a whitespace character.
  • encoding (str) – Specifies the text encoding for corpus. Common entries are either ‘utf-8’ (no header byte), or ‘utf-8-sig’ (header byte). Default value is ‘utf-8-sig’.
  • n_epochs (int) – The maximum number of training epochs to run. Default is 20.
  • batchsize (int) – The desired number of examples in each training batch. Default is 128.
  • verbose (bool) – Print batch information during training. Default is True.