Modules

BidirectionalLSTM

class numpy_ml.neural_nets.modules.BidirectionalLSTM(n_out, act_fn=None, gate_fn=None, merge_mode='concat', init='glorot_uniform', optimizer=None)[source]

A single bidirectional long short-term memory (LSTM) layer.

Parameters:
  • n_out (int) – The dimension of a single hidden state / output on a given timestep
  • act_fn (Activation object or None) – The activation function for computing A[t]. If not specified, use Tanh by default.
  • gate_fn (Activation object or None) – The gate function for computing the update, forget, and output gates. If not specified, use Sigmoid by default.
  • merge_mode ({"sum", "multiply", "concat", "average"}) – Mode by which outputs of the forward and backward LSTMs will be combined. Default is ‘concat’.
  • optimizer (str or Optimizer object or None) – The optimization strategy to use when performing gradient updates within the update method. If None, use the SGD optimizer with default parameters. Default is None.
  • init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’.
forward(X)[source]

Run a forward pass across all timesteps in the input.

Parameters:X (ndarray of shape (n_ex, n_in, n_t)) – Input consisting of n_ex examples each of dimensionality n_in and extending for n_t timesteps.
Returns:Y (ndarray of shape (n_ex, n_out, n_t)) – The value of the hidden state for each of the n_ex examples across each of the n_t timesteps.
backward(dLdA)[source]

Run a backward pass across all timesteps in the input.

Parameters:dLdA (ndarray of shape (n_ex, n_out, n_t)) – The gradient of the loss with respect to the layer output for each of the n_ex examples across all n_t timesteps.
Returns:dLdX (ndarray of shape (n_ex, n_in, n_t)) – The value of the hidden state for each of the n_ex examples across each of the n_t timesteps.
derived_variables[source]

A dictionary of intermediate values computed during the forward/backward passes.

gradients[source]

A dictionary of the accumulated module parameter gradients.

parameters[source]

A dictionary of the module parameters.

hyperparameters[source]

A dictionary of the module hyperparameters.

MultiHeadedAttentionModule

class numpy_ml.neural_nets.modules.MultiHeadedAttentionModule(n_heads=8, dropout_p=0, init='glorot_uniform', optimizer=None)[source]

A mutli-headed attention module.

Notes

Multi-head attention allows a model to jointly attend to information from different representation subspaces at different positions. With a single head, this information would get averaged away when the attention weights are combined with the value

\[\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = [\text{head}_1; ...; \text{head}_h] \mathbf{W}^{(O)}\]

where

\[\text{head}_i = \text{SDP_attention}( \mathbf{Q W}_i^{(Q)}, \mathbf{K W}_i^{(K)}, \mathbf{V W}_i^{(V)})\]

and the projection weights are parameter matrices:

\[\begin{split}\mathbf{W}_i^{(Q)} &\in \mathbb{R}^{(\text{kqv_dim} \ \times \ \text{latent_dim})} \\ \mathbf{W}_i^{(K)} &\in \mathbb{R}^{(\text{kqv_dim} \ \times \ \text{latent_dim})} \\ \mathbf{W}_i^{(V)} &\in \mathbb{R}^{(\text{kqv_dim} \ \times \ \text{latent_dim})} \\ \mathbf{W}^{(O)} &\in \mathbb{R}^{(\text{n_heads} \cdot \text{latent_dim} \ \times \ \text{kqv_dim})}\end{split}\]

Importantly, the current module explicitly assumes that

\[\text{kqv_dim} = \text{dim(query)} = \text{dim(keys)} = \text{dim(values)}\]

and that

\[\text{latent_dim} = \text{kqv_dim / n_heads}\]

[MH Attention Head h]:

K --> W_h^(K) ------\
V --> W_h^(V) ------- > DP_Attention --> head_h
Q --> W_h^(Q) ------/

The full [MultiHeadedAttentionModule] then becomes

      -----------------
K --> | [Attn Head 1] | --> head_1 --\
V --> | [Attn Head 2] | --> head_2 --\
Q --> |      ...      |      ...       --> Concat --> W^(O) --> MH_out
      | [Attn Head Z] | --> head_Z --/
      -----------------

Due to the reduced dimension of each head, the total computational cost is similar to that of a single attention head with full (i.e., kqv_dim) dimensionality.

Parameters:
  • n_heads (int) – The number of simultaneous attention heads to use. Note that the larger n_heads, the smaller the dimensionality of any single head, since latent_dim = kqv_dim / n_heads. Default is 8.
  • dropout_p (float in [0, 1)) – The dropout propbability during training, applied to the output of the softmax in each dot-product attention head. If 0, no dropout is applied. Default is 0.
  • init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’.
  • optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
forward(Q, K, V)[source]
backward(dLdy)[source]
derived_variables[source]

A dictionary of intermediate values computed during the forward/backward passes.

gradients[source]

A dictionary of the accumulated module parameter gradients.

parameters[source]

A dictionary of the module parameters.

hyperparameters[source]

A dictionary of the module hyperparameters.

SkipConnectionConvModule

class numpy_ml.neural_nets.modules.SkipConnectionConvModule(out_ch1, out_ch2, kernel_shape1, kernel_shape2, kernel_shape_skip, pad1=0, pad2=0, stride1=1, stride2=1, act_fn=None, epsilon=1e-05, momentum=0.9, stride_skip=1, optimizer=None, init='glorot_uniform')[source]

A ResNet-like “convolution” shortcut module.

Notes

In contrast to SkipConnectionIdentityModule, the additional conv2d_skip and batchnorm_skip layers in the shortcut path allow adjusting the dimensions of X to match the output of the main set of convolutions.

X -> Conv2D -> Act_fn -> BatchNorm2D -> Conv2D -> BatchNorm2D -> + -> Act_fn
 \_____________________ Conv2D -> Batchnorm2D __________________/

References

[1]He et al. (2015). “Deep residual learning for image recognition.” https://arxiv.org/pdf/1512.03385.pdf
Parameters:
  • out_ch1 (int) – The number of filters/kernels to compute in the first convolutional layer.
  • out_ch2 (int) – The number of filters/kernels to compute in the second convolutional layer.
  • kernel_shape1 (2-tuple) – The dimension of a single 2D filter/kernel in the first convolutional layer.
  • kernel_shape2 (2-tuple) – The dimension of a single 2D filter/kernel in the second convolutional layer.
  • kernel_shape_skip (2-tuple) – The dimension of a single 2D filter/kernel in the “skip” convolutional layer.
  • stride1 (int) – The stride/hop of the convolution kernels in the first convolutional layer. Default is 1.
  • stride2 (int) – The stride/hop of the convolution kernels in the second convolutional layer. Default is 1.
  • stride_skip (int) – The stride/hop of the convolution kernels in the “skip” convolutional layer. Default is 1.
  • pad1 (int, tuple, or 'same') – The number of rows/columns of 0’s to pad the input to the first convolutional layer with. Default is 0.
  • pad2 (int, tuple, or 'same') – The number of rows/columns of 0’s to pad the input to the second convolutional layer with. Default is 0.
  • act_fn (Activation object or None) – The activation function for computing Y[t]. If None, use the identity \(f(x) = x\) by default. Default is None.
  • epsilon (float) – A small smoothing constant to use during BatchNorm2D computation to avoid divide-by-zero errors. Default is 1e-5.
  • momentum (float) – The momentum term for the running mean/running std calculations in the BatchNorm2D layers. The closer this is to 1, the less weight will be given to the mean/std of the current batch (i.e., higher smoothing). Default is 0.9.
  • init (str) – The weight initialization strategy. Valid entries are {‘glorot_normal’, ‘glorot_uniform’, ‘he_normal’, ‘he_uniform’}.
  • optimizer (str or Optimizer object) – The optimization strategy to use when performing gradient updates within the update method. If None, use the SGD optimizer with default parameters. Default is None.
parameters[source]

A dictionary of the module parameters.

hyperparameters[source]

A dictionary of the module hyperparameters.

derived_variables[source]

A dictionary of intermediate values computed during the forward/backward passes.

gradients[source]

A dictionary of the accumulated module parameter gradients.

forward(X, retain_derived=True)[source]

Compute the layer output given input volume X.

Parameters:
  • X (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – The input volume consisting of n_ex examples, each with dimension (in_rows, in_cols, in_ch).
  • retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True.
Returns:

Y (ndarray of shape (n_ex, out_rows, out_cols, out_ch)) – The module output volume.

backward(dLdY, retain_grads=True)[source]

Compute the gradient of the loss with respect to the module parameters.

Parameters:
  • dLdy (ndarray of shape (n_ex, out_rows, out_cols, out_ch)) –
  • list of arrays (or) – The gradient(s) of the loss with respect to the module output(s).
  • retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True.
Returns:

dX (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – The gradient of the loss with respect to the module input volume.

SkipConnectionIdentityModule

class numpy_ml.neural_nets.modules.SkipConnectionIdentityModule(out_ch, kernel_shape1, kernel_shape2, stride1=1, stride2=1, act_fn=None, epsilon=1e-05, momentum=0.9, optimizer=None, init='glorot_uniform')[source]

A ResNet-like “identity” shortcut module.

Notes

The identity module enforces same padding during each convolution to ensure module output has same dims as its input.

X -> Conv2D -> Act_fn -> BatchNorm2D -> Conv2D -> BatchNorm2D -> + -> Act_fn
 \______________________________________________________________/

References

[1]He et al. (2015). “Deep residual learning for image recognition.” https://arxiv.org/pdf/1512.03385.pdf
Parameters:
  • out_ch (int) – The number of filters/kernels to compute in the first convolutional layer.
  • kernel_shape1 (2-tuple) – The dimension of a single 2D filter/kernel in the first convolutional layer.
  • kernel_shape2 (2-tuple) – The dimension of a single 2D filter/kernel in the second convolutional layer.
  • stride1 (int) – The stride/hop of the convolution kernels in the first convolutional layer. Default is 1.
  • stride2 (int) – The stride/hop of the convolution kernels in the second convolutional layer. Default is 1.
  • act_fn (Activation object or None) – The activation function for computing Y[t]. If None, use the identity \(f(x) = x\) by default. Default is None.
  • epsilon (float) – A small smoothing constant to use during BatchNorm2D computation to avoid divide-by-zero errors. Default is 1e-5.
  • momentum (float) – The momentum term for the running mean/running std calculations in the BatchNorm2D layers. The closer this is to 1, the less weight will be given to the mean/std of the current batch (i.e., higher smoothing). Default is 0.9.
  • optimizer (str or Optimizer object or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
  • init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’.
parameters[source]

A dictionary of the module parameters.

hyperparameters[source]

A dictionary of the module hyperparameters.

derived_variables[source]

A dictionary of intermediate values computed during the forward/backward passes.

gradients[source]

A dictionary of the accumulated module parameter gradients.

forward(X, retain_derived=True)[source]

Compute the module output given input volume X.

Parameters:
  • X (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – The input volume consisting of n_ex examples, each with dimension (in_rows, in_cols, in_ch).
  • retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True.
Returns:

Y (ndarray of shape (n_ex, out_rows, out_cols, out_ch)) – The module output volume.

backward(dLdY, retain_grads=True)[source]

Compute the gradient of the loss with respect to the layer parameters.

Parameters:
  • dLdy (ndarray of shape (n_ex, out_rows, out_cols, out_ch) or list of arrays) – The gradient(s) of the loss with respect to the module output(s).
  • retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True.
Returns:

dX (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – The gradient of the loss with respect to the module input volume.

WavenetResidualModule

class numpy_ml.neural_nets.modules.WavenetResidualModule(ch_residual, ch_dilation, dilation, kernel_width, optimizer=None, init='glorot_uniform')[source]

A WaveNet-like residual block with causal dilated convolutions.

*Skip path in* >-------------------------------------------> + ---> *Skip path out*
                  Causal      |--> Tanh --|                  |
*Main    |--> Dilated Conv1D -|           * --> 1x1 Conv1D --|
 path >--|                    |--> Sigm --|                  |
 in*     |-------------------------------------------------> + ---> *Main path out*
                             *Residual path*

On the final block, the output of the skip path is further processed to produce the network predictions.

References

[1]van den Oord et al. (2016). “Wavenet: a generative model for raw audio”. https://arxiv.org/pdf/1609.03499.pdf
Parameters:
  • ch_residual (int) – The number of output channels for the 1x1 Conv1D layer in the main path.
  • ch_dilation (int) – The number of output channels for the causal dilated Conv1D layer in the main path.
  • dilation (int) – The dilation rate for the causal dilated Conv1D layer in the main path.
  • kernel_width (int) – The width of the causal dilated Conv1D kernel in the main path.
  • init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’.
  • optimizer (str or Optimizer object or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
parameters[source]

A dictionary of the module parameters.

hyperparameters[source]

A dictionary of the module hyperparameters

derived_variables[source]

A dictionary of intermediate values computed during the forward/backward passes.

gradients[source]

A dictionary of the module parameter gradients.

forward(X_main, X_skip=None)[source]

Compute the module output on a single minibatch.

Parameters:
  • X_main (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – The input volume consisting of n_ex examples, each with dimension (in_rows, in_cols, in_ch).
  • X_skip (ndarray of shape (n_ex, in_rows, in_cols, in_ch), or None) – The output of the preceding skip-connection if this is not the first module in the network.
Returns:

  • Y_main (ndarray of shape (n_ex, out_rows, out_cols, out_ch)) – The output of the main pathway.
  • Y_skip (ndarray of shape (n_ex, out_rows, out_cols, out_ch)) – The output of the skip-connection pathway.

backward(dY_skip, dY_main=None)[source]