Modules¶
BidirectionalLSTM
¶
-
class
numpy_ml.neural_nets.modules.
BidirectionalLSTM
(n_out, act_fn=None, gate_fn=None, merge_mode='concat', init='glorot_uniform', optimizer=None)[source]¶ A single bidirectional long short-term memory (LSTM) layer.
Parameters: - n_out (int) – The dimension of a single hidden state / output on a given timestep
- act_fn (Activation object or None) – The activation function for computing
A[t]
. If not specified, useTanh
by default. - gate_fn (Activation object or None) – The gate function for computing the update, forget, and output
gates. If not specified, use
Sigmoid
by default. - merge_mode ({"sum", "multiply", "concat", "average"}) – Mode by which outputs of the forward and backward LSTMs will be combined. Default is ‘concat’.
- optimizer (str or Optimizer object or None) – The optimization strategy to use when performing gradient updates
within the update method. If None, use the
SGD
optimizer with default parameters. Default is None. - init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’.
-
forward
(X)[source]¶ Run a forward pass across all timesteps in the input.
Parameters: X ( ndarray
of shape (n_ex, n_in, n_t)) – Input consisting of n_ex examples each of dimensionality n_in and extending for n_t timesteps.Returns: Y ( ndarray
of shape (n_ex, n_out, n_t)) – The value of the hidden state for each of the n_ex examples across each of the n_t timesteps.
-
backward
(dLdA)[source]¶ Run a backward pass across all timesteps in the input.
Parameters: dLdA ( ndarray
of shape (n_ex, n_out, n_t)) – The gradient of the loss with respect to the layer output for each of the n_ex examples across all n_t timesteps.Returns: dLdX ( ndarray
of shape (n_ex, n_in, n_t)) – The value of the hidden state for each of the n_ex examples across each of the n_t timesteps.
MultiHeadedAttentionModule
¶
-
class
numpy_ml.neural_nets.modules.
MultiHeadedAttentionModule
(n_heads=8, dropout_p=0, init='glorot_uniform', optimizer=None)[source]¶ A mutli-headed attention module.
Notes
Multi-head attention allows a model to jointly attend to information from different representation subspaces at different positions. With a single head, this information would get averaged away when the attention weights are combined with the value
\[\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = [\text{head}_1; ...; \text{head}_h] \mathbf{W}^{(O)}\]where
\[\text{head}_i = \text{SDP_attention}( \mathbf{Q W}_i^{(Q)}, \mathbf{K W}_i^{(K)}, \mathbf{V W}_i^{(V)})\]and the projection weights are parameter matrices:
\[\begin{split}\mathbf{W}_i^{(Q)} &\in \mathbb{R}^{(\text{kqv_dim} \ \times \ \text{latent_dim})} \\ \mathbf{W}_i^{(K)} &\in \mathbb{R}^{(\text{kqv_dim} \ \times \ \text{latent_dim})} \\ \mathbf{W}_i^{(V)} &\in \mathbb{R}^{(\text{kqv_dim} \ \times \ \text{latent_dim})} \\ \mathbf{W}^{(O)} &\in \mathbb{R}^{(\text{n_heads} \cdot \text{latent_dim} \ \times \ \text{kqv_dim})}\end{split}\]Importantly, the current module explicitly assumes that
\[\text{kqv_dim} = \text{dim(query)} = \text{dim(keys)} = \text{dim(values)}\]and that
\[\text{latent_dim} = \text{kqv_dim / n_heads}\][MH Attention Head h]:
K --> W_h^(K) ------\ V --> W_h^(V) ------- > DP_Attention --> head_h Q --> W_h^(Q) ------/
The full [MultiHeadedAttentionModule] then becomes
----------------- K --> | [Attn Head 1] | --> head_1 --\ V --> | [Attn Head 2] | --> head_2 --\ Q --> | ... | ... --> Concat --> W^(O) --> MH_out | [Attn Head Z] | --> head_Z --/ -----------------
Due to the reduced dimension of each head, the total computational cost is similar to that of a single attention head with full (i.e., kqv_dim) dimensionality.
Parameters: - n_heads (int) – The number of simultaneous attention heads to use. Note that the
larger n_heads, the smaller the dimensionality of any single
head, since
latent_dim = kqv_dim / n_heads
. Default is 8. - dropout_p (float in [0, 1)) – The dropout propbability during training, applied to the output of the softmax in each dot-product attention head. If 0, no dropout is applied. Default is 0.
- init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’.
- optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates
within the
update()
method. If None, use theSGD
optimizer with default parameters. Default is None.
- n_heads (int) – The number of simultaneous attention heads to use. Note that the
larger n_heads, the smaller the dimensionality of any single
head, since
SkipConnectionConvModule
¶
-
class
numpy_ml.neural_nets.modules.
SkipConnectionConvModule
(out_ch1, out_ch2, kernel_shape1, kernel_shape2, kernel_shape_skip, pad1=0, pad2=0, stride1=1, stride2=1, act_fn=None, epsilon=1e-05, momentum=0.9, stride_skip=1, optimizer=None, init='glorot_uniform')[source]¶ A ResNet-like “convolution” shortcut module.
Notes
In contrast to
SkipConnectionIdentityModule
, the additional conv2d_skip and batchnorm_skip layers in the shortcut path allow adjusting the dimensions of X to match the output of the main set of convolutions.X -> Conv2D -> Act_fn -> BatchNorm2D -> Conv2D -> BatchNorm2D -> + -> Act_fn \_____________________ Conv2D -> Batchnorm2D __________________/
References
[1] He et al. (2015). “Deep residual learning for image recognition.” https://arxiv.org/pdf/1512.03385.pdf Parameters: - out_ch1 (int) – The number of filters/kernels to compute in the first convolutional layer.
- out_ch2 (int) – The number of filters/kernels to compute in the second convolutional layer.
- kernel_shape1 (2-tuple) – The dimension of a single 2D filter/kernel in the first convolutional layer.
- kernel_shape2 (2-tuple) – The dimension of a single 2D filter/kernel in the second convolutional layer.
- kernel_shape_skip (2-tuple) – The dimension of a single 2D filter/kernel in the “skip” convolutional layer.
- stride1 (int) – The stride/hop of the convolution kernels in the first convolutional layer. Default is 1.
- stride2 (int) – The stride/hop of the convolution kernels in the second convolutional layer. Default is 1.
- stride_skip (int) – The stride/hop of the convolution kernels in the “skip” convolutional layer. Default is 1.
- pad1 (int, tuple, or 'same') – The number of rows/columns of 0’s to pad the input to the first convolutional layer with. Default is 0.
- pad2 (int, tuple, or 'same') – The number of rows/columns of 0’s to pad the input to the second convolutional layer with. Default is 0.
- act_fn (Activation object or None) – The activation function for computing
Y[t]
. If None, use the identity \(f(x) = x\) by default. Default is None. - epsilon (float) – A small smoothing constant to use during
BatchNorm2D
computation to avoid divide-by-zero errors. Default is 1e-5. - momentum (float) – The momentum term for the running mean/running std calculations in
the
BatchNorm2D
layers. The closer this is to 1, the less weight will be given to the mean/std of the current batch (i.e., higher smoothing). Default is 0.9. - init (str) – The weight initialization strategy. Valid entries are {‘glorot_normal’, ‘glorot_uniform’, ‘he_normal’, ‘he_uniform’}.
- optimizer (str or Optimizer object) – The optimization strategy to use when performing gradient updates
within the
update
method. If None, use theSGD
optimizer with default parameters. Default is None.
-
derived_variables
[source]¶ A dictionary of intermediate values computed during the forward/backward passes.
-
forward
(X, retain_derived=True)[source]¶ Compute the layer output given input volume X.
Parameters: - X (
ndarray
of shape (n_ex, in_rows, in_cols, in_ch)) – The input volume consisting of n_ex examples, each with dimension (in_rows, in_cols, in_ch). - retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True.
Returns: Y (
ndarray
of shape (n_ex, out_rows, out_cols, out_ch)) – The module output volume.- X (
-
backward
(dLdY, retain_grads=True)[source]¶ Compute the gradient of the loss with respect to the module parameters.
Parameters: - dLdy (
ndarray
of shape (n_ex, out_rows, out_cols, out_ch)) – - list of arrays (or) – The gradient(s) of the loss with respect to the module output(s).
- retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True.
Returns: dX (
ndarray
of shape (n_ex, in_rows, in_cols, in_ch)) – The gradient of the loss with respect to the module input volume.- dLdy (
SkipConnectionIdentityModule
¶
-
class
numpy_ml.neural_nets.modules.
SkipConnectionIdentityModule
(out_ch, kernel_shape1, kernel_shape2, stride1=1, stride2=1, act_fn=None, epsilon=1e-05, momentum=0.9, optimizer=None, init='glorot_uniform')[source]¶ A ResNet-like “identity” shortcut module.
Notes
The identity module enforces same padding during each convolution to ensure module output has same dims as its input.
X -> Conv2D -> Act_fn -> BatchNorm2D -> Conv2D -> BatchNorm2D -> + -> Act_fn \______________________________________________________________/
References
[1] He et al. (2015). “Deep residual learning for image recognition.” https://arxiv.org/pdf/1512.03385.pdf Parameters: - out_ch (int) – The number of filters/kernels to compute in the first convolutional layer.
- kernel_shape1 (2-tuple) – The dimension of a single 2D filter/kernel in the first convolutional layer.
- kernel_shape2 (2-tuple) – The dimension of a single 2D filter/kernel in the second convolutional layer.
- stride1 (int) – The stride/hop of the convolution kernels in the first convolutional layer. Default is 1.
- stride2 (int) – The stride/hop of the convolution kernels in the second convolutional layer. Default is 1.
- act_fn (Activation object or None) – The activation function for computing Y[t]. If None, use the identity \(f(x) = x\) by default. Default is None.
- epsilon (float) – A small smoothing constant to use during
BatchNorm2D
computation to avoid divide-by-zero errors. Default is 1e-5. - momentum (float) – The momentum term for the running mean/running std calculations in
the
BatchNorm2D
layers. The closer this is to 1, the less weight will be given to the mean/std of the current batch (i.e., higher smoothing). Default is 0.9. - optimizer (str or Optimizer object or None) – The optimization strategy to use when performing gradient updates
within the
update()
method. If None, use theSGD
optimizer with default parameters. Default is None. - init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’.
-
derived_variables
[source]¶ A dictionary of intermediate values computed during the forward/backward passes.
-
forward
(X, retain_derived=True)[source]¶ Compute the module output given input volume X.
Parameters: - X (
ndarray
of shape (n_ex, in_rows, in_cols, in_ch)) – The input volume consisting of n_ex examples, each with dimension (in_rows, in_cols, in_ch). - retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True.
Returns: Y (
ndarray
of shape (n_ex, out_rows, out_cols, out_ch)) – The module output volume.- X (
-
backward
(dLdY, retain_grads=True)[source]¶ Compute the gradient of the loss with respect to the layer parameters.
Parameters: - dLdy (
ndarray
of shape (n_ex, out_rows, out_cols, out_ch) or list of arrays) – The gradient(s) of the loss with respect to the module output(s). - retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True.
Returns: dX (
ndarray
of shape (n_ex, in_rows, in_cols, in_ch)) – The gradient of the loss with respect to the module input volume.- dLdy (
WavenetResidualModule
¶
-
class
numpy_ml.neural_nets.modules.
WavenetResidualModule
(ch_residual, ch_dilation, dilation, kernel_width, optimizer=None, init='glorot_uniform')[source]¶ A WaveNet-like residual block with causal dilated convolutions.
*Skip path in* >-------------------------------------------> + ---> *Skip path out* Causal |--> Tanh --| | *Main |--> Dilated Conv1D -| * --> 1x1 Conv1D --| path >--| |--> Sigm --| | in* |-------------------------------------------------> + ---> *Main path out* *Residual path*
On the final block, the output of the skip path is further processed to produce the network predictions.
References
[1] van den Oord et al. (2016). “Wavenet: a generative model for raw audio”. https://arxiv.org/pdf/1609.03499.pdf Parameters: - ch_residual (int) – The number of output channels for the 1x1
Conv1D
layer in the main path. - ch_dilation (int) – The number of output channels for the causal dilated
Conv1D
layer in the main path. - dilation (int) – The dilation rate for the causal dilated
Conv1D
layer in the main path. - kernel_width (int) – The width of the causal dilated
Conv1D
kernel in the main path. - init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’.
- optimizer (str or Optimizer object or None) – The optimization strategy to use when performing gradient updates
within the
update()
method. If None, use theSGD
optimizer with default parameters. Default is None.
-
derived_variables
[source]¶ A dictionary of intermediate values computed during the forward/backward passes.
-
forward
(X_main, X_skip=None)[source]¶ Compute the module output on a single minibatch.
Parameters: - X_main (
ndarray
of shape (n_ex, in_rows, in_cols, in_ch)) – The input volume consisting of n_ex examples, each with dimension (in_rows, in_cols, in_ch). - X_skip (
ndarray
of shape (n_ex, in_rows, in_cols, in_ch), or None) – The output of the preceding skip-connection if this is not the first module in the network.
Returns: - X_main (
- ch_residual (int) – The number of output channels for the 1x1