# Layers¶

## LayerBase¶

class numpy_ml.neural_nets.layers.layers.LayerBase(optimizer=None)[source]

An abstract base class inherited by all neural network layers

forward(z, **kwargs)[source]

Perform a forward pass through the layer

backward(out, **kwargs)[source]

Perform a backward pass through the layer

freeze()[source]

Freeze the layer parameters at their current values so they can no longer be updated.

unfreeze()[source]

Unfreeze the layer parameters so they can be updated.

flush_gradients()[source]

Erase all the layer’s derived variables and gradients.

update(cur_loss=None)[source]

Update the layer parameters using the accrued gradients and layer optimizer. Flush all gradients once the update is complete.

set_params(summary_dict)[source]

Set the layer parameters from a dictionary of values.

Parameters: summary_dict (dict) – A dictionary of layer parameters and hyperparameters. If a required parameter or hyperparameter is not included within summary_dict, this method will use the value in the current layer’s summary() method. layer (Layer object) – The newly-initialized layer.
summary()[source]

Return a dict of the layer parameters, hyperparameters, and ID.

## Add¶

class numpy_ml.neural_nets.layers.Add(act_fn=None, optimizer=None)[source]

An “addition” layer that returns the sum of its inputs, passed through an optional nonlinearity.

Parameters: act_fn (str, Activation object, or None) – The element-wise output nonlinearity used in computing the final output. If None, use the identity function $$f(x) = x$$. Default is None. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

forward(X, retain_derived=True)[source]

Compute the layer output on a single minibatch.

Parameters: X (list of length n_inputs) – A list of tensors, all of the same shape. retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True. Y (ndarray of shape (n_ex, *)) – The sum over the n_ex examples.
backward(dLdY, retain_grads=True)[source]

Backprop from layer outputs to inputs.

Parameters: dLdY (ndarray of shape (n_ex, *)) – The gradient of the loss wrt. the layer output Y. retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True. dX (list of length n_inputs) – The gradient of the loss wrt. each input in X.

## BatchNorm1D¶

class numpy_ml.neural_nets.layers.BatchNorm1D(momentum=0.9, epsilon=1e-05, optimizer=None)[source]

A batch normalization layer for 1D inputs.

Notes

BatchNorm is an attempt address the problem of internal covariate shift (ICS) during training by normalizing layer inputs.

ICS refers to the change in the distribution of layer inputs during training as a result of the changing parameters of the previous layer(s). ICS can make it difficult to train models with saturating nonlinearities, and in general can slow training by requiring a lower learning rate.

Equations [train]:

Y = scaler * norm(X) + intercept
norm(X) = (X - mean(X)) / sqrt(var(X) + epsilon)


Equations [test]:

Y = scaler * running_norm(X) + intercept
running_norm(X) = (X - running_mean) / sqrt(running_var + epsilon)


In contrast to LayerNorm1D, the BatchNorm layer calculates the mean and var across the batch rather than the output features. This has two disadvantages:

1. It is highly affected by batch size: smaller mini-batch sizes increase the variance of the estimates for the global mean and variance.

2. It is difficult to apply in RNNs – one must fit a separate BatchNorm layer for each time-step.

Parameters: momentum (float) – The momentum term for the running mean/running std calculations. The closer this is to 1, the less weight will be given to the mean/std of the current batch (i.e., higher smoothing). Default is 0.9. epsilon (float) – A small smoothing constant to use during computation of norm(X) to avoid divide-by-zero errors. Default is 1e-5. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

reset_running_stats()[source]

Reset the running mean and variance estimates to 0 and 1.

forward(X, retain_derived=True)[source]

Compute the layer output on a single minibatch.

Parameters: X (ndarray of shape (n_ex, n_in)) – Layer input, representing the n_in-dimensional features for a minibatch of n_ex examples. retain_derived (bool) – Whether to use the current intput to adjust the running mean and running_var computations. Setting this to True is the same as freezing the layer for the current input. Default is True. Y (ndarray of shape (n_ex, n_in)) – Layer output for each of the n_ex examples
backward(dLdy, retain_grads=True)[source]

Backprop from layer outputs to inputs.

Parameters: dLdY (ndarray of shape (n_ex, n_in)) – The gradient of the loss wrt. the layer output Y. retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True. dX (ndarray of shape (n_ex, n_in)) – The gradient of the loss wrt. the layer input X.

## BatchNorm2D¶

class numpy_ml.neural_nets.layers.BatchNorm2D(momentum=0.9, epsilon=1e-05, optimizer=None)[source]

A batch normalization layer for two-dimensional inputs with an additional channel dimension.

Notes

BatchNorm is an attempt address the problem of internal covariate shift (ICS) during training by normalizing layer inputs.

ICS refers to the change in the distribution of layer inputs during training as a result of the changing parameters of the previous layer(s). ICS can make it difficult to train models with saturating nonlinearities, and in general can slow training by requiring a lower learning rate.

Equations [train]:

Y = scaler * norm(X) + intercept
norm(X) = (X - mean(X)) / sqrt(var(X) + epsilon)


Equations [test]:

Y = scaler * running_norm(X) + intercept
running_norm(X) = (X - running_mean) / sqrt(running_var + epsilon)


In contrast to LayerNorm2D, the BatchNorm layer calculates the mean and var across the batch rather than the output features. This has two disadvantages:

1. It is highly affected by batch size: smaller mini-batch sizes increase the variance of the estimates for the global mean and variance.

2. It is difficult to apply in RNNs – one must fit a separate BatchNorm layer for each time-step.

Parameters: momentum (float) – The momentum term for the running mean/running std calculations. The closer this is to 1, the less weight will be given to the mean/std of the current batch (i.e., higher smoothing). Default is 0.9. epsilon (float) – A small smoothing constant to use during computation of norm(X) to avoid divide-by-zero errors. Default is 1e-5. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

reset_running_stats()[source]

Reset the running mean and variance estimates to 0 and 1.

forward(X, retain_derived=True)[source]

Compute the layer output on a single minibatch.

Notes

Equations [train]:

Y = scaler * norm(X) + intercept
norm(X) = (X - mean(X)) / sqrt(var(X) + epsilon)


Equations [test]:

Y = scaler * running_norm(X) + intercept
running_norm(X) = (X - running_mean) / sqrt(running_var + epsilon)


In contrast to LayerNorm2D, the BatchNorm layer calculates the mean and var across the batch rather than the output features.

Parameters: X (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – Input volume containing the in_rows x in_cols-dimensional features for a minibatch of n_ex examples. retain_derived (bool) – Whether to use the current intput to adjust the running mean and running_var computations. Setting this to True is the same as freezing the layer for the current input. Default is True. Y (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – Layer output for each of the n_ex examples.
backward(dLdy, retain_grads=True)[source]

Backprop from layer outputs to inputs.

Parameters: dLdY (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – The gradient of the loss wrt. the layer output Y. retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True. dX (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – The gradient of the loss wrt. the layer input X.

## Conv1D¶

class numpy_ml.neural_nets.layers.Conv1D(out_ch, kernel_width, pad=0, stride=1, dilation=0, act_fn=None, init='glorot_uniform', optimizer=None)[source]

Apply a one-dimensional convolution kernel over an input volume.

Notes

Equations:

out = act_fn(pad(X) * W + b)
out_dim = floor(1 + (n_rows_in + pad_left + pad_right - kernel_width) / stride)


where ‘*’ denotes the cross-correlation operation with stride s and dilation d.

Parameters: out_ch (int) – The number of filters/kernels to compute in the current layer kernel_width (int) – The width of a single 1D filter/kernel in the current layer act_fn (str, Activation object, or None) – The activation function for computing Y[t]. If None, use the identity function $$f(x) = x$$ by default. Default is None. pad (int, tuple, or {'same', 'causal'}) – The number of rows/columns to zero-pad the input with. If ‘same’, calculate padding to ensure the output length matches in the input length. If ‘causal’ compute padding such that the output both has the same length as the input AND output[t] does not depend on input[t + 1:]. Default is 0. stride (int) – The stride/hop of the convolution kernels as they move over the input volume. Default is 1. dilation (int) – Number of pixels inserted between kernel elements. Effective kernel shape after dilation is: [kernel_rows * (d + 1) - d, kernel_cols * (d + 1) - d]. Default is 0. init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

forward(X, retain_derived=True)[source]

Compute the layer output given input volume X.

Parameters: X (ndarray of shape (n_ex, l_in, in_ch)) – The input volume consisting of n_ex examples, each of length l_in and with in_ch input channels retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True. Y (ndarray of shape (n_ex, l_out, out_ch)) – The layer output.
backward(dLdy, retain_grads=True)[source]

Compute the gradient of the loss with respect to the layer parameters.

Notes

Relies on im2col() and col2im() to vectorize the gradient calculation. See the private method _backward_naive() for a more straightforward implementation.

Parameters: dLdy (ndarray of shape (n_ex, l_out, out_ch) or list of arrays) – The gradient(s) of the loss with respect to the layer output(s). retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True. dX (ndarray of shape (n_ex, l_in, in_ch)) – The gradient of the loss with respect to the layer input volume.

## Conv2D¶

class numpy_ml.neural_nets.layers.Conv2D(out_ch, kernel_shape, pad=0, stride=1, dilation=0, act_fn=None, optimizer=None, init='glorot_uniform')[source]

Apply a two-dimensional convolution kernel over an input volume.

Notes

Equations:

out = act_fn(pad(X) * W + b)
n_rows_out = floor(1 + (n_rows_in + pad_left + pad_right - filter_rows) / stride)
n_cols_out = floor(1 + (n_cols_in + pad_top + pad_bottom - filter_cols) / stride)


where ‘*’ denotes the cross-correlation operation with stride s and dilation d.

Parameters: out_ch (int) – The number of filters/kernels to compute in the current layer kernel_shape (2-tuple) – The dimension of a single 2D filter/kernel in the current layer act_fn (str, Activation object, or None) – The activation function for computing Y[t]. If None, use the identity function $$f(X) = X$$ by default. Default is None. pad (int, tuple, or 'same') – The number of rows/columns to zero-pad the input with. Default is 0. stride (int) – The stride/hop of the convolution kernels as they move over the input volume. Default is 1. dilation (int) – Number of pixels inserted between kernel elements. Effective kernel shape after dilation is: [kernel_rows * (d + 1) - d, kernel_cols * (d + 1) - d]. Default is 0. init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

forward(X, retain_derived=True)[source]

Compute the layer output given input volume X.

Parameters: X (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – The input volume consisting of n_ex examples, each with dimension (in_rows, in_cols, in_ch). retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True. Y (ndarray of shape (n_ex, out_rows, out_cols, out_ch)) – The layer output.
backward(dLdy, retain_grads=True)[source]

Compute the gradient of the loss with respect to the layer parameters.

Notes

Relies on im2col() and col2im() to vectorize the gradient calculation.

See the private method _backward_naive() for a more straightforward implementation.

Parameters: dLdy (ndarray of shape (n_ex, out_rows,) – out_ch) or list of arrays (out_cols,) – The gradient(s) of the loss with respect to the layer output(s). retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True. dX (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – The gradient of the loss with respect to the layer input volume.

## Deconv2D¶

class numpy_ml.neural_nets.layers.Deconv2D(out_ch, kernel_shape, pad=0, stride=1, act_fn=None, optimizer=None, init='glorot_uniform')[source]

Apply a two-dimensional “deconvolution” to an input volume.

Notes

The term “deconvolution” in this context does not correspond with the deconvolution operation in mathematics. More accurately, this layer is computing a transposed convolution / fractionally-strided convolution.

Parameters: out_ch (int) – The number of filters/kernels to compute in the current layer kernel_shape (2-tuple) – The dimension of a single 2D filter/kernel in the current layer act_fn (str, Activation object, or None) – The activation function for computing Y[t]. If None, use Affine activations by default. Default is None. pad (int, tuple, or 'same') – The number of rows/columns to zero-pad the input with. Default is 0. stride (int) – The stride/hop of the convolution kernels as they move over the input volume. Default is 1. init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

forward(X, retain_derived=True)[source]

Compute the layer output given input volume X.

Parameters: X (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – The input volume consisting of n_ex examples, each with dimension (in_rows, in_cols, in_ch). retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True. Y (ndarray of shape (n_ex, out_rows, out_cols, out_ch)) – The layer output.
backward(dLdY, retain_grads=True)[source]

Compute the gradient of the loss with respect to the layer parameters.

Notes

Relies on im2col() and col2im() to vectorize the gradient calculations.

Parameters: dLdY (ndarray of shape (n_ex, out_rows, out_cols, out_ch)) – The gradient of the loss with respect to the layer output. retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True. dX (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – The gradient of the loss with respect to the layer input volume.

## DotProductAttention¶

class numpy_ml.neural_nets.layers.DotProductAttention(scale=True, dropout_p=0, init='glorot_uniform', optimizer=None)[source]

A single “attention head” layer using a dot-product for the scoring function.

Notes

The equations for a dot product attention layer are:

$\begin{split}\mathbf{Z} &= \mathbf{K Q}^\\top \ \ \ \ &&\text{if scale = False} \\ &= \mathbf{K Q}^\top / \sqrt{d_k} \ \ \ \ &&\text{if scale = True} \\ \mathbf{Y} &= \text{dropout}(\text{softmax}(\mathbf{Z})) \mathbf{V}\end{split}$
Parameters: scale (bool) – Whether to scale the the key-query dot product by the square root of the key/query vector dimensionality before applying the Softmax. This is useful, since the scale of dot product will otherwise increase as query / key dimensions grow. Default is True. dropout_p (float in [0, 1)) – The dropout propbability during training, applied to the output of the softmax. If 0, no dropout is applied. Default is 0. init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’. Unused. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None. Unused.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

freeze()[source]

Freeze the layer parameters at their current values so they can no longer be updated.

unfreeze()[source]

Unfreeze the layer parameters so they can be updated.

forward(Q, K, V, retain_derived=True)[source]

Compute the attention-weighted output of a collection of keys, values, and queries.

Notes

In the most abstract (ie., hand-wave-y) sense:

• Query vectors ask questions
• Key vectors advertise their relevancy to questions
• Value vectors give possible answers to questions
• The dot product between Key and Query vectors provides scores for each of the the n_ex different Value vectors

For a single query and n key-value pairs, dot-product attention (with scaling) is:

w0 = dropout(softmax( (query @ key[0]) / sqrt(d_k) ))
w1 = dropout(softmax( (query @ key[1]) / sqrt(d_k) ))
...
wn = dropout(softmax( (query @ key[n]) / sqrt(d_k) ))

y = np.array([w0, ..., wn]) @ values
(1 × n_ex)      (n_ex × d_v)


In words, keys and queries are combined via dot-product to produce a score, which is then passed through a softmax to produce a weight on each value vector in Values. We elementwise multiply each value vector by its weight, and then take the elementwise sum of each weighted value vector to get the $$1 \times d_v$$ output for the current example.

In vectorized form,

$\mathbf{Y} = \text{dropout}( \text{softmax}(\mathbf{KQ}^\top / \sqrt{d_k}) ) \mathbf{V}$
Parameters: Q (ndarray of shape (n_ex, *, d_k)) – A set of n_ex query vectors packed into a single matrix. Optional middle dimensions can be used to specify, e.g., the number of parallel attention heads. K (ndarray of shape (n_ex, *, d_k)) – A set of n_ex key vectors packed into a single matrix. Optional middle dimensions can be used to specify, e.g., the number of parallel attention heads. V (ndarray of shape (n_ex, *, d_v)) – A set of n_ex value vectors packed into a single matrix. Optional middle dimensions can be used to specify, e.g., the number of parallel attention heads. retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True. Y (ndarray of shape (n_ex, *, d_v)) – The attention-weighted output values
backward(dLdy, retain_grads=True)[source]

Backprop from layer outputs to inputs.

Parameters: dLdY (ndarray of shape (n_ex, *, d_v)) – The gradient of the loss wrt. the layer output Y retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True. dQ (ndarray of shape (n_ex, *, d_k) or list of arrays) – The gradient of the loss wrt. the layer query matrix/matrices Q. dK (ndarray of shape (n_ex, *, d_k) or list of arrays) – The gradient of the loss wrt. the layer key matrix/matrices K. dV (ndarray of shape (n_ex, *, d_v) or list of arrays) – The gradient of the loss wrt. the layer value matrix/matrices V.

## Embedding¶

class numpy_ml.neural_nets.layers.Embedding(n_out, vocab_size, pool=None, init='glorot_uniform', optimizer=None)[source]

An embedding layer.

Notes

Equations:

Y = W[x]


NB. This layer must be the first in a neural network as the gradients do not get passed back through to the inputs.

Parameters: n_out (int) – The dimensionality of the embeddings vocab_size (int) – The total number of items in the vocabulary. All integer indices are expected to range between 0 and vocab_size - 1. pool ({'sum', 'mean', None}) – If not None, apply this function to the collection of n_in encodings in each example to produce a single, pooled embedding. Default is None. init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

lookup(ids)[source]

Return the embeddings associated with the IDs in ids.

Parameters: word_ids (ndarray of shape (M,)) – An array of M IDs to retrieve embeddings for. embeddings (ndarray of shape (M, n_out)) – The embedding vectors for each of the M IDs.
forward(X, retain_derived=True)[source]

Compute the layer output on a single minibatch.

Notes

Equations:
Y = W[x]
Parameters: X (ndarray of shape (n_ex, n_in) or list of length n_ex) – Layer input, representing a minibatch of n_ex examples. If self.pool is None, each example must consist of exactly n_in integer token IDs. Otherwise, X can be a ragged array, with each example consisting of a variable number of token IDs. retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through with regard to this input. Default is True. Y (ndarray of shape (n_ex, n_in, n_out)) – Embeddings for each coordinate of each of the n_ex examples
backward(dLdy, retain_grads=True)[source]

Backprop from layer outputs to embedding weights.

Notes

Because the items in X are interpreted as indices, we cannot compute the gradient of the layer output wrt. X.

Parameters: dLdy (ndarray of shape (n_ex, n_in, n_out) or list of arrays) – The gradient(s) of the loss wrt. the layer output(s) retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True.

## Flatten¶

class numpy_ml.neural_nets.layers.Flatten(keep_dim='first', optimizer=None)[source]

Flatten a multidimensional input into a 2D matrix.

Parameters: keep_dim ({'first', 'last', -1}) – The dimension of the original input to retain. Typically used for retaining the minibatch dimension.. If -1, flatten all dimensions. Default is ‘first’. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

forward(X, retain_derived=True)[source]

Compute the layer output on a single minibatch.

Parameters: X (ndarray) – Input volume to flatten. retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True. Y (ndarray of shape (*out_dims)) – Flattened output. If keep_dim is ‘first’, X is reshaped to (X.shape[0], -1), otherwise (-1, X.shape[0]).
backward(dLdy, retain_grads=True)[source]

Backprop from layer outputs to inputs.

Parameters: dLdY (ndarray of shape (*out_dims)) – The gradient of the loss wrt. the layer output Y. retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True. dX (ndarray of shape (*in_dims) or list of arrays) – The gradient of the loss wrt. the layer input(s) X.

## FullyConnected¶

class numpy_ml.neural_nets.layers.FullyConnected(n_out, act_fn=None, init='glorot_uniform', optimizer=None)[source]

A fully-connected (dense) layer.

Notes

A fully connected layer computes the function

$\mathbf{Y} = f( \mathbf{WX} + \mathbf{b} )$

where f is the activation nonlinearity, W and b are parameters of the layer, and X is the minibatch of input examples.

Parameters: n_out (int) – The dimensionality of the layer output act_fn (str, Activation object, or None) – The element-wise output nonlinearity used in computing Y. If None, use the identity function $$f(X) = X$$. Default is None. init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

forward(X, retain_derived=True)[source]

Compute the layer output on a single minibatch.

Parameters: X (ndarray of shape (n_ex, n_in)) – Layer input, representing the n_in-dimensional features for a minibatch of n_ex examples. retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True. Y (ndarray of shape (n_ex, n_out)) – Layer output for each of the n_ex examples.
backward(dLdy, retain_grads=True)[source]

Backprop from layer outputs to inputs.

Parameters: dLdy (ndarray of shape (n_ex, n_out) or list of arrays) – The gradient(s) of the loss wrt. the layer output(s). retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True. dLdX (ndarray of shape (n_ex, n_in) or list of arrays) – The gradient of the loss wrt. the layer input(s) X.

## LSTM¶

class numpy_ml.neural_nets.layers.LSTM(n_out, act_fn='Tanh', gate_fn='Sigmoid', init='glorot_uniform', optimizer=None)[source]

A single long short-term memory (LSTM) RNN layer.

Parameters: n_out (int) – The dimension of a single hidden state / output on a given timestep. act_fn (str, Activation object, or None) – The activation function for computing A[t]. Default is ‘Tanh’. gate_fn (str, Activation object, or None) – The gate function for computing the update, forget, and output gates. Default is ‘Sigmoid’. init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

forward(X)[source]

Run a forward pass across all timesteps in the input.

Parameters: X (ndarray of shape (n_ex, n_in, n_t)) – Input consisting of n_ex examples each of dimensionality n_in and extending for n_t timesteps. Y (ndarray of shape (n_ex, n_out, n_t)) – The value of the hidden state for each of the n_ex examples across each of the n_t timesteps.
backward(dLdA)[source]

Run a backward pass across all timesteps in the input.

Parameters: dLdA (ndarray of shape (n_ex, n_out, n_t)) – The gradient of the loss with respect to the layer output for each of the n_ex examples across all n_t timesteps. dLdX (ndarray of shape (n_ex, n_in, n_t)) – The value of the hidden state for each of the n_ex examples across each of the n_t timesteps.
derived_variables[source]

Return a dictionary containing any intermediate variables computed during the forward / backward passes.

gradients[source]

Return a dictionary of the gradients computed during the backward pass

parameters[source]

Return a dictionary of the current layer parameters

freeze()[source]

Freeze the layer parameters at their current values so they can no longer be updated.

unfreeze()[source]

Unfreeze the layer parameters so they can be updated.

set_params(summary_dict)[source]

Set the layer parameters from a dictionary of values.

Parameters: summary_dict (dict) – A dictionary of layer parameters and hyperparameters. If a required parameter or hyperparameter is not included within summary_dict, this method will use the value in the current layer’s summary() method. layer (Layer object) – The newly-initialized layer.
flush_gradients()[source]

Erase all the layer’s derived variables and gradients.

update()[source]

Update the layer parameters using the accrued gradients and layer optimizer. Flush all gradients once the update is complete.

## LSTMCell¶

class numpy_ml.neural_nets.layers.LSTMCell(n_out, act_fn='Tanh', gate_fn='Sigmoid', init='glorot_uniform', optimizer=None)[source]

A single step of a long short-term memory (LSTM) RNN.

Notes

Notation:

• Z[t] is the input to each of the gates at timestep t
• A[t] is the value of the hidden state at timestep t
• Cc[t] is the value of the candidate cell/memory state at timestep t
• C[t] is the value of the final cell/memory state at timestep t
• Gf[t] is the output of the forget gate at timestep t
• Gu[t] is the output of the update gate at timestep t
• Go[t] is the output of the output gate at timestep t

Equations:

Z[t]  = stack([A[t-1], X[t]])
Gf[t] = gate_fn(Wf @ Z[t] + bf)
Gu[t] = gate_fn(Wu @ Z[t] + bu)
Go[t] = gate_fn(Wo @ Z[t] + bo)
Cc[t] = act_fn(Wc @ Z[t] + bc)
C[t]  = Gf[t] * C[t-1] + Gu[t] * Cc[t]
A[t]  = Go[t] * act_fn(C[t])


where @ indicates dot/matrix product, and ‘*’ indicates elementwise multiplication.

Parameters: n_out (int) – The dimension of a single hidden state / output on a given timestep. act_fn (str, Activation object, or None) – The activation function for computing A[t]. Default is ‘Tanh’. gate_fn (str, Activation object, or None) – The gate function for computing the update, forget, and output gates. Default is ‘Sigmoid’. init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

forward(Xt)[source]

Compute the layer output for a single timestep.

Parameters: Xt (ndarray of shape (n_ex, n_in)) – Input at timestep t consisting of n_ex examples each of dimensionality n_in. At (ndarray of shape (n_ex, n_out)) – The value of the hidden state at timestep t for each of the n_ex examples. Ct (ndarray of shape (n_ex, n_out)) – The value of the cell/memory state at timestep t for each of the n_ex examples.
backward(dLdAt)[source]

Backprop for a single timestep.

Parameters: dLdAt (ndarray of shape (n_ex, n_out)) – The gradient of the loss wrt. the layer outputs (ie., hidden states) at timestep t. dLdXt (ndarray of shape (n_ex, n_in)) – The gradient of the loss wrt. the layer inputs at timestep t.
flush_gradients()[source]

Erase all the layer’s derived variables and gradients.

## LayerNorm1D¶

class numpy_ml.neural_nets.layers.LayerNorm1D(epsilon=1e-05, optimizer=None)[source]

A layer normalization layer for 1D inputs.

Notes

In contrast to BatchNorm1D, the LayerNorm layer calculates the mean and variance across features rather than examples in the batch ensuring that the mean and variance estimates are independent of batch size and permitting straightforward application in RNNs.

Equations [train & test]:

Y = scaler * norm(X) + intercept
norm(X) = (X - mean(X)) / sqrt(var(X) + epsilon)


Also in contrast to BatchNorm1D, scaler and intercept are applied elementwise to norm(X).

Parameters: epsilon (float) – A small smoothing constant to use during computation of norm(X) to avoid divide-by-zero errors. Default is 1e-5. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

forward(X, retain_derived=True)[source]

Compute the layer output on a single minibatch.

Parameters: X (ndarray of shape (n_ex, n_in)) – Layer input, representing the n_in-dimensional features for a minibatch of n_ex examples. retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True. Y (ndarray of shape (n_ex, n_in)) – Layer output for each of the n_ex examples.
backward(dLdy, retain_grads=True)[source]

Backprop from layer outputs to inputs.

Parameters: dLdY (ndarray of shape (n_ex, n_in)) – The gradient of the loss wrt. the layer output Y. retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True. dX (ndarray of shape (n_ex, n_in)) – The gradient of the loss wrt. the layer input X.

## LayerNorm2D¶

class numpy_ml.neural_nets.layers.LayerNorm2D(epsilon=1e-05, optimizer=None)[source]

A layer normalization layer for 2D inputs with an additional channel dimension.

Notes

In contrast to BatchNorm2D, the LayerNorm layer calculates the mean and variance across features rather than examples in the batch ensuring that the mean and variance estimates are independent of batch size and permitting straightforward application in RNNs.

Equations [train & test]:

Y = scaler * norm(X) + intercept
norm(X) = (X - mean(X)) / sqrt(var(X) + epsilon)


Also in contrast to BatchNorm2D, scaler and intercept are applied elementwise to norm(X).

Parameters: epsilon (float) – A small smoothing constant to use during computation of norm(X) to avoid divide-by-zero errors. Default is 1e-5. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

forward(X, retain_derived=True)[source]

Compute the layer output on a single minibatch.

Notes

Equations [train & test]:

Y = scaler * norm(X) + intercept
norm(X) = (X - mean(X)) / sqrt(var(X) + epsilon)

Parameters: X (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – Input volume containing the in_rows by in_cols-dimensional features for a minibatch of n_ex examples. retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True. Y (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – Layer output for each of the n_ex examples.
backward(dLdy, retain_grads=True)[source]

Backprop from layer outputs to inputs.

Parameters: dLdY (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – The gradient of the loss wrt. the layer output Y. retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True. dX (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – The gradient of the loss wrt. the layer input X.

## Multiply¶

class numpy_ml.neural_nets.layers.Multiply(act_fn=None, optimizer=None)[source]

A multiplication layer that returns the elementwise product of its inputs, passed through an optional nonlinearity.

Parameters: act_fn (str, Activation object, or None) – The element-wise output nonlinearity used in computing the final output. If None, use the identity function $$f(x) = x$$. Default is None. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

forward(X, retain_derived=True)[source]

Compute the layer output on a single minibatch.

Parameters: X (list of length n_inputs) – A list of tensors, all of the same shape. retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True. Y (ndarray of shape (n_ex, *)) – The product over the n_ex examples.
backward(dLdY, retain_grads=True)[source]

Backprop from layer outputs to inputs.

Parameters: dLdY (ndarray of shape (n_ex, *)) – The gradient of the loss wrt. the layer output Y. retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True. dX (list of length n_inputs) – The gradient of the loss wrt. each input in X.

## Pool2D¶

class numpy_ml.neural_nets.layers.Pool2D(kernel_shape, stride=1, pad=0, mode='max', optimizer=None)[source]

A single two-dimensional pooling layer.

Parameters: kernel_shape (2-tuple) – The dimension of a single 2D filter/kernel in the current layer stride (int) – The stride/hop of the convolution kernels as they move over the input volume. Default is 1. pad (int, tuple, or 'same') – The number of rows/columns of 0’s to pad the input. Default is 0. mode ({"max", "average"}) – The pooling function to apply. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

forward(X, retain_derived=True)[source]

Compute the layer output given input volume X.

Parameters: X (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – The input volume consisting of n_ex examples, each with dimension (in_rows,in_cols, in_ch) retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True. Y (ndarray of shape (n_ex, out_rows, out_cols, out_ch)) – The layer output.
backward(dLdY, retain_grads=True)[source]

Backprop from layer outputs to inputs

Parameters: dLdY (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – The gradient of the loss wrt. the layer output Y. retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True. dX (ndarray of shape (n_ex, in_rows, in_cols, in_ch)) – The gradient of the loss wrt. the layer input X.

## RNN¶

class numpy_ml.neural_nets.layers.RNN(n_out, act_fn='Tanh', init='glorot_uniform', optimizer=None)[source]

A single vanilla (Elman)-RNN layer.

Parameters: n_out (int) – The dimension of a single hidden state / output on a given timestep. act_fn (str, Activation object, or None) – The activation function for computing A[t]. Default is ‘Tanh’. init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

forward(X)[source]

Run a forward pass across all timesteps in the input.

Parameters: X (ndarray of shape (n_ex, n_in, n_t)) – Input consisting of n_ex examples each of dimensionality n_in and extending for n_t timesteps. Y (ndarray of shape (n_ex, n_out, n_t)) – The value of the hidden state for each of the n_ex examples across each of the n_t timesteps.
backward(dLdA)[source]

Run a backward pass across all timesteps in the input.

Parameters: dLdA (ndarray of shape (n_ex, n_out, n_t)) – The gradient of the loss with respect to the layer output for each of the n_ex examples across all n_t timesteps. dLdX (ndarray of shape (n_ex, n_in, n_t)) – The value of the hidden state for each of the n_ex examples across each of the n_t timesteps.
derived_variables[source]

Return a dictionary containing any intermediate variables computed during the forward / backward passes.

gradients[source]

Return a dictionary of the gradients computed during the backward pass

parameters[source]

Return a dictionary of the current layer parameters

set_params(summary_dict)[source]

Set the layer parameters from a dictionary of values.

Parameters: summary_dict (dict) – A dictionary of layer parameters and hyperparameters. If a required parameter or hyperparameter is not included within summary_dict, this method will use the value in the current layer’s summary() method. layer (Layer object) – The newly-initialized layer.
freeze()[source]

Freeze the layer parameters at their current values so they can no longer be updated.

unfreeze()[source]

Unfreeze the layer parameters so they can be updated.

flush_gradients()[source]

Erase all the layer’s derived variables and gradients.

update()[source]

Update the layer parameters using the accrued gradients and layer optimizer. Flush all gradients once the update is complete.

## RNNCell¶

class numpy_ml.neural_nets.layers.RNNCell(n_out, act_fn='Tanh', init='glorot_uniform', optimizer=None)[source]

A single step of a vanilla (Elman) RNN.

Notes

At timestep t, the vanilla RNN cell computes

$\begin{split}\mathbf{Z}^{(t)} &= \mathbf{W}_{ax} \mathbf{X}^{(t)} + \mathbf{b}_{ax} + \mathbf{W}_{aa} \mathbf{A}^{(t-1)} + \mathbf{b}_{aa} \\ \mathbf{A}^{(t)} &= f(\mathbf{Z}^{(t)})\end{split}$

where

• $$\mathbf{X}^{(t)}$$ is the input at time t
• $$\mathbf{A}^{(t)}$$ is the hidden state at timestep t
• f is the layer activation function
• $$\mathbf{W}_{ax}$$ and $$\mathbf{b}_{ax}$$ are the weights and bias for the input to hidden layer
• $$\mathbf{W}_{aa}$$ and $$\mathbf{b}_{aa}$$ are the weights and biases for the hidden to hidden layer
Parameters: n_out (int) – The dimension of a single hidden state / output on a given timestep act_fn (str, Activation object, or None) – The activation function for computing A[t]. Default is ‘Tanh’. init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

forward(Xt)[source]

Compute the network output for a single timestep.

Parameters: Xt (ndarray of shape (n_ex, n_in)) – Input at timestep t consisting of n_ex examples each of dimensionality n_in. At (ndarray of shape (n_ex, n_out)) – The value of the hidden state at timestep t for each of the n_ex examples.
backward(dLdAt)[source]

Backprop for a single timestep.

Parameters: dLdAt (ndarray of shape (n_ex, n_out)) – The gradient of the loss wrt. the layer outputs (ie., hidden states) at timestep t. dLdXt (ndarray of shape (n_ex, n_in)) – The gradient of the loss wrt. the layer inputs at timestep t.
flush_gradients()[source]

Erase all the layer’s derived variables and gradients.

## RBM¶

class numpy_ml.neural_nets.layers.RBM(n_out, K=1, init='glorot_uniform', optimizer=None)[source]

A Restricted Boltzmann machine with Bernoulli visible and hidden units.

Parameters: n_out (int) – The number of output dimensions/units. K (int) – The number of contrastive divergence steps to run before computing a single gradient update. Default is 1. init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

CD_update(X)[source]

Perform a single contrastive divergence-k training update using the visible inputs X as a starting point for the Gibbs sampler.

Parameters: X (ndarray of shape (n_ex, n_in)) – Layer input, representing the n_in-dimensional features for a minibatch of n_ex examples. Each feature in X should ideally be binary-valued, although it is possible to also train on real-valued features ranging between (0, 1) (e.g., grayscale images).
forward(V, K=None, retain_derived=True)[source]

Perform the CD-k “forward pass” of visible inputs into hidden units and back.

Notes

This implementation follows [1]’s recommendations for the RBM forward pass:

• Use real-valued probabilities for both the data and the visible unit reconstructions.
• Only the final update of the hidden units should use the actual probabilities – all others should be sampled binary states.
• When collecting the pairwise statistics for learning weights or the individual statistics for learning biases, use the probabilities, not the binary states.

References

 [1] Hinton, G. (2010). “A practical guide to training restricted Boltzmann machines”. UTML TR 2010-003
Parameters: V (ndarray of shape (n_ex, n_in)) – Visible input, representing the n_in-dimensional features for a minibatch of n_ex examples. Each feature in V should ideally be binary-valued, although it is possible to also train on real-valued features ranging between (0, 1) (e.g., grayscale images). K (int) – The number of steps of contrastive divergence steps to run before computing the gradient update. If None, use self.K. Default is None. retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True.
backward(retain_grads=True, *args)[source]

Perform a gradient update on the layer parameters via the contrastive divergence equations.

Parameters: retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True.
reconstruct(X, n_steps=10, return_prob=False)[source]

Reconstruct an input X by running the trained Gibbs sampler for n_steps-worth of CD-k.

Parameters: X (ndarray of shape (n_ex, n_in)) – Layer input, representing the n_in-dimensional features for a minibatch of n_ex examples. Each feature in X should ideally be binary-valued, although it is possible to also train on real-valued features ranging between (0, 1) (e.g., grayscale images). If X has missing values, it may be sufficient to mark them with random entries and allow the reconstruction to impute them. n_steps (int) – The number of Gibbs sampling steps to perform when generating the reconstruction. Default is 10. return_prob (bool) – Whether to return the real-valued feature probabilities for the reconstruction or the binary samples. Default is False. V (ndarray of shape (n_ex, in_ch)) – The reconstruction (or feature probabilities if return_prob is true) of the visual input X after running the Gibbs sampler for n_steps.

## Softmax¶

class numpy_ml.neural_nets.layers.Softmax(dim=-1, optimizer=None)[source]

A softmax nonlinearity layer.

Notes

This is implemented as a layer rather than an activation primarily because it requires retaining the layer input in order to compute the softmax gradients properly. In other words, in contrast to other simple activations, the softmax function and its gradient are not computed elementwise, and thus are more easily expressed as a layer.

The softmax function computes:

$y_i = \frac{e^{x_i}}{\sum_j e^{x_j}}$

where $$x_i$$ is the i th element of input example x.

Parameters: dim (int) – The dimension in X along which the softmax will be computed. Default is -1. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None. Unused for this layer.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

forward(X, retain_derived=True)[source]

Compute the layer output on a single minibatch.

Parameters: X (ndarray of shape (n_ex, n_in)) – Layer input, representing the n_in-dimensional features for a minibatch of n_ex examples. retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True. Y (ndarray of shape (n_ex, n_out)) – Layer output for each of the n_ex examples.
backward(dLdy, retain_grads=True)[source]

Backprop from layer outputs to inputs.

Parameters: dLdy (ndarray of shape (n_ex, n_out) or list of arrays) – The gradient(s) of the loss wrt. the layer output(s). retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True. dLdX (ndarray of shape (n_ex, n_in)) – The gradient of the loss wrt. the layer input X.

## SparseEvolution¶

class numpy_ml.neural_nets.layers.SparseEvolution(n_out, zeta=0.3, epsilon=20, act_fn=None, init='glorot_uniform', optimizer=None)[source]

A sparse Erdos-Renyi layer with evolutionary rewiring via the sparse evolutionary training (SET) algorithm.

Notes

$Y = f( (\mathbf{W} \odot \mathbf{W}_{mask}) \mathbf{X} + \mathbf{b} )$

where $$\odot$$ is the elementwise multiplication operation, f is the layer activation function, and $$\mathbf{W}_{mask}$$ is an evolved binary mask.

Parameters: n_out (int) – The dimensionality of the layer output zeta (float) – Proportion of the positive and negative weights closest to zero to drop after each training update. Default is 0.3. epsilon (float) – Layer sparsity parameter. Default is 20. act_fn (str, Activation object, or None) – The element-wise output nonlinearity used in computing Y. If None, use the identity function $$f(X) = X$$. Default is None. init ({'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}) – The weight initialization strategy. Default is ‘glorot_uniform’. optimizer (str, Optimizer object, or None) – The optimization strategy to use when performing gradient updates within the update() method. If None, use the SGD optimizer with default parameters. Default is None.
hyperparameters[source]

Return a dictionary containing the layer hyperparameters.

forward(X, retain_derived=True)[source]

Compute the layer output on a single minibatch.

Parameters: X (ndarray of shape (n_ex, n_in)) – Layer input, representing the n_in-dimensional features for a minibatch of n_ex examples. retain_derived (bool) – Whether to retain the variables calculated during the forward pass for use later during backprop. If False, this suggests the layer will not be expected to backprop through wrt. this input. Default is True. Y (ndarray of shape (n_ex, n_out)) – Layer output for each of the n_ex examples.
backward(dLdy, retain_grads=True)[source]

Backprop from layer outputs to inputs

Parameters: dLdy (ndarray of shape (n_ex, n_out) or list of arrays) – The gradient(s) of the loss wrt. the layer output(s). retain_grads (bool) – Whether to include the intermediate parameter gradients computed during the backward pass in the final parameter update. Default is True. dLdX (ndarray of shape (n_ex, n_in)) – The gradient of the loss wrt. the layer input X.
update()[source]

Update parameters using current gradients and evolve network connections via SET.