Activations

Popular (and some not-so-popular) activation functions for use within arbitrary neural networks.

Affine

class numpy_ml.neural_nets.activations.Affine(slope=1, intercept=0)[source]

An affine activation function.

Parameters:
  • slope (float) – Activation slope. Default is 1.
  • intercept (float) – Intercept/offset term. Default is 0.
fn(z)[source]

Evaluate the Affine activation on the elements of input z.

\[\text{Affine}(z_i) = \text{slope} \times z_i + \text{intercept}\]
grad(x)[source]

Evaluate the first derivative of the Affine activation on the elements of input x.

\[\frac{\partial \text{Affine}}{\partial x_i} = \text{slope}\]
grad2(x)[source]

Evaluate the second derivative of the Affine activation on the elements of input x.

\[\frac{\partial^2 \text{Affine}}{\partial x_i^2} = 0\]

ELU

class numpy_ml.neural_nets.activations.ELU(alpha=1.0)[source]

An exponential linear unit (ELU).

Notes

ELUs are intended to address the fact that ReLUs are strictly nonnegative and thus have an average activation > 0, increasing the chances of internal covariate shift and slowing down learning. ELU units address this by (1) allowing negative values when \(x < 0\), which (2) are bounded by a value \(-\alpha\). Similar to LeakyReLU, the negative activation values help to push the average unit activation towards 0. Unlike LeakyReLU, however, the boundedness of the negative activation allows for greater robustness in the face of large negative values, allowing the function to avoid conveying the degree of “absence” (negative activation) in the input. [*]

Parameters:alpha (float) – Slope of negative segment. Default is 1.

References

[*]Clevert, D. A., Unterthiner, T., Hochreiter, S. (2016). “Fast and accurate deep network learning by exponential linear units (ELUs)”. 4th International Conference on Learning Representations.
fn(z)[source]

Evaluate the ELU activation on the elements of input z.

\[\begin{split}\text{ELU}(z_i) &= z_i \ \ \ \ &&\text{if }z_i > 0 \\ &= \alpha (e^{z_i} - 1) \ \ \ \ &&\text{otherwise}\end{split}\]
grad(x)[source]

Evaluate the first derivative of the ELU activation on the elements of input x.

\[\begin{split}\frac{\partial \text{ELU}}{\partial x_i} &= 1 \ \ \ \ &&\text{if } x_i > 0 \\ &= \alpha e^{x_i} \ \ \ \ &&\text{otherwise}\end{split}\]
grad2(x)[source]

Evaluate the second derivative of the ELU activation on the elements of input x.

\[\begin{split}\frac{\partial^2 \text{ELU}}{\partial x_i^2} &= 0 \ \ \ \ &&\text{if } x_i > 0 \\ &= \alpha e^{x_i} \ \ \ \ &&\text{otherwise}\end{split}\]

Exponential

class numpy_ml.neural_nets.activations.Exponential[source]

An exponential (base e) activation function

fn(z)[source]

Evaluate the activation function

\[\text{Exponential}(z_i) = e^{z_i}\]
grad(x)[source]

Evaluate the first derivative of the exponential activation on the elements of input x.

\[\frac{\partial \text{Exponential}}{\partial x_i} = e^{x_i}\]
grad2(x)[source]

Evaluate the second derivative of the exponential activation on the elements of input x.

\[\frac{\partial^2 \text{Exponential}}{\partial x_i^2} = e^{x_i}\]

HardSigmoid

class numpy_ml.neural_nets.activations.HardSigmoid[source]

A “hard” sigmoid activation function.

Notes

The hard sigmoid is a piecewise linear approximation of the logistic sigmoid that is computationally more efficient to compute.

fn(z)[source]

Evaluate the hard sigmoid activation on the elements of input z.

\[\begin{split}\text{HardSigmoid}(z_i) &= 0 \ \ \ \ &&\text{if }z_i < -2.5 \\ &= 0.2 z_i + 0.5 \ \ \ \ &&\text{if }-2.5 \leq z_i \leq 2.5 \\ &= 1 \ \ \ \ &&\text{if }z_i > 2.5\end{split}\]
grad(x)[source]

Evaluate the first derivative of the hard sigmoid activation on the elements of input x.

\[\begin{split}\frac{\partial \text{HardSigmoid}}{\partial x_i} &= 0.2 \ \ \ \ &&\text{if } -2.5 \leq x_i \leq 2.5\\ &= 0 \ \ \ \ &&\text{otherwise}\end{split}\]
grad2(x)[source]

Evaluate the second derivative of the hard sigmoid activation on the elements of input x.

\[\frac{\partial^2 \text{HardSigmoid}}{\partial x_i^2} = 0\]

Identity

class numpy_ml.neural_nets.activations.Identity[source]

Identity activation function.

Notes

Identity is syntactic sugar for Affine with slope = 1 and intercept = 0.

fn(z)[source]

Evaluate the Affine activation on the elements of input z.

\[\text{Affine}(z_i) = \text{slope} \times z_i + \text{intercept}\]
grad(x)[source]

Evaluate the first derivative of the Affine activation on the elements of input x.

\[\frac{\partial \text{Affine}}{\partial x_i} = \text{slope}\]
grad2(x)[source]

Evaluate the second derivative of the Affine activation on the elements of input x.

\[\frac{\partial^2 \text{Affine}}{\partial x_i^2} = 0\]

LeakyReLU

class numpy_ml.neural_nets.activations.LeakyReLU(alpha=0.3)[source]

‘Leaky’ version of a rectified linear unit (ReLU).

Notes

Leaky ReLUs [†] are designed to address the vanishing gradient problem in ReLUs by allowing a small non-zero gradient when x is negative.

Parameters:alpha (float) – Activation slope when x < 0. Default is 0.3.

References

[†]Mass, L. M., Hannun, A. Y, & Ng, A. Y. (2013). “Rectifier nonlinearities improve neural network acoustic models.” Proceedings of the 30th International Conference of Machine Learning, 30.
fn(z)[source]

Evaluate the leaky ReLU function on the elements of input z.

\[\begin{split}\text{LeakyReLU}(z_i) &= z_i \ \ \ \ &&\text{if } z_i > 0 \\ &= \alpha z_i \ \ \ \ &&\text{otherwise}\end{split}\]
grad(x)[source]

Evaluate the first derivative of the leaky ReLU function on the elements of input x.

\[\begin{split}\frac{\partial \text{LeakyReLU}}{\partial x_i} &= 1 \ \ \ \ &&\text{if }x_i > 0 \\ &= \alpha \ \ \ \ &&\text{otherwise}\end{split}\]
grad2(x)[source]

Evaluate the second derivative of the leaky ReLU function on the elements of input x.

\[\frac{\partial^2 \text{LeakyReLU}}{\partial x_i^2} = 0\]

ReLU

class numpy_ml.neural_nets.activations.ReLU[source]

A rectified linear activation function.

Notes

“ReLU units can be fragile during training and can “die”. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold.

For example, you may find that as much as 40% of your network can be “dead” (i.e. neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.” [‡]

References

[‡]Karpathy, A. “CS231n: Convolutional neural networks for visual recognition.”
fn(z)[source]

Evaulate the ReLU function on the elements of input z.

\[\begin{split}\text{ReLU}(z_i) &= z_i \ \ \ \ &&\text{if }z_i > 0 \\ &= 0 \ \ \ \ &&\text{otherwise}\end{split}\]
grad(x)[source]

Evaulate the first derivative of the ReLU function on the elements of input x.

\[\begin{split}\frac{\partial \text{ReLU}}{\partial x_i} &= 1 \ \ \ \ &&\text{if }x_i > 0 \\ &= 0 \ \ \ \ &&\text{otherwise}\end{split}\]
grad2(x)[source]

Evaulate the second derivative of the ReLU function on the elements of input x.

\[\frac{\partial^2 \text{ReLU}}{\partial x_i^2} = 0\]

SELU

class numpy_ml.neural_nets.activations.SELU[source]

A scaled exponential linear unit (SELU).

Notes

SELU units, when used in conjunction with proper weight initialization and regularization techniques, encourage neuron activations to converge to zero-mean and unit variance without explicit use of e.g., batchnorm.

For SELU units, the \(\alpha\) and \(\text{scale}\) values are constants chosen so that the mean and variance of the inputs are preserved between consecutive layers. As such the authors propose weights be initialized using Lecun-Normal initialization: \(w_{ij} \sim \mathcal{N}(0, 1 / \text{fan_in})\), and to use the dropout variant \(\alpha\)-dropout during regularization. [§]

See the reference for more information (especially the appendix ;-) ).

References

[§]Klambauer, G., Unterthiner, T., & Hochreiter, S. (2017). “Self-normalizing neural networks.” Advances in Neural Information Processing Systems, 30.
fn(z)[source]

Evaluate the SELU activation on the elements of input z.

\[\text{SELU}(z_i) = \text{scale} \times \text{ELU}(z_i, \alpha)\]

which is simply

\[\begin{split}\text{SELU}(z_i) &= \text{scale} \times z_i \ \ \ \ &&\text{if }z_i > 0 \\ &= \text{scale} \times \alpha (e^{z_i} - 1) \ \ \ \ &&\text{otherwise}\end{split}\]
grad(x)[source]

Evaluate the first derivative of the SELU activation on the elements of input x.

\[\begin{split}\frac{\partial \text{SELU}}{\partial x_i} &= \text{scale} \ \ \ \ &&\text{if } x_i > 0 \\ &= \text{scale} \times \alpha e^{x_i} \ \ \ \ &&\text{otherwise}\end{split}\]
grad2(x)[source]

Evaluate the second derivative of the SELU activation on the elements of input x.

\[\begin{split}\frac{\partial^2 \text{SELU}}{\partial x_i^2} &= 0 \ \ \ \ &&\text{if } x_i > 0 \\ &= \text{scale} \times \alpha e^{x_i} \ \ \ \ &&\text{otherwise}\end{split}\]

GELU

class numpy_ml.neural_nets.activations.GELU(approximate=True)[source]

A Gaussian error linear unit (GELU). [¶]

Notes

A ReLU alternative. GELU weights inputs by their value, rather than gates inputs by their sign, as in vanilla ReLUs.

References

[¶]Hendrycks, D., & Gimpel, K. (2016). “Bridging nonlinearities and stochastic regularizers with Gaussian error linear units.” CoRR.
Parameters:approximate (bool) – Whether to use a faster but less precise approximation to the Gauss error function when calculating the unit activation and gradient. Default is True.
fn(z)[source]

Compute the GELU function on the elements of input z.

\[\text{GELU}(z_i) = z_i P(Z \leq z_i) = z_i \Phi(z_i) = z_i \cdot \frac{1}{2}(1 + \text{erf}(x/\sqrt{2}))\]
grad(x)[source]

Evaluate the first derivative of the GELU function on the elements of input x.

\[\frac{\partial \text{GELU}}{\partial x_i} = \frac{1}{2} + \frac{1}{2}\left(\text{erf}(\frac{x}{\sqrt{2}}) + \frac{x + \text{erf}'(\frac{x}{\sqrt{2}})}{\sqrt{2}}\right)\]

where \(\text{erf}'(x) = \frac{2}{\sqrt{\pi}} \cdot \exp\{-x^2\}\).

grad2(x)[source]

Evaluate the second derivative of the GELU function on the elements of input x.

\[\frac{\partial^2 \text{GELU}}{\partial x_i^2} = \frac{1}{2\sqrt{2}} \left\[ \text{erf}'(\frac{x}{\sqrt{2}}) + \frac{1}{\sqrt{2}} \text{erf}''(\frac{x}{\sqrt{2}}) \right]\]

where \(\text{erf}'(x) = \frac{2}{\sqrt{\pi}} \cdot \exp\{-x^2\}\) and \(\text{erf}''(x) = \frac{-4x}{\sqrt{\pi}} \cdot \exp\{-x^2\}\).

Sigmoid

class numpy_ml.neural_nets.activations.Sigmoid[source]

A logistic sigmoid activation function.

fn(z)[source]

Evaluate the logistic sigmoid, \(\sigma\), on the elements of input z.

\[\sigma(x_i) = \frac{1}{1 + e^{-x_i}}\]
grad(x)[source]

Evaluate the first derivative of the logistic sigmoid on the elements of x.

\[\frac{\partial \sigma}{\partial x_i} = \sigma(x_i) (1 - \sigma(x_i))\]
grad2(x)[source]

Evaluate the second derivative of the logistic sigmoid on the elements of x.

\[\frac{\partial^2 \sigma}{\partial x_i^2} = \frac{\partial \sigma}{\partial x_i} (1 - 2 \sigma(x_i))\]

SoftPlus

class numpy_ml.neural_nets.activations.SoftPlus[source]

A softplus activation function.

Notes

In contrast to ReLU, the softplus activation is differentiable everywhere (including 0). It is, however, less computationally efficient to compute.

The derivative of the softplus activation is the logistic sigmoid.

fn(z)[source]

Evaluate the softplus activation on the elements of input z.

\[\text{SoftPlus}(z_i) = \log(1 + e^{z_i})\]
grad(x)[source]

Evaluate the first derivative of the softplus activation on the elements of input x.

\[\frac{\partial \text{SoftPlus}}{\partial x_i} = \frac{e^{x_i}}{1 + e^{x_i}}\]
grad2(x)[source]

Evaluate the second derivative of the softplus activation on the elements of input x.

\[\frac{\partial^2 \text{SoftPlus}}{\partial x_i^2} = \frac{e^{x_i}}{(1 + e^{x_i})^2}\]

Tanh

class numpy_ml.neural_nets.activations.Tanh[source]

A hyperbolic tangent activation function.

fn(z)[source]

Compute the tanh function on the elements of input z.

grad(x)[source]

Evaluate the first derivative of the tanh function on the elements of input x.

\[\frac{\partial \tanh}{\partial x_i} = 1 - \tanh(x)^2\]
grad2(x)[source]

Evaluate the second derivative of the tanh function on the elements of input x.

\[\frac{\partial^2 \tanh}{\partial x_i^2} = -2 \tanh(x) \left(\frac{\partial \tanh}{\partial x_i}\right)\]