Optimizers

Popular gradient-based strategies for optimizing parameters in neural networks.

For a discussion regarding the generalization performance of the solutions found via different optimization strategies, see:

[1]Wilson et al. (2017) “The marginal value of adaptive gradient methods in machine learning”, Proceedings of the 31st Conference on Neural Information Processing Systems https://arxiv.org/pdf/1705.08292.pdf

OptimizerBase

class numpy_ml.neural_nets.optimizers.optimizers.OptimizerBase(lr, scheduler=None)[source]

Bases: abc.ABC

An abstract base class for all Optimizer objects.

This should never be used directly.

step()[source]

Increment the optimizer step counter by 1

reset_step()[source]

Reset the step counter to zero

copy()[source]

Return a copy of the optimizer object

set_params(hparam_dict=None, cache_dict=None)[source]

Set the parameters of the optimizer object from a dictionary

update(param, param_grad, param_name, cur_loss=None)[source]

SGD

class numpy_ml.neural_nets.optimizers.SGD(lr=0.01, momentum=0.0, clip_norm=None, lr_scheduler=None, **kwargs)[source]

Bases: numpy_ml.neural_nets.optimizers.optimizers.OptimizerBase

A stochastic gradient descent optimizer.

Notes

For model parameters \(\theta\), averaged parameter gradients \(\nabla_{\theta} \mathcal{L}\), and learning rate \(\eta\), the SGD update at timestep t is

\[\begin{split}\text{update}^{(t)} &= \text{momentum} \cdot \text{update}^{(t-1)} + \eta^{(t)} \nabla_{\theta} \mathcal{L}\\ \theta^{(t+1)} &\leftarrow \theta^{(t)} - \text{update}^{(t)}\end{split}\]
Parameters:
  • lr (float) – Learning rate for SGD. If scheduler is not None, this is used as the starting learning rate. Default is 0.01.
  • momentum (float in range [0, 1]) – The fraction of the previous update to add to the current update. If 0, no momentum is applied. Default is 0.
  • clip_norm (float) – If not None, all param gradients are scaled to have maximum l2 norm of clip_norm before computing update. Default is None.
  • lr_scheduler (str, Scheduler object, or None) – The learning rate scheduler. If None, use a constant learning rate equal to lr. Default is None.
update(param, param_grad, param_name, cur_loss=None)[source]

Compute the SGD update for a given parameter

Parameters:
  • param (ndarray of shape (n, m)) – The value of the parameter to be updated.
  • param_grad (ndarray of shape (n, m)) – The gradient of the loss function with respect to param_name.
  • param_name (str) – The name of the parameter.
  • cur_loss (float) – The training or validation loss for the current minibatch. Used for learning rate scheduling e.g., by KingScheduler. Default is None.
Returns:

updated_params (ndarray of shape (n, m)) – The value of param after applying the momentum update.

AdaGrad

class numpy_ml.neural_nets.optimizers.AdaGrad(lr=0.01, eps=1e-07, clip_norm=None, lr_scheduler=None, **kwargs)[source]

Bases: numpy_ml.neural_nets.optimizers.optimizers.OptimizerBase

An AdaGrad optimizer.

Notes

Weights that receive large gradients will have their effective learning rate reduced, while weights that receive small or infrequent updates will have their effective learning rate increased.

Equations:

cache[t] = cache[t-1] + grad[t] ** 2
update[t] = lr * grad[t] / (np.sqrt(cache[t]) + eps)
param[t+1] = param[t] - update[t]

Note that the ** and / operations are elementwise

“A downside of Adagrad … is that the monotonic learning rate usually proves too aggressive and stops learning too early.” [1]

References

[1]Karpathy, A. “CS231n: Convolutional neural networks for visual recognition” https://cs231n.github.io/neural-networks-3/
Parameters:
  • lr (float) – Global learning rate
  • eps (float) – Smoothing term to avoid divide-by-zero errors in the update calc. Default is 1e-7.
  • clip_norm (float or None) – If not None, all param gradients are scaled to have maximum L2 norm of clip_norm before computing update. Default is None.
  • lr_scheduler (str or Scheduler object or None) – The learning rate scheduler. If None, use a constant learning rate equal to lr. Default is None.
update(param, param_grad, param_name, cur_loss=None)[source]

Compute the AdaGrad update for a given parameter.

Notes

Adjusts the learning rate of each weight based on the magnitudes of its gradients (big gradient -> small lr, small gradient -> big lr).

Parameters:
  • param (ndarray of shape (n, m)) – The value of the parameter to be updated
  • param_grad (ndarray of shape (n, m)) – The gradient of the loss function with respect to param_name
  • param_name (str) – The name of the parameter
  • cur_loss (float or None) – The training or validation loss for the current minibatch. Used for learning rate scheduling e.g., by KingScheduler. Default is None.
Returns:

updated_params (ndarray of shape (n, m)) – The value of param after applying the AdaGrad update

Adam

class numpy_ml.neural_nets.optimizers.Adam(lr=0.001, decay1=0.9, decay2=0.999, eps=1e-07, clip_norm=None, lr_scheduler=None, **kwargs)[source]

Bases: numpy_ml.neural_nets.optimizers.optimizers.OptimizerBase

Adam (adaptive moment estimation) optimization algorithm.

Notes

Designed to combine the advantages of AdaGrad, which works well with sparse gradients, and RMSProp, which works well in online and non-stationary settings.

Parameters:
  • lr (float) – Learning rate for update. This parameter is ignored if using NoamScheduler. Default is 0.001.
  • decay1 (float) – The rate of decay to use for in running estimate of the first moment (mean) of the gradient. Default is 0.9.
  • decay2 (float) – The rate of decay to use for in running estimate of the second moment (variance) of the gradient. Default is 0.999.
  • eps (float) – Constant term to avoid divide-by-zero errors during the update calc. Default is 1e-7.
  • clip_norm (float) – If not None, all param gradients are scaled to have maximum l2 norm of clip_norm before computing update. Default is None.
  • lr_scheduler (str, or Scheduler object, or None) – The learning rate scheduler. If None, use a constant learning rate equal to lr. Default is None.
update(param, param_grad, param_name, cur_loss=None)[source]

Compute the Adam update for a given parameter.

Parameters:
  • param (ndarray of shape (n, m)) – The value of the parameter to be updated.
  • param_grad (ndarray of shape (n, m)) – The gradient of the loss function with respect to param_name.
  • param_name (str) – The name of the parameter.
  • cur_loss (float) – The training or validation loss for the current minibatch. Used for learning rate scheduling e.g., by KingScheduler. Default is None.
Returns:

updated_params (ndarray of shape (n, m)) – The value of param after applying the Adam update.

RMSProp

class numpy_ml.neural_nets.optimizers.RMSProp(lr=0.001, decay=0.9, eps=1e-07, clip_norm=None, lr_scheduler=None, **kwargs)[source]

Bases: numpy_ml.neural_nets.optimizers.optimizers.OptimizerBase

RMSProp optimizer.

Notes

RMSProp was proposed as a refinement of AdaGrad to reduce its aggressive, monotonically decreasing learning rate.

RMSProp uses a decaying average of the previous squared gradients (second moment) rather than just the immediately preceding squared gradient for its previous_update value.

Equations:

cache[t] = decay * cache[t-1] + (1 - decay) * grad[t] ** 2
update[t] = lr * grad[t] / (np.sqrt(cache[t]) + eps)
param[t+1] = param[t] - update[t]

Note that the ** and / operations are elementwise.

Parameters:
  • lr (float) – Learning rate for update. Default is 0.001.
  • decay (float in [0, 1]) – Rate of decay for the moving average. Typical values are [0.9, 0.99, 0.999]. Default is 0.9.
  • eps (float) – Constant term to avoid divide-by-zero errors during the update calc. Default is 1e-7.
  • clip_norm (float or None) – If not None, all param gradients are scaled to have maximum l2 norm of clip_norm before computing update. Default is None.
  • lr_scheduler (str or Scheduler object or None) – The learning rate scheduler. If None, use a constant learning rate equal to lr. Default is None.
update(param, param_grad, param_name, cur_loss=None)[source]

Compute the RMSProp update for a given parameter.

Parameters:
  • param (ndarray of shape (n, m)) – The value of the parameter to be updated
  • param_grad (ndarray of shape (n, m)) – The gradient of the loss function with respect to param_name
  • param_name (str) – The name of the parameter
  • cur_loss (float or None) – The training or validation loss for the current minibatch. Used for learning rate scheduling e.g., by KingScheduler. Default is None.
Returns:

updated_params (ndarray of shape (n, m)) – The value of param after applying the RMSProp update.