# Optimizers¶

Popular gradient-based strategies for optimizing parameters in neural networks.

For a discussion regarding the generalization performance of the solutions found via different optimization strategies, see:

 [1] Wilson et al. (2017) “The marginal value of adaptive gradient methods in machine learning”, Proceedings of the 31st Conference on Neural Information Processing Systems https://arxiv.org/pdf/1705.08292.pdf

## OptimizerBase¶

class numpy_ml.neural_nets.optimizers.optimizers.OptimizerBase(lr, scheduler=None)[source]

Bases: abc.ABC

An abstract base class for all Optimizer objects.

This should never be used directly.

step()[source]

Increment the optimizer step counter by 1

reset_step()[source]

Reset the step counter to zero

copy()[source]

Return a copy of the optimizer object

set_params(hparam_dict=None, cache_dict=None)[source]

Set the parameters of the optimizer object from a dictionary

update(param, param_grad, param_name, cur_loss=None)[source]

## SGD¶

class numpy_ml.neural_nets.optimizers.SGD(lr=0.01, momentum=0.0, clip_norm=None, lr_scheduler=None, **kwargs)[source]

Notes

For model parameters $$\theta$$, averaged parameter gradients $$\nabla_{\theta} \mathcal{L}$$, and learning rate $$\eta$$, the SGD update at timestep t is

$\begin{split}\text{update}^{(t)} &= \text{momentum} \cdot \text{update}^{(t-1)} + \eta^{(t)} \nabla_{\theta} \mathcal{L}\\ \theta^{(t+1)} &\leftarrow \theta^{(t)} - \text{update}^{(t)}\end{split}$
Parameters: lr (float) – Learning rate for SGD. If scheduler is not None, this is used as the starting learning rate. Default is 0.01. momentum (float in range [0, 1]) – The fraction of the previous update to add to the current update. If 0, no momentum is applied. Default is 0. clip_norm (float) – If not None, all param gradients are scaled to have maximum l2 norm of clip_norm before computing update. Default is None. lr_scheduler (str, Scheduler object, or None) – The learning rate scheduler. If None, use a constant learning rate equal to lr. Default is None.
update(param, param_grad, param_name, cur_loss=None)[source]

Compute the SGD update for a given parameter

Parameters: param (ndarray of shape (n, m)) – The value of the parameter to be updated. param_grad (ndarray of shape (n, m)) – The gradient of the loss function with respect to param_name. param_name (str) – The name of the parameter. cur_loss (float) – The training or validation loss for the current minibatch. Used for learning rate scheduling e.g., by KingScheduler. Default is None. updated_params (ndarray of shape (n, m)) – The value of param after applying the momentum update.

## AdaGrad¶

class numpy_ml.neural_nets.optimizers.AdaGrad(lr=0.01, eps=1e-07, clip_norm=None, lr_scheduler=None, **kwargs)[source]

Notes

Weights that receive large gradients will have their effective learning rate reduced, while weights that receive small or infrequent updates will have their effective learning rate increased.

Equations:

cache[t] = cache[t-1] + grad[t] ** 2
update[t] = lr * grad[t] / (np.sqrt(cache[t]) + eps)
param[t+1] = param[t] - update[t]


Note that the ** and / operations are elementwise

“A downside of Adagrad … is that the monotonic learning rate usually proves too aggressive and stops learning too early.” [1]

References

 [1] Karpathy, A. “CS231n: Convolutional neural networks for visual recognition” https://cs231n.github.io/neural-networks-3/
Parameters: lr (float) – Global learning rate eps (float) – Smoothing term to avoid divide-by-zero errors in the update calc. Default is 1e-7. clip_norm (float or None) – If not None, all param gradients are scaled to have maximum L2 norm of clip_norm before computing update. Default is None. lr_scheduler (str or Scheduler object or None) – The learning rate scheduler. If None, use a constant learning rate equal to lr. Default is None.
update(param, param_grad, param_name, cur_loss=None)[source]

Notes

Adjusts the learning rate of each weight based on the magnitudes of its gradients (big gradient -> small lr, small gradient -> big lr).

Parameters: param (ndarray of shape (n, m)) – The value of the parameter to be updated param_grad (ndarray of shape (n, m)) – The gradient of the loss function with respect to param_name param_name (str) – The name of the parameter cur_loss (float or None) – The training or validation loss for the current minibatch. Used for learning rate scheduling e.g., by KingScheduler. Default is None. updated_params (ndarray of shape (n, m)) – The value of param after applying the AdaGrad update

## Adam¶

class numpy_ml.neural_nets.optimizers.Adam(lr=0.001, decay1=0.9, decay2=0.999, eps=1e-07, clip_norm=None, lr_scheduler=None, **kwargs)[source]

Notes

Designed to combine the advantages of AdaGrad, which works well with sparse gradients, and RMSProp, which works well in online and non-stationary settings.

Parameters: lr (float) – Learning rate for update. This parameter is ignored if using NoamScheduler. Default is 0.001. decay1 (float) – The rate of decay to use for in running estimate of the first moment (mean) of the gradient. Default is 0.9. decay2 (float) – The rate of decay to use for in running estimate of the second moment (variance) of the gradient. Default is 0.999. eps (float) – Constant term to avoid divide-by-zero errors during the update calc. Default is 1e-7. clip_norm (float) – If not None, all param gradients are scaled to have maximum l2 norm of clip_norm before computing update. Default is None. lr_scheduler (str, or Scheduler object, or None) – The learning rate scheduler. If None, use a constant learning rate equal to lr. Default is None.
update(param, param_grad, param_name, cur_loss=None)[source]

Compute the Adam update for a given parameter.

Parameters: param (ndarray of shape (n, m)) – The value of the parameter to be updated. param_grad (ndarray of shape (n, m)) – The gradient of the loss function with respect to param_name. param_name (str) – The name of the parameter. cur_loss (float) – The training or validation loss for the current minibatch. Used for learning rate scheduling e.g., by KingScheduler. Default is None. updated_params (ndarray of shape (n, m)) – The value of param after applying the Adam update.

## RMSProp¶

class numpy_ml.neural_nets.optimizers.RMSProp(lr=0.001, decay=0.9, eps=1e-07, clip_norm=None, lr_scheduler=None, **kwargs)[source]

RMSProp optimizer.

Notes

RMSProp was proposed as a refinement of AdaGrad to reduce its aggressive, monotonically decreasing learning rate.

RMSProp uses a decaying average of the previous squared gradients (second moment) rather than just the immediately preceding squared gradient for its previous_update value.

Equations:

cache[t] = decay * cache[t-1] + (1 - decay) * grad[t] ** 2
update[t] = lr * grad[t] / (np.sqrt(cache[t]) + eps)
param[t+1] = param[t] - update[t]


Note that the ** and / operations are elementwise.

Parameters: lr (float) – Learning rate for update. Default is 0.001. decay (float in [0, 1]) – Rate of decay for the moving average. Typical values are [0.9, 0.99, 0.999]. Default is 0.9. eps (float) – Constant term to avoid divide-by-zero errors during the update calc. Default is 1e-7. clip_norm (float or None) – If not None, all param gradients are scaled to have maximum l2 norm of clip_norm before computing update. Default is None. lr_scheduler (str or Scheduler object or None) – The learning rate scheduler. If None, use a constant learning rate equal to lr. Default is None.
update(param, param_grad, param_name, cur_loss=None)[source]

Compute the RMSProp update for a given parameter.

Parameters: param (ndarray of shape (n, m)) – The value of the parameter to be updated param_grad (ndarray of shape (n, m)) – The gradient of the loss function with respect to param_name param_name (str) – The name of the parameter cur_loss (float or None) – The training or validation loss for the current minibatch. Used for learning rate scheduling e.g., by KingScheduler. Default is None. updated_params (ndarray of shape (n, m)) – The value of param after applying the RMSProp update.