Optimizers¶
Popular gradient-based strategies for optimizing parameters in neural networks.
For a discussion regarding the generalization performance of the solutions found via different optimization strategies, see:
[1] | Wilson et al. (2017) “The marginal value of adaptive gradient methods in machine learning”, Proceedings of the 31st Conference on Neural Information Processing Systems https://arxiv.org/pdf/1705.08292.pdf |
OptimizerBase
¶
-
class
numpy_ml.neural_nets.optimizers.optimizers.
OptimizerBase
(lr, scheduler=None)[source]¶ Bases:
abc.ABC
An abstract base class for all Optimizer objects.
This should never be used directly.
SGD
¶
-
class
numpy_ml.neural_nets.optimizers.
SGD
(lr=0.01, momentum=0.0, clip_norm=None, lr_scheduler=None, **kwargs)[source]¶ Bases:
numpy_ml.neural_nets.optimizers.optimizers.OptimizerBase
A stochastic gradient descent optimizer.
Notes
For model parameters \(\theta\), averaged parameter gradients \(\nabla_{\theta} \mathcal{L}\), and learning rate \(\eta\), the SGD update at timestep t is
\[\begin{split}\text{update}^{(t)} &= \text{momentum} \cdot \text{update}^{(t-1)} + \eta^{(t)} \nabla_{\theta} \mathcal{L}\\ \theta^{(t+1)} &\leftarrow \theta^{(t)} - \text{update}^{(t)}\end{split}\]Parameters: - lr (float) – Learning rate for SGD. If scheduler is not None, this is used as the starting learning rate. Default is 0.01.
- momentum (float in range [0, 1]) – The fraction of the previous update to add to the current update. If 0, no momentum is applied. Default is 0.
- clip_norm (float) – If not None, all param gradients are scaled to have maximum l2 norm of clip_norm before computing update. Default is None.
- lr_scheduler (str, Scheduler object, or None) – The learning rate scheduler. If None, use a constant learning rate equal to lr. Default is None.
-
update
(param, param_grad, param_name, cur_loss=None)[source]¶ Compute the SGD update for a given parameter
Parameters: - param (
ndarray
of shape (n, m)) – The value of the parameter to be updated. - param_grad (
ndarray
of shape (n, m)) – The gradient of the loss function with respect to param_name. - param_name (str) – The name of the parameter.
- cur_loss (float) – The training or validation loss for the current minibatch. Used for
learning rate scheduling e.g., by
KingScheduler
. Default is None.
Returns: updated_params (
ndarray
of shape (n, m)) – The value of param after applying the momentum update.- param (
AdaGrad
¶
-
class
numpy_ml.neural_nets.optimizers.
AdaGrad
(lr=0.01, eps=1e-07, clip_norm=None, lr_scheduler=None, **kwargs)[source]¶ Bases:
numpy_ml.neural_nets.optimizers.optimizers.OptimizerBase
An AdaGrad optimizer.
Notes
Weights that receive large gradients will have their effective learning rate reduced, while weights that receive small or infrequent updates will have their effective learning rate increased.
Equations:
cache[t] = cache[t-1] + grad[t] ** 2 update[t] = lr * grad[t] / (np.sqrt(cache[t]) + eps) param[t+1] = param[t] - update[t]
Note that the
**
and / operations are elementwise“A downside of Adagrad … is that the monotonic learning rate usually proves too aggressive and stops learning too early.” [1]
References
[1] Karpathy, A. “CS231n: Convolutional neural networks for visual recognition” https://cs231n.github.io/neural-networks-3/ Parameters: - lr (float) – Global learning rate
- eps (float) – Smoothing term to avoid divide-by-zero errors in the update calc. Default is 1e-7.
- clip_norm (float or None) – If not None, all param gradients are scaled to have maximum L2 norm of clip_norm before computing update. Default is None.
- lr_scheduler (str or Scheduler object or None) – The learning rate scheduler. If None, use a constant learning rate equal to lr. Default is None.
-
update
(param, param_grad, param_name, cur_loss=None)[source]¶ Compute the AdaGrad update for a given parameter.
Notes
Adjusts the learning rate of each weight based on the magnitudes of its gradients (big gradient -> small lr, small gradient -> big lr).
Parameters: - param (
ndarray
of shape (n, m)) – The value of the parameter to be updated - param_grad (
ndarray
of shape (n, m)) – The gradient of the loss function with respect to param_name - param_name (str) – The name of the parameter
- cur_loss (float or None) – The training or validation loss for the current minibatch. Used for
learning rate scheduling e.g., by
KingScheduler
. Default is None.
Returns: updated_params (
ndarray
of shape (n, m)) – The value of param after applying the AdaGrad update- param (
Adam
¶
-
class
numpy_ml.neural_nets.optimizers.
Adam
(lr=0.001, decay1=0.9, decay2=0.999, eps=1e-07, clip_norm=None, lr_scheduler=None, **kwargs)[source]¶ Bases:
numpy_ml.neural_nets.optimizers.optimizers.OptimizerBase
Adam (adaptive moment estimation) optimization algorithm.
Notes
Designed to combine the advantages of
AdaGrad
, which works well with sparse gradients, andRMSProp
, which works well in online and non-stationary settings.Parameters: - lr (float) – Learning rate for update. This parameter is ignored if using
NoamScheduler
. Default is 0.001. - decay1 (float) – The rate of decay to use for in running estimate of the first moment (mean) of the gradient. Default is 0.9.
- decay2 (float) – The rate of decay to use for in running estimate of the second moment (variance) of the gradient. Default is 0.999.
- eps (float) – Constant term to avoid divide-by-zero errors during the update calc. Default is 1e-7.
- clip_norm (float) – If not None, all param gradients are scaled to have maximum l2 norm of clip_norm before computing update. Default is None.
- lr_scheduler (str, or Scheduler object, or None) – The learning rate scheduler. If None, use a constant learning rate equal to lr. Default is None.
-
update
(param, param_grad, param_name, cur_loss=None)[source]¶ Compute the Adam update for a given parameter.
Parameters: - param (
ndarray
of shape (n, m)) – The value of the parameter to be updated. - param_grad (
ndarray
of shape (n, m)) – The gradient of the loss function with respect to param_name. - param_name (str) – The name of the parameter.
- cur_loss (float) – The training or validation loss for the current minibatch. Used for
learning rate scheduling e.g., by
KingScheduler
. Default is None.
Returns: updated_params (
ndarray
of shape (n, m)) – The value of param after applying the Adam update.- param (
- lr (float) – Learning rate for update. This parameter is ignored if using
RMSProp
¶
-
class
numpy_ml.neural_nets.optimizers.
RMSProp
(lr=0.001, decay=0.9, eps=1e-07, clip_norm=None, lr_scheduler=None, **kwargs)[source]¶ Bases:
numpy_ml.neural_nets.optimizers.optimizers.OptimizerBase
RMSProp optimizer.
Notes
RMSProp was proposed as a refinement of
AdaGrad
to reduce its aggressive, monotonically decreasing learning rate.RMSProp uses a decaying average of the previous squared gradients (second moment) rather than just the immediately preceding squared gradient for its previous_update value.
Equations:
cache[t] = decay * cache[t-1] + (1 - decay) * grad[t] ** 2 update[t] = lr * grad[t] / (np.sqrt(cache[t]) + eps) param[t+1] = param[t] - update[t]
Note that the
**
and/
operations are elementwise.Parameters: - lr (float) – Learning rate for update. Default is 0.001.
- decay (float in [0, 1]) – Rate of decay for the moving average. Typical values are [0.9, 0.99, 0.999]. Default is 0.9.
- eps (float) – Constant term to avoid divide-by-zero errors during the update calc. Default is 1e-7.
- clip_norm (float or None) – If not None, all param gradients are scaled to have maximum l2 norm of clip_norm before computing update. Default is None.
- lr_scheduler (str or Scheduler object or None) – The learning rate scheduler. If None, use a constant learning rate equal to lr. Default is None.
-
update
(param, param_grad, param_name, cur_loss=None)[source]¶ Compute the RMSProp update for a given parameter.
Parameters: - param (
ndarray
of shape (n, m)) – The value of the parameter to be updated - param_grad (
ndarray
of shape (n, m)) – The gradient of the loss function with respect to param_name - param_name (str) – The name of the parameter
- cur_loss (float or None) – The training or validation loss for the current minibatch. Used for
learning rate scheduling e.g., by
KingScheduler
. Default is None.
Returns: updated_params (
ndarray
of shape (n, m)) – The value of param after applying the RMSProp update.- param (