This repo contains various experiments I'm doing in optimizers and learning rate schedulers.
SCRAM (Scale and Rotation Invariant Momentum)
This is similar to the LION optimizer, but normalizes each parameter's updates using the root mean square (RMS) rather than the sign. This makes the optimizer invariant to orthonormal transformations that rotate channels into each other.
Recommended hyperparameters for a model where AdamW is best at lr=1e-4:
eps | learning rate | beta1 | beta2 |
---|---|---|---|
1e-15 | 1e-6 | 0.98 | 0.99 |
For best results, gradient clipping should be disabled.
SIMON (Sigma Momentum)
An AdaBelief derivative that incorporates the momentum modifications from Lion, and uses a slightly different way of calculating the standard deviation. The best optimizer I've found for many problems in my tests.
Recommended hyperparameters for a model where AdamW is best at lr=1e-4:
eps | learning rate | beta1 | beta2 | rmsclip | layerwise | normalize |
---|---|---|---|---|---|---|
1e-15 | 1e-4 | 0.98 | 0.99 | False | False | False |
For best results, gradient clipping should be disabled.
ESGD (Ensemble Stochastic Gradient Descent)
A modification of stochastic gradient descent plus momentum and filterwise normalization, that simulates a very large ensemble of models by maintaining two copies of each weight and randomly selecting one copy to use for each weight independently at each optimization step.
ESGD seems to be particularly good at adversarial training.
Recommended hyperparameters for a model where AdamW is best at lr=1e-4:
eps | learning rate | beta1 | beta2 | p | swap_ratio |
---|---|---|---|---|---|
1e-15 | 1e-4 | 0.99 | 0.99 | 0.5 | 0.99 |
For best results, gradient clipping should be disabled.