import matplotlib.pyplot as plt
import torch
torch.manual_seed(0)
from torch import nn
import torch.nn.functional as F
import torchzero as tz
from visualbench import FunctionDescent, test_functions
3. Adaptive methods¶
Adaptive methods use an adaptive per-parameter step size. Most of them are well suited for large scale stochastic optimization, i.e. training neural nets.
3.1 Adagrad¶
Adagrad divides the gradient by square root of sum of squares of past gradients, making step size for weights where gradients have been large smaller and vice versa. The update is: $$ v_t \leftarrow v_{t-1} + \nabla f(x_t)^2 $$ $$ x_{t+1} \leftarrow x_t - \eta \frac{\nabla f(x_t)}{\sqrt{v_t + \epsilon}} $$
Here $\eta$ is the step size, $v$ accumulates square gradients, and $\epsilon$ is a small scalar added to avoid division by 0.
This is actually the diagonal version of Adagrad, the full version and approximations to it are described later in this notebook.
For testing we will use the Rosenbrock function: $$ f(x,y) = (1 - x)^2 + 100 * (y - x^2)^2 $$
func = FunctionDescent('rosen')
optimizer = tz.Modular(func.parameters(), tz.m.Adagrad(), tz.m.LR(4e-1))
func.run(optimizer, max_steps=5000)
func.plot(log_contour=True)
finished in 8.3s., reached loss = 0.00462
<Axes: >
3.2 RMSprop¶
A drawback of Adagrad simple accumulation doesn't forget old gradients that might be no longer relevant. RMSprop solves this by replacing sum of squares of past gradients with an exponential moving average.
$$ v_t \leftarrow \beta v_{t-1} + (1 - \beta) \nabla f(x_t)^2 $$ $$ x_{t+1} \leftarrow x_t - \eta \frac{\nabla f(x_t)}{\sqrt{v_t + \epsilon}} $$
Here $\beta$ is the smoothing hyperparameter that determines how fast old gradients are forgotten, default value is 0.99. If $\beta=0$, RMSprop becomes equivalent to SignGD, that is it uses the sign of the gradient.
func = FunctionDescent('rosen')
optimizer = tz.Modular(func.parameters(), tz.m.RMSprop(smoothing=0.95, eps=1e-4), tz.m.LR(1e-2))
func.run(optimizer, max_steps=4000)
func.plot(log_contour=True)
finished in 7.4s., reached loss = 0.0222
<Axes: >
3.3 Adam¶
Adam is a modification to RMSprop where instead of dividing the gradients, it divides an exponential moving average of gradients, i.e. momentum.
$$ m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) \nabla f(x_t) $$ $$ v_t \leftarrow \beta_2 v_{t-1} + (1 - \beta_2) \nabla f(x_t)^2 $$ $$ x_{t+1} \leftarrow x_t - \frac{\eta_{\text{Adam}} m_t}{\sqrt{v_t + \epsilon}} $$ Here $\beta_1$ controls gradient momentum and $\beta_2$ controls momentum of squared gradients.
Additionally Adam employs debiasing of the step size. The debiased step size is calculated as $$ \eta_{\text{Adam}} = \frac{\eta (1 - \beta_2^k)^{1/2}}{1 - \beta_1^k} $$ where $k$ is current step, starting from 1.
func = FunctionDescent('rosen')
optimizer = tz.Modular(func.parameters(), tz.m.Adam(beta1=0.95, beta2=0.99), tz.m.LR(3e-1))
func.run(optimizer, max_steps=500)
func.plot(log_contour=True)
finished in 1.0s., reached loss = 4.02e-05
<Axes: >
3.4 AdaHessian¶
AdaHessian is very similar to Adam, except squared gradients are replaced by squared hessian-vector products with random vectors. Those hessian-vector products are estimates of the hessian diagonal.
$$ m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) \nabla f(x_t) $$ $$ D = diag(\mathbf{H}) = \mathbb{E}[\mathbb{z} \odot \mathbf{H}\mathbf{z}] $$ $$ v_t \leftarrow \beta_2 v_{t-1} + (1 - \beta_2) D^2 $$ $$ x_{t+1} \leftarrow x_t - \eta \frac{m_t}{\sqrt{v_t + \epsilon}} $$ Here $z$ is a random Rademacher vector, and D is approximated as a hessian-vector product $Hz$, which can be computed efficiently via autograd or finite difference by two backward passes.
The hessian diagonal is advantageous to squared gradients as it provides a better scaling, which for example allows AdaHessian to minimize a stretched quadratic function with a diagonal hessian in a single step with a step size of 1.
func = FunctionDescent('rosen')
optimizer = tz.Modular(func.parameters(), tz.m.AdaHessian(beta1=0.95, beta2=0.99, seed=0), tz.m.WarmupNormClip(10), tz.m.LR(1))
func.run(optimizer, max_steps=2000)
func.plot(log_contour=True)
finished in 5.8s., reached loss = 0.00339
<Axes: >
3.5 SophiaG, ESGD¶
SophiaG[1] and ESGD (Equilibrated stochastic gradient descent)[2] also use hessian-vector products with random vectors. ESGD is essentially Adagrad with squared gradients replaced by squared hessian-vector products. SophiaG is similar to AdaHessian, but hessian-vector products are with normally distributed random vectors, they are not squared, and agressive clipping of the update is used.
func = FunctionDescent('rosen')
optimizer = tz.Modular(func.parameters(), tz.m.SophiaH(update_freq=1, seed=0), tz.m.LR(4e-2))
func.run(optimizer, max_steps=1000)
func.plot(log_contour=True)
finished in 2.3s., reached loss = 0.00977
<Axes: >
3.6 Lion¶
Lion (EvoLved Sign Momentum) is a method discovered by an evolutionary search of optimization algorithms. It essentially uses sign of dampened momentum.
$$ d_t = sign(\beta_1 v_{t-1} + (1 - \beta_1) \nabla f(x_t)) $$ $$ v_t \leftarrow \beta_2 v_{t-1} + (1 - \beta_2) \nabla f(x_t) $$ $$ x_{t+1} \leftarrow x_t - \eta d_t $$
func = FunctionDescent('rosen')
optimizer = tz.Modular(func.parameters(), tz.m.Lion(), tz.m.LR(2e-2))
func.run(optimizer, max_steps=1000)
func.plot(log_contour=True)
finished in 1.4s., reached loss = 9.61e-10
<Axes: >
3.7 Grams¶
Grams is Adam, except it uses magnitute of Adam's update, and sign of the gradient. Similar modification can be applied to any other momentum-based optimizer.
func = FunctionDescent('rosen')
optimizer = tz.Modular(
func.parameters(),
tz.m.Adam(beta1=0.95, beta2=0.95),
tz.m.GradSign(),
tz.m.LR(3e-1)
)
func.run(optimizer, max_steps=500)
func.plot(log_contour=True)
finished in 1.7s., reached loss = 0.00616
<Axes: >
3.8 LaProp¶
LaProp, like Adam, is a version of RMSprop with momentum and Adam debiasing. LaProp first divides gradient by the exponential moving average of squared gradients, and then tracks an exponential moving average of that update.
func = FunctionDescent('rosen')
optimizer = tz.Modular(
func.parameters(),
tz.m.RMSprop(0.95),
tz.m.Debias(beta1=None, beta2=0.95),
tz.m.EMA(0.95),
tz.m.Debias(beta1=0.95, beta2=None),
tz.m.LR(3e-1)
)
func.run(optimizer, max_steps=500)
func.plot(log_contour=True)
finished in 1.0s., reached loss = 1.88e-05
<Axes: >
3.9 Adan¶
Adan (Adaptive nesterov momentum). Adan combines Nesterov's momentum with adaptive per-parameter learning rates and is suggested for large-batch optimization. The update rule is quite long so here is a screenshot from the paper.
func = FunctionDescent('rosen')
optimizer = tz.Modular(func.parameters(), tz.m.Adan(), tz.m.LR(1e-1))
func.run(optimizer, max_steps=500)
func.plot(log_contour=True)
finished in 0.9s., reached loss = 3.32e-07
<Axes: >
3.10 MARS¶
MARS (Make vAriance Reduction Shine) is a variance reduction method which can actually be applied to any other optimizer. It calculates a variance-reduced gradient as follows: $$ c_t = f(x_t, \xi_t) + \gamma * \frac{\beta}{1 - \beta} * (\nabla f(x_t, \xi_t) - \nabla f(x_{t-1}, \xi_t)) $$ $$ \text{if } ||c_t||_2 > 1, \text{then } c_t \rightarrow \frac{c_t}{||c_t||_2} $$ Here $c_t$ is variance-reduced gradient that can be passed to another optimizer, $\beta$ should be set to the same value as momentum hyperparameter the optimizer, and $\gamma$ is a hyperparameter controlling scale of the correction.
Now $f(x_t, \xi_t)$ means gradient at parameters $x_t$ with mini-batch $\xi_t$. If you look at first formula, it also requires $\nabla f(x_{t-1}, \xi_t)$ - gradient at previous parameters $x_{t-1}$ and current mini-batch $\xi_t$. Funny enough, authors suggest to replace $f(x_{t-1}, \xi_t)$ with an approximation $f(x_{t-1}, \xi_{t-1})$ that doesn't require an extra evaluation, and it performs just as well in their experiments despite essentially not doing any variance reduction anymore. Nonetheless whatever MARS correction does appears to consistently speed Adam up.
func = FunctionDescent('rosen')
optimizer = tz.Modular(func.parameters(), tz.m.MARSCorrection(beta=0.95), tz.m.Adam(beta1=0.95), tz.m.LR(1e-1))
func.run(optimizer, max_steps=500)
func.plot(log_contour=True)
finished in 1.2s., reached loss = 9.95e-07
<Axes: >
3.11 Rprop¶
Rprop uses only the sign of the gradient. It uses a per-parameter adaptive learning rate - if gradient sign changed after and update, the learning rate for that weight is reduced and the update for that weight is undone, otherwise the learning rate is increased. Rprop requires essentially no tuning, as the learning rates are adapted automatically.
Because it uses differences in gradients, it is not suitable for stochastic optimization. However it can be a very strong method for non-stochastic tasks, for example in my experiments it outperformed L-BFGS on training PINNs, although SOAP then beat both.
func = FunctionDescent('rosen')
optimizer = tz.Modular(func.parameters(), tz.m.Rprop(), tz.m.LR(1e-3))
func.run(optimizer, max_steps=2000)
func.plot(log_contour=True)
finished in 3.0s., reached loss = 0.0141
<Axes: >
3.12 Full-matrix Adagrad¶
The full-matrix version of Adagrad uses the following update rule: $$ G_t \leftarrow G_{t-1} + \nabla f(x_t) \nabla f(x_t)^T $$ $$ x_{t+1} \leftarrow x_t - G_t^{-1/2}\nabla f(x) $$
On each step it adds outer product of the gradient with itself to the accumulator $G$, and calculates the update as inverse square root of $G$ times gradient.
What that does is similar ZCA whitening. If we consider a dataset of all past gradients, accumulator $G$ is an approximation to the covariance matrix of that dataset. ZCA whitening would center the dataset and multiply it by $G^{-1/2}$, which makes $G$ of the new transformed dataset identity. Full matrix Adagrad is the same except it doesn't center.
In order to demonstrate the power of full-matrix Adagrad we will use the rotated quadratic function, defined as $$ f(x,y) = x^2 + y^2 + 1.99 * x * y $$
This is a very hard function, and because it is rotated (i.e. large off-diagonal hessian elements), diagonal methods take a large number of steps to converge (even Adam). Full-matrix Adagrad, however, takes a straight path towards the minima, minimizing the function in less steps.
fig, ax = plt.subplots(ncols=2, figsize=(12,6))
# diagonal Adagrad
func = FunctionDescent('ill2')
optimizer = tz.Modular(func.parameters(), tz.m.Adagrad(), tz.m.LR(2))
func.run(optimizer, max_steps=1000)
func.plot(log_contour=True, ax=ax[0])
ax[0].set_title("Diagonal Adagrad (1000 steps)")
# full-matrix Adagrad
func = FunctionDescent('ill2')
optimizer = tz.Modular(func.parameters(), tz.m.FullMatrixAdagrad(init='zeros', reg=1e-4), tz.m.WarmupNormClip(10), tz.m.LR(1))
func.run(optimizer, max_steps=100)
func.plot(log_contour=True, ax=ax[1])
ax[1].set_title("Full-matrix Adagrad (100 steps)")
plt.show()
finished in 1.9s., reached loss = 0.0196 finished in 0.2s., reached loss = 1.82e-05
3.13 Full-matrix Adam¶
The FullMatrixAdagrad
is flexible and allows to define a full-matrix version of many other algorithms. Recall that Adam applies preconditioning to momentum, and uses exponential moving average instead of sum. In torchzero many algorithms that implement preconditioning have an inner
argument, which accepts modules that the preconditioning will be applied to instead of the gradient, and we can put tz.m.EMA
in it.
Since we are using Adam, we can up the difficulty of the function by making it even more stretched: $$ f(x,y) = x^2 + y^2 + 1.999 * x * y $$
fig, ax = plt.subplots(ncols=2, figsize=(12,6))
# diagonal Adam
func = FunctionDescent('ill3')
optimizer = tz.Modular(func.parameters(), tz.m.Adam(0.9, 0.95), tz.m.LR(1))
func.run(optimizer, max_steps=500)
func.plot(log_contour=True, ax=ax[0])
ax[0].set_title("Diagonal Adam (500 steps)")
# full-matrix Adam
func = FunctionDescent('ill3')
optimizer = tz.Modular(
func.parameters(),
tz.m.FullMatrixAdagrad(init='zeros', reg=1e-5, beta=0.95, inner=tz.m.EMA(0.9)),
tz.m.Debias(0.9, 0.95),
tz.m.LR(1)
)
func.run(optimizer, max_steps=100)
func.plot(log_contour=True, ax=ax[1])
ax[1].set_title("Full-matrix Adam (100 steps)")
plt.show()
finished in 0.8s., reached loss = 3.52e-05 finished in 0.2s., reached loss = 1.53e-06
3.14 Shampoo¶
Big drawback of Full-matrix Adagrad is that it requires storing a large $n \times n$ matrix and calculating its inverse square root, so it is really only feasible under around 5,000 parameters. Various methods exist that approximate G, for example Shampoo, SM3.
Shampoo utilizes the tensor structure of the parameters - for a $H \times W$ parameter it preconditions each row and each column separately, leading to $H^2 + W^2$ space. For example, a $100 \times 100$ parameter would require $100,000^2$ storage with full-matrix Adagrad, and $200,000$ with Shampoo.
Shampoo update for a matrix parameter $X_t$ is as follows: $$ G_t = \nabla f(X_t) \\\\ L_t \leftarrow L_{t-1} + GG^T \\\\ R_t \leftarrow R_{t-1} + G^TG \\\\ X_{t+1} \leftarrow X_t - \eta L_t^{-1/4} G_t R_t^{-1/4} $$
Shampoo is also defined in a similar way for tensor parameters (with dimension larger than 2). I don't know why uses $1/4$ instead of $1/2$, but there must be some good reason, although replacing it with $1/2$ has been suggested on some follow up papers too.
Shampoo is available in torchzero as tz.m.Shampoo
. Similarly to tz.m.FullMatrixAdagrad
, it is flexible and allows defining Shampoo-versions of various algorithms.
On a two-dimensional function Shampoo is equivalent to full-matrix Adagrad, so we can't really visualize it.
3.15 Limited-memory Adagrad¶
Limited-memory Adagrad stores a history of past $n$ gradients and computes an update equivalent to full-matrix Adagrad on those past $n$ gradients.
There update rule:
- take past $k$ gradient vectors and stack them as columns into a single matrix $G\in \mathbb{R}^{d\times k}$, where $d$ is dimensionality of the problem.
- calculate $M = G^TG$, which is a ${k\times k}$ matrix.
- add damping to $M$: $$ M = N + I * \lambda $$ Here $\lambda$ is the damping hyperparameter, can be some small value like 1e-6.
- Calculate eigendecomposition of $M$ $$ QΛQ^T =\text{eigh}(M) $$
- Compute U $$ U = GQ\sqrt{Λ} $$
- So now we have $U$ and $\Sigma$ from SVD(G), but computed more efficiently. We want to compute $(GG^T)^{-1/2}$, which can be expressed through SVD of $G$: $$ GG^T = (U \Sigma V^T)(U \Sigma V^T)^T = U \Sigma V^T V \Sigma^T U^T = U \Sigma^2 U^T \\\\ (GG^T)^{-1/2} = (U \Sigma^2 U^T)^{-1/2} = U (\Sigma^2)^{-1/2} U^T = U \Sigma^{-1} U^T $$ therefore the update rule is $$ x_{t+1} \leftarrow x_t - \eta U_t Λ_t^{-1/2} U_t^T \nabla f(x_t) $$
func = FunctionDescent('ill2')
optimizer = tz.Modular(func.parameters(), tz.m.LMAdagrad(damping=0), tz.m.LR(1))
func.run(optimizer, max_steps=100)
func.plot(log_contour=True)
finished in 0.3s., reached loss = 6.24e-10
<Axes: >
3.16 Muon¶
Muon (MomentUm Orthogonalized by Newton-Schulz) is an optimization algorithm made specifically for matrix parameters - ones that are involved in matrix multiplications. In a typical neural network a linear layer, a convolutional layer, and Q, K and V in a transformer layer are examples of such parameters.
Muon approximately orthogonalizes the update for those parameters using an efficient Newton-Schulz iteration. The motivation behind this can be found in author's blogpost - https://kellerjordan.github.io/posts/muon/. In Muon orthogonalization is applied to nesterov's momentum, and non-matrix parameters are optimized using some other optimizer, for example Adam. Muon (ignoring momentum) is also equivalent to memoryless Shampoo where the accumulators are reset after each step.
Since Muon only works on matrices, we can't easily visualize it on our 2D function, so I will just give an example of how Muon is constructed.
Authors suggest that embeddings, classifier heads, hidden gains/biases and first convolutional layer should be optimized using standard AdamW. Embeddings are not involved in matrix multiplications, while the other recommendations are empirical.
In torchzero orthogonalization with a Newton-Schulz solver is implemented as Orthogonalize
module. Parameters can be split by using the Split
module based on a filter. The filter can either be a callable, a parameter itself, or a list of filters.
For example to select 2D+ parameters to optimize with Muon we can do this:
filter = lambda x: x.ndim >= 2
or to be more specific, pass an iterable of specific tensors for Muon:
filter = (model.c2.weight, model.c3.weight)
class TinyConvNet(nn.Module):
def __init__(self):
super().__init__()
self.c1 = nn.Conv2d(1, 16, 3, 2)
self.c2 = nn.Conv2d(16, 24, 3, 2)
self.c3 = nn.Conv2d(24, 32, 2, 2)
self.head = nn.Sequential(nn.Flatten(), nn.Linear(32*9, 10))
def forward(self, x):
x = F.elu(self.c1(x), inplace=True)
x = F.elu(self.c2(x), inplace=True)
x = F.elu(self.c3(x), inplace=True)
return self.head(x)
model = TinyConvNet().cuda()
optimizer = tz.Modular(
model.parameters(),
# same hyperparams as https://github.com/KellerJordan/Muon
tz.m.NAG(0.95),
tz.m.Split(
filter = (model.c2.weight, model.c3.weight),
# or `filter = lambda x: x.ndim >= 2` to select all 2D+ params.
true = tz.m.Orthogonalize(),
false = [tz.m.Adam(0.9, 0.95), tz.m.Mul(1/66)],
),
tz.m.LR(1e-2),
)
3.17 SOAP¶
SOAP (ShampoO with Adam in the Preconditioner’s eigenbasis) runs Adam in the eigenbasis of Shampoo’s preconditioner. SOAP is more stable, computationally efficient and convergent than Shampoo. The update rule is quite long so here is a screenshot from the paper.
On 1D parameters SOAP uses full-matrix preconditioner, however it uses a different solver which appears to be significantly more robust, thus it is able to minimize even the most extreme version of the rotated quadratic function in few steps:
$$ f(x,y) = x^2 + y^2 + 1.999999 * x * y $$
No other adaptive method is able to minimize this, even quasi-newton methods tend to fail due to extremely large condition number of the hessian.
func = FunctionDescent('ill6')
optimizer = tz.Modular(func.parameters(), tz.m.SOAP(), tz.m.LR(1e-1))
func.run(optimizer, max_steps=200)
func.plot(log_contour=True)
finished in 0.4s., reached loss = 7.75e-07
<Axes: >
3.18 Natural gradient¶
Natural gradient uses Fisher information matrix (FIM) as the preconditioner, and once I manage to understand what FIM does, I will promptly attempt to explain it here.
Typically we don't have access to FIM. However an approximation exists when we are training a model on a batch of samples, called empirical fisher information matrix (EFIM).
Given $k$ samples, we calculate $k$ per-sample gradients and stack them into a matrix $G\in \mathbb{R}^{d\times k}$.
Then empirical fisher is calculated as $F = GG^T$.
The update is then calculated as: $$ x_{t+1} \leftarrow x_t - \eta F_t^{-1}\nabla f(x_t) $$
You might recognize that this is very similar to full-matrix Adagrad, except for some reason it doesn't have a square root.
In practice since $F=GG^T$ is a low rank matrix so we don't need to explicitly form it, torchzero implements a more efficient way to calculate $F^{-1}\nabla f(x)$ and thus doesn't use $N^2$ memory.
Since natural gradient requires training on samples, we can't really visualize it on a 2D function. Instead here is a basic training example.
We pass loss=losses
to opt.step
, and it takes care of calculating a batch of gradients. The gradients are calculated in batched mode.
X = torch.randn(64, 20)
y = torch.randn(64, 10)
model = nn.Sequential(nn.Linear(20, 64), nn.ELU(), nn.Linear(64, 10))
opt = tz.Modular(
model.parameters(),
tz.m.NaturalGradient(),
tz.m.LR(3e-2)
)
for i in range(100):
y_hat = model(X) # (64, 10)
losses = (y_hat - y).pow(2).mean(0) # (10, )
opt.step(loss=losses)
if i % 10 == 0:
print(f'{losses.mean() = }')
losses.mean() = tensor(1.1702, grad_fn=<MeanBackward0>) losses.mean() = tensor(0.8838, grad_fn=<MeanBackward0>) losses.mean() = tensor(0.6578, grad_fn=<MeanBackward0>) losses.mean() = tensor(0.4738, grad_fn=<MeanBackward0>) losses.mean() = tensor(0.3156, grad_fn=<MeanBackward0>) losses.mean() = tensor(0.2008, grad_fn=<MeanBackward0>) losses.mean() = tensor(0.1394, grad_fn=<MeanBackward0>) losses.mean() = tensor(0.0933, grad_fn=<MeanBackward0>) losses.mean() = tensor(0.0734, grad_fn=<MeanBackward0>) losses.mean() = tensor(0.0644, grad_fn=<MeanBackward0>)
3.19 Combinations¶
If one desires, one can create an abomination by chaining various optimizers together, although the practicality of such approach remains questionable.
func = FunctionDescent('ill4')
optimizer = tz.Modular(
func.parameters(),
tz.m.experimental.FFTProjection(
[tz.m.MARSCorrection(), tz.m.Adan(0.7), tz.m.SOAP(0.3), tz.m.NAG(-0.9), tz.m.LR(2e-1)]
)
)
func.run(optimizer, max_steps=500)
func.plot(log_contour=True)
finished in 1.6s., reached loss = 1.14e-08
<Axes: >
3.20 What should I use?¶
I can suggest a very strong baseline:
opt = tz.Modular(
model.parameters(),
tz.m.SOAP(),
tz.m.ClipNormByEMA(max_ema_growth=1.2),
tz.m.RelativeWeightDecay(1e-2),
tz.m.LR(lr).
)
Muon can also have very good performance on some models, and it is much cheaper to compute than SOAP.
Another strong baseline is PSGD Kron, although it has not yet been implemented in torchzero.