import torch
torch.manual_seed(0)
import torchzero as tz
from visualbench import FunctionDescent, test_functions
2. Momentum¶
2.1 Heavy-ball momentum¶
Momentum can be used to accelerate gradient descent and possibly other algorithms. The idea is to add "intertia" to gradient descent by using the Polyak's momentum formula: $$ x_{t+1} \leftarrow x_t - \eta \nabla f(x_t) + \beta (x_t - x_{t-1}) $$ Here $\eta$ is the step size and $\beta$ is the momentum hyperparameter, often set to 0.9 or 0.95.
The formula can be also rewritten in the following way: $$ m_{t} \leftarrow \beta m_{t-1} + \nabla f(x_t) $$ $$ x_{t+1} \leftarrow x_t - \eta m_{t} $$ Here $m_t$ is the momentum buffer which is used to update the parameters.
Let's run it on rosenbrock's function $$ f(x,y) = (1 - x)^2 + 100 * (y - x^2)^2 $$
Standard gradient descent requires about 10,000 steps to reach the value of 1e-4, whereas HeavyBall method requires only 2,000 steps.
func = FunctionDescent('rosen')
optimizer = tz.Modular(func.parameters(), tz.m.HeavyBall(0.95), tz.m.LR(2e-4))
func.run(optimizer, max_steps=2000)
func.plot(log_contour=True)
finished in 2.8s., reached loss = 0.000154
<Axes: >
2.2 Nesterov's momentum¶
Heavy ball momentum can have a lot of oscillation as seen on rosenbrock function, and Nesterov's momentum formula tends to reduce those osciallations, often leading to faster convergence.
$$ m_t \leftarrow \beta (m_{t-1} + \nabla f(x_t)) $$ $$ x_{t+1} \leftarrow x_t - \eta (\nabla f(x_t) + m_t) $$
Reference: Nesterov, Yurii. "A method for solving the convex programming problem with convergence rate O (1/k2)." Dokl akad nauk Sssr. Vol. 269. 1983.
func = FunctionDescent('rosen')
optimizer = tz.Modular(func.parameters(), tz.m.NAG(0.99), tz.m.LR(2e-4))
func.run(optimizer, max_steps=1000)
func.plot(log_contour=True)
finished in 1.6s., reached loss = 6.89e-09
<Axes: >
2.3 Exponential moving average¶
Many optimizers, for example Adam, use the exponential moving average of gradients in place of momentum. The formula is similar to heavy ball formula, but with an extra $(1 - \beta)$ term: $$ m_t \leftarrow \beta m_{t-1} + (1 - \beta) \nabla f(x_t) \\\\ $$ $$ x_{t+1} \leftarrow x_t - \eta m_t $$
func = FunctionDescent('rosen')
optimizer = tz.Modular(func.parameters(), tz.m.EMA(0.95), tz.m.LR(4e-3))
func.run(optimizer, max_steps=2000)
func.plot(log_contour=True)
finished in 4.1s., reached loss = 0.000117
<Axes: >
2.4 Cautious momentum¶
The idea of cautious updates is that when signs of gradient and momentum for a weight differ, update for that weight is zeroed. Cautioning can be applied to any momentum-based optimizer.
func = FunctionDescent('rosen')
optimizer = tz.Modular(func.parameters(), tz.m.HeavyBall(0.95), tz.m.Cautious(), tz.m.LR(3e-4))
func.run(optimizer, max_steps=2000)
func.plot(log_contour=True)
finished in 3.7s., reached loss = 1e-05
<Axes: >
2.5 Matrix momentum¶
Matrix momentum is an interesting and little-known modification of Polyak's momentum where the momentum parameter $\beta$ is replaced by an implicit matrix such that it uses second order information while only requiring a hessian-vector product with previous update on each step.
The method is based on the observation that at late times, with momentum, dynamics of the training is similar to dynamics without momentum, but with a scaled learning rate: $$ \mathrm{u_0} = \frac{\mathrm{u_0}}{1 - \beta} $$ here $\mathrm{u_0}$ is the effective learning rate, $\mathrm{u_0}$ is the actual learning rate and $\beta$ is the momentum hyperparameter. In Newton's method learning rate is replaced by inverse of the hessian matrix: $$ x_{t+1} \leftarrow x_t - H(x_t)^{-1} \nabla f(x_t) $$ Since newton's method is very fast, we would like effective learning rate to be $H^{-1}$. So if we replace effective learning rate $u_{eff}$ with inverse hessian $H^{-1}$ in the 1st formula, we get: $$ H^{-1} = \mathrm{u_0}(I - \beta)^{-1} $$ To get a nice formula, multiply both sides by $(I - \beta)$, then multiply both sides by $H$: $$ H^{-1} (I - \beta) = \mathrm{u_0} (I - \beta)^{-1} (I - \beta) \\\\ H^{-1} (I - \beta) = \mathrm{u_0} I \\\\ H H^{-1} (I - \beta) = H (\mathrm{u_0} I) \\\\ I (I - \beta) = \mathrm{u_0} H \\\\ I - \beta = \mathrm{u_0} H \\\\ \beta = I - \mathrm{u_0} H $$ So we have a formula for the momentum parameter $\beta$, which now becomes a matrix, such that the effective learning rate is $H^{-1}$. If we put this into the Polyak's momentum formula: $$ x_{t+1} \leftarrow x_t - \eta \nabla f(x_t) + \beta (x_t - x_{t-1}) $$ We replace $\beta$ with our $I - \mathrm{u_0} H$ and we obtain the Matrix momentum update formula: $$ x_{t+1} \leftarrow x_t - \eta \nabla f(x_t) + (I - \mathrm{u_0} H) (x_t - x_{t-1}) $$ Or it can be written as $$ s_t = x_t - x_{t-1} \\\\ x_{t+1} \leftarrow x_t - \eta \nabla f(x_t) + s_t - \mathrm{u_0} H s_t $$ So this doesn't need the hessian or the inverse, all it requires is a hessian-vector product $H s_t$. The biggest disadvantage is that $\mathrm{u_0}$ needs to be tuned, and it needs to be tuned very carefully, otherwise the algorithm becomes unstable, or if it is too small, indistinguishable from plain GD. The value of $\mathrm{u_0}$ should be smaller than $\frac{1}{\lambda_{max}}$, where $\lambda_{max}$ is the largest eigenvalue of the hessian, which is possible to estimate using the power method before training.
On the booth function we can see how matrix momentum takes a much straighter path towards the minima.
func = FunctionDescent('booth')
optimizer = tz.Modular(func.parameters(), tz.m.MatrixMomentum(lr=1e-2, mu=0.1))
func.run(optimizer, max_steps=100)
func.plot()
finished in 0.2s., reached loss = 1.56e-06
<Axes: >
2.6 Nested momentum¶
As an experiment it is possible to take the output of a momentum module and pass it to another momentum module.
func = FunctionDescent('rosen')
optimizer = tz.Modular(
func.parameters(),
tz.m.NAG(0.5),
tz.m.NAG(0.5),
tz.m.NAG(0.5),
tz.m.NAG(0.5),
tz.m.NAG(0.5),
tz.m.LR(1e-4),
)
func.run(optimizer, max_steps=2000)
func.plot(log_contour=True)
finished in 4.5s., reached loss = 0.0011
<Axes: >