In [1]:

Copied!

import torch
torch.manual_seed(0)

import torchzero as tz
from visualbench import FunctionDescent, test_functions
import torch
torch.manual_seed(0)

import torchzero as tz
from visualbench import FunctionDescent, test_functions

2. Momentum¶

2.1 Heavy-ball momentum¶

Momentum can be used to accelerate gradient descent and possibly other algorithms. The idea is to add "intertia" to gradient descent by adding previous update to negative gradient. Polyak's momentum, also known as Heavy-ball method, uses the following update rule: $$ x_{t+1} \leftarrow x_t - \eta \nabla f(x_t) + \beta (x_t - x_{t-1}) $$ Here $\eta$ is the step size and $\beta$ is the momentum hyperparameter, often set to 0.9 or 0.95.

The formula can be also rewritten in the following equivalent way: $$ m_{t} \leftarrow \beta m_{t-1} + \nabla f(x_t) $$ $$ x_{t+1} \leftarrow x_t - \eta m_{t} $$ Here $m_t$ is the momentum buffer which is used to update the parameters.

Reference: Polyak, Boris T. "Some methods of speeding up the convergence of iteration methods." Ussr computational mathematics and mathematical physics 4.5 (1964): 1-17.

In [2]:

Copied!





func = FunctionDescent('rosen')
optimizer = tz.Optimizer(func.parameters(), tz.m.HeavyBall(0.95), tz.m.LR(2e-4))
func.run(optimizer, max_steps=2000)
func.plot(log_contour=True)
func = FunctionDescent('rosen')
optimizer = tz.Optimizer(func.parameters(), tz.m.HeavyBall(0.95), tz.m.LR(2e-4))
func.run(optimizer, max_steps=2000)
func.plot(log_contour=True)

finished in 2.8s., reached loss = 0.000154

Out[2]:

<Axes: >

No description has been provided for this image

2.2 Nesterov's momentum¶

Nesterov's momentum formula tends to reduce oscillations that happen with heavy-ball update rule.

$$ m_t \leftarrow \beta (m_{t-1} + \nabla f(x_t)) $$ $$ x_{t+1} \leftarrow x_t - \eta (\nabla f(x_t) + m_t) $$

Reference: Nesterov, Yurii. "A method for solving the convex programming problem with convergence rate O (1/k2)." Dokl akad nauk Sssr. Vol. 269. 1983.

In [3]:

Copied!





func = FunctionDescent('rosen')
optimizer = tz.Optimizer(func.parameters(), tz.m.NAG(0.99), tz.m.LR(2e-4))
func.run(optimizer, max_steps=1000)
func.plot(log_contour=True)
func = FunctionDescent('rosen')
optimizer = tz.Optimizer(func.parameters(), tz.m.NAG(0.99), tz.m.LR(2e-4))
func.run(optimizer, max_steps=1000)
func.plot(log_contour=True)

finished in 1.6s., reached loss = 6.89e-09

Out[3]:

<Axes: >

2.3 Exponential moving average¶

Many optimizers, for example Adam, use the exponential moving average of gradients in place of momentum. The formula is similar to heavy ball formula, but with an extra $(1 - \beta)$ term: $$ m_t \leftarrow \beta m_{t-1} + (1 - \beta) \nabla f(x_t) \\\\ $$ $$ x_{t+1} \leftarrow x_t - \eta m_t $$

In [2]:

Copied!





func = FunctionDescent('rosen')
optimizer = tz.Optimizer(func.parameters(), tz.m.EMA(0.95), tz.m.LR(4e-3))
func.run(optimizer, max_steps=2000)
func.plot(log_contour=True)
func = FunctionDescent('rosen')
optimizer = tz.Optimizer(func.parameters(), tz.m.EMA(0.95), tz.m.LR(4e-3))
func.run(optimizer, max_steps=2000)
func.plot(log_contour=True)

finished in 1.8s., reached loss = 0.000117

Out[2]:

<Axes: >

2.4 Cautious momentum¶

The idea of cautious updates is that when signs of gradient and momentum for a weight differ, update for that weight is zeroed. Cautioning can be applied to any momentum-based optimizer.

Reference: Liang, Kaizhao, et al. "Cautious optimizers: Improving training with one line of code." arXiv preprint arXiv:2411.16085 (2024).

In [3]:

Copied!





func = FunctionDescent('rosen')
optimizer = tz.Optimizer(func.parameters(), tz.m.HeavyBall(0.95), tz.m.Cautious(), tz.m.LR(3e-4))
func.run(optimizer, max_steps=2000)
func.plot(log_contour=True)
func = FunctionDescent('rosen')
optimizer = tz.Optimizer(func.parameters(), tz.m.HeavyBall(0.95), tz.m.Cautious(), tz.m.LR(3e-4))
func.run(optimizer, max_steps=2000)
func.plot(log_contour=True)

finished in 2.1s., reached loss = 1e-05

Out[3]:

<Axes: >

2.5 Matrix momentum¶

Matrix momentum is a modification of Polyak's momentum suitable for convex stochastic optimization, where the momentum parameter $\beta$ is replaced by an implicit matrix such that it uses second order information while only requiring a hessian-vector product with previous update on each step.

The method is based on the observation that at late times, with momentum, dynamics of the training is similar to dynamics without momentum, but with a scaled learning rate: $$ \mathrm{u_{eff}} = \frac{\mathrm{u_0}}{1 - \beta} $$ here $\mathrm{u_{eff}}$ is the effective learning rate, $\mathrm{u_0}$ is the actual learning rate and $\beta$ is the momentum hyperparameter. In Newton's method learning rate is replaced by inverse of the hessian matrix: $$ x_{t+1} \leftarrow x_t - H(x_t)^{-1} \nabla f(x_t) $$ Since newton's method is very fast, we would like effective learning rate to be $H^{-1}$. So if we replace effective learning rate $\mathrm{u_{eff}}$ with inverse hessian $H^{-1}$ in the 1st formula, we get: $$ H^{-1} = \mathrm{u_0}(I - \beta)^{-1} $$ To get a nice formula, multiply both sides by $(I - \beta)$, then multiply both sides by $H$: $$ H^{-1} (I - \beta) = \mathrm{u_0} (I - \beta)^{-1} (I - \beta) \\\\ H^{-1} (I - \beta) = \mathrm{u_0} I \\\\ H H^{-1} (I - \beta) = H (\mathrm{u_0} I) \\\\ I (I - \beta) = \mathrm{u_0} H \\\\ I - \beta = \mathrm{u_0} H \\\\ \beta = I - \mathrm{u_0} H $$ So we have a formula for the momentum parameter $\beta$, which now becomes a matrix, such that the effective learning rate is $H^{-1}$. If we put this into the Polyak's momentum formula: $$ x_{t+1} \leftarrow x_t - \eta \nabla f(x_t) + \beta (x_t - x_{t-1}) $$ We replace $\beta$ with our $I - \mathrm{u_0} H$ and we obtain the Matrix momentum update formula: $$ x_{t+1} \leftarrow x_t - \eta \nabla f(x_t) + (I - \mathrm{u_0} H) (x_t - x_{t-1}) $$ Or it can be written as $$ s_t = x_t - x_{t-1} \\\\ x_{t+1} \leftarrow x_t - \eta \nabla f(x_t) + s_t - \mathrm{u_0} H s_t $$ So this doesn't need the hessian or the inverse, all it requires is a hessian-vector product $H s_t$. The biggest disadvantage is that $\mathrm{u_0}$ needs to be tuned, and it needs to be tuned very carefully, otherwise the algorithm becomes unstable, or if it is too small, indistinguishable from plain GD. The value of $\mathrm{u_0}$ should be smaller than $\frac{1}{\lambda_{max}}$, where $\lambda_{max}$ is the largest eigenvalue of the hessian, which is possible to estimate using the power method before training.

Reference: Orr, Genevieve, and Todd Leen. "Using curvature information for fast stochastic search." Advances in neural information processing systems 9 (1996).

In [5]:

Copied!





func = FunctionDescent('booth')
optimizer = tz.Optimizer(func.parameters(), tz.m.MatrixMomentum(lr=1e-2, mu=0.1))
func.run(optimizer, max_steps=100)
func.plot()
func = FunctionDescent('booth')
optimizer = tz.Optimizer(func.parameters(), tz.m.MatrixMomentum(lr=1e-2, mu=0.1))
func.run(optimizer, max_steps=100)
func.plot()

finished in 0.1s., reached loss = 1.26e-06

Out[5]:

<Axes: >

2.6 Nested momentum¶

It is possible to take the output of a momentum module and pass it to another momentum module.

In [6]:

Copied!





func = FunctionDescent('rosen')
optimizer = tz.Optimizer(
    func.parameters(), 
    tz.m.NAG(0.5),
    tz.m.NAG(0.5),
    tz.m.NAG(0.5),
    tz.m.NAG(0.5),
    tz.m.NAG(0.5),
    tz.m.LR(1e-4),
)
func.run(optimizer, max_steps=2000)
func.plot(log_contour=True)
func = FunctionDescent('rosen')
optimizer = tz.Optimizer(
    func.parameters(), 
    tz.m.NAG(0.5),
    tz.m.NAG(0.5),
    tz.m.NAG(0.5),
    tz.m.NAG(0.5),
    tz.m.NAG(0.5),
    tz.m.LR(1e-4),
)
func.run(optimizer, max_steps=2000)
func.plot(log_contour=True)

finished in 2.5s., reached loss = 0.0011

Out[6]:

<Axes: >