Gradient approximations¶
This subpackage contains modules that estimate the gradient using function values.
Classes:
-
FDM
–Approximate gradients via finite difference method.
-
ForwardGradient
–Forward gradient method.
-
GaussianSmoothing
–Gradient approximation via Gaussian smoothing method.
-
GradApproximator
–Base class for gradient approximations.
-
MeZO
–Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.
-
RDSA
–Gradient approximation via Random-direction stochastic approximation (RDSA) method.
-
RandomizedFDM
–Gradient approximation via a randomized finite-difference method.
-
SPSA
–Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.
FDM ¶
Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator
Approximate gradients via finite difference method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –magnitude of parameter perturbation. Defaults to 1e-3.
-
formula
(Literal
, default:'central'
) –finite difference formula. Defaults to 'central2'.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to 'closure'.
Examples: plain FDM:
Any gradient-based method can use FDM-estimated gradients.
fdm_ncg = tz.Optimizer(
model.parameters(),
tz.m.FDM(),
# set hvp_method to "forward" so that it
# uses gradient difference instead of autograd
tz.m.NewtonCG(hvp_method="forward"),
tz.m.Backtracking()
)
Source code in torchzero/modules/grad_approximation/fdm.py
ForwardGradient ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Forward gradient method.
This method samples one or more directional derivatives evaluated via autograd jacobian-vector products. This is very similar to randomized finite difference.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
distribution
(Literal
, default:'gaussian'
) –distribution for random gradient samples. Defaults to "gaussian".
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
jvp_method
(str
, default:'autograd'
) –how to calculate jacobian vector product, note that with
forward
and 'central' this is equivalent to randomized finite difference. Defaults to 'autograd'. -
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Baydin, A. G., Pearlmutter, B. A., Syme, D., Wood, F., & Torr, P. (2022). Gradients without backpropagation. arXiv preprint arXiv:2202.08587.
Source code in torchzero/modules/grad_approximation/forward_gradient.py
PRE_MULTIPLY_BY_H
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
GaussianSmoothing ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Gradient approximation via Gaussian smoothing method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.01
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-2. -
n_samples
(int
, default:100
) –number of random gradient samples. Defaults to 100.
-
formula
(Literal
, default:'forward2'
) –finite difference formula. Defaults to 'forward2'.
-
distribution
(Literal
, default:'gaussian'
) –distribution. Defaults to "gaussian".
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed
(int | None | Generator
, default:None
) –Seed for random generator. Defaults to None.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Yurii Nesterov, Vladimir Spokoiny. (2015). Random Gradient-Free Minimization of Convex Functions. https://gwern.net/doc/math/2015-nesterov.pdf
Source code in torchzero/modules/grad_approximation/rfdm.py
GradApproximator ¶
Bases: torchzero.core.module.Module
, abc.ABC
Base class for gradient approximations.
This is an abstract class, to use it, subclass it and override approximate
.
GradientApproximator modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure.
Parameters:
-
defaults
(dict[str, Any] | None
, default:None
) –dict with defaults. Defaults to None.
-
target
(str
, default:'closure'
) –whether to set
var.grad
,var.update
or 'var.closure`. Defaults to 'closure'.
Example:
Basic SPSA method implementation.
class SPSA(GradApproximator):
def __init__(self, h=1e-3):
defaults = dict(h=h)
super().__init__(defaults)
@torch.no_grad
def approximate(self, closure, params, loss):
perturbation = [rademacher_like(p) * self.settings[p]['h'] for p in params]
# evaluate params + perturbation
torch._foreach_add_(params, perturbation)
loss_plus = closure(False)
# evaluate params - perturbation
torch._foreach_sub_(params, perturbation)
torch._foreach_sub_(params, perturbation)
loss_minus = closure(False)
# restore original params
torch._foreach_add_(params, perturbation)
# calculate SPSA gradients
spsa_grads = []
for p, pert in zip(params, perturbation):
settings = self.settings[p]
h = settings['h']
d = (loss_plus - loss_minus) / (2*(h**2))
spsa_grads.append(pert * d)
# returns tuple: (grads, loss, loss_approx)
# loss must be with initial parameters
# since we only evaluated loss with perturbed parameters
# we only have loss_approx
return spsa_grads, None, loss_plus
Methods:
-
approximate
–Returns a tuple:
(grad, loss, loss_approx)
, make sure this resets parameters to their original values! -
pre_step
–This runs once before each step, whereas
approximate
may run multiple times per step if further modules
Source code in torchzero/modules/grad_approximation/grad_approximator.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
|
approximate ¶
approximate(closure: Callable, params: list[Tensor], loss: Tensor | None) -> tuple[Iterable[Tensor], Tensor | None, Tensor | None]
Returns a tuple: (grad, loss, loss_approx)
, make sure this resets parameters to their original values!
Source code in torchzero/modules/grad_approximation/grad_approximator.py
pre_step ¶
This runs once before each step, whereas approximate
may run multiple times per step if further modules
evaluate gradients at multiple points. This is useful for example to pre-generate new random perturbations.
Source code in torchzero/modules/grad_approximation/grad_approximator.py
MeZO ¶
Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator
Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
formula
(Literal
, default:'central2'
) –finite difference formula. Defaults to 'central2'.
-
distribution
(Literal
, default:'rademacher'
) –distribution. Defaults to "rademacher". If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36, 53038-53075. https://arxiv.org/abs/2305.17333
Source code in torchzero/modules/grad_approximation/rfdm.py
RDSA ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Gradient approximation via Random-direction stochastic approximation (RDSA) method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
formula
(Literal
, default:'central2'
) –finite difference formula. Defaults to 'central2'.
-
distribution
(Literal
, default:'gaussian'
) –distribution. Defaults to "gaussian".
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed
(int | None | Generator
, default:None
) –Seed for random generator. Defaults to None.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771
Source code in torchzero/modules/grad_approximation/rfdm.py
RandomizedFDM ¶
Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator
Gradient approximation via a randomized finite-difference method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
formula
(Literal
, default:'central'
) –finite difference formula. Defaults to 'central2'.
-
distribution
(Literal
, default:'rademacher'
) –distribution. Defaults to "rademacher". If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed
(int | None | Generator
, default:None
) –Seed for random generator. Defaults to None.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
Examples:
Simultaneous perturbation stochastic approximation (SPSA) method¶
SPSA is randomized FDM with rademacher distribution and central formula.
spsa = tz.Optimizer(
model.parameters(),
tz.m.RandomizedFDM(formula="fd_central", distribution="rademacher"),
tz.m.LR(1e-2)
)
Random-direction stochastic approximation (RDSA) method¶
RDSA is randomized FDM with usually gaussian distribution and central formula.
rdsa = tz.Optimizer(
model.parameters(),
tz.m.RandomizedFDM(formula="fd_central", distribution="gaussian"),
tz.m.LR(1e-2)
)
Gaussian smoothing method¶
GS uses many gaussian samples with possibly a larger finite difference step size.
gs = tz.Optimizer(
model.parameters(),
tz.m.RandomizedFDM(n_samples=100, distribution="gaussian", formula="forward2", h=1e-1),
tz.m.NewtonCG(hvp_method="forward"),
tz.m.Backtracking()
)
RandomizedFDM with momentum¶
Momentum might help by reducing the variance of the estimated gradients.
momentum_spsa = tz.Optimizer(
model.parameters(),
tz.m.RandomizedFDM(),
tz.m.HeavyBall(0.9),
tz.m.LR(1e-3)
)
Source code in torchzero/modules/grad_approximation/rfdm.py
|
|
PRE_MULTIPLY_BY_H
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
SPSA ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
formula
(Literal
, default:'central'
) –finite difference formula. Defaults to 'central2'.
-
distribution
(Literal
, default:'rademacher'
) –distribution. Defaults to "rademacher".
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed
(int | None | Generator
, default:None
) –Seed for random generator. Defaults to None.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771