Gradient approximations¶
This subpackage contains modules that estimate the gradient using function values.
Classes:
-
FDM
–Approximate gradients via finite difference method.
-
ForwardGradient
–Forward gradient method.
-
GaussianSmoothing
–Gradient approximation via Gaussian smoothing method.
-
GradApproximator
–Base class for gradient approximations.
-
MeZO
–Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.
-
RDSA
–Gradient approximation via Random-direction stochastic approximation (RDSA) method.
-
RandomizedFDM
–Gradient approximation via a randomized finite-difference method.
-
SPSA
–Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.
FDM ¶
Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator
Approximate gradients via finite difference method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –magnitude of parameter perturbation. Defaults to 1e-3.
-
formula
(Literal
, default:'central'
) –finite difference formula. Defaults to 'central2'.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to 'closure'.
Examples: plain FDM:
Any gradient-based method can use FDM-estimated gradients.
fdm_ncg = tz.Optimizer(
model.parameters(),
tz.m.FDM(),
# set hvp_method to "forward" so that it
# uses gradient difference instead of autograd
tz.m.NewtonCG(hvp_method="forward"),
tz.m.Backtracking()
)
Source code in torchzero/modules/grad_approximation/fdm.py
ForwardGradient ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Forward gradient method.
This method samples one or more directional derivatives evaluated via autograd jacobian-vector products. This is very similar to randomized finite difference.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
distribution
(Literal
, default:'gaussian'
) –distribution for random gradient samples. Defaults to "gaussian".
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
jvp_method
(str
, default:'autograd'
) –how to calculate jacobian vector product, note that with
forward
and 'central' this is equivalent to randomized finite difference. Defaults to 'autograd'. -
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Baydin, A. G., Pearlmutter, B. A., Syme, D., Wood, F., & Torr, P. (2022). Gradients without backpropagation. arXiv preprint arXiv:2202.08587.
Source code in torchzero/modules/grad_approximation/forward_gradient.py
PRE_MULTIPLY_BY_H
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
GaussianSmoothing ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Gradient approximation via Gaussian smoothing method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.01
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-2. -
n_samples
(int
, default:100
) –number of random gradient samples. Defaults to 100.
-
formula
(Literal
, default:'forward2'
) –finite difference formula. Defaults to 'forward2'.
-
distribution
(Literal
, default:'gaussian'
) –distribution. Defaults to "gaussian".
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed
(int | None | Generator
, default:None
) –Seed for random generator. Defaults to None.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Yurii Nesterov, Vladimir Spokoiny. (2015). Random Gradient-Free Minimization of Convex Functions. https://gwern.net/doc/math/2015-nesterov.pdf
Source code in torchzero/modules/grad_approximation/rfdm.py
GradApproximator ¶
Bases: torchzero.core.module.Module
, abc.ABC
Base class for gradient approximations.
This is an abstract class, to use it, subclass it and override approximate
.
GradientApproximator modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure.
Parameters:
-
defaults
(dict[str, Any] | None
, default:None
) –dict with defaults. Defaults to None.
-
target
(str
, default:'closure'
) –whether to set
var.grad
,var.update
or 'var.closure`. Defaults to 'closure'.
Example:
Basic SPSA method implementation.
class SPSA(GradApproximator):
def __init__(self, h=1e-3):
defaults = dict(h=h)
super().__init__(defaults)
@torch.no_grad
def approximate(self, closure, params, loss):
perturbation = [rademacher_like(p) * self.settings[p]['h'] for p in params]
# evaluate params + perturbation
torch._foreach_add_(params, perturbation)
loss_plus = closure(False)
# evaluate params - perturbation
torch._foreach_sub_(params, perturbation)
torch._foreach_sub_(params, perturbation)
loss_minus = closure(False)
# restore original params
torch._foreach_add_(params, perturbation)
# calculate SPSA gradients
spsa_grads = []
for p, pert in zip(params, perturbation):
settings = self.settings[p]
h = settings['h']
d = (loss_plus - loss_minus) / (2*(h**2))
spsa_grads.append(pert * d)
# returns tuple: (grads, loss, loss_approx)
# loss must be with initial parameters
# since we only evaluated loss with perturbed parameters
# we only have loss_approx
return spsa_grads, None, loss_plus
Methods:
-
approximate
–Returns a tuple:
(grad, loss, loss_approx)
, make sure this resets parameters to their original values! -
pre_step
–This runs once before each step, whereas
approximate
may run multiple times per step if further modules
Source code in torchzero/modules/grad_approximation/grad_approximator.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
|
approximate ¶
approximate(closure: Callable, params: list[Tensor], loss: Tensor | None) -> tuple[Iterable[Tensor], Tensor | None, Tensor | None]
Returns a tuple: (grad, loss, loss_approx)
, make sure this resets parameters to their original values!
Source code in torchzero/modules/grad_approximation/grad_approximator.py
pre_step ¶
This runs once before each step, whereas approximate
may run multiple times per step if further modules
evaluate gradients at multiple points. This is useful for example to pre-generate new random perturbations.
Source code in torchzero/modules/grad_approximation/grad_approximator.py
MeZO ¶
Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator
Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
formula
(Literal
, default:'central2'
) –finite difference formula. Defaults to 'central2'.
-
distribution
(Literal
, default:'rademacher'
) –distribution. Defaults to "rademacher". If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36, 53038-53075. https://arxiv.org/abs/2305.17333
Source code in torchzero/modules/grad_approximation/rfdm.py
RDSA ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Gradient approximation via Random-direction stochastic approximation (RDSA) method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
formula
(Literal
, default:'central2'
) –finite difference formula. Defaults to 'central2'.
-
distribution
(Literal
, default:'gaussian'
) –distribution. Defaults to "gaussian".
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed
(int | None | Generator
, default:None
) –Seed for random generator. Defaults to None.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771
Source code in torchzero/modules/grad_approximation/rfdm.py
RandomizedFDM ¶
Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator
Gradient approximation via a randomized finite-difference method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
formula
(Literal
, default:'central'
) –finite difference formula. Defaults to 'central2'.
-
distribution
(Literal
, default:'rademacher'
) –distribution. Defaults to "rademacher". If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed
(int | None | Generator
, default:None
) –Seed for random generator. Defaults to None.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
Examples:
Simultaneous perturbation stochastic approximation (SPSA) method¶
SPSA is randomized FDM with rademacher distribution and central formula.
spsa = tz.Optimizer(
model.parameters(),
tz.m.RandomizedFDM(formula="fd_central", distribution="rademacher"),
tz.m.LR(1e-2)
)
Random-direction stochastic approximation (RDSA) method¶
RDSA is randomized FDM with usually gaussian distribution and central formula.
rdsa = tz.Optimizer(
model.parameters(),
tz.m.RandomizedFDM(formula="fd_central", distribution="gaussian"),
tz.m.LR(1e-2)
)
Gaussian smoothing method¶
GS uses many gaussian samples with possibly a larger finite difference step size.
gs = tz.Optimizer(
model.parameters(),
tz.m.RandomizedFDM(n_samples=100, distribution="gaussian", formula="forward2", h=1e-1),
tz.m.NewtonCG(hvp_method="forward"),
tz.m.Backtracking()
)
RandomizedFDM with momentum¶
Momentum might help by reducing the variance of the estimated gradients.
momentum_spsa = tz.Optimizer(
model.parameters(),
tz.m.RandomizedFDM(),
tz.m.HeavyBall(0.9),
tz.m.LR(1e-3)
)
Source code in torchzero/modules/grad_approximation/rfdm.py
154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 |
|
PRE_MULTIPLY_BY_H
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
SPSA ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
formula
(Literal
, default:'central'
) –finite difference formula. Defaults to 'central2'.
-
distribution
(Literal
, default:'rademacher'
) –distribution. Defaults to "rademacher".
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed
(int | None | Generator
, default:None
) –Seed for random generator. Defaults to None.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771