Gradient approximations¶
This subpackage contains modules that estimate the gradient using function values.
Classes:
-
FDM
–Approximate gradients via finite difference method.
-
ForwardGradient
–Forward gradient method.
-
GaussianSmoothing
–Gradient approximation via Gaussian smoothing method.
-
GradApproximator
–Base class for gradient approximations.
-
MeZO
–Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.
-
RDSA
–Gradient approximation via Random-direction stochastic approximation (RDSA) method.
-
RandomizedFDM
–Gradient approximation via a randomized finite-difference method.
-
SPSA
–Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.
FDM ¶
Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator
Approximate gradients via finite difference method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –magnitude of parameter perturbation. Defaults to 1e-3.
-
formula
(Literal
, default:'central'
) –finite difference formula. Defaults to 'central2'.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to 'closure'.
Examples: plain FDM:
Any gradient-based method can use FDM-estimated gradients.
fdm_ncg = tz.Modular(
model.parameters(),
tz.m.FDM(),
# set hvp_method to "forward" so that it
# uses gradient difference instead of autograd
tz.m.NewtonCG(hvp_method="forward"),
tz.m.Backtracking()
)
Source code in torchzero/modules/grad_approximation/fdm.py
ForwardGradient ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Forward gradient method.
This method samples one or more directional derivatives evaluated via autograd jacobian-vector products. This is very similar to randomized finite difference.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
distribution
(Literal
, default:'gaussian'
) –distribution for random gradient samples. Defaults to "gaussian".
-
beta
(float
, default:0
) –If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
jvp_method
(str
, default:'autograd'
) –how to calculate jacobian vector product, note that with
forward
and 'central' this is equivalent to randomized finite difference. Defaults to 'autograd'. -
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Baydin, A. G., Pearlmutter, B. A., Syme, D., Wood, F., & Torr, P. (2022). Gradients without backpropagation. arXiv preprint arXiv:2202.08587.
Source code in torchzero/modules/grad_approximation/forward_gradient.py
PRE_MULTIPLY_BY_H
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
GaussianSmoothing ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Gradient approximation via Gaussian smoothing method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.01
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-2. -
n_samples
(int
, default:100
) –number of random gradient samples. Defaults to 100.
-
formula
(Literal
, default:'forward2'
) –finite difference formula. Defaults to 'forward2'.
-
distribution
(Literal
, default:'gaussian'
) –distribution. Defaults to "gaussian".
-
beta
(float
, default:0
) –If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed
(int | None | Generator
, default:None
) –Seed for random generator. Defaults to None.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Yurii Nesterov, Vladimir Spokoiny. (2015). Random Gradient-Free Minimization of Convex Functions. https://gwern.net/doc/math/2015-nesterov.pdf
Source code in torchzero/modules/grad_approximation/rfdm.py
GradApproximator ¶
Bases: torchzero.core.module.Module
, abc.ABC
Base class for gradient approximations.
This is an abstract class, to use it, subclass it and override approximate
.
GradientApproximator modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure.
Parameters:
-
defaults
(dict[str, Any] | None
, default:None
) –dict with defaults. Defaults to None.
-
target
(str
, default:'closure'
) –whether to set
var.grad
,var.update
or 'var.closure`. Defaults to 'closure'.
Example:
Basic SPSA method implementation.
class SPSA(GradApproximator):
def __init__(self, h=1e-3):
defaults = dict(h=h)
super().__init__(defaults)
@torch.no_grad
def approximate(self, closure, params, loss):
perturbation = [rademacher_like(p) * self.settings[p]['h'] for p in params]
# evaluate params + perturbation
torch._foreach_add_(params, perturbation)
loss_plus = closure(False)
# evaluate params - perturbation
torch._foreach_sub_(params, perturbation)
torch._foreach_sub_(params, perturbation)
loss_minus = closure(False)
# restore original params
torch._foreach_add_(params, perturbation)
# calculate SPSA gradients
spsa_grads = []
for p, pert in zip(params, perturbation):
settings = self.settings[p]
h = settings['h']
d = (loss_plus - loss_minus) / (2*(h**2))
spsa_grads.append(pert * d)
# returns tuple: (grads, loss, loss_approx)
# loss must be with initial parameters
# since we only evaluated loss with perturbed parameters
# we only have loss_approx
return spsa_grads, None, loss_plus
Methods:
-
approximate
–Returns a tuple:
(grad, loss, loss_approx)
, make sure this resets parameters to their original values! -
pre_step
–This runs once before each step, whereas
approximate
may run multiple times per step if further modules
Source code in torchzero/modules/grad_approximation/grad_approximator.py
approximate ¶
approximate(closure: Callable, params: list[Tensor], loss: Tensor | None) -> tuple[Iterable[Tensor], Tensor | None, Tensor | None]
Returns a tuple: (grad, loss, loss_approx)
, make sure this resets parameters to their original values!
Source code in torchzero/modules/grad_approximation/grad_approximator.py
pre_step ¶
pre_step(var: Var) -> None
This runs once before each step, whereas approximate
may run multiple times per step if further modules
evaluate gradients at multiple points. This is useful for example to pre-generate new random perturbations.
Source code in torchzero/modules/grad_approximation/grad_approximator.py
MeZO ¶
Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator
Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
formula
(Literal
, default:'central2'
) –finite difference formula. Defaults to 'central2'.
-
distribution
(Literal
, default:'rademacher'
) –distribution. Defaults to "rademacher". If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36, 53038-53075. https://arxiv.org/abs/2305.17333
Source code in torchzero/modules/grad_approximation/rfdm.py
RDSA ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Gradient approximation via Random-direction stochastic approximation (RDSA) method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
formula
(Literal
, default:'central2'
) –finite difference formula. Defaults to 'central2'.
-
distribution
(Literal
, default:'gaussian'
) –distribution. Defaults to "gaussian".
-
beta
(float
, default:0
) –If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed
(int | None | Generator
, default:None
) –Seed for random generator. Defaults to None.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771
Source code in torchzero/modules/grad_approximation/rfdm.py
RandomizedFDM ¶
Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator
Gradient approximation via a randomized finite-difference method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
formula
(Literal
, default:'central'
) –finite difference formula. Defaults to 'central2'.
-
distribution
(Literal
, default:'rademacher'
) –distribution. Defaults to "rademacher". If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
beta
(float
, default:0
) –optinal momentum for generated perturbations. Defaults to 1e-3.
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed
(int | None | Generator
, default:None
) –Seed for random generator. Defaults to None.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
Examples:
Simultaneous perturbation stochastic approximation (SPSA) method¶
SPSA is randomized finite differnce with rademacher distribution and central formula.
spsa = tz.Modular(
model.parameters(),
tz.m.RandomizedFDM(formula="central", distribution="rademacher"),
tz.m.LR(1e-2)
)
Random-direction stochastic approximation (RDSA) method¶
RDSA is randomized finite differnce with usually gaussian distribution and central formula.
rdsa = tz.Modular(
model.parameters(),
tz.m.RandomizedFDM(formula="central", distribution="gaussian"),
tz.m.LR(1e-2)
)
RandomizedFDM with momentum¶
Momentum might help by reducing the variance of the estimated gradients.
momentum_spsa = tz.Modular(
model.parameters(),
tz.m.RandomizedFDM(),
tz.m.HeavyBall(0.9),
tz.m.LR(1e-3)
)
Gaussian smoothing method¶
GS uses many gaussian samples with possibly a larger finite difference step size.
gs = tz.Modular(
model.parameters(),
tz.m.RandomizedFDM(n_samples=100, distribution="gaussian", formula="forward2", h=1e-1),
tz.m.NewtonCG(hvp_method="forward"),
tz.m.Backtracking()
)
SPSA-NewtonCG¶
NewtonCG with hessian-vector product estimated via gradient difference calls closure multiple times per step. If each closure call estimates gradients with different perturbations, NewtonCG is unable to produce useful directions.
By setting pre_generate to True, perturbations are generated once before each step, and each closure call estimates gradients using the same pre-generated perturbations. This way closure-based algorithms are able to use gradients estimated in a consistent way.
opt = tz.Modular(
model.parameters(),
tz.m.RandomizedFDM(n_samples=10),
tz.m.NewtonCG(hvp_method="forward", pre_generate=True),
tz.m.Backtracking()
)
SPSA-LBFGS¶
LBFGS uses a memory of past parameter and gradient differences. If past gradients were estimated with different perturbations, LBFGS directions will be useless.
To alleviate this momentum can be added to random perturbations to make sure they only
change by a little bit, and the history stays relevant. The momentum is determined by the :code:beta
parameter.
The disadvantage is that the subspace the algorithm is able to explore changes slowly.
Additionally we will reset SPSA and LBFGS memory every 100 steps to remove influence from old gradient estimates.
opt = tz.Modular(
bench.parameters(),
tz.m.ResetEvery(
[tz.m.RandomizedFDM(n_samples=10, pre_generate=True, beta=0.99), tz.m.LBFGS()],
steps = 100,
),
tz.m.Backtracking()
)
Source code in torchzero/modules/grad_approximation/rfdm.py
154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 |
|
PRE_MULTIPLY_BY_H
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
SPSA ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
formula
(Literal
, default:'central'
) –finite difference formula. Defaults to 'central2'.
-
distribution
(Literal
, default:'rademacher'
) –distribution. Defaults to "rademacher".
-
beta
(float
, default:0
) –If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed
(int | None | Generator
, default:None
) –Seed for random generator. Defaults to None.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771