Adaptive¶
This subpackage contains adaptive methods e.g. Adam, RMSprop, SOAP, etc.
See also¶
- Momentum - momentum methods (heavy ball, nesterov momentum)
- Quasi-newton - quasi-newton methods
Classes:
-
AEGD
–AEGD (Adaptive gradient descent with energy) from https://arxiv.org/abs/2010.05109#page=10.26.
-
ASAM
–Adaptive Sharpness-Aware Minimization from https://arxiv.org/pdf/2102.11600#page=6.52
-
AdaHessian
–AdaHessian: An Adaptive Second Order Optimizer for Machine Learning (https://arxiv.org/abs/2006.00719)
-
Adagrad
–Adagrad, divides by sum of past squares of gradients.
-
AdagradNorm
–Adagrad-Norm, divides by sum of past means of squares of gradients.
-
Adam
–Adam. Divides gradient EMA by EMA of gradient squares with debiased step size.
-
Adan
–Adaptive Nesterov Momentum Algorithm from https://arxiv.org/abs/2208.06677
-
AdaptiveHeavyBall
–Adaptive heavy ball from https://hal.science/hal-04832983v1/file/OJMO_2024__5__A7_0.pdf.
-
BacktrackOnSignChange
–Negates or undoes update for parameters where where gradient or update sign changes.
-
DualNormCorrection
–Dual norm correction for dualizer based optimizers (https://github.com/leloykun/adaptive-muon).
-
ESGD
–Equilibrated Gradient Descent (https://arxiv.org/abs/1502.04390)
-
FullMatrixAdagrad
–Full-matrix version of Adagrad, can be customized to make RMSprop or Adam (see examples).
-
LMAdagrad
–Limited-memory full matrix Adagrad.
-
Lion
–Lion (EvoLved Sign Momentum) optimizer from https://arxiv.org/abs/2302.06675.
-
MARSCorrection
–MARS variance reduction correction.
-
MSAM
–Momentum-SAM from https://arxiv.org/pdf/2401.12033.
-
MSAMObjective
–Momentum-SAM from https://arxiv.org/pdf/2401.12033.
-
MatrixMomentum
–Second order momentum method.
-
MuonAdjustLR
–LR adjustment for Muon from "Muon is Scalable for LLM Training" (https://github.com/MoonshotAI/Moonlight/tree/master).
-
NaturalGradient
–Natural gradient approximated via empirical fisher information matrix.
-
OrthoGrad
–Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.
-
Orthogonalize
–Uses Newton-Schulz iteration or SVD to compute the zeroth power / orthogonalization of update along first 2 dims.
-
RMSprop
–Divides graient by EMA of gradient squares.
-
Rprop
–Resilient propagation. The update magnitude gets multiplied by
nplus
if gradient didn't change the sign, -
SAM
–Sharpness-Aware Minimization from https://arxiv.org/pdf/2010.01412
-
SOAP
–SOAP (ShampoO with Adam in the Preconditioner's eigenbasis from https://arxiv.org/abs/2409.11321).
-
ScaleLRBySignChange
–learning rate gets multiplied by
nplus
if ascent/gradient didn't change the sign, -
Shampoo
–Shampoo from Preconditioned Stochastic Tensor Optimization (https://arxiv.org/abs/1802.09568).
-
SignConsistencyLRs
–Outputs per-weight learning rates based on consecutive sign consistency.
-
SignConsistencyMask
–Outputs a mask of sign consistency of current and previous inputs.
-
SophiaH
–SophiaH optimizer from https://arxiv.org/abs/2305.14342
Functions:
-
orthogonalize_grads_
–Uses newton-Schulz iteration to compute the zeroth power / orthogonalization of gradients of an iterable of parameters.
-
orthograd_
–Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.
AEGD ¶
Bases: torchzero.core.transform.Transform
AEGD (Adaptive gradient descent with energy) from https://arxiv.org/abs/2010.05109#page=10.26.
Note
AEGD has a learning rate hyperparameter that can't really be removed from the update rule.
To avoid compounding learning rate mofications, remove the tz.m.LR
module if you had it.
Parameters:
-
eta
(float
) –step size. Defaults to 0.1.
-
c
(float
, default:1
) –c. Defaults to 1.
-
beta3
(float
) –thrid (squared) momentum. Defaults to 0.1.
-
eps
(float
) –epsilon. Defaults to 1e-8.
-
use_n_prev
(bool
) –whether to use previous gradient differences momentum.
Source code in torchzero/modules/adaptive/aegd.py
ASAM ¶
Bases: torchzero.modules.adaptive.sam.SAM
Adaptive Sharpness-Aware Minimization from https://arxiv.org/pdf/2102.11600#page=6.52
SAM functions by seeking parameters that lie in neighborhoods having uniformly low loss value. It performs two forward and backward passes per step.
This implementation modifies the closure to return loss and calculate gradients of the SAM objective. All modules after this will use the modified objective.
.. note:: This module requires a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients at two points on each step.
Parameters:
-
rho
(float
, default:0.5
) –Neighborhood size. Defaults to 0.05.
-
p
(float
, default:2
) –norm of the SAM objective. Defaults to 2.
Examples:
ASAM-Adam:
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.ASAM(),
tz.m.Adam(),
tz.m.LR(1e-2)
)
References
Kwon, J., Kim, J., Park, H., & Choi, I. K. (2021, July). Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In International Conference on Machine Learning (pp. 5905-5914). PMLR. https://arxiv.org/abs/2102.11600
Source code in torchzero/modules/adaptive/sam.py
AdaHessian ¶
Bases: torchzero.core.module.Module
AdaHessian: An Adaptive Second Order Optimizer for Machine Learning (https://arxiv.org/abs/2006.00719)
This is similar to Adam, but the second momentum is replaced by square root of an exponential moving average of random hessian-vector products.
Notes
-
In most cases AdaHessian should be the first module in the chain because it relies on autograd. Use the
inner
argument if you wish to apply AdaHessian preconditioning to another module's output. -
If you are using gradient estimators or reformulations, set
hvp_method
to "forward" or "central". -
This module requires a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a
backward
argument (refer to documentation).
Parameters:
-
beta1
(float
, default:0.9
) –first momentum. Defaults to 0.9.
-
beta2
(float
, default:0.999
) –second momentum for squared hessian diagonal estimates. Defaults to 0.999.
-
averaging
(bool
, default:True
) –whether to enable block diagonal averaging over 1st dimension on parameters that have 2+ dimensions. This can be set per-parameter in param groups.
-
block_size
(int
, default:None
) –size of block in the block-diagonal averaging.
-
update_freq
(int
, default:1
) –frequency of updating hessian diagonal estimate via a hessian-vector product. This value can be increased to reduce computational cost. Defaults to 1.
-
eps
(float
, default:1e-08
) –division stability epsilon. Defaults to 1e-8.
-
hvp_method
(str
, default:'autograd'
) –Determines how Hessian-vector products are evaluated.
"autograd"
: Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient."forward"
: Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation."central"
: Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
-
fd_h
(float
, default:0.001
) –finite difference step size if
hvp_method
is "forward" or "central". Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of hessian-vector products with random vectors to evaluate each time when updating the preconditioner. Larger values may lead to better hessian diagonal estimate. Defaults to 1.
-
seed
(int | None
, default:None
) –seed for random vectors. Defaults to None.
-
inner
(Chainable | None
, default:None
) –Inner module. If this is specified, operations are performed in the following order. 1. compute hessian diagonal estimate. 2. pass inputs to
inner
. 3. momentum and preconditioning are applied to the ouputs ofinner
.
Examples:¶
Using AdaHessian:
AdaHessian preconditioner can be applied to any other module by passing it to the inner
argument.
Turn off AdaHessian's first momentum to get just the preconditioning. Here is an example of applying
AdaHessian preconditioning to nesterov momentum (tz.m.NAG
):
Source code in torchzero/modules/adaptive/adahessian.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 |
|
Adagrad ¶
Bases: torchzero.core.transform.Transform
Adagrad, divides by sum of past squares of gradients.
This implementation is identical to torch.optim.Adagrad
.
Parameters:
-
lr_decay
(float
, default:0
) –learning rate decay. Defaults to 0.
-
initial_accumulator_value
(float
, default:0
) –initial value of the sum of squares of gradients. Defaults to 0.
-
eps
(float
, default:1e-10
) –division epsilon. Defaults to 1e-10.
-
alpha
(float
, default:1
) –step size. Defaults to 1.
-
pow
(float
, default:2
) –power for gradients and accumulator root. Defaults to 2.
-
use_sqrt
(bool
, default:True
) –whether to take the root of the accumulator. Defaults to True.
-
inner
(Chainable | None
, default:None
) –Inner modules that are applied after updating accumulator and before preconditioning. Defaults to None.
Source code in torchzero/modules/adaptive/adagrad.py
AdagradNorm ¶
Bases: torchzero.core.transform.Transform
Adagrad-Norm, divides by sum of past means of squares of gradients.
Parameters:
-
lr_decay
(float
, default:0
) –learning rate decay. Defaults to 0.
-
initial_accumulator_value
(float
, default:0
) –initial value of the sum of squares of gradients. Defaults to 0.
-
eps
(float
, default:1e-10
) –division epsilon. Defaults to 1e-10.
-
alpha
(float
, default:1
) –step size. Defaults to 1.
-
pow
(float
, default:2
) –power for gradients and accumulator root. Defaults to 2.
-
use_sqrt
(bool
, default:True
) –whether to take the root of the accumulator. Defaults to True.
-
inner
(Chainable | None
, default:None
) –Inner modules that are applied after updating accumulator and before preconditioning. Defaults to None.
Source code in torchzero/modules/adaptive/adagrad.py
Adam ¶
Bases: torchzero.core.transform.Transform
Adam. Divides gradient EMA by EMA of gradient squares with debiased step size.
This implementation is identical to :code:torch.optim.Adam
.
Parameters:
-
beta1
(float
, default:0.9
) –momentum. Defaults to 0.9.
-
beta2
(float
, default:0.999
) –second momentum. Defaults to 0.999.
-
eps
(float
, default:1e-08
) –epsilon. Defaults to 1e-8.
-
alpha
(float
, default:1.0
) –learning rate. Defaults to 1.
-
amsgrad
(bool
, default:False
) –Whether to divide by maximum of EMA of gradient squares instead. Defaults to False.
-
pow
(float
, default:2
) –power used in second momentum power and root. Defaults to 2.
-
debiased
(bool
, default:True
) –whether to apply debiasing to momentums based on current step. Defaults to True.
Source code in torchzero/modules/adaptive/adam.py
Adan ¶
Bases: torchzero.core.transform.Transform
Adaptive Nesterov Momentum Algorithm from https://arxiv.org/abs/2208.06677
Parameters:
-
beta1
(float
, default:0.98
) –momentum. Defaults to 0.98.
-
beta2
(float
, default:0.92
) –momentum for gradient differences. Defaults to 0.92.
-
beta3
(float
, default:0.99
) –thrid (squared) momentum. Defaults to 0.99.
-
eps
(float
, default:1e-08
) –epsilon. Defaults to 1e-8.
-
use_n_prev
(bool
) –whether to use previous gradient differences momentum.
Example: ```python opt = tz.Modular( model.parameters(), tz.m.Adan(), tz.m.LR(1e-3), ) Reference: Xie, X., Zhou, P., Li, H., Lin, Z., & Yan, S. (2024). Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://arxiv.org/abs/2208.06677
Source code in torchzero/modules/adaptive/adan.py
AdaptiveHeavyBall ¶
Bases: torchzero.core.transform.Transform
Adaptive heavy ball from https://hal.science/hal-04832983v1/file/OJMO_2024__5__A7_0.pdf.
This is related to conjugate gradient methods, it may be very good for non-stochastic convex objectives, but won't work on stochastic ones.
note
The step size is determined by the algorithm, so learning rate modules shouldn't be used.
Parameters:
-
f_star
(int
, default:0
) –(estimated) minimal possible value of the objective function (lowest possible loss). Defaults to 0.
Source code in torchzero/modules/adaptive/adaptive_heavyball.py
BacktrackOnSignChange ¶
Bases: torchzero.core.transform.Transform
Negates or undoes update for parameters where where gradient or update sign changes.
This is part of RProp update rule.
Parameters:
-
use_grad
(bool
, default:False
) –if True, tracks sign change of the gradient, otherwise track sign change of the update. Defaults to True.
-
backtrack
(bool
, default:True
) –if True, undoes the update when sign changes, otherwise negates it. Defaults to True.
Source code in torchzero/modules/adaptive/rprop.py
DualNormCorrection ¶
Bases: torchzero.core.transform.TensorwiseTransform
Dual norm correction for dualizer based optimizers (https://github.com/leloykun/adaptive-muon).
Orthogonalize already has this built in with the dual_norm_correction
setting.
Source code in torchzero/modules/adaptive/muon.py
ESGD ¶
Bases: torchzero.core.module.Module
Equilibrated Gradient Descent (https://arxiv.org/abs/1502.04390)
This is similar to Adagrad, but the accumulates squared randomized hessian diagonal estimates instead of squared gradients.
.. note::
In most cases Adagrad should be the first module in the chain because it relies on autograd. Use the :code:inner
argument if you wish to apply Adagrad preconditioning to another module's output.
.. note::
If you are using gradient estimators or reformulations, set :code:hvp_method
to "forward" or "central".
.. note::
This module requires a closure passed to the optimizer step,
as it needs to re-evaluate the loss and gradients for calculating HVPs.
The closure must accept a backward
argument (refer to documentation).
Parameters:
-
damping
(float
, default:0.0001
) –added to denominator for stability. Defaults to 1e-4.
-
update_freq
(int
, default:20
) –frequency of updating hessian diagonal estimate via a hessian-vector product. This value can be increased to reduce computational cost. Defaults to 20.
-
hvp_method
(str
, default:'autograd'
) –Determines how Hessian-vector products are evaluated.
"autograd"
: Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient."forward"
: Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation."central"
: Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
-
fd_h
(float
, default:0.001
) –finite difference step size if :code:
hvp_method
is "forward" or "central". Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of hessian-vector products with random vectors to evaluate each time when updating the preconditioner. Larger values may lead to better hessian diagonal estimate. Defaults to 1.
-
seed
(int | None
, default:None
) –seed for random vectors. Defaults to None.
-
inner
(Chainable | None
, default:None
) –Inner module. If this is specified, operations are performed in the following order. 1. compute hessian diagonal estimate. 2. pass inputs to :code:
inner
. 3. momentum and preconditioning are applied to the ouputs of :code:inner
.
Examples:
Using ESGD:
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.ESGD(),
tz.m.LR(0.1)
)
ESGD preconditioner can be applied to any other module by passing it to the :code:inner
argument. Here is an example of applying
ESGD preconditioning to nesterov momentum (:code:tz.m.NAG
):
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.ESGD(beta1=0, inner=tz.m.NAG(0.9)),
tz.m.LR(0.1)
)
Source code in torchzero/modules/adaptive/esgd.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
|
FullMatrixAdagrad ¶
Bases: torchzero.core.transform.TensorwiseTransform
Full-matrix version of Adagrad, can be customized to make RMSprop or Adam (see examples).
Note
A more memory-efficient version equivalent to full matrix Adagrad on last n gradients is implemented in tz.m.LMAdagrad
.
Parameters:
-
beta
(float | None
, default:None
) –momentum for gradient outer product accumulators. if None, uses sum. Defaults to None.
-
decay
(float | None
, default:None
) –decay for gradient outer product accumulators. Defaults to None.
-
sqrt
(bool
, default:True
) –whether to take the square root of the accumulator. Defaults to True.
-
concat_params
(bool
, default:True
) –if False, each parameter will have it's own accumulator. Defaults to True.
-
precond_freq
(int
, default:1
) –frequency of updating the inverse square root of the accumulator. Defaults to 1.
-
init
(Literal[str]
, default:'identity'
) –how to initialize the accumulator. - "identity" - with identity matrix (default). - "zeros" - with zero matrix. - "ones" - with matrix of ones. -"GGT" - with the first outer product
-
divide
(bool
, default:False
) –whether to divide the accumulator by number of gradients in it. Defaults to False.
-
inner
(Chainable | None
, default:None
) –inner modules to apply preconditioning to. Defaults to None.
Examples:¶
Plain full-matrix adagrad
Full-matrix RMSprop
Full-matrix Adam
opt = tz.Modular(
model.parameters(),
tz.m.FullMatrixAdagrad(beta=0.999, inner=tz.m.EMA(0.9)),
tz.m.Debias(0.9, 0.999),
tz.m.LR(1e-2),
)
Source code in torchzero/modules/adaptive/adagrad.py
234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 |
|
LMAdagrad ¶
Bases: torchzero.core.transform.TensorwiseTransform
Limited-memory full matrix Adagrad.
The update rule is to stack recent gradients into M, compute U, S <- SVD(M), then calculate update as U S^-1 Uᵀg. But it uses eigendecomposition on MᵀM to get U and S^2 because that is faster when you don't neeed V.
This is equivalent to full-matrix Adagrad on recent gradients.
Parameters:
-
history_size
(int
, default:100
) –number of past gradients to store. Defaults to 10.
-
update_freq
(int
, default:1
) –frequency of updating the preconditioner (U and S). Defaults to 1.
-
damping
(float
, default:0.0001
) –damping value. Defaults to 1e-4.
-
rdamping
(float
, default:0
) –value of damping relative to singular values norm. Defaults to 0.
-
order
(int
, default:1
) –order=2 means gradient differences are used in place of gradients. Higher order uses higher order differences. Defaults to 1.
-
true_damping
(bool
, default:True
) –If True, damping is added to squared singular values to mimic Adagrad. Defaults to True.
-
U_beta
(float | None
, default:None
) –momentum for U (too unstable, don't use). Defaults to None.
-
L_beta
(float | None
, default:None
) –momentum for L (too unstable, don't use). Defaults to None.
-
interval
(int
, default:1
) –Interval between gradients that are added to history (2 means every second gradient is used). Defaults to 1.
-
concat_params
(bool
, default:True
) –if True, treats all parameters as a single vector, meaning it will also whiten inter-parameters. Defaults to True.
-
inner
(Chainable | None
, default:None
) –preconditioner will be applied to output of this module. Defaults to None.
Examples:¶
Limited-memory Adagrad
Adam with L-Adagrad preconditioner (for debiasing second beta is 0.999 arbitrarily)optimizer = tz.Modular(
model.parameters(),
tz.m.LMAdagrad(inner=tz.m.EMA()),
tz.m.Debias(0.9, 0.999),
tz.m.LR(0.01)
)
Stable Adam with L-Adagrad preconditioner (this is what I would recommend)
optimizer = tz.Modular(
model.parameters(),
tz.m.LMAdagrad(inner=tz.m.EMA()),
tz.m.Debias(0.9, 0.999),
tz.m.ClipNormByEMA(max_ema_growth=1.2),
tz.m.LR(0.01)
)
Source code in torchzero/modules/adaptive/lmadagrad.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
|
Lion ¶
Bases: torchzero.core.transform.Transform
Lion (EvoLved Sign Momentum) optimizer from https://arxiv.org/abs/2302.06675.
Parameters:
-
beta1
(float
, default:0.9
) –dampening for momentum. Defaults to 0.9.
-
beta2
(float
, default:0.99
) –momentum factor. Defaults to 0.99.
Source code in torchzero/modules/adaptive/lion.py
MARSCorrection ¶
Bases: torchzero.core.transform.Transform
MARS variance reduction correction.
Place any other momentum-based optimizer after this,
make sure beta
parameter matches with momentum in the optimizer.
Parameters:
-
beta
(float
, default:0.9
) –use the same beta as you use in the momentum module. Defaults to 0.9.
-
scaling
(float
, default:0.025
) –controls the scale of gradient correction in variance reduction. Defaults to 0.025.
-
max_norm
(float
, default:1
) –clips norm of corrected gradients, None to disable. Defaults to 1.
Examples:¶
Mars-AdamW
optimizer = tz.Modular(
model.parameters(),
tz.m.MARSCorrection(beta=0.95),
tz.m.Adam(beta1=0.95, beta2=0.99),
tz.m.WeightDecay(1e-3),
tz.m.LR(0.1)
)
Mars-Lion
optimizer = tz.Modular(
model.parameters(),
tz.m.MARSCorrection(beta=0.9),
tz.m.Lion(beta1=0.9),
tz.m.LR(0.1)
)
Source code in torchzero/modules/adaptive/mars.py
MSAM ¶
Bases: torchzero.core.transform.Transform
Momentum-SAM from https://arxiv.org/pdf/2401.12033.
This implementation expresses the update rule as function of gradient. This way it can be used as a drop-in replacement for momentum strategies in other optimizers.
To combine MSAM with other optimizers in the way done in the official implementation,
e.g. to make Adam_MSAM, use tz.m.MSAMObjective
module.
Note
MSAM has a learning rate hyperparameter that can't really be removed from the update rule.
To avoid compounding learning rate mofications, remove the tz.m.LR
module if you had it.
Parameters:
-
lr
(float
) –learning rate. Adding this module adds support for learning rate schedulers.
-
momentum
(float
, default:0.9
) –momentum (beta). Defaults to 0.9.
-
rho
(float
, default:0.3
) –perturbation strength. Defaults to 0.3.
-
weight_decay
(float
, default:0
) –weight decay. It is applied to perturbed parameters, so it is differnet from applying :code:
tz.m.WeightDecay
after MSAM. Defaults to 0. -
nesterov
(bool
, default:False
) –whether to use nesterov momentum formula. Defaults to False.
-
lerp
(bool
, default:False
) –whether to use linear interpolation, if True, this becomes similar to exponential moving average. Defaults to False.
Examples:
MSAM
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.MSAM(1e-3)
)
Adam with MSAM instead of exponential average. Note that this is different from Adam_MSAM.
To make Adam_MSAM and such, use the :code:tz.m.MSAMObjective
module.
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.RMSprop(0.999, inner=tz.m.MSAM(1e-3)),
tz.m.Debias(0.9, 0.999),
)
Source code in torchzero/modules/adaptive/msam.py
MSAMObjective ¶
Bases: torchzero.modules.adaptive.msam.MSAM
Momentum-SAM from https://arxiv.org/pdf/2401.12033.
Note
Please make sure to place tz.m.LR
inside the modules
argument. For example,
tz.m.MSAMObjective([tz.m.Adam(), tz.m.LR(1e-3)])
. Putting LR after MSAM will lead
to an incorrect update rule.
Parameters:
-
modules
(Chainable
) –modules that will optimizer the MSAM objective. Make sure :code:
tz.m.LR
is one of them. -
momentum
(float
, default:0.9
) –momentum (beta). Defaults to 0.9.
-
rho
(float
, default:0.3
) –perturbation strength. Defaults to 0.3.
-
nesterov
(bool
, default:False
) –whether to use nesterov momentum formula. Defaults to False.
-
lerp
(bool
, default:False
) –whether to use linear interpolation, if True, MSAM momentum becomes similar to exponential moving average. Defaults to False.
Examples:
AdamW-MSAM
.. code-block:: python
opt = tz.Modular(
bench.parameters(),
tz.m.MSAMObjective(
[tz.m.Adam(), tz.m.WeightDecay(1e-3), tz.m.LR(1e-3)],
rho=1.
)
)
Source code in torchzero/modules/adaptive/msam.py
MatrixMomentum ¶
Bases: torchzero.core.module.Module
Second order momentum method.
Matrix momentum is useful for convex objectives, also for some reason it has very really good generalization on elastic net logistic regression.
Notes
-
mu
needs to be tuned very carefully. It is supposed to be smaller than (1/largest eigenvalue), otherwise this will be very unstable. I have devised an adaptive version of this -tz.m.AdaptiveMatrixMomentum
, and it works well without having to tunemu
, however the adaptive version doesn't work on stochastic objectives. -
In most cases
MatrixMomentum
should be the first module in the chain because it relies on autograd. -
This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a
backward
argument.
Parameters:
-
mu
(float
, default:0.1
) –this has a similar role to (1 - beta) in normal momentum. Defaults to 0.1.
-
hvp_method
(str
, default:'autograd'
) –Determines how Hessian-vector products are evaluated.
"autograd"
: Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient."forward"
: Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation."central"
: Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
-
h
(float
, default:0.001
) –finite difference step size if hvp_method is set to finite difference. Defaults to 1e-3.
-
hvp_tfm
(Chainable | None
, default:None
) –optional module applied to hessian-vector products. Defaults to None.
Reference
Orr, Genevieve, and Todd Leen. "Using curvature information for fast stochastic search." Advances in neural information processing systems 9 (1996).
Source code in torchzero/modules/adaptive/matrix_momentum.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
|
MuonAdjustLR ¶
Bases: torchzero.core.transform.Transform
LR adjustment for Muon from "Muon is Scalable for LLM Training" (https://github.com/MoonshotAI/Moonlight/tree/master).
Orthogonalize already has this built in with the adjust_lr
setting, however you might want to move this to be later in the chain.
Source code in torchzero/modules/adaptive/muon.py
NaturalGradient ¶
Bases: torchzero.core.module.Module
Natural gradient approximated via empirical fisher information matrix.
To use this, either pass vector of per-sample losses to the step method, or make sure
the closure returns it. Gradients will be calculated via batched autograd within this module,
you don't need to implement the backward pass. When using closure, please add the backward
argument,
it will always be False but it is required. See below for an example.
Note
Empirical fisher information matrix may give a really bad approximation in some cases.
If that is the case, set sqrt
to True to perform whitening instead, which is way more robust.
Parameters:
-
reg
(float
, default:1e-08
) –regularization parameter. Defaults to 1e-8.
-
sqrt
(bool
, default:False
) –if True, uses square root of empirical fisher information matrix. Both EFIM and it's square root can be calculated and stored efficiently without ndim^2 memory. Square root whitens the gradient and often performs much better, especially when you try to use NGD with a vector that isn't strictly per-sample gradients, but rather for example different losses.
-
gn_grad
(bool
, default:False
) –if True, uses Gauss-Newton G^T @ f as the gradient, which is effectively sum weighted by value and is equivalent to squaring the values. This way you can solve least-squares objectives with a NGD-like algorithm. If False, uses sum of per-sample gradients. This has an effect when
sqrt=True
, and affects thegrad
attribute. Defaults to False. -
batched
(bool
, default:True
) –whether to use vmapping. Defaults to True.
Examples:
training a neural network:
X = torch.randn(64, 20)
y = torch.randn(64, 10)
model = nn.Sequential(nn.Linear(20, 64), nn.ELU(), nn.Linear(64, 10))
opt = tz.Modular(
model.parameters(),
tz.m.NaturalGradient(),
tz.m.LR(3e-2)
)
for i in range(100):
y_hat = model(X) # (64, 10)
losses = (y_hat - y).pow(2).mean(0) # (10, )
opt.step(loss=losses)
if i % 10 == 0:
print(f'{losses.mean() = }')
training a neural network - closure version
X = torch.randn(64, 20)
y = torch.randn(64, 10)
model = nn.Sequential(nn.Linear(20, 64), nn.ELU(), nn.Linear(64, 10))
opt = tz.Modular(
model.parameters(),
tz.m.NaturalGradient(),
tz.m.LR(3e-2)
)
def closure(backward=True):
y_hat = model(X) # (64, 10)
return (y_hat - y).pow(2).mean(0) # (10, )
for i in range(100):
losses = opt.step(closure)
if i % 10 == 0:
print(f'{losses.mean() = }')
minimizing the rosenbrock function with a mix of natural gradient, whitening and gauss-newton:
def rosenbrock(X):
x1, x2 = X
return torch.stack([(1 - x1).abs(), (10 * (x2 - x1**2).abs())])
X = torch.tensor([-1.1, 2.5], requires_grad=True)
opt = tz.Modular([X], tz.m.NaturalGradient(sqrt=True, gn_grad=True), tz.m.LR(0.05))
for iter in range(200):
losses = rosenbrock(X)
opt.step(loss=losses)
if iter % 20 == 0:
print(f'{losses.mean() = }')
Source code in torchzero/modules/adaptive/natural_gradient.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
|
OrthoGrad ¶
Bases: torchzero.core.transform.Transform
Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.
Parameters:
-
eps
(float
, default:1e-08
) –epsilon added to the denominator for numerical stability (default: 1e-30)
-
renormalize
(bool
, default:True
) –whether to graft projected gradient to original gradient norm. Defaults to True.
-
target
(Literal
, default:'update'
) –what to set on var. Defaults to 'update'.
Source code in torchzero/modules/adaptive/orthograd.py
Orthogonalize ¶
Bases: torchzero.core.transform.TensorwiseTransform
Uses Newton-Schulz iteration or SVD to compute the zeroth power / orthogonalization of update along first 2 dims.
To disable orthogonalization for a parameter, put it into a parameter group with "orthogonalize" = False. The Muon page says that embeddings and classifier heads should not be orthogonalized. Usually only matrix parameters that are directly used in matmuls should be orthogonalized.
To make Muon, use Split with Adam on 1d params
Parameters:
-
ns_steps
(int
, default:5
) –The number of Newton-Schulz iterations to run. Defaults to 5.
-
adjust_lr
(bool
, default:False
) –Enables LR adjustment based on parameter size from "Muon is Scalable for LLM Training". Defaults to False.
-
dual_norm_correction
(bool
, default:False
) –enables dual norm correction from https://github.com/leloykun/adaptive-muon. Defaults to False.
-
method
(str
, default:'newton-schulz'
) –Newton-Schulz is very fast, SVD is extremely slow but can be slighly more precise.
-
target
(str
, default:'update'
) –what to set on var.
Examples:¶
standard Muon with Adam fallback
opt = tz.Modular(
model.head.parameters(),
tz.m.Split(
# apply muon only to 2D+ parameters
filter = lambda t: t.ndim >= 2,
true = [
tz.m.HeavyBall(),
tz.m.Orthogonalize(),
tz.m.LR(1e-2),
],
false = tz.m.Adam()
),
tz.m.LR(1e-2)
)
Reference
Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, Jeremy Bernstein - Muon: An optimizer for hidden layers in neural networks (2024) https://github.com/KellerJordan/Muon
Source code in torchzero/modules/adaptive/muon.py
RMSprop ¶
Bases: torchzero.core.transform.Transform
Divides graient by EMA of gradient squares.
This implementation is identical to :code:torch.optim.RMSprop
.
Parameters:
-
smoothing
(float
, default:0.99
) –beta for exponential moving average of gradient squares. Defaults to 0.99.
-
eps
(float
, default:1e-08
) –epsilon for division. Defaults to 1e-8.
-
centered
(bool
, default:False
) –whether to center EMA of gradient squares using an additional EMA. Defaults to False.
-
debiased
(bool
, default:False
) –applies Adam debiasing. Defaults to False.
-
amsgrad
(bool
, default:False
) –Whether to divide by maximum of EMA of gradient squares instead. Defaults to False.
-
pow
(float
, default:2
) –power used in second momentum power and root. Defaults to 2.
-
init
(str
, default:'zeros'
) –how to initialize EMA, either "update" to use first update or "zeros". Defaults to "update".
-
inner
(Chainable | None
, default:None
) –Inner modules that are applied after updating EMA and before preconditioning. Defaults to None.
Source code in torchzero/modules/adaptive/rmsprop.py
Rprop ¶
Bases: torchzero.core.transform.Transform
Resilient propagation. The update magnitude gets multiplied by nplus
if gradient didn't change the sign,
or nminus
if it did. Then the update is applied with the sign of the current gradient.
Additionally, if gradient changes sign, the update for that weight is reverted. Next step, magnitude for that weight won't change.
Compared to pytorch this also implements backtracking update when sign changes.
This implementation is identical to :code:torch.optim.Rprop
if :code:backtrack
is set to False.
Parameters:
-
nplus
(float
, default:1.2
) –multiplicative increase factor for when ascent didn't change sign (default: 1.2).
-
nminus
(float
, default:0.5
) –multiplicative decrease factor for when ascent changed sign (default: 0.5).
-
lb
(float
, default:1e-06
) –minimum step size, can be None (default: 1e-6)
-
ub
(float
, default:50
) –maximum step size, can be None (default: 50)
-
backtrack
(float
, default:True
) –if True, when ascent sign changes, undoes last weight update, otherwise sets update to 0. When this is False, this exactly matches pytorch Rprop. (default: True)
-
alpha
(float
, default:1
) –initial per-parameter learning rate (default: 1).
reference Riedmiller, M., & Braun, H. (1993, March). A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In IEEE international conference on neural networks (pp. 586-591). IEEE.
Source code in torchzero/modules/adaptive/rprop.py
SAM ¶
Bases: torchzero.core.module.Module
Sharpness-Aware Minimization from https://arxiv.org/pdf/2010.01412
SAM functions by seeking parameters that lie in neighborhoods having uniformly low loss value. It performs two forward and backward passes per step.
This implementation modifies the closure to return loss and calculate gradients of the SAM objective. All modules after this will use the modified objective.
.. note:: This module requires a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients at two points on each step.
Parameters:
-
rho
(float
, default:0.05
) –Neighborhood size. Defaults to 0.05.
-
p
(float
, default:2
) –norm of the SAM objective. Defaults to 2.
-
asam
(bool
, default:False
) –enables ASAM variant which makes perturbation relative to weight magnitudes. ASAM requires a much larger :code:
rho
, like 0.5 or 1. The :code:tz.m.ASAM
class is idential to setting this argument to True, but it has larger :code:rho
by default.
Examples:
SAM-SGD:
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.SAM(),
tz.m.LR(1e-2)
)
SAM-Adam:
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.SAM(),
tz.m.Adam(),
tz.m.LR(1e-2)
)
References
Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2020). Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412. https://arxiv.org/abs/2010.01412#page=3.16
Source code in torchzero/modules/adaptive/sam.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
|
SOAP ¶
Bases: torchzero.core.transform.Transform
SOAP (ShampoO with Adam in the Preconditioner's eigenbasis from https://arxiv.org/abs/2409.11321).
Parameters:
-
beta1
(float
, default:0.95
) –beta for first momentum. Defaults to 0.95.
-
beta2
(float
, default:0.95
) –beta for second momentum. Defaults to 0.95.
-
shampoo_beta
(float | None
, default:0.95
) –beta for covariance matrices accumulators. Can be None, then it just sums them like Adagrad (which works worse). Defaults to 0.95.
-
precond_freq
(int
, default:10
) –How often to update the preconditioner. Defaults to 10.
-
merge_small
(bool
, default:True
) –Whether to merge small dims. Defaults to True.
-
max_dim
(int
, default:2000
) –Won't precondition dims larger than this. Defaults to 2_000.
-
precondition_1d
(bool
, default:True
) –Whether to precondition 1d params (SOAP paper sets this to False). Defaults to True.
-
eps
(float
, default:1e-08
) –epsilon for dividing first momentum by second. Defaults to 1e-8.
-
decay
(float | None
, default:None
) –Decays covariance matrix accumulators, this may be useful if
shampoo_beta
is None. Defaults to None. -
alpha
(float
, default:1
) –learning rate. Defaults to 1.
-
bias_correction
(bool
, default:True
) –enables adam bias correction. Defaults to True.
Examples:
SOAP:
.. code-block:: python
opt = tz.Modular(model.parameters(), tz.m.SOAP(), tz.m.LR(1e-3))
Stabilized SOAP:
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.SOAP(),
tz.m.NormalizeByEMA(max_ema_growth=1.2),
tz.m.LR(1e-2)
)
Source code in torchzero/modules/adaptive/soap.py
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 |
|
ScaleLRBySignChange ¶
Bases: torchzero.core.transform.Transform
learning rate gets multiplied by nplus
if ascent/gradient didn't change the sign,
or nminus
if it did.
This is part of RProp update rule.
Parameters:
-
nplus
(float
, default:1.2
) –learning rate gets multiplied by
nplus
if ascent/gradient didn't change the sign -
nminus
(float
, default:0.5
) –learning rate gets multiplied by
nminus
if ascent/gradient changed the sign -
lb
(float
, default:1e-06
) –lower bound for lr.
-
ub
(float
, default:50.0
) –upper bound for lr.
-
alpha
(float
, default:1.0
) –initial learning rate.
Source code in torchzero/modules/adaptive/rprop.py
Shampoo ¶
Bases: torchzero.core.transform.Transform
Shampoo from Preconditioned Stochastic Tensor Optimization (https://arxiv.org/abs/1802.09568).
.. note:: Shampoo is usually grafted to another optimizer like Adam, otherwise it can be unstable. An example of how to do grafting is given below in the Examples section.
.. note::
Shampoo is a very computationally expensive optimizer, increase :code:update_freq
if it is too slow.
.. note::
SOAP optimizer usually outperforms Shampoo and is also not as computationally expensive. SOAP implementation is available as :code:tz.m.SOAP
.
Parameters:
-
decay
(float | None
, default:None
) –slowly decays preconditioners. Defaults to None.
-
beta
(float | None
, default:None
) –if None calculates sum as in standard shampoo, otherwise uses EMA of preconditioners. Defaults to None.
-
update_freq
(int
, default:10
) –preconditioner update frequency. Defaults to 10.
-
exp_override
(int | None
, default:2
) –matrix exponent override, if not set, uses 2*ndim. Defaults to 2.
-
merge_small
(bool
, default:True
) –whether to merge small dims on tensors. Defaults to True.
-
max_dim
(int
, default:2000
) –maximum dimension size for preconditioning. Defaults to 2_000.
-
precondition_1d
(bool
, default:True
) –whether to precondition 1d tensors. Defaults to True.
-
adagrad_eps
(float
, default:1e-08
) –epsilon for adagrad division for tensors where shampoo can't be applied. Defaults to 1e-8.
-
inner
(Chainable | None
, default:None
) –module applied after updating preconditioners and before applying preconditioning. For example if beta≈0.999 and
inner=tz.m.EMA(0.9)
, this becomes Adam with shampoo preconditioner (ignoring debiasing). Defaults to None.
Examples:
Shampoo grafted to Adam
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.GraftModules(
direction = tz.m.Shampoo(),
magnitude = tz.m.Adam(),
),
tz.m.LR(1e-3)
)
Adam with Shampoo preconditioner
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.Shampoo(beta=0.999, inner=tz.m.EMA(0.9)),
tz.m.Debias(0.9, 0.999),
tz.m.LR(1e-3)
)
Source code in torchzero/modules/adaptive/shampoo.py
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 |
|
SignConsistencyLRs ¶
Bases: torchzero.core.transform.Transform
Outputs per-weight learning rates based on consecutive sign consistency.
The learning rate for a weight is multiplied by :code:nplus
when two consecutive update signs are the same, otherwise it is multiplied by :code:nplus
. The learning rates are bounded to be in :code:(lb, ub)
range.
Examples:
GD scaled by consecutive gradient sign consistency
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.Mul(tz.m.SignConsistencyLRs()),
tz.m.LR(1e-2)
)
Source code in torchzero/modules/adaptive/rprop.py
SignConsistencyMask ¶
Bases: torchzero.core.transform.Transform
Outputs a mask of sign consistency of current and previous inputs.
The output is 0 for weights where input sign changed compared to previous input, 1 otherwise.
Examples:
GD that skips update for weights where gradient sign changed compared to previous gradient.
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.Mul(tz.m.SignConsistencyMask()),
tz.m.LR(1e-2)
)
Source code in torchzero/modules/adaptive/rprop.py
SophiaH ¶
Bases: torchzero.core.module.Module
SophiaH optimizer from https://arxiv.org/abs/2305.14342
This is similar to Adam, but the second momentum is replaced by an exponential moving average of randomized hessian diagonal estimates, and the update is agressively clipped.
.. note::
In most cases SophiaH should be the first module in the chain because it relies on autograd. Use the :code:inner
argument if you wish to apply SophiaH preconditioning to another module's output.
.. note::
If you are using gradient estimators or reformulations, set :code:hvp_method
to "forward" or "central".
.. note::
This module requires the a closure passed to the optimizer step,
as it needs to re-evaluate the loss and gradients for calculating HVPs.
The closure must accept a backward
argument (refer to documentation).
Parameters:
-
beta1
(float
, default:0.96
) –first momentum. Defaults to 0.96.
-
beta2
(float
, default:0.99
) –momentum for hessian diagonal estimate. Defaults to 0.99.
-
update_freq
(int
, default:10
) –frequency of updating hessian diagonal estimate via a hessian-vector product. Defaults to 10.
-
precond_scale
(float
, default:1
) –scale of the preconditioner. Defaults to 1.
-
clip
(float
, default:1
) –clips update to (-clip, clip). Defaults to 1.
-
eps
(float
, default:1e-12
) –clips hessian diagonal esimate to be no less than this value. Defaults to 1e-12.
-
hvp_method
(str
, default:'autograd'
) –Determines how Hessian-vector products are evaluated.
"autograd"
: Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient."forward"
: Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation."central"
: Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
-
fd_h
(float
, default:0.001
) –finite difference step size if :code:
hvp_method
is "forward" or "central". Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of hessian-vector products with random vectors to evaluate each time when updating the preconditioner. Larger values may lead to better hessian diagonal estimate. Defaults to 1.
-
seed
(int | None
, default:None
) –seed for random vectors. Defaults to None.
-
inner
(Chainable | None
, default:None
) –preconditioning is applied to the output of this module. Defaults to None.
Examples:
Using SophiaH:
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.SophiaH(),
tz.m.LR(0.1)
)
SophiaH preconditioner can be applied to any other module by passing it to the :code:inner
argument.
Turn off SophiaH's first momentum to get just the preconditioning. Here is an example of applying
SophiaH preconditioning to nesterov momentum (:code:tz.m.NAG
):
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.SophiaH(beta1=0, inner=tz.m.NAG(0.96)),
tz.m.LR(0.1)
)
Source code in torchzero/modules/adaptive/sophia_h.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
|
orthogonalize_grads_ ¶
orthogonalize_grads_(params: Iterable[Tensor], steps: int = 5, dual_norm_correction=False, method: Literal['newton-schulz', 'svd'] = 'newton-schulz')
Uses newton-Schulz iteration to compute the zeroth power / orthogonalization of gradients of an iterable of parameters.
This sets gradients in-place. Applies along first 2 dims (expected to be out_channels, in_channels
).
Note that the Muon page says that embeddings and classifier heads should not be orthogonalized. Args: params (abc.Iterable[torch.Tensor]): parameters that hold gradients to orthogonalize. steps (int, optional): The number of Newton-Schulz iterations to run. Defaults to 5. dual_norm_correction (bool, optional): enables dual norm correction from https://github.com/leloykun/adaptive-muon. Defaults to False. method (str, optional): Newton-Schulz is very fast, SVD is extremely slow but can be slighly more precise.
Source code in torchzero/modules/adaptive/muon.py
orthograd_ ¶
Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.
Parameters:
-
params
(Iterable[Tensor]
) –parameters that hold gradients to apply ⟂Grad to.
-
eps
(float
, default:1e-30
) –epsilon added to the denominator for numerical stability (default: 1e-30)
reference https://arxiv.org/abs/2501.04697