List of all modules¶
A somewhat categorized list of modules is also available in Modules
Classes:
-
AEGD–AEGD (Adaptive gradient descent with energy) from https://arxiv.org/abs/2010.05109#page=10.26.
-
ASAM–Adaptive Sharpness-Aware Minimization from https://arxiv.org/pdf/2102.11600#page=6.52
-
Abs–Returns
abs(input) -
AccumulateMaximum–Accumulates maximum of all past updates.
-
AccumulateMean–Accumulates mean of all past updates.
-
AccumulateMinimum–Accumulates minimum of all past updates.
-
AccumulateProduct–Accumulates product of all past updates.
-
AccumulateSum–Accumulates sum of all past updates.
-
AdGD–AdGD and AdGD-2 (https://arxiv.org/abs/2308.02261)
-
AdaHessian–AdaHessian: An Adaptive Second Order Optimizer for Machine Learning (https://arxiv.org/abs/2006.00719)
-
Adagrad–Adagrad, divides by sum of past squares of gradients.
-
AdagradNorm–Adagrad-Norm, divides by sum of past means of squares of gradients.
-
Adam–Adam. Divides gradient EMA by EMA of gradient squares with debiased step size.
-
Adan–Adaptive Nesterov Momentum Algorithm from https://arxiv.org/abs/2208.06677
-
AdaptiveBacktracking–Adaptive backtracking line search. After each line search procedure, a new initial step size is set
-
AdaptiveBisection–A line search that evaluates previous step size, if value increased, backtracks until the value stops decreasing,
-
AdaptiveHeavyBall–Adaptive heavy ball from https://hal.science/hal-04832983v1/file/OJMO_2024__5__A7_0.pdf.
-
Add–Add
otherto tensors.othercan be a number or a module. -
Alternate–Alternates between stepping with :code:
modules. -
Averaging–Average of past
history_sizeupdates. -
BBStab–Stabilized Barzilai-Borwein method (https://arxiv.org/abs/1907.06409).
-
BFGS–Broyden–Fletcher–Goldfarb–Shanno Quasi-Newton method. This is usually the most stable quasi-newton method.
-
BacktrackOnSignChange–Negates or undoes update for parameters where where gradient or update sign changes.
-
Backtracking–Backtracking line search.
-
BarzilaiBorwein–Barzilai-Borwein step size method.
-
BinaryOperationBase–Base class for operations that use update as the first operand. This is an abstract class, subclass it and override
transformmethod to use it. -
BirginMartinezRestart–the restart criterion for conjugate gradient methods designed by Birgin and Martinez.
-
BoldDriver–Multiplies step size by
nplusif loss decreased compared to last iteration, otherwise multiplies bynminus. -
BroydenBad–Broyden's "bad" Quasi-Newton method.
-
BroydenGood–Broyden's "good" Quasi-Newton method.
-
CD–Coordinate descent. Proposes a descent direction along a single coordinate.
-
Cautious–Negates update for parameters where update and gradient sign is inconsistent.
-
CautiousWeightDecay–Cautious weight decay (https://arxiv.org/pdf/2510.12402).
-
CenteredEMASquared–Maintains a centered exponential moving average of squared updates. This also maintains an additional
-
CenteredSqrtEMASquared–Maintains a centered exponential moving average of squared updates, outputs optionally debiased square root.
-
Centralize–Centralizes the update.
-
Clip–clip tensors to be in
(min, max)range.minand`max: can be None, numbers or modules. -
ClipModules–Calculates
input(tensors).clip(min, max).minandmaxcan be numbers or modules. -
ClipNorm–Clips update norm to be no larger than
value. -
ClipNormByEMA–Clips norm to be no larger than the norm of an exponential moving average of past updates.
-
ClipNormGrowth–Clips update norm growth.
-
ClipValue–Clips update magnitude to be within
(-value, value)range. -
ClipValueByEMA–Clips magnitude of update to be no larger than magnitude of exponential moving average of past (unclipped) updates.
-
ClipValueGrowth–Clips update value magnitude growth.
-
Clone–Clones input. May be useful to store some intermediate result and make sure it doesn't get affected by in-place operations
-
ConjugateDescent–Conjugate Descent (CD).
-
CopyMagnitude–Returns
other(tensors)with sign copied from tensors. -
CopySign–Returns tensors with sign copied from
other(tensors). -
CubicRegularization–Cubic regularization.
-
CustomUnaryOperation–Applies
getattr(tensor, name)to each tensor -
DFP–Davidon–Fletcher–Powell Quasi-Newton method.
-
DNRTR–Diagonal quasi-newton method.
-
DYHS–Dai-Yuan - Hestenes–Stiefel hybrid conjugate gradient method.
-
DaiYuan–Dai–Yuan nonlinear conjugate gradient method.
-
Debias–Multiplies the update by an Adam debiasing term based first and/or second momentum.
-
Debias2–Multiplies the update by an Adam debiasing term based on the second momentum.
-
DiagonalBFGS–Diagonal BFGS. This is simply BFGS with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful.
-
DiagonalQuasiCauchi–Diagonal quasi-cauchi method.
-
DiagonalSR1–Diagonal SR1. This is simply SR1 with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful.
-
DiagonalWeightedQuasiCauchi–Diagonal quasi-cauchi method.
-
DirectWeightDecay–Directly applies weight decay to parameters.
-
Div–Divide tensors by
other.othercan be a number or a module. -
DivByLoss–Divides update by loss times
alpha -
DivModules–Calculates
input / other.inputandothercan be numbers or modules. -
Dogleg–Dogleg trust region algorithm.
-
Dropout–Applies dropout to the update.
-
DualNormCorrection–Dual norm correction for dualizer based optimizers (https://github.com/leloykun/adaptive-muon).
-
EMA–Maintains an exponential moving average of update.
-
EMASquared–Maintains an exponential moving average of squared updates.
-
ESGD–Equilibrated Gradient Descent (https://arxiv.org/abs/1502.04390)
-
EscapeAnnealing–If parameters stop changing, this runs a backward annealing random search
-
Exp–Returns
exp(input) -
ExpHomotopy– -
FDM–Approximate gradients via finite difference method.
-
Fill–Outputs tensors filled with
value -
FillLoss–Outputs tensors filled with loss value times
alpha -
FletcherReeves–Fletcher–Reeves nonlinear conjugate gradient method.
-
FletcherVMM–Fletcher's variable metric Quasi-Newton method.
-
ForwardGradient–Forward gradient method.
-
FullMatrixAdagrad–Full-matrix version of Adagrad, can be customized to make RMSprop or Adam (see examples).
-
GGT–GGT method from https://arxiv.org/pdf/1806.02958
-
GGTBasis–Run another optimizer in GGT eigenbasis. The eigenbasis is
rank-sized, so it is possible to run expensive -
GaussNewton–Gauss-newton method.
-
GaussianSmoothing–Gradient approximation via Gaussian smoothing method.
-
Grad–Outputs the gradient
-
GradApproximator–Base class for gradient approximations.
-
GradSign–Copies gradient sign to update.
-
GradToNone–Sets
gradattribute to None onobjective. -
GradientAccumulation–Uses
nsteps to accumulate gradients, afterngradients have been accumulated, they are passed to :code:modulesand parameters are updates. -
GradientCorrection–Estimates gradient at minima along search direction assuming function is quadratic.
-
GradientSampling–Samples and aggregates gradients and values at perturbed points.
-
Graft–Outputs
directionoutput rescaled to have the same norm asmagnitudeoutput. -
GraftGradToUpdate–Outputs gradient grafted to update, that is gradient rescaled to have the same norm as the update.
-
GraftInputToOutput–Outputs
tensorsrescaled to have the same norm asmagnitude(tensors). -
GraftOutputToInput–Outputs
magnitude(tensors)rescaled to have the same norm astensors -
GraftToGrad–Grafts update to the gradient, that is update is rescaled to have the same norm as the gradient.
-
GraftToParams–Grafts update to the parameters, that is update is rescaled to have the same norm as the parameters, but no smaller than
eps. -
GramSchimdt–outputs tensors made orthogonal to
other(tensors)via Gram-Schmidt. -
Greenstadt1–Greenstadt's first Quasi-Newton method.
-
Greenstadt2–Greenstadt's second Quasi-Newton method.
-
HagerZhang–Hager-Zhang nonlinear conjugate gradient method,
-
HeavyBall–Polyak's momentum (heavy-ball method).
-
HestenesStiefel–Hestenes–Stiefel nonlinear conjugate gradient method.
-
Horisho–Horisho's variable metric Quasi-Newton method.
-
HpuEstimate–returns
y/||s||, whereyis difference between current and previous update (gradient),sis difference between current and previous parameters. The returned tensors are a finite difference approximation to hessian times previous update. -
ICUM–Inverse Column-updating Quasi-Newton method. This is computationally cheaper than other Quasi-Newton methods
-
Identity–Identity operator that is argument-insensitive. This also can be used as identity hessian for trust region methods.
-
ImprovedNewton–Improved Newton's Method (INM).
-
IntermoduleCautious–Negaties update on :code:
mainmodule where it's sign doesn't match with output ofcomparemodule. -
InverseFreeNewton–Inverse-free newton's method
-
LBFGS–Limited-memory BFGS algorithm. A line search or trust region is recommended.
-
LR–Learning rate. Adding this module also adds support for LR schedulers.
-
LSR1–Limited-memory SR1 algorithm. A line search or trust region is recommended.
-
LambdaHomotopy– -
LaplacianSmoothing–Applies laplacian smoothing via a fast Fourier transform solver which can improve generalization.
-
LastAbsoluteRatio–Outputs ratio between absolute values of past two updates the numerator is determined by
numeratorargument. -
LastDifference–Outputs difference between past two updates.
-
LastGradDifference–Outputs difference between past two gradients.
-
LastProduct–Outputs difference between past two updates.
-
LastRatio–Outputs ratio between past two updates, the numerator is determined by
numeratorargument. -
LerpModules–Does a linear interpolation of
input(tensors)andend(tensors)based on a scalarweight. -
LevenbergMarquardt–Levenberg-Marquardt trust region algorithm.
-
LineSearchBase–Base class for line searches.
-
Lion–Lion (EvoLved Sign Momentum) optimizer from https://arxiv.org/abs/2302.06675.
-
LiuStorey–Liu-Storey nonlinear conjugate gradient method.
-
LogHomotopy– -
MARSCorrection–MARS variance reduction correction.
-
MSAM–Momentum-SAM from https://arxiv.org/pdf/2401.12033.
-
MSAMMomentum–Momentum-SAM from https://arxiv.org/pdf/2401.12033.
-
MatrixMomentum–Second order momentum method.
-
Maximum–Outputs
maximum(tensors, other(tensors)) -
MaximumModules–Outputs elementwise maximum of
inputsthat can be modules or numbers. -
McCormick–McCormicks's Quasi-Newton method.
-
MeZO–Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.
-
Mean–Outputs a mean of
inputsthat can be modules or numbers. -
MedianAveraging–Median of past
history_sizeupdates. -
Minimum–Outputs
minimum(tensors, other(tensors)) -
MinimumModules–Outputs elementwise minimum of
inputsthat can be modules or numbers. -
Mul–Multiply tensors by
other.othercan be a number or a module. -
MulByLoss–Multiplies update by loss times
alpha -
MultiOperationBase–Base class for operations that use operands. This is an abstract class, subclass it and override
transformmethod to use it. -
Multistep–Performs
stepsinner steps withmoduleper each step. -
MuonAdjustLR–LR adjustment for Muon from "Muon is Scalable for LLM Training" (https://github.com/MoonshotAI/Moonlight/tree/master).
-
NAG–Nesterov accelerated gradient method (nesterov momentum).
-
NanToNum–Convert
nan,infand-inf`` to numbers. -
NaturalGradient–Natural gradient approximated via empirical fisher information matrix.
-
Negate–Returns
- input -
NegateOnLossIncrease–Uses an extra forward pass to evaluate loss at
parameters+update, -
NewDQN–Diagonal quasi-newton method.
-
NewSSM–Self-scaling Quasi-Newton method.
-
Newton–Exact Newton's method via autograd.
-
NewtonCG–Newton's method with a matrix-free conjugate gradient or minimial-residual solver.
-
NewtonCGSteihaug–Newton's method with trust region and a matrix-free Steihaug-Toint conjugate gradient solver.
-
NoiseSign–Outputs random tensors with sign copied from the update.
-
Noop–Identity operator that is argument-insensitive. This also can be used as identity hessian for trust region methods.
-
Normalize–Normalizes the update.
-
NormalizeByEMA–Sets norm of the update to be the same as the norm of an exponential moving average of past updates.
-
NystromPCG–Newton's method with a Nyström-preconditioned conjugate gradient solver.
-
NystromSketchAndSolve–Newton's method with a Nyström sketch-and-solve solver.
-
Ones–Outputs ones
-
Online–Allows certain modules to be used for mini-batch optimization.
-
OrthoGrad–Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.
-
Orthogonalize–Uses Newton-Schulz iteration or SVD to compute the zeroth power / orthogonalization of update along first 2 dims.
-
PSB–Powell's Symmetric Broyden Quasi-Newton method.
-
PSGDDenseNewton–Dense hessian preconditioner from Preconditioned Stochastic Gradient Descent (see https://github.com/lixilinx/psgd_torch)
-
PSGDKronNewton–Kron hessian preconditioner from Preconditioned Stochastic Gradient Descent (see https://github.com/lixilinx/psgd_torch)
-
PSGDKronWhiten–Kron whitening preconditioner from Preconditioned Stochastic Gradient Descent (see https://github.com/lixilinx/psgd_torch)
-
PSGDLRANewton–Low rank hessian preconditioner from Preconditioned Stochastic Gradient Descent (see https://github.com/lixilinx/psgd_torch)
-
PSGDLRAWhiten–Low rank whitening preconditioner from Preconditioned Stochastic Gradient Descent (see https://github.com/lixilinx/psgd_torch)
-
Params–Outputs parameters
-
Pearson–Pearson's Quasi-Newton method.
-
PerturbWeights–Changes the closure so that it evaluates loss and gradients at weights perturbed by a random perturbation.
-
PolakRibiere–Polak-Ribière-Polyak nonlinear conjugate gradient method.
-
PolyakStepSize–Polyak's subgradient method with known or unknown f*.
-
Pow–Take tensors to the power of
exponent.exponentcan be a number or a module. -
PowModules–Calculates
input ** exponent.inputandothercan be numbers or modules. -
PowellRestart–Powell's two restarting criterions for conjugate gradient methods.
-
Previous–Maintains an update from n steps back, for example if n=1, returns previous update
-
PrintLoss–Prints var.get_loss().
-
PrintParams–Prints current update.
-
PrintShape–Prints shapes of the update.
-
PrintUpdate–Prints current update.
-
Prod–Outputs product of
inputsthat can be modules or numbers. -
ProjectedGradientMethod–Projected gradient method. Directly projects the gradient onto subspace conjugate to past directions.
-
ProjectedNewtonRaphson–Projected Newton Raphson method.
-
ProjectionBase–Base class for projections.
-
RCopySign–Returns
other(tensors)with sign copied from tensors. -
RDSA–Gradient approximation via Random-direction stochastic approximation (RDSA) method.
-
RDiv–Divide
otherby tensors.othercan be a number or a module. -
RMSprop–Divides graient by EMA of gradient squares.
-
RPow–Take
otherto the power of tensors.othercan be a number or a module. -
RSub–Subtract tensors from
other.othercan be a number or a module. -
Randn–Outputs tensors filled with random numbers from a normal distribution with mean 0 and variance 1.
-
RandomHvp–Returns a hessian-vector product with a random vector, optionally times vector
-
RandomReinitialize–On each step with probability
p_reinittrigger reinitialization, -
RandomSample–Outputs tensors filled with random numbers from distribution depending on value of
distribution. -
RandomStepSize–Uses random global or layer-wise step size from
lowtohigh. -
RandomizedFDM–Gradient approximation via a randomized finite-difference method.
-
Reciprocal–Returns
1 / input -
ReduceOperationBase–Base class for reduction operations like Sum, Prod, Maximum. This is an abstract class, subclass it and override
transformmethod to use it. -
Relative–Multiplies update by absolute parameter values to make it relative to their magnitude,
min_valueis minimum allowed value to avoid getting stuck at 0. -
RelativeWeightDecay–Weight decay relative to the mean absolute value of update, gradient or parameters depending on value of
norm_inputargument. -
RestartEvery–Resets the state every n steps
-
RestartOnStuck–Resets the state when update (difference in parameters) is zero for multiple steps in a row.
-
RestartStrategyBase–Base class for restart strategies.
-
Rprop–Resilient propagation. The update magnitude gets multiplied by
nplusif gradient didn't change the sign, -
SAM–Sharpness-Aware Minimization from https://arxiv.org/pdf/2010.01412
-
SG2–second-order stochastic gradient
-
SOAP–SOAP (ShampoO with Adam in the Preconditioner's eigenbasis from https://arxiv.org/abs/2409.11321).
-
SOAPBasis–Run another optimizer in Shampoo eigenbases.
-
SPSA–Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.
-
SPSA1–One-measurement variant of SPSA. Unlike standard two-measurement SPSA, the estimated
-
SR1–Symmetric Rank 1. This works best with a trust region:
-
SSVM–Self-scaling variable metric Quasi-Newton method.
-
SVRG–Stochastic variance reduced gradient method (SVRG).
-
SaveBest–Saves best parameters found so far, ones that have lowest loss. Put this as the last module.
-
ScalarProjection–projetion that splits all parameters into individual scalars
-
ScaleByGradCosineSimilarity–Multiplies the update by cosine similarity with gradient.
-
ScaleLRBySignChange–learning rate gets multiplied by
nplusif ascent/gradient didn't change the sign, -
ScaleModulesByCosineSimilarity–Scales the output of
mainmodule by it's cosine similarity to the output -
ScipyMinimizeScalar–Line search via :code:
scipy.optimize.minimize_scalarwhich implements brent, golden search and bounded brent methods. -
Sequential–On each step, this sequentially steps with
modulesstepstimes. -
Shampoo–Shampoo from Preconditioned Stochastic Tensor Optimization (https://arxiv.org/abs/1802.09568).
-
ShorR–Shor’s r-algorithm.
-
Sign–Returns
sign(input) -
SignConsistencyLRs–Outputs per-weight learning rates based on consecutive sign consistency.
-
SignConsistencyMask–Outputs a mask of sign consistency of current and previous inputs.
-
SixthOrder3P–Sixth-order iterative method.
-
SixthOrder3PM2–Wang, Xiaofeng, and Yang Li. "An efficient sixth-order Newton-type method for solving nonlinear systems." Algorithms 10.2 (2017): 45.
-
SixthOrder5P–Argyros, Ioannis K., et al. "Extended convergence for two sixth order methods under the same weak conditions." Foundations 3.1 (2023): 127-139.
-
SophiaH–SophiaH optimizer from https://arxiv.org/abs/2305.14342
-
Split–Apply
truemodules to all parameters filtered byfilter, applyfalsemodules to all other parameters. -
Sqrt–Returns
sqrt(input) -
SqrtEMASquared–Maintains an exponential moving average of squared updates, outputs optionally debiased square root.
-
SqrtHomotopy– -
SquareHomotopy– -
StepSize–this is exactly the same as LR, except the
lrparameter can be renamed to any other name to avoid clashes -
StrongWolfe–Interpolation line search satisfying Strong Wolfe condition.
-
Sub–Subtract
otherfrom tensors.othercan be a number or a module. -
SubModules–Calculates
input - other.inputandothercan be numbers or modules. -
SubspaceNewton–Subspace Newton. Performs a Newton step in a subspace (random or spanned by past gradients).
-
Sum–Outputs sum of
inputsthat can be modules or numbers. -
SumOfSquares–Sets loss to be the sum of squares of values returned by the closure.
-
Switch–After
stepssteps switches to the next module. -
TerminateAfterNEvaluations– -
TerminateAfterNSeconds– -
TerminateAfterNSteps– -
TerminateAll– -
TerminateAny– -
TerminateByGradientNorm– -
TerminateByUpdateNorm–update is calculated as parameter difference
-
TerminateNever– -
TerminateOnLossReached– -
TerminateOnNoImprovement– -
TerminationCriteriaBase– -
ThomasOptimalMethod–Thomas's "optimal" Quasi-Newton method.
-
Threshold–Outputs tensors thresholded such that values above
thresholdare set tovalue. -
To–Cast modules to specified device and dtype
-
TrustCG–Trust region via Steihaug-Toint Conjugate Gradient method.
-
TrustRegionBase– -
TwoPointNewton–two-point Newton method with frozen derivative with third order convergence.
-
UnaryLambda–Applies
fnto input tensors. -
UnaryParameterwiseLambda–Applies
fnto each input tensor. -
Uniform–Outputs tensors filled with random numbers from uniform distribution between
lowandhigh. -
UpdateGradientSignConsistency–Compares update and gradient signs. Output will have 1s where signs match, and 0s where they don't.
-
UpdateSign–Outputs gradient with sign copied from the update.
-
UpdateToNone–Sets
updateattribute to None onvar. -
VectorProjection–projection that concatenates all parameters into a vector
-
ViewAsReal–View complex tensors as real tensors. Doesn't affect tensors that are already.
-
Warmup–Learning rate warmup, linearly increases learning rate multiplier from
start_lrtoend_lroverstepssteps. -
WarmupNormClip–Warmup via clipping of the update norm.
-
WeightDecay–Weight decay.
-
WeightDropout–Changes the closure so that it evaluates loss and gradients with random weights replaced with 0.
-
WeightedAveraging–Weighted average of past
len(weights)updates. -
WeightedMean–Outputs weighted mean of
inputsthat can be modules or numbers. -
WeightedSum–Outputs a weighted sum of
inputsthat can be modules or numbers. -
Wrap–Wraps a pytorch optimizer to use it as a module.
-
Zeros–Outputs zeros
Functions:
-
clip_grad_norm_–Clips gradient of an iterable of parameters to specified norm value.
-
clip_grad_value_–Clips gradient of an iterable of parameters at specified value.
-
decay_weights_–directly decays weights in-place
-
normalize_grads_–Normalizes gradient of an iterable of parameters to specified norm value.
-
orthogonalize_grads_–Computes the zeroth power / orthogonalization of gradients of an iterable of parameters.
-
orthograd_–Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.
AEGD ¶
Bases: torchzero.core.transform.TensorTransform
AEGD (Adaptive gradient descent with energy) from https://arxiv.org/abs/2010.05109#page=10.26.
Note
AEGD has a learning rate hyperparameter that can't really be removed from the update rule.
To avoid compounding learning rate mofications, remove the tz.m.LR module if you had it.
Parameters:
-
lr(float, default:0.1) –learning rate (default: 0.1)
-
c(float, default:1) –term added to the original objective function (default: 1)
Source code in torchzero/modules/adaptive/aegd.py
ASAM ¶
Bases: torchzero.modules.adaptive.sam.SAM
Adaptive Sharpness-Aware Minimization from https://arxiv.org/pdf/2102.11600#page=6.52
SAM functions by seeking parameters that lie in neighborhoods having uniformly low loss value. It performs two forward and backward passes per step.
This implementation modifies the closure to return loss and calculate gradients of the SAM objective. All modules after this will use the modified objective.
Note
This module requires a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients at two points on each step.
Parameters:
-
rho(float, default:0.5) –Neighborhood size. Defaults to 0.05.
-
p(float, default:2) –norm of the SAM objective. Defaults to 2.
Examples:¶
ASAM-SGD:
ASAM-Adam:
References: Kwon, J., Kim, J., Park, H., & Choi, I. K. (2021, July). ASAM: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In International Conference on Machine Learning (pp. 5905-5914). PMLR.Source code in torchzero/modules/adaptive/sam.py
Abs ¶
Bases: torchzero.core.transform.TensorTransform
Returns abs(input)
Source code in torchzero/modules/ops/unary.py
AccumulateMaximum ¶
Bases: torchzero.core.transform.TensorTransform
Accumulates maximum of all past updates.
Parameters:
-
decay(float, default:0) –decays the accumulator. Defaults to 0.
-
target(Target) –target. Defaults to 'update'.
Source code in torchzero/modules/ops/accumulate.py
AccumulateMean ¶
Bases: torchzero.core.transform.TensorTransform
Accumulates mean of all past updates.
Parameters:
-
decay(float, default:0) –decays the accumulator. Defaults to 0.
-
target(Target) –target. Defaults to 'update'.
Source code in torchzero/modules/ops/accumulate.py
AccumulateMinimum ¶
Bases: torchzero.core.transform.TensorTransform
Accumulates minimum of all past updates.
Parameters:
-
decay(float, default:0) –decays the accumulator. Defaults to 0.
-
target(Target) –target. Defaults to 'update'.
Source code in torchzero/modules/ops/accumulate.py
AccumulateProduct ¶
Bases: torchzero.core.transform.TensorTransform
Accumulates product of all past updates.
Parameters:
-
decay(float, default:0) –decays the accumulator. Defaults to 0.
-
target(Target, default:'update') –target. Defaults to 'update'.
Source code in torchzero/modules/ops/accumulate.py
AccumulateSum ¶
Bases: torchzero.core.transform.TensorTransform
Accumulates sum of all past updates.
Parameters:
-
decay(float, default:0) –decays the accumulator. Defaults to 0.
-
target(Target) –target. Defaults to 'update'.
Source code in torchzero/modules/ops/accumulate.py
AdGD ¶
Bases: torchzero.core.transform.TensorTransform
AdGD and AdGD-2 (https://arxiv.org/abs/2308.02261)
Source code in torchzero/modules/step_size/adaptive.py
304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 | |
AdaHessian ¶
Bases: torchzero.core.transform.Transform
AdaHessian: An Adaptive Second Order Optimizer for Machine Learning (https://arxiv.org/abs/2006.00719)
This is similar to Adam, but the second momentum is replaced by square root of an exponential moving average of random hessian-vector products.
Notes
-
In most cases AdaHessian should be the first module in the chain because it relies on autograd. Use the
innerargument if you wish to apply AdaHessian preconditioning to another module's output. -
This module requires a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a
backwardargument (refer to documentation).
Parameters:
-
beta1(float, default:0.9) –first momentum. Defaults to 0.9.
-
beta2(float, default:0.999) –second momentum for squared hessian diagonal estimates. Defaults to 0.999.
-
averaging(bool, default:True) –whether to enable block diagonal averaging over 1st dimension on parameters that have 2+ dimensions. This can be set per-parameter in param groups.
-
block_size(int, default:None) –size of block in the block-diagonal averaging.
-
update_freq(int, default:1) –frequency of updating hessian diagonal estimate via a hessian-vector product. This value can be increased to reduce computational cost. Defaults to 1.
-
eps(float, default:1e-08) –division stability epsilon. Defaults to 1e-8.
-
hvp_method(str, default:'autograd') –Determines how hessian-vector products are computed.
"batched_autograd"- uses autograd with batched hessian-vector products. If a single hessian-vector is evaluated, equivalent to"autograd". Faster than"autograd"but uses more memory."autograd"- uses autograd hessian-vector products. If multiple hessian-vector products are evaluated, uses a for-loop. Slower than"batched_autograd"but uses less memory."fd_forward"- uses gradient finite difference approximation with a less accurate forward formula which requires one extra gradient evaluation per hessian-vector product."fd_central"- uses gradient finite difference approximation with a more accurate central formula which requires two gradient evaluations per hessian-vector product.
Defaults to
"autograd". -
h(float, default:0.001) –The step size for finite difference if
hvp_methodis"fd_forward"or"fd_central". Defaults to 1e-3. -
n_samples(int, default:1) –number of hessian-vector products with random vectors to evaluate each time when updating the preconditioner. Larger values may lead to better hessian diagonal estimate. Defaults to 1.
-
seed(int | None, default:None) –seed for random vectors. Defaults to None.
-
inner(Chainable | None) –Inner module. If this is specified, operations are performed in the following order. 1. compute hessian diagonal estimate. 2. pass inputs to
inner. 3. momentum and preconditioning are applied to the ouputs ofinner.
Examples:¶
Using AdaHessian:
AdaHessian preconditioner can be applied to any other module by passing it to the inner argument.
Turn off AdaHessian's first momentum to get just the preconditioning. Here is an example of applying
AdaHessian preconditioning to nesterov momentum (tz.m.NAG):
opt = tz.Optimizer(
model.parameters(),
tz.m.AdaHessian(beta1=0, inner=tz.m.NAG(0.9)),
tz.m.LR(0.1)
)
Source code in torchzero/modules/adaptive/adahessian.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | |
Adagrad ¶
Bases: torchzero.core.transform.TensorTransform
Adagrad, divides by sum of past squares of gradients.
This implementation is identical to torch.optim.Adagrad.
Parameters:
-
lr_decay(float, default:0) –learning rate decay. Defaults to 0.
-
initial_accumulator_value(float, default:0) –initial value of the sum of squares of gradients. Defaults to 0.
-
eps(float, default:1e-10) –division epsilon. Defaults to 1e-10.
-
alpha(float, default:1) –step size. Defaults to 1.
-
pow(float) –power for gradients and accumulator root. Defaults to 2.
-
use_sqrt(bool) –whether to take the root of the accumulator. Defaults to True.
-
inner(Chainable | None, default:None) –Inner modules that are applied after updating accumulator and before preconditioning. Defaults to None.
Source code in torchzero/modules/adaptive/adagrad.py
AdagradNorm ¶
Bases: torchzero.core.transform.TensorTransform
Adagrad-Norm, divides by sum of past means of squares of gradients.
Parameters:
-
lr_decay(float, default:0) –learning rate decay. Defaults to 0.
-
initial_accumulator_value(float, default:0) –initial value of the sum of squares of gradients. Defaults to 0.
-
eps(float, default:1e-10) –division epsilon. Defaults to 1e-10.
-
alpha(float, default:1) –step size. Defaults to 1.
-
use_sqrt(bool, default:True) –whether to take the root of the accumulator. Defaults to True.
-
inner(Chainable | None, default:None) –Inner modules that are applied after updating accumulator and before preconditioning. Defaults to None.
Source code in torchzero/modules/adaptive/adagrad.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | |
Adam ¶
Bases: torchzero.core.transform.TensorTransform
Adam. Divides gradient EMA by EMA of gradient squares with debiased step size.
This implementation is identical to :code:torch.optim.Adam.
Parameters:
-
beta1(float, default:0.9) –momentum. Defaults to 0.9.
-
beta2(float, default:0.999) –second momentum. Defaults to 0.999.
-
eps(float, default:1e-08) –epsilon. Defaults to 1e-8.
-
alpha(float, default:1.0) –learning rate. Defaults to 1.
-
amsgrad(bool, default:False) –Whether to divide by maximum of EMA of gradient squares instead. Defaults to False.
-
pow(float) –power used in second momentum power and root. Defaults to 2.
-
debias(bool, default:True) –whether to apply debiasing to momentums based on current step. Defaults to True.
Source code in torchzero/modules/adaptive/adam.py
Adan ¶
Bases: torchzero.core.transform.TensorTransform
Adaptive Nesterov Momentum Algorithm from https://arxiv.org/abs/2208.06677
Parameters:
-
beta1(float, default:0.98) –momentum. Defaults to 0.98.
-
beta2(float, default:0.92) –momentum for gradient differences. Defaults to 0.92.
-
beta3(float, default:0.99) –thrid (squared) momentum. Defaults to 0.99.
-
eps(float, default:1e-08) –epsilon. Defaults to 1e-8.
Example:
Reference: Xie, X., Zhou, P., Li, H., Lin, Z., & Yan, S. (2024). Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Transactions on Pattern Analysis and Machine Intelligence.Source code in torchzero/modules/adaptive/adan.py
AdaptiveBacktracking ¶
Bases: torchzero.modules.line_search.line_search.LineSearchBase
Adaptive backtracking line search. After each line search procedure, a new initial step size is set such that optimal step size in the procedure would be found on the second line search iteration.
Parameters:
-
init(float, default:1.0) –initial step size. Defaults to 1.0.
-
beta(float, default:0.5) –multiplies each consecutive step size by this value. Defaults to 0.5.
-
c(float, default:0.0001) –sufficient decrease condition. Defaults to 1e-4.
-
condition(Literal, default:'armijo') –termination condition, only ones that do not use gradient at f(x+a*d) can be specified. - "armijo" - sufficient decrease condition. - "decrease" - any decrease in objective function value satisfies the condition.
"goldstein" can techincally be specified but it doesn't make sense because there is not zoom stage. Defaults to 'armijo'.
-
maxiter(int, default:20) –maximum number of function evaluations per step. Defaults to 10.
-
target_iters(int, default:1) –sets next step size such that this number of iterations are expected to be performed until optimal step size is found. Defaults to 1.
-
nplus(float, default:2.0) –if initial step size is optimal, it is multiplied by this value. Defaults to 2.0.
-
scale_beta(float, default:0.0) –momentum for initial step size, at 0 disables momentum. Defaults to 0.0.
Source code in torchzero/modules/line_search/backtracking.py
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 | |
AdaptiveBisection ¶
Bases: torchzero.modules.line_search.line_search.LineSearchBase
A line search that evaluates previous step size, if value increased, backtracks until the value stops decreasing, otherwise forward-tracks until value stops decreasing.
Parameters:
-
init(float, default:1.0) –initial step size. Defaults to 1.0.
-
nplus(float, default:2) –multiplier to step size if initial step size is optimal. Defaults to 2.
-
nminus(float, default:0.5) –multiplier to step size if initial step size is too big. Defaults to 0.5.
-
maxiter(int, default:10) –maximum number of function evaluations per step. Defaults to 10.
-
adaptive(bool, default:True) –when enabled, if line search failed, step size will continue decreasing on the next step. Otherwise it will restart the line search from
initstep size. Defaults to True.
Source code in torchzero/modules/line_search/adaptive.py
AdaptiveHeavyBall ¶
Bases: torchzero.core.transform.TensorTransform
Adaptive heavy ball from https://hal.science/hal-04832983v1/file/OJMO_2024__5__A7_0.pdf.
Suitable for quadratic objectives with known f* (loss at minimum).
note
The step size is determined by the algorithm, so learning rate modules shouldn't be used.
Parameters:
-
f_star(int, default:0) –(estimated) minimal possible value of the objective function (lowest possible loss). Defaults to 0.
Source code in torchzero/modules/adaptive/adaptive_heavyball.py
Add ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Add other to tensors. other can be a number or a module.
If other is a module, this calculates tensors + other(tensors)
Source code in torchzero/modules/ops/binary.py
Alternate ¶
Bases: torchzero.core.module.Module
Alternates between stepping with :code:modules.
That is, first step is performed with 1st module, second step with second module, etc.
Parameters:
-
steps(int | Iterable[int], default:1) –number of steps to perform with each module. Defaults to 1.
Examples:¶
Alternate between Adam, SignSGD and RMSprop
opt = tz.Optimizer(
model.parameters(),
tz.m.Alternate(
tz.m.Adam(),
[tz.m.SignSGD(), tz.m.Mul(0.5)],
tz.m.RMSprop(),
),
tz.m.LR(1e-3),
)
Source code in torchzero/modules/misc/switch.py
LOOP
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
Averaging ¶
Bases: torchzero.core.transform.TensorTransform
Average of past history_size updates.
Parameters:
-
history_size(int) –Number of past updates to average
-
target(Target) –target. Defaults to 'update'.
Source code in torchzero/modules/momentum/averaging.py
BBStab ¶
Bases: torchzero.core.transform.TensorTransform
Stabilized Barzilai-Borwein method (https://arxiv.org/abs/1907.06409).
This clips the norm of the Barzilai-Borwein update by delta, where delta can be adaptive if c is specified.
Parameters:
-
c(float, default:0.2) –adaptive delta parameter. If
deltais set to None, firstinf_itersupdates are performed with non-stabilized Barzilai-Borwein step size. Then delta is set to norm of the update that had the smallest norm, and multiplied byc. Defaults to 0.2. -
delta(float | None, default:None) –Barzilai-Borwein update is clipped to this value. Set to
Noneto use an adaptive choice. Defaults to None. -
type(str, default:'geom') –one of "short" with formula sᵀy/yᵀy, "long" with formula sᵀs/sᵀy, or "geom" to use geometric mean of short and long. Defaults to "geom". Note that "long" corresponds to BB1stab and "short" to BB2stab, however I found that "geom" works really well.
-
inner(Chainable | None, default:None) –step size will be applied to outputs of this module. Defaults to None.
Source code in torchzero/modules/step_size/adaptive.py
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 | |
BFGS ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Broyden–Fletcher–Goldfarb–Shanno Quasi-Newton method. This is usually the most stable quasi-newton method.
Note
a line search or a trust region is recommended
Warning
this uses at least O(N^2) memory.
Parameters:
-
init_scale(float | Literal['auto'], default:'auto') –initial hessian matrix is set to identity times this.
"auto" corresponds to a heuristic from Nocedal. Stephen J. Wright. Numerical Optimization p.142-143.
Defaults to "auto".
-
tol(float, default:1e-32) –tolerance on curvature condition. Defaults to 1e-32.
-
ptol(float | None, default:1e-32) –skips update if maximum difference between current and previous gradients is less than this, to avoid instability. Defaults to 1e-32.
-
ptol_restart(bool, default:False) –whether to reset the hessian approximation when ptol tolerance is not met. Defaults to False.
-
restart_interval(int | None | Literal['auto'], default:None) –interval between resetting the hessian approximation.
"auto" corresponds to number of decision variables + 1.
None - no resets.
Defaults to None.
-
beta(float | None, default:None) –momentum on H or B. Defaults to None.
-
update_freq(int, default:1) –frequency of updating H or B. Defaults to 1.
-
scale_first(bool, default:False) –whether to downscale first step before hessian approximation becomes available. Defaults to True.
-
scale_second(bool) –whether to downscale second step. Defaults to False.
-
concat_params(bool, default:True) –If true, all parameters are treated as a single vector. If False, the update rule is applied to each parameter separately. Defaults to True.
-
inner(Chainable | None, default:None) –preconditioning is applied to the output of this module. Defaults to None.
Examples:¶
BFGS with backtracking line search:
BFGS with trust region
Source code in torchzero/modules/quasi_newton/quasi_newton.py
BacktrackOnSignChange ¶
Bases: torchzero.core.transform.TensorTransform
Negates or undoes update for parameters where where gradient or update sign changes.
This is part of RProp update rule.
Parameters:
-
use_grad(bool, default:False) –if True, tracks sign change of the gradient, otherwise track sign change of the update. Defaults to True.
-
backtrack(bool, default:True) –if True, undoes the update when sign changes, otherwise negates it. Defaults to True.
Source code in torchzero/modules/adaptive/rprop.py
Backtracking ¶
Bases: torchzero.modules.line_search.line_search.LineSearchBase
Backtracking line search.
Parameters:
-
init(float, default:1.0) –initial step size. Defaults to 1.0.
-
beta(float, default:0.5) –multiplies each consecutive step size by this value. Defaults to 0.5.
-
c(float, default:0.0001) –sufficient decrease condition. Defaults to 1e-4.
-
condition(Literal, default:'armijo') –termination condition, only ones that do not use gradient at f(x+a*d) can be specified. - "armijo" - sufficient decrease condition. - "decrease" - any decrease in objective function value satisfies the condition.
"goldstein" can techincally be specified but it doesn't make sense because there is not zoom stage. Defaults to 'armijo'.
-
maxiter(int, default:10) –maximum number of function evaluations per step. Defaults to 10.
-
adaptive(bool, default:True) –when enabled, if line search failed, step size will continue decreasing on the next step. Otherwise it will restart the line search from
initstep size. Defaults to True.
Examples: Gradient descent with backtracking line search:
L-BFGS with backtracking line search:
Source code in torchzero/modules/line_search/backtracking.py
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
BarzilaiBorwein ¶
Bases: torchzero.core.transform.TensorTransform
Barzilai-Borwein step size method.
Parameters:
-
type(str, default:'geom') –one of "short" with formula sᵀy/yᵀy, "long" with formula sᵀs/sᵀy, or "geom" to use geometric mean of short and long. Defaults to "geom".
-
fallback(float) –step size when denominator is less than 0 (will happen on negative curvature). Defaults to 1e-3.
-
inner(Chainable | None, default:None) –step size will be applied to outputs of this module. Defaults to None.
Source code in torchzero/modules/step_size/adaptive.py
BinaryOperationBase ¶
Bases: torchzero.core.module.Module, abc.ABC
Base class for operations that use update as the first operand. This is an abstract class, subclass it and override transform method to use it.
Methods:
-
transform–applies the operation to operands
Source code in torchzero/modules/ops/binary.py
transform ¶
transform(objective: Objective, update: list[Tensor], **operands: Any | list[Tensor]) -> Iterable[Tensor]
applies the operation to operands
BirginMartinezRestart ¶
Bases: torchzero.core.module.Module
the restart criterion for conjugate gradient methods designed by Birgin and Martinez.
This criterion restarts when when the angle between dk+1 and −gk+1 is not acute enough.
The restart clears all states of module.
Parameters:
-
module(Module) –module to restart, should be a conjugate gradient or possibly a quasi-newton method.
-
cond(float, default:0.001) –Restart is performed whenevr d^Tg > -cond||d||||g||. The default condition value of 1e-3 is suggested by Birgin and Martinez.
Reference
Birgin, Ernesto G., and José Mario Martínez. "A spectral conjugate gradient method for unconstrained optimization." Applied Mathematics & Optimization 43.2 (2001): 117-128.
Source code in torchzero/modules/restarts/restars.py
BoldDriver ¶
Bases: torchzero.core.transform.TensorTransform
Multiplies step size by nplus if loss decreased compared to last iteration, otherwise multiplies by nminus.
Source code in torchzero/modules/step_size/adaptive.py
BroydenBad ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Broyden's "bad" Quasi-Newton method.
Note
a trust region or an accurate line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
Source code in torchzero/modules/quasi_newton/quasi_newton.py
BroydenGood ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Broyden's "good" Quasi-Newton method.
Note
a trust region or an accurate line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
Source code in torchzero/modules/quasi_newton/quasi_newton.py
CD ¶
Bases: torchzero.core.module.Module
Coordinate descent. Proposes a descent direction along a single coordinate.
A line search such as tz.m.ScipyMinimizeScalar(maxiter=8) or a fixed step size can be used after this.
Parameters:
-
h(float, default:0.001) –finite difference step size. Defaults to 1e-3.
-
grad(bool, default:False) –if True, scales direction by gradient estimate. If False, the scale is fixed to 1. Defaults to True.
-
adaptive(bool, default:True) –whether to adapt finite difference step size, this requires an additional buffer. Defaults to True.
-
index(str, default:'cyclic2') –index selection strategy. - "cyclic" - repeatedly cycles through each coordinate, e.g.
1,2,3,1,2,3,.... - "cyclic2" - cycles forward and then backward, e.g1,2,3,3,2,1,1,2,3,...(default). - "random" - picks coordinate randomly. -
threepoint(bool, default:True) –whether to use three points (three function evaluatins) to determine descent direction. if False, uses two points, but then
adaptivecan't be used. Defaults to True.
Source code in torchzero/modules/zeroth_order/cd.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | |
Cautious ¶
Bases: torchzero.core.transform.TensorTransform
Negates update for parameters where update and gradient sign is inconsistent. Optionally normalizes the update by the number of parameters that are not masked. This is meant to be used after any momentum-based modules.
Parameters:
-
normalize(bool, default:False) –renormalize update after masking. only has effect when mode is 'zero'. Defaults to False.
-
eps(float, default:1e-06) –epsilon for normalization. Defaults to 1e-6.
-
mode(str, default:'zero') –what to do with updates with inconsistent signs. - "zero" - set them to zero (as in paper) - "grad" - set them to the gradient (same as using update magnitude and gradient sign) - "backtrack" - negate them
Examples:¶
Cautious Adam
References
Cautious Optimizers: Improving Training with One Line of Code. Kaizhao Liang, Lizhang Chen, Bo Liu, Qiang Liu
Source code in torchzero/modules/momentum/cautious.py
CautiousWeightDecay ¶
Bases: torchzero.core.transform.TensorTransform
Cautious weight decay (https://arxiv.org/pdf/2510.12402).
Weight decay but only applied to updates where update sign matches weight decay sign.
Parameters:
-
weight_decay(float) –weight decay scale.
-
ord(int, default:2) –order of the penalty, e.g. 1 for L1 and 2 for L2. Defaults to 2.
-
target(Target) –what to set on var. Defaults to 'update'.
Examples:¶
Adam with non-decoupled cautious weight decay
opt = tz.Optimizer(
model.parameters(),
tz.m.CautiousWeightDecay(1e-3),
tz.m.Adam(),
tz.m.LR(1e-3)
)
Adam with decoupled cautious weight decay that still scales with learning rate
opt = tz.Optimizer(
model.parameters(),
tz.m.Adam(),
tz.m.CautiousWeightDecay(1e-3),
tz.m.LR(1e-3)
)
Adam with fully decoupled cautious weight decay that doesn't scale with learning rate
opt = tz.Optimizer(
model.parameters(),
tz.m.Adam(),
tz.m.LR(1e-3),
tz.m.CautiousWeightDecay(1e-6)
)
Source code in torchzero/modules/weight_decay/weight_decay.py
CenteredEMASquared ¶
Bases: torchzero.core.transform.TensorTransform
Maintains a centered exponential moving average of squared updates. This also maintains an additional exponential moving average of un-squared updates, square of which is subtracted from the EMA.
Parameters:
-
beta(float, default:0.99) –momentum value. Defaults to 0.999.
-
amsgrad(bool, default:False) –whether to maintain maximum of the exponential moving average. Defaults to False.
-
pow(float, default:2) –power, absolute value is always used. Defaults to 2.
Source code in torchzero/modules/ops/higher_level.py
CenteredSqrtEMASquared ¶
Bases: torchzero.core.transform.TensorTransform
Maintains a centered exponential moving average of squared updates, outputs optionally debiased square root. This also maintains an additional exponential moving average of un-squared updates, square of which is subtracted from the EMA.
Parameters:
-
beta(float, default:0.99) –momentum value. Defaults to 0.999.
-
amsgrad(bool, default:False) –whether to maintain maximum of the exponential moving average. Defaults to False.
-
debiased(bool, default:False) –whether to multiply the output by a debiasing term from the Adam method. Defaults to False.
-
pow(float, default:2) –power, absolute value is always used. Defaults to 2.
Source code in torchzero/modules/ops/higher_level.py
Centralize ¶
Bases: torchzero.core.transform.TensorTransform
Centralizes the update.
Parameters:
-
dim(int | Sequence[int] | str | None, default:None) –calculates norm along those dimensions. If list/tuple, tensors are centralized along all dimensios in
dimthat they have. Can be set to "global" to centralize by global mean of all gradients concatenated to a vector. Defaults to None. -
inverse_dims(bool, default:False) –if True, the
dimsargument is inverted, and all other dimensions are centralized. -
min_size(int, default:2) –minimal size of a dimension to normalize along it. Defaults to 1.
Examples:
Standard gradient centralization:
References: - Yong, H., Huang, J., Hua, X., & Zhang, L. (2020). Gradient centralization: A new optimization technique for deep neural networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16 (pp. 635-652). Springer International Publishing. https://arxiv.org/abs/2004.01461
Source code in torchzero/modules/clipping/clipping.py
Clip ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
clip tensors to be in (min, max) range. min and `max: can be None, numbers or modules.
If min and max are modules, this calculates tensors.clip(min(tensors), max(tensors)).
Source code in torchzero/modules/ops/binary.py
ClipModules ¶
Bases: torchzero.modules.ops.multi.MultiOperationBase
Calculates input(tensors).clip(min, max). min and max can be numbers or modules.
Source code in torchzero/modules/ops/multi.py
ClipNorm ¶
Bases: torchzero.core.transform.TensorTransform
Clips update norm to be no larger than value.
Parameters:
-
max_norm(float) –value to clip norm to.
-
ord(float, default:2) –norm order. Defaults to 2.
-
dim(int | Sequence[int] | str | None, default:None) –calculates norm along those dimensions. If list/tuple, tensors are normalized along all dimensios in
dimthat they have. Can be set to "global" to normalize by global norm of all gradients concatenated to a vector. Defaults to None. -
inverse_dims(bool, default:False) –if True, the
dimsargument is inverted, and all other dimensions are normalized. -
min_size(int, default:1) –minimal numer of elements in a parameter or slice to clip norm. Defaults to 1.
-
target(str) –what this affects.
Examples:
Gradient norm clipping:
Update norm clipping:
Source code in torchzero/modules/clipping/clipping.py
ClipNormByEMA ¶
Bases: torchzero.core.transform.TensorTransform
Clips norm to be no larger than the norm of an exponential moving average of past updates.
Parameters:
-
beta(float, default:0.99) –beta for the exponential moving average. Defaults to 0.99.
-
ord(float, default:2) –order of the norm. Defaults to 2.
-
eps(float) –epsilon for division. Defaults to 1e-6.
-
tensorwise(bool, default:True) –if True, norms are calculated parameter-wise, otherwise treats all parameters as single vector. Defaults to True.
-
max_ema_growth(float | None, default:1.5) –if specified, restricts how quickly exponential moving average norm can grow. The norm is allowed to grow by at most this value per step. Defaults to 1.5.
-
ema_init(str) –How to initialize exponential moving average on first step, "update" to use the first update or "zeros". Defaults to 'zeros'.
Source code in torchzero/modules/clipping/ema_clipping.py
NORMALIZE
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
ClipNormGrowth ¶
Bases: torchzero.core.transform.TensorTransform
Clips update norm growth.
Parameters:
-
add(float | None, default:None) –additive clipping, next update norm is at most
previous norm + add. Defaults to None. -
mul(float | None, default:1.5) –multiplicative clipping, next update norm is at most
previous norm * mul. Defaults to 1.5. -
min_value(float | None, default:0.0001) –minimum value for multiplicative clipping to prevent collapse to 0. Next norm is at most :code:
max(prev_norm, min_value) * mul. Defaults to 1e-4. -
max_decay(float | None, default:2) –bounds the tracked multiplicative clipping decay to prevent collapse to 0. Next norm is at most :code:
max(previous norm * mul, max_decay). Defaults to 2. -
ord(float, default:2) –norm order. Defaults to 2.
-
tensorwise(bool, default:True) –if True, norms are calculated parameter-wise, otherwise treats all parameters as single vector. Defaults to True.
-
target(Target) –what to set on var. Defaults to "update".
Source code in torchzero/modules/clipping/growth_clipping.py
ClipValue ¶
Bases: torchzero.core.transform.TensorTransform
Clips update magnitude to be within (-value, value) range.
Parameters:
-
value(float) –value to clip to.
-
target(str) –refer to
target argumentin documentation.
Examples:
Gradient clipping:
Update clipping:
Source code in torchzero/modules/clipping/clipping.py
ClipValueByEMA ¶
Bases: torchzero.core.transform.TensorTransform
Clips magnitude of update to be no larger than magnitude of exponential moving average of past (unclipped) updates.
Parameters:
-
beta(float, default:0.99) –beta for the exponential moving average. Defaults to 0.99.
-
ema_init(str) –How to initialize exponential moving average on first step, "update" to use the first update or "zeros". Defaults to 'zeros'.
-
exp_avg_tfm(Chainable | None, default:None) –optional modules applied to exponential moving average before clipping by it. Defaults to None.
Source code in torchzero/modules/clipping/ema_clipping.py
ClipValueGrowth ¶
Bases: torchzero.core.transform.TensorTransform
Clips update value magnitude growth.
Parameters:
-
add(float | None, default:None) –additive clipping, next update is at most
previous update + add. Defaults to None. -
mul(float | None, default:1.5) –multiplicative clipping, next update is at most
previous update * mul. Defaults to 1.5. -
min_value(float | None, default:0.0001) –minimum value for multiplicative clipping to prevent collapse to 0. Next update is at most :code:
max(prev_update, min_value) * mul. Defaults to 1e-4. -
max_decay(float | None, default:2) –bounds the tracked multiplicative clipping decay to prevent collapse to 0. Next update is at most :code:
max(previous update * mul, max_decay). Defaults to 2. -
target(Target) –what to set on var. Defaults to "update".
Source code in torchzero/modules/clipping/growth_clipping.py
Clone ¶
Bases: torchzero.core.module.Module
Clones input. May be useful to store some intermediate result and make sure it doesn't get affected by in-place operations
Source code in torchzero/modules/ops/utility.py
ConjugateDescent ¶
Bases: torchzero.modules.conjugate_gradient.cg.ConguateGradientBase
Conjugate Descent (CD).
Note
This requires step size to be determined via a line search, so put a line search like tz.m.StrongWolfe(c2=0.1, a_init="first-order") after this.
Source code in torchzero/modules/conjugate_gradient/cg.py
CopyMagnitude ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Returns other(tensors) with sign copied from tensors.
Source code in torchzero/modules/ops/binary.py
CopySign ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Returns tensors with sign copied from other(tensors).
Source code in torchzero/modules/ops/binary.py
CubicRegularization ¶
Bases: torchzero.modules.trust_region.trust_region.TrustRegionBase
Cubic regularization.
Parameters:
-
hess_module(Module | None) –A module that maintains a hessian approximation (not hessian inverse!). This includes all full-matrix quasi-newton methods,
tz.m.Newtonandtz.m.GaussNewton. When using quasi-newton methods, setinverse=Falsewhen constructing them. -
eta(float, default:0.0) –if ratio of actual to predicted rediction is larger than this, step is accepted. When :code:
hess_moduleis GaussNewton, this can be set to 0. Defaults to 0.15. -
nplus(float, default:3.5) –increase factor on successful steps. Defaults to 1.5.
-
nminus(float, default:0.25) –decrease factor on unsuccessful steps. Defaults to 0.75.
-
rho_good(float, default:0.99) –if ratio of actual to predicted rediction is larger than this, trust region size is multiplied by
nplus. -
rho_bad(float, default:0.0001) –if ratio of actual to predicted rediction is less than this, trust region size is multiplied by
nminus. -
init(float, default:1) –Initial trust region value. Defaults to 1.
-
maxiter(float, default:100) –maximum iterations when solving cubic subproblem, defaults to 1e-7.
-
eps(float, default:1e-08) –epsilon for the solver, defaults to 1e-8.
-
update_freq(int, default:1) –frequency of updating the hessian. Defaults to 1.
-
max_attempts(max_attempts, default:10) –maximum number of trust region size size reductions per step. A zero update vector is returned when this limit is exceeded. Defaults to 10.
-
fallback(bool) –if
True, whenhess_modulemaintains hessian inverse which can't be inverted efficiently, it will be inverted anyway. WhenFalse(default), aRuntimeErrorwill be raised instead. -
inner(Chainable | None, default:None) –preconditioning is applied to output of thise module. Defaults to None.
Examples:
Cubic regularized newton
.. code-block:: python
opt = tz.Optimizer(
model.parameters(),
tz.m.CubicRegularization(tz.m.Newton()),
)
Source code in torchzero/modules/trust_region/cubic_regularization.py
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | |
CustomUnaryOperation ¶
Bases: torchzero.core.transform.TensorTransform
Applies getattr(tensor, name) to each tensor
Source code in torchzero/modules/ops/unary.py
DFP ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Davidon–Fletcher–Powell Quasi-Newton method.
Note
a trust region or an accurate line search is recommended.
Warning
this uses at least O(N^2) memory.
Source code in torchzero/modules/quasi_newton/quasi_newton.py
DNRTR ¶
Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy
Diagonal quasi-newton method.
Reference
Andrei, Neculai. "A diagonal quasi-Newton updating method for unconstrained optimization." Numerical Algorithms 81.2 (2019): 575-590.
Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
DYHS ¶
Bases: torchzero.modules.conjugate_gradient.cg.ConguateGradientBase
Dai-Yuan - Hestenes–Stiefel hybrid conjugate gradient method.
Note
This requires step size to be determined via a line search, so put a line search like tz.m.StrongWolfe(c2=0.1, a_init="first-order") after this.
Source code in torchzero/modules/conjugate_gradient/cg.py
DaiYuan ¶
Bases: torchzero.modules.conjugate_gradient.cg.ConguateGradientBase
Dai–Yuan nonlinear conjugate gradient method.
Note
This requires step size to be determined via a line search, so put a line search like tz.m.StrongWolfe(c2=0.1) after this.
Source code in torchzero/modules/conjugate_gradient/cg.py
Debias ¶
Bases: torchzero.core.transform.TensorTransform
Multiplies the update by an Adam debiasing term based first and/or second momentum.
Parameters:
-
beta1(float | None, default:None) –first momentum, should be the same as first momentum used in modules before. Defaults to None.
-
beta2(float | None, default:None) –second (squared) momentum, should be the same as second momentum used in modules before. Defaults to None.
-
alpha(float, default:1) –learning rate. Defaults to 1.
-
pow(float, default:2) –power, assumes absolute value is used. Defaults to 2.
-
target(Target) –target. Defaults to 'update'.
Source code in torchzero/modules/ops/higher_level.py
Debias2 ¶
Bases: torchzero.core.transform.TensorTransform
Multiplies the update by an Adam debiasing term based on the second momentum.
Parameters:
-
beta(float | None, default:0.999) –second (squared) momentum, should be the same as second momentum used in modules before. Defaults to None.
-
pow(float, default:2) –power, assumes absolute value is used. Defaults to 2.
-
target(Target) –target. Defaults to 'update'.
Source code in torchzero/modules/ops/higher_level.py
DiagonalBFGS ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Diagonal BFGS. This is simply BFGS with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful.
Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
DiagonalQuasiCauchi ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._HessianUpdateStrategyDefaults
Diagonal quasi-cauchi method.
Reference
Zhu M., Nazareth J. L., Wolkowicz H. The quasi-Cauchy relation and diagonal updating //SIAM Journal on Optimization. – 1999. – Т. 9. – №. 4. – С. 1192-1204.
Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
DiagonalSR1 ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Diagonal SR1. This is simply SR1 with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful.
Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
DiagonalWeightedQuasiCauchi ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._HessianUpdateStrategyDefaults
Diagonal quasi-cauchi method.
Reference
Leong, Wah June, Sharareh Enshaei, and Sie Long Kek. "Diagonal quasi-Newton methods via least change updating principle with weighted Frobenius norm." Numerical Algorithms 86 (2021): 1225-1241.
Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
DirectWeightDecay ¶
Bases: torchzero.core.module.Module
Directly applies weight decay to parameters.
Parameters:
-
weight_decay(float) –weight decay scale.
-
ord(int, default:2) –order of the penalty, e.g. 1 for L1 and 2 for L2. Defaults to 2.
Source code in torchzero/modules/weight_decay/weight_decay.py
Div ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Divide tensors by other. other can be a number or a module.
If other is a module, this calculates tensors / other(tensors)
Source code in torchzero/modules/ops/binary.py
DivByLoss ¶
Bases: torchzero.core.transform.TensorTransform
Divides update by loss times alpha
Source code in torchzero/modules/misc/misc.py
DivModules ¶
Bases: torchzero.modules.ops.multi.MultiOperationBase
Calculates input / other. input and other can be numbers or modules.
Source code in torchzero/modules/ops/multi.py
Dogleg ¶
Bases: torchzero.modules.trust_region.trust_region.TrustRegionBase
Dogleg trust region algorithm.
Parameters:
-
hess_module(Module | None) –A module that maintains a hessian approximation (not hessian inverse!). This includes all full-matrix quasi-newton methods,
tz.m.Newtonandtz.m.GaussNewton. When using quasi-newton methods, setinverse=Falsewhen constructing them. -
eta(float, default:0.0) –if ratio of actual to predicted rediction is larger than this, step is accepted. When :code:
hess_moduleis GaussNewton, this can be set to 0. Defaults to 0.15. -
nplus(float, default:2) –increase factor on successful steps. Defaults to 1.5.
-
nminus(float, default:0.25) –decrease factor on unsuccessful steps. Defaults to 0.75.
-
rho_good(float, default:0.75) –if ratio of actual to predicted rediction is larger than this, trust region size is multiplied by
nplus. -
rho_bad(float, default:0.25) –if ratio of actual to predicted rediction is less than this, trust region size is multiplied by
nminus. -
init(float, default:1) –Initial trust region value. Defaults to 1.
-
update_freq(int, default:1) –frequency of updating the hessian. Defaults to 1.
-
max_attempts(max_attempts, default:10) –maximum number of trust region size size reductions per step. A zero update vector is returned when this limit is exceeded. Defaults to 10.
-
inner(Chainable | None, default:None) –preconditioning is applied to output of thise module. Defaults to None.
Source code in torchzero/modules/trust_region/dogleg.py
Dropout ¶
Bases: torchzero.core.transform.Transform
Applies dropout to the update.
For each weight the update to that weight has p probability to be set to 0.
This can be used to implement gradient dropout or update dropout depending on placement.
Parameters:
-
p(float, default:0.5) –probability that update for a weight is replaced with 0. Defaults to 0.5.
-
graft(bool, default:False) –if True, update after dropout is rescaled to have the same norm as before dropout. Defaults to False.
-
target(Target) –what to set on var, refer to documentation. Defaults to 'update'.
Examples:¶
Gradient dropout.
Update dropout.
``python opt = tz.Optimizer( model.parameters(), tz.m.Adam(), tz.m.Dropout(0.5), tz.m.LR(1e-3) ) ```
Source code in torchzero/modules/misc/regularization.py
DualNormCorrection ¶
Bases: torchzero.core.transform.TensorTransform
Dual norm correction for dualizer based optimizers (https://github.com/leloykun/adaptive-muon).
Orthogonalize already has this built in with the dual_norm_correction setting.
Source code in torchzero/modules/adaptive/muon.py
EMA ¶
Bases: torchzero.core.transform.TensorTransform
Maintains an exponential moving average of update.
Parameters:
-
momentum(float, default:0.9) –momentum (beta). Defaults to 0.9.
-
dampening(float, default:0) –momentum dampening. Defaults to 0.
-
debias(bool, default:False) –whether to debias the EMA like in Adam. Defaults to False.
-
lerp(bool, default:True) –whether to use linear interpolation. Defaults to True.
-
ema_init(str, default:'zeros') –initial values for the EMA, "zeros" or "update".
-
target(Target) –target to apply EMA to. Defaults to 'update'.
Source code in torchzero/modules/momentum/momentum.py
EMASquared ¶
Bases: torchzero.core.transform.TensorTransform
Maintains an exponential moving average of squared updates.
Parameters:
-
beta(float, default:0.999) –momentum value. Defaults to 0.999.
-
amsgrad(bool, default:False) –whether to maintain maximum of the exponential moving average. Defaults to False.
-
pow(float, default:2) –power, absolute value is always used. Defaults to 2.
Methods:
-
EMA_SQ_FN–Updates
exp_avg_sq_with EMA of squaredtensors, ifmax_exp_avg_sq_is not None, updates it with maximum of EMA.
Source code in torchzero/modules/ops/higher_level.py
EMA_SQ_FN ¶
EMA_SQ_FN(tensors: TensorList, exp_avg_sq_: TensorList, beta: float | NumberList, max_exp_avg_sq_: TensorList | None, pow: float = 2)
Updates exp_avg_sq_ with EMA of squared tensors, if max_exp_avg_sq_ is not None, updates it with maximum of EMA.
Returns exp_avg_sq_ or max_exp_avg_sq_.
Source code in torchzero/modules/opt_utils.py
ESGD ¶
Bases: torchzero.core.transform.Transform
Equilibrated Gradient Descent (https://arxiv.org/abs/1502.04390)
This is similar to Adagrad, but the accumulates squared randomized hessian diagonal estimates instead of squared gradients.
Notes:
- In most cases ESGD should be the first module in the chain because it relies on autograd. Use the ``inner`` argument if you wish to apply ESGD preconditioning to another module's output.
- This module requires a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a ``backward`` argument (refer to documentation).
Args:
damping (float, optional): added to denominator for stability. Defaults to 1e-4.
update_freq (int, optional):
frequency of updating hessian diagonal estimate via a hessian-vector product.
This value can be increased to reduce computational cost. Defaults to 20.
hvp_method (str, optional):
Determines how hessian-vector products are computed.
- ``"batched_autograd"`` - uses autograd with batched hessian-vector products. If a single hessian-vector is evaluated, equivalent to ``"autograd"``. Faster than ``"autograd"`` but uses more memory.
- ``"autograd"`` - uses autograd hessian-vector products. If multiple hessian-vector products are evaluated, uses a for-loop. Slower than ``"batched_autograd"`` but uses less memory.
- ``"fd_forward"`` - uses gradient finite difference approximation with a less accurate forward formula which requires one extra gradient evaluation per hessian-vector product.
- ``"fd_central"`` - uses gradient finite difference approximation with a more accurate central formula which requires two gradient evaluations per hessian-vector product.
Defaults to ``"autograd"``.
h (float, optional):
The step size for finite difference if ``hvp_method`` is
``"fd_forward"`` or ``"fd_central"``. Defaults to 1e-3.
n_samples (int, optional):
number of hessian-vector products with random vectors to evaluate each time when updating
the preconditioner. Larger values may lead to better hessian diagonal estimate. Defaults to 1.
seed (int | None, optional): seed for random vectors. Defaults to None.
inner (Chainable | None, optional):
Inner module. If this is specified, operations are performed in the following order.
1. compute hessian diagonal estimate.
2. pass inputs to :code:`inner`.
3. momentum and preconditioning are applied to the ouputs of :code:`inner`.
### Examples:
Using ESGD:
```python
opt = tz.Optimizer(
model.parameters(),
tz.m.ESGD(),
tz.m.LR(0.1)
)
```
ESGD preconditioner can be applied to any other module by passing it to the :code:`inner` argument. Here is an example of applying
ESGD preconditioning to nesterov momentum (:code:`tz.m.NAG`):
```python
opt = tz.Optimizer(
model.parameters(),
tz.m.ESGD(beta1=0, inner=tz.m.NAG(0.9)),
tz.m.LR(0.1)
)
```
Source code in torchzero/modules/adaptive/esgd.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | |
EscapeAnnealing ¶
Bases: torchzero.core.module.Module
If parameters stop changing, this runs a backward annealing random search
Source code in torchzero/modules/misc/escape.py
Exp ¶
Bases: torchzero.core.transform.TensorTransform
Returns exp(input)
Source code in torchzero/modules/ops/unary.py
ExpHomotopy ¶
FDM ¶
Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator
Approximate gradients via finite difference method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h(float, default:0.001) –magnitude of parameter perturbation. Defaults to 1e-3.
-
formula(Literal, default:'central') –finite difference formula. Defaults to 'central2'.
-
target(Literal, default:'closure') –what to set on var. Defaults to 'closure'.
Examples: plain FDM:
Any gradient-based method can use FDM-estimated gradients.
fdm_ncg = tz.Optimizer(
model.parameters(),
tz.m.FDM(),
# set hvp_method to "forward" so that it
# uses gradient difference instead of autograd
tz.m.NewtonCG(hvp_method="forward"),
tz.m.Backtracking()
)
Source code in torchzero/modules/grad_approximation/fdm.py
Fill ¶
Bases: torchzero.core.module.Module
Outputs tensors filled with value
Source code in torchzero/modules/ops/utility.py
FillLoss ¶
Bases: torchzero.core.module.Module
Outputs tensors filled with loss value times alpha
Source code in torchzero/modules/misc/misc.py
FletcherReeves ¶
Bases: torchzero.modules.conjugate_gradient.cg.ConguateGradientBase
Fletcher–Reeves nonlinear conjugate gradient method.
Note
This requires step size to be determined via a line search, so put a line search like tz.m.StrongWolfe(c2=0.1, a_init="first-order") after this.
Source code in torchzero/modules/conjugate_gradient/cg.py
FletcherVMM ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Fletcher's variable metric Quasi-Newton method.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Fletcher, R. (1970). A new approach to variable metric algorithms. The Computer Journal, 13(3), 317–322. doi:10.1093/comjnl/13.3.317
Source code in torchzero/modules/quasi_newton/quasi_newton.py
ForwardGradient ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Forward gradient method.
This method samples one or more directional derivatives evaluated via autograd jacobian-vector products. This is very similar to randomized finite difference.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
n_samples(int, default:1) –number of random gradient samples. Defaults to 1.
-
distribution(Literal, default:'gaussian') –distribution for random gradient samples. Defaults to "gaussian".
-
pre_generate(bool, default:True) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
jvp_method(str, default:'autograd') –how to calculate jacobian vector product, note that with
forwardand 'central' this is equivalent to randomized finite difference. Defaults to 'autograd'. -
h(float, default:0.001) –finite difference step size of jvp_method is set to
forwardorcentral. Defaults to 1e-3. -
target(Literal, default:'closure') –what to set on var. Defaults to "closure".
References
Baydin, A. G., Pearlmutter, B. A., Syme, D., Wood, F., & Torr, P. (2022). Gradients without backpropagation. arXiv preprint arXiv:2202.08587.
Source code in torchzero/modules/grad_approximation/forward_gradient.py
PRE_MULTIPLY_BY_H
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
FullMatrixAdagrad ¶
Bases: torchzero.core.transform.TensorTransform
Full-matrix version of Adagrad, can be customized to make RMSprop or Adam (see examples).
Note
A more memory-efficient version equivalent to full matrix Adagrad on last n gradients is implemented in tz.m.GGT.
Parameters:
-
reg(float, default:1e-12) –regularization, scale of identity matrix added to accumulator. Defaults to 1e-12.
-
precond_freq(int, default:1) –frequency of updating the inverse square root of the accumulator. Defaults to 1.
-
beta(float | None, default:None) –momentum for gradient outer product accumulators. if None, uses sum. Defaults to None.
-
beta_debias(bool, default:True) –whether to use debiasing, only has effect when
betais notNone. Defaults to True. -
init(Literal[str], default:'identity') –how to initialize the accumulator. - "identity" - with identity matrix (default). - "zeros" - with zero matrix. - "ones" - with matrix of ones. -"GGT" - with the first outer product
-
matrix_power(float, default:-0.5) –accumulator matrix power. Defaults to -1/2.
-
concat_params(bool, default:True) –if False, each parameter will have it's own accumulator. Defaults to True.
-
inner(Chainable | None, default:None) –inner modules to apply preconditioning to. Defaults to None.
Examples:¶
Plain full-matrix adagrad
Full-matrix RMSprop
Full-matrix Adam
opt = tz.Optimizer(
model.parameters(),
tz.m.FullMatrixAdagrad(beta=0.999, inner=tz.m.EMA(0.9)),
tz.m.Debias(0.9, 0.999),
tz.m.LR(1e-2),
)
Source code in torchzero/modules/adaptive/adagrad.py
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 | |
GGT ¶
Bases: torchzero.core.transform.TensorTransform
GGT method from https://arxiv.org/pdf/1806.02958
The update rule is to stack recent gradients into M and compute eigendecomposition of M M^T via eigendecomposition of M^T M.
This is equivalent to full-matrix Adagrad on recent gradients.
Parameters:
-
history_size(int, default:100) –number of past gradients to store. Defaults to 10.
-
update_freq(int, default:1) –frequency of updating the preconditioner (U and S). Defaults to 1.
-
eig_tol(float, default:1e-07) –removes eigenvalues this much smaller than largest eigenvalue. Defaults to 1e-7.
-
truncate(int, default:None) –number of larges eigenvalues to keep. None to disable. Defaults to None.
-
damping(float, default:0.0001) –damping value. Defaults to 1e-4.
-
rdamping(float, default:0) –value of damping relative to largest eigenvalue. Defaults to 0.
-
concat_params(bool, default:True) –if True, treats all parameters as a single vector. Defaults to True.
-
inner(Chainable | None, default:None) –preconditioner will be applied to output of this module. Defaults to None.
Examples:¶
Limited-memory Adagrad
Adam with L-Adagrad preconditioner (for debiasing second beta is 0.999 arbitrarily)optimizer = tz.Optimizer(
model.parameters(),
tz.m.GGT(inner=tz.m.EMA()),
tz.m.Debias(0.9, 0.999),
tz.m.LR(0.01)
)
Stable Adam with L-Adagrad preconditioner (this is what I would recommend)
optimizer = tz.Optimizer(
model.parameters(),
tz.m.GGT(inner=tz.m.EMA()),
tz.m.Debias(0.9, 0.999),
tz.m.ClipNormByEMA(max_ema_growth=1.2),
tz.m.LR(0.01)
)
Source code in torchzero/modules/adaptive/ggt.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | |
GGTBasis ¶
Bases: torchzero.core.transform.TensorTransform
Run another optimizer in GGT eigenbasis. The eigenbasis is rank-sized, so it is possible to run expensive
methods such as Full-matrix Adagrad/Adam.
The update rule is to stack recent gradients into M and compute eigendecomposition of M M^T via eigendecomposition of M^T M.
This is equivalent to full-matrix Adagrad on recent gradients.
Note
the buffers of the basis_opt are re-projected whenever basis changes. The reprojection logic is not implemented on all modules. Some supported modules are:
Adagrad, FullMatrixAdagrad, Adam, Adan, Lion, MARSCorrection, MSAMMomentum, RMSprop, GGT, EMA, HeavyBall, NAG, ClipNormByEMA, ClipValueByEMA, NormalizeByEMA, ClipValueGrowth, CoordinateMomentum, CubicAdam.
Additionally most modules with no internal buffers are supported, e.g. Cautious, Sign, ClipNorm, Orthogonalize, etc. However modules that use weight values, such as WeighDecay can't be supported, as weights can't be projected.
Also, if you say use EMA on output of Pow(2), the exponential average will be reprojected as gradient and not as squared gradients. Use modules like EMASquared, SqrtEMASquared to get correct reprojections.
Parameters:
-
basis_opt(Chainable) –module or modules to run in GGT eigenbasis.
-
history_size(int, default:100) –number of past gradients to store, and rank of preconditioner. Defaults to 10.
-
update_freq(int, default:1) –frequency of updating the preconditioner (U and S). Defaults to 1.
-
eig_tol(float, default:1e-07) –removes eigenvalues this much smaller than largest eigenvalue. Defaults to 1e-7.
-
truncate(int, default:None) –number of larges eigenvalues to keep. None to disable. Defaults to None.
-
damping(float, default:0.0001) –damping value. Defaults to 1e-4.
-
rdamping(float, default:0) –value of damping relative to largest eigenvalue. Defaults to 0.
-
concat_params(bool) –if True, treats all parameters as a single vector. Defaults to True.
-
inner(Chainable | None, default:None) –output of this module is projected and
basis_optwill run on it, but preconditioners are updated from original gradients.
Examples:¶
Examples: Adam in GGT eigenbasis:
Full-matrix Adam in GGT eigenbasis. We can define full-matrix Adam through FullMatrixAdagrad.
opt = tz.Optimizer(
model.parameters(),
tz.m.GGTBasis(
[tz.m.FullMatrixAdagrad(beta=0.99, inner=tz.m.EMA(0.9, debias=True))]
),
tz.m.LR(1e-3)
)
LaProp in GGT eigenbasis:
# we define LaProp through other modules, moved it out for brevity
laprop = (
tz.m.RMSprop(0.95),
tz.m.Debias(beta1=None, beta2=0.95),
tz.m.EMA(0.95),
tz.m.Debias(beta1=0.95, beta2=None),
)
opt = tz.Optimizer(
model.parameters(),
tz.m.GGTBasis(laprop),
tz.m.LR(1e-3)
)
Reference
Agarwal N. et al. Efficient full-matrix adaptive regularization //International Conference on Machine Learning. – PMLR, 2019. – С. 102-110.
Source code in torchzero/modules/basis/ggt_basis.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 | |
GaussNewton ¶
Bases: torchzero.core.transform.Transform
Gauss-newton method.
To use this, the closure should return a vector of values to minimize sum of squares of.
Please add the backward argument, it will always be False but it is required.
Gradients will be calculated via batched autograd within this module, you don't need to
implement the backward pass. Please see below for an example.
Note
This method requires ndim^2 memory, however, if it is used within tz.m.TrustCG trust region,
the memory requirement is ndim*m, where m is number of values in the output.
Parameters:
-
reg(float, default:1e-08) –regularization parameter. Defaults to 1e-8.
-
update_freq(int, default:1) –frequency of computing the jacobian. When jacobian is not computed, only residuals are computed and updated. Defaults to 1.
-
batched(bool, default:True) –whether to use vmapping. Defaults to True.
Examples:
minimizing the rosenbrock function:
def rosenbrock(X):
x1, x2 = X
return torch.stack([(1 - x1), 100 * (x2 - x1**2)])
X = torch.tensor([-1.1, 2.5], requires_grad=True)
opt = tz.Optimizer([X], tz.m.GaussNewton(), tz.m.Backtracking())
# define the closure for line search
def closure(backward=True):
return rosenbrock(X)
# minimize
for iter in range(10):
loss = opt.step(closure)
print(f'{loss = }')
training a neural network with a matrix-free GN trust region:
X = torch.randn(64, 20)
y = torch.randn(64, 10)
model = nn.Sequential(nn.Linear(20, 64), nn.ELU(), nn.Linear(64, 10))
opt = tz.Optimizer(
model.parameters(),
tz.m.TrustCG(tz.m.GaussNewton()),
)
def closure(backward=True):
y_hat = model(X) # (64, 10)
return (y_hat - y).pow(2).mean(0) # (10, )
for i in range(100):
losses = opt.step(closure)
if i % 10 == 0:
print(f'{losses.mean() = }')
Source code in torchzero/modules/least_squares/gn.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 | |
GaussianSmoothing ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Gradient approximation via Gaussian smoothing method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h(float, default:0.01) –finite difference step size of jvp_method is set to
forwardorcentral. Defaults to 1e-2. -
n_samples(int, default:100) –number of random gradient samples. Defaults to 100.
-
formula(Literal, default:'forward2') –finite difference formula. Defaults to 'forward2'.
-
distribution(Literal, default:'gaussian') –distribution. Defaults to "gaussian".
-
pre_generate(bool, default:True) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed(int | None | Generator, default:None) –Seed for random generator. Defaults to None.
-
target(Literal, default:'closure') –what to set on var. Defaults to "closure".
References
Yurii Nesterov, Vladimir Spokoiny. (2015). Random Gradient-Free Minimization of Convex Functions. https://gwern.net/doc/math/2015-nesterov.pdf
Source code in torchzero/modules/grad_approximation/rfdm.py
Grad ¶
Bases: torchzero.core.module.Module
Outputs the gradient
Source code in torchzero/modules/ops/utility.py
GradApproximator ¶
Bases: torchzero.core.module.Module, abc.ABC
Base class for gradient approximations.
This is an abstract class, to use it, subclass it and override approximate.
GradientApproximator modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure.
Parameters:
-
defaults(dict[str, Any] | None, default:None) –dict with defaults. Defaults to None.
-
target(str, default:'closure') –whether to set
var.grad,var.updateor 'var.closure`. Defaults to 'closure'.
Example:
Basic SPSA method implementation.
class SPSA(GradApproximator):
def __init__(self, h=1e-3):
defaults = dict(h=h)
super().__init__(defaults)
@torch.no_grad
def approximate(self, closure, params, loss):
perturbation = [rademacher_like(p) * self.settings[p]['h'] for p in params]
# evaluate params + perturbation
torch._foreach_add_(params, perturbation)
loss_plus = closure(False)
# evaluate params - perturbation
torch._foreach_sub_(params, perturbation)
torch._foreach_sub_(params, perturbation)
loss_minus = closure(False)
# restore original params
torch._foreach_add_(params, perturbation)
# calculate SPSA gradients
spsa_grads = []
for p, pert in zip(params, perturbation):
settings = self.settings[p]
h = settings['h']
d = (loss_plus - loss_minus) / (2*(h**2))
spsa_grads.append(pert * d)
# returns tuple: (grads, loss, loss_approx)
# loss must be with initial parameters
# since we only evaluated loss with perturbed parameters
# we only have loss_approx
return spsa_grads, None, loss_plus
Methods:
-
approximate–Returns a tuple:
(grad, loss, loss_approx), make sure this resets parameters to their original values! -
pre_step–This runs once before each step, whereas
approximatemay run multiple times per step if further modules
Source code in torchzero/modules/grad_approximation/grad_approximator.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | |
approximate ¶
approximate(closure: Callable, params: list[Tensor], loss: Tensor | None) -> tuple[Iterable[Tensor], Tensor | None, Tensor | None]
Returns a tuple: (grad, loss, loss_approx), make sure this resets parameters to their original values!
Source code in torchzero/modules/grad_approximation/grad_approximator.py
pre_step ¶
This runs once before each step, whereas approximate may run multiple times per step if further modules
evaluate gradients at multiple points. This is useful for example to pre-generate new random perturbations.
Source code in torchzero/modules/grad_approximation/grad_approximator.py
GradSign ¶
Bases: torchzero.core.transform.TensorTransform
Copies gradient sign to update.
Source code in torchzero/modules/misc/misc.py
GradToNone ¶
Bases: torchzero.core.module.Module
Sets grad attribute to None on objective.
Source code in torchzero/modules/ops/utility.py
GradientAccumulation ¶
Bases: torchzero.core.module.Module
Uses n steps to accumulate gradients, after n gradients have been accumulated, they are passed to :code:modules and parameters are updates.
Accumulating gradients for n steps is equivalent to increasing batch size by n. Increasing the batch size
is more computationally efficient, but sometimes it is not feasible due to memory constraints.
Note
Technically this can accumulate any inputs, including updates generated by previous modules. As long as this module is first, it will accumulate the gradients.
Parameters:
-
n(int) –number of gradients to accumulate.
-
mean(bool, default:True) –if True, uses mean of accumulated gradients, otherwise uses sum. Defaults to True.
-
stop(bool, default:True) –this module prevents next modules from stepping unless
ngradients have been accumulate. Setting this argument to False disables that. Defaults to True.
Examples:¶
Adam with gradients accumulated for 16 batches.
Source code in torchzero/modules/misc/gradient_accumulation.py
GradientCorrection ¶
Bases: torchzero.core.transform.TensorTransform
Estimates gradient at minima along search direction assuming function is quadratic.
This can useful as inner module for second order methods with inexact line search.
Example:¶
L-BFGS with gradient correction
opt = tz.Optimizer(
model.parameters(),
tz.m.LBFGS(inner=tz.m.GradientCorrection()),
tz.m.Backtracking()
)
Reference
HOSHINO, S. (1972). A Formulation of Variable Metric Methods. IMA Journal of Applied Mathematics, 10(3), 394–403. doi:10.1093/imamat/10.3.394
Source code in torchzero/modules/quasi_newton/quasi_newton.py
GradientSampling ¶
Bases: torchzero.core.reformulation.Reformulation
Samples and aggregates gradients and values at perturbed points.
This module can be used for gaussian homotopy and gradient sampling methods.
Parameters:
-
modules(Chainable | None, default:None) –modules that will be optimizing the modified objective. if None, returns gradient of the modified objective as the update. Defaults to None.
-
sigma(float, default:1.0) –initial magnitude of the perturbations. Defaults to 1.
-
n(int, default:100) –number of perturbations per step. Defaults to 100.
-
aggregate(str, default:'mean') –how to aggregate values and gradients - "mean" - uses mean of the gradients, as in gaussian homotopy. - "max" - uses element-wise maximum of the gradients. - "min" - uses element-wise minimum of the gradients. - "min-norm" - picks gradient with the lowest norm.
Defaults to 'mean'.
-
distribution(Literal, default:'gaussian') –distribution for random perturbations. Defaults to 'gaussian'.
-
include_x0(bool, default:True) –whether to include gradient at un-perturbed point. Defaults to True.
-
fixed(bool, default:True) –if True, perturbations do not get replaced by new random perturbations until termination criteria is satisfied. Defaults to True.
-
pre_generate(bool, default:True) –if True, perturbations are pre-generated before each step. This requires more memory to store all of them, but ensures they do not change when closure is evaluated multiple times. Defaults to True.
-
termination(TerminationCriteriaBase | Sequence[TerminationCriteriaBase] | None, default:None) –a termination criteria module, sigma will be multiplied by
decaywhen termination criteria is satisfied, and new perturbations will be generated iffixed. Defaults to None. -
decay(float, default:0.6666666666666666) –sigma multiplier on termination criteria. Defaults to 2/3.
-
reset_on_termination(bool, default:True) –whether to reset states of all other modules on termination. Defaults to True.
-
sigma_strategy(str | None, default:None) –strategy for adapting sigma. If condition is satisfied, sigma is multiplied by
sigma_nplus, otherwise it is multiplied bysigma_nminus. - "grad-norm" - at leastsigma_targetgradients should have lower norm than at un-perturbed point. - "value" - at leastsigma_targetvalues (losses) should be lower than at un-perturbed point. - None - doesn't use adaptive sigma.This introduces a side-effect to the closure, so it should be left at None of you use trust region or line search to optimize the modified objective. Defaults to None.
-
sigma_target(int, default:0.2) –number of elements to satisfy the condition in
sigma_strategy. Defaults to 1. -
sigma_nplus(float, default:1.3333333333333333) –sigma multiplier when
sigma_strategycondition is satisfied. Defaults to 4/3. -
sigma_nminus(float, default:0.6666666666666666) –sigma multiplier when
sigma_strategycondition is not satisfied. Defaults to 2/3. -
seed(int | None, default:None) –seed. Defaults to None.
Source code in torchzero/modules/smoothing/sampling.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 | |
Graft ¶
Bases: torchzero.modules.ops.multi.MultiOperationBase
Outputs direction output rescaled to have the same norm as magnitude output.
Parameters:
-
direction(Chainable) –module to use the direction from
-
magnitude(Chainable) –module to use the magnitude from
-
tensorwise(bool, default:True) –whether to calculate norm per-tensor or globally. Defaults to True.
-
ord(float, default:2) –norm order. Defaults to 2.
-
eps(float, default:1e-06) –clips denominator to be no less than this value. Defaults to 1e-6.
-
strength(float, default:1) –strength of grafting. Defaults to 1.
Example:¶
Shampoo grafted to Adam
opt = tz.Optimizer(
model.parameters(),
tz.m.GraftModules(
direction = tz.m.Shampoo(),
magnitude = tz.m.Adam(),
),
tz.m.LR(1e-3)
)
Source code in torchzero/modules/ops/multi.py
GraftGradToUpdate ¶
Bases: torchzero.core.transform.TensorTransform
Outputs gradient grafted to update, that is gradient rescaled to have the same norm as the update.
Source code in torchzero/modules/misc/misc.py
GraftInputToOutput ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Outputs tensors rescaled to have the same norm as magnitude(tensors).
Source code in torchzero/modules/ops/binary.py
GraftOutputToInput ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Outputs magnitude(tensors) rescaled to have the same norm as tensors
Source code in torchzero/modules/ops/binary.py
GraftToGrad ¶
Bases: torchzero.core.transform.TensorTransform
Grafts update to the gradient, that is update is rescaled to have the same norm as the gradient.
Source code in torchzero/modules/misc/misc.py
GraftToParams ¶
Bases: torchzero.core.transform.TensorTransform
Grafts update to the parameters, that is update is rescaled to have the same norm as the parameters, but no smaller than eps.
Source code in torchzero/modules/misc/misc.py
GramSchimdt ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
outputs tensors made orthogonal to other(tensors) via Gram-Schmidt.
Source code in torchzero/modules/ops/binary.py
Greenstadt1 ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Greenstadt's first Quasi-Newton method.
Note
a trust region or an accurate line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
Source code in torchzero/modules/quasi_newton/quasi_newton.py
Greenstadt2 ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Greenstadt's second Quasi-Newton method.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
Source code in torchzero/modules/quasi_newton/quasi_newton.py
HagerZhang ¶
Bases: torchzero.modules.conjugate_gradient.cg.ConguateGradientBase
Hager-Zhang nonlinear conjugate gradient method,
Note
This requires step size to be determined via a line search, so put a line search like tz.m.StrongWolfe(c2=0.1, a_init="first-order") after this.
Source code in torchzero/modules/conjugate_gradient/cg.py
HeavyBall ¶
Bases: torchzero.modules.momentum.momentum.EMA
Polyak's momentum (heavy-ball method).
Parameters:
-
momentum(float, default:0.9) –momentum (beta). Defaults to 0.9.
-
dampening(float, default:0) –momentum dampening. Defaults to 0.
-
debias(bool, default:False) –whether to debias the EMA like in Adam. Defaults to False.
-
lerp(bool, default:False) –whether to use linear interpolation, if True, this becomes exponential moving average. Defaults to False.
-
ema_init(str, default:'update') –initial values for the EMA, "zeros" or "update".
-
target(Target) –target to apply EMA to. Defaults to 'update'.
Source code in torchzero/modules/momentum/momentum.py
HestenesStiefel ¶
Bases: torchzero.modules.conjugate_gradient.cg.ConguateGradientBase
Hestenes–Stiefel nonlinear conjugate gradient method.
Note
This requires step size to be determined via a line search, so put a line search like tz.m.StrongWolfe(c2=0.1, a_init="first-order") after this.
Source code in torchzero/modules/conjugate_gradient/cg.py
Horisho ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Horisho's variable metric Quasi-Newton method.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
HOSHINO, S. (1972). A Formulation of Variable Metric Methods. IMA Journal of Applied Mathematics, 10(3), 394–403. doi:10.1093/imamat/10.3.394
Source code in torchzero/modules/quasi_newton/quasi_newton.py
HpuEstimate ¶
Bases: torchzero.core.transform.TensorTransform
returns y/||s||, where y is difference between current and previous update (gradient), s is difference between current and previous parameters. The returned tensors are a finite difference approximation to hessian times previous update.
Source code in torchzero/modules/misc/misc.py
ICUM ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Inverse Column-updating Quasi-Newton method. This is computationally cheaper than other Quasi-Newton methods due to only updating one column of the inverse hessian approximation per step.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Lopes, V. L., & Martínez, J. M. (1995). Convergence properties of the inverse column-updating method. Optimization Methods & Software, 6(2), 127–144. from https://www.ime.unicamp.br/sites/default/files/pesquisa/relatorios/rp-1993-76.pdf
Source code in torchzero/modules/quasi_newton/quasi_newton.py
Identity ¶
Bases: torchzero.core.module.Module
Identity operator that is argument-insensitive. This also can be used as identity hessian for trust region methods.
Source code in torchzero/modules/ops/utility.py
ImprovedNewton ¶
Bases: torchzero.core.transform.Transform
Improved Newton's Method (INM).
Source code in torchzero/modules/second_order/inm.py
IntermoduleCautious ¶
Bases: torchzero.core.module.Module
Negaties update on :code:main module where it's sign doesn't match with output of compare module.
Parameters:
-
main(Chainable) –main module or sequence of modules whose update will be cautioned.
-
compare(Chainable) –modules or sequence of modules to compare the sign to.
-
normalize(bool, default:False) –renormalize update after masking. Defaults to False.
-
eps(float, default:1e-06) –epsilon for normalization. Defaults to 1e-6.
-
mode(str, default:'zero') –what to do with updates with inconsistent signs. - "zero" - set them to zero (as in paper) - "grad" - set them to the gradient (same as using update magnitude and gradient sign) - "backtrack" - negate them
Source code in torchzero/modules/momentum/cautious.py
InverseFreeNewton ¶
Bases: torchzero.core.transform.Transform
Inverse-free newton's method
Source code in torchzero/modules/second_order/ifn.py
LBFGS ¶
Bases: torchzero.core.transform.TensorTransform
Limited-memory BFGS algorithm. A line search or trust region is recommended.
Parameters:
-
history_size(int, default:10) –number of past parameter differences and gradient differences to store. Defaults to 10.
-
ptol(float | None, default:1e-32) –skips updating the history if maximum absolute value of parameter difference is less than this value. Defaults to 1e-10.
-
ptol_restart(bool, default:False) –If true, whenever parameter difference is less then
ptol, L-BFGS state will be reset. Defaults to None. -
gtol(float | None, default:1e-32) –skips updating the history if if maximum absolute value of gradient difference is less than this value. Defaults to 1e-10.
-
ptol_restart(bool, default:False) –If true, whenever gradient difference is less then
gtol, L-BFGS state will be reset. Defaults to None. -
sy_tol(float | None, default:1e-32) –history will not be updated whenever s⋅y is less than this value (negative s⋅y means negative curvature)
-
scale_first(bool, default:True) –makes first step, when hessian approximation is not available, small to reduce number of line search iterations. Defaults to True.
-
update_freq(int, default:1) –how often to update L-BFGS history. Larger values may be better for stochastic optimization. Defaults to 1.
-
damping(Union, default:None) –damping to use, can be "powell" or "double". Defaults to None.
-
inner(Chainable | None, default:None) –optional inner modules applied after updating L-BFGS history and before preconditioning. Defaults to None.
Examples:¶
L-BFGS with line search
L-BFGS with trust region
Source code in torchzero/modules/quasi_newton/lbfgs.py
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 | |
LR ¶
Bases: torchzero.core.transform.TensorTransform
Learning rate. Adding this module also adds support for LR schedulers.
Source code in torchzero/modules/step_size/lr.py
LSR1 ¶
Bases: torchzero.core.transform.TensorTransform
Limited-memory SR1 algorithm. A line search or trust region is recommended.
Parameters:
-
history_size(int, default:10) –number of past parameter differences and gradient differences to store. Defaults to 10.
-
ptol(float | None, default:None) –skips updating the history if maximum absolute value of parameter difference is less than this value. Defaults to None.
-
ptol_restart(bool, default:False) –If true, whenever parameter difference is less then
ptol, L-SR1 state will be reset. Defaults to None. -
gtol(float | None, default:None) –skips updating the history if if maximum absolute value of gradient difference is less than this value. Defaults to None.
-
ptol_restart(bool, default:False) –If true, whenever gradient difference is less then
gtol, L-SR1 state will be reset. Defaults to None. -
scale_first(bool, default:False) –makes first step, when hessian approximation is not available, small to reduce number of line search iterations. Defaults to False.
-
update_freq(int, default:1) –how often to update L-SR1 history. Larger values may be better for stochastic optimization. Defaults to 1.
-
damping(Union, default:None) –damping to use, can be "powell" or "double". Defaults to None.
-
compact(bool) –if True, uses a compact representation verstion of L-SR1. It is much faster computationally, but less stable.
-
inner(Chainable | None, default:None) –optional inner modules applied after updating L-SR1 history and before preconditioning. Defaults to None.
Examples:¶
L-SR1 with line search
L-SR1 with trust region
Source code in torchzero/modules/quasi_newton/lsr1.py
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 | |
LambdaHomotopy ¶
Bases: torchzero.modules.misc.homotopy.HomotopyBase
Source code in torchzero/modules/misc/homotopy.py
LaplacianSmoothing ¶
Bases: torchzero.core.transform.TensorTransform
Applies laplacian smoothing via a fast Fourier transform solver which can improve generalization.
Parameters:
-
sigma(float, default:1) –controls the amount of smoothing. Defaults to 1.
-
layerwise(bool, default:True) –If True, applies smoothing to each parameter's gradient separately, Otherwise applies it to all gradients, concatenated into a single vector. Defaults to True.
-
min_numel(int, default:4) –minimum number of elements in a parameter to apply laplacian smoothing to. Only has effect if
layerwiseis True. Defaults to 4. -
target(str) –what to set on var.
Examples: Laplacian Smoothing Gradient Descent optimizer as in the paper
Reference
Osher, S., Wang, B., Yin, P., Luo, X., Barekat, F., Pham, M., & Lin, A. (2022). Laplacian smoothing gradient descent. Research in the Mathematical Sciences, 9(3), 55.
Source code in torchzero/modules/smoothing/laplacian.py
LastAbsoluteRatio ¶
Bases: torchzero.core.transform.TensorTransform
Outputs ratio between absolute values of past two updates the numerator is determined by numerator argument.
Source code in torchzero/modules/misc/misc.py
LastDifference ¶
Bases: torchzero.core.transform.TensorTransform
Outputs difference between past two updates.
Source code in torchzero/modules/misc/misc.py
LastGradDifference ¶
Bases: torchzero.core.module.Module
Outputs difference between past two gradients.
Source code in torchzero/modules/misc/misc.py
LastProduct ¶
Bases: torchzero.core.transform.TensorTransform
Outputs difference between past two updates.
Source code in torchzero/modules/misc/misc.py
LastRatio ¶
Bases: torchzero.core.transform.TensorTransform
Outputs ratio between past two updates, the numerator is determined by numerator argument.
Source code in torchzero/modules/misc/misc.py
LerpModules ¶
Bases: torchzero.modules.ops.multi.MultiOperationBase
Does a linear interpolation of input(tensors) and end(tensors) based on a scalar weight.
The output is given by output = input(tensors) + weight * (end(tensors) - input(tensors))
Source code in torchzero/modules/ops/multi.py
LevenbergMarquardt ¶
Bases: torchzero.modules.trust_region.trust_region.TrustRegionBase
Levenberg-Marquardt trust region algorithm.
Parameters:
-
hess_module(Module | None) –A module that maintains a hessian approximation (not hessian inverse!). This includes all full-matrix quasi-newton methods,
tz.m.Newtonandtz.m.GaussNewton. When using quasi-newton methods, setinverse=Falsewhen constructing them. -
y(float, default:0) –when
y=0, identity matrix is added to hessian, wheny=1, diagonal of the hessian approximation is added. Values between interpolate. This should only be used with Gauss-Newton. Defaults to 0. -
eta(float, default:0.0) –if ratio of actual to predicted rediction is larger than this, step is accepted. When
hess_moduleisNewtonorGaussNewton, this can be set to 0. Defaults to 0.15. -
nplus(float, default:3.5) –increase factor on successful steps. Defaults to 1.5.
-
nminus(float, default:0.25) –decrease factor on unsuccessful steps. Defaults to 0.75.
-
rho_good(float, default:0.99) –if ratio of actual to predicted rediction is larger than this, trust region size is multiplied by
nplus. -
rho_bad(float, default:0.0001) –if ratio of actual to predicted rediction is less than this, trust region size is multiplied by
nminus. -
init(float, default:1) –Initial trust region value. Defaults to 1.
-
update_freq(int, default:1) –frequency of updating the hessian. Defaults to 1.
-
max_attempts(max_attempts, default:10) –maximum number of trust region size size reductions per step. A zero update vector is returned when this limit is exceeded. Defaults to 10.
-
adaptive(bool, default:False) –if True, trust radius is multiplied by square root of gradient norm.
-
fallback(bool, default:False) –if
True, whenhess_modulemaintains hessian inverse which can't be inverted efficiently, it will be inverted anyway. WhenFalse(default), aRuntimeErrorwill be raised instead. -
inner(Chainable | None, default:None) –preconditioning is applied to output of thise module. Defaults to None.
Examples:¶
Gauss-Newton with Levenberg-Marquardt trust-region
LM-SR1
Source code in torchzero/modules/trust_region/levenberg_marquardt.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | |
LineSearchBase ¶
Bases: torchzero.core.module.Module, abc.ABC
Base class for line searches.
This is an abstract class, to use it, subclass it and override search.
Parameters:
-
defaults(dict[str, Any] | None) –dictionary with defaults.
-
maxiter(int | None, default:None) –if this is specified, the search method will terminate upon evaluating the objective this many times, and step size with the lowest loss value will be used. This is useful when passing
make_objectiveto an external library which doesn't have a maxiter option. Defaults to None.
Other useful methods
evaluate_f- returns loss with a given scalar step sizeevaluate_f_d- returns loss and directional derivative with a given scalar step sizemake_objective- creates a function that accepts a scalar step size and returns loss. This can be passed to a scalar solver, such as scipy.optimize.minimize_scalar.make_objective_with_derivative- creates a function that accepts a scalar step size and returns a tuple with loss and directional derivative. This can be passed to a scalar solver.
Examples:
Basic line search¶
This evaluates all step sizes in a range by using the :code:self.evaluate_step_size method.
class GridLineSearch(LineSearch):
def __init__(self, start, end, num):
defaults = dict(start=start,end=end,num=num)
super().__init__(defaults)
@torch.no_grad
def search(self, update, var):
start = self.defaults["start"]
end = self.defaults["end"]
num = self.defaults["num"]
lowest_loss = float("inf")
best_step_size = best_step_size
for step_size in torch.linspace(start,end,num):
loss = self.evaluate_step_size(step_size.item(), var=var, backward=False)
if loss < lowest_loss:
lowest_loss = loss
best_step_size = step_size
return best_step_size
Using external solver via self.make_objective¶
Here we let :code:scipy.optimize.minimize_scalar solver find the best step size via :code:self.make_objective
class ScipyMinimizeScalar(LineSearch):
def __init__(self, method: str | None = None):
defaults = dict(method=method)
super().__init__(defaults)
@torch.no_grad
def search(self, update, var):
objective = self.make_objective(var=var)
method = self.defaults["method"]
res = self.scopt.minimize_scalar(objective, method=method)
return res.x
Methods:
-
evaluate_f–evaluate function value at alpha
step_size. -
evaluate_f_d–evaluate function value and directional derivative in the direction of the update at step size
step_size. -
evaluate_f_d_g–evaluate function value, directional derivative, and gradient list at step size
step_size. -
search–Finds the step size to use
Source code in torchzero/modules/line_search/line_search.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 | |
evaluate_f ¶
evaluate function value at alpha step_size.
Source code in torchzero/modules/line_search/line_search.py
evaluate_f_d ¶
evaluate function value and directional derivative in the direction of the update at step size step_size.
Source code in torchzero/modules/line_search/line_search.py
evaluate_f_d_g ¶
evaluate function value, directional derivative, and gradient list at step size step_size.
Source code in torchzero/modules/line_search/line_search.py
Lion ¶
Bases: torchzero.core.transform.TensorTransform
Lion (EvoLved Sign Momentum) optimizer from https://arxiv.org/abs/2302.06675.
Parameters:
-
beta1(float, default:0.9) –dampening for momentum. Defaults to 0.9.
-
beta2(float, default:0.99) –momentum factor. Defaults to 0.99.
Source code in torchzero/modules/adaptive/lion.py
LiuStorey ¶
Bases: torchzero.modules.conjugate_gradient.cg.ConguateGradientBase
Liu-Storey nonlinear conjugate gradient method.
Note
This requires step size to be determined via a line search, so put a line search like tz.m.StrongWolfe(c2=0.1, a_init="first-order") after this.
Source code in torchzero/modules/conjugate_gradient/cg.py
LogHomotopy ¶
MARSCorrection ¶
Bases: torchzero.core.transform.TensorTransform
MARS variance reduction correction.
Place any other momentum-based optimizer after this,
make sure beta parameter matches with momentum in the optimizer.
Parameters:
-
beta(float, default:0.9) –use the same beta as you use in the momentum module. Defaults to 0.9.
-
scaling(float, default:0.025) –controls the scale of gradient correction in variance reduction. Defaults to 0.025.
-
max_norm(float, default:1) –clips norm of corrected gradients, None to disable. Defaults to 1.
Examples:¶
Mars-AdamW
optimizer = tz.Optimizer(
model.parameters(),
tz.m.MARSCorrection(beta=0.95),
tz.m.Adam(beta1=0.95, beta2=0.99),
tz.m.WeightDecay(1e-3),
tz.m.LR(0.1)
)
Mars-Lion
optimizer = tz.Optimizer(
model.parameters(),
tz.m.MARSCorrection(beta=0.9),
tz.m.Lion(beta1=0.9),
tz.m.LR(0.1)
)
Source code in torchzero/modules/adaptive/mars.py
MSAM ¶
Bases: torchzero.core.transform.Transform
Momentum-SAM from https://arxiv.org/pdf/2401.12033.
Note
Please make sure to place tz.m.LR inside the modules argument. For example,
tz.m.MSAMObjective([tz.m.Adam(), tz.m.LR(1e-3)]). Putting LR after MSAM will lead
to an incorrect update rule.
Parameters:
-
modules(Chainable) –modules that will optimize the MSAM objective. Make sure
tz.m.LRis one of them. -
momentum(float, default:0.9) –momentum (beta). Defaults to 0.9.
-
rho(float, default:0.3) –perturbation strength. Defaults to 0.3.
-
nesterov(bool, default:False) –whether to use nesterov momentum formula. Defaults to False.
-
lerp(bool, default:False) –whether to use linear interpolation, if True, MSAM momentum becomes similar to exponential moving average. Defaults to False.
Examples: AdamW-MSAM
opt = tz.Optimizer(
bench.parameters(),
tz.m.MSAMObjective(
[tz.m.Adam(), tz.m.WeightDecay(1e-3), tz.m.LR(1e-3)],
rho=1.
)
)
Source code in torchzero/modules/adaptive/msam.py
MSAMMomentum ¶
Bases: torchzero.core.transform.TensorTransform
Momentum-SAM from https://arxiv.org/pdf/2401.12033.
This implementation expresses the update rule as function of gradient. This way it can be used as a drop-in replacement for momentum strategies in other optimizers.
To combine MSAM with other optimizers in the way done in the official implementation,
e.g. to make Adam_MSAM, use tz.m.MSAMObjective module.
Note
MSAM has a learning rate hyperparameter that can't really be removed from the update rule.
To avoid compounding learning rate mofications, remove the tz.m.LR module if you had it.
Parameters:
-
lr(float) –learning rate. Adding this module adds support for learning rate schedulers.
-
momentum(float, default:0.9) –momentum (beta). Defaults to 0.9.
-
rho(float, default:0.3) –perturbation strength. Defaults to 0.3.
-
weight_decay(float, default:0) –weight decay. It is applied to perturbed parameters, so it is differnet from applying :code:
tz.m.WeightDecayafter MSAM. Defaults to 0. -
nesterov(bool, default:False) –whether to use nesterov momentum formula. Defaults to False.
-
lerp(bool, default:False) –whether to use linear interpolation, if True, this becomes similar to exponential moving average. Defaults to False.
Examples:¶
MSAM
Adam with MSAM instead of exponential average. Note that this is different from Adam_MSAM.
To make Adam_MSAM and such, use the tz.m.MSAMObjective module.
opt = tz.Optimizer(
model.parameters(),
tz.m.RMSprop(0.999, inner=tz.m.MSAM(1e-3)),
tz.m.Debias(0.9, 0.999),
)
Source code in torchzero/modules/adaptive/msam.py
MatrixMomentum ¶
Bases: torchzero.core.transform.Transform
Second order momentum method.
Matrix momentum is useful for convex objectives, also for some reason it has very really good generalization on elastic net logistic regression.
Notes
-
muneeds to be tuned very carefully. It is supposed to be smaller than (1/largest eigenvalue), otherwise this will be very unstable. I have devised an adaptive version of this -tz.m.AdaptiveMatrixMomentum, and it works well without having to tunemu, however the adaptive version doesn't work on stochastic objectives. -
In most cases
MatrixMomentumshould be the first module in the chain because it relies on autograd. -
This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a
backwardargument.
Parameters:
-
mu(float, default:0.1) –this has a similar role to (1 - beta) in normal momentum. Defaults to 0.1.
-
hvp_method(str, default:'autograd') –Determines how hessian-vector products are computed.
"batched_autograd"- uses autograd with batched hessian-vector products. If a single hessian-vector is evaluated, equivalent to"autograd". Faster than"autograd"but uses more memory."autograd"- uses autograd hessian-vector products. If multiple hessian-vector products are evaluated, uses a for-loop. Slower than"batched_autograd"but uses less memory."fd_forward"- uses gradient finite difference approximation with a less accurate forward formula which requires one extra gradient evaluation per hessian-vector product."fd_central"- uses gradient finite difference approximation with a more accurate central formula which requires two gradient evaluations per hessian-vector product.
Defaults to
"autograd". -
h(float, default:0.001) –The step size for finite difference if
hvp_methodis"fd_forward"or"fd_central". Defaults to 1e-3. -
hvp_tfm(Chainable | None) –optional module applied to hessian-vector products. Defaults to None.
Reference
Orr, Genevieve, and Todd Leen. "Using curvature information for fast stochastic search." Advances in neural information processing systems 9 (1996).
Source code in torchzero/modules/adaptive/matrix_momentum.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
Maximum ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Outputs maximum(tensors, other(tensors))
Source code in torchzero/modules/ops/binary.py
MaximumModules ¶
Bases: torchzero.modules.ops.reduce.ReduceOperationBase
Outputs elementwise maximum of inputs that can be modules or numbers.
Source code in torchzero/modules/ops/reduce.py
McCormick ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
McCormicks's Quasi-Newton method.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.
This is "Algorithm 2", attributed to McCormick in this paper. However for some reason this method is also called Pearson's 2nd method in other sources.
Source code in torchzero/modules/quasi_newton/quasi_newton.py
MeZO ¶
Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator
Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h(float, default:0.001) –finite difference step size of jvp_method is set to
forwardorcentral. Defaults to 1e-3. -
n_samples(int, default:1) –number of random gradient samples. Defaults to 1.
-
formula(Literal, default:'central2') –finite difference formula. Defaults to 'central2'.
-
distribution(Literal, default:'rademacher') –distribution. Defaults to "rademacher". If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
target(Literal, default:'closure') –what to set on var. Defaults to "closure".
References
Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36, 53038-53075. https://arxiv.org/abs/2305.17333
Source code in torchzero/modules/grad_approximation/rfdm.py
Mean ¶
Bases: torchzero.modules.ops.reduce.Sum
Outputs a mean of inputs that can be modules or numbers.
Source code in torchzero/modules/ops/reduce.py
USE_MEAN
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
MedianAveraging ¶
Bases: torchzero.core.transform.TensorTransform
Median of past history_size updates.
Parameters:
-
history_size(int) –Number of past updates to average
-
target(Target) –target. Defaults to 'update'.
Source code in torchzero/modules/momentum/averaging.py
Minimum ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Outputs minimum(tensors, other(tensors))
Source code in torchzero/modules/ops/binary.py
MinimumModules ¶
Bases: torchzero.modules.ops.reduce.ReduceOperationBase
Outputs elementwise minimum of inputs that can be modules or numbers.
Source code in torchzero/modules/ops/reduce.py
Mul ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Multiply tensors by other. other can be a number or a module.
If other is a module, this calculates tensors * other(tensors)
Source code in torchzero/modules/ops/binary.py
MulByLoss ¶
Bases: torchzero.core.transform.TensorTransform
Multiplies update by loss times alpha
Source code in torchzero/modules/misc/misc.py
MultiOperationBase ¶
Bases: torchzero.core.module.Module, abc.ABC
Base class for operations that use operands. This is an abstract class, subclass it and override transform method to use it.
Methods:
-
transform–applies the operation to operands
Source code in torchzero/modules/ops/multi.py
transform ¶
Multistep ¶
Bases: torchzero.core.module.Module
Performs steps inner steps with module per each step.
The update is taken to be the parameter difference between parameters before and after the inner loop.
Source code in torchzero/modules/misc/multistep.py
MuonAdjustLR ¶
Bases: torchzero.core.transform.Transform
LR adjustment for Muon from "Muon is Scalable for LLM Training" (https://github.com/MoonshotAI/Moonlight/tree/master).
Orthogonalize already has this built in with the adjust_lr setting, however you might want to move this to be later in the chain.
Source code in torchzero/modules/adaptive/muon.py
NAG ¶
Bases: torchzero.core.transform.TensorTransform
Nesterov accelerated gradient method (nesterov momentum).
Parameters:
-
momentum(float, default:0.9) –momentum (beta). Defaults to 0.9.
-
dampening(float, default:0) –momentum dampening. Defaults to 0.
-
lerp(bool, default:False) –whether to use linear interpolation, if True, this becomes similar to exponential moving average. Defaults to False.
-
target(Target) –target to apply EMA to. Defaults to 'update'.
Source code in torchzero/modules/momentum/momentum.py
NanToNum ¶
Bases: torchzero.core.transform.TensorTransform
Convert nan, inf and -inf`` to numbers.
Parameters:
-
nan(optional, default:None) –the value to replace NaNs with. Default is zero.
-
posinf(optional, default:None) –if a Number, the value to replace positive infinity values with. If None, positive infinity values are replaced with the greatest finite value representable by input's dtype. Default is None.
-
neginf(optional, default:None) –if a Number, the value to replace negative infinity values with. If None, negative infinity values are replaced with the lowest finite value representable by input's dtype. Default is None.
Source code in torchzero/modules/ops/unary.py
NaturalGradient ¶
Bases: torchzero.core.transform.Transform
Natural gradient approximated via empirical fisher information matrix.
To use this, either pass vector of per-sample losses to the step method, or make sure
the closure returns it. Gradients will be calculated via batched autograd within this module,
you don't need to implement the backward pass. When using closure, please add the backward argument,
it will always be False but it is required. See below for an example.
Note
Empirical fisher information matrix may give a really bad approximation in some cases.
If that is the case, set sqrt to True to perform whitening instead, which is way more robust.
Parameters:
-
reg(float, default:1e-08) –regularization parameter. Defaults to 1e-8.
-
sqrt(bool, default:False) –if True, uses square root of empirical fisher information matrix. Both EFIM and it's square root can be calculated and stored efficiently without ndim^2 memory. Square root whitens the gradient and often performs much better, especially when you try to use NGD with a vector that isn't strictly per-sample gradients, but rather for example different losses.
-
gn_grad(bool, default:False) –if True, uses Gauss-Newton G^T @ f as the gradient, which is effectively sum weighted by value and is equivalent to squaring the values. That makes the kernel trick solver incorrect, but for some reason it still works. If False, uses sum of per-sample gradients. This has an effect when
sqrt=False, and affects thegradattribute. Defaults to False. -
batched(bool, default:True) –whether to use vmapping. Defaults to True.
Examples:
training a neural network:
X = torch.randn(64, 20)
y = torch.randn(64, 10)
model = nn.Sequential(nn.Linear(20, 64), nn.ELU(), nn.Linear(64, 10))
opt = tz.Optimizer(
model.parameters(),
tz.m.NaturalGradient(),
tz.m.LR(3e-2)
)
for i in range(100):
y_hat = model(X) # (64, 10)
losses = (y_hat - y).pow(2).mean(0) # (10, )
opt.step(loss=losses)
if i % 10 == 0:
print(f'{losses.mean() = }')
training a neural network - closure version
X = torch.randn(64, 20)
y = torch.randn(64, 10)
model = nn.Sequential(nn.Linear(20, 64), nn.ELU(), nn.Linear(64, 10))
opt = tz.Optimizer(
model.parameters(),
tz.m.NaturalGradient(),
tz.m.LR(3e-2)
)
def closure(backward=True):
y_hat = model(X) # (64, 10)
return (y_hat - y).pow(2).mean(0) # (10, )
for i in range(100):
losses = opt.step(closure)
if i % 10 == 0:
print(f'{losses.mean() = }')
minimizing the rosenbrock function with a mix of natural gradient, whitening and gauss-newton:
def rosenbrock(X):
x1, x2 = X
return torch.stack([(1 - x1).abs(), (10 * (x2 - x1**2).abs())])
X = torch.tensor([-1.1, 2.5], requires_grad=True)
opt = tz.Optimizer([X], tz.m.NaturalGradient(sqrt=True, gn_grad=True), tz.m.LR(0.05))
for iter in range(200):
losses = rosenbrock(X)
opt.step(loss=losses)
if iter % 20 == 0:
print(f'{losses.mean() = }')
Source code in torchzero/modules/adaptive/natural_gradient.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 | |
Negate ¶
Bases: torchzero.core.transform.TensorTransform
Returns - input
Source code in torchzero/modules/ops/unary.py
NegateOnLossIncrease ¶
Bases: torchzero.core.module.Module
Uses an extra forward pass to evaluate loss at parameters+update,
if loss is larger than at parameters,
the update is set to 0 if backtrack=False and to -update otherwise
Source code in torchzero/modules/misc/multistep.py
NewDQN ¶
Bases: torchzero.modules.quasi_newton.diagonal_quasi_newton.DNRTR
Diagonal quasi-newton method.
Reference
Nosrati, Mahsa, and Keyvan Amini. "A new diagonal quasi-Newton algorithm for unconstrained optimization problems." Applications of Mathematics 69.4 (2024): 501-512.
Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
NewSSM ¶
Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy
Self-scaling Quasi-Newton method.
Note
a line search such as tz.m.StrongWolfe() is required.
Warning
this uses roughly O(N^2) memory.
Reference
Moghrabi, I. A., Hassan, B. A., & Askar, A. (2022). New self-scaling quasi-newton methods for unconstrained optimization. Int. J. Math. Comput. Sci., 17, 1061U.
Source code in torchzero/modules/quasi_newton/quasi_newton.py
Newton ¶
Bases: torchzero.core.transform.Transform
Exact Newton's method via autograd.
Newton's method produces a direction jumping to the stationary point of quadratic approximation of the target function.
The update rule is given by (H + yI)⁻¹g, where H is the hessian and g is the gradient, y is the damping parameter.
g can be output of another module, if it is specifed in inner argument.
Note
In most cases Newton should be the first module in the chain because it relies on autograd. Use the inner argument if you wish to apply Newton preconditioning to another module's output.
Note
This module requires the a closure passed to the optimizer step,
as it needs to re-evaluate the loss and gradients for calculating the hessian.
The closure must accept a backward argument (refer to documentation).
Parameters:
-
damping(float, default:0) –tikhonov regularizer value. Defaults to 0.
-
eigval_fn(Callable | None, default:None) –function to apply to eigenvalues, for example
torch.absorlambda L: torch.clip(L, min=1e-8). If this is specified, eigendecomposition will be used to invert the hessian. -
update_freq(int, default:1) –updates hessian every
update_freqsteps. -
precompute_inverse(bool, default:None) –if
True, whenever hessian is computed, also computes the inverse. This is more efficient whenupdate_freqis large. IfNone, this isTrueifupdate_freq >= 10. -
use_lstsq((bool, Optional), default:False) –if True, least squares will be used to solve the linear system, this can prevent it from exploding when hessian is indefinite. If False, tries cholesky, if it fails tries LU, and then least squares. If
eigval_fnis specified, eigendecomposition is always used and this argument is ignored. -
hessian_method(str, default:'batched_autograd') –Determines how hessian is computed.
"batched_autograd"- uses autograd to computendimbatched hessian-vector products. Faster than"autograd"but uses more memory."autograd"- uses autograd to computendimhessian-vector products using for loop. Slower than"batched_autograd"but uses less memory."functional_revrev"- usestorch.autograd.functionalwith "reverse-over-reverse" strategy and a for-loop. This is generally equivalent to"autograd"."functional_fwdrev"- usestorch.autograd.functionalwith vectorized "forward-over-reverse" strategy. Faster than"functional_fwdrev"but uses more memory ("batched_autograd"seems to be faster)"func"- usestorch.func.hessianwhich uses "forward-over-reverse" strategy. This method is the fastest and is recommended, however it is more restrictive and fails with some operators which is why it isn't the default."gfd_forward"- computesndimhessian-vector products via gradient finite difference using a less accurate forward formula which requires one extra gradient evaluation per hessian-vector product."gfd_central"- computesndimhessian-vector products via gradient finite difference using a more accurate central formula which requires two gradient evaluations per hessian-vector product."fd"- uses function values to estimate gradient and hessian via finite difference. This uses less evaluations than chaining"gfd_*"aftertz.m.FDM."thoad"- usesthoadlibrary, can be significantly faster than pytorch but limited operator coverage.
Defaults to
"batched_autograd". -
h(float, default:0.001) –finite difference step size if hessian is compute via finite-difference.
-
inner(Chainable | None, default:None) –modules to apply hessian preconditioner to. Defaults to None.
See also¶
tz.m.NewtonCG: uses a matrix-free conjugate gradient solver and hessian-vector products. useful for large scale problems as it doesn't form the full hessian.tz.m.NewtonCGSteihaug: trust region version oftz.m.NewtonCG.tz.m.ImprovedNewton: Newton with additional rank one correction to the hessian, can be faster than Newton.tz.m.InverseFreeNewton: an inverse-free variant of Newton's method.tz.m.quasi_newton: large collection of quasi-newton methods that estimate the hessian.
Notes¶
Implementation details¶
(H + yI)⁻¹g is calculated by solving the linear system (H + yI)x = g.
The linear system is solved via cholesky decomposition, if that fails, LU decomposition, and if that fails, least squares. Least squares can be forced by setting use_lstsq=True.
Additionally, if eigval_fn is specified, eigendecomposition of the hessian is computed,
eigval_fn is applied to the eigenvalues, and (H + yI)⁻¹ is computed using the computed eigenvectors and transformed eigenvalues. This is more generally more computationally expensive but not by much.
Handling non-convexity¶
Standard Newton's method does not handle non-convexity well without some modifications. This is because it jumps to the stationary point, which may be the maxima of the quadratic approximation.
A modification to handle non-convexity is to modify the eignevalues to be positive,
for example by setting eigval_fn = lambda L: L.abs().clip(min=1e-4).
Examples:¶
Newton's method with backtracking line search
Newton's method for non-convex optimization.
opt = tz.Optimizer(
model.parameters(),
tz.m.Newton(eigval_fn = lambda L: L.abs().clip(min=1e-4)),
tz.m.Backtracking()
)
Newton preconditioning applied to momentum
Source code in torchzero/modules/second_order/newton.py
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 | |
NewtonCG ¶
Bases: torchzero.core.transform.Transform
Newton's method with a matrix-free conjugate gradient or minimial-residual solver.
Notes
-
In most cases NewtonCGSteihaug should be the first module in the chain because it relies on autograd. Use the
innerargument if you wish to apply Newton preconditioning to another module's output. -
This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a
backwardargument (refer to documentation).
Warning
CG may fail if hessian is not positive-definite.
Parameters:
-
maxiter(int | None, default:None) –Maximum number of iterations for the conjugate gradient solver. By default, this is set to the number of dimensions in the objective function, which is the theoretical upper bound for CG convergence. Setting this to a smaller value (truncated Newton) can still generate good search directions. Defaults to None.
-
tol(float, default:1e-08) –Relative tolerance for the conjugate gradient solver to determine convergence. Defaults to 1e-4.
-
reg(float, default:1e-08) –Regularization parameter (damping) added to the Hessian diagonal. This helps ensure the system is positive-definite. Defaults to 1e-8.
-
hvp_method(str, default:'autograd') –Determines how Hessian-vector products are evaluated.
"autograd"- uses autograd hessian-vector products. If multiple hessian-vector products are evaluated, uses a for-loop."fd_forward"- uses gradient finite difference approximation with a less accurate forward formula which requires one extra gradient evaluation per hessian-vector product."fd_central"- uses gradient finite difference approximation with a more accurate central formula which requires two gradient evaluations per hessian-vector product.
For NewtonCG
"batched_autograd"is equivalent to"autograd". Defaults to"autograd". -
h(float, default:0.001) –The step size for finite difference if
hvp_methodis"fd_forward"or"fd_central". Defaults to 1e-3. -
warm_start(bool, default:False) –If
True, the conjugate gradient solver is initialized with the solution from the previous optimization step. This can accelerate convergence, especially in truncated Newton methods. Defaults to False. -
inner(Chainable | None, default:None) –NewtonCG will attempt to apply preconditioning to the output of this module.
Examples: Newton-CG with a backtracking line search:
Truncated Newton method (useful for large-scale problems):
Source code in torchzero/modules/second_order/newton_cg.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | |
NewtonCGSteihaug ¶
Bases: torchzero.core.transform.Transform
Newton's method with trust region and a matrix-free Steihaug-Toint conjugate gradient solver.
Notes
-
In most cases NewtonCGSteihaug should be the first module in the chain because it relies on autograd. Use the
innerargument if you wish to apply Newton preconditioning to another module's output. -
This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a
backwardargument (refer to documentation).
Parameters:
-
eta(float, default:0.0) –if ratio of actual to predicted rediction is larger than this, step is accepted. Defaults to 0.0.
-
nplus(float, default:3.5) –increase factor on successful steps. Defaults to 1.5.
-
nminus(float, default:0.25) –decrease factor on unsuccessful steps. Defaults to 0.75.
-
rho_good(float, default:0.99) –if ratio of actual to predicted rediction is larger than this, trust region size is multiplied by
nplus. -
rho_bad(float, default:0.0001) –if ratio of actual to predicted rediction is less than this, trust region size is multiplied by
nminus. -
init(float, default:1) –Initial trust region value. Defaults to 1.
-
max_attempts(max_attempts, default:100) –maximum number of trust radius reductions per step. A zero update vector is returned when this limit is exceeded. Defaults to 10.
-
max_history(int, default:100) –CG will store this many intermediate solutions, reusing them when trust radius is reduced instead of re-running CG. Each solution storage requires 2N memory. Defaults to 100.
-
boundary_tol(float | None, default:1e-06) –The trust region only increases when suggested step's norm is at least
(1-boundary_tol)*trust_region. This prevents increasing trust region when solution is not on the boundary. Defaults to 1e-2. -
maxiter(int | None, default:None) –maximum number of CG iterations per step. Each iteration requies one backward pass if
hvp_method="forward", two otherwise. Defaults to None. -
miniter(int, default:1) –minimal number of CG iterations. This prevents making no progress
-
tol(float, default:1e-08) –terminates CG when norm of the residual is less than this value. Defaults to 1e-8. when initial guess is below tolerance. Defaults to 1.
-
reg(float, default:1e-08) –hessian regularization. Defaults to 1e-8.
-
solver(str, default:'cg') –solver, "cg" or "minres". "cg" is recommended. Defaults to 'cg'.
-
adapt_tol(bool, default:False) –if True, whenever trust radius collapses to smallest representable number, the tolerance is multiplied by 0.1. Defaults to True.
-
npc_terminate(bool, default:False) –whether to terminate CG/MINRES whenever negative curvature is detected. Defaults to False.
-
hvp_method(str, default:'fd_central') –either
"fd_forward"to use forward formula which requires one backward pass per hessian-vector product, or"fd_central"to use a more accurate central formula which requires two backward passes."fd_forward"is usually accurate enough. Defaults to"fd_forward". -
h(float, default:0.001) –finite difference step size. Defaults to 1e-3.
-
inner(Chainable | None, default:None) –applies preconditioning to output of this module. Defaults to None.
Examples:¶
Trust-region Newton-CG:
Reference:¶
Steihaug, Trond. "The conjugate gradient method and trust regions in large scale optimization." SIAM Journal on Numerical Analysis 20.3 (1983): 626-637.
Source code in torchzero/modules/second_order/newton_cg.py
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 | |
NoiseSign ¶
Bases: torchzero.core.transform.TensorTransform
Outputs random tensors with sign copied from the update.
Source code in torchzero/modules/misc/misc.py
Noop ¶
Bases: torchzero.core.module.Module
Identity operator that is argument-insensitive. This also can be used as identity hessian for trust region methods.
Source code in torchzero/modules/ops/utility.py
Normalize ¶
Bases: torchzero.core.transform.TensorTransform
Normalizes the update.
Parameters:
-
norm_value(float, default:1) –desired norm value.
-
ord(float, default:2) –norm order. Defaults to 2.
-
dim(int | Sequence[int] | str | None, default:None) –calculates norm along those dimensions. If list/tuple, tensors are normalized along all dimensios in
dimthat they have. Can be set to "global" to normalize by global norm of all gradients concatenated to a vector. Defaults to None. -
inverse_dims(bool, default:False) –if True, the
dimsargument is inverted, and all other dimensions are normalized. -
min_size(int, default:1) –minimal size of a dimension to normalize along it. Defaults to 1.
-
target(str) –what this affects.
Examples: Gradient normalization:
Update normalization:
Source code in torchzero/modules/clipping/clipping.py
NormalizeByEMA ¶
Bases: torchzero.modules.clipping.ema_clipping.ClipNormByEMA
Sets norm of the update to be the same as the norm of an exponential moving average of past updates.
Parameters:
-
beta(float, default:0.99) –beta for the exponential moving average. Defaults to 0.99.
-
ord(float, default:2) –order of the norm. Defaults to 2.
-
eps(float) –epsilon for division. Defaults to 1e-6.
-
tensorwise(bool, default:True) –if True, norms are calculated parameter-wise, otherwise treats all parameters as single vector. Defaults to True.
-
max_ema_growth(float | None, default:1.5) –if specified, restricts how quickly exponential moving average norm can grow. The norm is allowed to grow by at most this value per step. Defaults to 1.5.
-
ema_init(str) –How to initialize exponential moving average on first step, "update" to use the first update or "zeros". Defaults to 'zeros'.
Source code in torchzero/modules/clipping/ema_clipping.py
NORMALIZE
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
NystromPCG ¶
Bases: torchzero.core.transform.Transform
Newton's method with a Nyström-preconditioned conjugate gradient solver.
Notes
-
This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a
backwardargument (refer to documentation). -
In most cases NystromPCG should be the first module in the chain because it relies on autograd. Use the
innerargument if you wish to apply Newton preconditioning to another module's output.
Parameters:
-
rank(int, default:100) –size of the sketch for preconditioning, this many hessian-vector products will be evaluated before running the conjugate gradient solver. Larger value improves the preconditioning and speeds up conjugate gradient.
-
maxiter(int | None, default:None) –maximum number of iterations. By default this is set to the number of dimensions in the objective function, which is supposed to be enough for conjugate gradient to have guaranteed convergence. Setting this to a small value can still generate good enough directions. Defaults to None.
-
tol(float, default:1e-08) –relative tolerance for conjugate gradient solver. Defaults to 1e-4.
-
reg(float, default:1e-06) –regularization parameter. Defaults to 1e-8.
-
hvp_method(str, default:'batched_autograd') –Determines how Hessian-vector products are computed.
"batched_autograd"- uses autograd with batched hessian-vector products to compute the preconditioner. Faster than"autograd"but uses more memory."autograd"- uses autograd hessian-vector products, uses a for loop to compute the preconditioner. Slower than"batched_autograd"but uses less memory."fd_forward"- uses gradient finite difference approximation with a less accurate forward formula which requires one extra gradient evaluation per hessian-vector product."fd_central"- uses gradient finite difference approximation with a more accurate central formula which requires two gradient evaluations per hessian-vector product.
Defaults to
"autograd". -
h(float, default:0.001) –The step size for finite difference if
hvp_methodis"fd_forward"or"fd_central". Defaults to 1e-3. -
inner(Chainable | None, default:None) –modules to apply hessian preconditioner to. Defaults to None.
-
seed(int | None, default:None) –seed for random generator. Defaults to None.
Examples:
NystromPCG with backtracking line search
Reference
Frangella, Z., Tropp, J. A., & Udell, M. (2023). Randomized nyström preconditioning. SIAM Journal on Matrix Analysis and Applications, 44(2), 718-752. https://arxiv.org/abs/2110.02820
Source code in torchzero/modules/second_order/nystrom.py
171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 | |
NystromSketchAndSolve ¶
Bases: torchzero.core.transform.Transform
Newton's method with a Nyström sketch-and-solve solver.
Notes
-
This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a
backwardargument (refer to documentation). -
In most cases NystromSketchAndSolve should be the first module in the chain because it relies on autograd. Use the
innerargument if you wish to apply Newton preconditioning to another module's output. -
If this is unstable, increase the
regparameter and tune the rank.
Parameters:
-
rank(int, default:100) –size of the sketch, this many hessian-vector products will be evaluated per step.
-
reg(float | None, default:0.01) –scale of identity matrix added to hessian. Note that if this is specified, nystrom sketch-and-solve is used to compute
(Q diag(L) Q.T + reg*I)x = b. It is very unstable whenregis small, i.e. smaller than 1e-4. If this is None,(Q diag(L) Q.T)x = bis computed by simply taking reciprocal of eigenvalues. Defaults to 1e-3. -
eigv_tol(float, default:0) –all eigenvalues smaller than largest eigenvalue times
eigv_tolare removed. Defaults to None. -
truncate(int | None, default:None) –keeps top
truncateeigenvalues. Defaults to None. -
damping(float, default:0) –scalar added to eigenvalues. Defaults to 0.
-
rdamping(float, default:0) –scalar multiplied by largest eigenvalue and added to eigenvalues. Defaults to 0.
-
update_freq(int, default:1) –frequency of updating preconditioner. Defaults to 1.
-
hvp_method(str, default:'batched_autograd') –Determines how Hessian-vector products are computed.
"batched_autograd"- uses autograd with batched hessian-vector products to compute the preconditioner. Faster than"autograd"but uses more memory."autograd"- uses autograd hessian-vector products, uses a for loop to compute the preconditioner. Slower than"batched_autograd"but uses less memory."fd_forward"- uses gradient finite difference approximation with a less accurate forward formula which requires one extra gradient evaluation per hessian-vector product."fd_central"- uses gradient finite difference approximation with a more accurate central formula which requires two gradient evaluations per hessian-vector product.
Defaults to
"autograd". -
h(float, default:0.001) –The step size for finite difference if
hvp_methodis"fd_forward"or"fd_central". Defaults to 1e-3. -
inner(Chainable | None, default:None) –modules to apply hessian preconditioner to. Defaults to None.
-
seed(int | None, default:None) –seed for random generator. Defaults to None.
Examples: NystromSketchAndSolve with backtracking line search
Trust region NystromSketchAndSolve
References: - Frangella, Z., Rathore, P., Zhao, S., & Udell, M. (2024). SketchySGD: Reliable Stochastic Optimization via Randomized Curvature Estimates. SIAM Journal on Mathematics of Data Science, 6(4), 1173-1204. - Frangella, Z., Tropp, J. A., & Udell, M. (2023). Randomized nyström preconditioning. SIAM Journal on Matrix Analysis and Applications, 44(2), 718-752
Source code in torchzero/modules/second_order/nystrom.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 | |
Ones ¶
Bases: torchzero.core.module.Module
Outputs ones
Source code in torchzero/modules/ops/utility.py
Online ¶
Bases: torchzero.core.module.Module
Allows certain modules to be used for mini-batch optimization.
Examples:
Online L-BFGS with Backtracking line search
Online L-BFGS trust region
Source code in torchzero/modules/misc/multistep.py
OrthoGrad ¶
Bases: torchzero.core.transform.TensorTransform
Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.
Parameters:
-
eps(float, default:1e-08) –epsilon added to the denominator for numerical stability (default: 1e-30)
-
renormalize(bool, default:True) –whether to graft projected gradient to original gradient norm. Defaults to True.
Source code in torchzero/modules/adaptive/orthograd.py
Orthogonalize ¶
Bases: torchzero.core.transform.TensorTransform
Uses Newton-Schulz iteration or SVD to compute the zeroth power / orthogonalization of update along first 2 dims.
To disable orthogonalization for a parameter, put it into a parameter group with "orthogonalize" = False. The Muon page says that embeddings and classifier heads should not be orthogonalized. Usually only matrix parameters that are directly used in matmuls should be orthogonalized.
To make Muon, use Split with Adam on 1d params
Parameters:
-
adjust_lr(bool, default:False) –Enables LR adjustment based on parameter size from "Muon is Scalable for LLM Training". Defaults to False.
-
dual_norm_correction(bool, default:False) –enables dual norm correction from https://github.com/leloykun/adaptive-muon. Defaults to False.
-
method(str, default:'newtonschulz') –Newton-Schulz is very fast, SVD is slow but can be more precise.
-
channel_first(bool, default:True) –if True, orthogonalizes along 1st two dimensions, otherwise along last 2. Other dimensions are considered batch dimensions.
Examples:¶
standard Muon with Adam fallback
opt = tz.Optimizer(
model.head.parameters(),
tz.m.Split(
# apply muon only to 2D+ parameters
filter = lambda t: t.ndim >= 2,
true = [
tz.m.HeavyBall(),
tz.m.Orthogonalize(),
tz.m.LR(1e-2),
],
false = tz.m.Adam()
),
tz.m.LR(1e-2)
)
Reference
Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, Jeremy Bernstein - Muon: An optimizer for hidden layers in neural networks (2024) https://github.com/KellerJordan/Muon
Source code in torchzero/modules/adaptive/muon.py
PSB ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._HessianUpdateStrategyDefaults
Powell's Symmetric Broyden Quasi-Newton method.
Note
a line search or a trust region is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
Source code in torchzero/modules/quasi_newton/quasi_newton.py
PSGDDenseNewton ¶
Bases: torchzero.core.transform.Transform
Dense hessian preconditioner from Preconditioned Stochastic Gradient Descent (see https://github.com/lixilinx/psgd_torch)
Parameters:
-
init_scale(float | None, default:None) –initial scale of the preconditioner. If None, determined based on a heuristic. Defaults to None.
-
lr_preconditioner(float, default:0.1) –learning rate of the preconditioner. Defaults to 0.1.
-
betaL(float, default:0.9) –EMA factor for the L-smoothness constant wrt Q. Defaults to 0.9.
-
damping(float, default:1e-09) –adds small noise to hessian-vector product when updating the preconditioner. Defaults to 1e-9.
-
grad_clip_max_norm(float, default:inf) –clips norm of the update. Defaults to float("inf").
-
update_probability(float, default:1.0) –probability of updating preconditioner on each step. Defaults to 1.0.
-
dQ(str, default:'Q0.5EQ1.5') –geometry for preconditioner update. Defaults to "Q0.5EQ1.5".
-
hvp_method(Literal, default:'autograd') –how to compute hessian-vector products. Defaults to 'autograd'.
-
h(float, default:0.001) –if
hvp_methodis"fd_central"or"fd_forward", controls finite difference step size. Defaults to 1e-3. -
distribution(Literal, default:'normal') –distribution for random vectors for hessian-vector products. Defaults to 'normal'.
-
inner(Chainable | None, default:None) –preconditioning will be applied to output of this module. Defaults to None.
Examples:¶
Pure Dense Newton PSGD:
Applying preconditioner to momentum:
optimizer = tz.Optimizer(
model.parameters(),
tz.m.DenseNewton(inner=tz.m.EMA(0.9)),
tz.m.LR(1e-3),
)
Source code in torchzero/modules/adaptive/psgd/psgd_dense_newton.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | |
PSGDKronNewton ¶
Bases: torchzero.core.transform.Transform
Kron hessian preconditioner from Preconditioned Stochastic Gradient Descent (see https://github.com/lixilinx/psgd_torch)
Parameters:
-
max_dim(int, default:10000) –dimensions with size larger than this use diagonal preconditioner. Defaults to 10_000.
-
max_skew(float, default:1.0) –if memory used by full preconditioner (dim^2) is larger than total number of elements in a parameter times
max_skew, it uses a diagonal preconditioner. Defaults to 1.0. -
init_scale(float | None, default:None) –initial scale of the preconditioner. If None, determined based on a heuristic. Defaults to None.
-
lr_preconditioner(float, default:0.1) –learning rate of the preconditioner. Defaults to 0.1.
-
betaL(float, default:0.9) –EMA factor for the L-smoothness constant wrt Q. Defaults to 0.9.
-
damping(float, default:1e-09) –adds small noise to gradient when updating the preconditioner. Defaults to 1e-9.
-
grad_clip_max_amp(float, default:inf) –clips amplitude of the update. Defaults to float("inf").
-
update_probability(float, default:1.0) –probability of updating preconditioner on each step. Defaults to 1.0.
-
dQ(str, default:'Q0.5EQ1.5') –geometry for preconditioner update. Defaults to "Q0.5EQ1.5".
-
balance_probability(float, default:0.01) –probablility of balancing the dynamic ranges of the factors of Q to avoid over/under-flow on each step. Defaults to 0.01.
-
hvp_method(Literal, default:'autograd') –how to compute hessian-vector products. Defaults to 'autograd'.
-
h(float, default:0.001) –if
hvp_methodis"fd_central"or"fd_forward", controls finite difference step size. Defaults to 1e-3. -
distribution(Literal, default:'normal') –distribution for random vectors for hessian-vector products. Defaults to 'normal'.
-
inner(Chainable | None, default:None) –preconditioning will be applied to output of this module. Defaults to None.
Examples:¶
Pure PSGD Kron Newton:
Applying preconditioner to momentum:
optimizer = tz.Optimizer(
model.parameters(),
tz.m.KronNewton(inner=tz.m.EMA(0.9)),
tz.m.LR(1e-3),
)
Source code in torchzero/modules/adaptive/psgd/psgd_kron_newton.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | |
PSGDKronWhiten ¶
Bases: torchzero.core.transform.TensorTransform
Kron whitening preconditioner from Preconditioned Stochastic Gradient Descent (see https://github.com/lixilinx/psgd_torch)
Parameters:
-
max_dim(int, default:10000) –dimensions with size larger than this use diagonal preconditioner. Defaults to 10_000.
-
max_skew(float, default:1.0) –if memory used by full preconditioner (dim^2) is larger than total number of elements in a parameter times
max_skew, it uses a diagonal preconditioner. Defaults to 1.0. -
init_scale(float | None, default:None) –initial scale of the preconditioner. If None, determined from magnitude of the first gradient. Defaults to None.
-
lr_preconditioner(float, default:0.1) –learning rate of the preconditioner. Defaults to 0.1.
-
betaL(float, default:0.9) –EMA factor for the L-smoothness constant wrt Q. Defaults to 0.9.
-
damping(float, default:1e-09) –adds small noise to gradient when updating the preconditioner. Defaults to 1e-9.
-
grad_clip_max_amp(float, default:inf) –clips amplitude of the update. Defaults to float("inf").
-
update_probability(float, default:1.0) –probability of updating preconditioner on each step. Defaults to 1.0.
-
dQ(str, default:'Q0.5EQ1.5') –geometry for preconditioner update. Defaults to "Q0.5EQ1.5".
-
balance_probability(float, default:0.01) –probablility of balancing the dynamic ranges of the factors of Q to avoid over/under-flow on each step. Defaults to 0.01.
-
inner(Chainable | None, default:None) –preconditioning will be applied to output of this module. Defaults to None.
Examples:¶
Pure PSGD Kron:
Momentum into preconditioner (whitens momentum):
Updating the preconditioner from gradients and applying it to momentum:
optimizer = tz.Optimizer(
model.parameters(),
tz.m.KronWhiten(inner=tz.m.EMA(0.9)),
tz.m.LR(1e-3),
)
Source code in torchzero/modules/adaptive/psgd/psgd_kron_whiten.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | |
PSGDLRANewton ¶
Bases: torchzero.core.transform.Transform
Low rank hessian preconditioner from Preconditioned Stochastic Gradient Descent (see https://github.com/lixilinx/psgd_torch)
Parameters:
-
rank(int, default:10) –Preconditioner has a diagonal part and a low rank part, whose rank is decided by this setting. Defaults to 10.
-
init_scale(float | None, default:None) –initial scale of the preconditioner. If None, determined based on a heuristic. Defaults to None.
-
lr_preconditioner(float, default:0.1) –learning rate of the preconditioner. Defaults to 0.1.
-
betaL(float, default:0.9) –EMA factor for the L-smoothness constant wrt Q. Defaults to 0.9.
-
damping(float, default:1e-09) –adds small noise to hessian-vector product when updating the preconditioner. Defaults to 1e-9.
-
grad_clip_max_norm(float, default:inf) –clips norm of the update. Defaults to float("inf").
-
update_probability(float, default:1.0) –probability of updating preconditioner on each step. Defaults to 1.0.
-
hvp_method(Literal, default:'autograd') –how to compute hessian-vector products. Defaults to 'autograd'.
-
h(float, default:0.001) –if
hvp_methodis"fd_central"or"fd_forward", controls finite difference step size. Defaults to 1e-3. -
distribution(Literal, default:'normal') –distribution for random vectors for hessian-vector products. Defaults to 'normal'.
-
inner(Chainable | None, default:None) –preconditioning will be applied to output of this module. Defaults to None.
Examples:¶
Pure LRA Newton PSGD:
Applying preconditioner to momentum:
Source code in torchzero/modules/adaptive/psgd/psgd_lra_newton.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | |
PSGDLRAWhiten ¶
Bases: torchzero.core.transform.TensorTransform
Low rank whitening preconditioner from Preconditioned Stochastic Gradient Descent (see https://github.com/lixilinx/psgd_torch)
Parameters:
-
rank(int, default:10) –Preconditioner has a diagonal part and a low rank part, whose rank is decided by this setting. Defaults to 10.
-
init_scale(float | None, default:None) –initial scale of the preconditioner. If None, determined based on a heuristic. Defaults to None.
-
lr_preconditioner(float, default:0.1) –learning rate of the preconditioner. Defaults to 0.1.
-
betaL(float, default:0.9) –EMA factor for the L-smoothness constant wrt Q. Defaults to 0.9.
-
damping(float, default:1e-09) –adds small noise to hessian-vector product when updating the preconditioner. Defaults to 1e-9.
-
grad_clip_max_norm(float) –clips norm of the update. Defaults to float("inf").
-
update_probability(float, default:1.0) –probability of updating preconditioner on each step. Defaults to 1.0.
-
concat_params(bool, default:True) –if True, treats all parameters as concatenated to a single vector. If False, each parameter is preconditioned separately. Defaults to True.
-
inner(Chainable | None, default:None) –preconditioning will be applied to output of this module. Defaults to None.
Examples:¶
Pure PSGD LRA:
Momentum into preconditioner (whitens momentum):
Updating the preconditioner from gradients and applying it to momentum:
Source code in torchzero/modules/adaptive/psgd/psgd_lra_whiten.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | |
Params ¶
Bases: torchzero.core.module.Module
Outputs parameters
Source code in torchzero/modules/ops/utility.py
Pearson ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Pearson's Quasi-Newton method.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.
Source code in torchzero/modules/quasi_newton/quasi_newton.py
PerturbWeights ¶
Bases: torchzero.core.module.Module
Changes the closure so that it evaluates loss and gradients at weights perturbed by a random perturbation.
Can be disabled for a parameter by setting perturb=False in corresponding parameter group.
Parameters:
-
alpha(float, default:0.1) –multiplier for perturbation magnitude. Defaults to 0.1.
-
relative(bool, default:True) –whether to multiply perturbation by mean absolute value of the parameter. Defaults to True.
-
distribution(bool, default:'normal') –distribution of the random perturbation. Defaults to False.
Source code in torchzero/modules/misc/regularization.py
PolakRibiere ¶
Bases: torchzero.modules.conjugate_gradient.cg.ConguateGradientBase
Polak-Ribière-Polyak nonlinear conjugate gradient method.
Note
This requires step size to be determined via a line search, so put a line search like tz.m.StrongWolfe(c2=0.1, a_init="first-order") after this.
Source code in torchzero/modules/conjugate_gradient/cg.py
PolyakStepSize ¶
Bases: torchzero.core.transform.TensorTransform
Polyak's subgradient method with known or unknown f*.
Parameters:
-
f_star(float | Mone, default:0) –minimal possible value of the objective function. If not known, set to
None. Defaults to 0. -
y(float, default:1) –when
f_staris set to None, it is calculated asf_best - y. -
y_decay(float, default:0.001) –yis multiplied by(1 - y_decay)after each step. Defaults to 1e-3. -
max(float | None, default:None) –maximum possible step size. Defaults to None.
-
use_grad(bool, default:True) –if True, uses dot product of update and gradient to compute the step size. Otherwise, dot product of update with itself is used.
-
alpha(float, default:1) –multiplier to Polyak step-size. Defaults to 1.
Source code in torchzero/modules/step_size/adaptive.py
Pow ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Take tensors to the power of exponent. exponent can be a number or a module.
If exponent is a module, this calculates tensors ^ exponent(tensors)
Source code in torchzero/modules/ops/binary.py
PowModules ¶
Bases: torchzero.modules.ops.multi.MultiOperationBase
Calculates input ** exponent. input and other can be numbers or modules.
Source code in torchzero/modules/ops/multi.py
PowellRestart ¶
Bases: torchzero.modules.restarts.restars.RestartStrategyBase
Powell's two restarting criterions for conjugate gradient methods.
The restart clears all states of modules.
Parameters:
-
modules(Chainable | None) –modules to reset. If None, resets all modules.
-
cond1(float | None, default:0.2) –criterion that checks for nonconjugacy of the search directions. Restart is performed whenevr g^Tg_{k+1} >= cond1*||g_{k+1}||^2. The default condition value of 0.2 is suggested by Powell. Can be None to disable that criterion.
-
cond2(float | None, default:0.2) –criterion that checks if direction is not effectively downhill. Restart is performed if -1.2||g||^2 < d^Tg < -0.8||g||^2. Defaults to 0.2. Can be None to disable that criterion.
Reference
Powell, Michael James David. "Restart procedures for the conjugate gradient method." Mathematical programming 12.1 (1977): 241-254.
Source code in torchzero/modules/restarts/restars.py
Previous ¶
Bases: torchzero.core.transform.TensorTransform
Maintains an update from n steps back, for example if n=1, returns previous update
Source code in torchzero/modules/misc/misc.py
PrintLoss ¶
Bases: torchzero.core.module.Module
Prints var.get_loss().
Source code in torchzero/modules/misc/debug.py
PrintParams ¶
Bases: torchzero.core.module.Module
Prints current update.
Source code in torchzero/modules/misc/debug.py
PrintShape ¶
Bases: torchzero.core.module.Module
Prints shapes of the update.
Source code in torchzero/modules/misc/debug.py
PrintUpdate ¶
Bases: torchzero.core.module.Module
Prints current update.
Source code in torchzero/modules/misc/debug.py
Prod ¶
Bases: torchzero.modules.ops.reduce.ReduceOperationBase
Outputs product of inputs that can be modules or numbers.
Source code in torchzero/modules/ops/reduce.py
ProjectedGradientMethod ¶
Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy
Projected gradient method. Directly projects the gradient onto subspace conjugate to past directions.
Notes
- This method uses N^2 memory.
- This requires step size to be determined via a line search, so put a line search like
tz.m.StrongWolfe(c2=0.1, a_init="first-order")after this. - This is not the same as projected gradient descent.
Reference
Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171. (algorithm 5 in section 6)
Source code in torchzero/modules/conjugate_gradient/cg.py
ProjectedNewtonRaphson ¶
Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy
Projected Newton Raphson method.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.
This one is Algorithm 7.
Source code in torchzero/modules/quasi_newton/quasi_newton.py
ProjectionBase ¶
Bases: torchzero.core.module.Module, abc.ABC
Base class for projections.
This is an abstract class, to use it, subclass it and override project and unproject.
Parameters:
-
modules(Chainable) –modules that will be applied in the projected domain.
-
project_update(bool, default:True) –whether to project the update. Defaults to True.
-
project_params(bool, default:False) –whether to project the params. This is necessary for modules that use closure. Defaults to False.
-
project_grad(bool, default:False) –whether to project the gradients (separately from update). Defaults to False.
-
defaults(dict[str, Any] | None, default:None) –dictionary with defaults. Defaults to None.
Methods:
-
project–projects
tensors. Note that this can be called multiple times per step withparams,grads, andupdate. -
unproject–unprojects
tensors. Note that this can be called multiple times per step withparams,grads, andupdate.
Source code in torchzero/modules/projections/projection.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 | |
project ¶
project(tensors: list[Tensor], params: list[Tensor], grads: list[Tensor] | None, loss: Tensor | None, states: list[dict[str, Any]], settings: list[ChainMap[str, Any]], current: str) -> Iterable[Tensor]
projects tensors. Note that this can be called multiple times per step with params, grads, and update.
Source code in torchzero/modules/projections/projection.py
unproject ¶
unproject(projected_tensors: list[Tensor], params: list[Tensor], grads: list[Tensor] | None, loss: Tensor | None, states: list[dict[str, Any]], settings: list[ChainMap[str, Any]], current: str) -> Iterable[Tensor]
unprojects tensors. Note that this can be called multiple times per step with params, grads, and update.
Parameters:
-
projected_tensors(list[Tensor]) –projected tensors to unproject.
-
params(list[Tensor]) –original, unprojected parameters.
-
grads(list[Tensor] | None) –original, unprojected gradients
-
loss(Tensor | None) –loss at initial point.
-
states(list[dict[str, Any]]) –list of state dictionaries per each UNPROJECTED tensor.
-
settings(list[ChainMap[str, Any]]) –list of setting dictionaries per each UNPROJECTED tensor.
-
current(str) –string representing what is being unprojected, e.g. "params", "grads" or "update".
Returns:
-
Iterable[Tensor]–Iterable[torch.Tensor]: unprojected tensors of the same shape as params
Source code in torchzero/modules/projections/projection.py
RCopySign ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Returns other(tensors) with sign copied from tensors.
Source code in torchzero/modules/ops/binary.py
RDSA ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Gradient approximation via Random-direction stochastic approximation (RDSA) method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h(float, default:0.001) –finite difference step size of jvp_method is set to
forwardorcentral. Defaults to 1e-3. -
n_samples(int, default:1) –number of random gradient samples. Defaults to 1.
-
formula(Literal, default:'central2') –finite difference formula. Defaults to 'central2'.
-
distribution(Literal, default:'gaussian') –distribution. Defaults to "gaussian".
-
pre_generate(bool, default:True) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed(int | None | Generator, default:None) –Seed for random generator. Defaults to None.
-
target(Literal, default:'closure') –what to set on var. Defaults to "closure".
References
Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771
Source code in torchzero/modules/grad_approximation/rfdm.py
RDiv ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Divide other by tensors. other can be a number or a module.
If other is a module, this calculates other(tensors) / tensors
Source code in torchzero/modules/ops/binary.py
RMSprop ¶
Bases: torchzero.core.transform.TensorTransform
Divides graient by EMA of gradient squares.
This implementation is identical to :code:torch.optim.RMSprop.
Parameters:
-
smoothing(float, default:0.99) –beta for exponential moving average of gradient squares. Defaults to 0.99.
-
eps(float, default:1e-08) –epsilon for division. Defaults to 1e-8.
-
centered(bool, default:False) –whether to center EMA of gradient squares using an additional EMA. Defaults to False.
-
debias(bool, default:False) –applies Adam debiasing. Defaults to False.
-
amsgrad(bool, default:False) –Whether to divide by maximum of EMA of gradient squares instead. Defaults to False.
-
pow(float) –power used in second momentum power and root. Defaults to 2.
-
init(str, default:'zeros') –how to initialize EMA, either "update" to use first update or "zeros". Defaults to "update".
-
inner(Chainable | None, default:None) –Inner modules that are applied after updating EMA and before preconditioning. Defaults to None.
Source code in torchzero/modules/adaptive/rmsprop.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | |
RPow ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Take other to the power of tensors. other can be a number or a module.
If other is a module, this calculates other(tensors) ^ tensors
Source code in torchzero/modules/ops/binary.py
RSub ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Subtract tensors from other. other can be a number or a module.
If other is a module, this calculates other(tensors) - tensors
Source code in torchzero/modules/ops/binary.py
Randn ¶
Bases: torchzero.core.module.Module
Outputs tensors filled with random numbers from a normal distribution with mean 0 and variance 1.
Source code in torchzero/modules/ops/utility.py
RandomHvp ¶
Bases: torchzero.core.module.Module
Returns a hessian-vector product with a random vector, optionally times vector
Source code in torchzero/modules/misc/misc.py
RandomReinitialize ¶
Bases: torchzero.core.module.Module
On each step with probability p_reinit trigger reinitialization,
whereby p_weights weights are reset to their initial values.
This modifies the parameters directly. Place it as the first module.
Parameters:
-
p_reinit(float, default:0.01) –probability to trigger reinitialization on each step. Defaults to 0.01.
-
p_weights(float, default:0.1) –probability for each weight to be set to initial value when reinitialization is triggered. Defaults to 0.1.
-
store_every(int | None, default:None) –if set, stores new initial values every this many steps. Defaults to None.
-
beta(float, default:0) –whenever
store_everyis triggered, uses linear interpolation with this beta. Ifstore_every=1, this can be set to some value close to 1 such as 0.999 to reinitialize to slow parameter EMA. Defaults to 0. -
reset(bool, default:False) –whether to reset states of other modules on reinitialization. Defaults to False.
-
seed(int | None, default:None) –random seed.
Source code in torchzero/modules/weight_decay/reinit.py
RandomSample ¶
Bases: torchzero.core.module.Module
Outputs tensors filled with random numbers from distribution depending on value of distribution.
Source code in torchzero/modules/ops/utility.py
RandomStepSize ¶
Bases: torchzero.core.transform.TensorTransform
Uses random global or layer-wise step size from low to high.
Parameters:
-
low(float, default:0) –minimum learning rate. Defaults to 0.
-
high(float, default:1) –maximum learning rate. Defaults to 1.
-
parameterwise(bool, default:False) –if True, generate random step size for each parameter separately, if False generate one global random step size. Defaults to False.
Source code in torchzero/modules/step_size/lr.py
RandomizedFDM ¶
Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator
Gradient approximation via a randomized finite-difference method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h(float, default:0.001) –finite difference step size of jvp_method is set to
forwardorcentral. Defaults to 1e-3. -
n_samples(int, default:1) –number of random gradient samples. Defaults to 1.
-
formula(Literal, default:'central') –finite difference formula. Defaults to 'central2'.
-
distribution(Literal, default:'rademacher') –distribution. Defaults to "rademacher". If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
pre_generate(bool, default:True) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed(int | None | Generator, default:None) –Seed for random generator. Defaults to None.
-
target(Literal, default:'closure') –what to set on var. Defaults to "closure".
Examples:
Simultaneous perturbation stochastic approximation (SPSA) method¶
SPSA is randomized FDM with rademacher distribution and central formula.
spsa = tz.Optimizer(
model.parameters(),
tz.m.RandomizedFDM(formula="fd_central", distribution="rademacher"),
tz.m.LR(1e-2)
)
Random-direction stochastic approximation (RDSA) method¶
RDSA is randomized FDM with usually gaussian distribution and central formula.
rdsa = tz.Optimizer(
model.parameters(),
tz.m.RandomizedFDM(formula="fd_central", distribution="gaussian"),
tz.m.LR(1e-2)
)
Gaussian smoothing method¶
GS uses many gaussian samples with possibly a larger finite difference step size.
gs = tz.Optimizer(
model.parameters(),
tz.m.RandomizedFDM(n_samples=100, distribution="gaussian", formula="forward2", h=1e-1),
tz.m.NewtonCG(hvp_method="forward"),
tz.m.Backtracking()
)
RandomizedFDM with momentum¶
Momentum might help by reducing the variance of the estimated gradients.
momentum_spsa = tz.Optimizer(
model.parameters(),
tz.m.RandomizedFDM(),
tz.m.HeavyBall(0.9),
tz.m.LR(1e-3)
)
Source code in torchzero/modules/grad_approximation/rfdm.py
154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 | |
PRE_MULTIPLY_BY_H
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
Reciprocal ¶
Bases: torchzero.core.transform.TensorTransform
Returns 1 / input
Source code in torchzero/modules/ops/unary.py
ReduceOperationBase ¶
Bases: torchzero.core.module.Module, abc.ABC
Base class for reduction operations like Sum, Prod, Maximum. This is an abstract class, subclass it and override transform method to use it.
Methods:
-
transform–applies the operation to operands
Source code in torchzero/modules/ops/reduce.py
transform ¶
Relative ¶
Bases: torchzero.core.transform.TensorTransform
Multiplies update by absolute parameter values to make it relative to their magnitude, min_value is minimum allowed value to avoid getting stuck at 0.
Source code in torchzero/modules/misc/misc.py
RelativeWeightDecay ¶
Bases: torchzero.core.transform.TensorTransform
Weight decay relative to the mean absolute value of update, gradient or parameters depending on value of norm_input argument.
Parameters:
-
weight_decay(float, default:0.1) –relative weight decay scale.
-
ord(int, default:2) –order of the penalty, e.g. 1 for L1 and 2 for L2. Defaults to 2.
-
norm_input(str, default:'update') –determines what should weight decay be relative to. "update", "grad" or "params". Defaults to "update".
-
metric(Ords, default:'mad') –metric (norm, etc) that weight decay should be relative to. defaults to 'mad' (mean absolute deviation).
-
target(Target) –what to set on var. Defaults to 'update'.
Examples:¶
Adam with non-decoupled relative weight decay
opt = tz.Optimizer(
model.parameters(),
tz.m.RelativeWeightDecay(1e-1),
tz.m.Adam(),
tz.m.LR(1e-3)
)
Adam with decoupled relative weight decay
opt = tz.Optimizer(
model.parameters(),
tz.m.Adam(),
tz.m.RelativeWeightDecay(1e-1),
tz.m.LR(1e-3)
)
Source code in torchzero/modules/weight_decay/weight_decay.py
RestartEvery ¶
Bases: torchzero.modules.restarts.restars.RestartStrategyBase
Resets the state every n steps
Parameters:
-
modules(Chainable | None) –modules to reset. If None, resets all modules.
-
steps(int | Literal['ndim']) –number of steps between resets. "ndim" to use number of parameters.
Source code in torchzero/modules/restarts/restars.py
RestartOnStuck ¶
Bases: torchzero.modules.restarts.restars.RestartStrategyBase
Resets the state when update (difference in parameters) is zero for multiple steps in a row.
Parameters:
-
modules(Chainable | None) –modules to reset. If None, resets all modules.
-
tol(float, default:None) –step is considered failed when maximum absolute parameter difference is smaller than this. Defaults to None (uses twice the smallest respresentable number)
-
n_tol(int, default:10) –number of failed consequtive steps required to trigger a reset. Defaults to 10.
Source code in torchzero/modules/restarts/restars.py
RestartStrategyBase ¶
Bases: torchzero.core.module.Module, abc.ABC
Base class for restart strategies.
On each update/step this checks reset condition and if it is satisfied,
resets the modules before updating or stepping.
Methods:
-
should_reset–returns whether reset should occur
Source code in torchzero/modules/restarts/restars.py
Rprop ¶
Bases: torchzero.core.transform.TensorTransform
Resilient propagation. The update magnitude gets multiplied by nplus if gradient didn't change the sign,
or nminus if it did. Then the update is applied with the sign of the current gradient.
Additionally, if gradient changes sign, the update for that weight is reverted. Next step, magnitude for that weight won't change.
Compared to pytorch this also implements backtracking update when sign changes.
This implementation is identical to torch.optim.Rprop if backtrack is set to False.
Parameters:
-
nplus(float, default:1.2) –multiplicative increase factor for when ascent didn't change sign (default: 1.2).
-
nminus(float, default:0.5) –multiplicative decrease factor for when ascent changed sign (default: 0.5).
-
lb(float, default:1e-06) –minimum step size, can be None (default: 1e-6)
-
ub(float, default:50) –maximum step size, can be None (default: 50)
-
backtrack(float, default:True) –if True, when ascent sign changes, undoes last weight update, otherwise sets update to 0. When this is False, this exactly matches pytorch Rprop. (default: True)
-
alpha(float, default:1) –initial per-parameter learning rate (default: 1).
reference Riedmiller, M., & Braun, H. (1993, March). A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In IEEE international conference on neural networks (pp. 586-591). IEEE.
Source code in torchzero/modules/adaptive/rprop.py
SAM ¶
Bases: torchzero.core.transform.Transform
Sharpness-Aware Minimization from https://arxiv.org/pdf/2010.01412
SAM functions by seeking parameters that lie in neighborhoods having uniformly low loss value. It performs two forward and backward passes per step.
This implementation modifies the closure to return loss and calculate gradients of the SAM objective. All modules after this will use the modified objective.
.. note:: This module requires a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients at two points on each step.
Parameters:
-
rho(float, default:0.05) –Neighborhood size. Defaults to 0.05.
-
p(float, default:2) –norm of the SAM objective. Defaults to 2.
-
asam(bool, default:False) –enables ASAM variant which makes perturbation relative to weight magnitudes. ASAM requires a much larger
rho, like 0.5 or 1. Thetz.m.ASAMclass is idential to setting this argument to True, but it has largerrhoby default.
Examples:¶
SAM-SGD:
SAM-Adam:
Source code in torchzero/modules/adaptive/sam.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 | |
SG2 ¶
Bases: torchzero.core.transform.Transform
second-order stochastic gradient
2SPSA (second-order SPSA)
SG2 with line search
SG2 with trust region
opt = tz.Optimizer(
model.parameters(),
tz.m.LevenbergMarquardt(tz.m.SG2(beta=0.75. n_samples=4)),
)
Source code in torchzero/modules/quasi_newton/sg2.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | |
SOAP ¶
Bases: torchzero.core.transform.TensorTransform
SOAP (ShampoO with Adam in the Preconditioner's eigenbasis from https://arxiv.org/abs/2409.11321).
Parameters:
-
beta1(float, default:0.95) –beta for first momentum. Defaults to 0.95.
-
beta2(float, default:0.95) –beta for second momentum. Defaults to 0.95.
-
shampoo_beta(float | None, default:0.95) –beta for covariance matrices accumulators. Can be None, then it just sums them like Adagrad (which works worse). Defaults to 0.95.
-
precond_freq(int, default:10) –How often to update the preconditioner. Defaults to 10.
-
merge_small(bool, default:True) –Whether to merge small dims. Defaults to True.
-
max_dim(int, default:4096) –Won't precondition dims larger than this. Defaults to 10_000.
-
precondition_1d(bool, default:True) –Whether to precondition 1d params (SOAP paper sets this to False). Defaults to True.
-
eps(float, default:1e-08) –epsilon for dividing first momentum by second. Defaults to 1e-8.
-
debias(bool, default:True) –enables adam bias correction. Defaults to True.
-
proj_exp_avg(bool, default:True) –if True, maintains exponential average of gradients (momentum) in projected space. If False - in original space Defaults to True.
-
alpha(float, default:1) –learning rate. Defaults to 1.
-
inner(Chainable | None, default:None) –output of this module is projected and Adam will run on it, but preconditioners are updated from original gradients.
Examples:¶
SOAP:
Stabilized SOAP:opt = tz.Optimizer(
model.parameters(),
tz.m.SOAP(),
tz.m.NormalizeByEMA(max_ema_growth=1.2),
tz.m.LR(1e-2)
)
Source code in torchzero/modules/adaptive/soap.py
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 | |
SOAPBasis ¶
Bases: torchzero.core.transform.TensorTransform
Run another optimizer in Shampoo eigenbases.
Note
the buffers of the basis_opt are re-projected whenever basis changes. The reprojection logic is not implemented on all modules. Some supported modules are:
Adagrad, Adam, Adan, Lion, MARSCorrection, MSAMMomentum, RMSprop, EMA, HeavyBall, NAG, ClipNormByEMA, ClipValueByEMA, NormalizeByEMA, ClipValueGrowth, CoordinateMomentum, CubicAdam.
Additionally most modules with no internal buffers are supported, e.g. Cautious, Sign, ClipNorm, Orthogonalize, etc. However modules that use weight values, such as WeighDecay can't be supported, as weights can't be projected.
Also, if you say use EMA on output of Pow(2), the exponential average will be reprojected as gradient and not as squared gradients. Use modules like EMASquared, SqrtEMASquared to get correct reprojections.
Parameters:
-
basis_opt(Chainable) –module or modules to run in Shampoo eigenbases.
-
shampoo_beta(float | None, default:0.95) –beta for covariance matrices accumulators. Can be None, then it just sums them like Adagrad (which works worse). Defaults to 0.95.
-
precond_freq(int, default:10) –How often to update the preconditioner. Defaults to 10.
-
merge_small(bool, default:True) –Whether to merge small dims. Defaults to True.
-
max_dim(int, default:4096) –Won't precondition dims larger than this. Defaults to 10_000.
-
precondition_1d(bool, default:True) –Whether to precondition 1d params (SOAP paper sets this to False). Defaults to True.
-
inner(Chainable | None, default:None) –output of this module is projected and
basis_optwill run on it, but preconditioners are updated from original gradients.
Examples: SOAP with MARS and AMSGrad:
opt = tz.Optimizer(
model.parameters(),
tz.m.SOAPBasis([tz.m.MARSCorrection(0.95), tz.m.Adam(0.95, 0.95, amsgrad=True)]),
tz.m.LR(1e-3)
)
LaProp in Shampoo eigenbases (SOLP):
# we define LaProp through other modules, moved it out for brevity
laprop = (
tz.m.RMSprop(0.95),
tz.m.Debias(beta1=None, beta2=0.95),
tz.m.EMA(0.95),
tz.m.Debias(beta1=0.95, beta2=None),
)
opt = tz.Optimizer(
model.parameters(),
tz.m.SOAPBasis(laprop),
tz.m.LR(1e-3)
)
Lion in Shampoo eigenbases (works kinda well):
Source code in torchzero/modules/basis/soap_basis.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 | |
SPSA ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h(float, default:0.001) –finite difference step size of jvp_method is set to
forwardorcentral. Defaults to 1e-3. -
n_samples(int, default:1) –number of random gradient samples. Defaults to 1.
-
formula(Literal, default:'central') –finite difference formula. Defaults to 'central2'.
-
distribution(Literal, default:'rademacher') –distribution. Defaults to "rademacher".
-
pre_generate(bool, default:True) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed(int | None | Generator, default:None) –Seed for random generator. Defaults to None.
-
target(Literal, default:'closure') –what to set on var. Defaults to "closure".
References
Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771
Source code in torchzero/modules/grad_approximation/rfdm.py
SPSA1 ¶
Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator
One-measurement variant of SPSA. Unlike standard two-measurement SPSA, the estimated gradient often won't be a descent direction, however the expectation is biased towards the descent direction. Therefore this variant of SPSA is only recommended for a specific class of problems where the objective function changes on each evaluation, for example feedback control problems.
Parameters:
-
h(float, default:0.001) –finite difference step size, recommended to set to same value as learning rate. Defaults to 1e-3.
-
n_samples(int, default:1) –number of random samples. Defaults to 1.
-
eps(float, default:1e-08) –measurement noise estimate. Defaults to 1e-8.
-
seed(int | None | Generator, default:None) –random seed. Defaults to None.
-
target(Literal, default:'closure') –what to set on closure. Defaults to "closure".
Source code in torchzero/modules/grad_approximation/spsa1.py
SR1 ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Symmetric Rank 1. This works best with a trust region:
Parameters:
-
init_scale(float | Literal['auto'], default:'auto') –initial hessian matrix is set to identity times this.
"auto" corresponds to a heuristic from [1] p.142-143.
Defaults to "auto".
-
tol(float, default:1e-32) –tolerance for denominator in SR1 update rule as in [1] p.146. Defaults to 1e-32.
-
ptol(float | None, default:1e-32) –skips update if maximum difference between current and previous gradients is less than this, to avoid instability. Defaults to 1e-32.
-
ptol_restart(bool, default:False) –whether to reset the hessian approximation when ptol tolerance is not met. Defaults to False.
-
restart_interval(int | None | Literal['auto'], default:None) –interval between resetting the hessian approximation.
"auto" corresponds to number of decision variables + 1.
None - no resets.
Defaults to None.
-
beta(float | None, default:None) –momentum on H or B. Defaults to None.
-
update_freq(int, default:1) –frequency of updating H or B. Defaults to 1.
-
scale_first(bool, default:False) –whether to downscale first step before hessian approximation becomes available. Defaults to True.
-
scale_second(bool) –whether to downscale second step. Defaults to False.
-
concat_params(bool, default:True) –If true, all parameters are treated as a single vector. If False, the update rule is applied to each parameter separately. Defaults to True.
-
inner(Chainable | None, default:None) –preconditioning is applied to the output of this module. Defaults to None.
Examples:¶
SR1 with trust region
References:¶
[1]. Nocedal. Stephen J. Wright. Numerical Optimization
Source code in torchzero/modules/quasi_newton/quasi_newton.py
SSVM ¶
Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy
Self-scaling variable metric Quasi-Newton method.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Oren, S. S., & Spedicato, E. (1976). Optimal conditioning of self-scaling variable Metric algorithms. Mathematical Programming, 10(1), 70–90. doi:10.1007/bf01580654
Source code in torchzero/modules/quasi_newton/quasi_newton.py
SVRG ¶
Bases: torchzero.core.module.Module
Stochastic variance reduced gradient method (SVRG).
To use, put SVRG as the first module, it can be used with any other modules. To reduce variance of a gradient estimator, put the gradient estimator before SVRG.
First it uses first accum_steps batches to compute full gradient at initial
parameters using gradient accumulation, the model will not be updated during this.
Then it performs svrg_steps SVRG steps, each requires two forward and backward passes.
After svrg_steps, it goes back to full gradient computation step step.
As an alternative to gradient accumulation you can pass "full_closure" argument to the step method,
which should compute full gradients, set them to .grad attributes of the parameters,
and return full loss.
Parameters:
-
svrg_steps(int) –number of steps before calculating full gradient. This can be set to length of the dataloader.
-
accum_steps(int | None, default:None) –number of steps to accumulate the gradient for. Not used if "full_closure" is passed to the
stepmethod. If None, uses value ofsvrg_steps. Defaults to None. -
reset_before_accum(bool, default:True) –whether to reset all other modules when re-calculating full gradient. Defaults to True.
-
svrg_loss(bool, default:True) –whether to replace loss with SVRG loss (calculated by same formula as SVRG gradient). Defaults to True.
-
alpha(float, default:1) –multiplier to
g_full(x_0) - g_batch(x_0)term, can be annealed linearly from 1 to 0 as suggested in https://arxiv.org/pdf/2311.05589#page=6
Examples:¶
SVRG-LBFGS
opt = tz.Optimizer(
model.parameters(),
tz.m.SVRG(len(dataloader)),
tz.m.LBFGS(),
tz.m.Backtracking(),
)
For extra variance reduction one can use Online versions of algorithms, although it won't always help.
opt = tz.Optimizer(
model.parameters(),
tz.m.SVRG(len(dataloader)),
tz.m.Online(tz.m.LBFGS()),
tz.m.Backtracking(),
)
Variance reduction can also be applied to gradient estimators.
```python
opt = tz.Optimizer(
model.parameters(),
tz.m.SPSA(),
tz.m.SVRG(100),
tz.m.LR(1e-2),
)
Notes¶
The SVRG gradient is computed as g_b(x) - alpha * (g_b(x_0) - g_f(x_0)), where:
- x is current parameters
- x_0 is initial parameters, where full gradient was computed
- g_b refers to mini-batch gradient at x or x_0
- g_f refers to full gradient at x_0.
The SVRG loss is computed using the same formula.
Source code in torchzero/modules/variance_reduction/svrg.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | |
SaveBest ¶
Bases: torchzero.core.module.Module
Saves best parameters found so far, ones that have lowest loss. Put this as the last module.
Adds the following attrs:
best_params- a list of tensors with best parameters.best_loss- loss value withbest_params.load_best_parameters- a function that sets parameters to the best parameters./
Examples¶
```python def rosenbrock(x, y): return (1 - x)2 + (100 * (y - x2))**2
xy = torch.tensor((-1.1, 2.5), requires_grad=True) opt = tz.Optimizer( [xy], tz.m.NAG(0.999), tz.m.LR(1e-6), tz.m.SaveBest() )
optimize for 1000 steps¶
for i in range(1000): loss = rosenbrock(*xy) opt.zero_grad() loss.backward() opt.step(loss=loss) # SaveBest needs closure or loss
NAG overshot, but we saved the best params¶
print(f'{rosenbrock(*xy) = }') # >> 3.6583 print(f"{opt.attrs['best_loss'] = }") # >> 0.000627
load best parameters¶
opt.attrs'load_best_params' print(f'{rosenbrock(*xy) = }') # >> 0.000627
Source code in torchzero/modules/misc/misc.py
ScalarProjection ¶
Bases: torchzero.modules.projections.projection.ProjectionBase
projetion that splits all parameters into individual scalars
Source code in torchzero/modules/projections/projection.py
ScaleByGradCosineSimilarity ¶
Bases: torchzero.core.transform.TensorTransform
Multiplies the update by cosine similarity with gradient. If cosine similarity is negative, naturally the update will be negated as well.
Parameters:
-
eps(float, default:1e-06) –epsilon for division. Defaults to 1e-6.
Examples:¶
Scaled Adam
opt = tz.Optimizer(
bench.parameters(),
tz.m.Adam(),
tz.m.ScaleByGradCosineSimilarity(),
tz.m.LR(1e-2)
)
Source code in torchzero/modules/momentum/cautious.py
ScaleLRBySignChange ¶
Bases: torchzero.core.transform.TensorTransform
learning rate gets multiplied by nplus if ascent/gradient didn't change the sign,
or nminus if it did.
This is part of RProp update rule.
Parameters:
-
nplus(float, default:1.2) –learning rate gets multiplied by
nplusif ascent/gradient didn't change the sign -
nminus(float, default:0.5) –learning rate gets multiplied by
nminusif ascent/gradient changed the sign -
lb(float, default:1e-06) –lower bound for lr.
-
ub(float, default:50.0) –upper bound for lr.
-
alpha(float, default:1.0) –initial learning rate.
Source code in torchzero/modules/adaptive/rprop.py
ScaleModulesByCosineSimilarity ¶
Bases: torchzero.core.module.Module
Scales the output of main module by it's cosine similarity to the output
of compare module.
Parameters:
-
main(Chainable) –main module or sequence of modules whose update will be scaled.
-
compare(Chainable) –module or sequence of modules to compare to
-
eps(float, default:1e-06) –epsilon for division. Defaults to 1e-6.
Examples:¶
Adam scaled by similarity to RMSprop
opt = tz.Optimizer(
bench.parameters(),
tz.m.ScaleModulesByCosineSimilarity(
main = tz.m.Adam(),
compare = tz.m.RMSprop(0.999, debiased=True),
),
tz.m.LR(1e-2)
)
Source code in torchzero/modules/momentum/cautious.py
ScipyMinimizeScalar ¶
Bases: torchzero.modules.line_search.line_search.LineSearchBase
Line search via :code:scipy.optimize.minimize_scalar which implements brent, golden search and bounded brent methods.
Parameters:
-
method(str | None, default:None) –"brent", "golden" or "bounded". Defaults to None.
-
maxiter(int | None, default:None) –maximum number of function evaluations the line search is allowed to perform. Defaults to None.
-
bracket(Sequence | None, default:None) –Either a triple (xa, xb, xc) satisfying xa < xb < xc and func(xb) < func(xa) and func(xb) < func(xc), or a pair (xa, xb) to be used as initial points for a downhill bracket search. Defaults to None.
-
bounds(Sequence | None, default:None) –For method ‘bounded’, bounds is mandatory and must have two finite items corresponding to the optimization bounds. Defaults to None.
-
tol(float | None, default:None) –Tolerance for termination. Defaults to None.
-
prev_init(bool, default:False) –uses previous step size as initial guess for the line search.
-
options(dict | None, default:None) –A dictionary of solver options. Defaults to None.
For more details on methods and arguments refer to https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize_scalar.html
Source code in torchzero/modules/line_search/scipy.py
Sequential ¶
Bases: torchzero.core.module.Module
On each step, this sequentially steps with modules steps times.
The update is taken to be the parameter difference between parameters before and after the inner loop.
Source code in torchzero/modules/misc/multistep.py
Shampoo ¶
Bases: torchzero.core.transform.TensorTransform
Shampoo from Preconditioned Stochastic Tensor Optimization (https://arxiv.org/abs/1802.09568).
Notes
Shampoo is usually grafted to another optimizer like Adam, otherwise it can be unstable. An example of how to do grafting is given below in the Examples section.
Shampoo is a very computationally expensive optimizer, increase update_freq if it is too slow.
SOAP optimizer usually outperforms Shampoo and is also not as computationally expensive. SOAP implementation is available as tz.m.SOAP.
Parameters:
-
update_freq(int) –preconditioner update frequency. Defaults to 10.
-
matrix_power(float | None, default:None) –overrides matrix exponent. By default uses
-1/grad.ndim. Defaults to None. -
merge_small(bool, default:True) –whether to merge small dims on tensors. Defaults to True.
-
max_dim(int, default:10000) –maximum dimension size for preconditioning. Defaults to 10_000.
-
precondition_1d(bool, default:True) –whether to precondition 1d tensors. Defaults to True.
-
adagrad_eps(float, default:1e-08) –epsilon for adagrad division for tensors where shampoo can't be applied. Defaults to 1e-8.
-
matrix_power_method(Literal, default:'eigh_abs') –how to compute matrix power.
-
beta(float | None, default:None) –if None calculates sum as in standard Shampoo, otherwise uses EMA of preconditioners. Defaults to None.
-
inner(Chainable | None, default:None) –module applied after updating preconditioners and before applying preconditioning. For example if beta≈0.999 and
inner=tz.m.EMA(0.9), this becomes Adam with shampoo preconditioner (ignoring debiasing). Defaults to None.
Examples: Shampoo grafted to Adam
opt = tz.Optimizer(
model.parameters(),
tz.m.GraftModules(
direction = tz.m.Shampoo(),
magnitude = tz.m.Adam(),
),
tz.m.LR(1e-3)
)
Adam with Shampoo preconditioner
opt = tz.Optimizer(
model.parameters(),
tz.m.Shampoo(beta=0.999, inner=tz.m.EMA(0.9)),
tz.m.Debias(0.9, 0.999),
tz.m.LR(1e-3)
)
Source code in torchzero/modules/adaptive/shampoo.py
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 | |
ShorR ¶
Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy
Shor’s r-algorithm.
Note
-
A line search such as
[tz.m.StrongWolfe(a_init="quadratic", fallback=True), tz.m.Mul(1.2)]is required. Similarly to conjugate gradient, ShorR doesn't have an automatic step size scaling, so settinga_initin the line search is recommended. -
The line search should try to overstep by a little, therefore it can help to multiply direction given by a line search by some value slightly larger than 1 such as 1.2.
References
Those are the original references, but neither seem to be available online: - Shor, N. Z., Utilization of the Operation of Space Dilatation in the Minimization of Convex Functions, Kibernetika, No. 1, pp. 6-12, 1970.
- Skokov, V. A., Note on Minimization Methods Employing Space Stretching, Kibernetika, No. 4, pp. 115-117, 1974.
An overview is available in Burke, James V., Adrian S. Lewis, and Michael L. Overton. "The Speed of Shor's R-algorithm." IMA Journal of numerical analysis 28.4 (2008): 711-720.
Reference by Skokov, V. A. describes a more efficient formula which can be found here Ansari, Zafar A. Limited Memory Space Dilation and Reduction Algorithms. Diss. Virginia Tech, 1998.
Source code in torchzero/modules/quasi_newton/quasi_newton.py
Sign ¶
Bases: torchzero.core.transform.TensorTransform
Returns sign(input)
Source code in torchzero/modules/ops/unary.py
SignConsistencyLRs ¶
Bases: torchzero.core.transform.TensorTransform
Outputs per-weight learning rates based on consecutive sign consistency.
The learning rate for a weight is multiplied by nplus when two consecutive update signs are the same, otherwise it is multiplied by nplus. The learning rates are bounded to be in (lb, ub) range.
Examples:¶
GD scaled by consecutive gradient sign consistency
Source code in torchzero/modules/adaptive/rprop.py
SignConsistencyMask ¶
Bases: torchzero.core.transform.TensorTransform
Outputs a mask of sign consistency of current and previous inputs.
The output is 0 for weights where input sign changed compared to previous input, 1 otherwise.
Examples:¶
GD that skips update for weights where gradient sign changed compared to previous gradient.
Source code in torchzero/modules/adaptive/rprop.py
SixthOrder3P ¶
Bases: torchzero.modules.second_order.multipoint.HigherOrderMethodBase
Sixth-order iterative method.
Abro, Hameer Akhtar, and Muhammad Mujtaba Shaikh. "A new time-efficient and convergent nonlinear solver." Applied Mathematics and Computation 355 (2019): 516-536.
Source code in torchzero/modules/second_order/multipoint.py
SixthOrder3PM2 ¶
Bases: torchzero.modules.second_order.multipoint.HigherOrderMethodBase
Wang, Xiaofeng, and Yang Li. "An efficient sixth-order Newton-type method for solving nonlinear systems." Algorithms 10.2 (2017): 45.
Source code in torchzero/modules/second_order/multipoint.py
SixthOrder5P ¶
Bases: torchzero.modules.second_order.multipoint.HigherOrderMethodBase
Argyros, Ioannis K., et al. "Extended convergence for two sixth order methods under the same weak conditions." Foundations 3.1 (2023): 127-139.
Source code in torchzero/modules/second_order/multipoint.py
SophiaH ¶
Bases: torchzero.core.transform.Transform
SophiaH optimizer from https://arxiv.org/abs/2305.14342
This is similar to Adam, but the second momentum is replaced by an exponential moving average of randomized hessian diagonal estimates, and the update is agressively clipped.
Notes
-
In most cases SophiaH should be the first module in the chain because it relies on autograd. Use the
innerargument if you wish to apply SophiaH preconditioning to another module's output. -
This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a
backwardargument (refer to documentation).
Parameters:
-
beta1(float, default:0.96) –first momentum. Defaults to 0.96.
-
beta2(float, default:0.99) –momentum for hessian diagonal estimate. Defaults to 0.99.
-
update_freq(int, default:10) –frequency of updating hessian diagonal estimate via a hessian-vector product. Defaults to 10.
-
precond_scale(float, default:1) –scale of the preconditioner. Defaults to 1.
-
clip(float, default:1) –clips update to (-clip, clip). Defaults to 1.
-
eps(float, default:1e-12) –clips hessian diagonal esimate to be no less than this value. Defaults to 1e-12.
-
hvp_method(str, default:'autograd') –Determines how Hessian-vector products are computed.
"batched_autograd"- uses autograd with batched hessian-vector products. If a single hessian-vector is evaluated, equivalent to"autograd". Faster than"autograd"but uses more memory."autograd"- uses autograd hessian-vector products. If multiple hessian-vector products are evaluated, uses a for-loop. Slower than"batched_autograd"but uses less memory."fd_forward"- uses gradient finite difference approximation with a less accurate forward formula which requires one extra gradient evaluation per hessian-vector product."fd_central"- uses gradient finite difference approximation with a more accurate central formula which requires two gradient evaluations per hessian-vector product.
Defaults to
"autograd". -
h(float, default:0.001) –The step size for finite difference if
hvp_methodis"fd_forward"or"fd_central". Defaults to 1e-3. -
n_samples(int, default:1) –number of hessian-vector products with random vectors to evaluate each time when updating the preconditioner. Larger values may lead to better hessian diagonal estimate. Defaults to 1.
-
seed(int | None, default:None) –seed for random vectors. Defaults to None.
-
inner(Chainable | None) –preconditioning is applied to the output of this module. Defaults to None.
Examples:¶
Using SophiaH:
SophiaH preconditioner can be applied to any other module by passing it to the inner argument.
Turn off SophiaH's first momentum to get just the preconditioning. Here is an example of applying
SophiaH preconditioning to nesterov momentum (tz.m.NAG):
Source code in torchzero/modules/adaptive/sophia_h.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 | |
Split ¶
Bases: torchzero.core.module.Module
Apply true modules to all parameters filtered by filter, apply false modules to all other parameters.
Parameters:
-
filter(Filter, bool]) –a filter that selects tensors to be optimized by
true. - tensor or iterable of tensors (e.g.encoder.parameters()). - function that takes in tensor and outputs a bool (e.g.lambda x: x.ndim >= 2). - a sequence of above (acts as "or", so returns true if any of them is true). -
true(Chainable | None) –modules that are applied to tensors where
filterisTrue. -
false(Chainable | None) –modules that are applied to tensors where
filterisFalse.
Examples:¶
Muon with Adam fallback using same hyperparams as https://github.com/KellerJordan/Muon
opt = tz.Optimizer(
model.parameters(),
tz.m.NAG(0.95),
tz.m.Split(
lambda p: p.ndim >= 2,
true = tz.m.Orthogonalize(),
false = [tz.m.Adam(0.9, 0.95), tz.m.Mul(1/66)],
),
tz.m.LR(1e-2),
)
Source code in torchzero/modules/misc/split.py
Sqrt ¶
Bases: torchzero.core.transform.TensorTransform
Returns sqrt(input)
Source code in torchzero/modules/ops/unary.py
SqrtEMASquared ¶
Bases: torchzero.core.transform.TensorTransform
Maintains an exponential moving average of squared updates, outputs optionally debiased square root.
Parameters:
-
beta(float, default:0.999) –momentum value. Defaults to 0.999.
-
amsgrad(bool, default:False) –whether to maintain maximum of the exponential moving average. Defaults to False.
-
debiased(bool, default:False) –whether to multiply the output by a debiasing term from the Adam method. Defaults to False.
-
pow(float, default:2) –power, absolute value is always used. Defaults to 2.
Methods:
-
SQRT_EMA_SQ_FN–Updates
exp_avg_sq_with EMA of squaredtensorsand calculates it's square root,
Source code in torchzero/modules/ops/higher_level.py
SQRT_EMA_SQ_FN ¶
SQRT_EMA_SQ_FN(tensors: TensorList, exp_avg_sq_: TensorList, beta: float | NumberList, max_exp_avg_sq_: TensorList | None, debiased: bool, step: int, pow: float = 2, ema_sq_fn: Callable = ema_sq_)
Updates exp_avg_sq_ with EMA of squared tensors and calculates it's square root,
with optional AMSGrad and debiasing.
Returns new tensors.
Source code in torchzero/modules/opt_utils.py
SqrtHomotopy ¶
SquareHomotopy ¶
StepSize ¶
Bases: torchzero.core.transform.TensorTransform
this is exactly the same as LR, except the lr parameter can be renamed to any other name to avoid clashes
Source code in torchzero/modules/step_size/lr.py
StrongWolfe ¶
Bases: torchzero.modules.line_search.line_search.LineSearchBase
Interpolation line search satisfying Strong Wolfe condition.
Parameters:
-
c1(float, default:0.0001) –sufficient descent condition. Defaults to 1e-4.
-
c2(float, default:0.9) –strong curvature condition. For CG set to 0.1. Defaults to 0.9.
-
a_init(str, default:'fixed') –strategy for initializing the initial step size guess. - "fixed" - uses a fixed value specified in
init_valueargument. - "first-order" - assumes first-order change in the function at iterate will be the same as that obtained at the previous step. - "quadratic" - interpolates quadratic to f(x_{-1}) and f_x. - "quadratic-clip" - same as quad, but uses min(1, 1.01*alpha) as described in Numerical Optimization. - "previous" - uses final step size found on previous iteration.For 2nd order methods it is usually best to leave at "fixed". For methods that do not produce well scaled search directions, e.g. conjugate gradient, "first-order" or "quadratic-clip" are recommended. Defaults to 'init'.
-
a_max(float, default:1000000000000.0) –upper bound for the proposed step sizes. Defaults to 1e12.
-
init_value(float, default:1) –initial step size. Used when
a_init="fixed", and with other strategies as fallback value. Defaults to 1. -
maxiter(int, default:25) –maximum number of line search iterations. Defaults to 25.
-
maxzoom(int, default:10) –maximum number of zoom iterations. Defaults to 10.
-
maxeval(int | None, default:None) –maximum number of function evaluations. Defaults to None.
-
tol_change(float, default:1e-09) –tolerance, terminates on small brackets. Defaults to 1e-9.
-
interpolation(str, default:'cubic') –What type of interpolation to use. - "bisection" - uses the middle point. This is robust, especially if the objective function is non-smooth, however it may need more function evaluations. - "quadratic" - minimizes a quadratic model, generally outperformed by "cubic". - "cubic" - minimizes a cubic model - this is the most widely used interpolation strategy. - "polynomial" - fits a a polynomial to all points obtained during line search. - "polynomial2" - alternative polynomial fit, where if a point is outside of bounds, a lower degree polynomial is tried. This may have faster convergence than "cubic" and "polynomial".
Defaults to 'cubic'.
-
adaptive(bool, default:True) –if True, the initial step size will be halved when line search failed to find a good direction. When a good direction is found, initial step size is reset to the original value. Defaults to True.
-
fallback(bool, default:False) –if True, when no point satisfied strong wolfe criteria, returns a point with value lower than initial value that doesn't satisfy the criteria. Defaults to False.
-
plus_minus(bool, default:False) –if True, enables the plus-minus variant, where if curvature is negative, line search is performed in the opposite direction. Defaults to False.
Examples:¶
Conjugate gradient method with strong wolfe line search. Nocedal, Wright recommend setting c2 to 0.1 for CG. Since CG doesn't produce well scaled directions, initial alpha can be determined from function values by a_init="first-order".
opt = tz.Optimizer(
model.parameters(),
tz.m.PolakRibiere(),
tz.m.StrongWolfe(c2=0.1, a_init="first-order")
)
LBFGS strong wolfe line search:
Source code in torchzero/modules/line_search/strong_wolfe.py
189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 | |
Sub ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Subtract other from tensors. other can be a number or a module.
If other is a module, this calculates :code:tensors - other(tensors)
Source code in torchzero/modules/ops/binary.py
SubModules ¶
Bases: torchzero.modules.ops.multi.MultiOperationBase
Calculates input - other. input and other can be numbers or modules.
Source code in torchzero/modules/ops/multi.py
SubspaceNewton ¶
Bases: torchzero.core.transform.Transform
Subspace Newton. Performs a Newton step in a subspace (random or spanned by past gradients).
Parameters:
-
sketch_size(int, default:100) –size of the random sketch. This many hessian-vector products will need to be evaluated each step.
-
sketch_type(str, default:'common_directions') –- "common_directions" - uses history steepest descent directions as the basis[2]. It is orthonormalized on-line using Gram-Schmidt (default).
- "orthonormal" - random orthonormal basis. Orthonormality is necessary to use linear operator based modules such as trust region, but it can be slower to compute.
- "rows" - samples random rows.
- "topk" - samples top-rank rows with largest gradient magnitude.
- "rademacher" - approximately orthonormal (if dimension is large) scaled random rademacher basis.
- "mixed" - random orthonormal basis but with four directions set to gradient, slow and fast gradient EMAs, and previous update direction.
-
damping(float, default:0) –hessian damping (scale of identity matrix added to hessian). Defaults to 0.
-
hvp_method(str, default:'batched_autograd') –How to compute hessian-matrix product: - "batched_autograd" - uses batched autograd - "autograd" - uses unbatched autograd - "forward" - uses finite difference with forward formula, performing 1 backward pass per Hvp. - "central" - uses finite difference with a more accurate central formula, performing 2 backward passes per Hvp.
. Defaults to "batched_autograd".
-
h(float, default:0.01) –finite difference step size. Defaults to 1e-2.
-
use_lstsq(bool, default:False) –whether to use least squares to solve
Hx=g. Defaults to False. -
update_freq(int, default:1) –frequency of updating the hessian. Defaults to 1.
-
H_tfm(Callable | None) –optional hessian transforms, takes in two arguments -
(hessian, gradient).must return either a tuple:
(hessian, is_inverted)with transformed hessian and a boolean value which must be True if transform inverted the hessian and False otherwise.Or it returns a single tensor which is used as the update.
Defaults to None.
-
eigval_fn(Callable | None, default:None) –optional eigenvalues transform, for example
torch.absorlambda L: torch.clip(L, min=1e-8). If this is specified, eigendecomposition will be used to invert the hessian. -
seed(int | None, default:None) –seed for random generator. Defaults to None.
-
inner(Chainable | None, default:None) –preconditions output of this module. Defaults to None.
Examples¶
RSN with line search
RSN with trust region
References
- Gower, Robert, et al. "RSN: randomized subspace Newton." Advances in Neural Information Processing Systems 32 (2019).
- Wang, Po-Wei, Ching-pei Lee, and Chih-Jen Lin. "The common-directions method for regularized empirical risk minimization." Journal of Machine Learning Research 20.58 (2019): 1-49.
Source code in torchzero/modules/second_order/rsn.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 | |
Sum ¶
Bases: torchzero.modules.ops.reduce.ReduceOperationBase
Outputs sum of inputs that can be modules or numbers.
Source code in torchzero/modules/ops/reduce.py
USE_MEAN
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
SumOfSquares ¶
Bases: torchzero.core.transform.Transform
Sets loss to be the sum of squares of values returned by the closure.
This is meant to be used to test least squares methods against ordinary minimization methods.
To use this, the closure should return a vector of values to minimize sum of squares of.
Please add the backward argument, it will always be False but it is required.
Source code in torchzero/modules/least_squares/gn.py
Switch ¶
Bases: torchzero.modules.misc.switch.Alternate
After steps steps switches to the next module.
Parameters:
-
steps(int | Iterable[int]) –Number of steps to perform with each module.
Examples:¶
Start with Adam, switch to L-BFGS after 1000th step and Truncated Newton on 2000th step.
opt = tz.Optimizer(
model.parameters(),
tz.m.Switch(
[tz.m.Adam(), tz.m.LR(1e-3)],
[tz.m.LBFGS(), tz.m.Backtracking()],
[tz.m.NewtonCG(maxiter=20), tz.m.Backtracking()],
steps = (1000, 2000)
)
)
Source code in torchzero/modules/misc/switch.py
LOOP
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
TerminateAfterNEvaluations ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
Source code in torchzero/modules/termination/termination.py
TerminateAfterNSeconds ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
Source code in torchzero/modules/termination/termination.py
TerminateAfterNSteps ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
Source code in torchzero/modules/termination/termination.py
TerminateAll ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
Source code in torchzero/modules/termination/termination.py
TerminateAny ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
Source code in torchzero/modules/termination/termination.py
TerminateByGradientNorm ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
Source code in torchzero/modules/termination/termination.py
TerminateByUpdateNorm ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
update is calculated as parameter difference
Source code in torchzero/modules/termination/termination.py
TerminateNever ¶
TerminateOnLossReached ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
Source code in torchzero/modules/termination/termination.py
TerminateOnNoImprovement ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
Source code in torchzero/modules/termination/termination.py
TerminationCriteriaBase ¶
Bases: torchzero.core.module.Module
Source code in torchzero/modules/termination/termination.py
ThomasOptimalMethod ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Thomas's "optimal" Quasi-Newton method.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Thomas, Stephen Walter. Sequential estimation techniques for quasi-Newton algorithms. Cornell University, 1975.
Source code in torchzero/modules/quasi_newton/quasi_newton.py
Threshold ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Outputs tensors thresholded such that values above threshold are set to value.
Source code in torchzero/modules/ops/binary.py
To ¶
Bases: torchzero.modules.projections.projection.ProjectionBase
Cast modules to specified device and dtype
Source code in torchzero/modules/projections/cast.py
TrustCG ¶
Bases: torchzero.modules.trust_region.trust_region.TrustRegionBase
Trust region via Steihaug-Toint Conjugate Gradient method.
.. note::
If you wish to use exact hessian, use the matrix-free :code:`tz.m.NewtonCGSteihaug`
which only uses hessian-vector products. While passing ``tz.m.Newton`` to this
is possible, it is usually less efficient.
Parameters:
-
hess_module(Module | None) –A module that maintains a hessian approximation (not hessian inverse!). This includes all full-matrix quasi-newton methods,
tz.m.Newtonandtz.m.GaussNewton. When using quasi-newton methods, setinverse=Falsewhen constructing them. -
eta(float, default:0.0) –if ratio of actual to predicted rediction is larger than this, step is accepted. When :code:
hess_moduleis GaussNewton, this can be set to 0. Defaults to 0.15. -
nplus(float, default:3.5) –increase factor on successful steps. Defaults to 1.5.
-
nminus(float, default:0.25) –decrease factor on unsuccessful steps. Defaults to 0.75.
-
rho_good(float, default:0.99) –if ratio of actual to predicted rediction is larger than this, trust region size is multiplied by
nplus. -
rho_bad(float, default:0.0001) –if ratio of actual to predicted rediction is less than this, trust region size is multiplied by
nminus. -
init(float, default:1) –Initial trust region value. Defaults to 1.
-
update_freq(int, default:1) –frequency of updating the hessian. Defaults to 1.
-
reg(int, default:0) –regularization parameter for conjugate gradient. Defaults to 0.
-
max_attempts(max_attempts, default:10) –maximum number of trust region size size reductions per step. A zero update vector is returned when this limit is exceeded. Defaults to 10.
-
boundary_tol(float | None, default:1e-06) –The trust region only increases when suggested step's norm is at least
(1-boundary_tol)*trust_region. This prevents increasing trust region when solution is not on the boundary. Defaults to 1e-2. -
prefer_exact(bool, default:True) –when exact solution can be easily calculated without CG (e.g. hessian is stored as scaled identity), uses the exact solution. If False, always uses CG. Defaults to True.
-
inner(Chainable | None, default:None) –preconditioning is applied to output of thise module. Defaults to None.
Examples:
Trust-SR1
.. code-block:: python
opt = tz.Optimizer(
model.parameters(),
tz.m.TrustCG(hess_module=tz.m.SR1(inverse=False)),
)
Source code in torchzero/modules/trust_region/trust_cg.py
TrustRegionBase ¶
Bases: torchzero.core.module.Module, abc.ABC
Methods:
-
trust_region_apply–Solves the trust region subproblem and outputs
Objectivewith the solution direction. -
trust_region_update–updates the state of this module after H or B have been updated, if necessary
-
trust_solve–Solve Hx=g with a trust region penalty/bound defined by
radius
Source code in torchzero/modules/trust_region/trust_region.py
214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 | |
trust_region_apply ¶
trust_region_apply(objective: Objective, tensors: list[Tensor], H: LinearOperator | None) -> Objective
Solves the trust region subproblem and outputs Objective with the solution direction.
Source code in torchzero/modules/trust_region/trust_region.py
trust_region_update ¶
updates the state of this module after H or B have been updated, if necessary
trust_solve ¶
trust_solve(f: float, g: Tensor, H: LinearOperator, radius: float, params: list[Tensor], closure: Callable, settings: Mapping[str, Any]) -> Tensor
Solve Hx=g with a trust region penalty/bound defined by radius
Source code in torchzero/modules/trust_region/trust_region.py
TwoPointNewton ¶
Bases: torchzero.modules.second_order.multipoint.HigherOrderMethodBase
two-point Newton method with frozen derivative with third order convergence.
Sharma, Janak Raj, and Deepak Kumar. "A fast and efficient composite Newton–Chebyshev method for systems of nonlinear equations." Journal of Complexity 49 (2018): 56-73.
Source code in torchzero/modules/second_order/multipoint.py
UnaryLambda ¶
Bases: torchzero.core.transform.TensorTransform
Applies fn to input tensors.
fn must accept and return a list of tensors.
Source code in torchzero/modules/ops/unary.py
UnaryParameterwiseLambda ¶
Bases: torchzero.core.transform.TensorTransform
Applies fn to each input tensor.
fn must accept and return a tensor.
Source code in torchzero/modules/ops/unary.py
Uniform ¶
Bases: torchzero.core.module.Module
Outputs tensors filled with random numbers from uniform distribution between low and high.
Source code in torchzero/modules/ops/utility.py
UpdateGradientSignConsistency ¶
Bases: torchzero.core.transform.TensorTransform
Compares update and gradient signs. Output will have 1s where signs match, and 0s where they don't.
Parameters:
-
normalize(bool, default:False) –renormalize update after masking. Defaults to False.
-
eps(float, default:1e-06) –epsilon for normalization. Defaults to 1e-6.
Source code in torchzero/modules/momentum/cautious.py
UpdateSign ¶
Bases: torchzero.core.transform.TensorTransform
Outputs gradient with sign copied from the update.
Source code in torchzero/modules/misc/misc.py
UpdateToNone ¶
Bases: torchzero.core.module.Module
Sets update attribute to None on var.
Source code in torchzero/modules/ops/utility.py
VectorProjection ¶
Bases: torchzero.modules.projections.projection.ProjectionBase
projection that concatenates all parameters into a vector
Source code in torchzero/modules/projections/projection.py
ViewAsReal ¶
Bases: torchzero.modules.projections.projection.ProjectionBase
View complex tensors as real tensors. Doesn't affect tensors that are already.
Source code in torchzero/modules/projections/cast.py
Warmup ¶
Bases: torchzero.core.transform.TensorTransform
Learning rate warmup, linearly increases learning rate multiplier from start_lr to end_lr over steps steps.
Parameters:
-
steps(int, default:100) –number of steps to perform warmup for. Defaults to 100.
-
start_lr(_type_, default:1e-05) –initial learning rate multiplier on first step. Defaults to 1e-5.
-
end_lr(float, default:1) –learning rate multiplier at the end and after warmup. Defaults to 1.
Example
Adam with 1000 steps warmup
.. code-block:: python
opt = tz.Optimizer(
model.parameters(),
tz.m.Adam(),
tz.m.LR(1e-2),
tz.m.Warmup(steps=1000)
)
Source code in torchzero/modules/step_size/lr.py
WarmupNormClip ¶
Bases: torchzero.core.transform.TensorTransform
Warmup via clipping of the update norm.
Parameters:
-
start_norm(_type_, default:1e-05) –maximal norm on the first step. Defaults to 1e-5.
-
end_norm(float, default:1) –maximal norm on the last step. After that, norm clipping is disabled. Defaults to 1.
-
steps(int, default:100) –number of steps to perform warmup for. Defaults to 100.
Example
Adam with 1000 steps norm clip warmup
.. code-block:: python
opt = tz.Optimizer(
model.parameters(),
tz.m.Adam(),
tz.m.WarmupNormClip(steps=1000)
tz.m.LR(1e-2),
)
Source code in torchzero/modules/step_size/lr.py
WeightDecay ¶
Bases: torchzero.core.transform.TensorTransform
Weight decay.
Parameters:
-
weight_decay(float) –weight decay scale.
-
ord(int, default:2) –order of the penalty, e.g. 1 for L1 and 2 for L2. Defaults to 2.
-
target(Target) –what to set on var. Defaults to 'update'.
Examples:¶
Adam with non-decoupled weight decay
Adam with decoupled weight decay that still scales with learning rate
Adam with fully decoupled weight decay that doesn't scale with learning rate
Source code in torchzero/modules/weight_decay/weight_decay.py
WeightDropout ¶
Bases: torchzero.core.module.Module
Changes the closure so that it evaluates loss and gradients with random weights replaced with 0.
Dropout can be disabled for a parameter by setting use_dropout=False in corresponding parameter group.
Parameters:
-
p(float, default:0.5) –probability that any weight is replaced with 0. Defaults to 0.5.
Source code in torchzero/modules/misc/regularization.py
WeightedAveraging ¶
Bases: torchzero.core.transform.TensorTransform
Weighted average of past len(weights) updates.
Parameters:
-
weights(Sequence[float]) –a sequence of weights from oldest to newest.
-
target(Target) –target. Defaults to 'update'.
Source code in torchzero/modules/momentum/averaging.py
WeightedMean ¶
Bases: torchzero.modules.ops.reduce.WeightedSum
Outputs weighted mean of inputs that can be modules or numbers.
Source code in torchzero/modules/ops/reduce.py
USE_MEAN
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
WeightedSum ¶
Bases: torchzero.modules.ops.reduce.ReduceOperationBase
Outputs a weighted sum of inputs that can be modules or numbers.
Source code in torchzero/modules/ops/reduce.py
USE_MEAN
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
Wrap ¶
Bases: torchzero.core.module.Module
Wraps a pytorch optimizer to use it as a module.
Note
Custom param groups are supported only by set_param_groups, settings passed to Optimizer will be applied to all parameters.
Parameters:
-
opt_fn(Callable[..., Optimizer] | Optimizer) –function that takes in parameters and returns the optimizer, for example
torch.optim.Adamorlambda parameters: torch.optim.Adam(parameters, lr=1e-3) -
*args– -
**kwargs–Extra args to be passed to opt_fn. The function is called as
opt_fn(parameters, *args, **kwargs). -
use_param_groups(bool, default:True) –Whether to pass settings passed to Optimizer to the wrapped optimizer.
Note that settings to the first parameter are used for all parameters, so if you specified per-parameter settings, they will be ignored.
Example:¶
wrapping pytorch_optimizer.StableAdamW
from pytorch_optimizer import StableAdamW
opt = tz.Optimizer(
model.parameters(),
tz.m.Wrap(StableAdamW, lr=1),
tz.m.Cautious(),
tz.m.LR(1e-2)
)
Source code in torchzero/modules/wrappers/optim_wrapper.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | |
Zeros ¶
Bases: torchzero.core.module.Module
Outputs zeros
Source code in torchzero/modules/ops/utility.py
clip_grad_norm_ ¶
clip_grad_norm_(params: Iterable[Tensor], max_norm: float | None, ord: Union[Literal['mad', 'std', 'var', 'sum', 'l0', 'l1', 'l2', 'l3', 'l4', 'linf'], float, Tensor] = 2, dim: Union[int, Sequence[int], Literal['global'], NoneType] = None, inverse_dims: bool = False, min_size: int = 2, min_norm: float | None = None)
Clips gradient of an iterable of parameters to specified norm value. Gradients are modified in-place.
Parameters:
-
params(Iterable[Tensor]) –parameters with gradients to clip.
-
max_norm(float) –value to clip norm to.
-
ord(float, default:2) –norm order. Defaults to 2.
-
dim(int | Sequence[int] | str | None, default:None) –calculates norm along those dimensions. If list/tuple, tensors are normalized along all dimensios in
dimthat they have. Can be set to "global" to normalize by global norm of all gradients concatenated to a vector. Defaults to None. -
min_size(int, default:2) –minimal size of a dimension to normalize along it. Defaults to 1.
Source code in torchzero/modules/clipping/clipping.py
clip_grad_value_ ¶
Clips gradient of an iterable of parameters at specified value. Gradients are modified in-place. Args: params (Iterable[Tensor]): iterable of tensors with gradients to clip. value (float or int): maximum allowed value of gradient
Source code in torchzero/modules/clipping/clipping.py
decay_weights_ ¶
directly decays weights in-place
Source code in torchzero/modules/weight_decay/weight_decay.py
normalize_grads_ ¶
normalize_grads_(params: Iterable[Tensor], norm_value: float, ord: Union[Literal['mad', 'std', 'var', 'sum', 'l0', 'l1', 'l2', 'l3', 'l4', 'linf'], float, Tensor] = 2, dim: Union[int, Sequence[int], Literal['global'], NoneType] = None, inverse_dims: bool = False, min_size: int = 1)
Normalizes gradient of an iterable of parameters to specified norm value. Gradients are modified in-place.
Parameters:
-
params(Iterable[Tensor]) –parameters with gradients to clip.
-
norm_value(float) –value to clip norm to.
-
ord(float, default:2) –norm order. Defaults to 2.
-
dim(int | Sequence[int] | str | None, default:None) –calculates norm along those dimensions. If list/tuple, tensors are normalized along all dimensios in
dimthat they have. Can be set to "global" to normalize by global norm of all gradients concatenated to a vector. Defaults to None. -
inverse_dims(bool, default:False) –if True, the
dimsargument is inverted, and all other dimensions are normalized. -
min_size(int, default:1) –minimal size of a dimension to normalize along it. Defaults to 1.
Source code in torchzero/modules/clipping/clipping.py
orthogonalize_grads_ ¶
orthogonalize_grads_(params: Iterable[Tensor], dual_norm_correction=False, method: Literal['newtonschulz', 'ns5', 'polar_express', 'svd', 'qr', 'eigh'] = 'newtonschulz', channel_first: bool = True)
Computes the zeroth power / orthogonalization of gradients of an iterable of parameters.
This sets gradients in-place. Applies along first 2 dims (expected to be out_channels, in_channels).
Note that the Muon page says that embeddings and classifier heads should not be orthogonalized. Args: params (abc.Iterable[torch.Tensor]): parameters that hold gradients to orthogonalize. dual_norm_correction (bool, optional): enables dual norm correction from https://github.com/leloykun/adaptive-muon. Defaults to False. method (str, optional): Newton-Schulz is very fast, SVD is extremely slow but can be slighly more precise. channel_first (bool, optional): if True, orthogonalizes along 1st two dimensions, otherwise along last 2. Other dimensions are considered batch dimensions.
Source code in torchzero/modules/adaptive/muon.py
orthograd_ ¶
Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.
Parameters:
-
params(Iterable[Tensor]) –parameters that hold gradients to apply ⟂Grad to.
-
eps(float, default:1e-30) –epsilon added to the denominator for numerical stability (default: 1e-30)
reference https://arxiv.org/abs/2501.04697