List of all modules¶
A somewhat categorized list of modules is also available in Modules
Classes:
-
AEGD
–AEGD (Adaptive gradient descent with energy) from https://arxiv.org/abs/2010.05109#page=10.26.
-
ASAM
–Adaptive Sharpness-Aware Minimization from https://arxiv.org/pdf/2102.11600#page=6.52
-
Abs
–Returns :code:
abs(input)
-
AccumulateMaximum
–Accumulates maximum of all past updates.
-
AccumulateMean
–Accumulates mean of all past updates.
-
AccumulateMinimum
–Accumulates minimum of all past updates.
-
AccumulateProduct
–Accumulates product of all past updates.
-
AccumulateSum
–Accumulates sum of all past updates.
-
AdGD
–AdGD and AdGD-2 (https://arxiv.org/abs/2308.02261)
-
AdaHessian
–AdaHessian: An Adaptive Second Order Optimizer for Machine Learning (https://arxiv.org/abs/2006.00719)
-
Adagrad
–Adagrad, divides by sum of past squares of gradients.
-
AdagradNorm
–Adagrad-Norm, divides by sum of past means of squares of gradients.
-
Adam
–Adam. Divides gradient EMA by EMA of gradient squares with debiased step size.
-
Adan
–Adaptive Nesterov Momentum Algorithm from https://arxiv.org/abs/2208.06677
-
AdaptiveBacktracking
–Adaptive backtracking line search. After each line search procedure, a new initial step size is set
-
AdaptiveHeavyBall
–Adaptive heavy ball from https://hal.science/hal-04832983v1/file/OJMO_2024__5__A7_0.pdf.
-
AdaptiveTracking
–A line search that evaluates previous step size, if value increased, backtracks until the value stops decreasing,
-
Add
–Add :code:
other
to tensors. :code:other
can be a number or a module. -
Alternate
–Alternates between stepping with :code:
modules
. -
Averaging
–Average of past
history_size
updates. -
BBStab
–Stabilized Barzilai-Borwein method (https://arxiv.org/abs/1907.06409).
-
BFGS
–Broyden–Fletcher–Goldfarb–Shanno Quasi-Newton method. This is usually the most stable quasi-newton method.
-
BacktrackOnSignChange
–Negates or undoes update for parameters where where gradient or update sign changes.
-
Backtracking
–Backtracking line search.
-
BarzilaiBorwein
–Barzilai-Borwein step size method.
-
BinaryOperationBase
–Base class for operations that use update as the first operand. This is an abstract class, subclass it and override
transform
method to use it. -
BirginMartinezRestart
–the restart criterion for conjugate gradient methods designed by Birgin and Martinez.
-
BroydenBad
–Broyden's "bad" Quasi-Newton method.
-
BroydenGood
–Broyden's "good" Quasi-Newton method.
-
CCD
–Cumulative coordinate descent. This updates one gradient coordinate at a time and accumulates it
-
CCDLS
–CCD with line search instead of adaptive step size.
-
CD
–Coordinate descent. Proposes a descent direction along a single coordinate.
-
Cautious
–Negates update for parameters where update and gradient sign is inconsistent.
-
CenteredEMASquared
–Maintains a centered exponential moving average of squared updates. This also maintains an additional
-
CenteredSqrtEMASquared
–Maintains a centered exponential moving average of squared updates, outputs optionally debiased square root.
-
Centralize
–Centralizes the update.
-
Clip
–clip tensors to be in :code:
(min, max)
range. :code:min
and :code:`max: can be None, numbers or modules. -
ClipModules
–Calculates :code:
input(tensors).clip(min, max)
. :code:min
and :code:max
can be numbers or modules. -
ClipNorm
–Clips update norm to be no larger than
value
. -
ClipNormByEMA
–Clips norm to be no larger than the norm of an exponential moving average of past updates.
-
ClipNormGrowth
–Clips update norm growth.
-
ClipValue
–Clips update magnitude to be within
(-value, value)
range. -
ClipValueByEMA
–Clips magnitude of update to be no larger than magnitude of exponential moving average of past (unclipped) updates.
-
ClipValueGrowth
–Clips update value magnitude growth.
-
Clone
–Clones input. May be useful to store some intermediate result and make sure it doesn't get affected by in-place operations
-
ConjugateDescent
–Conjugate Descent (CD).
-
CopyMagnitude
–Returns :code:
other(tensors)
with sign copied from tensors. -
CopySign
–Returns tensors with sign copied from :code:
other(tensors)
. -
CubicRegularization
–Cubic regularization.
-
CustomUnaryOperation
–Applies :code:
getattr(tensor, name)
to each tensor -
DFP
–Davidon–Fletcher–Powell Quasi-Newton method.
-
DNRTR
–Diagonal quasi-newton method.
-
DYHS
–Dai-Yuan - Hestenes–Stiefel hybrid conjugate gradient method.
-
DaiYuan
–Dai–Yuan nonlinear conjugate gradient method.
-
Debias
–Multiplies the update by an Adam debiasing term based first and/or second momentum.
-
Debias2
–Multiplies the update by an Adam debiasing term based on the second momentum.
-
DiagonalBFGS
–Diagonal BFGS. This is simply BFGS with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful.
-
DiagonalQuasiCauchi
–Diagonal quasi-cauchi method.
-
DiagonalSR1
–Diagonal SR1. This is simply SR1 with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful.
-
DiagonalWeightedQuasiCauchi
–Diagonal quasi-cauchi method.
-
DirectWeightDecay
–Directly applies weight decay to parameters.
-
Div
–Divide tensors by :code:
other
. :code:other
can be a number or a module. -
DivByLoss
–Divides update by loss times :code:
alpha
-
DivModules
–Calculates :code:
input / other
. :code:input
and :code:other
can be numbers or modules. -
Dogleg
–Dogleg trust region algorithm.
-
Dropout
–Applies dropout to the update.
-
DualNormCorrection
–Dual norm correction for dualizer based optimizers (https://github.com/leloykun/adaptive-muon).
-
EMA
–Maintains an exponential moving average of update.
-
EMASquared
–Maintains an exponential moving average of squared updates.
-
ESGD
–Equilibrated Gradient Descent (https://arxiv.org/abs/1502.04390)
-
EscapeAnnealing
–If parameters stop changing, this runs a backward annealing random search
-
Exp
–Returns :code:
exp(input)
-
ExpHomotopy
– -
FDM
–Approximate gradients via finite difference method.
-
Fill
–Outputs tensors filled with :code:
value
-
FillLoss
–Outputs tensors filled with loss value times :code:
alpha
-
FletcherReeves
–Fletcher–Reeves nonlinear conjugate gradient method.
-
FletcherVMM
–Fletcher's variable metric Quasi-Newton method.
-
ForwardGradient
–Forward gradient method.
-
FullMatrixAdagrad
–Full-matrix version of Adagrad, can be customized to make RMSprop or Adam (see examples).
-
GaussNewton
–Gauss-newton method.
-
GaussianSmoothing
–Gradient approximation via Gaussian smoothing method.
-
Grad
–Outputs the gradient
-
GradApproximator
–Base class for gradient approximations.
-
GradSign
–Copies gradient sign to update.
-
GradToNone
–Sets :code:
grad
attribute to None on :code:var
. -
GradientAccumulation
–Uses
n
steps to accumulate gradients, aftern
gradients have been accumulated, they are passed to :code:modules
and parameters are updates. -
GradientCorrection
–Estimates gradient at minima along search direction assuming function is quadratic.
-
GradientSampling
–Samples and aggregates gradients and values at perturbed points.
-
Graft
–Outputs tensors rescaled to have the same norm as :code:
magnitude(tensors)
. -
GraftGradToUpdate
–Outputs gradient grafted to update, that is gradient rescaled to have the same norm as the update.
-
GraftModules
–Outputs :code:
direction
output rescaled to have the same norm as :code:magnitude
output. -
GraftToGrad
–Grafts update to the gradient, that is update is rescaled to have the same norm as the gradient.
-
GraftToParams
–Grafts update to the parameters, that is update is rescaled to have the same norm as the parameters, but no smaller than :code:
eps
. -
GraftToUpdate
–Outputs :code:
magnitude(tensors)
rescaled to have the same norm as tensors -
GramSchimdt
–outputs tensors made orthogonal to
other(tensors)
via Gram-Schmidt. -
Greenstadt1
–Greenstadt's first Quasi-Newton method.
-
Greenstadt2
–Greenstadt's second Quasi-Newton method.
-
HagerZhang
–Hager-Zhang nonlinear conjugate gradient method,
-
HeavyBall
–Polyak's momentum (heavy-ball method).
-
HestenesStiefel
–Hestenes–Stiefel nonlinear conjugate gradient method.
-
HigherOrderNewton
–A basic arbitrary order newton's method with optional trust region and proximal penalty.
-
Horisho
–Horisho's variable metric Quasi-Newton method.
-
HpuEstimate
–returns
y/||s||
, wherey
is difference between current and previous update (gradient),s
is difference between current and previous parameters. The returned tensors are a finite difference approximation to hessian times previous update. -
ICUM
–Inverse Column-updating Quasi-Newton method. This is computationally cheaper than other Quasi-Newton methods
-
Identity
–Identity operator that is argument-insensitive. This also can be used as identity hessian for trust region methods.
-
IntermoduleCautious
–Negaties update on :code:
main
module where it's sign doesn't match with output of :code:compare
module. -
InverseFreeNewton
–Inverse-free newton's method
-
LBFGS
–Limited-memory BFGS algorithm. A line search or trust region is recommended.
-
LMAdagrad
–Limited-memory full matrix Adagrad.
-
LR
–Learning rate. Adding this module also adds support for LR schedulers.
-
LSR1
–Limited-memory SR1 algorithm. A line search or trust region is recommended.
-
LambdaHomotopy
– -
LaplacianSmoothing
–Applies laplacian smoothing via a fast Fourier transform solver which can improve generalization.
-
LastAbsoluteRatio
–Outputs ratio between absolute values of past two updates the numerator is determined by :code:
numerator
argument. -
LastDifference
–Outputs difference between past two updates.
-
LastGradDifference
–Outputs difference between past two gradients.
-
LastProduct
–Outputs difference between past two updates.
-
LastRatio
–Outputs ratio between past two updates, the numerator is determined by :code:
numerator
argument. -
LerpModules
–Does a linear interpolation of :code:
input(tensors)
and :code:end(tensors)
based on a scalar :code:weight
. -
LevenbergMarquardt
–Levenberg-Marquardt trust region algorithm.
-
LineSearchBase
–Base class for line searches.
-
Lion
–Lion (EvoLved Sign Momentum) optimizer from https://arxiv.org/abs/2302.06675.
-
LiuStorey
–Liu-Storey nonlinear conjugate gradient method.
-
LogHomotopy
– -
MARSCorrection
–MARS variance reduction correction.
-
MSAM
–Momentum-SAM from https://arxiv.org/pdf/2401.12033.
-
MSAMObjective
–Momentum-SAM from https://arxiv.org/pdf/2401.12033.
-
MatrixMomentum
–Second order momentum method.
-
Maximum
–Outputs :code:
maximum(tensors, other(tensors))
-
MaximumModules
–Outputs elementwise maximum of :code:
inputs
that can be modules or numbers. -
McCormick
–McCormicks's Quasi-Newton method.
-
MeZO
–Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.
-
Mean
–Outputs a mean of :code:
inputs
that can be modules or numbers. -
MedianAveraging
–Median of past
history_size
updates. -
Minimum
–Outputs :code:
minimum(tensors, other(tensors))
-
MinimumModules
–Outputs elementwise minimum of :code:
inputs
that can be modules or numbers. -
Mul
–Multiply tensors by :code:
other
. :code:other
can be a number or a module. -
MulByLoss
–Multiplies update by loss times :code:
alpha
-
MultiOperationBase
–Base class for operations that use operands. This is an abstract class, subclass it and override
transform
method to use it. -
Multistep
–Performs :code:
steps
inner steps with :code:module
per each step. -
MuonAdjustLR
–LR adjustment for Muon from "Muon is Scalable for LLM Training" (https://github.com/MoonshotAI/Moonlight/tree/master).
-
NAG
–Nesterov accelerated gradient method (nesterov momentum).
-
NanToNum
–Convert
nan
,inf
and-inf
to numbers. -
NaturalGradient
–Natural gradient approximated via empirical fisher information matrix.
-
Negate
–Returns :code:
- input
-
NegateOnLossIncrease
–Uses an extra forward pass to evaluate loss at :code:
parameters+update
, -
NewDQN
–Diagonal quasi-newton method.
-
NewSSM
–Self-scaling Quasi-Newton method.
-
Newton
–Exact newton's method via autograd.
-
NewtonCG
–Newton's method with a matrix-free conjugate gradient or minimial-residual solver.
-
NewtonCGSteihaug
–Newton's method with trust region and a matrix-free Steihaug-Toint conjugate gradient solver.
-
NoiseSign
–Outputs random tensors with sign copied from the update.
-
Noop
–Identity operator that is argument-insensitive. This also can be used as identity hessian for trust region methods.
-
Normalize
–Normalizes the update.
-
NormalizeByEMA
–Sets norm of the update to be the same as the norm of an exponential moving average of past updates.
-
NystromPCG
–Newton's method with a Nyström-preconditioned conjugate gradient solver.
-
NystromSketchAndSolve
–Newton's method with a Nyström sketch-and-solve solver.
-
Ones
–Outputs ones
-
Online
–Allows certain modules to be used for mini-batch optimization.
-
OrthoGrad
–Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.
-
Orthogonalize
–Uses Newton-Schulz iteration or SVD to compute the zeroth power / orthogonalization of update along first 2 dims.
-
PSB
–Powell's Symmetric Broyden Quasi-Newton method.
-
Params
–Outputs parameters
-
Pearson
–Pearson's Quasi-Newton method.
-
PerturbWeights
–Changes the closure so that it evaluates loss and gradients at weights perturbed by a random perturbation.
-
PolakRibiere
–Polak-Ribière-Polyak nonlinear conjugate gradient method.
-
PolyakStepSize
–Polyak's subgradient method with known or unknown f*.
-
Pow
–Take tensors to the power of :code:
exponent
. :code:exponent
can be a number or a module. -
PowModules
–Calculates :code:
input ** exponent
. :code:input
and :code:other
can be numbers or modules. -
PowellRestart
–Powell's two restarting criterions for conjugate gradient methods.
-
Previous
–Maintains an update from n steps back, for example if n=1, returns previous update
-
PrintLoss
–Prints var.get_loss().
-
PrintParams
–Prints current update.
-
PrintShape
–Prints shapes of the update.
-
PrintUpdate
–Prints current update.
-
Prod
–Outputs product of :code:
inputs
that can be modules or numbers. -
ProjectedGradientMethod
–Projected gradient method. Directly projects the gradient onto subspace conjugate to past directions.
-
ProjectedNewtonRaphson
–Projected Newton Raphson method.
-
ProjectionBase
–Base class for projections.
-
RCopySign
–Returns :code:
other(tensors)
with sign copied from tensors. -
RDSA
–Gradient approximation via Random-direction stochastic approximation (RDSA) method.
-
RDiv
–Divide :code:
other
by tensors. :code:other
can be a number or a module. -
RGraft
–Outputs :code:
magnitude(tensors)
rescaled to have the same norm as tensors -
RMSprop
–Divides graient by EMA of gradient squares.
-
RPow
–Take :code:
other
to the power of tensors. :code:other
can be a number or a module. -
RSub
–Subtract tensors from :code:
other
. :code:other
can be a number or a module. -
Randn
–Outputs tensors filled with random numbers from a normal distribution with mean 0 and variance 1.
-
RandomHvp
–Returns a hessian-vector product with a random vector
-
RandomSample
–Outputs tensors filled with random numbers from distribution depending on value of :code:
distribution
. -
RandomStepSize
–Uses random global or layer-wise step size from
low
tohigh
. -
RandomizedFDM
–Gradient approximation via a randomized finite-difference method.
-
Reciprocal
–Returns :code:
1 / input
-
ReduceOperationBase
–Base class for reduction operations like Sum, Prod, Maximum. This is an abstract class, subclass it and override
transform
method to use it. -
Relative
–Multiplies update by absolute parameter values to make it relative to their magnitude, :code:
min_value
is minimum allowed value to avoid getting stuck at 0. -
RelativeWeightDecay
–Weight decay relative to the mean absolute value of update, gradient or parameters depending on value of
norm_input
argument. -
RestartEvery
–Resets the state every n steps
-
RestartOnStuck
–Resets the state when update (difference in parameters) is zero for multiple steps in a row.
-
RestartStrategyBase
–Base class for restart strategies.
-
Rprop
–Resilient propagation. The update magnitude gets multiplied by
nplus
if gradient didn't change the sign, -
SAM
–Sharpness-Aware Minimization from https://arxiv.org/pdf/2010.01412
-
SOAP
–SOAP (ShampoO with Adam in the Preconditioner's eigenbasis from https://arxiv.org/abs/2409.11321).
-
SPSA
–Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.
-
SR1
–Symmetric Rank 1. This works best with a trust region:
-
SSVM
–Self-scaling variable metric Quasi-Newton method.
-
SVRG
–Stochastic variance reduced gradient method (SVRG).
-
SaveBest
–Saves best parameters found so far, ones that have lowest loss. Put this as the last module.
-
ScalarProjection
–projetion that splits all parameters into individual scalars
-
ScaleByGradCosineSimilarity
–Multiplies the update by cosine similarity with gradient.
-
ScaleLRBySignChange
–learning rate gets multiplied by
nplus
if ascent/gradient didn't change the sign, -
ScaleModulesByCosineSimilarity
–Scales the output of :code:
main
module by it's cosine similarity to the output -
ScipyMinimizeScalar
–Line search via :code:
scipy.optimize.minimize_scalar
which implements brent, golden search and bounded brent methods. -
Sequential
–On each step, this sequentially steps with :code:
modules
:code:steps
times. -
Shampoo
–Shampoo from Preconditioned Stochastic Tensor Optimization (https://arxiv.org/abs/1802.09568).
-
ShorR
–Shor’s r-algorithm.
-
Sign
–Returns :code:
sign(input)
-
SignConsistencyLRs
–Outputs per-weight learning rates based on consecutive sign consistency.
-
SignConsistencyMask
–Outputs a mask of sign consistency of current and previous inputs.
-
SixthOrder3P
–Sixth-order iterative method.
-
SixthOrder3PM2
–Wang, Xiaofeng, and Yang Li. "An efficient sixth-order Newton-type method for solving nonlinear systems." Algorithms 10.2 (2017): 45.
-
SixthOrder5P
–Argyros, Ioannis K., et al. "Extended convergence for two sixth order methods under the same weak conditions." Foundations 3.1 (2023): 127-139.
-
SophiaH
–SophiaH optimizer from https://arxiv.org/abs/2305.14342
-
Split
–Apply
true
modules to all parameters filtered byfilter
, applyfalse
modules to all other parameters. -
Sqrt
–Returns :code:
sqrt(input)
-
SqrtEMASquared
–Maintains an exponential moving average of squared updates, outputs optionally debiased square root.
-
SqrtHomotopy
– -
SquareHomotopy
– -
StepSize
–this is exactly the same as LR, except the
lr
parameter can be renamed to any other name to avoid clashes -
StrongWolfe
–Interpolation line search satisfying Strong Wolfe condition.
-
Sub
–Subtract :code:
other
from tensors. :code:other
can be a number or a module. -
SubModules
–Calculates :code:
input - other
. :code:input
and :code:other
can be numbers or modules. -
Sum
–Outputs sum of :code:
inputs
that can be modules or numbers. -
SumOfSquares
–Sets loss to be the sum of squares of values returned by the closure.
-
Switch
–After :code:
steps
steps switches to the next module. -
TerminateAfterNEvaluations
– -
TerminateAfterNSeconds
– -
TerminateAfterNSteps
– -
TerminateAll
– -
TerminateAny
– -
TerminateByGradientNorm
– -
TerminateByUpdateNorm
–update is calculated as parameter difference
-
TerminateNever
– -
TerminateOnLossReached
– -
TerminateOnNoImprovement
– -
TerminationCriteriaBase
– -
ThomasOptimalMethod
–Thomas's "optimal" Quasi-Newton method.
-
Threshold
–Outputs tensors thresholded such that values above :code:
threshold
are set to :code:value
. -
To
–Cast modules to specified device and dtype
-
TrustCG
–Trust region via Steihaug-Toint Conjugate Gradient method.
-
TrustRegionBase
– -
TwoPointNewton
–two-point Newton method with frozen derivative with third order convergence.
-
UnaryLambda
–Applies :code:
fn
to input tensors. -
UnaryParameterwiseLambda
–Applies :code:
fn
to each input tensor. -
Uniform
–Outputs tensors filled with random numbers from uniform distribution between :code:
low
and :code:high
. -
UpdateGradientSignConsistency
–Compares update and gradient signs. Output will have 1s where signs match, and 0s where they don't.
-
UpdateSign
–Outputs gradient with sign copied from the update.
-
UpdateToNone
–Sets :code:
update
attribute to None on :code:var
. -
VectorProjection
–projection that concatenates all parameters into a vector
-
ViewAsReal
–View complex tensors as real tensors. Doesn't affect tensors that are already.
-
Warmup
–Learning rate warmup, linearly increases learning rate multiplier from :code:
start_lr
to :code:end_lr
over :code:steps
steps. -
WarmupNormClip
–Warmup via clipping of the update norm.
-
WeightDecay
–Weight decay.
-
WeightDropout
–Changes the closure so that it evaluates loss and gradients with random weights replaced with 0.
-
WeightedAveraging
–Weighted average of past
len(weights)
updates. -
WeightedMean
–Outputs weighted mean of :code:
inputs
that can be modules or numbers. -
WeightedSum
– -
Wrap
–Wraps a pytorch optimizer to use it as a module.
-
Zeros
–Outputs zeros
Functions:
-
clip_grad_norm_
–Clips gradient of an iterable of parameters to specified norm value.
-
clip_grad_value_
–Clips gradient of an iterable of parameters at specified value.
-
decay_weights_
–directly decays weights in-place
-
normalize_grads_
–Normalizes gradient of an iterable of parameters to specified norm value.
-
orthogonalize_grads_
–Uses newton-Schulz iteration to compute the zeroth power / orthogonalization of gradients of an iterable of parameters.
-
orthograd_
–Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.
AEGD ¶
Bases: torchzero.core.transform.Transform
AEGD (Adaptive gradient descent with energy) from https://arxiv.org/abs/2010.05109#page=10.26.
Note
AEGD has a learning rate hyperparameter that can't really be removed from the update rule.
To avoid compounding learning rate mofications, remove the tz.m.LR
module if you had it.
Parameters:
-
eta
(float
) –step size. Defaults to 0.1.
-
c
(float
, default:1
) –c. Defaults to 1.
-
beta3
(float
) –thrid (squared) momentum. Defaults to 0.1.
-
eps
(float
) –epsilon. Defaults to 1e-8.
-
use_n_prev
(bool
) –whether to use previous gradient differences momentum.
Source code in torchzero/modules/adaptive/aegd.py
ASAM ¶
Bases: torchzero.modules.adaptive.sam.SAM
Adaptive Sharpness-Aware Minimization from https://arxiv.org/pdf/2102.11600#page=6.52
SAM functions by seeking parameters that lie in neighborhoods having uniformly low loss value. It performs two forward and backward passes per step.
This implementation modifies the closure to return loss and calculate gradients of the SAM objective. All modules after this will use the modified objective.
.. note:: This module requires a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients at two points on each step.
Parameters:
-
rho
(float
, default:0.5
) –Neighborhood size. Defaults to 0.05.
-
p
(float
, default:2
) –norm of the SAM objective. Defaults to 2.
Examples:
ASAM-Adam:
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.ASAM(),
tz.m.Adam(),
tz.m.LR(1e-2)
)
References
Kwon, J., Kim, J., Park, H., & Choi, I. K. (2021, July). Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In International Conference on Machine Learning (pp. 5905-5914). PMLR. https://arxiv.org/abs/2102.11600
Source code in torchzero/modules/adaptive/sam.py
Abs ¶
Bases: torchzero.core.transform.Transform
Returns :code:abs(input)
Source code in torchzero/modules/ops/unary.py
AccumulateMaximum ¶
Bases: torchzero.core.transform.Transform
Accumulates maximum of all past updates.
Parameters:
-
decay
(float
, default:0
) –decays the accumulator. Defaults to 0.
-
target
(Literal
, default:'update'
) –target. Defaults to 'update'.
Source code in torchzero/modules/ops/accumulate.py
AccumulateMean ¶
Bases: torchzero.core.transform.Transform
Accumulates mean of all past updates.
Parameters:
-
decay
(float
, default:0
) –decays the accumulator. Defaults to 0.
-
target
(Literal
, default:'update'
) –target. Defaults to 'update'.
Source code in torchzero/modules/ops/accumulate.py
AccumulateMinimum ¶
Bases: torchzero.core.transform.Transform
Accumulates minimum of all past updates.
Parameters:
-
decay
(float
, default:0
) –decays the accumulator. Defaults to 0.
-
target
(Literal
, default:'update'
) –target. Defaults to 'update'.
Source code in torchzero/modules/ops/accumulate.py
AccumulateProduct ¶
Bases: torchzero.core.transform.Transform
Accumulates product of all past updates.
Parameters:
-
decay
(float
, default:0
) –decays the accumulator. Defaults to 0.
-
target
(Literal
, default:'update'
) –target. Defaults to 'update'.
Source code in torchzero/modules/ops/accumulate.py
AccumulateSum ¶
Bases: torchzero.core.transform.Transform
Accumulates sum of all past updates.
Parameters:
-
decay
(float
, default:0
) –decays the accumulator. Defaults to 0.
-
target
(Literal
, default:'update'
) –target. Defaults to 'update'.
Source code in torchzero/modules/ops/accumulate.py
AdGD ¶
Bases: torchzero.core.transform.Transform
AdGD and AdGD-2 (https://arxiv.org/abs/2308.02261)
Source code in torchzero/modules/step_size/adaptive.py
304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 |
|
AdaHessian ¶
Bases: torchzero.core.module.Module
AdaHessian: An Adaptive Second Order Optimizer for Machine Learning (https://arxiv.org/abs/2006.00719)
This is similar to Adam, but the second momentum is replaced by square root of an exponential moving average of random hessian-vector products.
Notes
-
In most cases AdaHessian should be the first module in the chain because it relies on autograd. Use the
inner
argument if you wish to apply AdaHessian preconditioning to another module's output. -
If you are using gradient estimators or reformulations, set
hvp_method
to "forward" or "central". -
This module requires a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a
backward
argument (refer to documentation).
Parameters:
-
beta1
(float
, default:0.9
) –first momentum. Defaults to 0.9.
-
beta2
(float
, default:0.999
) –second momentum for squared hessian diagonal estimates. Defaults to 0.999.
-
averaging
(bool
, default:True
) –whether to enable block diagonal averaging over 1st dimension on parameters that have 2+ dimensions. This can be set per-parameter in param groups.
-
block_size
(int
, default:None
) –size of block in the block-diagonal averaging.
-
update_freq
(int
, default:1
) –frequency of updating hessian diagonal estimate via a hessian-vector product. This value can be increased to reduce computational cost. Defaults to 1.
-
eps
(float
, default:1e-08
) –division stability epsilon. Defaults to 1e-8.
-
hvp_method
(str
, default:'autograd'
) –Determines how Hessian-vector products are evaluated.
"autograd"
: Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient."forward"
: Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation."central"
: Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
-
fd_h
(float
, default:0.001
) –finite difference step size if
hvp_method
is "forward" or "central". Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of hessian-vector products with random vectors to evaluate each time when updating the preconditioner. Larger values may lead to better hessian diagonal estimate. Defaults to 1.
-
seed
(int | None
, default:None
) –seed for random vectors. Defaults to None.
-
inner
(Chainable | None
, default:None
) –Inner module. If this is specified, operations are performed in the following order. 1. compute hessian diagonal estimate. 2. pass inputs to
inner
. 3. momentum and preconditioning are applied to the ouputs ofinner
.
Examples:¶
Using AdaHessian:
AdaHessian preconditioner can be applied to any other module by passing it to the inner
argument.
Turn off AdaHessian's first momentum to get just the preconditioning. Here is an example of applying
AdaHessian preconditioning to nesterov momentum (tz.m.NAG
):
Source code in torchzero/modules/adaptive/adahessian.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 |
|
Adagrad ¶
Bases: torchzero.core.transform.Transform
Adagrad, divides by sum of past squares of gradients.
This implementation is identical to torch.optim.Adagrad
.
Parameters:
-
lr_decay
(float
, default:0
) –learning rate decay. Defaults to 0.
-
initial_accumulator_value
(float
, default:0
) –initial value of the sum of squares of gradients. Defaults to 0.
-
eps
(float
, default:1e-10
) –division epsilon. Defaults to 1e-10.
-
alpha
(float
, default:1
) –step size. Defaults to 1.
-
pow
(float
, default:2
) –power for gradients and accumulator root. Defaults to 2.
-
use_sqrt
(bool
, default:True
) –whether to take the root of the accumulator. Defaults to True.
-
inner
(Chainable | None
, default:None
) –Inner modules that are applied after updating accumulator and before preconditioning. Defaults to None.
Source code in torchzero/modules/adaptive/adagrad.py
AdagradNorm ¶
Bases: torchzero.core.transform.Transform
Adagrad-Norm, divides by sum of past means of squares of gradients.
Parameters:
-
lr_decay
(float
, default:0
) –learning rate decay. Defaults to 0.
-
initial_accumulator_value
(float
, default:0
) –initial value of the sum of squares of gradients. Defaults to 0.
-
eps
(float
, default:1e-10
) –division epsilon. Defaults to 1e-10.
-
alpha
(float
, default:1
) –step size. Defaults to 1.
-
pow
(float
, default:2
) –power for gradients and accumulator root. Defaults to 2.
-
use_sqrt
(bool
, default:True
) –whether to take the root of the accumulator. Defaults to True.
-
inner
(Chainable | None
, default:None
) –Inner modules that are applied after updating accumulator and before preconditioning. Defaults to None.
Source code in torchzero/modules/adaptive/adagrad.py
Adam ¶
Bases: torchzero.core.transform.Transform
Adam. Divides gradient EMA by EMA of gradient squares with debiased step size.
This implementation is identical to :code:torch.optim.Adam
.
Parameters:
-
beta1
(float
, default:0.9
) –momentum. Defaults to 0.9.
-
beta2
(float
, default:0.999
) –second momentum. Defaults to 0.999.
-
eps
(float
, default:1e-08
) –epsilon. Defaults to 1e-8.
-
alpha
(float
, default:1.0
) –learning rate. Defaults to 1.
-
amsgrad
(bool
, default:False
) –Whether to divide by maximum of EMA of gradient squares instead. Defaults to False.
-
pow
(float
, default:2
) –power used in second momentum power and root. Defaults to 2.
-
debiased
(bool
, default:True
) –whether to apply debiasing to momentums based on current step. Defaults to True.
Source code in torchzero/modules/adaptive/adam.py
Adan ¶
Bases: torchzero.core.transform.Transform
Adaptive Nesterov Momentum Algorithm from https://arxiv.org/abs/2208.06677
Parameters:
-
beta1
(float
, default:0.98
) –momentum. Defaults to 0.98.
-
beta2
(float
, default:0.92
) –momentum for gradient differences. Defaults to 0.92.
-
beta3
(float
, default:0.99
) –thrid (squared) momentum. Defaults to 0.99.
-
eps
(float
, default:1e-08
) –epsilon. Defaults to 1e-8.
-
use_n_prev
(bool
) –whether to use previous gradient differences momentum.
Example: ```python opt = tz.Modular( model.parameters(), tz.m.Adan(), tz.m.LR(1e-3), ) Reference: Xie, X., Zhou, P., Li, H., Lin, Z., & Yan, S. (2024). Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://arxiv.org/abs/2208.06677
Source code in torchzero/modules/adaptive/adan.py
AdaptiveBacktracking ¶
Bases: torchzero.modules.line_search.line_search.LineSearchBase
Adaptive backtracking line search. After each line search procedure, a new initial step size is set such that optimal step size in the procedure would be found on the second line search iteration.
Parameters:
-
init
(float
, default:1.0
) –initial step size. Defaults to 1.0.
-
beta
(float
, default:0.5
) –multiplies each consecutive step size by this value. Defaults to 0.5.
-
c
(float
, default:0.0001
) –sufficient decrease condition. Defaults to 1e-4.
-
condition
(Literal
, default:'armijo'
) –termination condition, only ones that do not use gradient at f(x+a*d) can be specified. - "armijo" - sufficient decrease condition. - "decrease" - any decrease in objective function value satisfies the condition.
"goldstein" can techincally be specified but it doesn't make sense because there is not zoom stage. Defaults to 'armijo'.
-
maxiter
(int
, default:20
) –maximum number of function evaluations per step. Defaults to 10.
-
target_iters
(int
, default:1
) –sets next step size such that this number of iterations are expected to be performed until optimal step size is found. Defaults to 1.
-
nplus
(float
, default:2.0
) –if initial step size is optimal, it is multiplied by this value. Defaults to 2.0.
-
scale_beta
(float
, default:0.0
) –momentum for initial step size, at 0 disables momentum. Defaults to 0.0.
Source code in torchzero/modules/line_search/backtracking.py
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 |
|
AdaptiveHeavyBall ¶
Bases: torchzero.core.transform.Transform
Adaptive heavy ball from https://hal.science/hal-04832983v1/file/OJMO_2024__5__A7_0.pdf.
This is related to conjugate gradient methods, it may be very good for non-stochastic convex objectives, but won't work on stochastic ones.
note
The step size is determined by the algorithm, so learning rate modules shouldn't be used.
Parameters:
-
f_star
(int
, default:0
) –(estimated) minimal possible value of the objective function (lowest possible loss). Defaults to 0.
Source code in torchzero/modules/adaptive/adaptive_heavyball.py
AdaptiveTracking ¶
Bases: torchzero.modules.line_search.line_search.LineSearchBase
A line search that evaluates previous step size, if value increased, backtracks until the value stops decreasing, otherwise forward-tracks until value stops decreasing.
Parameters:
-
init
(float
, default:1.0
) –initial step size. Defaults to 1.0.
-
nplus
(float
, default:2
) –multiplier to step size if initial step size is optimal. Defaults to 2.
-
nminus
(float
, default:0.5
) –multiplier to step size if initial step size is too big. Defaults to 0.5.
-
maxiter
(int
, default:10
) –maximum number of function evaluations per step. Defaults to 10.
-
adaptive
(bool
, default:True
) –when enabled, if line search failed, step size will continue decreasing on the next step. Otherwise it will restart the line search from
init
step size. Defaults to True.
Source code in torchzero/modules/line_search/adaptive.py
Add ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Add :code:other
to tensors. :code:other
can be a number or a module.
If :code:other
is a module, this calculates :code:tensors + other(tensors)
Source code in torchzero/modules/ops/binary.py
Alternate ¶
Bases: torchzero.core.module.Module
Alternates between stepping with :code:modules
.
That is, first step is performed with 1st module, second step with second module, etc.
Parameters:
-
steps
(int | Iterable[int]
, default:1
) –number of steps to perform with each module. Defaults to 1.
Examples:
Alternate between Adam, SignSGD and RMSprop
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.Alternate(
tz.m.Adam(),
[tz.m.SignSGD(), tz.m.Mul(0.5)],
tz.m.RMSprop(),
),
tz.m.LR(1e-3),
)
Source code in torchzero/modules/misc/switch.py
LOOP
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
Averaging ¶
Bases: torchzero.core.transform.TensorwiseTransform
Average of past history_size
updates.
Parameters:
-
history_size
(int
) –Number of past updates to average
-
target
(Literal
, default:'update'
) –target. Defaults to 'update'.
Source code in torchzero/modules/momentum/averaging.py
BBStab ¶
Bases: torchzero.core.transform.Transform
Stabilized Barzilai-Borwein method (https://arxiv.org/abs/1907.06409).
This clips the norm of the Barzilai-Borwein update by delta
, where delta
can be adaptive if c
is specified.
Parameters:
-
c
(float
, default:0.2
) –adaptive delta parameter. If
delta
is set to None, firstinf_iters
updates are performed with non-stabilized Barzilai-Borwein step size. Then delta is set to norm of the update that had the smallest norm, and multiplied byc
. Defaults to 0.2. -
delta
(float | None
, default:None
) –Barzilai-Borwein update is clipped to this value. Set to
None
to use an adaptive choice. Defaults to None. -
type
(str
, default:'geom'
) –one of "short" with formula sᵀy/yᵀy, "long" with formula sᵀs/sᵀy, or "geom" to use geometric mean of short and long. Defaults to "geom". Note that "long" corresponds to BB1stab and "short" to BB2stab, however I found that "geom" works really well.
-
inner
(Chainable | None
, default:None
) –step size will be applied to outputs of this module. Defaults to None.
Source code in torchzero/modules/step_size/adaptive.py
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 |
|
BFGS ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Broyden–Fletcher–Goldfarb–Shanno Quasi-Newton method. This is usually the most stable quasi-newton method.
Note
a line search or a trust region is recommended
Warning
this uses at least O(N^2) memory.
Parameters:
-
init_scale
(float | Literal['auto']
, default:'auto'
) –initial hessian matrix is set to identity times this.
"auto" corresponds to a heuristic from Nocedal. Stephen J. Wright. Numerical Optimization p.142-143.
Defaults to "auto".
-
tol
(float
, default:1e-32
) –tolerance on curvature condition. Defaults to 1e-32.
-
ptol
(float | None
, default:1e-32
) –skips update if maximum difference between current and previous gradients is less than this, to avoid instability. Defaults to 1e-32.
-
ptol_restart
(bool
, default:False
) –whether to reset the hessian approximation when ptol tolerance is not met. Defaults to False.
-
restart_interval
(int | None | Literal['auto']
, default:None
) –interval between resetting the hessian approximation.
"auto" corresponds to number of decision variables + 1.
None - no resets.
Defaults to None.
-
beta
(float | None
, default:None
) –momentum on H or B. Defaults to None.
-
update_freq
(int
, default:1
) –frequency of updating H or B. Defaults to 1.
-
scale_first
(bool
, default:False
) –whether to downscale first step before hessian approximation becomes available. Defaults to True.
-
scale_second
(bool
) –whether to downscale second step. Defaults to False.
-
concat_params
(bool
, default:True
) –If true, all parameters are treated as a single vector. If False, the update rule is applied to each parameter separately. Defaults to True.
-
inner
(Chainable | None
, default:None
) –preconditioning is applied to the output of this module. Defaults to None.
Examples:¶
BFGS with backtracking line search:
BFGS with trust region
Source code in torchzero/modules/quasi_newton/quasi_newton.py
BacktrackOnSignChange ¶
Bases: torchzero.core.transform.Transform
Negates or undoes update for parameters where where gradient or update sign changes.
This is part of RProp update rule.
Parameters:
-
use_grad
(bool
, default:False
) –if True, tracks sign change of the gradient, otherwise track sign change of the update. Defaults to True.
-
backtrack
(bool
, default:True
) –if True, undoes the update when sign changes, otherwise negates it. Defaults to True.
Source code in torchzero/modules/adaptive/rprop.py
Backtracking ¶
Bases: torchzero.modules.line_search.line_search.LineSearchBase
Backtracking line search.
Parameters:
-
init
(float
, default:1.0
) –initial step size. Defaults to 1.0.
-
beta
(float
, default:0.5
) –multiplies each consecutive step size by this value. Defaults to 0.5.
-
c
(float
, default:0.0001
) –sufficient decrease condition. Defaults to 1e-4.
-
condition
(Literal
, default:'armijo'
) –termination condition, only ones that do not use gradient at f(x+a*d) can be specified. - "armijo" - sufficient decrease condition. - "decrease" - any decrease in objective function value satisfies the condition.
"goldstein" can techincally be specified but it doesn't make sense because there is not zoom stage. Defaults to 'armijo'.
-
maxiter
(int
, default:10
) –maximum number of function evaluations per step. Defaults to 10.
-
adaptive
(bool
, default:True
) –when enabled, if line search failed, step size will continue decreasing on the next step. Otherwise it will restart the line search from
init
step size. Defaults to True.
Examples: Gradient descent with backtracking line search:
L-BFGS with backtracking line search:
Source code in torchzero/modules/line_search/backtracking.py
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
BarzilaiBorwein ¶
Bases: torchzero.core.transform.Transform
Barzilai-Borwein step size method.
Parameters:
-
type
(str
, default:'geom'
) –one of "short" with formula sᵀy/yᵀy, "long" with formula sᵀs/sᵀy, or "geom" to use geometric mean of short and long. Defaults to "geom".
-
fallback
(float
) –step size when denominator is less than 0 (will happen on negative curvature). Defaults to 1e-3.
-
inner
(Chainable | None
, default:None
) –step size will be applied to outputs of this module. Defaults to None.
Source code in torchzero/modules/step_size/adaptive.py
BinaryOperationBase ¶
Bases: torchzero.core.module.Module
, abc.ABC
Base class for operations that use update as the first operand. This is an abstract class, subclass it and override transform
method to use it.
Methods:
-
transform
–applies the operation to operands
Source code in torchzero/modules/ops/binary.py
BirginMartinezRestart ¶
Bases: torchzero.core.module.Module
the restart criterion for conjugate gradient methods designed by Birgin and Martinez.
This criterion restarts when when the angle between dk+1 and −gk+1 is not acute enough.
The restart clears all states of module
.
Parameters:
-
module
(Module
) –module to restart, should be a conjugate gradient or possibly a quasi-newton method.
-
cond
(float
, default:0.001
) –Restart is performed whenevr d^Tg > -cond||d||||g||. The default condition value of 1e-3 is suggested by Birgin and Martinez.
Reference
Birgin, Ernesto G., and José Mario Martínez. "A spectral conjugate gradient method for unconstrained optimization." Applied Mathematics & Optimization 43.2 (2001): 117-128.
Source code in torchzero/modules/restarts/restars.py
BroydenBad ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Broyden's "bad" Quasi-Newton method.
Note
a trust region or an accurate line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
Source code in torchzero/modules/quasi_newton/quasi_newton.py
BroydenGood ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Broyden's "good" Quasi-Newton method.
Note
a trust region or an accurate line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
Source code in torchzero/modules/quasi_newton/quasi_newton.py
CCD ¶
Bases: torchzero.core.module.Module
Cumulative coordinate descent. This updates one gradient coordinate at a time and accumulates it to the update direction. The coordinate updated is random weighted by magnitudes of current update direction. As update direction ceases to be a descent direction due to old accumulated coordinates, it is decayed.
Parameters:
-
pmin
(float
, default:0.1
) –multiplier to probability of picking the lowest magnitude gradient. Defaults to 0.1.
-
pmax
(float
, default:1.0
) –multiplier to probability of picking the largest magnitude gradient. Defaults to 1.0.
-
pow
(int
, default:2
) –power transform to probabilities. Defaults to 2.
-
decay
(float
, default:0.8
) –accumulated gradient decay on failed step. Defaults to 0.5.
-
decay2
(float
, default:0.2
) –decay multiplier decay on failed step. Defaults to 0.25.
-
nplus
(float
, default:1.5
) –step size increase on successful steps. Defaults to 1.5.
-
nminus
(float
, default:0.75
) –step size increase on unsuccessful steps. Defaults to 0.75.
Source code in torchzero/modules/zeroth_order/cd.py
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 |
|
CCDLS ¶
Bases: torchzero.core.module.Module
CCD with line search instead of adaptive step size.
Parameters:
-
pmin
(float
, default:0.1
) –multiplier to probability of picking the lowest magnitude gradient. Defaults to 0.1.
-
pmax
(float
, default:1.0
) –multiplier to probability of picking the largest magnitude gradient. Defaults to 1.0.
-
pow
(int
, default:2
) –power transform to probabilities. Defaults to 2.
-
decay
(float
, default:0.8
) –accumulated gradient decay on failed step. Defaults to 0.5.
-
decay2
(float
, default:0.2
) –decay multiplier decay on failed step. Defaults to 0.25.
-
maxiter
(int
, default:10
) –max number of line search iterations.
Source code in torchzero/modules/zeroth_order/cd.py
276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 |
|
CD ¶
Bases: torchzero.core.module.Module
Coordinate descent. Proposes a descent direction along a single coordinate.
You can then put a line search such as tz.m.ScipyMinimizeScalar
, or a fixed step size.
Parameters:
-
h
(float
, default:0.001
) –finite difference step size. Defaults to 1e-3.
-
grad
(bool
, default:True
) –if True, scales direction by gradient estimate. If False, the scale is fixed to 1. Defaults to True.
-
adaptive
(bool
, default:True
) –whether to adapt finite difference step size, this requires an additional buffer. Defaults to True.
-
index
(str
, default:'cyclic2'
) –index selection strategy. - "cyclic" - repeatedly cycles through each coordinate, e.g.
1,2,3,1,2,3,...
. - "cyclic2" - cycles forward and then backward, e.g1,2,3,3,2,1,1,2,3,...
(default). - "random" - picks coordinate randomly. -
threepoint
(bool
, default:True
) –whether to use three points (three function evaluatins) to determine descent direction. if False, uses two points, but then
adaptive
can't be used. Defaults to True.
Source code in torchzero/modules/zeroth_order/cd.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
|
Cautious ¶
Bases: torchzero.core.transform.Transform
Negates update for parameters where update and gradient sign is inconsistent. Optionally normalizes the update by the number of parameters that are not masked. This is meant to be used after any momentum-based modules.
Parameters:
-
normalize
(bool
, default:False
) –renormalize update after masking. only has effect when mode is 'zero'. Defaults to False.
-
eps
(float
, default:1e-06
) –epsilon for normalization. Defaults to 1e-6.
-
mode
(str
, default:'zero'
) –what to do with updates with inconsistent signs. - "zero" - set them to zero (as in paper) - "grad" - set them to the gradient (same as using update magnitude and gradient sign) - "backtrack" - negate them
Examples:¶
Cautious Adam
References
Cautious Optimizers: Improving Training with One Line of Code. Kaizhao Liang, Lizhang Chen, Bo Liu, Qiang Liu
Source code in torchzero/modules/momentum/cautious.py
CenteredEMASquared ¶
Bases: torchzero.core.transform.Transform
Maintains a centered exponential moving average of squared updates. This also maintains an additional exponential moving average of un-squared updates, square of which is subtracted from the EMA.
Parameters:
-
beta
(float
, default:0.99
) –momentum value. Defaults to 0.999.
-
amsgrad
(bool
, default:False
) –whether to maintain maximum of the exponential moving average. Defaults to False.
-
pow
(float
, default:2
) –power, absolute value is always used. Defaults to 2.
Source code in torchzero/modules/ops/higher_level.py
CenteredSqrtEMASquared ¶
Bases: torchzero.core.transform.Transform
Maintains a centered exponential moving average of squared updates, outputs optionally debiased square root. This also maintains an additional exponential moving average of un-squared updates, square of which is subtracted from the EMA.
Parameters:
-
beta
(float
, default:0.99
) –momentum value. Defaults to 0.999.
-
amsgrad
(bool
, default:False
) –whether to maintain maximum of the exponential moving average. Defaults to False.
-
debiased
(bool
, default:False
) –whether to multiply the output by a debiasing term from the Adam method. Defaults to False.
-
pow
(float
, default:2
) –power, absolute value is always used. Defaults to 2.
Source code in torchzero/modules/ops/higher_level.py
Centralize ¶
Bases: torchzero.core.transform.Transform
Centralizes the update.
Parameters:
-
dim
(int | Sequence[int] | str | None
, default:None
) –calculates norm along those dimensions. If list/tuple, tensors are centralized along all dimensios in
dim
that they have. Can be set to "global" to centralize by global mean of all gradients concatenated to a vector. Defaults to None. -
inverse_dims
(bool
, default:False
) –if True, the
dims
argument is inverted, and all other dimensions are centralized. -
min_size
(int
, default:2
) –minimal size of a dimension to normalize along it. Defaults to 1.
Examples:
Standard gradient centralization:
References: - Yong, H., Huang, J., Hua, X., & Zhang, L. (2020). Gradient centralization: A new optimization technique for deep neural networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16 (pp. 635-652). Springer International Publishing. https://arxiv.org/abs/2004.01461
Source code in torchzero/modules/clipping/clipping.py
Clip ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
clip tensors to be in :code:(min, max)
range. :code:min
and :code:`max: can be None, numbers or modules.
If code:min
and :code:max
: are modules, this calculates :code:tensors.clip(min(tensors), max(tensors))
.
Source code in torchzero/modules/ops/binary.py
ClipModules ¶
Bases: torchzero.modules.ops.multi.MultiOperationBase
Calculates :code:input(tensors).clip(min, max)
. :code:min
and :code:max
can be numbers or modules.
Source code in torchzero/modules/ops/multi.py
ClipNorm ¶
Bases: torchzero.core.transform.Transform
Clips update norm to be no larger than value
.
Parameters:
-
max_norm
(float
) –value to clip norm to.
-
ord
(float
, default:2
) –norm order. Defaults to 2.
-
dim
(int | Sequence[int] | str | None
, default:None
) –calculates norm along those dimensions. If list/tuple, tensors are normalized along all dimensios in
dim
that they have. Can be set to "global" to normalize by global norm of all gradients concatenated to a vector. Defaults to None. -
inverse_dims
(bool
, default:False
) –if True, the
dims
argument is inverted, and all other dimensions are normalized. -
min_size
(int
, default:1
) –minimal numer of elements in a parameter or slice to clip norm. Defaults to 1.
-
target
(str
, default:'update'
) –what this affects.
Examples:
Gradient norm clipping:
Update norm clipping:
Source code in torchzero/modules/clipping/clipping.py
ClipNormByEMA ¶
Bases: torchzero.core.transform.Transform
Clips norm to be no larger than the norm of an exponential moving average of past updates.
Parameters:
-
beta
(float
, default:0.99
) –beta for the exponential moving average. Defaults to 0.99.
-
ord
(float
, default:2
) –order of the norm. Defaults to 2.
-
eps
(float
, default:1e-06
) –epsilon for division. Defaults to 1e-6.
-
tensorwise
(bool
, default:True
) –if True, norms are calculated parameter-wise, otherwise treats all parameters as single vector. Defaults to True.
-
max_ema_growth
(float | None
, default:1.5
) –if specified, restricts how quickly exponential moving average norm can grow. The norm is allowed to grow by at most this value per step. Defaults to 1.5.
-
ema_init
(str
, default:'zeros'
) –How to initialize exponential moving average on first step, "update" to use the first update or "zeros". Defaults to 'zeros'.
Source code in torchzero/modules/clipping/ema_clipping.py
NORMALIZE
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
ClipNormGrowth ¶
Bases: torchzero.core.transform.Transform
Clips update norm growth.
Parameters:
-
add
(float | None
, default:None
) –additive clipping, next update norm is at most
previous norm + add
. Defaults to None. -
mul
(float | None
, default:1.5
) –multiplicative clipping, next update norm is at most
previous norm * mul
. Defaults to 1.5. -
min_value
(float | None
, default:0.0001
) –minimum value for multiplicative clipping to prevent collapse to 0. Next norm is at most :code:
max(prev_norm, min_value) * mul
. Defaults to 1e-4. -
max_decay
(float | None
, default:2
) –bounds the tracked multiplicative clipping decay to prevent collapse to 0. Next norm is at most :code:
max(previous norm * mul, max_decay)
. Defaults to 2. -
ord
(float
, default:2
) –norm order. Defaults to 2.
-
parameterwise
(bool
, default:True
) –if True, norms are calculated parameter-wise, otherwise treats all parameters as single vector. Defaults to True.
-
target
(Literal
, default:'update'
) –what to set on var. Defaults to "update".
Source code in torchzero/modules/clipping/growth_clipping.py
ClipValue ¶
Bases: torchzero.core.transform.Transform
Clips update magnitude to be within (-value, value)
range.
Parameters:
-
value
(float
) –value to clip to.
-
target
(str
, default:'update'
) –refer to
target argument
in documentation.
Examples:
Gradient clipping:
Update clipping:
Source code in torchzero/modules/clipping/clipping.py
ClipValueByEMA ¶
Bases: torchzero.core.transform.Transform
Clips magnitude of update to be no larger than magnitude of exponential moving average of past (unclipped) updates.
Parameters:
-
beta
(float
, default:0.99
) –beta for the exponential moving average. Defaults to 0.99.
-
ema_init
(str
, default:'zeros'
) –How to initialize exponential moving average on first step, "update" to use the first update or "zeros". Defaults to 'zeros'.
-
ema_tfm
(Chainable | None
, default:None
) –optional modules applied to exponential moving average before clipping by it. Defaults to None.
Source code in torchzero/modules/clipping/ema_clipping.py
ClipValueGrowth ¶
Bases: torchzero.core.transform.TensorwiseTransform
Clips update value magnitude growth.
Parameters:
-
add
(float | None
, default:None
) –additive clipping, next update is at most
previous update + add
. Defaults to None. -
mul
(float | None
, default:1.5
) –multiplicative clipping, next update is at most
previous update * mul
. Defaults to 1.5. -
min_value
(float | None
, default:0.0001
) –minimum value for multiplicative clipping to prevent collapse to 0. Next update is at most :code:
max(prev_update, min_value) * mul
. Defaults to 1e-4. -
max_decay
(float | None
, default:2
) –bounds the tracked multiplicative clipping decay to prevent collapse to 0. Next update is at most :code:
max(previous update * mul, max_decay)
. Defaults to 2. -
target
(Literal
, default:'update'
) –what to set on var. Defaults to "update".
Source code in torchzero/modules/clipping/growth_clipping.py
Clone ¶
Bases: torchzero.core.module.Module
Clones input. May be useful to store some intermediate result and make sure it doesn't get affected by in-place operations
Source code in torchzero/modules/ops/utility.py
ConjugateDescent ¶
Bases: torchzero.modules.conjugate_gradient.cg.ConguateGradientBase
Conjugate Descent (CD).
Note
This requires step size to be determined via a line search, so put a line search like tz.m.StrongWolfe(c2=0.1, a_init="first-order")
after this.
Source code in torchzero/modules/conjugate_gradient/cg.py
CopyMagnitude ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Returns :code:other(tensors)
with sign copied from tensors.
Source code in torchzero/modules/ops/binary.py
CopySign ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Returns tensors with sign copied from :code:other(tensors)
.
Source code in torchzero/modules/ops/binary.py
CubicRegularization ¶
Bases: torchzero.modules.trust_region.trust_region.TrustRegionBase
Cubic regularization.
Parameters:
-
hess_module
(Module | None
) –A module that maintains a hessian approximation (not hessian inverse!). This includes all full-matrix quasi-newton methods,
tz.m.Newton
andtz.m.GaussNewton
. When using quasi-newton methods, setinverse=False
when constructing them. -
eta
(float
, default:0.0
) –if ratio of actual to predicted rediction is larger than this, step is accepted. When :code:
hess_module
is GaussNewton, this can be set to 0. Defaults to 0.15. -
nplus
(float
, default:3.5
) –increase factor on successful steps. Defaults to 1.5.
-
nminus
(float
, default:0.25
) –decrease factor on unsuccessful steps. Defaults to 0.75.
-
rho_good
(float
, default:0.99
) –if ratio of actual to predicted rediction is larger than this, trust region size is multiplied by
nplus
. -
rho_bad
(float
, default:0.0001
) –if ratio of actual to predicted rediction is less than this, trust region size is multiplied by
nminus
. -
init
(float
, default:1
) –Initial trust region value. Defaults to 1.
-
maxiter
(float
, default:100
) –maximum iterations when solving cubic subproblem, defaults to 1e-7.
-
eps
(float
, default:1e-08
) –epsilon for the solver, defaults to 1e-8.
-
update_freq
(int
, default:1
) –frequency of updating the hessian. Defaults to 1.
-
max_attempts
(max_attempts
, default:10
) –maximum number of trust region size size reductions per step. A zero update vector is returned when this limit is exceeded. Defaults to 10.
-
fallback
(bool
) –if
True
, whenhess_module
maintains hessian inverse which can't be inverted efficiently, it will be inverted anyway. WhenFalse
(default), aRuntimeError
will be raised instead. -
inner
(Chainable | None
, default:None
) –preconditioning is applied to output of thise module. Defaults to None.
Examples:
Cubic regularized newton
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.CubicRegularization(tz.m.Newton()),
)
Source code in torchzero/modules/trust_region/cubic_regularization.py
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
|
CustomUnaryOperation ¶
Bases: torchzero.core.transform.Transform
Applies :code:getattr(tensor, name)
to each tensor
Source code in torchzero/modules/ops/unary.py
DFP ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Davidon–Fletcher–Powell Quasi-Newton method.
Note
a trust region or an accurate line search is recommended.
Warning
this uses at least O(N^2) memory.
Source code in torchzero/modules/quasi_newton/quasi_newton.py
DNRTR ¶
Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy
Diagonal quasi-newton method.
Reference
Andrei, Neculai. "A diagonal quasi-Newton updating method for unconstrained optimization." Numerical Algorithms 81.2 (2019): 575-590.
Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
DYHS ¶
Bases: torchzero.modules.conjugate_gradient.cg.ConguateGradientBase
Dai-Yuan - Hestenes–Stiefel hybrid conjugate gradient method.
Note
This requires step size to be determined via a line search, so put a line search like tz.m.StrongWolfe(c2=0.1, a_init="first-order")
after this.
Source code in torchzero/modules/conjugate_gradient/cg.py
DaiYuan ¶
Bases: torchzero.modules.conjugate_gradient.cg.ConguateGradientBase
Dai–Yuan nonlinear conjugate gradient method.
Note
This requires step size to be determined via a line search, so put a line search like tz.m.StrongWolfe(c2=0.1)
after this.
Source code in torchzero/modules/conjugate_gradient/cg.py
Debias ¶
Bases: torchzero.core.transform.Transform
Multiplies the update by an Adam debiasing term based first and/or second momentum.
Parameters:
-
beta1
(float | None
, default:None
) –first momentum, should be the same as first momentum used in modules before. Defaults to None.
-
beta2
(float | None
, default:None
) –second (squared) momentum, should be the same as second momentum used in modules before. Defaults to None.
-
alpha
(float
, default:1
) –learning rate. Defaults to 1.
-
pow
(float
, default:2
) –power, assumes absolute value is used. Defaults to 2.
-
target
(Literal
, default:'update'
) –target. Defaults to 'update'.
Source code in torchzero/modules/ops/higher_level.py
Debias2 ¶
Bases: torchzero.core.transform.Transform
Multiplies the update by an Adam debiasing term based on the second momentum.
Parameters:
-
beta
(float | None
, default:0.999
) –second (squared) momentum, should be the same as second momentum used in modules before. Defaults to None.
-
pow
(float
, default:2
) –power, assumes absolute value is used. Defaults to 2.
-
target
(Literal
, default:'update'
) –target. Defaults to 'update'.
Source code in torchzero/modules/ops/higher_level.py
DiagonalBFGS ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Diagonal BFGS. This is simply BFGS with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful.
Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
DiagonalQuasiCauchi ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._HessianUpdateStrategyDefaults
Diagonal quasi-cauchi method.
Reference
Zhu M., Nazareth J. L., Wolkowicz H. The quasi-Cauchy relation and diagonal updating //SIAM Journal on Optimization. – 1999. – Т. 9. – №. 4. – С. 1192-1204.
Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
DiagonalSR1 ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Diagonal SR1. This is simply SR1 with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful.
Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
DiagonalWeightedQuasiCauchi ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._HessianUpdateStrategyDefaults
Diagonal quasi-cauchi method.
Reference
Leong, Wah June, Sharareh Enshaei, and Sie Long Kek. "Diagonal quasi-Newton methods via least change updating principle with weighted Frobenius norm." Numerical Algorithms 86 (2021): 1225-1241.
Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
DirectWeightDecay ¶
Bases: torchzero.core.module.Module
Directly applies weight decay to parameters.
Parameters:
-
weight_decay
(float
) –weight decay scale.
-
ord
(int
, default:2
) –order of the penalty, e.g. 1 for L1 and 2 for L2. Defaults to 2.
Source code in torchzero/modules/weight_decay/weight_decay.py
Div ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Divide tensors by :code:other
. :code:other
can be a number or a module.
If :code:other
is a module, this calculates :code:tensors / other(tensors)
Source code in torchzero/modules/ops/binary.py
DivByLoss ¶
Bases: torchzero.core.module.Module
Divides update by loss times :code:alpha
Source code in torchzero/modules/misc/misc.py
DivModules ¶
Bases: torchzero.modules.ops.multi.MultiOperationBase
Calculates :code:input / other
. :code:input
and :code:other
can be numbers or modules.
Source code in torchzero/modules/ops/multi.py
Dogleg ¶
Bases: torchzero.modules.trust_region.trust_region.TrustRegionBase
Dogleg trust region algorithm.
Parameters:
-
hess_module
(Module | None
) –A module that maintains a hessian approximation (not hessian inverse!). This includes all full-matrix quasi-newton methods,
tz.m.Newton
andtz.m.GaussNewton
. When using quasi-newton methods, setinverse=False
when constructing them. -
eta
(float
, default:0.0
) –if ratio of actual to predicted rediction is larger than this, step is accepted. When :code:
hess_module
is GaussNewton, this can be set to 0. Defaults to 0.15. -
nplus
(float
, default:2
) –increase factor on successful steps. Defaults to 1.5.
-
nminus
(float
, default:0.25
) –decrease factor on unsuccessful steps. Defaults to 0.75.
-
rho_good
(float
, default:0.75
) –if ratio of actual to predicted rediction is larger than this, trust region size is multiplied by
nplus
. -
rho_bad
(float
, default:0.25
) –if ratio of actual to predicted rediction is less than this, trust region size is multiplied by
nminus
. -
init
(float
, default:1
) –Initial trust region value. Defaults to 1.
-
update_freq
(int
, default:1
) –frequency of updating the hessian. Defaults to 1.
-
max_attempts
(max_attempts
, default:10
) –maximum number of trust region size size reductions per step. A zero update vector is returned when this limit is exceeded. Defaults to 10.
-
inner
(Chainable | None
, default:None
) –preconditioning is applied to output of thise module. Defaults to None.
Source code in torchzero/modules/trust_region/dogleg.py
Dropout ¶
Bases: torchzero.core.transform.Transform
Applies dropout to the update.
For each weight the update to that weight has :code:p
probability to be set to 0.
This can be used to implement gradient dropout or update dropout depending on placement.
Parameters:
-
p
(float
, default:0.5
) –probability that update for a weight is replaced with 0. Defaults to 0.5.
-
graft
(bool
, default:False
) –if True, update after dropout is rescaled to have the same norm as before dropout. Defaults to False.
-
target
(Literal
, default:'update'
) –what to set on var, refer to documentation. Defaults to 'update'.
Examples:
Gradient dropout.
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.Dropout(0.5),
tz.m.Adam(),
tz.m.LR(1e-3)
)
Update dropout.
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.Adam(),
tz.m.Dropout(0.5),
tz.m.LR(1e-3)
)
Source code in torchzero/modules/misc/regularization.py
DualNormCorrection ¶
Bases: torchzero.core.transform.TensorwiseTransform
Dual norm correction for dualizer based optimizers (https://github.com/leloykun/adaptive-muon).
Orthogonalize already has this built in with the dual_norm_correction
setting.
Source code in torchzero/modules/adaptive/muon.py
EMA ¶
Bases: torchzero.core.transform.Transform
Maintains an exponential moving average of update.
Parameters:
-
momentum
(float
, default:0.9
) –momentum (beta). Defaults to 0.9.
-
dampening
(float
, default:0
) –momentum dampening. Defaults to 0.
-
debiased
(bool
, default:False
) –whether to debias the EMA like in Adam. Defaults to False.
-
lerp
(bool
, default:True
) –whether to use linear interpolation. Defaults to True.
-
ema_init
(str
, default:'zeros'
) –initial values for the EMA, "zeros" or "update".
-
target
(Literal
, default:'update'
) –target to apply EMA to. Defaults to 'update'.
Source code in torchzero/modules/momentum/momentum.py
EMASquared ¶
Bases: torchzero.core.transform.Transform
Maintains an exponential moving average of squared updates.
Parameters:
-
beta
(float
, default:0.999
) –momentum value. Defaults to 0.999.
-
amsgrad
(bool
, default:False
) –whether to maintain maximum of the exponential moving average. Defaults to False.
-
pow
(float
, default:2
) –power, absolute value is always used. Defaults to 2.
Methods:
-
EMA_SQ_FN
–Updates
exp_avg_sq_
with EMA of squaredtensors
, ifmax_exp_avg_sq_
is not None, updates it with maximum of EMA.
Source code in torchzero/modules/ops/higher_level.py
EMA_SQ_FN ¶
EMA_SQ_FN(tensors: TensorList, exp_avg_sq_: TensorList, beta: float | NumberList, max_exp_avg_sq_: TensorList | None, pow: float = 2)
Updates exp_avg_sq_
with EMA of squared tensors
, if max_exp_avg_sq_
is not None, updates it with maximum of EMA.
Returns exp_avg_sq_
or max_exp_avg_sq_
.
Source code in torchzero/modules/functional.py
ESGD ¶
Bases: torchzero.core.module.Module
Equilibrated Gradient Descent (https://arxiv.org/abs/1502.04390)
This is similar to Adagrad, but the accumulates squared randomized hessian diagonal estimates instead of squared gradients.
.. note::
In most cases Adagrad should be the first module in the chain because it relies on autograd. Use the :code:inner
argument if you wish to apply Adagrad preconditioning to another module's output.
.. note::
If you are using gradient estimators or reformulations, set :code:hvp_method
to "forward" or "central".
.. note::
This module requires a closure passed to the optimizer step,
as it needs to re-evaluate the loss and gradients for calculating HVPs.
The closure must accept a backward
argument (refer to documentation).
Parameters:
-
damping
(float
, default:0.0001
) –added to denominator for stability. Defaults to 1e-4.
-
update_freq
(int
, default:20
) –frequency of updating hessian diagonal estimate via a hessian-vector product. This value can be increased to reduce computational cost. Defaults to 20.
-
hvp_method
(str
, default:'autograd'
) –Determines how Hessian-vector products are evaluated.
"autograd"
: Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient."forward"
: Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation."central"
: Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
-
fd_h
(float
, default:0.001
) –finite difference step size if :code:
hvp_method
is "forward" or "central". Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of hessian-vector products with random vectors to evaluate each time when updating the preconditioner. Larger values may lead to better hessian diagonal estimate. Defaults to 1.
-
seed
(int | None
, default:None
) –seed for random vectors. Defaults to None.
-
inner
(Chainable | None
, default:None
) –Inner module. If this is specified, operations are performed in the following order. 1. compute hessian diagonal estimate. 2. pass inputs to :code:
inner
. 3. momentum and preconditioning are applied to the ouputs of :code:inner
.
Examples:
Using ESGD:
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.ESGD(),
tz.m.LR(0.1)
)
ESGD preconditioner can be applied to any other module by passing it to the :code:inner
argument. Here is an example of applying
ESGD preconditioning to nesterov momentum (:code:tz.m.NAG
):
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.ESGD(beta1=0, inner=tz.m.NAG(0.9)),
tz.m.LR(0.1)
)
Source code in torchzero/modules/adaptive/esgd.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
|
EscapeAnnealing ¶
Bases: torchzero.core.module.Module
If parameters stop changing, this runs a backward annealing random search
Source code in torchzero/modules/misc/escape.py
Exp ¶
Bases: torchzero.core.transform.Transform
Returns :code:exp(input)
Source code in torchzero/modules/ops/unary.py
ExpHomotopy ¶
FDM ¶
Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator
Approximate gradients via finite difference method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –magnitude of parameter perturbation. Defaults to 1e-3.
-
formula
(Literal
, default:'central'
) –finite difference formula. Defaults to 'central2'.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to 'closure'.
Examples: plain FDM:
Any gradient-based method can use FDM-estimated gradients.
fdm_ncg = tz.Modular(
model.parameters(),
tz.m.FDM(),
# set hvp_method to "forward" so that it
# uses gradient difference instead of autograd
tz.m.NewtonCG(hvp_method="forward"),
tz.m.Backtracking()
)
Source code in torchzero/modules/grad_approximation/fdm.py
Fill ¶
Bases: torchzero.core.module.Module
Outputs tensors filled with :code:value
Source code in torchzero/modules/ops/utility.py
FillLoss ¶
Bases: torchzero.core.module.Module
Outputs tensors filled with loss value times :code:alpha
Source code in torchzero/modules/misc/misc.py
FletcherReeves ¶
Bases: torchzero.modules.conjugate_gradient.cg.ConguateGradientBase
Fletcher–Reeves nonlinear conjugate gradient method.
Note
This requires step size to be determined via a line search, so put a line search like tz.m.StrongWolfe(c2=0.1, a_init="first-order")
after this.
Source code in torchzero/modules/conjugate_gradient/cg.py
FletcherVMM ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Fletcher's variable metric Quasi-Newton method.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Fletcher, R. (1970). A new approach to variable metric algorithms. The Computer Journal, 13(3), 317–322. doi:10.1093/comjnl/13.3.317
Source code in torchzero/modules/quasi_newton/quasi_newton.py
ForwardGradient ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Forward gradient method.
This method samples one or more directional derivatives evaluated via autograd jacobian-vector products. This is very similar to randomized finite difference.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
distribution
(Literal
, default:'gaussian'
) –distribution for random gradient samples. Defaults to "gaussian".
-
beta
(float
, default:0
) –If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
jvp_method
(str
, default:'autograd'
) –how to calculate jacobian vector product, note that with
forward
and 'central' this is equivalent to randomized finite difference. Defaults to 'autograd'. -
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Baydin, A. G., Pearlmutter, B. A., Syme, D., Wood, F., & Torr, P. (2022). Gradients without backpropagation. arXiv preprint arXiv:2202.08587.
Source code in torchzero/modules/grad_approximation/forward_gradient.py
PRE_MULTIPLY_BY_H
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
FullMatrixAdagrad ¶
Bases: torchzero.core.transform.TensorwiseTransform
Full-matrix version of Adagrad, can be customized to make RMSprop or Adam (see examples).
Note
A more memory-efficient version equivalent to full matrix Adagrad on last n gradients is implemented in tz.m.LMAdagrad
.
Parameters:
-
beta
(float | None
, default:None
) –momentum for gradient outer product accumulators. if None, uses sum. Defaults to None.
-
decay
(float | None
, default:None
) –decay for gradient outer product accumulators. Defaults to None.
-
sqrt
(bool
, default:True
) –whether to take the square root of the accumulator. Defaults to True.
-
concat_params
(bool
, default:True
) –if False, each parameter will have it's own accumulator. Defaults to True.
-
precond_freq
(int
, default:1
) –frequency of updating the inverse square root of the accumulator. Defaults to 1.
-
init
(Literal[str]
, default:'identity'
) –how to initialize the accumulator. - "identity" - with identity matrix (default). - "zeros" - with zero matrix. - "ones" - with matrix of ones. -"GGT" - with the first outer product
-
divide
(bool
, default:False
) –whether to divide the accumulator by number of gradients in it. Defaults to False.
-
inner
(Chainable | None
, default:None
) –inner modules to apply preconditioning to. Defaults to None.
Examples:¶
Plain full-matrix adagrad
Full-matrix RMSprop
Full-matrix Adam
opt = tz.Modular(
model.parameters(),
tz.m.FullMatrixAdagrad(beta=0.999, inner=tz.m.EMA(0.9)),
tz.m.Debias(0.9, 0.999),
tz.m.LR(1e-2),
)
Source code in torchzero/modules/adaptive/adagrad.py
234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 |
|
GaussNewton ¶
Bases: torchzero.core.module.Module
Gauss-newton method.
To use this, the closure should return a vector of values to minimize sum of squares of.
Please add the backward
argument, it will always be False but it is required.
Gradients will be calculated via batched autograd within this module, you don't need to
implement the backward pass. Please see below for an example.
Note
This method requires ndim^2
memory, however, if it is used within tz.m.TrustCG
trust region,
the memory requirement is ndim*m
, where m
is number of values in the output.
Parameters:
-
reg
(float
, default:1e-08
) –regularization parameter. Defaults to 1e-8.
-
batched
(bool
, default:True
) –whether to use vmapping. Defaults to True.
Examples:
minimizing the rosenbrock function:
def rosenbrock(X):
x1, x2 = X
return torch.stack([(1 - x1), 100 * (x2 - x1**2)])
X = torch.tensor([-1.1, 2.5], requires_grad=True)
opt = tz.Modular([X], tz.m.GaussNewton(), tz.m.Backtracking())
# define the closure for line search
def closure(backward=True):
return rosenbrock(X)
# minimize
for iter in range(10):
loss = opt.step(closure)
print(f'{loss = }')
training a neural network with a matrix-free GN trust region:
X = torch.randn(64, 20)
y = torch.randn(64, 10)
model = nn.Sequential(nn.Linear(20, 64), nn.ELU(), nn.Linear(64, 10))
opt = tz.Modular(
model.parameters(),
tz.m.TrustCG(tz.m.GaussNewton()),
)
def closure(backward=True):
y_hat = model(X) # (64, 10)
return (y_hat - y).pow(2).mean(0) # (10, )
for i in range(100):
losses = opt.step(closure)
if i % 10 == 0:
print(f'{losses.mean() = }')
Source code in torchzero/modules/least_squares/gn.py
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
|
GaussianSmoothing ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Gradient approximation via Gaussian smoothing method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.01
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-2. -
n_samples
(int
, default:100
) –number of random gradient samples. Defaults to 100.
-
formula
(Literal
, default:'forward2'
) –finite difference formula. Defaults to 'forward2'.
-
distribution
(Literal
, default:'gaussian'
) –distribution. Defaults to "gaussian".
-
beta
(float
, default:0
) –If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed
(int | None | Generator
, default:None
) –Seed for random generator. Defaults to None.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Yurii Nesterov, Vladimir Spokoiny. (2015). Random Gradient-Free Minimization of Convex Functions. https://gwern.net/doc/math/2015-nesterov.pdf
Source code in torchzero/modules/grad_approximation/rfdm.py
Grad ¶
Bases: torchzero.core.module.Module
Outputs the gradient
Source code in torchzero/modules/ops/utility.py
GradApproximator ¶
Bases: torchzero.core.module.Module
, abc.ABC
Base class for gradient approximations.
This is an abstract class, to use it, subclass it and override approximate
.
GradientApproximator modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure.
Parameters:
-
defaults
(dict[str, Any] | None
, default:None
) –dict with defaults. Defaults to None.
-
target
(str
, default:'closure'
) –whether to set
var.grad
,var.update
or 'var.closure`. Defaults to 'closure'.
Example:
Basic SPSA method implementation.
class SPSA(GradApproximator):
def __init__(self, h=1e-3):
defaults = dict(h=h)
super().__init__(defaults)
@torch.no_grad
def approximate(self, closure, params, loss):
perturbation = [rademacher_like(p) * self.settings[p]['h'] for p in params]
# evaluate params + perturbation
torch._foreach_add_(params, perturbation)
loss_plus = closure(False)
# evaluate params - perturbation
torch._foreach_sub_(params, perturbation)
torch._foreach_sub_(params, perturbation)
loss_minus = closure(False)
# restore original params
torch._foreach_add_(params, perturbation)
# calculate SPSA gradients
spsa_grads = []
for p, pert in zip(params, perturbation):
settings = self.settings[p]
h = settings['h']
d = (loss_plus - loss_minus) / (2*(h**2))
spsa_grads.append(pert * d)
# returns tuple: (grads, loss, loss_approx)
# loss must be with initial parameters
# since we only evaluated loss with perturbed parameters
# we only have loss_approx
return spsa_grads, None, loss_plus
Methods:
-
approximate
–Returns a tuple:
(grad, loss, loss_approx)
, make sure this resets parameters to their original values! -
pre_step
–This runs once before each step, whereas
approximate
may run multiple times per step if further modules
Source code in torchzero/modules/grad_approximation/grad_approximator.py
approximate ¶
approximate(closure: Callable, params: list[Tensor], loss: Tensor | None) -> tuple[Iterable[Tensor], Tensor | None, Tensor | None]
Returns a tuple: (grad, loss, loss_approx)
, make sure this resets parameters to their original values!
Source code in torchzero/modules/grad_approximation/grad_approximator.py
pre_step ¶
pre_step(var: Var) -> None
This runs once before each step, whereas approximate
may run multiple times per step if further modules
evaluate gradients at multiple points. This is useful for example to pre-generate new random perturbations.
Source code in torchzero/modules/grad_approximation/grad_approximator.py
GradSign ¶
Bases: torchzero.core.transform.Transform
Copies gradient sign to update.
Source code in torchzero/modules/misc/misc.py
GradToNone ¶
Bases: torchzero.core.module.Module
Sets :code:grad
attribute to None on :code:var
.
Source code in torchzero/modules/ops/utility.py
GradientAccumulation ¶
Bases: torchzero.core.module.Module
Uses n
steps to accumulate gradients, after n
gradients have been accumulated, they are passed to :code:modules
and parameters are updates.
Accumulating gradients for n
steps is equivalent to increasing batch size by n
. Increasing the batch size
is more computationally efficient, but sometimes it is not feasible due to memory constraints.
Note
Technically this can accumulate any inputs, including updates generated by previous modules. As long as this module is first, it will accumulate the gradients.
Parameters:
-
n
(int
) –number of gradients to accumulate.
-
mean
(bool
, default:True
) –if True, uses mean of accumulated gradients, otherwise uses sum. Defaults to True.
-
stop
(bool
, default:True
) –this module prevents next modules from stepping unless
n
gradients have been accumulate. Setting this argument to False disables that. Defaults to True.
Examples:¶
Adam with gradients accumulated for 16 batches.
Source code in torchzero/modules/misc/gradient_accumulation.py
GradientCorrection ¶
Bases: torchzero.core.transform.Transform
Estimates gradient at minima along search direction assuming function is quadratic.
This can useful as inner module for second order methods with inexact line search.
Example:¶
L-BFGS with gradient correction
opt = tz.Modular(
model.parameters(),
tz.m.LBFGS(inner=tz.m.GradientCorrection()),
tz.m.Backtracking()
)
Reference
HOSHINO, S. (1972). A Formulation of Variable Metric Methods. IMA Journal of Applied Mathematics, 10(3), 394–403. doi:10.1093/imamat/10.3.394
Source code in torchzero/modules/quasi_newton/quasi_newton.py
GradientSampling ¶
Bases: torchzero.core.reformulation.Reformulation
Samples and aggregates gradients and values at perturbed points.
This module can be used for gaussian homotopy and gradient sampling methods.
Parameters:
-
modules
(Chainable | None
, default:None
) –modules that will be optimizing the modified objective. if None, returns gradient of the modified objective as the update. Defaults to None.
-
sigma
(float
, default:1.0
) –initial magnitude of the perturbations. Defaults to 1.
-
n
(int
, default:100
) –number of perturbations per step. Defaults to 100.
-
aggregate
(str
, default:'mean'
) –how to aggregate values and gradients - "mean" - uses mean of the gradients, as in gaussian homotopy. - "max" - uses element-wise maximum of the gradients. - "min" - uses element-wise minimum of the gradients. - "min-norm" - picks gradient with the lowest norm.
Defaults to 'mean'.
-
distribution
(Literal
, default:'gaussian'
) –distribution for random perturbations. Defaults to 'gaussian'.
-
include_x0
(bool
, default:True
) –whether to include gradient at un-perturbed point. Defaults to True.
-
fixed
(bool
, default:True
) –if True, perturbations do not get replaced by new random perturbations until termination criteria is satisfied. Defaults to True.
-
pre_generate
(bool
, default:True
) –if True, perturbations are pre-generated before each step. This requires more memory to store all of them, but ensures they do not change when closure is evaluated multiple times. Defaults to True.
-
termination
(TerminationCriteriaBase | Sequence[TerminationCriteriaBase] | None
, default:None
) –a termination criteria module, sigma will be multiplied by
decay
when termination criteria is satisfied, and new perturbations will be generated iffixed
. Defaults to None. -
decay
(float
, default:0.6666666666666666
) –sigma multiplier on termination criteria. Defaults to 2/3.
-
reset_on_termination
(bool
, default:True
) –whether to reset states of all other modules on termination. Defaults to True.
-
sigma_strategy
(str | None
, default:None
) –strategy for adapting sigma. If condition is satisfied, sigma is multiplied by
sigma_nplus
, otherwise it is multiplied bysigma_nminus
. - "grad-norm" - at leastsigma_target
gradients should have lower norm than at un-perturbed point. - "value" - at leastsigma_target
values (losses) should be lower than at un-perturbed point. - None - doesn't use adaptive sigma.This introduces a side-effect to the closure, so it should be left at None of you use trust region or line search to optimize the modified objective. Defaults to None.
-
sigma_target
(int
, default:0.2
) –number of elements to satisfy the condition in
sigma_strategy
. Defaults to 1. -
sigma_nplus
(float
, default:1.3333333333333333
) –sigma multiplier when
sigma_strategy
condition is satisfied. Defaults to 4/3. -
sigma_nminus
(float
, default:0.6666666666666666
) –sigma multiplier when
sigma_strategy
condition is not satisfied. Defaults to 2/3. -
seed
(int | None
, default:None
) –seed. Defaults to None.
Source code in torchzero/modules/smoothing/sampling.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 |
|
Graft ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Outputs tensors rescaled to have the same norm as :code:magnitude(tensors)
.
Source code in torchzero/modules/ops/binary.py
GraftGradToUpdate ¶
Bases: torchzero.core.transform.Transform
Outputs gradient grafted to update, that is gradient rescaled to have the same norm as the update.
Source code in torchzero/modules/misc/misc.py
GraftModules ¶
Bases: torchzero.modules.ops.multi.MultiOperationBase
Outputs :code:direction
output rescaled to have the same norm as :code:magnitude
output.
Parameters:
-
direction
(Chainable
) –module to use the direction from
-
magnitude
(Chainable
) –module to use the magnitude from
-
tensorwise
(bool
, default:True
) –whether to calculate norm per-tensor or globally. Defaults to True.
-
ord
(float
, default:2
) –norm order. Defaults to 2.
-
eps
(float
, default:1e-06
) –clips denominator to be no less than this value. Defaults to 1e-6.
-
strength
(float
, default:1
) –strength of grafting. Defaults to 1.
Example
Shampoo grafted to Adam
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.GraftModules(
direction = tz.m.Shampoo(),
magnitude = tz.m.Adam(),
),
tz.m.LR(1e-3)
)
Reference
Agarwal, N., Anil, R., Hazan, E., Koren, T., & Zhang, C. (2020). Disentangling adaptive gradient methods from learning rates. arXiv preprint arXiv:2002.11803. https://arxiv.org/pdf/2002.11803
Source code in torchzero/modules/ops/multi.py
GraftToGrad ¶
Bases: torchzero.core.transform.Transform
Grafts update to the gradient, that is update is rescaled to have the same norm as the gradient.
Source code in torchzero/modules/misc/misc.py
GraftToParams ¶
Bases: torchzero.core.transform.Transform
Grafts update to the parameters, that is update is rescaled to have the same norm as the parameters, but no smaller than :code:eps
.
Source code in torchzero/modules/misc/misc.py
GraftToUpdate ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Outputs :code:magnitude(tensors)
rescaled to have the same norm as tensors
Source code in torchzero/modules/ops/binary.py
GramSchimdt ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
outputs tensors made orthogonal to other(tensors)
via Gram-Schmidt.
Source code in torchzero/modules/ops/binary.py
Greenstadt1 ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Greenstadt's first Quasi-Newton method.
Note
a trust region or an accurate line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
Source code in torchzero/modules/quasi_newton/quasi_newton.py
Greenstadt2 ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Greenstadt's second Quasi-Newton method.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
Source code in torchzero/modules/quasi_newton/quasi_newton.py
HagerZhang ¶
Bases: torchzero.modules.conjugate_gradient.cg.ConguateGradientBase
Hager-Zhang nonlinear conjugate gradient method,
Note
This requires step size to be determined via a line search, so put a line search like tz.m.StrongWolfe(c2=0.1, a_init="first-order")
after this.
Source code in torchzero/modules/conjugate_gradient/cg.py
HeavyBall ¶
Bases: torchzero.modules.momentum.momentum.EMA
Polyak's momentum (heavy-ball method).
Parameters:
-
momentum
(float
, default:0.9
) –momentum (beta). Defaults to 0.9.
-
dampening
(float
, default:0
) –momentum dampening. Defaults to 0.
-
debiased
(bool
, default:False
) –whether to debias the EMA like in Adam. Defaults to False.
-
lerp
(bool
, default:False
) –whether to use linear interpolation, if True, this becomes exponential moving average. Defaults to False.
-
ema_init
(str
, default:'update'
) –initial values for the EMA, "zeros" or "update".
-
target
(Literal
, default:'update'
) –target to apply EMA to. Defaults to 'update'.
Source code in torchzero/modules/momentum/momentum.py
HestenesStiefel ¶
Bases: torchzero.modules.conjugate_gradient.cg.ConguateGradientBase
Hestenes–Stiefel nonlinear conjugate gradient method.
Note
This requires step size to be determined via a line search, so put a line search like tz.m.StrongWolfe(c2=0.1, a_init="first-order")
after this.
Source code in torchzero/modules/conjugate_gradient/cg.py
HigherOrderNewton ¶
Bases: torchzero.core.module.Module
A basic arbitrary order newton's method with optional trust region and proximal penalty.
This constructs an nth order taylor approximation via autograd and minimizes it with
scipy.optimize.minimize
trust region newton solvers with optional proximal penalty.
The hessian of taylor approximation is easier to evaluate, plus it can be evaluated in a batched mode, so it can be more efficient in very specific instances.
Notes
- In most cases HigherOrderNewton should be the first module in the chain because it relies on extra autograd. Use the
inner
argument if you wish to apply Newton preconditioning to another module's output. - This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating higher order derivatives. The closure must accept a
backward
argument (refer to documentation). - this uses roughly O(N^order) memory and solving the subproblem is very expensive.
- "none" and "proximal" trust methods may generate subproblems that have no minima, causing divergence.
Args:
order (int, optional):
Order of the method, number of taylor series terms (orders of derivatives) used to approximate the function. Defaults to 4.
trust_method (str | None, optional):
Method used for trust region.
- "bounds" - the model is minimized within bounds defined by trust region.
- "proximal" - the model is minimized with penalty for going too far from current point.
- "none" - disables trust region.
Defaults to 'bounds'.
increase (float, optional): trust region multiplier on good steps. Defaults to 1.5.
decrease (float, optional): trust region multiplier on bad steps. Defaults to 0.75.
trust_init (float | None, optional):
initial trust region size. If none, defaults to 1 on :code:`trust_method="bounds"` and 0.1 on ``"proximal"``. Defaults to None.
trust_tol (float, optional):
Maximum ratio of expected loss reduction to actual reduction for trust region increase.
Should 1 or higer. Defaults to 2.
de_iters (int | None, optional):
If this is specified, the model is minimized via differential evolution first to possibly escape local minima,
then it is passed to scipy.optimize.minimize. Defaults to None.
vectorize (bool, optional): whether to enable vectorized jacobians (usually faster). Defaults to True.
Source code in torchzero/modules/higher_order/higher_order_newton.py
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 |
|
Horisho ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Horisho's variable metric Quasi-Newton method.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
HOSHINO, S. (1972). A Formulation of Variable Metric Methods. IMA Journal of Applied Mathematics, 10(3), 394–403. doi:10.1093/imamat/10.3.394
Source code in torchzero/modules/quasi_newton/quasi_newton.py
HpuEstimate ¶
Bases: torchzero.core.transform.Transform
returns y/||s||
, where y
is difference between current and previous update (gradient), s
is difference between current and previous parameters. The returned tensors are a finite difference approximation to hessian times previous update.
Source code in torchzero/modules/misc/misc.py
ICUM ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Inverse Column-updating Quasi-Newton method. This is computationally cheaper than other Quasi-Newton methods due to only updating one column of the inverse hessian approximation per step.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Lopes, V. L., & Martínez, J. M. (1995). Convergence properties of the inverse column-updating method. Optimization Methods & Software, 6(2), 127–144. from https://www.ime.unicamp.br/sites/default/files/pesquisa/relatorios/rp-1993-76.pdf
Source code in torchzero/modules/quasi_newton/quasi_newton.py
Identity ¶
Bases: torchzero.core.module.Module
Identity operator that is argument-insensitive. This also can be used as identity hessian for trust region methods.
Source code in torchzero/modules/ops/utility.py
IntermoduleCautious ¶
Bases: torchzero.core.module.Module
Negaties update on :code:main
module where it's sign doesn't match with output of :code:compare
module.
Parameters:
-
main
(Chainable
) –main module or sequence of modules whose update will be cautioned.
-
compare
(Chainable
) –modules or sequence of modules to compare the sign to.
-
normalize
(bool
, default:False
) –renormalize update after masking. Defaults to False.
-
eps
(float
, default:1e-06
) –epsilon for normalization. Defaults to 1e-6.
-
mode
(str
, default:'zero'
) –what to do with updates with inconsistent signs. - "zero" - set them to zero (as in paper) - "grad" - set them to the gradient (same as using update magnitude and gradient sign) - "backtrack" - negate them
Source code in torchzero/modules/momentum/cautious.py
InverseFreeNewton ¶
Bases: torchzero.core.module.Module
Inverse-free newton's method
.. note::
In most cases Newton should be the first module in the chain because it relies on autograd. Use the :code:inner
argument if you wish to apply Newton preconditioning to another module's output.
.. note::
This module requires the a closure passed to the optimizer step,
as it needs to re-evaluate the loss and gradients for calculating the hessian.
The closure must accept a backward
argument (refer to documentation).
.. warning:: this uses roughly O(N^2) memory.
Reference Massalski, Marcin, and Magdalena Nockowska-Rosiak. "INVERSE-FREE NEWTON'S METHOD." Journal of Applied Analysis & Computation 15.4 (2025): 2238-2257.
Source code in torchzero/modules/second_order/newton.py
284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 |
|
LBFGS ¶
Bases: torchzero.core.transform.Transform
Limited-memory BFGS algorithm. A line search or trust region is recommended.
Parameters:
-
history_size
(int
, default:10
) –number of past parameter differences and gradient differences to store. Defaults to 10.
-
ptol
(float | None
, default:1e-32
) –skips updating the history if maximum absolute value of parameter difference is less than this value. Defaults to 1e-10.
-
ptol_restart
(bool
, default:False
) –If true, whenever parameter difference is less then
ptol
, L-BFGS state will be reset. Defaults to None. -
gtol
(float | None
, default:1e-32
) –skips updating the history if if maximum absolute value of gradient difference is less than this value. Defaults to 1e-10.
-
ptol_restart
(bool
, default:False
) –If true, whenever gradient difference is less then
gtol
, L-BFGS state will be reset. Defaults to None. -
sy_tol
(float | None
, default:1e-32
) –history will not be updated whenever s⋅y is less than this value (negative s⋅y means negative curvature)
-
scale_first
(bool
, default:True
) –makes first step, when hessian approximation is not available, small to reduce number of line search iterations. Defaults to True.
-
update_freq
(int
, default:1
) –how often to update L-BFGS history. Larger values may be better for stochastic optimization. Defaults to 1.
-
damping
(Union
, default:None
) –damping to use, can be "powell" or "double". Defaults to None.
-
inner
(Chainable | None
, default:None
) –optional inner modules applied after updating L-BFGS history and before preconditioning. Defaults to None.
Examples:¶
L-BFGS with line search
L-BFGS with trust region
Source code in torchzero/modules/quasi_newton/lbfgs.py
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 |
|
LMAdagrad ¶
Bases: torchzero.core.transform.TensorwiseTransform
Limited-memory full matrix Adagrad.
The update rule is to stack recent gradients into M, compute U, S <- SVD(M), then calculate update as U S^-1 Uᵀg. But it uses eigendecomposition on MᵀM to get U and S^2 because that is faster when you don't neeed V.
This is equivalent to full-matrix Adagrad on recent gradients.
Parameters:
-
history_size
(int
, default:100
) –number of past gradients to store. Defaults to 10.
-
update_freq
(int
, default:1
) –frequency of updating the preconditioner (U and S). Defaults to 1.
-
damping
(float
, default:0.0001
) –damping value. Defaults to 1e-4.
-
rdamping
(float
, default:0
) –value of damping relative to singular values norm. Defaults to 0.
-
order
(int
, default:1
) –order=2 means gradient differences are used in place of gradients. Higher order uses higher order differences. Defaults to 1.
-
true_damping
(bool
, default:True
) –If True, damping is added to squared singular values to mimic Adagrad. Defaults to True.
-
U_beta
(float | None
, default:None
) –momentum for U (too unstable, don't use). Defaults to None.
-
L_beta
(float | None
, default:None
) –momentum for L (too unstable, don't use). Defaults to None.
-
interval
(int
, default:1
) –Interval between gradients that are added to history (2 means every second gradient is used). Defaults to 1.
-
concat_params
(bool
, default:True
) –if True, treats all parameters as a single vector, meaning it will also whiten inter-parameters. Defaults to True.
-
inner
(Chainable | None
, default:None
) –preconditioner will be applied to output of this module. Defaults to None.
Examples:¶
Limited-memory Adagrad
Adam with L-Adagrad preconditioner (for debiasing second beta is 0.999 arbitrarily)optimizer = tz.Modular(
model.parameters(),
tz.m.LMAdagrad(inner=tz.m.EMA()),
tz.m.Debias(0.9, 0.999),
tz.m.LR(0.01)
)
Stable Adam with L-Adagrad preconditioner (this is what I would recommend)
optimizer = tz.Modular(
model.parameters(),
tz.m.LMAdagrad(inner=tz.m.EMA()),
tz.m.Debias(0.9, 0.999),
tz.m.ClipNormByEMA(max_ema_growth=1.2),
tz.m.LR(0.01)
)
Source code in torchzero/modules/adaptive/lmadagrad.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
|
LR ¶
Bases: torchzero.core.transform.Transform
Learning rate. Adding this module also adds support for LR schedulers.
Source code in torchzero/modules/step_size/lr.py
LSR1 ¶
Bases: torchzero.core.transform.Transform
Limited-memory SR1 algorithm. A line search or trust region is recommended.
Parameters:
-
history_size
(int
, default:10
) –number of past parameter differences and gradient differences to store. Defaults to 10.
-
ptol
(float | None
, default:None
) –skips updating the history if maximum absolute value of parameter difference is less than this value. Defaults to None.
-
ptol_restart
(bool
, default:False
) –If true, whenever parameter difference is less then
ptol
, L-SR1 state will be reset. Defaults to None. -
gtol
(float | None
, default:None
) –skips updating the history if if maximum absolute value of gradient difference is less than this value. Defaults to None.
-
ptol_restart
(bool
, default:False
) –If true, whenever gradient difference is less then
gtol
, L-SR1 state will be reset. Defaults to None. -
scale_first
(bool
, default:False
) –makes first step, when hessian approximation is not available, small to reduce number of line search iterations. Defaults to False.
-
update_freq
(int
, default:1
) –how often to update L-SR1 history. Larger values may be better for stochastic optimization. Defaults to 1.
-
damping
(Union
, default:None
) –damping to use, can be "powell" or "double". Defaults to None.
-
compact
(bool
) –if True, uses a compact representation verstion of L-SR1. It is much faster computationally, but less stable.
-
inner
(Chainable | None
, default:None
) –optional inner modules applied after updating L-SR1 history and before preconditioning. Defaults to None.
Examples:¶
L-SR1 with line search
L-SR1 with trust region
Source code in torchzero/modules/quasi_newton/lsr1.py
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 |
|
LambdaHomotopy ¶
Bases: torchzero.modules.misc.homotopy.HomotopyBase
Source code in torchzero/modules/misc/homotopy.py
LaplacianSmoothing ¶
Bases: torchzero.core.transform.Transform
Applies laplacian smoothing via a fast Fourier transform solver which can improve generalization.
Parameters:
-
sigma
(float
, default:1
) –controls the amount of smoothing. Defaults to 1.
-
layerwise
(bool
, default:True
) –If True, applies smoothing to each parameter's gradient separately, Otherwise applies it to all gradients, concatenated into a single vector. Defaults to True.
-
min_numel
(int
, default:4
) –minimum number of elements in a parameter to apply laplacian smoothing to. Only has effect if
layerwise
is True. Defaults to 4. -
target
(str
, default:'update'
) –what to set on var.
Examples:
Laplacian Smoothing Gradient Descent optimizer as in the paper
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.LaplacianSmoothing(),
tz.m.LR(1e-2),
)
Reference
Osher, S., Wang, B., Yin, P., Luo, X., Barekat, F., Pham, M., & Lin, A. (2022). Laplacian smoothing gradient descent. Research in the Mathematical Sciences, 9(3), 55.
Source code in torchzero/modules/smoothing/laplacian.py
LastAbsoluteRatio ¶
Bases: torchzero.core.transform.Transform
Outputs ratio between absolute values of past two updates the numerator is determined by :code:numerator
argument.
Source code in torchzero/modules/misc/misc.py
LastDifference ¶
Bases: torchzero.core.transform.Transform
Outputs difference between past two updates.
Source code in torchzero/modules/misc/misc.py
LastGradDifference ¶
Bases: torchzero.core.module.Module
Outputs difference between past two gradients.
Source code in torchzero/modules/misc/misc.py
LastProduct ¶
Bases: torchzero.core.transform.Transform
Outputs difference between past two updates.
Source code in torchzero/modules/misc/misc.py
LastRatio ¶
Bases: torchzero.core.transform.Transform
Outputs ratio between past two updates, the numerator is determined by :code:numerator
argument.
Source code in torchzero/modules/misc/misc.py
LerpModules ¶
Bases: torchzero.modules.ops.multi.MultiOperationBase
Does a linear interpolation of :code:input(tensors)
and :code:end(tensors)
based on a scalar :code:weight
.
The output is given by :code:output = input(tensors) + weight * (end(tensors) - input(tensors))
Source code in torchzero/modules/ops/multi.py
LevenbergMarquardt ¶
Bases: torchzero.modules.trust_region.trust_region.TrustRegionBase
Levenberg-Marquardt trust region algorithm.
Parameters:
-
hess_module
(Module | None
) –A module that maintains a hessian approximation (not hessian inverse!). This includes all full-matrix quasi-newton methods,
tz.m.Newton
andtz.m.GaussNewton
. When using quasi-newton methods, setinverse=False
when constructing them. -
y
(float
, default:0
) –when
y=0
, identity matrix is added to hessian, wheny=1
, diagonal of the hessian approximation is added. Values between interpolate. This should only be used with Gauss-Newton. Defaults to 0. -
eta
(float
, default:0.0
) –if ratio of actual to predicted rediction is larger than this, step is accepted. When
hess_module
isNewton
orGaussNewton
, this can be set to 0. Defaults to 0.15. -
nplus
(float
, default:3.5
) –increase factor on successful steps. Defaults to 1.5.
-
nminus
(float
, default:0.25
) –decrease factor on unsuccessful steps. Defaults to 0.75.
-
rho_good
(float
, default:0.99
) –if ratio of actual to predicted rediction is larger than this, trust region size is multiplied by
nplus
. -
rho_bad
(float
, default:0.0001
) –if ratio of actual to predicted rediction is less than this, trust region size is multiplied by
nminus
. -
init
(float
, default:1
) –Initial trust region value. Defaults to 1.
-
update_freq
(int
, default:1
) –frequency of updating the hessian. Defaults to 1.
-
max_attempts
(max_attempts
, default:10
) –maximum number of trust region size size reductions per step. A zero update vector is returned when this limit is exceeded. Defaults to 10.
-
fallback
(bool
, default:False
) –if
True
, whenhess_module
maintains hessian inverse which can't be inverted efficiently, it will be inverted anyway. WhenFalse
(default), aRuntimeError
will be raised instead. -
inner
(Chainable | None
, default:None
) –preconditioning is applied to output of thise module. Defaults to None.
Examples:
Gauss-Newton with Levenberg-Marquardt trust-region
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.LevenbergMarquardt(tz.m.GaussNewton()),
)
LM-SR1
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.LevenbergMarquardt(tz.m.SR1(inverse=False)),
)
First order trust region (hessian is assumed to be identity)
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.LevenbergMarquardt(tz.m.Identity()),
)
Source code in torchzero/modules/trust_region/levenberg_marquardt.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
|
LineSearchBase ¶
Bases: torchzero.core.module.Module
, abc.ABC
Base class for line searches.
This is an abstract class, to use it, subclass it and override search
.
Parameters:
-
defaults
(dict[str, Any] | None
) –dictionary with defaults.
-
maxiter
(int | None
, default:None
) –if this is specified, the search method will terminate upon evaluating the objective this many times, and step size with the lowest loss value will be used. This is useful when passing
make_objective
to an external library which doesn't have a maxiter option. Defaults to None.
Other useful methods
evaluate_f
- returns loss with a given scalar step sizeevaluate_f_d
- returns loss and directional derivative with a given scalar step sizemake_objective
- creates a function that accepts a scalar step size and returns loss. This can be passed to a scalar solver, such as scipy.optimize.minimize_scalar.make_objective_with_derivative
- creates a function that accepts a scalar step size and returns a tuple with loss and directional derivative. This can be passed to a scalar solver.
Examples:
Basic line search¶
This evaluates all step sizes in a range by using the :code:self.evaluate_step_size
method.
class GridLineSearch(LineSearch):
def __init__(self, start, end, num):
defaults = dict(start=start,end=end,num=num)
super().__init__(defaults)
@torch.no_grad
def search(self, update, var):
start = self.defaults["start"]
end = self.defaults["end"]
num = self.defaults["num"]
lowest_loss = float("inf")
best_step_size = best_step_size
for step_size in torch.linspace(start,end,num):
loss = self.evaluate_step_size(step_size.item(), var=var, backward=False)
if loss < lowest_loss:
lowest_loss = loss
best_step_size = step_size
return best_step_size
Using external solver via self.make_objective¶
Here we let :code:scipy.optimize.minimize_scalar
solver find the best step size via :code:self.make_objective
class ScipyMinimizeScalar(LineSearch):
def __init__(self, method: str | None = None):
defaults = dict(method=method)
super().__init__(defaults)
@torch.no_grad
def search(self, update, var):
objective = self.make_objective(var=var)
method = self.defaults["method"]
res = self.scopt.minimize_scalar(objective, method=method)
return res.x
Methods:
-
evaluate_f
–evaluate function value at alpha
step_size
. -
evaluate_f_d
–evaluate function value and directional derivative in the direction of the update at step size
step_size
. -
evaluate_f_d_g
–evaluate function value, directional derivative, and gradient list at step size
step_size
. -
search
–Finds the step size to use
Source code in torchzero/modules/line_search/line_search.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 |
|
evaluate_f ¶
evaluate_f(step_size: float, var: Var, backward: bool = False)
evaluate function value at alpha step_size
.
Source code in torchzero/modules/line_search/line_search.py
evaluate_f_d ¶
evaluate_f_d(step_size: float, var: Var)
evaluate function value and directional derivative in the direction of the update at step size step_size
.
Source code in torchzero/modules/line_search/line_search.py
evaluate_f_d_g ¶
evaluate_f_d_g(step_size: float, var: Var)
evaluate function value, directional derivative, and gradient list at step size step_size
.
Source code in torchzero/modules/line_search/line_search.py
Lion ¶
Bases: torchzero.core.transform.Transform
Lion (EvoLved Sign Momentum) optimizer from https://arxiv.org/abs/2302.06675.
Parameters:
-
beta1
(float
, default:0.9
) –dampening for momentum. Defaults to 0.9.
-
beta2
(float
, default:0.99
) –momentum factor. Defaults to 0.99.
Source code in torchzero/modules/adaptive/lion.py
LiuStorey ¶
Bases: torchzero.modules.conjugate_gradient.cg.ConguateGradientBase
Liu-Storey nonlinear conjugate gradient method.
Note
This requires step size to be determined via a line search, so put a line search like tz.m.StrongWolfe(c2=0.1, a_init="first-order")
after this.
Source code in torchzero/modules/conjugate_gradient/cg.py
LogHomotopy ¶
MARSCorrection ¶
Bases: torchzero.core.transform.Transform
MARS variance reduction correction.
Place any other momentum-based optimizer after this,
make sure beta
parameter matches with momentum in the optimizer.
Parameters:
-
beta
(float
, default:0.9
) –use the same beta as you use in the momentum module. Defaults to 0.9.
-
scaling
(float
, default:0.025
) –controls the scale of gradient correction in variance reduction. Defaults to 0.025.
-
max_norm
(float
, default:1
) –clips norm of corrected gradients, None to disable. Defaults to 1.
Examples:¶
Mars-AdamW
optimizer = tz.Modular(
model.parameters(),
tz.m.MARSCorrection(beta=0.95),
tz.m.Adam(beta1=0.95, beta2=0.99),
tz.m.WeightDecay(1e-3),
tz.m.LR(0.1)
)
Mars-Lion
optimizer = tz.Modular(
model.parameters(),
tz.m.MARSCorrection(beta=0.9),
tz.m.Lion(beta1=0.9),
tz.m.LR(0.1)
)
Source code in torchzero/modules/adaptive/mars.py
MSAM ¶
Bases: torchzero.core.transform.Transform
Momentum-SAM from https://arxiv.org/pdf/2401.12033.
This implementation expresses the update rule as function of gradient. This way it can be used as a drop-in replacement for momentum strategies in other optimizers.
To combine MSAM with other optimizers in the way done in the official implementation,
e.g. to make Adam_MSAM, use tz.m.MSAMObjective
module.
Note
MSAM has a learning rate hyperparameter that can't really be removed from the update rule.
To avoid compounding learning rate mofications, remove the tz.m.LR
module if you had it.
Parameters:
-
lr
(float
) –learning rate. Adding this module adds support for learning rate schedulers.
-
momentum
(float
, default:0.9
) –momentum (beta). Defaults to 0.9.
-
rho
(float
, default:0.3
) –perturbation strength. Defaults to 0.3.
-
weight_decay
(float
, default:0
) –weight decay. It is applied to perturbed parameters, so it is differnet from applying :code:
tz.m.WeightDecay
after MSAM. Defaults to 0. -
nesterov
(bool
, default:False
) –whether to use nesterov momentum formula. Defaults to False.
-
lerp
(bool
, default:False
) –whether to use linear interpolation, if True, this becomes similar to exponential moving average. Defaults to False.
Examples:
MSAM
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.MSAM(1e-3)
)
Adam with MSAM instead of exponential average. Note that this is different from Adam_MSAM.
To make Adam_MSAM and such, use the :code:tz.m.MSAMObjective
module.
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.RMSprop(0.999, inner=tz.m.MSAM(1e-3)),
tz.m.Debias(0.9, 0.999),
)
Source code in torchzero/modules/adaptive/msam.py
MSAMObjective ¶
Bases: torchzero.modules.adaptive.msam.MSAM
Momentum-SAM from https://arxiv.org/pdf/2401.12033.
Note
Please make sure to place tz.m.LR
inside the modules
argument. For example,
tz.m.MSAMObjective([tz.m.Adam(), tz.m.LR(1e-3)])
. Putting LR after MSAM will lead
to an incorrect update rule.
Parameters:
-
modules
(Chainable
) –modules that will optimizer the MSAM objective. Make sure :code:
tz.m.LR
is one of them. -
momentum
(float
, default:0.9
) –momentum (beta). Defaults to 0.9.
-
rho
(float
, default:0.3
) –perturbation strength. Defaults to 0.3.
-
nesterov
(bool
, default:False
) –whether to use nesterov momentum formula. Defaults to False.
-
lerp
(bool
, default:False
) –whether to use linear interpolation, if True, MSAM momentum becomes similar to exponential moving average. Defaults to False.
Examples:
AdamW-MSAM
.. code-block:: python
opt = tz.Modular(
bench.parameters(),
tz.m.MSAMObjective(
[tz.m.Adam(), tz.m.WeightDecay(1e-3), tz.m.LR(1e-3)],
rho=1.
)
)
Source code in torchzero/modules/adaptive/msam.py
MatrixMomentum ¶
Bases: torchzero.core.module.Module
Second order momentum method.
Matrix momentum is useful for convex objectives, also for some reason it has very really good generalization on elastic net logistic regression.
Notes
-
mu
needs to be tuned very carefully. It is supposed to be smaller than (1/largest eigenvalue), otherwise this will be very unstable. I have devised an adaptive version of this -tz.m.AdaptiveMatrixMomentum
, and it works well without having to tunemu
, however the adaptive version doesn't work on stochastic objectives. -
In most cases
MatrixMomentum
should be the first module in the chain because it relies on autograd. -
This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a
backward
argument.
Parameters:
-
mu
(float
, default:0.1
) –this has a similar role to (1 - beta) in normal momentum. Defaults to 0.1.
-
hvp_method
(str
, default:'autograd'
) –Determines how Hessian-vector products are evaluated.
"autograd"
: Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient."forward"
: Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation."central"
: Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
-
h
(float
, default:0.001
) –finite difference step size if hvp_method is set to finite difference. Defaults to 1e-3.
-
hvp_tfm
(Chainable | None
, default:None
) –optional module applied to hessian-vector products. Defaults to None.
Reference
Orr, Genevieve, and Todd Leen. "Using curvature information for fast stochastic search." Advances in neural information processing systems 9 (1996).
Source code in torchzero/modules/adaptive/matrix_momentum.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
|
Maximum ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Outputs :code:maximum(tensors, other(tensors))
Source code in torchzero/modules/ops/binary.py
MaximumModules ¶
Bases: torchzero.modules.ops.reduce.ReduceOperationBase
Outputs elementwise maximum of :code:inputs
that can be modules or numbers.
Source code in torchzero/modules/ops/reduce.py
McCormick ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
McCormicks's Quasi-Newton method.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.
This is "Algorithm 2", attributed to McCormick in this paper. However for some reason this method is also called Pearson's 2nd method in other sources.
Source code in torchzero/modules/quasi_newton/quasi_newton.py
MeZO ¶
Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator
Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
formula
(Literal
, default:'central2'
) –finite difference formula. Defaults to 'central2'.
-
distribution
(Literal
, default:'rademacher'
) –distribution. Defaults to "rademacher". If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36, 53038-53075. https://arxiv.org/abs/2305.17333
Source code in torchzero/modules/grad_approximation/rfdm.py
Mean ¶
Bases: torchzero.modules.ops.reduce.Sum
Outputs a mean of :code:inputs
that can be modules or numbers.
Source code in torchzero/modules/ops/reduce.py
USE_MEAN
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
MedianAveraging ¶
Bases: torchzero.core.transform.TensorwiseTransform
Median of past history_size
updates.
Parameters:
-
history_size
(int
) –Number of past updates to average
-
target
(Literal
, default:'update'
) –target. Defaults to 'update'.
Source code in torchzero/modules/momentum/averaging.py
Minimum ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Outputs :code:minimum(tensors, other(tensors))
Source code in torchzero/modules/ops/binary.py
MinimumModules ¶
Bases: torchzero.modules.ops.reduce.ReduceOperationBase
Outputs elementwise minimum of :code:inputs
that can be modules or numbers.
Source code in torchzero/modules/ops/reduce.py
Mul ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Multiply tensors by :code:other
. :code:other
can be a number or a module.
If :code:other
is a module, this calculates :code:tensors * other(tensors)
Source code in torchzero/modules/ops/binary.py
MulByLoss ¶
Bases: torchzero.core.module.Module
Multiplies update by loss times :code:alpha
Source code in torchzero/modules/misc/misc.py
MultiOperationBase ¶
Bases: torchzero.core.module.Module
, abc.ABC
Base class for operations that use operands. This is an abstract class, subclass it and override transform
method to use it.
Methods:
-
transform
–applies the operation to operands
Source code in torchzero/modules/ops/multi.py
Multistep ¶
Bases: torchzero.core.module.Module
Performs :code:steps
inner steps with :code:module
per each step.
The update is taken to be the parameter difference between parameters before and after the inner loop.
Source code in torchzero/modules/misc/multistep.py
MuonAdjustLR ¶
Bases: torchzero.core.transform.Transform
LR adjustment for Muon from "Muon is Scalable for LLM Training" (https://github.com/MoonshotAI/Moonlight/tree/master).
Orthogonalize already has this built in with the adjust_lr
setting, however you might want to move this to be later in the chain.
Source code in torchzero/modules/adaptive/muon.py
NAG ¶
Bases: torchzero.core.transform.Transform
Nesterov accelerated gradient method (nesterov momentum).
Parameters:
-
momentum
(float
, default:0.9
) –momentum (beta). Defaults to 0.9.
-
dampening
(float
, default:0
) –momentum dampening. Defaults to 0.
-
lerp
(bool
, default:False
) –whether to use linear interpolation, if True, this becomes similar to exponential moving average. Defaults to False.
-
target
(Literal
, default:'update'
) –target to apply EMA to. Defaults to 'update'.
Source code in torchzero/modules/momentum/momentum.py
NanToNum ¶
Bases: torchzero.core.transform.Transform
Convert nan
, inf
and -inf
to numbers.
Parameters:
-
nan
(optional
, default:None
) –the value to replace NaNs with. Default is zero.
-
posinf
(optional
, default:None
) –if a Number, the value to replace positive infinity values with. If None, positive infinity values are replaced with the greatest finite value representable by input's dtype. Default is None.
-
neginf
(optional
, default:None
) –if a Number, the value to replace negative infinity values with. If None, negative infinity values are replaced with the lowest finite value representable by input's dtype. Default is None.
Source code in torchzero/modules/ops/unary.py
NaturalGradient ¶
Bases: torchzero.core.module.Module
Natural gradient approximated via empirical fisher information matrix.
To use this, either pass vector of per-sample losses to the step method, or make sure
the closure returns it. Gradients will be calculated via batched autograd within this module,
you don't need to implement the backward pass. When using closure, please add the backward
argument,
it will always be False but it is required. See below for an example.
Note
Empirical fisher information matrix may give a really bad approximation in some cases.
If that is the case, set sqrt
to True to perform whitening instead, which is way more robust.
Parameters:
-
reg
(float
, default:1e-08
) –regularization parameter. Defaults to 1e-8.
-
sqrt
(bool
, default:False
) –if True, uses square root of empirical fisher information matrix. Both EFIM and it's square root can be calculated and stored efficiently without ndim^2 memory. Square root whitens the gradient and often performs much better, especially when you try to use NGD with a vector that isn't strictly per-sample gradients, but rather for example different losses.
-
gn_grad
(bool
, default:False
) –if True, uses Gauss-Newton G^T @ f as the gradient, which is effectively sum weighted by value and is equivalent to squaring the values. This way you can solve least-squares objectives with a NGD-like algorithm. If False, uses sum of per-sample gradients. This has an effect when
sqrt=True
, and affects thegrad
attribute. Defaults to False. -
batched
(bool
, default:True
) –whether to use vmapping. Defaults to True.
Examples:
training a neural network:
X = torch.randn(64, 20)
y = torch.randn(64, 10)
model = nn.Sequential(nn.Linear(20, 64), nn.ELU(), nn.Linear(64, 10))
opt = tz.Modular(
model.parameters(),
tz.m.NaturalGradient(),
tz.m.LR(3e-2)
)
for i in range(100):
y_hat = model(X) # (64, 10)
losses = (y_hat - y).pow(2).mean(0) # (10, )
opt.step(loss=losses)
if i % 10 == 0:
print(f'{losses.mean() = }')
training a neural network - closure version
X = torch.randn(64, 20)
y = torch.randn(64, 10)
model = nn.Sequential(nn.Linear(20, 64), nn.ELU(), nn.Linear(64, 10))
opt = tz.Modular(
model.parameters(),
tz.m.NaturalGradient(),
tz.m.LR(3e-2)
)
def closure(backward=True):
y_hat = model(X) # (64, 10)
return (y_hat - y).pow(2).mean(0) # (10, )
for i in range(100):
losses = opt.step(closure)
if i % 10 == 0:
print(f'{losses.mean() = }')
minimizing the rosenbrock function with a mix of natural gradient, whitening and gauss-newton:
def rosenbrock(X):
x1, x2 = X
return torch.stack([(1 - x1).abs(), (10 * (x2 - x1**2).abs())])
X = torch.tensor([-1.1, 2.5], requires_grad=True)
opt = tz.Modular([X], tz.m.NaturalGradient(sqrt=True, gn_grad=True), tz.m.LR(0.05))
for iter in range(200):
losses = rosenbrock(X)
opt.step(loss=losses)
if iter % 20 == 0:
print(f'{losses.mean() = }')
Source code in torchzero/modules/adaptive/natural_gradient.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
|
Negate ¶
Bases: torchzero.core.transform.Transform
Returns :code:- input
Source code in torchzero/modules/ops/unary.py
NegateOnLossIncrease ¶
Bases: torchzero.core.module.Module
Uses an extra forward pass to evaluate loss at :code:parameters+update
,
if loss is larger than at :code:parameters
,
the update is set to 0 if :code:backtrack=False
and to :code:-update
otherwise
Source code in torchzero/modules/misc/multistep.py
NewDQN ¶
Bases: torchzero.modules.quasi_newton.diagonal_quasi_newton.DNRTR
Diagonal quasi-newton method.
Reference
Nosrati, Mahsa, and Keyvan Amini. "A new diagonal quasi-Newton algorithm for unconstrained optimization problems." Applications of Mathematics 69.4 (2024): 501-512.
Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
NewSSM ¶
Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy
Self-scaling Quasi-Newton method.
Note
a line search such as tz.m.StrongWolfe()
is required.
Warning
this uses roughly O(N^2) memory.
Reference
Moghrabi, I. A., Hassan, B. A., & Askar, A. (2022). New self-scaling quasi-newton methods for unconstrained optimization. Int. J. Math. Comput. Sci., 17, 1061U.
Source code in torchzero/modules/quasi_newton/quasi_newton.py
Newton ¶
Bases: torchzero.core.module.Module
Exact newton's method via autograd.
Newton's method produces a direction jumping to the stationary point of quadratic approximation of the target function.
The update rule is given by (H + yI)⁻¹g
, where H
is the hessian and g
is the gradient, y
is the damping
parameter.
g
can be output of another module, if it is specifed in inner
argument.
Note
In most cases Newton should be the first module in the chain because it relies on autograd. Use the :code:inner
argument if you wish to apply Newton preconditioning to another module's output.
Note
This module requires the a closure passed to the optimizer step,
as it needs to re-evaluate the loss and gradients for calculating the hessian.
The closure must accept a backward
argument (refer to documentation).
Parameters:
-
damping
(float
, default:0
) –tikhonov regularizer value. Set this to 0 when using trust region. Defaults to 0.
-
search_negative
((bool, Optional)
, default:False
) –if True, whenever a negative eigenvalue is detected, search direction is proposed along weighted sum of eigenvectors corresponding to negative eigenvalues.
-
use_lstsq
((bool, Optional)
, default:False
) –if True, least squares will be used to solve the linear system, this may generate reasonable directions when hessian is not invertible. If False, tries cholesky, if it fails tries LU, and then least squares. If
eigval_fn
is specified, eigendecomposition will always be used to solve the linear system and this argument will be ignored. -
hessian_method
(str
, default:'autograd'
) –how to calculate hessian. Defaults to "autograd".
-
vectorize
(bool
, default:True
) –whether to enable vectorized hessian. Defaults to True.
-
inner
(Chainable | None
, default:None
) –modules to apply hessian preconditioner to. Defaults to None.
-
H_tfm
(Callable | None
, default:None
) –optional hessian transforms, takes in two arguments -
(hessian, gradient)
.must return either a tuple:
(hessian, is_inverted)
with transformed hessian and a boolean value which must be True if transform inverted the hessian and False otherwise.Or it returns a single tensor which is used as the update.
Defaults to None.
-
eigval_fn
(Callable | None
, default:None
) –optional eigenvalues transform, for example
torch.abs
orlambda L: torch.clip(L, min=1e-8)
. If this is specified, eigendecomposition will be used to invert the hessian.
See also¶
tz.m.NewtonCG
: uses a matrix-free conjugate gradient solver and hessian-vector products, useful for large scale problems as it doesn't form the full hessian.tz.m.NewtonCGSteihaug
: trust region version oftz.m.NewtonCG
.tz.m.InverseFreeNewton
: an inverse-free variant of Newton's method.tz.m.quasi_newton
: large collection of quasi-newton methods that estimate the hessian.
Notes¶
Implementation details¶
(H + yI)⁻¹g
is calculated by solving the linear system (H + yI)x = g
.
The linear system is solved via cholesky decomposition, if that fails, LU decomposition, and if that fails, least squares.
Least squares can be forced by setting use_lstsq=True
, which may generate better search directions when linear system is overdetermined.
Additionally, if eigval_fn
is specified or search_negative
is True
,
eigendecomposition of the hessian is computed, eigval_fn
is applied to the eigenvalues,
and (H + yI)⁻¹
is computed using the computed eigenvectors and transformed eigenvalues.
This is more generally more computationally expensive.
Handling non-convexity¶
Standard Newton's method does not handle non-convexity well without some modifications. This is because it jumps to the stationary point, which may be the maxima of the quadratic approximation.
The first modification to handle non-convexity is to modify the eignevalues to be positive,
for example by setting eigval_fn = lambda L: L.abs().clip(min=1e-4)
.
Second modification is search_negative=True
, which will search along a negative curvature direction if one is detected.
This also requires an eigendecomposition.
The Newton direction can also be forced to be a descent direction by using tz.m.GradSign()
or tz.m.Cautious
,
but that may be significantly less efficient.
Examples:¶
Newton's method with backtracking line search
Newton preconditioning applied to momentum
Diagonal newton example. This will still evaluate the entire hessian so it isn't efficient, but if you wanted to see how diagonal newton behaves or compares to full newton, you can use this.
opt = tz.Modular(
model.parameters(),
tz.m.Newton(H_tfm = lambda H, g: g/H.diag()),
tz.m.Backtracking()
)
Source code in torchzero/modules/second_order/newton.py
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 |
|
NewtonCG ¶
Bases: torchzero.core.module.Module
Newton's method with a matrix-free conjugate gradient or minimial-residual solver.
Notes
-
In most cases NewtonCGSteihaug should be the first module in the chain because it relies on autograd. Use the
inner
argument if you wish to apply Newton preconditioning to another module's output. -
This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a
backward
argument (refer to documentation).
Warning
CG may fail if hessian is not positive-definite.
Parameters:
-
maxiter
(int | None
, default:None
) –Maximum number of iterations for the conjugate gradient solver. By default, this is set to the number of dimensions in the objective function, which is the theoretical upper bound for CG convergence. Setting this to a smaller value (truncated Newton) can still generate good search directions. Defaults to None.
-
tol
(float
, default:1e-08
) –Relative tolerance for the conjugate gradient solver to determine convergence. Defaults to 1e-4.
-
reg
(float
, default:1e-08
) –Regularization parameter (damping) added to the Hessian diagonal. This helps ensure the system is positive-definite. Defaults to 1e-8.
-
hvp_method
(str
, default:'autograd'
) –Determines how Hessian-vector products are evaluated.
"autograd"
: Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient."forward"
: Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation."central"
: Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
-
h
(float
, default:0.001
) –The step size for finite differences if :code:
hvp_method
is"forward"
or"central"
. Defaults to 1e-3. -
warm_start
(bool
, default:False
) –If
True
, the conjugate gradient solver is initialized with the solution from the previous optimization step. This can accelerate convergence, especially in truncated Newton methods. Defaults to False. -
inner
(Chainable | None
, default:None
) –NewtonCG will attempt to apply preconditioning to the output of this module.
Examples: Newton-CG with a backtracking line search:
Truncated Newton method (useful for large-scale problems):
Source code in torchzero/modules/second_order/newton_cg.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
|
NewtonCGSteihaug ¶
Bases: torchzero.core.module.Module
Newton's method with trust region and a matrix-free Steihaug-Toint conjugate gradient solver.
Notes
-
In most cases NewtonCGSteihaug should be the first module in the chain because it relies on autograd. Use the
inner
argument if you wish to apply Newton preconditioning to another module's output. -
This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a
backward
argument (refer to documentation).
Parameters:
-
eta
(float
, default:0.0
) –if ratio of actual to predicted rediction is larger than this, step is accepted. Defaults to 0.0.
-
nplus
(float
, default:3.5
) –increase factor on successful steps. Defaults to 1.5.
-
nminus
(float
, default:0.25
) –decrease factor on unsuccessful steps. Defaults to 0.75.
-
rho_good
(float
, default:0.99
) –if ratio of actual to predicted rediction is larger than this, trust region size is multiplied by
nplus
. -
rho_bad
(float
, default:0.0001
) –if ratio of actual to predicted rediction is less than this, trust region size is multiplied by
nminus
. -
init
(float
, default:1
) –Initial trust region value. Defaults to 1.
-
max_attempts
(max_attempts
, default:100
) –maximum number of trust radius reductions per step. A zero update vector is returned when this limit is exceeded. Defaults to 10.
-
max_history
(int
, default:100
) –CG will store this many intermediate solutions, reusing them when trust radius is reduced instead of re-running CG. Each solution storage requires 2N memory. Defaults to 100.
-
boundary_tol
(float | None
, default:1e-06
) –The trust region only increases when suggested step's norm is at least
(1-boundary_tol)*trust_region
. This prevents increasing trust region when solution is not on the boundary. Defaults to 1e-2. -
maxiter
(int | None
, default:None
) –maximum number of CG iterations per step. Each iteration requies one backward pass if
hvp_method="forward"
, two otherwise. Defaults to None. -
miniter
(int
, default:1
) –minimal number of CG iterations. This prevents making no progress
-
tol
(float
, default:1e-08
) –terminates CG when norm of the residual is less than this value. Defaults to 1e-8. when initial guess is below tolerance. Defaults to 1.
-
reg
(float
, default:1e-08
) –hessian regularization. Defaults to 1e-8.
-
solver
(str
, default:'cg'
) –solver, "cg" or "minres". "cg" is recommended. Defaults to 'cg'.
-
adapt_tol
(bool
, default:True
) –if True, whenever trust radius collapses to smallest representable number, the tolerance is multiplied by 0.1. Defaults to True.
-
npc_terminate
(bool
, default:False
) –whether to terminate CG/MINRES whenever negative curvature is detected. Defaults to False.
-
hvp_method
(str
, default:'central'
) –either "forward" to use forward formula which requires one backward pass per Hvp, or "central" to use a more accurate central formula which requires two backward passes. "forward" is usually accurate enough. Defaults to "forward".
-
h
(float
, default:0.001
) –finite difference step size. Defaults to 1e-3.
-
inner
(Chainable | None
, default:None
) –applies preconditioning to output of this module. Defaults to None.
Examples:¶
Trust-region Newton-CG:
Reference:¶
Steihaug, Trond. "The conjugate gradient method and trust regions in large scale optimization." SIAM Journal on Numerical Analysis 20.3 (1983): 626-637.
Source code in torchzero/modules/second_order/newton_cg.py
178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 |
|
NoiseSign ¶
Bases: torchzero.core.transform.Transform
Outputs random tensors with sign copied from the update.
Source code in torchzero/modules/misc/misc.py
Noop ¶
Bases: torchzero.core.module.Module
Identity operator that is argument-insensitive. This also can be used as identity hessian for trust region methods.
Source code in torchzero/modules/ops/utility.py
Normalize ¶
Bases: torchzero.core.transform.Transform
Normalizes the update.
Parameters:
-
norm_value
(float
, default:1
) –desired norm value.
-
ord
(float
, default:2
) –norm order. Defaults to 2.
-
dim
(int | Sequence[int] | str | None
, default:None
) –calculates norm along those dimensions. If list/tuple, tensors are normalized along all dimensios in
dim
that they have. Can be set to "global" to normalize by global norm of all gradients concatenated to a vector. Defaults to None. -
inverse_dims
(bool
, default:False
) –if True, the
dims
argument is inverted, and all other dimensions are normalized. -
min_size
(int
, default:1
) –minimal size of a dimension to normalize along it. Defaults to 1.
-
target
(str
, default:'update'
) –what this affects.
Examples: Gradient normalization:
Update normalization:
Source code in torchzero/modules/clipping/clipping.py
NormalizeByEMA ¶
Bases: torchzero.modules.clipping.ema_clipping.ClipNormByEMA
Sets norm of the update to be the same as the norm of an exponential moving average of past updates.
Parameters:
-
beta
(float
, default:0.99
) –beta for the exponential moving average. Defaults to 0.99.
-
ord
(float
, default:2
) –order of the norm. Defaults to 2.
-
eps
(float
, default:1e-06
) –epsilon for division. Defaults to 1e-6.
-
tensorwise
(bool
, default:True
) –if True, norms are calculated parameter-wise, otherwise treats all parameters as single vector. Defaults to True.
-
max_ema_growth
(float | None
, default:1.5
) –if specified, restricts how quickly exponential moving average norm can grow. The norm is allowed to grow by at most this value per step. Defaults to 1.5.
-
ema_init
(str
, default:'zeros'
) –How to initialize exponential moving average on first step, "update" to use the first update or "zeros". Defaults to 'zeros'.
Source code in torchzero/modules/clipping/ema_clipping.py
NORMALIZE
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
NystromPCG ¶
Bases: torchzero.core.module.Module
Newton's method with a Nyström-preconditioned conjugate gradient solver. This tends to outperform NewtonCG but requires tuning sketch size. An adaptive version exists in https://arxiv.org/abs/2110.02820, I might implement it too at some point.
.. note::
This module requires the a closure passed to the optimizer step,
as it needs to re-evaluate the loss and gradients for calculating HVPs.
The closure must accept a backward
argument (refer to documentation).
.. note::
In most cases NystromPCG should be the first module in the chain because it relies on autograd. Use the :code:inner
argument if you wish to apply Newton preconditioning to another module's output.
Parameters:
-
sketch_size
(int
) –size of the sketch for preconditioning, this many hessian-vector products will be evaluated before running the conjugate gradient solver. Larger value improves the preconditioning and speeds up conjugate gradient.
-
maxiter
(int | None
, default:None
) –maximum number of iterations. By default this is set to the number of dimensions in the objective function, which is supposed to be enough for conjugate gradient to have guaranteed convergence. Setting this to a small value can still generate good enough directions. Defaults to None.
-
tol
(float
, default:0.001
) –relative tolerance for conjugate gradient solver. Defaults to 1e-4.
-
reg
(float
, default:1e-06
) –regularization parameter. Defaults to 1e-8.
-
hvp_method
(str
, default:'autograd'
) –Determines how Hessian-vector products are evaluated.
"autograd"
: Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient."forward"
: Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation."central"
: Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
-
h
(float
, default:0.001
) –finite difference step size if :code:
hvp_method
is "forward" or "central". Defaults to 1e-3. -
inner
(Chainable | None
, default:None
) –modules to apply hessian preconditioner to. Defaults to None.
-
seed
(int | None
, default:None
) –seed for random generator. Defaults to None.
Examples:
NystromPCG with backtracking line search
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.NystromPCG(10),
tz.m.Backtracking()
)
Reference
Frangella, Z., Tropp, J. A., & Udell, M. (2023). Randomized nyström preconditioning. SIAM Journal on Matrix Analysis and Applications, 44(2), 718-752. https://arxiv.org/abs/2110.02820
Source code in torchzero/modules/second_order/nystrom.py
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 |
|
NystromSketchAndSolve ¶
Bases: torchzero.core.module.Module
Newton's method with a Nyström sketch-and-solve solver.
.. note::
This module requires the a closure passed to the optimizer step,
as it needs to re-evaluate the loss and gradients for calculating HVPs.
The closure must accept a backward
argument (refer to documentation).
.. note::
In most cases NystromSketchAndSolve should be the first module in the chain because it relies on autograd. Use the :code:inner
argument if you wish to apply Newton preconditioning to another module's output.
.. note::
If this is unstable, increase the :code:reg
parameter and tune the rank.
.. note:
:code:tz.m.NystromPCG
usually outperforms this.
Parameters:
-
rank
(int
) –size of the sketch, this many hessian-vector products will be evaluated per step.
-
reg
(float
, default:0.001
) –regularization parameter. Defaults to 1e-3.
-
hvp_method
(str
, default:'autograd'
) –Determines how Hessian-vector products are evaluated.
"autograd"
: Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient."forward"
: Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation."central"
: Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
-
h
(float
, default:0.001
) –finite difference step size if :code:
hvp_method
is "forward" or "central". Defaults to 1e-3. -
inner
(Chainable | None
, default:None
) –modules to apply hessian preconditioner to. Defaults to None.
-
seed
(int | None
, default:None
) –seed for random generator. Defaults to None.
Examples:
NystromSketchAndSolve with backtracking line search
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.NystromSketchAndSolve(10),
tz.m.Backtracking()
)
Reference
Frangella, Z., Tropp, J. A., & Udell, M. (2023). Randomized nyström preconditioning. SIAM Journal on Matrix Analysis and Applications, 44(2), 718-752. https://arxiv.org/abs/2110.02820
Source code in torchzero/modules/second_order/nystrom.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
|
Ones ¶
Bases: torchzero.core.module.Module
Outputs ones
Source code in torchzero/modules/ops/utility.py
Online ¶
Bases: torchzero.core.module.Module
Allows certain modules to be used for mini-batch optimization.
Examples:
Online L-BFGS with Backtracking line search
Online L-BFGS trust region
Source code in torchzero/modules/misc/multistep.py
OrthoGrad ¶
Bases: torchzero.core.transform.Transform
Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.
Parameters:
-
eps
(float
, default:1e-08
) –epsilon added to the denominator for numerical stability (default: 1e-30)
-
renormalize
(bool
, default:True
) –whether to graft projected gradient to original gradient norm. Defaults to True.
-
target
(Literal
, default:'update'
) –what to set on var. Defaults to 'update'.
Source code in torchzero/modules/adaptive/orthograd.py
Orthogonalize ¶
Bases: torchzero.core.transform.TensorwiseTransform
Uses Newton-Schulz iteration or SVD to compute the zeroth power / orthogonalization of update along first 2 dims.
To disable orthogonalization for a parameter, put it into a parameter group with "orthogonalize" = False. The Muon page says that embeddings and classifier heads should not be orthogonalized. Usually only matrix parameters that are directly used in matmuls should be orthogonalized.
To make Muon, use Split with Adam on 1d params
Parameters:
-
ns_steps
(int
, default:5
) –The number of Newton-Schulz iterations to run. Defaults to 5.
-
adjust_lr
(bool
, default:False
) –Enables LR adjustment based on parameter size from "Muon is Scalable for LLM Training". Defaults to False.
-
dual_norm_correction
(bool
, default:False
) –enables dual norm correction from https://github.com/leloykun/adaptive-muon. Defaults to False.
-
method
(str
, default:'newton-schulz'
) –Newton-Schulz is very fast, SVD is extremely slow but can be slighly more precise.
-
target
(str
, default:'update'
) –what to set on var.
Examples:¶
standard Muon with Adam fallback
opt = tz.Modular(
model.head.parameters(),
tz.m.Split(
# apply muon only to 2D+ parameters
filter = lambda t: t.ndim >= 2,
true = [
tz.m.HeavyBall(),
tz.m.Orthogonalize(),
tz.m.LR(1e-2),
],
false = tz.m.Adam()
),
tz.m.LR(1e-2)
)
Reference
Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, Jeremy Bernstein - Muon: An optimizer for hidden layers in neural networks (2024) https://github.com/KellerJordan/Muon
Source code in torchzero/modules/adaptive/muon.py
PSB ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._HessianUpdateStrategyDefaults
Powell's Symmetric Broyden Quasi-Newton method.
Note
a line search or a trust region is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
Source code in torchzero/modules/quasi_newton/quasi_newton.py
Params ¶
Bases: torchzero.core.module.Module
Outputs parameters
Source code in torchzero/modules/ops/utility.py
Pearson ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Pearson's Quasi-Newton method.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.
Source code in torchzero/modules/quasi_newton/quasi_newton.py
PerturbWeights ¶
Bases: torchzero.core.module.Module
Changes the closure so that it evaluates loss and gradients at weights perturbed by a random perturbation.
Can be disabled for a parameter by setting :code:perturb=False
in corresponding parameter group.
Parameters:
-
alpha
(float
, default:0.1
) –multiplier for perturbation magnitude. Defaults to 0.1.
-
relative
(bool
, default:True
) –whether to multiply perturbation by mean absolute value of the parameter. Defaults to True.
-
distribution
(bool
, default:'normal'
) –distribution of the random perturbation. Defaults to False.
Source code in torchzero/modules/misc/regularization.py
PolakRibiere ¶
Bases: torchzero.modules.conjugate_gradient.cg.ConguateGradientBase
Polak-Ribière-Polyak nonlinear conjugate gradient method.
Note
This requires step size to be determined via a line search, so put a line search like tz.m.StrongWolfe(c2=0.1, a_init="first-order")
after this.
Source code in torchzero/modules/conjugate_gradient/cg.py
PolyakStepSize ¶
Bases: torchzero.core.transform.Transform
Polyak's subgradient method with known or unknown f*.
Parameters:
-
f_star
(float | Mone
, default:0
) –minimal possible value of the objective function. If not known, set to
None
. Defaults to 0. -
y
(float
, default:1
) –when
f_star
is set to None, it is calculated asf_best - y
. -
y_decay
(float
, default:0.001
) –y
is multiplied by(1 - y_decay)
after each step. Defaults to 1e-3. -
max
(float | None
, default:None
) –maximum possible step size. Defaults to None.
-
use_grad
(bool
, default:True
) –if True, uses dot product of update and gradient to compute the step size. Otherwise, dot product of update with itself is used.
-
alpha
(float
, default:1
) –multiplier to Polyak step-size. Defaults to 1.
Source code in torchzero/modules/step_size/adaptive.py
Pow ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Take tensors to the power of :code:exponent
. :code:exponent
can be a number or a module.
If :code:exponent
is a module, this calculates :code:tensors ^ exponent(tensors)
Source code in torchzero/modules/ops/binary.py
PowModules ¶
Bases: torchzero.modules.ops.multi.MultiOperationBase
Calculates :code:input ** exponent
. :code:input
and :code:other
can be numbers or modules.
Source code in torchzero/modules/ops/multi.py
PowellRestart ¶
Bases: torchzero.modules.restarts.restars.RestartStrategyBase
Powell's two restarting criterions for conjugate gradient methods.
The restart clears all states of modules
.
Parameters:
-
modules
(Chainable | None
) –modules to reset. If None, resets all modules.
-
cond1
(float | None
, default:0.2
) –criterion that checks for nonconjugacy of the search directions. Restart is performed whenevr g^Tg_{k+1} >= cond1*||g_{k+1}||^2. The default condition value of 0.2 is suggested by Powell. Can be None to disable that criterion.
-
cond2
(float | None
, default:0.2
) –criterion that checks if direction is not effectively downhill. Restart is performed if -1.2||g||^2 < d^Tg < -0.8||g||^2. Defaults to 0.2. Can be None to disable that criterion.
Reference
Powell, Michael James David. "Restart procedures for the conjugate gradient method." Mathematical programming 12.1 (1977): 241-254.
Source code in torchzero/modules/restarts/restars.py
Previous ¶
Bases: torchzero.core.transform.TensorwiseTransform
Maintains an update from n steps back, for example if n=1, returns previous update
Source code in torchzero/modules/misc/misc.py
PrintLoss ¶
Bases: torchzero.core.module.Module
Prints var.get_loss().
Source code in torchzero/modules/misc/debug.py
PrintParams ¶
Bases: torchzero.core.module.Module
Prints current update.
Source code in torchzero/modules/misc/debug.py
PrintShape ¶
Bases: torchzero.core.module.Module
Prints shapes of the update.
Source code in torchzero/modules/misc/debug.py
PrintUpdate ¶
Bases: torchzero.core.module.Module
Prints current update.
Source code in torchzero/modules/misc/debug.py
Prod ¶
Bases: torchzero.modules.ops.reduce.ReduceOperationBase
Outputs product of :code:inputs
that can be modules or numbers.
Source code in torchzero/modules/ops/reduce.py
ProjectedGradientMethod ¶
Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy
Projected gradient method. Directly projects the gradient onto subspace conjugate to past directions.
Notes
- This method uses N^2 memory.
- This requires step size to be determined via a line search, so put a line search like
tz.m.StrongWolfe(c2=0.1, a_init="first-order")
after this. - This is not the same as projected gradient descent.
Reference
Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171. (algorithm 5 in section 6)
Source code in torchzero/modules/conjugate_gradient/cg.py
ProjectedNewtonRaphson ¶
Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy
Projected Newton Raphson method.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.
This one is Algorithm 7.
Source code in torchzero/modules/quasi_newton/quasi_newton.py
ProjectionBase ¶
Bases: torchzero.core.module.Module
, abc.ABC
Base class for projections.
This is an abstract class, to use it, subclass it and override project
and unproject
.
Parameters:
-
modules
(Chainable
) –modules that will be applied in the projected domain.
-
project_update
(bool
, default:True
) –whether to project the update. Defaults to True.
-
project_params
(bool
, default:False
) –whether to project the params. This is necessary for modules that use closure. Defaults to False.
-
project_grad
(bool
, default:False
) –whether to project the gradients (separately from update). Defaults to False.
-
defaults
(dict[str, Any] | None
, default:None
) –dictionary with defaults. Defaults to None.
Methods:
-
project
–projects
tensors
. Note that this can be called multiple times per step withparams
,grads
, andupdate
. -
unproject
–unprojects
tensors
. Note that this can be called multiple times per step withparams
,grads
, andupdate
.
Source code in torchzero/modules/projections/projection.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 |
|
project ¶
project(tensors: list[Tensor], params: list[Tensor], grads: list[Tensor] | None, loss: Tensor | None, states: list[dict[str, Any]], settings: list[ChainMap[str, Any]], current: str) -> Iterable[Tensor]
projects tensors
. Note that this can be called multiple times per step with params
, grads
, and update
.
Source code in torchzero/modules/projections/projection.py
unproject ¶
unproject(projected_tensors: list[Tensor], params: list[Tensor], grads: list[Tensor] | None, loss: Tensor | None, states: list[dict[str, Any]], settings: list[ChainMap[str, Any]], current: str) -> Iterable[Tensor]
unprojects tensors
. Note that this can be called multiple times per step with params
, grads
, and update
.
Parameters:
-
projected_tensors
(list[Tensor]
) –projected tensors to unproject.
-
params
(list[Tensor]
) –original, unprojected parameters.
-
grads
(list[Tensor] | None
) –original, unprojected gradients
-
loss
(Tensor | None
) –loss at initial point.
-
states
(list[dict[str, Any]]
) –list of state dictionaries per each UNPROJECTED tensor.
-
settings
(list[ChainMap[str, Any]]
) –list of setting dictionaries per each UNPROJECTED tensor.
-
current
(str
) –string representing what is being unprojected, e.g. "params", "grads" or "update".
Returns:
-
Iterable[Tensor]
–Iterable[torch.Tensor]: unprojected tensors of the same shape as params
Source code in torchzero/modules/projections/projection.py
RCopySign ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Returns :code:other(tensors)
with sign copied from tensors.
Source code in torchzero/modules/ops/binary.py
RDSA ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Gradient approximation via Random-direction stochastic approximation (RDSA) method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
formula
(Literal
, default:'central2'
) –finite difference formula. Defaults to 'central2'.
-
distribution
(Literal
, default:'gaussian'
) –distribution. Defaults to "gaussian".
-
beta
(float
, default:0
) –If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed
(int | None | Generator
, default:None
) –Seed for random generator. Defaults to None.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771
Source code in torchzero/modules/grad_approximation/rfdm.py
RDiv ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Divide :code:other
by tensors. :code:other
can be a number or a module.
If :code:other
is a module, this calculates :code:other(tensors) / tensors
Source code in torchzero/modules/ops/binary.py
RGraft ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Outputs :code:magnitude(tensors)
rescaled to have the same norm as tensors
Source code in torchzero/modules/ops/binary.py
RMSprop ¶
Bases: torchzero.core.transform.Transform
Divides graient by EMA of gradient squares.
This implementation is identical to :code:torch.optim.RMSprop
.
Parameters:
-
smoothing
(float
, default:0.99
) –beta for exponential moving average of gradient squares. Defaults to 0.99.
-
eps
(float
, default:1e-08
) –epsilon for division. Defaults to 1e-8.
-
centered
(bool
, default:False
) –whether to center EMA of gradient squares using an additional EMA. Defaults to False.
-
debiased
(bool
, default:False
) –applies Adam debiasing. Defaults to False.
-
amsgrad
(bool
, default:False
) –Whether to divide by maximum of EMA of gradient squares instead. Defaults to False.
-
pow
(float
, default:2
) –power used in second momentum power and root. Defaults to 2.
-
init
(str
, default:'zeros'
) –how to initialize EMA, either "update" to use first update or "zeros". Defaults to "update".
-
inner
(Chainable | None
, default:None
) –Inner modules that are applied after updating EMA and before preconditioning. Defaults to None.
Source code in torchzero/modules/adaptive/rmsprop.py
RPow ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Take :code:other
to the power of tensors. :code:other
can be a number or a module.
If :code:other
is a module, this calculates :code:other(tensors) ^ tensors
Source code in torchzero/modules/ops/binary.py
RSub ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Subtract tensors from :code:other
. :code:other
can be a number or a module.
If :code:other
is a module, this calculates :code:other(tensors) - tensors
Source code in torchzero/modules/ops/binary.py
Randn ¶
Bases: torchzero.core.module.Module
Outputs tensors filled with random numbers from a normal distribution with mean 0 and variance 1.
Source code in torchzero/modules/ops/utility.py
RandomHvp ¶
Bases: torchzero.core.module.Module
Returns a hessian-vector product with a random vector
Source code in torchzero/modules/misc/misc.py
RandomSample ¶
Bases: torchzero.core.module.Module
Outputs tensors filled with random numbers from distribution depending on value of :code:distribution
.
Source code in torchzero/modules/ops/utility.py
RandomStepSize ¶
Bases: torchzero.core.transform.Transform
Uses random global or layer-wise step size from low
to high
.
Parameters:
-
low
(float
, default:0
) –minimum learning rate. Defaults to 0.
-
high
(float
, default:1
) –maximum learning rate. Defaults to 1.
-
parameterwise
(bool
, default:False
) –if True, generate random step size for each parameter separately, if False generate one global random step size. Defaults to False.
Source code in torchzero/modules/step_size/lr.py
RandomizedFDM ¶
Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator
Gradient approximation via a randomized finite-difference method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
formula
(Literal
, default:'central'
) –finite difference formula. Defaults to 'central2'.
-
distribution
(Literal
, default:'rademacher'
) –distribution. Defaults to "rademacher". If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
beta
(float
, default:0
) –optinal momentum for generated perturbations. Defaults to 1e-3.
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed
(int | None | Generator
, default:None
) –Seed for random generator. Defaults to None.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
Examples:
Simultaneous perturbation stochastic approximation (SPSA) method¶
SPSA is randomized finite differnce with rademacher distribution and central formula.
spsa = tz.Modular(
model.parameters(),
tz.m.RandomizedFDM(formula="central", distribution="rademacher"),
tz.m.LR(1e-2)
)
Random-direction stochastic approximation (RDSA) method¶
RDSA is randomized finite differnce with usually gaussian distribution and central formula.
rdsa = tz.Modular(
model.parameters(),
tz.m.RandomizedFDM(formula="central", distribution="gaussian"),
tz.m.LR(1e-2)
)
RandomizedFDM with momentum¶
Momentum might help by reducing the variance of the estimated gradients.
momentum_spsa = tz.Modular(
model.parameters(),
tz.m.RandomizedFDM(),
tz.m.HeavyBall(0.9),
tz.m.LR(1e-3)
)
Gaussian smoothing method¶
GS uses many gaussian samples with possibly a larger finite difference step size.
gs = tz.Modular(
model.parameters(),
tz.m.RandomizedFDM(n_samples=100, distribution="gaussian", formula="forward2", h=1e-1),
tz.m.NewtonCG(hvp_method="forward"),
tz.m.Backtracking()
)
SPSA-NewtonCG¶
NewtonCG with hessian-vector product estimated via gradient difference calls closure multiple times per step. If each closure call estimates gradients with different perturbations, NewtonCG is unable to produce useful directions.
By setting pre_generate to True, perturbations are generated once before each step, and each closure call estimates gradients using the same pre-generated perturbations. This way closure-based algorithms are able to use gradients estimated in a consistent way.
opt = tz.Modular(
model.parameters(),
tz.m.RandomizedFDM(n_samples=10),
tz.m.NewtonCG(hvp_method="forward", pre_generate=True),
tz.m.Backtracking()
)
SPSA-LBFGS¶
LBFGS uses a memory of past parameter and gradient differences. If past gradients were estimated with different perturbations, LBFGS directions will be useless.
To alleviate this momentum can be added to random perturbations to make sure they only
change by a little bit, and the history stays relevant. The momentum is determined by the :code:beta
parameter.
The disadvantage is that the subspace the algorithm is able to explore changes slowly.
Additionally we will reset SPSA and LBFGS memory every 100 steps to remove influence from old gradient estimates.
opt = tz.Modular(
bench.parameters(),
tz.m.ResetEvery(
[tz.m.RandomizedFDM(n_samples=10, pre_generate=True, beta=0.99), tz.m.LBFGS()],
steps = 100,
),
tz.m.Backtracking()
)
Source code in torchzero/modules/grad_approximation/rfdm.py
154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 |
|
PRE_MULTIPLY_BY_H
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
Reciprocal ¶
Bases: torchzero.core.transform.Transform
Returns :code:1 / input
Source code in torchzero/modules/ops/unary.py
ReduceOperationBase ¶
Bases: torchzero.core.module.Module
, abc.ABC
Base class for reduction operations like Sum, Prod, Maximum. This is an abstract class, subclass it and override transform
method to use it.
Methods:
-
transform
–applies the operation to operands
Source code in torchzero/modules/ops/reduce.py
Relative ¶
Bases: torchzero.core.transform.Transform
Multiplies update by absolute parameter values to make it relative to their magnitude, :code:min_value
is minimum allowed value to avoid getting stuck at 0.
Source code in torchzero/modules/misc/misc.py
RelativeWeightDecay ¶
Bases: torchzero.core.transform.Transform
Weight decay relative to the mean absolute value of update, gradient or parameters depending on value of norm_input
argument.
Parameters:
-
weight_decay
(float
, default:0.1
) –relative weight decay scale.
-
ord
(int
, default:2
) –order of the penalty, e.g. 1 for L1 and 2 for L2. Defaults to 2.
-
norm_input
(str
, default:'update'
) –determines what should weight decay be relative to. "update", "grad" or "params". Defaults to "update".
-
metric
(Ords
, default:'mad'
) –metric (norm, etc) that weight decay should be relative to. defaults to 'mad' (mean absolute deviation).
-
target
(Literal
, default:'update'
) –what to set on var. Defaults to 'update'.
Examples:¶
Adam with non-decoupled relative weight decay
Adam with decoupled relative weight decay
Source code in torchzero/modules/weight_decay/weight_decay.py
RestartEvery ¶
Bases: torchzero.modules.restarts.restars.RestartStrategyBase
Resets the state every n steps
Parameters:
-
modules
(Chainable | None
) –modules to reset. If None, resets all modules.
-
steps
(int | Literal['ndim']
) –number of steps between resets. "ndim" to use number of parameters.
Source code in torchzero/modules/restarts/restars.py
RestartOnStuck ¶
Bases: torchzero.modules.restarts.restars.RestartStrategyBase
Resets the state when update (difference in parameters) is zero for multiple steps in a row.
Parameters:
-
modules
(Chainable | None
) –modules to reset. If None, resets all modules.
-
tol
(float
, default:None
) –step is considered failed when maximum absolute parameter difference is smaller than this. Defaults to None (uses twice the smallest respresentable number)
-
n_tol
(int
, default:10
) –number of failed consequtive steps required to trigger a reset. Defaults to 10.
Source code in torchzero/modules/restarts/restars.py
RestartStrategyBase ¶
Bases: torchzero.core.module.Module
, abc.ABC
Base class for restart strategies.
On each update
/step
this checks reset condition and if it is satisfied,
resets the modules before updating or stepping.
Methods:
-
should_reset
–returns whether reset should occur
Source code in torchzero/modules/restarts/restars.py
Rprop ¶
Bases: torchzero.core.transform.Transform
Resilient propagation. The update magnitude gets multiplied by nplus
if gradient didn't change the sign,
or nminus
if it did. Then the update is applied with the sign of the current gradient.
Additionally, if gradient changes sign, the update for that weight is reverted. Next step, magnitude for that weight won't change.
Compared to pytorch this also implements backtracking update when sign changes.
This implementation is identical to :code:torch.optim.Rprop
if :code:backtrack
is set to False.
Parameters:
-
nplus
(float
, default:1.2
) –multiplicative increase factor for when ascent didn't change sign (default: 1.2).
-
nminus
(float
, default:0.5
) –multiplicative decrease factor for when ascent changed sign (default: 0.5).
-
lb
(float
, default:1e-06
) –minimum step size, can be None (default: 1e-6)
-
ub
(float
, default:50
) –maximum step size, can be None (default: 50)
-
backtrack
(float
, default:True
) –if True, when ascent sign changes, undoes last weight update, otherwise sets update to 0. When this is False, this exactly matches pytorch Rprop. (default: True)
-
alpha
(float
, default:1
) –initial per-parameter learning rate (default: 1).
reference Riedmiller, M., & Braun, H. (1993, March). A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In IEEE international conference on neural networks (pp. 586-591). IEEE.
Source code in torchzero/modules/adaptive/rprop.py
SAM ¶
Bases: torchzero.core.module.Module
Sharpness-Aware Minimization from https://arxiv.org/pdf/2010.01412
SAM functions by seeking parameters that lie in neighborhoods having uniformly low loss value. It performs two forward and backward passes per step.
This implementation modifies the closure to return loss and calculate gradients of the SAM objective. All modules after this will use the modified objective.
.. note:: This module requires a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients at two points on each step.
Parameters:
-
rho
(float
, default:0.05
) –Neighborhood size. Defaults to 0.05.
-
p
(float
, default:2
) –norm of the SAM objective. Defaults to 2.
-
asam
(bool
, default:False
) –enables ASAM variant which makes perturbation relative to weight magnitudes. ASAM requires a much larger :code:
rho
, like 0.5 or 1. The :code:tz.m.ASAM
class is idential to setting this argument to True, but it has larger :code:rho
by default.
Examples:
SAM-SGD:
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.SAM(),
tz.m.LR(1e-2)
)
SAM-Adam:
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.SAM(),
tz.m.Adam(),
tz.m.LR(1e-2)
)
References
Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2020). Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412. https://arxiv.org/abs/2010.01412#page=3.16
Source code in torchzero/modules/adaptive/sam.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
|
SOAP ¶
Bases: torchzero.core.transform.Transform
SOAP (ShampoO with Adam in the Preconditioner's eigenbasis from https://arxiv.org/abs/2409.11321).
Parameters:
-
beta1
(float
, default:0.95
) –beta for first momentum. Defaults to 0.95.
-
beta2
(float
, default:0.95
) –beta for second momentum. Defaults to 0.95.
-
shampoo_beta
(float | None
, default:0.95
) –beta for covariance matrices accumulators. Can be None, then it just sums them like Adagrad (which works worse). Defaults to 0.95.
-
precond_freq
(int
, default:10
) –How often to update the preconditioner. Defaults to 10.
-
merge_small
(bool
, default:True
) –Whether to merge small dims. Defaults to True.
-
max_dim
(int
, default:2000
) –Won't precondition dims larger than this. Defaults to 2_000.
-
precondition_1d
(bool
, default:True
) –Whether to precondition 1d params (SOAP paper sets this to False). Defaults to True.
-
eps
(float
, default:1e-08
) –epsilon for dividing first momentum by second. Defaults to 1e-8.
-
decay
(float | None
, default:None
) –Decays covariance matrix accumulators, this may be useful if
shampoo_beta
is None. Defaults to None. -
alpha
(float
, default:1
) –learning rate. Defaults to 1.
-
bias_correction
(bool
, default:True
) –enables adam bias correction. Defaults to True.
Examples:
SOAP:
.. code-block:: python
opt = tz.Modular(model.parameters(), tz.m.SOAP(), tz.m.LR(1e-3))
Stabilized SOAP:
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.SOAP(),
tz.m.NormalizeByEMA(max_ema_growth=1.2),
tz.m.LR(1e-2)
)
Source code in torchzero/modules/adaptive/soap.py
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 |
|
SPSA ¶
Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM
Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.
Note
This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.
Parameters:
-
h
(float
, default:0.001
) –finite difference step size of jvp_method is set to
forward
orcentral
. Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of random gradient samples. Defaults to 1.
-
formula
(Literal
, default:'central'
) –finite difference formula. Defaults to 'central2'.
-
distribution
(Literal
, default:'rademacher'
) –distribution. Defaults to "rademacher".
-
beta
(float
, default:0
) –If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
-
pre_generate
(bool
, default:True
) –whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
-
seed
(int | None | Generator
, default:None
) –Seed for random generator. Defaults to None.
-
target
(Literal
, default:'closure'
) –what to set on var. Defaults to "closure".
References
Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771
Source code in torchzero/modules/grad_approximation/rfdm.py
SR1 ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Symmetric Rank 1. This works best with a trust region:
Parameters:
-
init_scale
(float | Literal['auto']
, default:'auto'
) –initial hessian matrix is set to identity times this.
"auto" corresponds to a heuristic from [1] p.142-143.
Defaults to "auto".
-
tol
(float
, default:1e-32
) –tolerance for denominator in SR1 update rule as in [1] p.146. Defaults to 1e-32.
-
ptol
(float | None
, default:1e-32
) –skips update if maximum difference between current and previous gradients is less than this, to avoid instability. Defaults to 1e-32.
-
ptol_restart
(bool
, default:False
) –whether to reset the hessian approximation when ptol tolerance is not met. Defaults to False.
-
restart_interval
(int | None | Literal['auto']
, default:None
) –interval between resetting the hessian approximation.
"auto" corresponds to number of decision variables + 1.
None - no resets.
Defaults to None.
-
beta
(float | None
, default:None
) –momentum on H or B. Defaults to None.
-
update_freq
(int
, default:1
) –frequency of updating H or B. Defaults to 1.
-
scale_first
(bool
, default:False
) –whether to downscale first step before hessian approximation becomes available. Defaults to True.
-
scale_second
(bool
) –whether to downscale second step. Defaults to False.
-
concat_params
(bool
, default:True
) –If true, all parameters are treated as a single vector. If False, the update rule is applied to each parameter separately. Defaults to True.
-
inner
(Chainable | None
, default:None
) –preconditioning is applied to the output of this module. Defaults to None.
Examples:¶
SR1 with trust region
References:¶
[1]. Nocedal. Stephen J. Wright. Numerical Optimization
Source code in torchzero/modules/quasi_newton/quasi_newton.py
SSVM ¶
Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy
Self-scaling variable metric Quasi-Newton method.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Oren, S. S., & Spedicato, E. (1976). Optimal conditioning of self-scaling variable Metric algorithms. Mathematical Programming, 10(1), 70–90. doi:10.1007/bf01580654
Source code in torchzero/modules/quasi_newton/quasi_newton.py
SVRG ¶
Bases: torchzero.core.module.Module
Stochastic variance reduced gradient method (SVRG).
To use, put SVRG as the first module, it can be used with any other modules. To reduce variance of a gradient estimator, put the gradient estimator before SVRG.
First it uses first accum_steps
batches to compute full gradient at initial
parameters using gradient accumulation, the model will not be updated during this.
Then it performs svrg_steps
SVRG steps, each requires two forward and backward passes.
After svrg_steps
, it goes back to full gradient computation step step.
As an alternative to gradient accumulation you can pass "full_closure" argument to the step
method,
which should compute full gradients, set them to .grad
attributes of the parameters,
and return full loss.
Parameters:
-
svrg_steps
(int
) –number of steps before calculating full gradient. This can be set to length of the dataloader.
-
accum_steps
(int | None
, default:None
) –number of steps to accumulate the gradient for. Not used if "full_closure" is passed to the
step
method. If None, uses value ofsvrg_steps
. Defaults to None. -
reset_before_accum
(bool
, default:True
) –whether to reset all other modules when re-calculating full gradient. Defaults to True.
-
svrg_loss
(bool
, default:True
) –whether to replace loss with SVRG loss (calculated by same formula as SVRG gradient). Defaults to True.
-
alpha
(float
, default:1
) –multiplier to
g_full(x_0) - g_batch(x_0)
term, can be annealed linearly from 1 to 0 as suggested in https://arxiv.org/pdf/2311.05589#page=6
Examples:¶
SVRG-LBFGS
opt = tz.Modular(
model.parameters(),
tz.m.SVRG(len(dataloader)),
tz.m.LBFGS(),
tz.m.Backtracking(),
)
For extra variance reduction one can use Online versions of algorithms, although it won't always help.
opt = tz.Modular(
model.parameters(),
tz.m.SVRG(len(dataloader)),
tz.m.Online(tz.m.LBFGS()),
tz.m.Backtracking(),
)
Variance reduction can also be applied to gradient estimators.
```python
opt = tz.Modular(
model.parameters(),
tz.m.SPSA(),
tz.m.SVRG(100),
tz.m.LR(1e-2),
)
Notes¶
The SVRG gradient is computed as g_b(x) - alpha * g_b(x_0) - g_f(x0.)
, where:
- x
is current parameters
- x_0
is initial parameters, where full gradient was computed
- g_b
refers to mini-batch gradient at x
or x_0
- g_f
refers to full gradient at x_0
.
The SVRG loss is computed using the same formula.
Source code in torchzero/modules/variance_reduction/svrg.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
|
SaveBest ¶
Bases: torchzero.core.module.Module
Saves best parameters found so far, ones that have lowest loss. Put this as the last module.
Adds the following attrs:
best_params
- a list of tensors with best parameters.best_loss
- loss value withbest_params
.load_best_parameters
- a function that sets parameters to the best parameters./
Examples¶
```python def rosenbrock(x, y): return (1 - x)2 + (100 * (y - x2))**2
xy = torch.tensor((-1.1, 2.5), requires_grad=True) opt = tz.Modular( [xy], tz.m.NAG(0.999), tz.m.LR(1e-6), tz.m.SaveBest() )
optimize for 1000 steps¶
for i in range(1000): loss = rosenbrock(*xy) opt.zero_grad() loss.backward() opt.step(loss=loss) # SaveBest needs closure or loss
NAG overshot, but we saved the best params¶
print(f'{rosenbrock(*xy) = }') # >> 3.6583 print(f"{opt.attrs['best_loss'] = }") # >> 0.000627
load best parameters¶
opt.attrs'load_best_params' print(f'{rosenbrock(*xy) = }') # >> 0.000627
Source code in torchzero/modules/misc/misc.py
ScalarProjection ¶
Bases: torchzero.modules.projections.projection.ProjectionBase
projetion that splits all parameters into individual scalars
Source code in torchzero/modules/projections/projection.py
ScaleByGradCosineSimilarity ¶
Bases: torchzero.core.transform.Transform
Multiplies the update by cosine similarity with gradient. If cosine similarity is negative, naturally the update will be negated as well.
Parameters:
-
eps
(float
, default:1e-06
) –epsilon for division. Defaults to 1e-6.
Examples:¶
Scaled Adam
opt = tz.Modular(
bench.parameters(),
tz.m.Adam(),
tz.m.ScaleByGradCosineSimilarity(),
tz.m.LR(1e-2)
)
Source code in torchzero/modules/momentum/cautious.py
ScaleLRBySignChange ¶
Bases: torchzero.core.transform.Transform
learning rate gets multiplied by nplus
if ascent/gradient didn't change the sign,
or nminus
if it did.
This is part of RProp update rule.
Parameters:
-
nplus
(float
, default:1.2
) –learning rate gets multiplied by
nplus
if ascent/gradient didn't change the sign -
nminus
(float
, default:0.5
) –learning rate gets multiplied by
nminus
if ascent/gradient changed the sign -
lb
(float
, default:1e-06
) –lower bound for lr.
-
ub
(float
, default:50.0
) –upper bound for lr.
-
alpha
(float
, default:1.0
) –initial learning rate.
Source code in torchzero/modules/adaptive/rprop.py
ScaleModulesByCosineSimilarity ¶
Bases: torchzero.core.module.Module
Scales the output of :code:main
module by it's cosine similarity to the output
of :code:compare
module.
Parameters:
-
main
(Chainable
) –main module or sequence of modules whose update will be scaled.
-
compare
(Chainable
) –module or sequence of modules to compare to
-
eps
(float
, default:1e-06
) –epsilon for division. Defaults to 1e-6.
Examples:¶
Adam scaled by similarity to RMSprop
opt = tz.Modular(
bench.parameters(),
tz.m.ScaleModulesByCosineSimilarity(
main = tz.m.Adam(),
compare = tz.m.RMSprop(0.999, debiased=True),
),
tz.m.LR(1e-2)
)
Source code in torchzero/modules/momentum/cautious.py
ScipyMinimizeScalar ¶
Bases: torchzero.modules.line_search.line_search.LineSearchBase
Line search via :code:scipy.optimize.minimize_scalar
which implements brent, golden search and bounded brent methods.
Parameters:
-
method
(str | None
, default:None
) –"brent", "golden" or "bounded". Defaults to None.
-
maxiter
(int | None
, default:None
) –maximum number of function evaluations the line search is allowed to perform. Defaults to None.
-
bracket
(Sequence | None
, default:None
) –Either a triple (xa, xb, xc) satisfying xa < xb < xc and func(xb) < func(xa) and func(xb) < func(xc), or a pair (xa, xb) to be used as initial points for a downhill bracket search. Defaults to None.
-
bounds
(Sequence | None
, default:None
) –For method ‘bounded’, bounds is mandatory and must have two finite items corresponding to the optimization bounds. Defaults to None.
-
tol
(float | None
, default:None
) –Tolerance for termination. Defaults to None.
-
options
(dict | None
, default:None
) –A dictionary of solver options. Defaults to None.
For more details on methods and arguments refer to https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize_scalar.html
Source code in torchzero/modules/line_search/scipy.py
Sequential ¶
Bases: torchzero.core.module.Module
On each step, this sequentially steps with :code:modules
:code:steps
times.
The update is taken to be the parameter difference between parameters before and after the inner loop.
Source code in torchzero/modules/misc/multistep.py
Shampoo ¶
Bases: torchzero.core.transform.Transform
Shampoo from Preconditioned Stochastic Tensor Optimization (https://arxiv.org/abs/1802.09568).
.. note:: Shampoo is usually grafted to another optimizer like Adam, otherwise it can be unstable. An example of how to do grafting is given below in the Examples section.
.. note::
Shampoo is a very computationally expensive optimizer, increase :code:update_freq
if it is too slow.
.. note::
SOAP optimizer usually outperforms Shampoo and is also not as computationally expensive. SOAP implementation is available as :code:tz.m.SOAP
.
Parameters:
-
decay
(float | None
, default:None
) –slowly decays preconditioners. Defaults to None.
-
beta
(float | None
, default:None
) –if None calculates sum as in standard shampoo, otherwise uses EMA of preconditioners. Defaults to None.
-
update_freq
(int
, default:10
) –preconditioner update frequency. Defaults to 10.
-
exp_override
(int | None
, default:2
) –matrix exponent override, if not set, uses 2*ndim. Defaults to 2.
-
merge_small
(bool
, default:True
) –whether to merge small dims on tensors. Defaults to True.
-
max_dim
(int
, default:2000
) –maximum dimension size for preconditioning. Defaults to 2_000.
-
precondition_1d
(bool
, default:True
) –whether to precondition 1d tensors. Defaults to True.
-
adagrad_eps
(float
, default:1e-08
) –epsilon for adagrad division for tensors where shampoo can't be applied. Defaults to 1e-8.
-
inner
(Chainable | None
, default:None
) –module applied after updating preconditioners and before applying preconditioning. For example if beta≈0.999 and
inner=tz.m.EMA(0.9)
, this becomes Adam with shampoo preconditioner (ignoring debiasing). Defaults to None.
Examples:
Shampoo grafted to Adam
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.GraftModules(
direction = tz.m.Shampoo(),
magnitude = tz.m.Adam(),
),
tz.m.LR(1e-3)
)
Adam with Shampoo preconditioner
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.Shampoo(beta=0.999, inner=tz.m.EMA(0.9)),
tz.m.Debias(0.9, 0.999),
tz.m.LR(1e-3)
)
Source code in torchzero/modules/adaptive/shampoo.py
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 |
|
ShorR ¶
Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy
Shor’s r-algorithm.
Note
A line search such as tz.m.StrongWolfe(a_init="quadratic", fallback=True)
is required.
Similarly to conjugate gradient, ShorR doesn't have an automatic step size scaling,
so setting a_init
in the line search is recommended.
References
S HOR , N. Z. (1985) Minimization Methods for Non-differentiable Functions. New York: Springer.
Burke, James V., Adrian S. Lewis, and Michael L. Overton. "The Speed of Shor's R-algorithm." IMA Journal of numerical analysis 28.4 (2008): 711-720. - good overview.
Ansari, Zafar A. Limited Memory Space Dilation and Reduction Algorithms. Diss. Virginia Tech, 1998. - this is where a more efficient formula is described.
Source code in torchzero/modules/quasi_newton/quasi_newton.py
Sign ¶
Bases: torchzero.core.transform.Transform
Returns :code:sign(input)
Source code in torchzero/modules/ops/unary.py
SignConsistencyLRs ¶
Bases: torchzero.core.transform.Transform
Outputs per-weight learning rates based on consecutive sign consistency.
The learning rate for a weight is multiplied by :code:nplus
when two consecutive update signs are the same, otherwise it is multiplied by :code:nplus
. The learning rates are bounded to be in :code:(lb, ub)
range.
Examples:
GD scaled by consecutive gradient sign consistency
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.Mul(tz.m.SignConsistencyLRs()),
tz.m.LR(1e-2)
)
Source code in torchzero/modules/adaptive/rprop.py
SignConsistencyMask ¶
Bases: torchzero.core.transform.Transform
Outputs a mask of sign consistency of current and previous inputs.
The output is 0 for weights where input sign changed compared to previous input, 1 otherwise.
Examples:
GD that skips update for weights where gradient sign changed compared to previous gradient.
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.Mul(tz.m.SignConsistencyMask()),
tz.m.LR(1e-2)
)
Source code in torchzero/modules/adaptive/rprop.py
SixthOrder3P ¶
Bases: torchzero.modules.second_order.multipoint.HigherOrderMethodBase
Sixth-order iterative method.
Abro, Hameer Akhtar, and Muhammad Mujtaba Shaikh. "A new time-efficient and convergent nonlinear solver." Applied Mathematics and Computation 355 (2019): 516-536.
Source code in torchzero/modules/second_order/multipoint.py
SixthOrder3PM2 ¶
Bases: torchzero.modules.second_order.multipoint.HigherOrderMethodBase
Wang, Xiaofeng, and Yang Li. "An efficient sixth-order Newton-type method for solving nonlinear systems." Algorithms 10.2 (2017): 45.
Source code in torchzero/modules/second_order/multipoint.py
SixthOrder5P ¶
Bases: torchzero.modules.second_order.multipoint.HigherOrderMethodBase
Argyros, Ioannis K., et al. "Extended convergence for two sixth order methods under the same weak conditions." Foundations 3.1 (2023): 127-139.
Source code in torchzero/modules/second_order/multipoint.py
SophiaH ¶
Bases: torchzero.core.module.Module
SophiaH optimizer from https://arxiv.org/abs/2305.14342
This is similar to Adam, but the second momentum is replaced by an exponential moving average of randomized hessian diagonal estimates, and the update is agressively clipped.
.. note::
In most cases SophiaH should be the first module in the chain because it relies on autograd. Use the :code:inner
argument if you wish to apply SophiaH preconditioning to another module's output.
.. note::
If you are using gradient estimators or reformulations, set :code:hvp_method
to "forward" or "central".
.. note::
This module requires the a closure passed to the optimizer step,
as it needs to re-evaluate the loss and gradients for calculating HVPs.
The closure must accept a backward
argument (refer to documentation).
Parameters:
-
beta1
(float
, default:0.96
) –first momentum. Defaults to 0.96.
-
beta2
(float
, default:0.99
) –momentum for hessian diagonal estimate. Defaults to 0.99.
-
update_freq
(int
, default:10
) –frequency of updating hessian diagonal estimate via a hessian-vector product. Defaults to 10.
-
precond_scale
(float
, default:1
) –scale of the preconditioner. Defaults to 1.
-
clip
(float
, default:1
) –clips update to (-clip, clip). Defaults to 1.
-
eps
(float
, default:1e-12
) –clips hessian diagonal esimate to be no less than this value. Defaults to 1e-12.
-
hvp_method
(str
, default:'autograd'
) –Determines how Hessian-vector products are evaluated.
"autograd"
: Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient."forward"
: Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation."central"
: Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
-
fd_h
(float
, default:0.001
) –finite difference step size if :code:
hvp_method
is "forward" or "central". Defaults to 1e-3. -
n_samples
(int
, default:1
) –number of hessian-vector products with random vectors to evaluate each time when updating the preconditioner. Larger values may lead to better hessian diagonal estimate. Defaults to 1.
-
seed
(int | None
, default:None
) –seed for random vectors. Defaults to None.
-
inner
(Chainable | None
, default:None
) –preconditioning is applied to the output of this module. Defaults to None.
Examples:
Using SophiaH:
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.SophiaH(),
tz.m.LR(0.1)
)
SophiaH preconditioner can be applied to any other module by passing it to the :code:inner
argument.
Turn off SophiaH's first momentum to get just the preconditioning. Here is an example of applying
SophiaH preconditioning to nesterov momentum (:code:tz.m.NAG
):
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.SophiaH(beta1=0, inner=tz.m.NAG(0.96)),
tz.m.LR(0.1)
)
Source code in torchzero/modules/adaptive/sophia_h.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
|
Split ¶
Bases: torchzero.core.module.Module
Apply true
modules to all parameters filtered by filter
, apply false
modules to all other parameters.
Parameters:
-
filter
(Filter, bool]
) –a filter that selects tensors to be optimized by
true
. - tensor or iterable of tensors (e.g.encoder.parameters()
). - function that takes in tensor and outputs a bool (e.g.lambda x: x.ndim >= 2
). - a sequence of above (acts as "or", so returns true if any of them is true). -
true
(Chainable | None
) –modules that are applied to tensors where
filter
isTrue
. -
false
(Chainable | None
) –modules that are applied to tensors where
filter
isFalse
.
Examples:¶
Muon with Adam fallback using same hyperparams as https://github.com/KellerJordan/Muon
opt = tz.Modular(
model.parameters(),
tz.m.NAG(0.95),
tz.m.Split(
lambda p: p.ndim >= 2,
true = tz.m.Orthogonalize(),
false = [tz.m.Adam(0.9, 0.95), tz.m.Mul(1/66)],
),
tz.m.LR(1e-2),
)
Source code in torchzero/modules/misc/split.py
Sqrt ¶
Bases: torchzero.core.transform.Transform
Returns :code:sqrt(input)
Source code in torchzero/modules/ops/unary.py
SqrtEMASquared ¶
Bases: torchzero.core.transform.Transform
Maintains an exponential moving average of squared updates, outputs optionally debiased square root.
Parameters:
-
beta
(float
, default:0.999
) –momentum value. Defaults to 0.999.
-
amsgrad
(bool
, default:False
) –whether to maintain maximum of the exponential moving average. Defaults to False.
-
debiased
(bool
, default:False
) –whether to multiply the output by a debiasing term from the Adam method. Defaults to False.
-
pow
(float
, default:2
) –power, absolute value is always used. Defaults to 2.
Methods:
-
SQRT_EMA_SQ_FN
–Updates
exp_avg_sq_
with EMA of squaredtensors
and calculates it's square root,
Source code in torchzero/modules/ops/higher_level.py
SQRT_EMA_SQ_FN ¶
SQRT_EMA_SQ_FN(tensors: TensorList, exp_avg_sq_: TensorList, beta: float | NumberList, max_exp_avg_sq_: TensorList | None, debiased: bool, step: int, pow: float = 2, ema_sq_fn: Callable = ema_sq_)
Updates exp_avg_sq_
with EMA of squared tensors
and calculates it's square root,
with optional AMSGrad and debiasing.
Returns new tensors.
Source code in torchzero/modules/functional.py
SqrtHomotopy ¶
SquareHomotopy ¶
StepSize ¶
Bases: torchzero.core.transform.Transform
this is exactly the same as LR, except the lr
parameter can be renamed to any other name to avoid clashes
Source code in torchzero/modules/step_size/lr.py
StrongWolfe ¶
Bases: torchzero.modules.line_search.line_search.LineSearchBase
Interpolation line search satisfying Strong Wolfe condition.
Parameters:
-
c1
(float
, default:0.0001
) –sufficient descent condition. Defaults to 1e-4.
-
c2
(float
, default:0.9
) –strong curvature condition. For CG set to 0.1. Defaults to 0.9.
-
a_init
(str
, default:'fixed'
) –strategy for initializing the initial step size guess. - "fixed" - uses a fixed value specified in
init_value
argument. - "first-order" - assumes first-order change in the function at iterate will be the same as that obtained at the previous step. - "quadratic" - interpolates quadratic to f(x_{-1}) and f_x. - "quadratic-clip" - same as quad, but uses min(1, 1.01*alpha) as described in Numerical Optimization. - "previous" - uses final step size found on previous iteration.For 2nd order methods it is usually best to leave at "fixed". For methods that do not produce well scaled search directions, e.g. conjugate gradient, "first-order" or "quadratic-clip" are recommended. Defaults to 'init'.
-
a_max
(float
, default:1000000000000.0
) –upper bound for the proposed step sizes. Defaults to 1e12.
-
init_value
(float
, default:1
) –initial step size. Used when
a_init
="fixed", and with other strategies as fallback value. Defaults to 1. -
maxiter
(int
, default:25
) –maximum number of line search iterations. Defaults to 25.
-
maxzoom
(int
, default:10
) –maximum number of zoom iterations. Defaults to 10.
-
maxeval
(int | None
, default:None
) –maximum number of function evaluations. Defaults to None.
-
tol_change
(float
, default:1e-09
) –tolerance, terminates on small brackets. Defaults to 1e-9.
-
interpolation
(str
, default:'cubic'
) –What type of interpolation to use. - "bisection" - uses the middle point. This is robust, especially if the objective function is non-smooth, however it may need more function evaluations. - "quadratic" - minimizes a quadratic model, generally outperformed by "cubic". - "cubic" - minimizes a cubic model - this is the most widely used interpolation strategy. - "polynomial" - fits a a polynomial to all points obtained during line search. - "polynomial2" - alternative polynomial fit, where if a point is outside of bounds, a lower degree polynomial is tried. This may have faster convergence than "cubic" and "polynomial".
Defaults to 'cubic'.
-
adaptive
(bool
, default:True
) –if True, the initial step size will be halved when line search failed to find a good direction. When a good direction is found, initial step size is reset to the original value. Defaults to True.
-
fallback
(bool
, default:False
) –if True, when no point satisfied strong wolfe criteria, returns a point with value lower than initial value that doesn't satisfy the criteria. Defaults to False.
-
plus_minus
(bool
, default:False
) –if True, enables the plus-minus variant, where if curvature is negative, line search is performed in the opposite direction. Defaults to False.
Examples:¶
Conjugate gradient method with strong wolfe line search. Nocedal, Wright recommend setting c2 to 0.1 for CG. Since CG doesn't produce well scaled directions, initial alpha can be determined from function values by a_init="first-order"
.
opt = tz.Modular(
model.parameters(),
tz.m.PolakRibiere(),
tz.m.StrongWolfe(c2=0.1, a_init="first-order")
)
LBFGS strong wolfe line search:
Source code in torchzero/modules/line_search/strong_wolfe.py
189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 |
|
Sub ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Subtract :code:other
from tensors. :code:other
can be a number or a module.
If :code:other
is a module, this calculates :code:tensors - other(tensors)
Source code in torchzero/modules/ops/binary.py
SubModules ¶
Bases: torchzero.modules.ops.multi.MultiOperationBase
Calculates :code:input - other
. :code:input
and :code:other
can be numbers or modules.
Source code in torchzero/modules/ops/multi.py
Sum ¶
Bases: torchzero.modules.ops.reduce.ReduceOperationBase
Outputs sum of :code:inputs
that can be modules or numbers.
Source code in torchzero/modules/ops/reduce.py
USE_MEAN
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
SumOfSquares ¶
Bases: torchzero.core.module.Module
Sets loss to be the sum of squares of values returned by the closure.
This is meant to be used to test least squares methods against ordinary minimization methods.
To use this, the closure should return a vector of values to minimize sum of squares of.
Please add the backward
argument, it will always be False but it is required.
Source code in torchzero/modules/least_squares/gn.py
Switch ¶
Bases: torchzero.modules.misc.switch.Alternate
After :code:steps
steps switches to the next module.
Parameters:
-
steps
(int | Iterable[int]
) –Number of steps to perform with each module.
Examples:
Start with Adam, switch to L-BFGS after 1000th step and Truncated Newton on 2000th step.
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.Switch(
[tz.m.Adam(), tz.m.LR(1e-3)],
[tz.m.LBFGS(), tz.m.Backtracking()],
[tz.m.NewtonCG(maxiter=20), tz.m.Backtracking()],
steps = (1000, 2000)
)
)
Source code in torchzero/modules/misc/switch.py
LOOP
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
TerminateAfterNEvaluations ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
Source code in torchzero/modules/termination/termination.py
TerminateAfterNSeconds ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
Source code in torchzero/modules/termination/termination.py
TerminateAfterNSteps ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
Source code in torchzero/modules/termination/termination.py
TerminateAll ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
Source code in torchzero/modules/termination/termination.py
TerminateAny ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
Source code in torchzero/modules/termination/termination.py
TerminateByGradientNorm ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
Source code in torchzero/modules/termination/termination.py
TerminateByUpdateNorm ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
update is calculated as parameter difference
Source code in torchzero/modules/termination/termination.py
TerminateNever ¶
TerminateOnLossReached ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
Source code in torchzero/modules/termination/termination.py
TerminateOnNoImprovement ¶
Bases: torchzero.modules.termination.termination.TerminationCriteriaBase
Source code in torchzero/modules/termination/termination.py
TerminationCriteriaBase ¶
Bases: torchzero.core.module.Module
Source code in torchzero/modules/termination/termination.py
ThomasOptimalMethod ¶
Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults
Thomas's "optimal" Quasi-Newton method.
Note
a line search is recommended.
Warning
this uses at least O(N^2) memory.
Reference
Thomas, Stephen Walter. Sequential estimation techniques for quasi-Newton algorithms. Cornell University, 1975.
Source code in torchzero/modules/quasi_newton/quasi_newton.py
Threshold ¶
Bases: torchzero.modules.ops.binary.BinaryOperationBase
Outputs tensors thresholded such that values above :code:threshold
are set to :code:value
.
Source code in torchzero/modules/ops/binary.py
To ¶
Bases: torchzero.modules.projections.projection.ProjectionBase
Cast modules to specified device and dtype
Source code in torchzero/modules/projections/cast.py
TrustCG ¶
Bases: torchzero.modules.trust_region.trust_region.TrustRegionBase
Trust region via Steihaug-Toint Conjugate Gradient method.
.. note::
If you wish to use exact hessian, use the matrix-free :code:`tz.m.NewtonCGSteihaug`
which only uses hessian-vector products. While passing ``tz.m.Newton`` to this
is possible, it is usually less efficient.
Parameters:
-
hess_module
(Module | None
) –A module that maintains a hessian approximation (not hessian inverse!). This includes all full-matrix quasi-newton methods,
tz.m.Newton
andtz.m.GaussNewton
. When using quasi-newton methods, setinverse=False
when constructing them. -
eta
(float
, default:0.0
) –if ratio of actual to predicted rediction is larger than this, step is accepted. When :code:
hess_module
is GaussNewton, this can be set to 0. Defaults to 0.15. -
nplus
(float
, default:3.5
) –increase factor on successful steps. Defaults to 1.5.
-
nminus
(float
, default:0.25
) –decrease factor on unsuccessful steps. Defaults to 0.75.
-
rho_good
(float
, default:0.99
) –if ratio of actual to predicted rediction is larger than this, trust region size is multiplied by
nplus
. -
rho_bad
(float
, default:0.0001
) –if ratio of actual to predicted rediction is less than this, trust region size is multiplied by
nminus
. -
init
(float
, default:1
) –Initial trust region value. Defaults to 1.
-
update_freq
(int
, default:1
) –frequency of updating the hessian. Defaults to 1.
-
reg
(int
, default:0
) –regularization parameter for conjugate gradient. Defaults to 0.
-
max_attempts
(max_attempts
, default:10
) –maximum number of trust region size size reductions per step. A zero update vector is returned when this limit is exceeded. Defaults to 10.
-
boundary_tol
(float | None
, default:1e-06
) –The trust region only increases when suggested step's norm is at least
(1-boundary_tol)*trust_region
. This prevents increasing trust region when solution is not on the boundary. Defaults to 1e-2. -
prefer_exact
(bool
, default:True
) –when exact solution can be easily calculated without CG (e.g. hessian is stored as scaled identity), uses the exact solution. If False, always uses CG. Defaults to True.
-
inner
(Chainable | None
, default:None
) –preconditioning is applied to output of thise module. Defaults to None.
Examples:
Trust-SR1
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.TrustCG(hess_module=tz.m.SR1(inverse=False)),
)
Source code in torchzero/modules/trust_region/trust_cg.py
TrustRegionBase ¶
Bases: torchzero.core.module.Module
, abc.ABC
Methods:
-
trust_region_apply
–Solves the trust region subproblem and outputs
Var
with the solution direction. -
trust_region_update
–updates the state of this module after H or B have been updated, if necessary
-
trust_solve
–Solve Hx=g with a trust region penalty/bound defined by
radius
Source code in torchzero/modules/trust_region/trust_region.py
206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 |
|
trust_region_apply ¶
Solves the trust region subproblem and outputs Var
with the solution direction.
Source code in torchzero/modules/trust_region/trust_region.py
trust_region_update ¶
trust_region_update(var: Var, H: LinearOperator | None) -> None
updates the state of this module after H or B have been updated, if necessary
trust_solve ¶
trust_solve(f: float, g: Tensor, H: LinearOperator, radius: float, params: list[Tensor], closure: Callable, settings: Mapping[str, Any]) -> Tensor
Solve Hx=g with a trust region penalty/bound defined by radius
Source code in torchzero/modules/trust_region/trust_region.py
TwoPointNewton ¶
Bases: torchzero.modules.second_order.multipoint.HigherOrderMethodBase
two-point Newton method with frozen derivative with third order convergence.
Sharma, Janak Raj, and Deepak Kumar. "A fast and efficient composite Newton–Chebyshev method for systems of nonlinear equations." Journal of Complexity 49 (2018): 56-73.
Source code in torchzero/modules/second_order/multipoint.py
UnaryLambda ¶
Bases: torchzero.core.transform.Transform
Applies :code:fn
to input tensors.
:code:fn
must accept and return a list of tensors.
Source code in torchzero/modules/ops/unary.py
UnaryParameterwiseLambda ¶
Bases: torchzero.core.transform.TensorwiseTransform
Applies :code:fn
to each input tensor.
:code:fn
must accept and return a tensor.
Source code in torchzero/modules/ops/unary.py
Uniform ¶
Bases: torchzero.core.module.Module
Outputs tensors filled with random numbers from uniform distribution between :code:low
and :code:high
.
Source code in torchzero/modules/ops/utility.py
UpdateGradientSignConsistency ¶
Bases: torchzero.core.transform.Transform
Compares update and gradient signs. Output will have 1s where signs match, and 0s where they don't.
Parameters:
-
normalize
(bool
, default:False
) –renormalize update after masking. Defaults to False.
-
eps
(float
, default:1e-06
) –epsilon for normalization. Defaults to 1e-6.
Source code in torchzero/modules/momentum/cautious.py
UpdateSign ¶
Bases: torchzero.core.transform.Transform
Outputs gradient with sign copied from the update.
Source code in torchzero/modules/misc/misc.py
UpdateToNone ¶
Bases: torchzero.core.module.Module
Sets :code:update
attribute to None on :code:var
.
Source code in torchzero/modules/ops/utility.py
VectorProjection ¶
Bases: torchzero.modules.projections.projection.ProjectionBase
projection that concatenates all parameters into a vector
Source code in torchzero/modules/projections/projection.py
ViewAsReal ¶
Bases: torchzero.modules.projections.projection.ProjectionBase
View complex tensors as real tensors. Doesn't affect tensors that are already.
Source code in torchzero/modules/projections/cast.py
Warmup ¶
Bases: torchzero.core.transform.Transform
Learning rate warmup, linearly increases learning rate multiplier from :code:start_lr
to :code:end_lr
over :code:steps
steps.
Parameters:
-
steps
(int
, default:100
) –number of steps to perform warmup for. Defaults to 100.
-
start_lr
(_type_
, default:1e-05
) –initial learning rate multiplier on first step. Defaults to 1e-5.
-
end_lr
(float
, default:1
) –learning rate multiplier at the end and after warmup. Defaults to 1.
Example
Adam with 1000 steps warmup
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.Adam(),
tz.m.LR(1e-2),
tz.m.Warmup(steps=1000)
)
Source code in torchzero/modules/step_size/lr.py
WarmupNormClip ¶
Bases: torchzero.core.transform.Transform
Warmup via clipping of the update norm.
Parameters:
-
start_norm
(_type_
, default:1e-05
) –maximal norm on the first step. Defaults to 1e-5.
-
end_norm
(float
, default:1
) –maximal norm on the last step. After that, norm clipping is disabled. Defaults to 1.
-
steps
(int
, default:100
) –number of steps to perform warmup for. Defaults to 100.
Example
Adam with 1000 steps norm clip warmup
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.Adam(),
tz.m.WarmupNormClip(steps=1000)
tz.m.LR(1e-2),
)
Source code in torchzero/modules/step_size/lr.py
WeightDecay ¶
Bases: torchzero.core.transform.Transform
Weight decay.
Parameters:
-
weight_decay
(float
) –weight decay scale.
-
ord
(int
, default:2
) –order of the penalty, e.g. 1 for L1 and 2 for L2. Defaults to 2.
-
target
(Literal
, default:'update'
) –what to set on var. Defaults to 'update'.
Examples:¶
Adam with non-decoupled weight decay
Adam with decoupled weight decay that still scales with learning rate
Adam with fully decoupled weight decay that doesn't scale with learning rate
Source code in torchzero/modules/weight_decay/weight_decay.py
WeightDropout ¶
Bases: torchzero.core.module.Module
Changes the closure so that it evaluates loss and gradients with random weights replaced with 0.
Dropout can be disabled for a parameter by setting :code:use_dropout=False
in corresponding parameter group.
Parameters:
-
p
(float
, default:0.5
) –probability that any weight is replaced with 0. Defaults to 0.5.
-
graft
(bool
, default:True
) –if True, parameters after dropout are rescaled to have the same norm as before dropout. Defaults to False.
Source code in torchzero/modules/misc/regularization.py
WeightedAveraging ¶
Bases: torchzero.core.transform.TensorwiseTransform
Weighted average of past len(weights)
updates.
Parameters:
-
weights
(Sequence[float]
) –a sequence of weights from oldest to newest.
-
target
(Literal
, default:'update'
) –target. Defaults to 'update'.
Source code in torchzero/modules/momentum/averaging.py
WeightedMean ¶
Bases: torchzero.modules.ops.reduce.WeightedSum
Outputs weighted mean of :code:inputs
that can be modules or numbers.
Source code in torchzero/modules/ops/reduce.py
USE_MEAN
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
WeightedSum ¶
Bases: torchzero.modules.ops.reduce.ReduceOperationBase
Source code in torchzero/modules/ops/reduce.py
USE_MEAN
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
Wrap ¶
Bases: torchzero.core.module.Module
Wraps a pytorch optimizer to use it as a module.
.. note::
Custom param groups are supported only by set_param_groups
, settings passed to Modular will be ignored.
Parameters:
-
opt_fn
(Callable[..., Optimizer] | Optimizer
) –function that takes in parameters and returns the optimizer, for example :code:
torch.optim.Adam
or :code:lambda parameters: torch.optim.Adam(parameters, lr=1e-3)
-
*args
– -
**kwargs
–Extra args to be passed to opt_fn. The function is called as :code:
opt_fn(parameters, *args, **kwargs)
.
Example
wrapping pytorch_optimizer.StableAdamW
.. code-block:: py
from pytorch_optimizer import StableAdamW
opt = tz.Modular(
model.parameters(),
tz.m.Wrap(StableAdamW, lr=1),
tz.m.Cautious(),
tz.m.LR(1e-2)
)
Source code in torchzero/modules/wrappers/optim_wrapper.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
|
Zeros ¶
Bases: torchzero.core.module.Module
Outputs zeros
Source code in torchzero/modules/ops/utility.py
clip_grad_norm_ ¶
clip_grad_norm_(params: Iterable[Tensor], max_norm: float | None, ord: Union[Literal['mad', 'std', 'var', 'sum', 'l0', 'l1', 'l2', 'l3', 'l4', 'linf'], float, Tensor] = 2, dim: Union[int, Sequence[int], Literal['global'], NoneType] = None, inverse_dims: bool = False, min_size: int = 2, min_norm: float | None = None)
Clips gradient of an iterable of parameters to specified norm value. Gradients are modified in-place.
Parameters:
-
params
(Iterable[Tensor]
) –parameters with gradients to clip.
-
max_norm
(float
) –value to clip norm to.
-
ord
(float
, default:2
) –norm order. Defaults to 2.
-
dim
(int | Sequence[int] | str | None
, default:None
) –calculates norm along those dimensions. If list/tuple, tensors are normalized along all dimensios in
dim
that they have. Can be set to "global" to normalize by global norm of all gradients concatenated to a vector. Defaults to None. -
min_size
(int
, default:2
) –minimal size of a dimension to normalize along it. Defaults to 1.
Source code in torchzero/modules/clipping/clipping.py
clip_grad_value_ ¶
Clips gradient of an iterable of parameters at specified value. Gradients are modified in-place. Args: params (Iterable[Tensor]): iterable of tensors with gradients to clip. value (float or int): maximum allowed value of gradient
Source code in torchzero/modules/clipping/clipping.py
decay_weights_ ¶
directly decays weights in-place
Source code in torchzero/modules/weight_decay/weight_decay.py
normalize_grads_ ¶
normalize_grads_(params: Iterable[Tensor], norm_value: float, ord: Union[Literal['mad', 'std', 'var', 'sum', 'l0', 'l1', 'l2', 'l3', 'l4', 'linf'], float, Tensor] = 2, dim: Union[int, Sequence[int], Literal['global'], NoneType] = None, inverse_dims: bool = False, min_size: int = 1)
Normalizes gradient of an iterable of parameters to specified norm value. Gradients are modified in-place.
Parameters:
-
params
(Iterable[Tensor]
) –parameters with gradients to clip.
-
norm_value
(float
) –value to clip norm to.
-
ord
(float
, default:2
) –norm order. Defaults to 2.
-
dim
(int | Sequence[int] | str | None
, default:None
) –calculates norm along those dimensions. If list/tuple, tensors are normalized along all dimensios in
dim
that they have. Can be set to "global" to normalize by global norm of all gradients concatenated to a vector. Defaults to None. -
inverse_dims
(bool
, default:False
) –if True, the
dims
argument is inverted, and all other dimensions are normalized. -
min_size
(int
, default:1
) –minimal size of a dimension to normalize along it. Defaults to 1.
Source code in torchzero/modules/clipping/clipping.py
orthogonalize_grads_ ¶
orthogonalize_grads_(params: Iterable[Tensor], steps: int = 5, dual_norm_correction=False, method: Literal['newton-schulz', 'svd'] = 'newton-schulz')
Uses newton-Schulz iteration to compute the zeroth power / orthogonalization of gradients of an iterable of parameters.
This sets gradients in-place. Applies along first 2 dims (expected to be out_channels, in_channels
).
Note that the Muon page says that embeddings and classifier heads should not be orthogonalized. Args: params (abc.Iterable[torch.Tensor]): parameters that hold gradients to orthogonalize. steps (int, optional): The number of Newton-Schulz iterations to run. Defaults to 5. dual_norm_correction (bool, optional): enables dual norm correction from https://github.com/leloykun/adaptive-muon. Defaults to False. method (str, optional): Newton-Schulz is very fast, SVD is extremely slow but can be slighly more precise.
Source code in torchzero/modules/adaptive/muon.py
orthograd_ ¶
Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.
Parameters:
-
params
(Iterable[Tensor]
) –parameters that hold gradients to apply ⟂Grad to.
-
eps
(float
, default:1e-30
) –epsilon added to the denominator for numerical stability (default: 1e-30)
reference https://arxiv.org/abs/2501.04697