Miscellaneous¶
This subpackage contains a lot of uncategorized modules, notably gradient accumulation, switching, automatic resetting, random restarts.
Classes:
-
Alternate
–Alternates between stepping with :code:
modules
. -
DivByLoss
–Divides update by loss times :code:
alpha
-
Dropout
–Applies dropout to the update.
-
EscapeAnnealing
–If parameters stop changing, this runs a backward annealing random search
-
ExpHomotopy
– -
FillLoss
–Outputs tensors filled with loss value times :code:
alpha
-
GradSign
–Copies gradient sign to update.
-
GradientAccumulation
–Uses
n
steps to accumulate gradients, aftern
gradients have been accumulated, they are passed to :code:modules
and parameters are updates. -
GraftGradToUpdate
–Outputs gradient grafted to update, that is gradient rescaled to have the same norm as the update.
-
GraftToGrad
–Grafts update to the gradient, that is update is rescaled to have the same norm as the gradient.
-
GraftToParams
–Grafts update to the parameters, that is update is rescaled to have the same norm as the parameters, but no smaller than :code:
eps
. -
HpuEstimate
–returns
y/||s||
, wherey
is difference between current and previous update (gradient),s
is difference between current and previous parameters. The returned tensors are a finite difference approximation to hessian times previous update. -
LambdaHomotopy
– -
LastAbsoluteRatio
–Outputs ratio between absolute values of past two updates the numerator is determined by :code:
numerator
argument. -
LastDifference
–Outputs difference between past two updates.
-
LastGradDifference
–Outputs difference between past two gradients.
-
LastProduct
–Outputs difference between past two updates.
-
LastRatio
–Outputs ratio between past two updates, the numerator is determined by :code:
numerator
argument. -
LogHomotopy
– -
MulByLoss
–Multiplies update by loss times :code:
alpha
-
Multistep
–Performs :code:
steps
inner steps with :code:module
per each step. -
NegateOnLossIncrease
–Uses an extra forward pass to evaluate loss at :code:
parameters+update
, -
NoiseSign
–Outputs random tensors with sign copied from the update.
-
Online
–Allows certain modules to be used for mini-batch optimization.
-
PerturbWeights
–Changes the closure so that it evaluates loss and gradients at weights perturbed by a random perturbation.
-
Previous
–Maintains an update from n steps back, for example if n=1, returns previous update
-
PrintLoss
–Prints var.get_loss().
-
PrintParams
–Prints current update.
-
PrintShape
–Prints shapes of the update.
-
PrintUpdate
–Prints current update.
-
RandomHvp
–Returns a hessian-vector product with a random vector
-
Relative
–Multiplies update by absolute parameter values to make it relative to their magnitude, :code:
min_value
is minimum allowed value to avoid getting stuck at 0. -
SaveBest
–Saves best parameters found so far, ones that have lowest loss. Put this as the last module.
-
Sequential
–On each step, this sequentially steps with :code:
modules
:code:steps
times. -
Split
–Apply
true
modules to all parameters filtered byfilter
, applyfalse
modules to all other parameters. -
SqrtHomotopy
– -
SquareHomotopy
– -
Switch
–After :code:
steps
steps switches to the next module. -
UpdateSign
–Outputs gradient with sign copied from the update.
-
WeightDropout
–Changes the closure so that it evaluates loss and gradients with random weights replaced with 0.
Alternate ¶
Bases: torchzero.core.module.Module
Alternates between stepping with :code:modules
.
That is, first step is performed with 1st module, second step with second module, etc.
Parameters:
-
steps
(int | Iterable[int]
, default:1
) –number of steps to perform with each module. Defaults to 1.
Examples:
Alternate between Adam, SignSGD and RMSprop
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.Alternate(
tz.m.Adam(),
[tz.m.SignSGD(), tz.m.Mul(0.5)],
tz.m.RMSprop(),
),
tz.m.LR(1e-3),
)
Source code in torchzero/modules/misc/switch.py
LOOP
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
DivByLoss ¶
Bases: torchzero.core.module.Module
Divides update by loss times :code:alpha
Source code in torchzero/modules/misc/misc.py
Dropout ¶
Bases: torchzero.core.transform.Transform
Applies dropout to the update.
For each weight the update to that weight has :code:p
probability to be set to 0.
This can be used to implement gradient dropout or update dropout depending on placement.
Parameters:
-
p
(float
, default:0.5
) –probability that update for a weight is replaced with 0. Defaults to 0.5.
-
graft
(bool
, default:False
) –if True, update after dropout is rescaled to have the same norm as before dropout. Defaults to False.
-
target
(Literal
, default:'update'
) –what to set on var, refer to documentation. Defaults to 'update'.
Examples:
Gradient dropout.
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.Dropout(0.5),
tz.m.Adam(),
tz.m.LR(1e-3)
)
Update dropout.
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.Adam(),
tz.m.Dropout(0.5),
tz.m.LR(1e-3)
)
Source code in torchzero/modules/misc/regularization.py
EscapeAnnealing ¶
Bases: torchzero.core.module.Module
If parameters stop changing, this runs a backward annealing random search
Source code in torchzero/modules/misc/escape.py
ExpHomotopy ¶
FillLoss ¶
Bases: torchzero.core.module.Module
Outputs tensors filled with loss value times :code:alpha
Source code in torchzero/modules/misc/misc.py
GradSign ¶
Bases: torchzero.core.transform.Transform
Copies gradient sign to update.
Source code in torchzero/modules/misc/misc.py
GradientAccumulation ¶
Bases: torchzero.core.module.Module
Uses n
steps to accumulate gradients, after n
gradients have been accumulated, they are passed to :code:modules
and parameters are updates.
Accumulating gradients for n
steps is equivalent to increasing batch size by n
. Increasing the batch size
is more computationally efficient, but sometimes it is not feasible due to memory constraints.
Note
Technically this can accumulate any inputs, including updates generated by previous modules. As long as this module is first, it will accumulate the gradients.
Parameters:
-
n
(int
) –number of gradients to accumulate.
-
mean
(bool
, default:True
) –if True, uses mean of accumulated gradients, otherwise uses sum. Defaults to True.
-
stop
(bool
, default:True
) –this module prevents next modules from stepping unless
n
gradients have been accumulate. Setting this argument to False disables that. Defaults to True.
Examples:¶
Adam with gradients accumulated for 16 batches.
Source code in torchzero/modules/misc/gradient_accumulation.py
GraftGradToUpdate ¶
Bases: torchzero.core.transform.Transform
Outputs gradient grafted to update, that is gradient rescaled to have the same norm as the update.
Source code in torchzero/modules/misc/misc.py
GraftToGrad ¶
Bases: torchzero.core.transform.Transform
Grafts update to the gradient, that is update is rescaled to have the same norm as the gradient.
Source code in torchzero/modules/misc/misc.py
GraftToParams ¶
Bases: torchzero.core.transform.Transform
Grafts update to the parameters, that is update is rescaled to have the same norm as the parameters, but no smaller than :code:eps
.
Source code in torchzero/modules/misc/misc.py
HpuEstimate ¶
Bases: torchzero.core.transform.Transform
returns y/||s||
, where y
is difference between current and previous update (gradient), s
is difference between current and previous parameters. The returned tensors are a finite difference approximation to hessian times previous update.
Source code in torchzero/modules/misc/misc.py
LambdaHomotopy ¶
Bases: torchzero.modules.misc.homotopy.HomotopyBase
Source code in torchzero/modules/misc/homotopy.py
LastAbsoluteRatio ¶
Bases: torchzero.core.transform.Transform
Outputs ratio between absolute values of past two updates the numerator is determined by :code:numerator
argument.
Source code in torchzero/modules/misc/misc.py
LastDifference ¶
Bases: torchzero.core.transform.Transform
Outputs difference between past two updates.
Source code in torchzero/modules/misc/misc.py
LastGradDifference ¶
Bases: torchzero.core.module.Module
Outputs difference between past two gradients.
Source code in torchzero/modules/misc/misc.py
LastProduct ¶
Bases: torchzero.core.transform.Transform
Outputs difference between past two updates.
Source code in torchzero/modules/misc/misc.py
LastRatio ¶
Bases: torchzero.core.transform.Transform
Outputs ratio between past two updates, the numerator is determined by :code:numerator
argument.
Source code in torchzero/modules/misc/misc.py
LogHomotopy ¶
MulByLoss ¶
Bases: torchzero.core.module.Module
Multiplies update by loss times :code:alpha
Source code in torchzero/modules/misc/misc.py
Multistep ¶
Bases: torchzero.core.module.Module
Performs :code:steps
inner steps with :code:module
per each step.
The update is taken to be the parameter difference between parameters before and after the inner loop.
Source code in torchzero/modules/misc/multistep.py
NegateOnLossIncrease ¶
Bases: torchzero.core.module.Module
Uses an extra forward pass to evaluate loss at :code:parameters+update
,
if loss is larger than at :code:parameters
,
the update is set to 0 if :code:backtrack=False
and to :code:-update
otherwise
Source code in torchzero/modules/misc/multistep.py
NoiseSign ¶
Bases: torchzero.core.transform.Transform
Outputs random tensors with sign copied from the update.
Source code in torchzero/modules/misc/misc.py
Online ¶
Bases: torchzero.core.module.Module
Allows certain modules to be used for mini-batch optimization.
Examples:
Online L-BFGS with Backtracking line search
Online L-BFGS trust region
Source code in torchzero/modules/misc/multistep.py
PerturbWeights ¶
Bases: torchzero.core.module.Module
Changes the closure so that it evaluates loss and gradients at weights perturbed by a random perturbation.
Can be disabled for a parameter by setting :code:perturb=False
in corresponding parameter group.
Parameters:
-
alpha
(float
, default:0.1
) –multiplier for perturbation magnitude. Defaults to 0.1.
-
relative
(bool
, default:True
) –whether to multiply perturbation by mean absolute value of the parameter. Defaults to True.
-
distribution
(bool
, default:'normal'
) –distribution of the random perturbation. Defaults to False.
Source code in torchzero/modules/misc/regularization.py
Previous ¶
Bases: torchzero.core.transform.TensorwiseTransform
Maintains an update from n steps back, for example if n=1, returns previous update
Source code in torchzero/modules/misc/misc.py
PrintLoss ¶
Bases: torchzero.core.module.Module
Prints var.get_loss().
Source code in torchzero/modules/misc/debug.py
PrintParams ¶
Bases: torchzero.core.module.Module
Prints current update.
Source code in torchzero/modules/misc/debug.py
PrintShape ¶
Bases: torchzero.core.module.Module
Prints shapes of the update.
Source code in torchzero/modules/misc/debug.py
PrintUpdate ¶
Bases: torchzero.core.module.Module
Prints current update.
Source code in torchzero/modules/misc/debug.py
RandomHvp ¶
Bases: torchzero.core.module.Module
Returns a hessian-vector product with a random vector
Source code in torchzero/modules/misc/misc.py
Relative ¶
Bases: torchzero.core.transform.Transform
Multiplies update by absolute parameter values to make it relative to their magnitude, :code:min_value
is minimum allowed value to avoid getting stuck at 0.
Source code in torchzero/modules/misc/misc.py
SaveBest ¶
Bases: torchzero.core.module.Module
Saves best parameters found so far, ones that have lowest loss. Put this as the last module.
Adds the following attrs:
best_params
- a list of tensors with best parameters.best_loss
- loss value withbest_params
.load_best_parameters
- a function that sets parameters to the best parameters./
Examples¶
```python def rosenbrock(x, y): return (1 - x)2 + (100 * (y - x2))**2
xy = torch.tensor((-1.1, 2.5), requires_grad=True) opt = tz.Modular( [xy], tz.m.NAG(0.999), tz.m.LR(1e-6), tz.m.SaveBest() )
optimize for 1000 steps¶
for i in range(1000): loss = rosenbrock(*xy) opt.zero_grad() loss.backward() opt.step(loss=loss) # SaveBest needs closure or loss
NAG overshot, but we saved the best params¶
print(f'{rosenbrock(*xy) = }') # >> 3.6583 print(f"{opt.attrs['best_loss'] = }") # >> 0.000627
load best parameters¶
opt.attrs'load_best_params' print(f'{rosenbrock(*xy) = }') # >> 0.000627
Source code in torchzero/modules/misc/misc.py
Sequential ¶
Bases: torchzero.core.module.Module
On each step, this sequentially steps with :code:modules
:code:steps
times.
The update is taken to be the parameter difference between parameters before and after the inner loop.
Source code in torchzero/modules/misc/multistep.py
Split ¶
Bases: torchzero.core.module.Module
Apply true
modules to all parameters filtered by filter
, apply false
modules to all other parameters.
Parameters:
-
filter
(Filter, bool]
) –a filter that selects tensors to be optimized by
true
. - tensor or iterable of tensors (e.g.encoder.parameters()
). - function that takes in tensor and outputs a bool (e.g.lambda x: x.ndim >= 2
). - a sequence of above (acts as "or", so returns true if any of them is true). -
true
(Chainable | None
) –modules that are applied to tensors where
filter
isTrue
. -
false
(Chainable | None
) –modules that are applied to tensors where
filter
isFalse
.
Examples:¶
Muon with Adam fallback using same hyperparams as https://github.com/KellerJordan/Muon
opt = tz.Modular(
model.parameters(),
tz.m.NAG(0.95),
tz.m.Split(
lambda p: p.ndim >= 2,
true = tz.m.Orthogonalize(),
false = [tz.m.Adam(0.9, 0.95), tz.m.Mul(1/66)],
),
tz.m.LR(1e-2),
)
Source code in torchzero/modules/misc/split.py
SqrtHomotopy ¶
SquareHomotopy ¶
Switch ¶
Bases: torchzero.modules.misc.switch.Alternate
After :code:steps
steps switches to the next module.
Parameters:
-
steps
(int | Iterable[int]
) –Number of steps to perform with each module.
Examples:
Start with Adam, switch to L-BFGS after 1000th step and Truncated Newton on 2000th step.
.. code-block:: python
opt = tz.Modular(
model.parameters(),
tz.m.Switch(
[tz.m.Adam(), tz.m.LR(1e-3)],
[tz.m.LBFGS(), tz.m.Backtracking()],
[tz.m.NewtonCG(maxiter=20), tz.m.Backtracking()],
steps = (1000, 2000)
)
)
Source code in torchzero/modules/misc/switch.py
LOOP
class-attribute
¶
bool(x) -> bool
Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.
UpdateSign ¶
Bases: torchzero.core.transform.Transform
Outputs gradient with sign copied from the update.
Source code in torchzero/modules/misc/misc.py
WeightDropout ¶
Bases: torchzero.core.module.Module
Changes the closure so that it evaluates loss and gradients with random weights replaced with 0.
Dropout can be disabled for a parameter by setting :code:use_dropout=False
in corresponding parameter group.
Parameters:
-
p
(float
, default:0.5
) –probability that any weight is replaced with 0. Defaults to 0.5.
-
graft
(bool
, default:True
) –if True, parameters after dropout are rescaled to have the same norm as before dropout. Defaults to False.