Skip to content

Adaptive

This subpackage contains adaptive methods e.g. Adam, RMSprop, SOAP, etc.

See also

Classes:

  • AEGD

    AEGD (Adaptive gradient descent with energy) from https://arxiv.org/abs/2010.05109#page=10.26.

  • ASAM

    Adaptive Sharpness-Aware Minimization from https://arxiv.org/pdf/2102.11600#page=6.52

  • AdaHessian

    AdaHessian: An Adaptive Second Order Optimizer for Machine Learning (https://arxiv.org/abs/2006.00719)

  • Adagrad

    Adagrad, divides by sum of past squares of gradients.

  • AdagradNorm

    Adagrad-Norm, divides by sum of past means of squares of gradients.

  • Adam

    Adam. Divides gradient EMA by EMA of gradient squares with debiased step size.

  • Adan

    Adaptive Nesterov Momentum Algorithm from https://arxiv.org/abs/2208.06677

  • AdaptiveHeavyBall

    Adaptive heavy ball from https://hal.science/hal-04832983v1/file/OJMO_2024__5__A7_0.pdf.

  • BacktrackOnSignChange

    Negates or undoes update for parameters where where gradient or update sign changes.

  • DualNormCorrection

    Dual norm correction for dualizer based optimizers (https://github.com/leloykun/adaptive-muon).

  • ESGD

    Equilibrated Gradient Descent (https://arxiv.org/abs/1502.04390)

  • FullMatrixAdagrad

    Full-matrix version of Adagrad, can be customized to make RMSprop or Adam (see examples).

  • LMAdagrad

    Limited-memory full matrix Adagrad.

  • Lion

    Lion (EvoLved Sign Momentum) optimizer from https://arxiv.org/abs/2302.06675.

  • MARSCorrection

    MARS variance reduction correction.

  • MSAM

    Momentum-SAM from https://arxiv.org/pdf/2401.12033.

  • MSAMObjective

    Momentum-SAM from https://arxiv.org/pdf/2401.12033.

  • MatrixMomentum

    Second order momentum method.

  • MuonAdjustLR

    LR adjustment for Muon from "Muon is Scalable for LLM Training" (https://github.com/MoonshotAI/Moonlight/tree/master).

  • NaturalGradient

    Natural gradient approximated via empirical fisher information matrix.

  • OrthoGrad

    Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.

  • Orthogonalize

    Uses Newton-Schulz iteration or SVD to compute the zeroth power / orthogonalization of update along first 2 dims.

  • RMSprop

    Divides graient by EMA of gradient squares.

  • Rprop

    Resilient propagation. The update magnitude gets multiplied by nplus if gradient didn't change the sign,

  • SAM

    Sharpness-Aware Minimization from https://arxiv.org/pdf/2010.01412

  • SOAP

    SOAP (ShampoO with Adam in the Preconditioner's eigenbasis from https://arxiv.org/abs/2409.11321).

  • ScaleLRBySignChange

    learning rate gets multiplied by nplus if ascent/gradient didn't change the sign,

  • Shampoo

    Shampoo from Preconditioned Stochastic Tensor Optimization (https://arxiv.org/abs/1802.09568).

  • SignConsistencyLRs

    Outputs per-weight learning rates based on consecutive sign consistency.

  • SignConsistencyMask

    Outputs a mask of sign consistency of current and previous inputs.

  • SophiaH

    SophiaH optimizer from https://arxiv.org/abs/2305.14342

Functions:

  • orthogonalize_grads_

    Uses newton-Schulz iteration to compute the zeroth power / orthogonalization of gradients of an iterable of parameters.

  • orthograd_

    Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.

AEGD

Bases: torchzero.core.transform.Transform

AEGD (Adaptive gradient descent with energy) from https://arxiv.org/abs/2010.05109#page=10.26.

Note

AEGD has a learning rate hyperparameter that can't really be removed from the update rule. To avoid compounding learning rate mofications, remove the tz.m.LR module if you had it.

Parameters:

  • eta (float) –

    step size. Defaults to 0.1.

  • c (float, default: 1 ) –

    c. Defaults to 1.

  • beta3 (float) –

    thrid (squared) momentum. Defaults to 0.1.

  • eps (float) –

    epsilon. Defaults to 1e-8.

  • use_n_prev (bool) –

    whether to use previous gradient differences momentum.

Source code in torchzero/modules/adaptive/aegd.py
class AEGD(Transform):
    """AEGD (Adaptive gradient descent with energy) from https://arxiv.org/abs/2010.05109#page=10.26.

    Note:
        AEGD has a learning rate hyperparameter that can't really be removed from the update rule.
        To avoid compounding learning rate mofications, remove the ``tz.m.LR`` module if you had it.

    Args:
        eta (float, optional): step size. Defaults to 0.1.
        c (float, optional): c. Defaults to 1.
        beta3 (float, optional): thrid (squared) momentum. Defaults to 0.1.
        eps (float, optional): epsilon. Defaults to 1e-8.
        use_n_prev (bool, optional):
            whether to use previous gradient differences momentum.
    """
    def __init__(
        self,
        lr: float = 0.1,
        c: float = 1,
    ):
        defaults=dict(c=c,lr=lr)
        super().__init__(defaults, uses_loss=True)

    @torch.no_grad
    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        assert loss is not None
        tensors = TensorList(tensors)

        c,lr=unpack_dicts(settings, 'c','lr', cls=NumberList)
        r = unpack_states(states, tensors, 'r', init=lambda t: torch.full_like(t, float(loss+c[0])**0.5), cls=TensorList)

        update = aegd_(
            f=loss,
            g=tensors,
            r_=r,
            c=c,
            eta=lr,
        )

        return update

ASAM

Bases: torchzero.modules.adaptive.sam.SAM

Adaptive Sharpness-Aware Minimization from https://arxiv.org/pdf/2102.11600#page=6.52

SAM functions by seeking parameters that lie in neighborhoods having uniformly low loss value. It performs two forward and backward passes per step.

This implementation modifies the closure to return loss and calculate gradients of the SAM objective. All modules after this will use the modified objective.

.. note:: This module requires a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients at two points on each step.

Parameters:

  • rho (float, default: 0.5 ) –

    Neighborhood size. Defaults to 0.05.

  • p (float, default: 2 ) –

    norm of the SAM objective. Defaults to 2.

Examples:

ASAM-Adam:

.. code-block:: python

opt = tz.Modular(
    model.parameters(),
    tz.m.ASAM(),
    tz.m.Adam(),
    tz.m.LR(1e-2)
)
References

Kwon, J., Kim, J., Park, H., & Choi, I. K. (2021, July). Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In International Conference on Machine Learning (pp. 5905-5914). PMLR. https://arxiv.org/abs/2102.11600

Source code in torchzero/modules/adaptive/sam.py
class ASAM(SAM):
    """Adaptive Sharpness-Aware Minimization from https://arxiv.org/pdf/2102.11600#page=6.52

    SAM functions by seeking parameters that lie in neighborhoods having uniformly low loss value.
    It performs two forward and backward passes per step.

    This implementation modifies the closure to return loss and calculate gradients
    of the SAM objective. All modules after this will use the modified objective.

    .. note::
        This module requires a closure passed to the optimizer step,
        as it needs to re-evaluate the loss and gradients at two points on each step.

    Args:
        rho (float, optional): Neighborhood size. Defaults to 0.05.
        p (float, optional): norm of the SAM objective. Defaults to 2.

    Examples:
        ASAM-Adam:

        .. code-block:: python

            opt = tz.Modular(
                model.parameters(),
                tz.m.ASAM(),
                tz.m.Adam(),
                tz.m.LR(1e-2)
            )

    References:
        Kwon, J., Kim, J., Park, H., & Choi, I. K. (2021, July). Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In International Conference on Machine Learning (pp. 5905-5914). PMLR. https://arxiv.org/abs/2102.11600
    """
    def __init__(self, rho: float = 0.5, p: float = 2, eps=1e-10):
        super().__init__(rho=rho, p=p, eps=eps, asam=True)

AdaHessian

Bases: torchzero.core.module.Module

AdaHessian: An Adaptive Second Order Optimizer for Machine Learning (https://arxiv.org/abs/2006.00719)

This is similar to Adam, but the second momentum is replaced by square root of an exponential moving average of random hessian-vector products.

Notes
  • In most cases AdaHessian should be the first module in the chain because it relies on autograd. Use the inner argument if you wish to apply AdaHessian preconditioning to another module's output.

  • If you are using gradient estimators or reformulations, set hvp_method to "forward" or "central".

  • This module requires a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a backward argument (refer to documentation).

Parameters:

  • beta1 (float, default: 0.9 ) –

    first momentum. Defaults to 0.9.

  • beta2 (float, default: 0.999 ) –

    second momentum for squared hessian diagonal estimates. Defaults to 0.999.

  • averaging (bool, default: True ) –

    whether to enable block diagonal averaging over 1st dimension on parameters that have 2+ dimensions. This can be set per-parameter in param groups.

  • block_size (int, default: None ) –

    size of block in the block-diagonal averaging.

  • update_freq (int, default: 1 ) –

    frequency of updating hessian diagonal estimate via a hessian-vector product. This value can be increased to reduce computational cost. Defaults to 1.

  • eps (float, default: 1e-08 ) –

    division stability epsilon. Defaults to 1e-8.

  • hvp_method (str, default: 'autograd' ) –

    Determines how Hessian-vector products are evaluated.

    • "autograd": Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient.
    • "forward": Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation.
    • "central": Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
  • fd_h (float, default: 0.001 ) –

    finite difference step size if hvp_method is "forward" or "central". Defaults to 1e-3.

  • n_samples (int, default: 1 ) –

    number of hessian-vector products with random vectors to evaluate each time when updating the preconditioner. Larger values may lead to better hessian diagonal estimate. Defaults to 1.

  • seed (int | None, default: None ) –

    seed for random vectors. Defaults to None.

  • inner (Chainable | None, default: None ) –

    Inner module. If this is specified, operations are performed in the following order. 1. compute hessian diagonal estimate. 2. pass inputs to inner. 3. momentum and preconditioning are applied to the ouputs of inner.

Examples:

Using AdaHessian:

opt = tz.Modular(
    model.parameters(),
    tz.m.AdaHessian(),
    tz.m.LR(0.1)
)

AdaHessian preconditioner can be applied to any other module by passing it to the inner argument. Turn off AdaHessian's first momentum to get just the preconditioning. Here is an example of applying AdaHessian preconditioning to nesterov momentum (tz.m.NAG):

opt = tz.Modular(
    model.parameters(),
    tz.m.AdaHessian(beta1=0, inner=tz.m.NAG(0.9)),
    tz.m.LR(0.1)
)

Source code in torchzero/modules/adaptive/adahessian.py
class AdaHessian(Module):
    """AdaHessian: An Adaptive Second Order Optimizer for Machine Learning (https://arxiv.org/abs/2006.00719)

    This is similar to Adam, but the second momentum is replaced by square root of an exponential moving average of random hessian-vector products.

    Notes:
        - In most cases AdaHessian should be the first module in the chain because it relies on autograd. Use the ``inner`` argument if you wish to apply AdaHessian preconditioning to another module's output.

        - If you are using gradient estimators or reformulations, set ``hvp_method`` to "forward" or "central".

        - This module requires a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a ``backward`` argument (refer to documentation).

    Args:
        beta1 (float, optional): first momentum. Defaults to 0.9.
        beta2 (float, optional): second momentum for squared hessian diagonal estimates. Defaults to 0.999.
        averaging (bool, optional):
            whether to enable block diagonal averaging over 1st dimension on parameters that have 2+ dimensions.
            This can be set per-parameter in param groups.
        block_size (int, optional):
            size of block in the block-diagonal averaging.
        update_freq (int, optional):
            frequency of updating hessian diagonal estimate via a hessian-vector product.
            This value can be increased to reduce computational cost. Defaults to 1.
        eps (float, optional):
            division stability epsilon. Defaults to 1e-8.
        hvp_method (str, optional):
            Determines how Hessian-vector products are evaluated.

            - ``"autograd"``: Use PyTorch's autograd to calculate exact HVPs.
              This requires creating a graph for the gradient.
            - ``"forward"``: Use a forward finite difference formula to
              approximate the HVP. This requires one extra gradient evaluation.
            - ``"central"``: Use a central finite difference formula for a
              more accurate HVP approximation. This requires two extra
              gradient evaluations.
            Defaults to "autograd".
        fd_h (float, optional): finite difference step size if ``hvp_method`` is "forward" or "central". Defaults to 1e-3.
        n_samples (int, optional):
            number of hessian-vector products with random vectors to evaluate each time when updating
            the preconditioner. Larger values may lead to better hessian diagonal estimate. Defaults to 1.
        seed (int | None, optional): seed for random vectors. Defaults to None.
        inner (Chainable | None, optional):
            Inner module. If this is specified, operations are performed in the following order.
            1. compute hessian diagonal estimate.
            2. pass inputs to ``inner``.
            3. momentum and preconditioning are applied to the ouputs of ``inner``.

    ## Examples:

    Using AdaHessian:

    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.AdaHessian(),
        tz.m.LR(0.1)
    )
    ```

    AdaHessian preconditioner can be applied to any other module by passing it to the ``inner`` argument.
    Turn off AdaHessian's first momentum to get just the preconditioning. Here is an example of applying
    AdaHessian preconditioning to nesterov momentum (``tz.m.NAG``):
    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.AdaHessian(beta1=0, inner=tz.m.NAG(0.9)),
        tz.m.LR(0.1)
    )
    ```

    """
    def __init__(
        self,
        beta1: float = 0.9,
        beta2: float = 0.999,
        averaging: bool = True,
        block_size: int | None = None,
        update_freq: int = 1,
        eps: float = 1e-8,
        hessian_power: float = 1,
        hvp_method: Literal['autograd', 'forward', 'central'] = 'autograd',
        fd_h: float = 1e-3,
        n_samples = 1,
        seed: int | None = None,
        inner: Chainable | None = None
    ):
        defaults = dict(beta1=beta1, beta2=beta2, update_freq=update_freq, averaging=averaging, block_size=block_size, eps=eps, hessian_power=hessian_power, hvp_method=hvp_method, n_samples=n_samples, fd_h=fd_h, seed=seed)
        super().__init__(defaults)

        if inner is not None:
            self.set_child('inner', inner)

    @torch.no_grad
    def step(self, var):
        params = var.params
        settings = self.settings[params[0]]
        hvp_method = settings['hvp_method']
        fd_h = settings['fd_h']
        update_freq = settings['update_freq']
        n_samples = settings['n_samples']

        seed = settings['seed']
        generator = self.get_generator(params[0].device, seed)

        beta1, beta2, eps, averaging, block_size, hessian_power = self.get_settings(params,
            'beta1', 'beta2', 'eps', 'averaging', 'block_size', "hessian_power", cls=NumberList)

        exp_avg, D_exp_avg_sq = self.get_state(params, 'exp_avg', 'h_exp_avg', cls=TensorList)

        step = self.global_state.get('step', 0)
        self.global_state['step'] = step + 1

        closure = var.closure
        assert closure is not None

        D = None
        if step % update_freq == 0:

            rgrad=None
            for i in range(n_samples):
                u = [_rademacher_like(p, generator=generator) for p in params]

                Hvp, rgrad = self.Hvp(u, at_x0=True, var=var, rgrad=rgrad, hvp_method=hvp_method,
                                     h=fd_h, normalize=True, retain_grad=i < n_samples-1)
                Hvp = tuple(Hvp)

                if D is None: D = Hvp
                else: torch._foreach_add_(D, Hvp)

            assert D is not None
            if n_samples > 1: torch._foreach_div_(D, n_samples)

            D = TensorList(D).zipmap_args(_block_average, block_size, averaging)

        update = var.get_update()
        if 'inner' in self.children:
            update = apply_transform(self.children['inner'], tensors=update, params=params, grads=var.grad, var=var)

        var.update = adahessian(
            tensors=TensorList(update),
            D=TensorList(D) if D is not None else None,
            exp_avg_=exp_avg,
            D_exp_avg_sq_=D_exp_avg_sq,
            beta1=beta1,
            beta2=beta2,
            update_freq=update_freq,
            eps=eps,
            hessian_power=hessian_power,
            step=step,
        )
        return var

Adagrad

Bases: torchzero.core.transform.Transform

Adagrad, divides by sum of past squares of gradients.

This implementation is identical to torch.optim.Adagrad.

Parameters:

  • lr_decay (float, default: 0 ) –

    learning rate decay. Defaults to 0.

  • initial_accumulator_value (float, default: 0 ) –

    initial value of the sum of squares of gradients. Defaults to 0.

  • eps (float, default: 1e-10 ) –

    division epsilon. Defaults to 1e-10.

  • alpha (float, default: 1 ) –

    step size. Defaults to 1.

  • pow (float, default: 2 ) –

    power for gradients and accumulator root. Defaults to 2.

  • use_sqrt (bool, default: True ) –

    whether to take the root of the accumulator. Defaults to True.

  • inner (Chainable | None, default: None ) –

    Inner modules that are applied after updating accumulator and before preconditioning. Defaults to None.

Source code in torchzero/modules/adaptive/adagrad.py
class Adagrad(Transform):
    """Adagrad, divides by sum of past squares of gradients.

    This implementation is identical to ``torch.optim.Adagrad``.

    Args:
        lr_decay (float, optional): learning rate decay. Defaults to 0.
        initial_accumulator_value (float, optional): initial value of the sum of squares of gradients. Defaults to 0.
        eps (float, optional): division epsilon. Defaults to 1e-10.
        alpha (float, optional): step size. Defaults to 1.
        pow (float, optional): power for gradients and accumulator root. Defaults to 2.
        use_sqrt (bool, optional): whether to take the root of the accumulator. Defaults to True.
        inner (Chainable | None, optional): Inner modules that are applied after updating accumulator and before preconditioning. Defaults to None.
    """
    def __init__(
        self,
        lr_decay: float = 0,
        initial_accumulator_value: float = 0,
        eps: float = 1e-10,
        alpha: float = 1,
        pow: float = 2,
        use_sqrt: bool = True,
        divide: bool=False,
        beta:float | None = None,
        decay: float | None = None,
        inner: Chainable | None = None,
    ):
        defaults = dict(alpha = alpha, lr_decay = lr_decay, initial_accumulator_value=initial_accumulator_value,
                        eps = eps, pow=pow, use_sqrt = use_sqrt, divide=divide, beta=beta, decay=decay)
        super().__init__(defaults=defaults, uses_grad=False)

        if inner is not None:
            self.set_child('inner', inner)

    @torch.no_grad
    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        tensors = TensorList(tensors)
        step = self.global_state['step'] = self.global_state.get('step', 0) + 1

        lr_decay,alpha,eps = unpack_dicts(settings, 'lr_decay', 'alpha', 'eps', cls=NumberList)

        pow, use_sqrt, divide = itemgetter('pow', 'use_sqrt', 'divide')(settings[0])

        sq_sum = unpack_states(states, tensors, 'sq_sum', cls=TensorList)

        # initialize accumulator on 1st step
        if step == 1:
            sq_sum.set_(tensors.full_like([s['initial_accumulator_value'] for s in settings]))

        return adagrad_(
            tensors,
            sq_sum_=sq_sum,
            alpha=alpha,
            lr_decay=lr_decay,
            eps=eps,
            step=step,
            pow=pow,
            use_sqrt=use_sqrt,
            divide=divide,

            beta = self.defaults["beta"],
            decay = self.defaults["decay"],
            # inner args
            inner=self.children.get("inner", None),
            params=params,
            grads=grads,
        )

AdagradNorm

Bases: torchzero.core.transform.Transform

Adagrad-Norm, divides by sum of past means of squares of gradients.

Parameters:

  • lr_decay (float, default: 0 ) –

    learning rate decay. Defaults to 0.

  • initial_accumulator_value (float, default: 0 ) –

    initial value of the sum of squares of gradients. Defaults to 0.

  • eps (float, default: 1e-10 ) –

    division epsilon. Defaults to 1e-10.

  • alpha (float, default: 1 ) –

    step size. Defaults to 1.

  • pow (float, default: 2 ) –

    power for gradients and accumulator root. Defaults to 2.

  • use_sqrt (bool, default: True ) –

    whether to take the root of the accumulator. Defaults to True.

  • inner (Chainable | None, default: None ) –

    Inner modules that are applied after updating accumulator and before preconditioning. Defaults to None.

Source code in torchzero/modules/adaptive/adagrad.py
class AdagradNorm(Transform):
    """Adagrad-Norm, divides by sum of past means of squares of gradients.

    Args:
        lr_decay (float, optional): learning rate decay. Defaults to 0.
        initial_accumulator_value (float, optional): initial value of the sum of squares of gradients. Defaults to 0.
        eps (float, optional): division epsilon. Defaults to 1e-10.
        alpha (float, optional): step size. Defaults to 1.
        pow (float, optional): power for gradients and accumulator root. Defaults to 2.
        use_sqrt (bool, optional): whether to take the root of the accumulator. Defaults to True.
        inner (Chainable | None, optional): Inner modules that are applied after updating accumulator and before preconditioning. Defaults to None.
    """
    def __init__(
        self,
        lr_decay: float = 0,
        initial_accumulator_value: float = 0,
        eps: float = 1e-10,
        alpha: float = 1,
        pow: float = 2,
        use_sqrt: bool = True,
        divide: bool=False,
        beta:float | None = None,
        decay: float | None = None,
        inner: Chainable | None = None,
    ):
        defaults = dict(alpha = alpha, lr_decay = lr_decay, initial_accumulator_value=initial_accumulator_value,
                        eps = eps, pow=pow, use_sqrt = use_sqrt, divide=divide, beta=beta, decay=decay)
        super().__init__(defaults=defaults, uses_grad=False)

        if inner is not None:
            self.set_child('inner', inner)

    @torch.no_grad
    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        tensors = TensorList(tensors)
        step = self.global_state['step'] = self.global_state.get('step', 0) + 1
        lr_decay,alpha,eps = unpack_dicts(settings, 'lr_decay', 'alpha', 'eps', cls=NumberList)

        use_sqrt, divide, initial_accumulator_value = itemgetter('use_sqrt', 'divide', "initial_accumulator_value")(settings[0])

        accumulator = self.global_state.get("accumulator", initial_accumulator_value)

        d, self.global_state["accumulator"] = adagrad_norm_(
            tensors,
            accumulator=accumulator,
            alpha=alpha,
            lr_decay=lr_decay,
            eps=eps,
            step=step,
            use_sqrt=use_sqrt,
            divide=divide,

            beta = self.defaults["beta"],
            decay = self.defaults["decay"],
            # inner args
            inner=self.children.get("inner", None),
            params=params,
            grads=grads,
        )

        return d

Adam

Bases: torchzero.core.transform.Transform

Adam. Divides gradient EMA by EMA of gradient squares with debiased step size.

This implementation is identical to :code:torch.optim.Adam.

Parameters:

  • beta1 (float, default: 0.9 ) –

    momentum. Defaults to 0.9.

  • beta2 (float, default: 0.999 ) –

    second momentum. Defaults to 0.999.

  • eps (float, default: 1e-08 ) –

    epsilon. Defaults to 1e-8.

  • alpha (float, default: 1.0 ) –

    learning rate. Defaults to 1.

  • amsgrad (bool, default: False ) –

    Whether to divide by maximum of EMA of gradient squares instead. Defaults to False.

  • pow (float, default: 2 ) –

    power used in second momentum power and root. Defaults to 2.

  • debiased (bool, default: True ) –

    whether to apply debiasing to momentums based on current step. Defaults to True.

Source code in torchzero/modules/adaptive/adam.py
class Adam(Transform):
    """Adam. Divides gradient EMA by EMA of gradient squares with debiased step size.

    This implementation is identical to :code:`torch.optim.Adam`.

    Args:
        beta1 (float, optional): momentum. Defaults to 0.9.
        beta2 (float, optional): second momentum. Defaults to 0.999.
        eps (float, optional): epsilon. Defaults to 1e-8.
        alpha (float, optional): learning rate. Defaults to 1.
        amsgrad (bool, optional): Whether to divide by maximum of EMA of gradient squares instead. Defaults to False.
        pow (float, optional): power used in second momentum power and root. Defaults to 2.
        debiased (bool, optional): whether to apply debiasing to momentums based on current step. Defaults to True.
    """
    def __init__(
        self,
        beta1: float = 0.9,
        beta2: float = 0.999,
        eps: float = 1e-8,
        amsgrad: bool = False,
        alpha: float = 1.,
        pow: float = 2,
        debiased: bool = True,
        inner: Chainable | None = None
    ):
        defaults=dict(beta1=beta1,beta2=beta2,eps=eps,alpha=alpha,amsgrad=amsgrad,pow=pow,debiased=debiased)
        super().__init__(defaults, uses_grad=False)

        if inner is not None: self.set_child('inner', inner)

    @torch.no_grad
    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        step = self.global_state['step'] = self.global_state.get('step', 0) + 1

        beta1,beta2,eps,alpha=unpack_dicts(settings, 'beta1','beta2','eps','alpha', cls=NumberList)
        amsgrad,pow,debiased = itemgetter('amsgrad','pow','debiased')(settings[0])

        if amsgrad:
            exp_avg, exp_avg_sq, max_exp_avg_sq = unpack_states(states, tensors, 'exp_avg', 'exp_avg_sq', 'max_exp_avg_sq', cls=TensorList)
        else:
            exp_avg, exp_avg_sq = unpack_states(states, tensors, 'exp_avg', 'exp_avg_sq', cls=TensorList)
            max_exp_avg_sq = None


        return adam_(
            tensors=TensorList(tensors),
            exp_avg_=exp_avg,
            exp_avg_sq_=exp_avg_sq,
            alpha=alpha,
            beta1=beta1,
            beta2=beta2,
            eps=eps,
            step=step,
            pow=pow,
            debiased=debiased,
            max_exp_avg_sq_=max_exp_avg_sq,

            # inner args
            inner=self.children.get("inner", None),
            params=params,
            grads=grads,

        )

Adan

Bases: torchzero.core.transform.Transform

Adaptive Nesterov Momentum Algorithm from https://arxiv.org/abs/2208.06677

Parameters:

  • beta1 (float, default: 0.98 ) –

    momentum. Defaults to 0.98.

  • beta2 (float, default: 0.92 ) –

    momentum for gradient differences. Defaults to 0.92.

  • beta3 (float, default: 0.99 ) –

    thrid (squared) momentum. Defaults to 0.99.

  • eps (float, default: 1e-08 ) –

    epsilon. Defaults to 1e-8.

  • use_n_prev (bool) –

    whether to use previous gradient differences momentum.

Example: ```python opt = tz.Modular( model.parameters(), tz.m.Adan(), tz.m.LR(1e-3), ) Reference: Xie, X., Zhou, P., Li, H., Lin, Z., & Yan, S. (2024). Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://arxiv.org/abs/2208.06677

Source code in torchzero/modules/adaptive/adan.py
class Adan(Transform):
    """Adaptive Nesterov Momentum Algorithm from https://arxiv.org/abs/2208.06677

    Args:
        beta1 (float, optional): momentum. Defaults to 0.98.
        beta2 (float, optional): momentum for gradient differences. Defaults to 0.92.
        beta3 (float, optional): thrid (squared) momentum. Defaults to 0.99.
        eps (float, optional): epsilon. Defaults to 1e-8.
        use_n_prev (bool, optional):
            whether to use previous gradient differences momentum.

    Example:
    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.Adan(),
        tz.m.LR(1e-3),
    )
    Reference:
        Xie, X., Zhou, P., Li, H., Lin, Z., & Yan, S. (2024). Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://arxiv.org/abs/2208.06677
    """
    def __init__(
        self,
        beta1: float = 0.98,
        beta2: float = 0.92,
        beta3: float = 0.99,
        eps: float = 1e-8,
    ):
        defaults=dict(beta1=beta1,beta2=beta2,beta3=beta3,eps=eps)
        super().__init__(defaults, uses_grad=False)

    @torch.no_grad
    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        tensors = TensorList(tensors)
        step = self.global_state['step'] = self.global_state.get('step', 0) + 1

        beta1,beta2,beta3,eps=unpack_dicts(settings, 'beta1','beta2','beta3','eps', cls=NumberList)
        g_prev, m, v, n = unpack_states(states, tensors, 'g_prev','m','v','n', cls=TensorList)

        update = adan_(
            g=tensors,
            g_prev_=g_prev,
            m_=m,
            v_=v,
            n_=n,
            beta1=beta1,
            beta2=beta2,
            beta3=beta3,
            eps=eps,
            step=step,
        )

        return update

AdaptiveHeavyBall

Bases: torchzero.core.transform.Transform

Adaptive heavy ball from https://hal.science/hal-04832983v1/file/OJMO_2024__5__A7_0.pdf.

This is related to conjugate gradient methods, it may be very good for non-stochastic convex objectives, but won't work on stochastic ones.

note

The step size is determined by the algorithm, so learning rate modules shouldn't be used.

Parameters:

  • f_star (int, default: 0 ) –

    (estimated) minimal possible value of the objective function (lowest possible loss). Defaults to 0.

Source code in torchzero/modules/adaptive/adaptive_heavyball.py
class AdaptiveHeavyBall(Transform):
    """Adaptive heavy ball from https://hal.science/hal-04832983v1/file/OJMO_2024__5__A7_0.pdf.

    This is related to conjugate gradient methods, it may be very good for non-stochastic convex objectives, but won't work on stochastic ones.

    note:
        The step size is determined by the algorithm, so learning rate modules shouldn't be used.

    Args:
        f_star (int, optional):
            (estimated) minimal possible value of the objective function (lowest possible loss). Defaults to 0.
    """
    def __init__(self, f_star: float = 0):
        defaults = dict(f_star=f_star)
        super().__init__(defaults, uses_grad=False, uses_loss=True)

    @torch.no_grad
    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        assert loss is not None
        tensors = TensorList(tensors)
        f_star = self.defaults['f_star']

        f_prev = self.global_state.get('f_prev', None)
        p_prev, g_prev = unpack_states(states, tensors, 'p_prev', 'g_prev', init=[params,tensors], cls=TensorList)

        if f_prev is None:
            self.global_state['f_prev'] = loss
            h = 2*(loss - f_star) / tensors.dot(tensors)
            return h * tensors

        update = adaptive_heavy_ball(f=loss, f_star=f_star, f_prev=f_prev, g=tensors, g_prev=g_prev, p=TensorList(params), p_prev=p_prev)

        self.global_state['f_prev'] = loss
        p_prev.copy_(params)
        g_prev.copy_(tensors)
        return update

BacktrackOnSignChange

Bases: torchzero.core.transform.Transform

Negates or undoes update for parameters where where gradient or update sign changes.

This is part of RProp update rule.

Parameters:

  • use_grad (bool, default: False ) –

    if True, tracks sign change of the gradient, otherwise track sign change of the update. Defaults to True.

  • backtrack (bool, default: True ) –

    if True, undoes the update when sign changes, otherwise negates it. Defaults to True.

Source code in torchzero/modules/adaptive/rprop.py
class BacktrackOnSignChange(Transform):
    """Negates or undoes update for parameters where where gradient or update sign changes.

    This is part of RProp update rule.

    Args:
        use_grad (bool, optional):
            if True, tracks sign change of the gradient,
            otherwise track sign change of the update. Defaults to True.
        backtrack (bool, optional):
            if True, undoes the update when sign changes, otherwise negates it.
            Defaults to True.

    """
    def __init__(self, use_grad = False, backtrack = True, target: Target = 'update'):
        defaults = dict(use_grad=use_grad, backtrack=backtrack, target=target)
        super().__init__(defaults, uses_grad=use_grad)

    @torch.no_grad
    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        step = self.global_state.get('step', 0)
        self.global_state['step'] = step + 1

        tensors = as_tensorlist(tensors)
        use_grad = settings[0]['use_grad']
        backtrack = settings[0]['backtrack']

        if use_grad: cur = as_tensorlist(grads)
        else: cur = tensors

        tensors = backtrack_on_sign_change_(
            tensors_ = tensors,
            cur = cur,
            prev_ = unpack_states(states, tensors, 'prev', cls=TensorList),
            backtrack = backtrack,
            step = step,
        )

        return tensors

DualNormCorrection

Bases: torchzero.core.transform.TensorwiseTransform

Dual norm correction for dualizer based optimizers (https://github.com/leloykun/adaptive-muon). Orthogonalize already has this built in with the dual_norm_correction setting.

Source code in torchzero/modules/adaptive/muon.py
class DualNormCorrection(TensorwiseTransform):
    """Dual norm correction for dualizer based optimizers (https://github.com/leloykun/adaptive-muon).
    Orthogonalize already has this built in with the `dual_norm_correction` setting."""
    def __init__(self, target: Target='update'):
        super().__init__({}, uses_grad=True, target=target)

    def apply_tensor(self, tensor, param, grad, loss, state, setting):
        assert grad is not None
        if (tensor.ndim >= 2) and (tensor.size(0) > 1) and (tensor.size(1) > 1):
            return _dual_norm_correction(tensor, grad, batch_first=False)
        return tensor

ESGD

Bases: torchzero.core.module.Module

Equilibrated Gradient Descent (https://arxiv.org/abs/1502.04390)

This is similar to Adagrad, but the accumulates squared randomized hessian diagonal estimates instead of squared gradients.

.. note:: In most cases Adagrad should be the first module in the chain because it relies on autograd. Use the :code:inner argument if you wish to apply Adagrad preconditioning to another module's output.

.. note:: If you are using gradient estimators or reformulations, set :code:hvp_method to "forward" or "central".

.. note:: This module requires a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a backward argument (refer to documentation).

Parameters:

  • damping (float, default: 0.0001 ) –

    added to denominator for stability. Defaults to 1e-4.

  • update_freq (int, default: 20 ) –

    frequency of updating hessian diagonal estimate via a hessian-vector product. This value can be increased to reduce computational cost. Defaults to 20.

  • hvp_method (str, default: 'autograd' ) –

    Determines how Hessian-vector products are evaluated.

    • "autograd": Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient.
    • "forward": Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation.
    • "central": Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
  • fd_h (float, default: 0.001 ) –

    finite difference step size if :code:hvp_method is "forward" or "central". Defaults to 1e-3.

  • n_samples (int, default: 1 ) –

    number of hessian-vector products with random vectors to evaluate each time when updating the preconditioner. Larger values may lead to better hessian diagonal estimate. Defaults to 1.

  • seed (int | None, default: None ) –

    seed for random vectors. Defaults to None.

  • inner (Chainable | None, default: None ) –

    Inner module. If this is specified, operations are performed in the following order. 1. compute hessian diagonal estimate. 2. pass inputs to :code:inner. 3. momentum and preconditioning are applied to the ouputs of :code:inner.

Examples:

Using ESGD:

.. code-block:: python

opt = tz.Modular(
    model.parameters(),
    tz.m.ESGD(),
    tz.m.LR(0.1)
)

ESGD preconditioner can be applied to any other module by passing it to the :code:inner argument. Here is an example of applying ESGD preconditioning to nesterov momentum (:code:tz.m.NAG):

.. code-block:: python

opt = tz.Modular(
    model.parameters(),
    tz.m.ESGD(beta1=0, inner=tz.m.NAG(0.9)),
    tz.m.LR(0.1)
)
Source code in torchzero/modules/adaptive/esgd.py
class ESGD(Module):
    """Equilibrated Gradient Descent (https://arxiv.org/abs/1502.04390)

    This is similar to Adagrad, but the accumulates squared randomized hessian diagonal estimates instead of squared gradients.

    .. note::
        In most cases Adagrad should be the first module in the chain because it relies on autograd. Use the :code:`inner` argument if you wish to apply Adagrad preconditioning to another module's output.

    .. note::
        If you are using gradient estimators or reformulations, set :code:`hvp_method` to "forward" or "central".

    .. note::
        This module requires a closure passed to the optimizer step,
        as it needs to re-evaluate the loss and gradients for calculating HVPs.
        The closure must accept a ``backward`` argument (refer to documentation).

    Args:
        damping (float, optional): added to denominator for stability. Defaults to 1e-4.
        update_freq (int, optional):
            frequency of updating hessian diagonal estimate via a hessian-vector product.
            This value can be increased to reduce computational cost. Defaults to 20.
        hvp_method (str, optional):
            Determines how Hessian-vector products are evaluated.

            - ``"autograd"``: Use PyTorch's autograd to calculate exact HVPs.
              This requires creating a graph for the gradient.
            - ``"forward"``: Use a forward finite difference formula to
              approximate the HVP. This requires one extra gradient evaluation.
            - ``"central"``: Use a central finite difference formula for a
              more accurate HVP approximation. This requires two extra
              gradient evaluations.
            Defaults to "autograd".
        fd_h (float, optional): finite difference step size if :code:`hvp_method` is "forward" or "central". Defaults to 1e-3.
        n_samples (int, optional):
            number of hessian-vector products with random vectors to evaluate each time when updating
            the preconditioner. Larger values may lead to better hessian diagonal estimate. Defaults to 1.
        seed (int | None, optional): seed for random vectors. Defaults to None.
        inner (Chainable | None, optional):
            Inner module. If this is specified, operations are performed in the following order.
            1. compute hessian diagonal estimate.
            2. pass inputs to :code:`inner`.
            3. momentum and preconditioning are applied to the ouputs of :code:`inner`.

    Examples:
        Using ESGD:

        .. code-block:: python

            opt = tz.Modular(
                model.parameters(),
                tz.m.ESGD(),
                tz.m.LR(0.1)
            )

        ESGD preconditioner can be applied to any other module by passing it to the :code:`inner` argument. Here is an example of applying
        ESGD preconditioning to nesterov momentum (:code:`tz.m.NAG`):

        .. code-block:: python

            opt = tz.Modular(
                model.parameters(),
                tz.m.ESGD(beta1=0, inner=tz.m.NAG(0.9)),
                tz.m.LR(0.1)
            )

    """
    def __init__(
        self,
        damping: float = 1e-4,
        update_freq: int = 20,
        hvp_method: Literal['autograd', 'forward', 'central'] = 'autograd',
        fd_h: float = 1e-3,
        n_samples = 1,
        seed: int | None = None,
        inner: Chainable | None = None
    ):
        defaults = dict(damping=damping, update_freq=update_freq, hvp_method=hvp_method, n_samples=n_samples, fd_h=fd_h, seed=seed)
        super().__init__(defaults)

        if inner is not None:
            self.set_child('inner', inner)

    @torch.no_grad
    def step(self, var):
        params = var.params
        settings = self.settings[params[0]]
        hvp_method = settings['hvp_method']
        fd_h = settings['fd_h']
        update_freq = settings['update_freq']
        n_samples = settings['n_samples']

        seed = settings['seed']
        generator = None
        if seed is not None:
            if 'generator' not in self.global_state:
                self.global_state['generator'] = torch.Generator(params[0].device).manual_seed(seed)
            generator = self.global_state['generator']

        damping = self.get_settings(params, 'damping', cls=NumberList)
        D_sq_acc = self.get_state(params, 'D_sq_acc', cls=TensorList)
        i = self.global_state.get('i', 0)

        step = self.global_state.get('step', 0)
        self.global_state['step'] = step + 1

        closure = var.closure
        assert closure is not None

        D = None
        if step % update_freq == 0:

            rgrad=None
            for j in range(n_samples):
                u = [torch.randn(p.size(), generator=generator, device=p.device, dtype=p.dtype) for p in params]

                Hvp, rgrad = self.Hvp(u, at_x0=True, var=var, rgrad=rgrad, hvp_method=hvp_method,
                                     h=fd_h, normalize=True, retain_grad=j < n_samples-1)

                if D is None: D = Hvp
                else: torch._foreach_add_(D, Hvp)

            assert D is not None
            if n_samples > 1: torch._foreach_div_(D, n_samples)

            D = TensorList(D)

        update = var.get_update()
        if 'inner' in self.children:
            update = apply_transform(self.children['inner'], tensors=update, params=params, grads=var.grad, var=var)

        var.update, self.global_state['i'] = esgd_(
            tensors_=TensorList(update),
            D=TensorList(D) if D is not None else None,
            D_sq_acc_=D_sq_acc,
            damping=damping,
            update_freq=update_freq,
            step=step,
            i=i,
        )
        return var

FullMatrixAdagrad

Bases: torchzero.core.transform.TensorwiseTransform

Full-matrix version of Adagrad, can be customized to make RMSprop or Adam (see examples).

Note

A more memory-efficient version equivalent to full matrix Adagrad on last n gradients is implemented in tz.m.LMAdagrad.

Parameters:

  • beta (float | None, default: None ) –

    momentum for gradient outer product accumulators. if None, uses sum. Defaults to None.

  • decay (float | None, default: None ) –

    decay for gradient outer product accumulators. Defaults to None.

  • sqrt (bool, default: True ) –

    whether to take the square root of the accumulator. Defaults to True.

  • concat_params (bool, default: True ) –

    if False, each parameter will have it's own accumulator. Defaults to True.

  • precond_freq (int, default: 1 ) –

    frequency of updating the inverse square root of the accumulator. Defaults to 1.

  • init (Literal[str], default: 'identity' ) –

    how to initialize the accumulator. - "identity" - with identity matrix (default). - "zeros" - with zero matrix. - "ones" - with matrix of ones. -"GGT" - with the first outer product

  • divide (bool, default: False ) –

    whether to divide the accumulator by number of gradients in it. Defaults to False.

  • inner (Chainable | None, default: None ) –

    inner modules to apply preconditioning to. Defaults to None.

Examples:

Plain full-matrix adagrad

opt = tz.Modular(
    model.parameters(),
    tz.m.FullMatrixAdagrd(),
    tz.m.LR(1e-2),
)

Full-matrix RMSprop

opt = tz.Modular(
    model.parameters(),
    tz.m.FullMatrixAdagrad(beta=0.99),
    tz.m.LR(1e-2),
)

Full-matrix Adam

opt = tz.Modular(
    model.parameters(),
    tz.m.FullMatrixAdagrad(beta=0.999, inner=tz.m.EMA(0.9)),
    tz.m.Debias(0.9, 0.999),
    tz.m.LR(1e-2),
)

Source code in torchzero/modules/adaptive/adagrad.py
class FullMatrixAdagrad(TensorwiseTransform):
    """Full-matrix version of Adagrad, can be customized to make RMSprop or Adam (see examples).

    Note:
        A more memory-efficient version equivalent to full matrix Adagrad on last n gradients is implemented in ``tz.m.LMAdagrad``.

    Args:
        beta (float | None, optional): momentum for gradient outer product accumulators. if None, uses sum. Defaults to None.
        decay (float | None, optional): decay for gradient outer product accumulators. Defaults to None.
        sqrt (bool, optional): whether to take the square root of the accumulator. Defaults to True.
        concat_params (bool, optional): if False, each parameter will have it's own accumulator. Defaults to True.
        precond_freq (int, optional): frequency of updating the inverse square root of the accumulator. Defaults to 1.
        init (Literal[str], optional):
            how to initialize the accumulator.
            - "identity" - with identity matrix (default).
            - "zeros" - with zero matrix.
            - "ones" - with matrix of ones.
             -"GGT" - with the first outer product
        divide (bool, optional): whether to divide the accumulator by number of gradients in it. Defaults to False.
        inner (Chainable | None, optional): inner modules to apply preconditioning to. Defaults to None.

    ## Examples:

    Plain full-matrix adagrad
    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.FullMatrixAdagrd(),
        tz.m.LR(1e-2),
    )
    ```

    Full-matrix RMSprop
    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.FullMatrixAdagrad(beta=0.99),
        tz.m.LR(1e-2),
    )
    ```

    Full-matrix Adam
    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.FullMatrixAdagrad(beta=0.999, inner=tz.m.EMA(0.9)),
        tz.m.Debias(0.9, 0.999),
        tz.m.LR(1e-2),
    )
    ```
    """
    def __init__(
        self,
        beta: float | None = None,
        decay: float | None = None,
        sqrt: bool = True,
        concat_params=True,
        precond_freq: int = 1,
        init: Literal["identity", "zeros", "ones", "GGT"] = "identity",
        reg: float = 1e-12,
        divide: bool = False,
        inner: Chainable | None = None,
    ):
        defaults = dict(beta=beta, decay=decay, sqrt=sqrt, precond_freq=precond_freq, init=init, divide=divide, reg=reg)
        super().__init__(defaults, uses_grad=False, concat_params=concat_params, inner=inner,)

    @torch.no_grad
    def update_tensor(self, tensor, param, grad, loss, state, setting):
        G = tensor.ravel()
        GG = torch.outer(G, G)
        decay = setting['decay']
        beta = setting['beta']
        init = setting['init']

        if 'GG' not in state:
            if init == 'identity': state['GG'] = torch.eye(GG.size(0), device=GG.device, dtype=GG.dtype)
            elif init == 'zeros': state['GG'] =  torch.zeros_like(GG)
            elif init == 'ones': state['GG'] = torch.ones_like(GG)
            elif init == 'GGT': state['GG'] = GG.clone()
            else: raise ValueError(init)
        if decay is not None: state['GG'].mul_(decay)

        if beta is not None: state['GG'].lerp_(GG, 1-beta)
        else: state['GG'].add_(GG)
        state['i'] = state.get('i', 0) + 1 # number of GGTs in sum

    @torch.no_grad
    def apply_tensor(self, tensor, param, grad, loss, state, setting):
        step = state.get('step', 0)
        state['step'] = step + 1

        GG: torch.Tensor = state['GG']
        sqrt = setting['sqrt']
        divide = setting['divide']
        precond_freq = setting['precond_freq']
        reg = setting['reg']

        if divide: GG = GG/state.get('i', 1)

        if reg != 0:
            GG = GG + torch.eye(GG.size(0), device=GG.device, dtype=GG.dtype).mul_(reg)

        if tensor.numel() == 1:
            GG = GG.squeeze()
            if sqrt: return tensor / GG.sqrt()
            return tensor / GG

        try:
            if sqrt:
                if "B" not in state or step % precond_freq == 0:
                    B = state["B"] = matrix_power_eigh(GG, -1/2)
                else:
                    B = state["B"]

            else: return torch.linalg.solve(GG, tensor.ravel()).view_as(tensor) # pylint:disable = not-callable

        except torch.linalg.LinAlgError:
            # fallback to diagonal AdaGrad
            denom = GG.diagonal()
            if sqrt: denom = denom.sqrt()
            return tensor.div_(denom + max(reg, 1e-12))

        return (B @ tensor.ravel()).view_as(tensor)

LMAdagrad

Bases: torchzero.core.transform.TensorwiseTransform

Limited-memory full matrix Adagrad.

The update rule is to stack recent gradients into M, compute U, S <- SVD(M), then calculate update as U S^-1 Uᵀg. But it uses eigendecomposition on MᵀM to get U and S^2 because that is faster when you don't neeed V.

This is equivalent to full-matrix Adagrad on recent gradients.

Parameters:

  • history_size (int, default: 100 ) –

    number of past gradients to store. Defaults to 10.

  • update_freq (int, default: 1 ) –

    frequency of updating the preconditioner (U and S). Defaults to 1.

  • damping (float, default: 0.0001 ) –

    damping value. Defaults to 1e-4.

  • rdamping (float, default: 0 ) –

    value of damping relative to singular values norm. Defaults to 0.

  • order (int, default: 1 ) –

    order=2 means gradient differences are used in place of gradients. Higher order uses higher order differences. Defaults to 1.

  • true_damping (bool, default: True ) –

    If True, damping is added to squared singular values to mimic Adagrad. Defaults to True.

  • U_beta (float | None, default: None ) –

    momentum for U (too unstable, don't use). Defaults to None.

  • L_beta (float | None, default: None ) –

    momentum for L (too unstable, don't use). Defaults to None.

  • interval (int, default: 1 ) –

    Interval between gradients that are added to history (2 means every second gradient is used). Defaults to 1.

  • concat_params (bool, default: True ) –

    if True, treats all parameters as a single vector, meaning it will also whiten inter-parameters. Defaults to True.

  • inner (Chainable | None, default: None ) –

    preconditioner will be applied to output of this module. Defaults to None.

Examples:

Limited-memory Adagrad

optimizer = tz.Modular(
    model.parameters(),
    tz.m.LMAdagrad(),
    tz.m.LR(0.1)
)
Adam with L-Adagrad preconditioner (for debiasing second beta is 0.999 arbitrarily)

optimizer = tz.Modular(
    model.parameters(),
    tz.m.LMAdagrad(inner=tz.m.EMA()),
    tz.m.Debias(0.9, 0.999),
    tz.m.LR(0.01)
)

Stable Adam with L-Adagrad preconditioner (this is what I would recommend)

optimizer = tz.Modular(
    model.parameters(),
    tz.m.LMAdagrad(inner=tz.m.EMA()),
    tz.m.Debias(0.9, 0.999),
    tz.m.ClipNormByEMA(max_ema_growth=1.2),
    tz.m.LR(0.01)
)
Reference: Agarwal N. et al. Efficient full-matrix adaptive regularization //International Conference on Machine Learning. – PMLR, 2019. – С. 102-110.

Source code in torchzero/modules/adaptive/lmadagrad.py
class LMAdagrad(TensorwiseTransform):
    """
    Limited-memory full matrix Adagrad.

    The update rule is to stack recent gradients into M, compute U, S <- SVD(M), then calculate update as U S^-1 Uᵀg.
    But it uses eigendecomposition on MᵀM to get U and S^2 because that is faster when you don't neeed V.

    This is equivalent to full-matrix Adagrad on recent gradients.

    Args:
        history_size (int, optional): number of past gradients to store. Defaults to 10.
        update_freq (int, optional): frequency of updating the preconditioner (U and S). Defaults to 1.
        damping (float, optional): damping value. Defaults to 1e-4.
        rdamping (float, optional): value of damping relative to singular values norm. Defaults to 0.
        order (int, optional):
            order=2 means gradient differences are used in place of gradients. Higher order uses higher order differences. Defaults to 1.
        true_damping (bool, optional):
            If True, damping is added to squared singular values to mimic Adagrad. Defaults to True.
        U_beta (float | None, optional): momentum for U (too unstable, don't use). Defaults to None.
        L_beta (float | None, optional): momentum for L (too unstable, don't use). Defaults to None.
        interval (int, optional): Interval between gradients that are added to history (2 means every second gradient is used). Defaults to 1.
        concat_params (bool, optional): if True, treats all parameters as a single vector, meaning it will also whiten inter-parameters. Defaults to True.
        inner (Chainable | None, optional): preconditioner will be applied to output of this module. Defaults to None.

    ## Examples:

    Limited-memory Adagrad

    ```python
    optimizer = tz.Modular(
        model.parameters(),
        tz.m.LMAdagrad(),
        tz.m.LR(0.1)
    )
    ```
    Adam with L-Adagrad preconditioner (for debiasing second beta is 0.999 arbitrarily)

    ```python
    optimizer = tz.Modular(
        model.parameters(),
        tz.m.LMAdagrad(inner=tz.m.EMA()),
        tz.m.Debias(0.9, 0.999),
        tz.m.LR(0.01)
    )
    ```

    Stable Adam with L-Adagrad preconditioner (this is what I would recommend)

    ```python
    optimizer = tz.Modular(
        model.parameters(),
        tz.m.LMAdagrad(inner=tz.m.EMA()),
        tz.m.Debias(0.9, 0.999),
        tz.m.ClipNormByEMA(max_ema_growth=1.2),
        tz.m.LR(0.01)
    )
    ```
    Reference:
        Agarwal N. et al. Efficient full-matrix adaptive regularization //International Conference on Machine Learning. – PMLR, 2019. – С. 102-110.
    """

    def __init__(
        self,
        history_size: int = 100,
        update_freq: int = 1,
        damping: float = 1e-4,
        rdamping: float = 0,
        order: int = 1,
        true_damping: bool = True,
        U_beta: float | None = None,
        L_beta: float | None = None,
        interval: int = 1,
        concat_params: bool = True,
        inner: Chainable | None = None,
    ):
        # history is still updated each step so Precondition's update_freq has different meaning
        defaults = dict(history_size=history_size, update_freq=update_freq, damping=damping, rdamping=rdamping, true_damping=true_damping, order=order, U_beta=U_beta, L_beta=L_beta)
        super().__init__(defaults, uses_grad=False, concat_params=concat_params, inner=inner, update_freq=interval)

    @torch.no_grad
    def update_tensor(self, tensor, param, grad, loss, state, setting):
        order = setting['order']
        history_size = setting['history_size']
        update_freq = setting['update_freq']
        damping = setting['damping']
        rdamping = setting['rdamping']
        U_beta = setting['U_beta']
        L_beta = setting['L_beta']

        if 'history' not in state: state['history'] = deque(maxlen=history_size)
        history = state['history']

        if order == 1:
            t = tensor.clone().view(-1)
            history.append(t)
        else:

            # if order=2, history is of gradient differences, order 3 is differences between differences, etc
            # scaled by parameter differences
            cur_p = param.clone()
            cur_g = tensor.clone()
            eps = torch.finfo(cur_p.dtype).tiny * 2
            for i in range(1, order):
                if f'prev_g_{i}' not in state:
                    state[f'prev_p_{i}'] = cur_p
                    state[f'prev_g_{i}'] = cur_g
                    break

                s = cur_p - state[f'prev_p_{i}']
                y = cur_g - state[f'prev_g_{i}']
                state[f'prev_p_{i}'] = cur_p
                state[f'prev_g_{i}'] = cur_g
                cur_p = s
                cur_g = y

                if i == order - 1:
                    cur_g = cur_g / torch.linalg.norm(cur_p).clip(min=eps) # pylint:disable=not-callable
                    history.append(cur_g.view(-1))

        step = state.get('step', 0)
        if step % update_freq == 0 and len(history) != 0:
            U, L = lm_adagrad_update(history, damping=damping, rdamping=rdamping)
            maybe_lerp_(state, U_beta, 'U', U)
            maybe_lerp_(state, L_beta, 'L', L)

        if len(history) != 0:
            state['step'] = step + 1 # do not increment if no history (gathering s_ks and y_ks)

    @torch.no_grad
    def apply_tensor(self, tensor, param, grad, loss, state, setting):
        U = state.get('U', None)
        if U is None:
            # make a conservative step to avoid issues due to different GD scaling
            return tensor.clip_(-0.1, 0.1) # pyright:ignore[reportArgumentType]

        L = state['L']
        update = lm_adagrad_apply(tensor.view(-1), U, L).view_as(tensor)

        return update

Lion

Bases: torchzero.core.transform.Transform

Lion (EvoLved Sign Momentum) optimizer from https://arxiv.org/abs/2302.06675.

Parameters:

  • beta1 (float, default: 0.9 ) –

    dampening for momentum. Defaults to 0.9.

  • beta2 (float, default: 0.99 ) –

    momentum factor. Defaults to 0.99.

Source code in torchzero/modules/adaptive/lion.py
class Lion(Transform):
    """Lion (EvoLved Sign Momentum) optimizer from https://arxiv.org/abs/2302.06675.

    Args:
        beta1 (float, optional): dampening for momentum. Defaults to 0.9.
        beta2 (float, optional): momentum factor. Defaults to 0.99.
    """

    def __init__(self, beta1: float = 0.9, beta2: float = 0.99):
        defaults = dict(beta1=beta1, beta2=beta2)
        super().__init__(defaults, uses_grad=False)

    @torch.no_grad
    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        beta1, beta2 = unpack_dicts(settings, 'beta1', 'beta2', cls=NumberList)
        exp_avg = unpack_states(states, tensors, 'ema', cls=TensorList)
        return lion_(TensorList(tensors),exp_avg,beta1,beta2)

MARSCorrection

Bases: torchzero.core.transform.Transform

MARS variance reduction correction.

Place any other momentum-based optimizer after this, make sure beta parameter matches with momentum in the optimizer.

Parameters:

  • beta (float, default: 0.9 ) –

    use the same beta as you use in the momentum module. Defaults to 0.9.

  • scaling (float, default: 0.025 ) –

    controls the scale of gradient correction in variance reduction. Defaults to 0.025.

  • max_norm (float, default: 1 ) –

    clips norm of corrected gradients, None to disable. Defaults to 1.

Examples:

Mars-AdamW

optimizer = tz.Modular(
    model.parameters(),
    tz.m.MARSCorrection(beta=0.95),
    tz.m.Adam(beta1=0.95, beta2=0.99),
    tz.m.WeightDecay(1e-3),
    tz.m.LR(0.1)
)

Mars-Lion

optimizer = tz.Modular(
    model.parameters(),
    tz.m.MARSCorrection(beta=0.9),
    tz.m.Lion(beta1=0.9),
    tz.m.LR(0.1)
)

Source code in torchzero/modules/adaptive/mars.py
class MARSCorrection(Transform):
    """MARS variance reduction correction.

    Place any other momentum-based optimizer after this,
    make sure ``beta`` parameter matches with momentum in the optimizer.

    Args:
        beta (float, optional): use the same beta as you use in the momentum module. Defaults to 0.9.
        scaling (float, optional): controls the scale of gradient correction in variance reduction. Defaults to 0.025.
        max_norm (float, optional): clips norm of corrected gradients, None to disable. Defaults to 1.

    ## Examples:

    Mars-AdamW
    ```python
    optimizer = tz.Modular(
        model.parameters(),
        tz.m.MARSCorrection(beta=0.95),
        tz.m.Adam(beta1=0.95, beta2=0.99),
        tz.m.WeightDecay(1e-3),
        tz.m.LR(0.1)
    )
    ```

    Mars-Lion
    ```python
    optimizer = tz.Modular(
        model.parameters(),
        tz.m.MARSCorrection(beta=0.9),
        tz.m.Lion(beta1=0.9),
        tz.m.LR(0.1)
    )
    ```

    """
    def __init__(
        self,
        beta: float = 0.9,
        scaling: float = 0.025,
        max_norm: float | None = 1,
    ):
        defaults=dict(beta=beta, scaling=scaling, max_norm=max_norm)
        super().__init__(defaults, uses_grad=False)

    @torch.no_grad
    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        prev = unpack_states(states, tensors, 'prev', init=tensors, cls=TensorList)
        beta, scaling = unpack_dicts(settings, 'beta', 'scaling', cls=NumberList)
        max_norm = settings[0]['max_norm']

        return mars_correction_(
            tensors_=TensorList(tensors),
            prev_=prev,
            beta=beta,
            scaling=scaling,
            max_norm=max_norm,
        )

MSAM

Bases: torchzero.core.transform.Transform

Momentum-SAM from https://arxiv.org/pdf/2401.12033.

This implementation expresses the update rule as function of gradient. This way it can be used as a drop-in replacement for momentum strategies in other optimizers.

To combine MSAM with other optimizers in the way done in the official implementation, e.g. to make Adam_MSAM, use tz.m.MSAMObjective module.

Note MSAM has a learning rate hyperparameter that can't really be removed from the update rule. To avoid compounding learning rate mofications, remove the tz.m.LR module if you had it.

Parameters:

  • lr (float) –

    learning rate. Adding this module adds support for learning rate schedulers.

  • momentum (float, default: 0.9 ) –

    momentum (beta). Defaults to 0.9.

  • rho (float, default: 0.3 ) –

    perturbation strength. Defaults to 0.3.

  • weight_decay (float, default: 0 ) –

    weight decay. It is applied to perturbed parameters, so it is differnet from applying :code:tz.m.WeightDecay after MSAM. Defaults to 0.

  • nesterov (bool, default: False ) –

    whether to use nesterov momentum formula. Defaults to False.

  • lerp (bool, default: False ) –

    whether to use linear interpolation, if True, this becomes similar to exponential moving average. Defaults to False.

Examples:

MSAM

.. code-block:: python

opt = tz.Modular(
    model.parameters(),
    tz.m.MSAM(1e-3)
)

Adam with MSAM instead of exponential average. Note that this is different from Adam_MSAM. To make Adam_MSAM and such, use the :code:tz.m.MSAMObjective module.

.. code-block:: python

opt = tz.Modular(
    model.parameters(),
    tz.m.RMSprop(0.999, inner=tz.m.MSAM(1e-3)),
    tz.m.Debias(0.9, 0.999),
)
Source code in torchzero/modules/adaptive/msam.py
class MSAM(Transform):
    """Momentum-SAM from https://arxiv.org/pdf/2401.12033.

    This implementation expresses the update rule as function of gradient. This way it can be used as a drop-in
    replacement for momentum strategies in other optimizers.

    To combine MSAM with other optimizers in the way done in the official implementation,
    e.g. to make Adam_MSAM, use ``tz.m.MSAMObjective`` module.

    Note
        MSAM has a learning rate hyperparameter that can't really be removed from the update rule.
        To avoid compounding learning rate mofications, remove the ``tz.m.LR`` module if you had it.

    Args:
        lr (float): learning rate. Adding this module adds support for learning rate schedulers.
        momentum (float, optional): momentum (beta). Defaults to 0.9.
        rho (float, optional): perturbation strength. Defaults to 0.3.
        weight_decay (float, optional):
            weight decay. It is applied to perturbed parameters, so it is differnet
            from applying :code:`tz.m.WeightDecay` after MSAM. Defaults to 0.
        nesterov (bool, optional): whether to use nesterov momentum formula. Defaults to False.
        lerp (bool, optional):
            whether to use linear interpolation, if True, this becomes similar to exponential moving average. Defaults to False.

    Examples:
        MSAM

        .. code-block:: python

            opt = tz.Modular(
                model.parameters(),
                tz.m.MSAM(1e-3)
            )

        Adam with MSAM instead of exponential average. Note that this is different from Adam_MSAM.
        To make Adam_MSAM and such, use the :code:`tz.m.MSAMObjective` module.

        .. code-block:: python

            opt = tz.Modular(
                model.parameters(),
                tz.m.RMSprop(0.999, inner=tz.m.MSAM(1e-3)),
                tz.m.Debias(0.9, 0.999),
            )
    """
    _USES_LR = True
    def __init__(self, lr: float, momentum:float=0.9, rho:float=0.3,  weight_decay:float=0, nesterov=False, lerp=False,):
        defaults = dict(momentum=momentum,rho=rho, nesterov=nesterov, lerp=lerp, weight_decay=weight_decay)
        if self._USES_LR: defaults['lr'] = lr
        super().__init__(defaults, uses_grad=False)

    @torch.no_grad
    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        velocity = unpack_states(states, tensors, 'velocity', cls=TensorList)
        s = self.settings[params[0]]
        lerp = s['lerp']
        nesterov = s['nesterov']

        if self._USES_LR:
            lr, momentum, rho, weight_decay = unpack_dicts(settings, 'lr','momentum','rho','weight_decay', cls=NumberList)

        else:
            lr=None
            momentum,rho,weight_decay = unpack_dicts(settings, 'momentum','rho','weight_decay', cls=NumberList)

        return msam_(
            TensorList(tensors),
            params=TensorList(params),
            velocity_=velocity,
            momentum=momentum,
            lr=lr,
            rho=rho,
            weight_decay=weight_decay,
            nesterov=nesterov,
            lerp=lerp,

            # inner args
            inner=self.children.get("modules", None),
            grads=grads,
        )

MSAMObjective

Bases: torchzero.modules.adaptive.msam.MSAM

Momentum-SAM from https://arxiv.org/pdf/2401.12033.

Note

Please make sure to place tz.m.LR inside the modules argument. For example, tz.m.MSAMObjective([tz.m.Adam(), tz.m.LR(1e-3)]). Putting LR after MSAM will lead to an incorrect update rule.

Parameters:

  • modules (Chainable) –

    modules that will optimizer the MSAM objective. Make sure :code:tz.m.LR is one of them.

  • momentum (float, default: 0.9 ) –

    momentum (beta). Defaults to 0.9.

  • rho (float, default: 0.3 ) –

    perturbation strength. Defaults to 0.3.

  • nesterov (bool, default: False ) –

    whether to use nesterov momentum formula. Defaults to False.

  • lerp (bool, default: False ) –

    whether to use linear interpolation, if True, MSAM momentum becomes similar to exponential moving average. Defaults to False.

Examples:

AdamW-MSAM

.. code-block:: python

opt = tz.Modular(
    bench.parameters(),
    tz.m.MSAMObjective(
        [tz.m.Adam(), tz.m.WeightDecay(1e-3), tz.m.LR(1e-3)],
        rho=1.
    )
)
Source code in torchzero/modules/adaptive/msam.py
class MSAMObjective(MSAM):
    """Momentum-SAM from https://arxiv.org/pdf/2401.12033.

    Note:
        Please make sure to place ``tz.m.LR`` inside the ``modules`` argument. For example,
        ``tz.m.MSAMObjective([tz.m.Adam(), tz.m.LR(1e-3)])``. Putting LR after MSAM will lead
        to an incorrect update rule.

    Args:
        modules (Chainable): modules that will optimizer the MSAM objective. Make sure :code:`tz.m.LR` is one of them.
        momentum (float, optional): momentum (beta). Defaults to 0.9.
        rho (float, optional): perturbation strength. Defaults to 0.3.
        nesterov (bool, optional): whether to use nesterov momentum formula. Defaults to False.
        lerp (bool, optional):
            whether to use linear interpolation, if True, MSAM momentum becomes similar to exponential moving average.
            Defaults to False.

    Examples:
        AdamW-MSAM

        .. code-block:: python

            opt = tz.Modular(
                bench.parameters(),
                tz.m.MSAMObjective(
                    [tz.m.Adam(), tz.m.WeightDecay(1e-3), tz.m.LR(1e-3)],
                    rho=1.
                )
            )
    """
    _USES_LR = False
    def __init__(self, modules: Chainable, momentum:float=0.9, rho:float=0.3, weight_decay:float=0, nesterov=False, lerp=False):
        super().__init__(lr=0, momentum=momentum, rho=rho, weight_decay=weight_decay, nesterov=nesterov, lerp=lerp)
        self.set_child('modules', modules)

MatrixMomentum

Bases: torchzero.core.module.Module

Second order momentum method.

Matrix momentum is useful for convex objectives, also for some reason it has very really good generalization on elastic net logistic regression.

Notes
  • mu needs to be tuned very carefully. It is supposed to be smaller than (1/largest eigenvalue), otherwise this will be very unstable. I have devised an adaptive version of this - tz.m.AdaptiveMatrixMomentum, and it works well without having to tune mu, however the adaptive version doesn't work on stochastic objectives.

  • In most cases MatrixMomentum should be the first module in the chain because it relies on autograd.

  • This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a backward argument.

Parameters:

  • mu (float, default: 0.1 ) –

    this has a similar role to (1 - beta) in normal momentum. Defaults to 0.1.

  • hvp_method (str, default: 'autograd' ) –

    Determines how Hessian-vector products are evaluated.

    • "autograd": Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient.
    • "forward": Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation.
    • "central": Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
  • h (float, default: 0.001 ) –

    finite difference step size if hvp_method is set to finite difference. Defaults to 1e-3.

  • hvp_tfm (Chainable | None, default: None ) –

    optional module applied to hessian-vector products. Defaults to None.

Reference

Orr, Genevieve, and Todd Leen. "Using curvature information for fast stochastic search." Advances in neural information processing systems 9 (1996).

Source code in torchzero/modules/adaptive/matrix_momentum.py
class MatrixMomentum(Module):
    """Second order momentum method.

    Matrix momentum is useful for convex objectives, also for some reason it has very really good generalization on elastic net logistic regression.

    Notes:
        - ``mu`` needs to be tuned very carefully. It is supposed to be smaller than (1/largest eigenvalue), otherwise this will be very unstable. I have devised an adaptive version of this - ``tz.m.AdaptiveMatrixMomentum``, and it works well without having to tune ``mu``, however the adaptive version doesn't work on stochastic objectives.

        - In most cases ``MatrixMomentum`` should be the first module in the chain because it relies on autograd.

        - This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a ``backward`` argument.

    Args:
        mu (float, optional): this has a similar role to (1 - beta) in normal momentum. Defaults to 0.1.
        hvp_method (str, optional):
            Determines how Hessian-vector products are evaluated.

            - ``"autograd"``: Use PyTorch's autograd to calculate exact HVPs.
              This requires creating a graph for the gradient.
            - ``"forward"``: Use a forward finite difference formula to
              approximate the HVP. This requires one extra gradient evaluation.
            - ``"central"``: Use a central finite difference formula for a
              more accurate HVP approximation. This requires two extra
              gradient evaluations.
            Defaults to "autograd".
        h (float, optional): finite difference step size if hvp_method is set to finite difference. Defaults to 1e-3.
        hvp_tfm (Chainable | None, optional): optional module applied to hessian-vector products. Defaults to None.

    Reference:
        Orr, Genevieve, and Todd Leen. "Using curvature information for fast stochastic search." Advances in neural information processing systems 9 (1996).
    """

    def __init__(
        self,
        lr:float,
        mu=0.1,
        hvp_method: Literal["autograd", "forward", "central"] = "autograd",
        h: float = 1e-3,
        adaptive:bool = False,
        adapt_freq: int | None = None,
        hvp_tfm: Chainable | None = None,
    ):
        defaults = dict(lr=lr, mu=mu, hvp_method=hvp_method, h=h, adaptive=adaptive, adapt_freq=adapt_freq)
        super().__init__(defaults)

        if hvp_tfm is not None:
            self.set_child('hvp_tfm', hvp_tfm)

    def reset_for_online(self):
        super().reset_for_online()
        self.clear_state_keys('p_prev')

    @torch.no_grad
    def update(self, var):
        assert var.closure is not None
        p = TensorList(var.params)
        p_prev = self.get_state(p, 'p_prev', init=var.params)

        hvp_method = self.defaults['hvp_method']
        h = self.defaults['h']
        step = self.global_state.get("step", 0)
        self.global_state["step"] = step + 1

        if step > 0:
            s = p - p_prev

            Hs, _ = self.Hvp(s, at_x0=True, var=var, rgrad=None, hvp_method=hvp_method, h=h, normalize=True, retain_grad=False)
            Hs = [t.detach() for t in Hs]

            if 'hvp_tfm' in self.children:
                Hs = TensorList(apply_transform(self.children['hvp_tfm'], Hs, params=p, grads=var.grad, var=var))

            self.store(p, ("Hs", "s"), (Hs, s))

            # -------------------------------- adaptive mu ------------------------------- #
            if self.defaults["adaptive"]:
                g = TensorList(var.get_grad())

                if self.defaults["adapt_freq"] is None:
                    # ---------------------------- deterministic case ---------------------------- #
                    g_prev = self.get_state(var.params, "g_prev", cls=TensorList)
                    y = g - g_prev
                    g_prev.copy_(g)
                    denom = y.global_vector_norm()
                    denom = denom.clip(min=torch.finfo(denom.dtype).tiny * 2)
                    self.global_state["mu_mul"] = s.global_vector_norm() / denom

                else:
                    # -------------------------------- stochastic -------------------------------- #
                    adapt_freq = self.defaults["adapt_freq"]

                    # we start on 1nd step, and want to adapt when we start, so use (step - 1)
                    if (step - 1) % adapt_freq == 0:
                        assert var.closure is not None
                        params = TensorList(var.params)
                        p_cur = params.clone()

                        # move to previous params and evaluate p_prev with current mini-batch
                        params.copy_(self.get_state(var.params, 'p_prev'))
                        with torch.enable_grad():
                            var.closure()
                        g_prev = [p.grad if p.grad is not None else torch.zeros_like(p) for p in params]
                        y = g - g_prev

                        # move back to current params
                        params.copy_(p_cur)

                        denom = y.global_vector_norm()
                        denom = denom.clip(min=torch.finfo(denom.dtype).tiny * 2)
                        self.global_state["mu_mul"] = s.global_vector_norm() / denom

        torch._foreach_copy_(p_prev, var.params)

    @torch.no_grad
    def apply(self, var):
        update = TensorList(var.get_update())
        lr,mu = self.get_settings(var.params, "lr", 'mu', cls=NumberList)

        if "mu_mul" in self.global_state:
            mu = mu * self.global_state["mu_mul"]

        # --------------------------------- 1st step --------------------------------- #
        # p_prev is not available so make a small step
        step = self.global_state["step"]
        if step == 1:
            if self.defaults["adaptive"]: self.get_state(var.params, "g_prev", init=var.get_grad())
            update.mul_(lr) # separate so that initial_step_size can clip correctly
            update.mul_(initial_step_size(update, 1e-7))
            return var

        # -------------------------- matrix momentum update -------------------------- #
        s, Hs = self.get_state(var.params, 's', 'Hs', cls=TensorList)

        update.mul_(lr).sub_(s).add_(Hs*mu)
        var.update = update
        return var

MuonAdjustLR

Bases: torchzero.core.transform.Transform

LR adjustment for Muon from "Muon is Scalable for LLM Training" (https://github.com/MoonshotAI/Moonlight/tree/master). Orthogonalize already has this built in with the adjust_lr setting, however you might want to move this to be later in the chain.

Source code in torchzero/modules/adaptive/muon.py
class MuonAdjustLR(Transform):
    """LR adjustment for Muon from "Muon is Scalable for LLM Training" (https://github.com/MoonshotAI/Moonlight/tree/master).
    Orthogonalize already has this built in with the `adjust_lr` setting, however you might want to move this to be later in the chain."""
    def __init__(self, alpha: float = 1, target: Target='update'):
        defaults = dict(alpha=alpha)
        super().__init__(defaults=defaults, uses_grad=False, target=target)

    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        alphas = [s['alpha'] for s in settings]
        tensors_alphas = [(t, adjust_lr_for_muon(a, t.shape)) for t, a in zip(tensors, alphas) if _is_at_least_2d(t)]
        tensors = [i[0] for i in tensors_alphas]
        a = [i[1] for i in alphas]
        torch._foreach_mul_(tensors, a)
        return tensors

NaturalGradient

Bases: torchzero.core.module.Module

Natural gradient approximated via empirical fisher information matrix.

To use this, either pass vector of per-sample losses to the step method, or make sure the closure returns it. Gradients will be calculated via batched autograd within this module, you don't need to implement the backward pass. When using closure, please add the backward argument, it will always be False but it is required. See below for an example.

Note

Empirical fisher information matrix may give a really bad approximation in some cases. If that is the case, set sqrt to True to perform whitening instead, which is way more robust.

Parameters:

  • reg (float, default: 1e-08 ) –

    regularization parameter. Defaults to 1e-8.

  • sqrt (bool, default: False ) –

    if True, uses square root of empirical fisher information matrix. Both EFIM and it's square root can be calculated and stored efficiently without ndim^2 memory. Square root whitens the gradient and often performs much better, especially when you try to use NGD with a vector that isn't strictly per-sample gradients, but rather for example different losses.

  • gn_grad (bool, default: False ) –

    if True, uses Gauss-Newton G^T @ f as the gradient, which is effectively sum weighted by value and is equivalent to squaring the values. This way you can solve least-squares objectives with a NGD-like algorithm. If False, uses sum of per-sample gradients. This has an effect when sqrt=True, and affects the grad attribute. Defaults to False.

  • batched (bool, default: True ) –

    whether to use vmapping. Defaults to True.

Examples:

training a neural network:

X = torch.randn(64, 20)
y = torch.randn(64, 10)

model = nn.Sequential(nn.Linear(20, 64), nn.ELU(), nn.Linear(64, 10))
opt = tz.Modular(
    model.parameters(),
    tz.m.NaturalGradient(),
    tz.m.LR(3e-2)
)

for i in range(100):
    y_hat = model(X) # (64, 10)
    losses = (y_hat - y).pow(2).mean(0) # (10, )
    opt.step(loss=losses)
    if i % 10 == 0:
        print(f'{losses.mean() = }')

training a neural network - closure version

X = torch.randn(64, 20)
y = torch.randn(64, 10)

model = nn.Sequential(nn.Linear(20, 64), nn.ELU(), nn.Linear(64, 10))
opt = tz.Modular(
    model.parameters(),
    tz.m.NaturalGradient(),
    tz.m.LR(3e-2)
)

def closure(backward=True):
    y_hat = model(X) # (64, 10)
    return (y_hat - y).pow(2).mean(0) # (10, )

for i in range(100):
    losses = opt.step(closure)
    if i % 10 == 0:
    print(f'{losses.mean() = }')

minimizing the rosenbrock function with a mix of natural gradient, whitening and gauss-newton:

def rosenbrock(X):
    x1, x2 = X
    return torch.stack([(1 - x1).abs(), (10 * (x2 - x1**2).abs())])

X = torch.tensor([-1.1, 2.5], requires_grad=True)
opt = tz.Modular([X], tz.m.NaturalGradient(sqrt=True, gn_grad=True), tz.m.LR(0.05))

for iter in range(200):
    losses = rosenbrock(X)
    opt.step(loss=losses)
    if iter % 20 == 0:
        print(f'{losses.mean() = }')

Source code in torchzero/modules/adaptive/natural_gradient.py
class NaturalGradient(Module):
    """Natural gradient approximated via empirical fisher information matrix.

    To use this, either pass vector of per-sample losses to the step method, or make sure
    the closure returns it. Gradients will be calculated via batched autograd within this module,
    you don't need to implement the backward pass. When using closure, please add the ``backward`` argument,
    it will always be False but it is required. See below for an example.

    Note:
        Empirical fisher information matrix may give a really bad approximation in some cases.
        If that is the case, set ``sqrt`` to True to perform whitening instead, which is way more robust.

    Args:
        reg (float, optional): regularization parameter. Defaults to 1e-8.
        sqrt (bool, optional):
            if True, uses square root of empirical fisher information matrix. Both EFIM and it's square
            root can be calculated and stored efficiently without ndim^2 memory. Square root
            whitens the gradient and often performs much better, especially when you try to use NGD
            with a vector that isn't strictly per-sample gradients, but rather for example different losses.
        gn_grad (bool, optional):
            if True, uses Gauss-Newton G^T @ f as the gradient, which is effectively sum weighted by value
            and is equivalent to squaring the values. This way you can solve least-squares
            objectives with a NGD-like algorithm. If False, uses sum of per-sample gradients.
            This has an effect when ``sqrt=True``, and affects the ``grad`` attribute.
            Defaults to False.
        batched (bool, optional): whether to use vmapping. Defaults to True.

    Examples:

    training a neural network:
    ```python
    X = torch.randn(64, 20)
    y = torch.randn(64, 10)

    model = nn.Sequential(nn.Linear(20, 64), nn.ELU(), nn.Linear(64, 10))
    opt = tz.Modular(
        model.parameters(),
        tz.m.NaturalGradient(),
        tz.m.LR(3e-2)
    )

    for i in range(100):
        y_hat = model(X) # (64, 10)
        losses = (y_hat - y).pow(2).mean(0) # (10, )
        opt.step(loss=losses)
        if i % 10 == 0:
            print(f'{losses.mean() = }')
    ```

    training a neural network - closure version
    ```python
    X = torch.randn(64, 20)
    y = torch.randn(64, 10)

    model = nn.Sequential(nn.Linear(20, 64), nn.ELU(), nn.Linear(64, 10))
    opt = tz.Modular(
        model.parameters(),
        tz.m.NaturalGradient(),
        tz.m.LR(3e-2)
    )

    def closure(backward=True):
        y_hat = model(X) # (64, 10)
        return (y_hat - y).pow(2).mean(0) # (10, )

    for i in range(100):
        losses = opt.step(closure)
        if i % 10 == 0:
        print(f'{losses.mean() = }')
    ```

    minimizing the rosenbrock function with a mix of natural gradient, whitening and gauss-newton:
    ```python
    def rosenbrock(X):
        x1, x2 = X
        return torch.stack([(1 - x1).abs(), (10 * (x2 - x1**2).abs())])

    X = torch.tensor([-1.1, 2.5], requires_grad=True)
    opt = tz.Modular([X], tz.m.NaturalGradient(sqrt=True, gn_grad=True), tz.m.LR(0.05))

    for iter in range(200):
        losses = rosenbrock(X)
        opt.step(loss=losses)
        if iter % 20 == 0:
            print(f'{losses.mean() = }')
    ```
    """
    def __init__(self, reg:float = 1e-8, sqrt:bool=False, gn_grad:bool=False, batched:bool=True, ):
        super().__init__(defaults=dict(batched=batched, reg=reg, sqrt=sqrt, gn_grad=gn_grad))

    @torch.no_grad
    def update(self, var):
        params = var.params
        batched = self.defaults['batched']
        gn_grad = self.defaults['gn_grad']

        closure = var.closure
        assert closure is not None

        with torch.enable_grad():
            f = var.get_loss(backward=False) # n_out
            assert isinstance(f, torch.Tensor)
            G_list = jacobian_wrt([f.ravel()], params, batched=batched)

        var.loss = f.sum()
        G = self.global_state["G"] = flatten_jacobian(G_list) # (n_samples, ndim)

        if gn_grad:
            g = self.global_state["g"] = G.H @ f.detach()

        else:
            g = self.global_state["g"] = G.sum(0)

        var.grad = vec_to_tensors(g, params)

        # set closure to calculate scalar value for line searches etc
        if var.closure is not None:
            def ngd_closure(backward=True):
                if backward:
                    var.zero_grad()
                    with torch.enable_grad():
                        loss = closure(False)
                        if gn_grad: loss = loss.pow(2)
                        loss = loss.sum()
                        loss.backward()
                    return loss

                loss = closure(False)
                if gn_grad: loss = loss.pow(2)
                return loss.sum()

            var.closure = ngd_closure

    @torch.no_grad
    def apply(self, var):
        params = var.params
        reg = self.defaults['reg']
        sqrt = self.defaults['sqrt']

        G: torch.Tensor = self.global_state['G'] # (n_samples, n_dim)

        if sqrt:
            # this computes U, S <- SVD(M), then calculate update as U S^-1 Uᵀg,
            # but it computes it through eigendecompotision
            U, L = lm_adagrad_update(G.H, reg, 0)
            if U is None or L is None: return var

            v = lm_adagrad_apply(self.global_state["g"], U, L)
            var.update = vec_to_tensors(v, params)
            return var

        GGT = G @ G.H # (n_samples, n_samples)

        if reg != 0:
            GGT.add_(torch.eye(GGT.size(0), device=GGT.device, dtype=GGT.dtype).mul_(reg))

        z, _ = torch.linalg.solve_ex(GGT, torch.ones_like(GGT[0])) # pylint:disable=not-callable
        v = G.H @ z

        var.update = vec_to_tensors(v, params)
        return var


    def get_H(self, var):
        if "G" not in self.global_state: return linear_operator.ScaledIdentity()
        G = self.global_state['G']
        return linear_operator.AtA(G)

OrthoGrad

Bases: torchzero.core.transform.Transform

Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.

Parameters:

  • eps (float, default: 1e-08 ) –

    epsilon added to the denominator for numerical stability (default: 1e-30)

  • renormalize (bool, default: True ) –

    whether to graft projected gradient to original gradient norm. Defaults to True.

  • target (Literal, default: 'update' ) –

    what to set on var. Defaults to 'update'.

Source code in torchzero/modules/adaptive/orthograd.py
class OrthoGrad(Transform):
    """Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.

    Args:
        eps (float, optional): epsilon added to the denominator for numerical stability (default: 1e-30)
        renormalize (bool, optional): whether to graft projected gradient to original gradient norm. Defaults to True.
        target (Target, optional): what to set on var. Defaults to 'update'.
    """
    def __init__(self, eps: float = 1e-8, renormalize=True, target: Target = 'update'):
        defaults = dict(eps=eps, renormalize=renormalize)
        super().__init__(defaults, uses_grad=False, target=target)

    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        eps = settings[0]['eps']
        renormalize = settings[0]['renormalize']

        params = as_tensorlist(params)
        target = as_tensorlist(tensors)

        scale = params.dot(target)/(params.dot(params) + eps)
        if renormalize:
            norm = target.global_vector_norm()
            target -= params * scale
            target *= (norm / target.global_vector_norm())
            return target

        target -= params * scale
        return target

Orthogonalize

Bases: torchzero.core.transform.TensorwiseTransform

Uses Newton-Schulz iteration or SVD to compute the zeroth power / orthogonalization of update along first 2 dims.

To disable orthogonalization for a parameter, put it into a parameter group with "orthogonalize" = False. The Muon page says that embeddings and classifier heads should not be orthogonalized. Usually only matrix parameters that are directly used in matmuls should be orthogonalized.

To make Muon, use Split with Adam on 1d params

Parameters:

  • ns_steps (int, default: 5 ) –

    The number of Newton-Schulz iterations to run. Defaults to 5.

  • adjust_lr (bool, default: False ) –

    Enables LR adjustment based on parameter size from "Muon is Scalable for LLM Training". Defaults to False.

  • dual_norm_correction (bool, default: False ) –

    enables dual norm correction from https://github.com/leloykun/adaptive-muon. Defaults to False.

  • method (str, default: 'newton-schulz' ) –

    Newton-Schulz is very fast, SVD is extremely slow but can be slighly more precise.

  • target (str, default: 'update' ) –

    what to set on var.

Examples:

standard Muon with Adam fallback

opt = tz.Modular(
    model.head.parameters(),
    tz.m.Split(
        # apply muon only to 2D+ parameters
        filter = lambda t: t.ndim >= 2,
        true = [
            tz.m.HeavyBall(),
            tz.m.Orthogonalize(),
            tz.m.LR(1e-2),
        ],
        false = tz.m.Adam()
    ),
    tz.m.LR(1e-2)
)

Reference

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, Jeremy Bernstein - Muon: An optimizer for hidden layers in neural networks (2024) https://github.com/KellerJordan/Muon

Source code in torchzero/modules/adaptive/muon.py
class Orthogonalize(TensorwiseTransform):
    """Uses Newton-Schulz iteration or SVD to compute the zeroth power / orthogonalization of update along first 2 dims.

    To disable orthogonalization for a parameter, put it into a parameter group with "orthogonalize" = False.
    The Muon page says that embeddings and classifier heads should not be orthogonalized.
    Usually only matrix parameters that are directly used in matmuls should be orthogonalized.

    To make Muon, use Split with Adam on 1d params

    Args:
        ns_steps (int, optional):
            The number of Newton-Schulz iterations to run. Defaults to 5.
        adjust_lr (bool, optional):
            Enables LR adjustment based on parameter size from "Muon is Scalable for LLM Training". Defaults to False.
        dual_norm_correction (bool, optional):
            enables dual norm correction from https://github.com/leloykun/adaptive-muon. Defaults to False.
        method (str, optional):
            Newton-Schulz is very fast, SVD is extremely slow but can be slighly more precise.
        target (str, optional):
            what to set on var.

    ## Examples:

    standard Muon with Adam fallback
    ```py
    opt = tz.Modular(
        model.head.parameters(),
        tz.m.Split(
            # apply muon only to 2D+ parameters
            filter = lambda t: t.ndim >= 2,
            true = [
                tz.m.HeavyBall(),
                tz.m.Orthogonalize(),
                tz.m.LR(1e-2),
            ],
            false = tz.m.Adam()
        ),
        tz.m.LR(1e-2)
    )
    ```

    Reference:
        Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, Jeremy Bernstein - Muon: An optimizer for hidden layers in neural networks (2024) https://github.com/KellerJordan/Muon
    """
    def __init__(self, ns_steps=5, adjust_lr=False, dual_norm_correction=False,
                 method: Literal['newton-schulz', 'svd'] = 'newton-schulz', target:Target='update'):
        defaults = dict(orthogonalize=True, ns_steps=ns_steps, dual_norm_correction=dual_norm_correction, adjust_lr=adjust_lr, method=method.lower())
        super().__init__(uses_grad=False, defaults=defaults, target=target)

    @torch.no_grad
    def apply_tensor(self, tensor, param, grad, loss, state, setting):
        orthogonalize, ns_steps, dual_norm_correction, adjust_lr, method = itemgetter(
            'orthogonalize', 'ns_steps', 'dual_norm_correction', 'adjust_lr', 'method')(setting)

        if not orthogonalize: return tensor

        if _is_at_least_2d(tensor):

            X = _orthogonalize_tensor(tensor, ns_steps, method)

            if dual_norm_correction:
                X = _dual_norm_correction(X, tensor, batch_first=False)

            if adjust_lr:
                X.mul_(adjust_lr_for_muon(1, param.shape))

            return X.view_as(param)

        return tensor

RMSprop

Bases: torchzero.core.transform.Transform

Divides graient by EMA of gradient squares.

This implementation is identical to :code:torch.optim.RMSprop.

Parameters:

  • smoothing (float, default: 0.99 ) –

    beta for exponential moving average of gradient squares. Defaults to 0.99.

  • eps (float, default: 1e-08 ) –

    epsilon for division. Defaults to 1e-8.

  • centered (bool, default: False ) –

    whether to center EMA of gradient squares using an additional EMA. Defaults to False.

  • debiased (bool, default: False ) –

    applies Adam debiasing. Defaults to False.

  • amsgrad (bool, default: False ) –

    Whether to divide by maximum of EMA of gradient squares instead. Defaults to False.

  • pow (float, default: 2 ) –

    power used in second momentum power and root. Defaults to 2.

  • init (str, default: 'zeros' ) –

    how to initialize EMA, either "update" to use first update or "zeros". Defaults to "update".

  • inner (Chainable | None, default: None ) –

    Inner modules that are applied after updating EMA and before preconditioning. Defaults to None.

Source code in torchzero/modules/adaptive/rmsprop.py
class RMSprop(Transform):
    """Divides graient by EMA of gradient squares.

    This implementation is identical to :code:`torch.optim.RMSprop`.

    Args:
        smoothing (float, optional): beta for exponential moving average of gradient squares. Defaults to 0.99.
        eps (float, optional): epsilon for division. Defaults to 1e-8.
        centered (bool, optional): whether to center EMA of gradient squares using an additional EMA. Defaults to False.
        debiased (bool, optional): applies Adam debiasing. Defaults to False.
        amsgrad (bool, optional): Whether to divide by maximum of EMA of gradient squares instead. Defaults to False.
        pow (float, optional): power used in second momentum power and root. Defaults to 2.
        init (str, optional): how to initialize EMA, either "update" to use first update or "zeros". Defaults to "update".
        inner (Chainable | None, optional):
            Inner modules that are applied after updating EMA and before preconditioning. Defaults to None.
    """
    def __init__(
        self,
        smoothing: float = 0.99,
        eps: float = 1e-8,
        centered: bool = False,
        debiased: bool = False,
        amsgrad: bool = False,
        pow: float = 2,
        init: Literal["zeros", "update"] = "zeros",
        inner: Chainable | None = None,
    ):
        defaults = dict(smoothing=smoothing,eps=eps,centered=centered,debiased=debiased,amsgrad=amsgrad,pow=pow,init=init)
        super().__init__(defaults=defaults, uses_grad=False)

        if inner is not None:
            self.set_child('inner', inner)

    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        step = self.global_state['step'] = self.global_state.get('step', 0) + 1
        smoothing, eps = unpack_dicts(settings, 'smoothing', 'eps', cls=NumberList)
        centered, debiased, amsgrad, pow, init = itemgetter('centered','debiased','amsgrad','pow','init')(settings[0])

        exp_avg_sq = unpack_states(states, tensors, 'exp_avg_sq', cls=TensorList)
        exp_avg = unpack_states(states, tensors, 'exp_avg', cls=TensorList) if centered else None
        max_exp_avg_sq = unpack_states(states, tensors, 'max_exp_avg_sq', cls=TensorList) if amsgrad else None

        if init == 'update' and step == 1:
            exp_avg_sq.set_([t**2 for t in tensors])
            if exp_avg is not None: exp_avg.set_([t.clone() for t in tensors])

        return rmsprop_(
            TensorList(tensors),
            exp_avg_sq_=exp_avg_sq,
            smoothing=smoothing,
            eps=eps,
            debiased=debiased,
            step=step,
            exp_avg_=exp_avg,
            max_exp_avg_sq_=max_exp_avg_sq,
            pow=pow,

            # inner args
            inner=self.children.get("inner", None),
            params=params,
            grads=grads,
        )

Rprop

Bases: torchzero.core.transform.Transform

Resilient propagation. The update magnitude gets multiplied by nplus if gradient didn't change the sign, or nminus if it did. Then the update is applied with the sign of the current gradient.

Additionally, if gradient changes sign, the update for that weight is reverted. Next step, magnitude for that weight won't change.

Compared to pytorch this also implements backtracking update when sign changes.

This implementation is identical to :code:torch.optim.Rprop if :code:backtrack is set to False.

Parameters:

  • nplus (float, default: 1.2 ) –

    multiplicative increase factor for when ascent didn't change sign (default: 1.2).

  • nminus (float, default: 0.5 ) –

    multiplicative decrease factor for when ascent changed sign (default: 0.5).

  • lb (float, default: 1e-06 ) –

    minimum step size, can be None (default: 1e-6)

  • ub (float, default: 50 ) –

    maximum step size, can be None (default: 50)

  • backtrack (float, default: True ) –

    if True, when ascent sign changes, undoes last weight update, otherwise sets update to 0. When this is False, this exactly matches pytorch Rprop. (default: True)

  • alpha (float, default: 1 ) –

    initial per-parameter learning rate (default: 1).

reference Riedmiller, M., & Braun, H. (1993, March). A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In IEEE international conference on neural networks (pp. 586-591). IEEE.

Source code in torchzero/modules/adaptive/rprop.py
class Rprop(Transform):
    """
    Resilient propagation. The update magnitude gets multiplied by `nplus` if gradient didn't change the sign,
    or `nminus` if it did. Then the update is applied with the sign of the current gradient.

    Additionally, if gradient changes sign, the update for that weight is reverted.
    Next step, magnitude for that weight won't change.

    Compared to pytorch this also implements backtracking update when sign changes.

    This implementation is identical to :code:`torch.optim.Rprop` if :code:`backtrack` is set to False.

    Args:
        nplus (float): multiplicative increase factor for when ascent didn't change sign (default: 1.2).
        nminus (float): multiplicative decrease factor for when ascent changed sign (default: 0.5).
        lb (float): minimum step size, can be None (default: 1e-6)
        ub (float): maximum step size, can be None (default: 50)
        backtrack (float):
            if True, when ascent sign changes, undoes last weight update, otherwise sets update to 0.
            When this is False, this exactly matches pytorch Rprop. (default: True)
        alpha (float): initial per-parameter learning rate (default: 1).

    reference
        *Riedmiller, M., & Braun, H. (1993, March). A direct adaptive method for faster backpropagation learning:
        The RPROP algorithm. In IEEE international conference on neural networks (pp. 586-591). IEEE.*
    """
    def __init__(
        self,
        nplus: float = 1.2,
        nminus: float = 0.5,
        lb: float = 1e-6,
        ub: float = 50,
        backtrack=True,
        alpha: float = 1,
    ):
        defaults = dict(nplus = nplus, nminus = nminus, alpha = alpha, lb = lb, ub = ub, backtrack=backtrack)
        super().__init__(defaults, uses_grad=False)

    @torch.no_grad
    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        step = self.global_state.get('step', 0)
        self.global_state['step'] = step + 1

        nplus, nminus, lb, ub, alpha = unpack_dicts(settings, 'nplus', 'nminus', 'lb', 'ub', 'alpha', cls=NumberList)
        prev, allowed, magnitudes = unpack_states(
            states, tensors,
            'prev','allowed','magnitudes',
            init=[torch.zeros_like, _bool_ones_like, torch.zeros_like],
            cls = TensorList,
        )

        tensors = rprop_(
            tensors_ = as_tensorlist(tensors),
            prev_ = prev,
            allowed_ = allowed,
            magnitudes_ = magnitudes,
            nplus = nplus,
            nminus = nminus,
            lb = lb,
            ub = ub,
            alpha = alpha,
            backtrack=settings[0]['backtrack'],
            step=step,
        )

        return tensors

SAM

Bases: torchzero.core.module.Module

Sharpness-Aware Minimization from https://arxiv.org/pdf/2010.01412

SAM functions by seeking parameters that lie in neighborhoods having uniformly low loss value. It performs two forward and backward passes per step.

This implementation modifies the closure to return loss and calculate gradients of the SAM objective. All modules after this will use the modified objective.

.. note:: This module requires a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients at two points on each step.

Parameters:

  • rho (float, default: 0.05 ) –

    Neighborhood size. Defaults to 0.05.

  • p (float, default: 2 ) –

    norm of the SAM objective. Defaults to 2.

  • asam (bool, default: False ) –

    enables ASAM variant which makes perturbation relative to weight magnitudes. ASAM requires a much larger :code:rho, like 0.5 or 1. The :code:tz.m.ASAM class is idential to setting this argument to True, but it has larger :code:rho by default.

Examples:

SAM-SGD:

.. code-block:: python

opt = tz.Modular(
    model.parameters(),
    tz.m.SAM(),
    tz.m.LR(1e-2)
)

SAM-Adam:

.. code-block:: python

opt = tz.Modular(
    model.parameters(),
    tz.m.SAM(),
    tz.m.Adam(),
    tz.m.LR(1e-2)
)
References

Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2020). Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412. https://arxiv.org/abs/2010.01412#page=3.16

Source code in torchzero/modules/adaptive/sam.py
class SAM(Module):
    """Sharpness-Aware Minimization from https://arxiv.org/pdf/2010.01412

    SAM functions by seeking parameters that lie in neighborhoods having uniformly low loss value.
    It performs two forward and backward passes per step.

    This implementation modifies the closure to return loss and calculate gradients
    of the SAM objective. All modules after this will use the modified objective.

    .. note::
        This module requires a closure passed to the optimizer step,
        as it needs to re-evaluate the loss and gradients at two points on each step.

    Args:
        rho (float, optional): Neighborhood size. Defaults to 0.05.
        p (float, optional): norm of the SAM objective. Defaults to 2.
        asam (bool, optional):
            enables ASAM variant which makes perturbation relative to weight magnitudes.
            ASAM requires a much larger :code:`rho`, like 0.5 or 1.
            The :code:`tz.m.ASAM` class is idential to setting this argument to True, but
            it has larger :code:`rho` by default.

    Examples:
        SAM-SGD:

        .. code-block:: python

            opt = tz.Modular(
                model.parameters(),
                tz.m.SAM(),
                tz.m.LR(1e-2)
            )

        SAM-Adam:

        .. code-block:: python

            opt = tz.Modular(
                model.parameters(),
                tz.m.SAM(),
                tz.m.Adam(),
                tz.m.LR(1e-2)
            )

    References:
        Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2020). Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412. https://arxiv.org/abs/2010.01412#page=3.16
    """
    def __init__(self, rho: float = 0.05, p: float = 2, eps=1e-10, asam=False):
        defaults = dict(rho=rho, p=p, eps=eps, asam=asam)
        super().__init__(defaults)

    @torch.no_grad
    def step(self, var):

        params = var.params
        closure = var.closure
        zero_grad = var.zero_grad
        if closure is None: raise RuntimeError("SAM requires a closure passed to the optimizer step")
        p, rho = self.get_settings(var.params, 'p', 'rho', cls=NumberList)
        s = self.defaults
        eps = s['eps']
        asam = s['asam']

        # 1/p + 1/q = 1
        # okay, authors of SAM paper, I will manually solve your equation
        # so q = -p/(1-p)
        q = -p / (1-p)
        # as a validation for 2 it is -2 / -1 = 2

        @torch.no_grad
        def sam_closure(backward=True):
            orig_grads = None
            if not backward:
                # if backward is False, make sure this doesn't modify gradients
                # to avoid issues
                orig_grads = [p.grad for p in params]

            # gradient at initial parameters
            zero_grad()
            with torch.enable_grad():
                closure()

            grad = TensorList(p.grad if p.grad is not None else torch.zeros_like(p) for p in params)
            grad_abs = grad.abs()

            # compute e
            term1 = grad.sign().mul_(rho)
            term2 = grad_abs.pow(q-1)

            if asam:
                grad_abs.mul_(torch._foreach_abs(params))

            denom = grad_abs.pow_(q).sum().pow(1/p)

            e = term1.mul_(term2).div_(denom.clip(min=eps))

            if asam:
                e.mul_(torch._foreach_pow(params, 2))

            # calculate loss and gradient approximation of inner problem
            torch._foreach_add_(params, e)
            if backward:
                zero_grad()
                with torch.enable_grad():
                    # this sets .grad attributes
                    sam_loss = closure()

            else:
                sam_loss = closure(False)

            # and restore initial parameters
            torch._foreach_sub_(params, e)

            if orig_grads is not None:
                for param,orig_grad in zip(params, orig_grads):
                    param.grad = orig_grad

            return sam_loss

        var.closure = sam_closure
        return var

SOAP

Bases: torchzero.core.transform.Transform

SOAP (ShampoO with Adam in the Preconditioner's eigenbasis from https://arxiv.org/abs/2409.11321).

Parameters:

  • beta1 (float, default: 0.95 ) –

    beta for first momentum. Defaults to 0.95.

  • beta2 (float, default: 0.95 ) –

    beta for second momentum. Defaults to 0.95.

  • shampoo_beta (float | None, default: 0.95 ) –

    beta for covariance matrices accumulators. Can be None, then it just sums them like Adagrad (which works worse). Defaults to 0.95.

  • precond_freq (int, default: 10 ) –

    How often to update the preconditioner. Defaults to 10.

  • merge_small (bool, default: True ) –

    Whether to merge small dims. Defaults to True.

  • max_dim (int, default: 2000 ) –

    Won't precondition dims larger than this. Defaults to 2_000.

  • precondition_1d (bool, default: True ) –

    Whether to precondition 1d params (SOAP paper sets this to False). Defaults to True.

  • eps (float, default: 1e-08 ) –

    epsilon for dividing first momentum by second. Defaults to 1e-8.

  • decay (float | None, default: None ) –

    Decays covariance matrix accumulators, this may be useful if shampoo_beta is None. Defaults to None.

  • alpha (float, default: 1 ) –

    learning rate. Defaults to 1.

  • bias_correction (bool, default: True ) –

    enables adam bias correction. Defaults to True.

Examples:

SOAP:

.. code-block:: python

opt = tz.Modular(model.parameters(), tz.m.SOAP(), tz.m.LR(1e-3))

Stabilized SOAP:

.. code-block:: python

opt = tz.Modular(
    model.parameters(),
    tz.m.SOAP(),
    tz.m.NormalizeByEMA(max_ema_growth=1.2),
    tz.m.LR(1e-2)
)
Source code in torchzero/modules/adaptive/soap.py
class SOAP(Transform):
    """SOAP (ShampoO with Adam in the Preconditioner's eigenbasis from https://arxiv.org/abs/2409.11321).

    Args:
        beta1 (float, optional): beta for first momentum. Defaults to 0.95.
        beta2 (float, optional): beta for second momentum. Defaults to 0.95.
        shampoo_beta (float | None, optional):
            beta for covariance matrices accumulators. Can be None, then it just sums them like Adagrad (which works worse). Defaults to 0.95.
        precond_freq (int, optional): How often to update the preconditioner. Defaults to 10.
        merge_small (bool, optional): Whether to merge small dims. Defaults to True.
        max_dim (int, optional): Won't precondition dims larger than this. Defaults to 2_000.
        precondition_1d (bool, optional):
            Whether to precondition 1d params (SOAP paper sets this to False). Defaults to True.
        eps (float, optional):
            epsilon for dividing first momentum by second. Defaults to 1e-8.
        decay (float | None, optional):
            Decays covariance matrix accumulators, this may be useful if `shampoo_beta` is None. Defaults to None.
        alpha (float, optional):
            learning rate. Defaults to 1.
        bias_correction (bool, optional):
            enables adam bias correction. Defaults to True.

    Examples:
        SOAP:

        .. code-block:: python

            opt = tz.Modular(model.parameters(), tz.m.SOAP(), tz.m.LR(1e-3))

        Stabilized SOAP:

        .. code-block:: python

            opt = tz.Modular(
                model.parameters(),
                tz.m.SOAP(),
                tz.m.NormalizeByEMA(max_ema_growth=1.2),
                tz.m.LR(1e-2)
            )
    """
    def __init__(
        self,
        beta1: float = 0.95,
        beta2: float = 0.95,
        shampoo_beta: float | None = 0.95,
        precond_freq: int = 10,
        merge_small: bool = True,
        max_dim: int = 2_000,
        precondition_1d: bool = True,
        eps: float = 1e-8,
        decay: float | None = None,
        alpha: float = 1,
        bias_correction: bool = True,
    ):
        defaults = dict(
            beta1=beta1,
            beta2=beta2,
            shampoo_beta=shampoo_beta,
            precond_freq=precond_freq,
            merge_small=merge_small,
            max_dim=max_dim,
            precondition_1d=precondition_1d,
            eps=eps,
            decay=decay,
            bias_correction=bias_correction,
            alpha=alpha,
        )
        super().__init__(defaults, uses_grad=False)

    @torch.no_grad
    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        updates = []
        # update preconditioners
        for i,(p,t, state, setting) in enumerate(zip(params, tensors, states, settings)):
            beta1, beta2, shampoo_beta, merge_small, max_dim, precondition_1d, eps,alpha = itemgetter(
                'beta1', 'beta2', 'shampoo_beta', 'merge_small', 'max_dim', 'precondition_1d', 'eps','alpha')(setting)

            if merge_small:
                t, state['flat_sizes'], state['sort_idxs'] = _merge_small_dims(t, max_dim)

            # initialize state on 1st step
            if 'GG' not in state:
                state["exp_avg"] = torch.zeros_like(t)
                state["exp_avg_sq_projected"] = torch.zeros_like(t)

                if not precondition_1d and t.ndim <= 1:
                    state['GG'] = []

                else:
                    state['GG'] = [torch.zeros(s, s, dtype=t.dtype, device=t.device) if 1<s<max_dim else None for s in t.shape]

                # either scalar parameter, 1d with precondition_1d=False, or all dims are too big.
                if len([i is not None for i in state['GG']]) == 0:
                    state['GG'] = None

                if state['GG'] is not None:
                    update_soap_covariances_(t, GGs_=state['GG'], beta=shampoo_beta)
                    try: state['Q'] = get_orthogonal_matrix(state['GG'])
                    except torch.linalg.LinAlgError as e:
                        warnings.warn(f"torch.linalg.eigh raised an error when initializing SOAP Q matrices on 1st step, diagonal preconditioning will be used for this parameter. The error was:\n{e}")
                        state["GG"] = None

                state['step'] = 0
                updates.append(tensors[i].clip(-0.1, 0.1))
                continue  # skip 1st step as in https://github.com/nikhilvyas/SOAP/blob/main/soap.py ?
                # I use scaled update instead as to not mess up with next modules.

            # Projecting gradients to the eigenbases of Shampoo's preconditioner
            # i.e. projecting to the eigenbases of matrices in state['GG']
            t_projected = None
            if state['GG'] is not None:
                t_projected = project(t, state['Q'])

            # exponential moving averages
            # this part could be foreached but I will do that at some point its not a big difference compared to preconditioning
            exp_avg: torch.Tensor = state["exp_avg"]
            exp_avg_sq_projected: torch.Tensor = state["exp_avg_sq_projected"]

            exp_avg.lerp_(t, 1-beta1)

            if t_projected is None:
                exp_avg_sq_projected.mul_(beta2).addcmul_(t, t, value=1-beta2)
            else:
                exp_avg_sq_projected.mul_(beta2).addcmul_(t_projected, t_projected, value=1-beta2)

            # project exponential moving averages if they are accumulated unprojected
            exp_avg_projected = exp_avg
            if t_projected is not None:
                exp_avg_projected = project(exp_avg, state['Q'])

            denom = exp_avg_sq_projected.sqrt().add_(eps)
            # print(f'{t_projected = }, {exp_avg = }, {exp_avg_projected = }, {exp_avg_sq = }, {exp_avg_sq_projected = }, {denom = }')

            # Projecting back the preconditioned (by Adam) exponential moving average of gradients
            # to the original space
            update = exp_avg_projected / denom

            if t_projected is not None:
                update = project_back(update, state["Q"])

            if setting['bias_correction']:
                bias_correction1 = 1.0 - beta1 ** (state["step"]+1)
                bias_correction2 = 1.0 - beta2 ** (state["step"]+1)
                update *= ((bias_correction2 ** .5) / bias_correction1) * alpha
            elif alpha is not None:
                update *= alpha

            if merge_small:
                update = _unmerge_small_dims(update, state['flat_sizes'], state['sort_idxs'])

            updates.append(update)
            state["step"] += 1

            # Update is done after the gradient step to avoid using current gradients in the projection.
            if state['GG'] is not None:
                update_soap_covariances_(t, state['GG'], shampoo_beta)
                if state['step'] % setting['precond_freq'] == 0:
                    try:
                        state['Q'], state['exp_avg_sq_projected'] = get_orthogonal_matrix_QR(exp_avg_sq_projected, state['GG'], state['Q'])
                    except torch.linalg.LinAlgError:
                        pass
        return updates

ScaleLRBySignChange

Bases: torchzero.core.transform.Transform

learning rate gets multiplied by nplus if ascent/gradient didn't change the sign, or nminus if it did.

This is part of RProp update rule.

Parameters:

  • nplus (float, default: 1.2 ) –

    learning rate gets multiplied by nplus if ascent/gradient didn't change the sign

  • nminus (float, default: 0.5 ) –

    learning rate gets multiplied by nminus if ascent/gradient changed the sign

  • lb (float, default: 1e-06 ) –

    lower bound for lr.

  • ub (float, default: 50.0 ) –

    upper bound for lr.

  • alpha (float, default: 1.0 ) –

    initial learning rate.

Source code in torchzero/modules/adaptive/rprop.py
class ScaleLRBySignChange(Transform):
    """
    learning rate gets multiplied by `nplus` if ascent/gradient didn't change the sign,
    or `nminus` if it did.

    This is part of RProp update rule.

    Args:
        nplus (float): learning rate gets multiplied by `nplus` if ascent/gradient didn't change the sign
        nminus (float): learning rate gets multiplied by `nminus` if ascent/gradient changed the sign
        lb (float): lower bound for lr.
        ub (float): upper bound for lr.
        alpha (float): initial learning rate.

    """

    def __init__(
        self,
        nplus: float = 1.2,
        nminus: float = 0.5,
        lb=1e-6,
        ub=50.0,
        alpha=1.0,
        use_grad=False,
        target: Target = "update",
    ):
        defaults = dict(nplus=nplus, nminus=nminus, alpha=alpha, lb=lb, ub=ub, use_grad=use_grad)
        super().__init__(defaults, uses_grad=use_grad, target=target)

    @torch.no_grad
    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        step = self.global_state.get('step', 0)
        self.global_state['step'] = step + 1

        tensors = as_tensorlist(tensors)
        use_grad = settings[0]['use_grad']
        if use_grad: cur = as_tensorlist(grads)
        else: cur = tensors

        nplus, nminus, lb, ub = unpack_dicts(settings, 'nplus', 'nminus', 'lb', 'ub', cls=NumberList)
        prev, lrs = unpack_states(states, tensors, 'prev', 'lrs', cls=TensorList)

        if step == 0:
            lrs.set_(tensors.full_like([s['alpha'] for s in settings]))

        tensors = scale_by_sign_change_(
            tensors_ = tensors,
            cur = cur,
            prev_ = prev,
            lrs_ = lrs,
            nplus = nplus,
            nminus = nminus,
            lb = lb,
            ub = ub,
            step = step,
        )
        return tensors

Shampoo

Bases: torchzero.core.transform.Transform

Shampoo from Preconditioned Stochastic Tensor Optimization (https://arxiv.org/abs/1802.09568).

.. note:: Shampoo is usually grafted to another optimizer like Adam, otherwise it can be unstable. An example of how to do grafting is given below in the Examples section.

.. note:: Shampoo is a very computationally expensive optimizer, increase :code:update_freq if it is too slow.

.. note:: SOAP optimizer usually outperforms Shampoo and is also not as computationally expensive. SOAP implementation is available as :code:tz.m.SOAP.

Parameters:

  • decay (float | None, default: None ) –

    slowly decays preconditioners. Defaults to None.

  • beta (float | None, default: None ) –

    if None calculates sum as in standard shampoo, otherwise uses EMA of preconditioners. Defaults to None.

  • update_freq (int, default: 10 ) –

    preconditioner update frequency. Defaults to 10.

  • exp_override (int | None, default: 2 ) –

    matrix exponent override, if not set, uses 2*ndim. Defaults to 2.

  • merge_small (bool, default: True ) –

    whether to merge small dims on tensors. Defaults to True.

  • max_dim (int, default: 2000 ) –

    maximum dimension size for preconditioning. Defaults to 2_000.

  • precondition_1d (bool, default: True ) –

    whether to precondition 1d tensors. Defaults to True.

  • adagrad_eps (float, default: 1e-08 ) –

    epsilon for adagrad division for tensors where shampoo can't be applied. Defaults to 1e-8.

  • inner (Chainable | None, default: None ) –

    module applied after updating preconditioners and before applying preconditioning. For example if beta≈0.999 and inner=tz.m.EMA(0.9), this becomes Adam with shampoo preconditioner (ignoring debiasing). Defaults to None.

Examples:

Shampoo grafted to Adam

.. code-block:: python

opt = tz.Modular(
    model.parameters(),
    tz.m.GraftModules(
        direction = tz.m.Shampoo(),
        magnitude = tz.m.Adam(),
    ),
    tz.m.LR(1e-3)
)

Adam with Shampoo preconditioner

.. code-block:: python

opt = tz.Modular(
    model.parameters(),
    tz.m.Shampoo(beta=0.999, inner=tz.m.EMA(0.9)),
    tz.m.Debias(0.9, 0.999),
    tz.m.LR(1e-3)
)
Source code in torchzero/modules/adaptive/shampoo.py
class Shampoo(Transform):
    """Shampoo from Preconditioned Stochastic Tensor Optimization (https://arxiv.org/abs/1802.09568).

    .. note::
        Shampoo is usually grafted to another optimizer like Adam, otherwise it can be unstable. An example of how to do grafting is given below in the Examples section.

    .. note::
        Shampoo is a very computationally expensive optimizer, increase :code:`update_freq` if it is too slow.

    .. note::
        SOAP optimizer usually outperforms Shampoo and is also not as computationally expensive. SOAP implementation is available as :code:`tz.m.SOAP`.

    Args:
        decay (float | None, optional): slowly decays preconditioners. Defaults to None.
        beta (float | None, optional):
            if None calculates sum as in standard shampoo, otherwise uses EMA of preconditioners. Defaults to None.
        update_freq (int, optional): preconditioner update frequency. Defaults to 10.
        exp_override (int | None, optional): matrix exponent override, if not set, uses 2*ndim. Defaults to 2.
        merge_small (bool, optional): whether to merge small dims on tensors. Defaults to True.
        max_dim (int, optional): maximum dimension size for preconditioning. Defaults to 2_000.
        precondition_1d (bool, optional): whether to precondition 1d tensors. Defaults to True.
        adagrad_eps (float, optional): epsilon for adagrad division for tensors where shampoo can't be applied. Defaults to 1e-8.
        inner (Chainable | None, optional):
            module applied after updating preconditioners and before applying preconditioning.
            For example if beta≈0.999 and `inner=tz.m.EMA(0.9)`, this becomes Adam with shampoo preconditioner (ignoring debiasing).
            Defaults to None.

    Examples:
        Shampoo grafted to Adam

        .. code-block:: python

            opt = tz.Modular(
                model.parameters(),
                tz.m.GraftModules(
                    direction = tz.m.Shampoo(),
                    magnitude = tz.m.Adam(),
                ),
                tz.m.LR(1e-3)
            )

        Adam with Shampoo preconditioner

        .. code-block:: python

            opt = tz.Modular(
                model.parameters(),
                tz.m.Shampoo(beta=0.999, inner=tz.m.EMA(0.9)),
                tz.m.Debias(0.9, 0.999),
                tz.m.LR(1e-3)
            )
    """
    def __init__(
        self,
        decay: float | None = None,
        beta: float | None = None,
        reg: float = 1e-12,
        update_freq: int = 10,
        exp_override: int | None = 2,
        merge_small: bool = True,
        max_dim: int = 2_000,
        precondition_1d: bool = True,
        adagrad_eps: float = 1e-8,
        inner: Chainable | None = None,
    ):
        defaults = dict(decay=decay, beta=beta, update_freq=update_freq, exp_override=exp_override, merge_small=merge_small, max_dim=max_dim, precondition_1d=precondition_1d,adagrad_eps=adagrad_eps, reg=reg)
        super().__init__(defaults, uses_grad=False)

        if inner is not None:
            self.set_child('inner', inner)

    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        merged_tensors = [] # target with merged dims

        # update preconditioners
        for i,(t,state, setting) in enumerate(zip(tensors, states, settings)):
            beta, update_freq, exp_override, merge_small, max_dim, precondition_1d, reg = itemgetter(
                'beta', 'update_freq', 'exp_override', 'merge_small', 'max_dim', 'precondition_1d', "reg")(setting)

            if merge_small:
                t, state['flat_sizes'], state['sort_idxs'] = _merge_small_dims(t, max_dim)

            merged_tensors.append(t)

            # initialize accumulators and preconditioners for each dim on 1st step
            if 'accumulators' not in state:

                if not precondition_1d and t.ndim <= 1:
                    state['accumulators'] = []

                else:
                    state['accumulators'] = [torch.eye(s, dtype=t.dtype, device=t.device) if 1<s<max_dim else None for s in t.shape]
                    state['preconditioners'] = [torch.eye(s, dtype=t.dtype, device=t.device) if 1<s<max_dim else None for s in t.shape]

                # either scalar parameter, 1d with precondition_1d=False, or too big, then basic diagonal preconditioner is used.
                if len([i is not None for i in state['accumulators']]) == 0:
                    state['diagonal_accumulator'] = torch.zeros_like(t)

                state['step'] = 0

            # update preconditioners
            if 'diagonal_accumulator' in state:
                update_diagonal_(t, state['diagonal_accumulator'], beta)
            else:
                update_shampoo_preconditioner_(
                    t,
                    accumulators_=state['accumulators'],
                    preconditioners_=state['preconditioners'],
                    step=state['step'],
                    update_freq=update_freq,
                    exp_override=exp_override,
                    beta=beta,
                    reg=reg,
                )

        # inner step
        if 'inner' in self.children:
            tensors = apply_transform(self.children['inner'], tensors, params=params, grads=grads)

            # have to merge small dims again
            merged_tensors = [] # target with merged dims
            for i,(t,state, setting) in enumerate(zip(tensors, states, settings)):
                if setting['merge_small']:
                    t, state['flat_sizes'], state['sort_idxs'] = _merge_small_dims(t, setting['max_dim'])
                merged_tensors.append(t)

        # precondition
        for i,(t,state, setting) in enumerate(zip(merged_tensors, states, settings)):
            decay, merge_small, adagrad_eps= itemgetter('decay', 'merge_small', 'adagrad_eps')(setting)

            if 'diagonal_accumulator' in state:
                tensors[i] = apply_diagonal_(t, state['diagonal_accumulator'], decay=decay, eps=adagrad_eps)
            else:
                tensors[i] = apply_shampoo_preconditioner(t, preconditioners_=state['preconditioners'], decay=decay)

            if merge_small:
                tensors[i] = _unmerge_small_dims(tensors[i], state['flat_sizes'], state['sort_idxs'])

            state['step'] += 1

        return tensors

SignConsistencyLRs

Bases: torchzero.core.transform.Transform

Outputs per-weight learning rates based on consecutive sign consistency.

The learning rate for a weight is multiplied by :code:nplus when two consecutive update signs are the same, otherwise it is multiplied by :code:nplus. The learning rates are bounded to be in :code:(lb, ub) range.

Examples:

GD scaled by consecutive gradient sign consistency

.. code-block:: python

    opt = tz.Modular(
        model.parameters(),
        tz.m.Mul(tz.m.SignConsistencyLRs()),
        tz.m.LR(1e-2)
    )
Source code in torchzero/modules/adaptive/rprop.py
class SignConsistencyLRs(Transform):
    """Outputs per-weight learning rates based on consecutive sign consistency.

    The learning rate for a weight is multiplied by :code:`nplus` when two consecutive update signs are the same, otherwise it is multiplied by :code:`nplus`. The learning rates are bounded to be in :code:`(lb, ub)` range.

    Examples:

        GD scaled by consecutive gradient sign consistency

        .. code-block:: python

            opt = tz.Modular(
                model.parameters(),
                tz.m.Mul(tz.m.SignConsistencyLRs()),
                tz.m.LR(1e-2)
            )

    """
    def __init__(
        self,
        nplus: float = 1.2,
        nminus: float = 0.5,
        lb: float | None = 1e-6,
        ub: float | None = 50,
        alpha: float = 1,
        target: Target = 'update'
    ):
        defaults = dict(nplus = nplus, nminus = nminus, alpha = alpha, lb = lb, ub = ub)
        super().__init__(defaults, uses_grad=False, target = target)

    @torch.no_grad
    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        step = self.global_state.get('step', 0)
        self.global_state['step'] = step + 1

        target = as_tensorlist(tensors)
        nplus, nminus, lb, ub = unpack_dicts(settings, 'nplus', 'nminus', 'lb', 'ub', cls=NumberList)
        prev, lrs = unpack_states(states, tensors, 'prev', 'lrs', cls=TensorList)

        if step == 0:
            lrs.set_(target.full_like([s['alpha'] for s in settings]))

        target = sign_consistency_lrs_(
            tensors = target,
            prev_ = prev,
            lrs_ = lrs,
            nplus = nplus,
            nminus = nminus,
            lb = lb,
            ub = ub,
            step = step,
        )
        return target.clone()

SignConsistencyMask

Bases: torchzero.core.transform.Transform

Outputs a mask of sign consistency of current and previous inputs.

The output is 0 for weights where input sign changed compared to previous input, 1 otherwise.

Examples:

GD that skips update for weights where gradient sign changed compared to previous gradient.

.. code-block:: python

    opt = tz.Modular(
        model.parameters(),
        tz.m.Mul(tz.m.SignConsistencyMask()),
        tz.m.LR(1e-2)
    )
Source code in torchzero/modules/adaptive/rprop.py
class SignConsistencyMask(Transform):
    """
    Outputs a mask of sign consistency of current and previous inputs.

    The output is 0 for weights where input sign changed compared to previous input, 1 otherwise.

    Examples:

        GD that skips update for weights where gradient sign changed compared to previous gradient.

        .. code-block:: python

            opt = tz.Modular(
                model.parameters(),
                tz.m.Mul(tz.m.SignConsistencyMask()),
                tz.m.LR(1e-2)
            )

    """
    def __init__(self,target: Target = 'update'):
        super().__init__({}, uses_grad=False, target = target)

    @torch.no_grad
    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        prev = unpack_states(states, tensors, 'prev', cls=TensorList)
        mask = prev.mul_(tensors).gt_(0)
        prev.copy_(tensors)
        return mask

SophiaH

Bases: torchzero.core.module.Module

SophiaH optimizer from https://arxiv.org/abs/2305.14342

This is similar to Adam, but the second momentum is replaced by an exponential moving average of randomized hessian diagonal estimates, and the update is agressively clipped.

.. note:: In most cases SophiaH should be the first module in the chain because it relies on autograd. Use the :code:inner argument if you wish to apply SophiaH preconditioning to another module's output.

.. note:: If you are using gradient estimators or reformulations, set :code:hvp_method to "forward" or "central".

.. note:: This module requires the a closure passed to the optimizer step, as it needs to re-evaluate the loss and gradients for calculating HVPs. The closure must accept a backward argument (refer to documentation).

Parameters:

  • beta1 (float, default: 0.96 ) –

    first momentum. Defaults to 0.96.

  • beta2 (float, default: 0.99 ) –

    momentum for hessian diagonal estimate. Defaults to 0.99.

  • update_freq (int, default: 10 ) –

    frequency of updating hessian diagonal estimate via a hessian-vector product. Defaults to 10.

  • precond_scale (float, default: 1 ) –

    scale of the preconditioner. Defaults to 1.

  • clip (float, default: 1 ) –

    clips update to (-clip, clip). Defaults to 1.

  • eps (float, default: 1e-12 ) –

    clips hessian diagonal esimate to be no less than this value. Defaults to 1e-12.

  • hvp_method (str, default: 'autograd' ) –

    Determines how Hessian-vector products are evaluated.

    • "autograd": Use PyTorch's autograd to calculate exact HVPs. This requires creating a graph for the gradient.
    • "forward": Use a forward finite difference formula to approximate the HVP. This requires one extra gradient evaluation.
    • "central": Use a central finite difference formula for a more accurate HVP approximation. This requires two extra gradient evaluations. Defaults to "autograd".
  • fd_h (float, default: 0.001 ) –

    finite difference step size if :code:hvp_method is "forward" or "central". Defaults to 1e-3.

  • n_samples (int, default: 1 ) –

    number of hessian-vector products with random vectors to evaluate each time when updating the preconditioner. Larger values may lead to better hessian diagonal estimate. Defaults to 1.

  • seed (int | None, default: None ) –

    seed for random vectors. Defaults to None.

  • inner (Chainable | None, default: None ) –

    preconditioning is applied to the output of this module. Defaults to None.

Examples:

Using SophiaH:

.. code-block:: python

opt = tz.Modular(
    model.parameters(),
    tz.m.SophiaH(),
    tz.m.LR(0.1)
)

SophiaH preconditioner can be applied to any other module by passing it to the :code:inner argument. Turn off SophiaH's first momentum to get just the preconditioning. Here is an example of applying SophiaH preconditioning to nesterov momentum (:code:tz.m.NAG):

.. code-block:: python

opt = tz.Modular(
    model.parameters(),
    tz.m.SophiaH(beta1=0, inner=tz.m.NAG(0.96)),
    tz.m.LR(0.1)
)
Source code in torchzero/modules/adaptive/sophia_h.py
class SophiaH(Module):
    """SophiaH optimizer from https://arxiv.org/abs/2305.14342

    This is similar to Adam, but the second momentum is replaced by an exponential moving average of randomized hessian diagonal estimates, and the update is agressively clipped.

    .. note::
        In most cases SophiaH should be the first module in the chain because it relies on autograd. Use the :code:`inner` argument if you wish to apply SophiaH preconditioning to another module's output.

    .. note::
        If you are using gradient estimators or reformulations, set :code:`hvp_method` to "forward" or "central".

    .. note::
        This module requires the a closure passed to the optimizer step,
        as it needs to re-evaluate the loss and gradients for calculating HVPs.
        The closure must accept a ``backward`` argument (refer to documentation).

    Args:
        beta1 (float, optional): first momentum. Defaults to 0.96.
        beta2 (float, optional): momentum for hessian diagonal estimate. Defaults to 0.99.
        update_freq (int, optional):
            frequency of updating hessian diagonal estimate via a hessian-vector product. Defaults to 10.
        precond_scale (float, optional):
            scale of the preconditioner. Defaults to 1.
        clip (float, optional):
            clips update to (-clip, clip). Defaults to 1.
        eps (float, optional):
            clips hessian diagonal esimate to be no less than this value. Defaults to 1e-12.
        hvp_method (str, optional):
            Determines how Hessian-vector products are evaluated.

            - ``"autograd"``: Use PyTorch's autograd to calculate exact HVPs.
              This requires creating a graph for the gradient.
            - ``"forward"``: Use a forward finite difference formula to
              approximate the HVP. This requires one extra gradient evaluation.
            - ``"central"``: Use a central finite difference formula for a
              more accurate HVP approximation. This requires two extra
              gradient evaluations.
            Defaults to "autograd".
        fd_h (float, optional): finite difference step size if :code:`hvp_method` is "forward" or "central". Defaults to 1e-3.
        n_samples (int, optional):
            number of hessian-vector products with random vectors to evaluate each time when updating
            the preconditioner. Larger values may lead to better hessian diagonal estimate. Defaults to 1.
        seed (int | None, optional): seed for random vectors. Defaults to None.
        inner (Chainable | None, optional): preconditioning is applied to the output of this module. Defaults to None.

    Examples:
        Using SophiaH:

        .. code-block:: python

            opt = tz.Modular(
                model.parameters(),
                tz.m.SophiaH(),
                tz.m.LR(0.1)
            )

        SophiaH preconditioner can be applied to any other module by passing it to the :code:`inner` argument.
        Turn off SophiaH's first momentum to get just the preconditioning. Here is an example of applying
        SophiaH preconditioning to nesterov momentum (:code:`tz.m.NAG`):

        .. code-block:: python

            opt = tz.Modular(
                model.parameters(),
                tz.m.SophiaH(beta1=0, inner=tz.m.NAG(0.96)),
                tz.m.LR(0.1)
            )

    """
    def __init__(
        self,
        beta1: float = 0.96,
        beta2: float = 0.99,
        update_freq: int = 10,
        precond_scale: float = 1,
        clip: float = 1,
        eps: float = 1e-12,
        hvp_method: Literal['autograd', 'forward', 'central'] = 'autograd',
        fd_h: float = 1e-3,
        n_samples = 1,
        seed: int | None = None,
        inner: Chainable | None = None
    ):
        defaults = dict(beta1=beta1, beta2=beta2, update_freq=update_freq, precond_scale=precond_scale, clip=clip, eps=eps, hvp_method=hvp_method, n_samples=n_samples, fd_h=fd_h, seed=seed)
        super().__init__(defaults)

        if inner is not None:
            self.set_child('inner', inner)

    @torch.no_grad
    def step(self, var):
        params = var.params
        settings = self.settings[params[0]]
        hvp_method = settings['hvp_method']
        fd_h = settings['fd_h']
        update_freq = settings['update_freq']
        n_samples = settings['n_samples']

        seed = settings['seed']
        generator = None
        if seed is not None:
            if 'generator' not in self.global_state:
                self.global_state['generator'] = torch.Generator(params[0].device).manual_seed(seed)
            generator = self.global_state['generator']

        beta1, beta2, precond_scale, clip, eps = self.get_settings(params,
            'beta1', 'beta2', 'precond_scale', 'clip', 'eps', cls=NumberList)

        exp_avg, h_exp_avg = self.get_state(params, 'exp_avg', 'h_exp_avg', cls=TensorList)

        step = self.global_state.get('step', 0)
        self.global_state['step'] = step + 1

        closure = var.closure
        assert closure is not None

        h = None
        if step % update_freq == 0:

            rgrad=None
            for i in range(n_samples):
                u = [torch.randn(p.shape, device=p.device, dtype=p.dtype, generator=generator) for p in params]

                Hvp, rgrad = self.Hvp(u, at_x0=True, var=var, rgrad=rgrad, hvp_method=hvp_method,
                                     h=fd_h, normalize=True, retain_grad=i < n_samples-1)
                Hvp = tuple(Hvp)

                if h is None: h = Hvp
                else: torch._foreach_add_(h, Hvp)

            assert h is not None
            if n_samples > 1: torch._foreach_div_(h, n_samples)

        update = var.get_update()
        if 'inner' in self.children:
            update = apply_transform(self.children['inner'], tensors=update, params=params, grads=var.grad, var=var)

        var.update = sophia_H(
            tensors=TensorList(update),
            h=TensorList(h) if h is not None else None,
            exp_avg_=exp_avg,
            h_exp_avg_=h_exp_avg,
            beta1=beta1,
            beta2=beta2,
            update_freq=update_freq,
            precond_scale=precond_scale,
            clip=clip,
            eps=eps,
            step=step,
        )
        return var

orthogonalize_grads_

orthogonalize_grads_(params: Iterable[Tensor], steps: int = 5, dual_norm_correction=False, method: Literal['newton-schulz', 'svd'] = 'newton-schulz')

Uses newton-Schulz iteration to compute the zeroth power / orthogonalization of gradients of an iterable of parameters.

This sets gradients in-place. Applies along first 2 dims (expected to be out_channels, in_channels).

Note that the Muon page says that embeddings and classifier heads should not be orthogonalized. Args: params (abc.Iterable[torch.Tensor]): parameters that hold gradients to orthogonalize. steps (int, optional): The number of Newton-Schulz iterations to run. Defaults to 5. dual_norm_correction (bool, optional): enables dual norm correction from https://github.com/leloykun/adaptive-muon. Defaults to False. method (str, optional): Newton-Schulz is very fast, SVD is extremely slow but can be slighly more precise.

Source code in torchzero/modules/adaptive/muon.py
def orthogonalize_grads_(
    params: Iterable[torch.Tensor],
    steps: int = 5,
    dual_norm_correction=False,
    method: Literal["newton-schulz", "svd"] = "newton-schulz",
):
    """Uses newton-Schulz iteration to compute the zeroth power / orthogonalization of gradients of an iterable of parameters.

    This sets gradients in-place. Applies along first 2 dims (expected to be `out_channels, in_channels`).

    Note that the Muon page says that embeddings and classifier heads should not be orthogonalized.
    Args:
        params (abc.Iterable[torch.Tensor]): parameters that hold gradients to orthogonalize.
        steps (int, optional):
            The number of Newton-Schulz iterations to run. Defaults to 5.
        dual_norm_correction (bool, optional):
            enables dual norm correction from https://github.com/leloykun/adaptive-muon. Defaults to False.
        method (str, optional):
            Newton-Schulz is very fast, SVD is extremely slow but can be slighly more precise.
    """
    for p in params:
        if (p.grad is not None) and _is_at_least_2d(p.grad):
            X = _orthogonalize_tensor(p.grad, steps, method)
            if dual_norm_correction: X = _dual_norm_correction(X, p.grad, batch_first=False)
            p.grad.set_(X.view_as(p)) # pyright:ignore[reportArgumentType]

orthograd_

orthograd_(params: Iterable[Tensor], eps: float = 1e-30)

Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.

Parameters:

  • params (Iterable[Tensor]) –

    parameters that hold gradients to apply ⟂Grad to.

  • eps (float, default: 1e-30 ) –

    epsilon added to the denominator for numerical stability (default: 1e-30)

reference https://arxiv.org/abs/2501.04697

Source code in torchzero/modules/adaptive/orthograd.py
def orthograd_(params: Iterable[torch.Tensor], eps: float = 1e-30):
    """Applies ⟂Grad - projects gradient of an iterable of parameters to be orthogonal to the weights.

    Args:
        params (abc.Iterable[torch.Tensor]): parameters that hold gradients to apply ⟂Grad to.
        eps (float, optional): epsilon added to the denominator for numerical stability (default: 1e-30)

    reference
        https://arxiv.org/abs/2501.04697
    """
    params = as_tensorlist(params).with_grad()
    grad = params.grad
    grad -= (params.dot(grad)/(params.dot(params) + eps)) * params