Skip to content

Gradient approximations

This subpackage contains modules that estimate the gradient using function values.

Classes:

  • FDM

    Approximate gradients via finite difference method.

  • ForwardGradient

    Forward gradient method.

  • GaussianSmoothing

    Gradient approximation via Gaussian smoothing method.

  • GradApproximator

    Base class for gradient approximations.

  • MeZO

    Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.

  • RDSA

    Gradient approximation via Random-direction stochastic approximation (RDSA) method.

  • RandomizedFDM

    Gradient approximation via a randomized finite-difference method.

  • SPSA

    Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.

FDM

Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator

Approximate gradients via finite difference method.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

  • h (float, default: 0.001 ) –

    magnitude of parameter perturbation. Defaults to 1e-3.

  • formula (Literal, default: 'central' ) –

    finite difference formula. Defaults to 'central2'.

  • target (Literal, default: 'closure' ) –

    what to set on var. Defaults to 'closure'.

Examples: plain FDM:

fdm = tz.Modular(model.parameters(), tz.m.FDM(), tz.m.LR(1e-2))

Any gradient-based method can use FDM-estimated gradients.

fdm_ncg = tz.Modular(
    model.parameters(),
    tz.m.FDM(),
    # set hvp_method to "forward" so that it
    # uses gradient difference instead of autograd
    tz.m.NewtonCG(hvp_method="forward"),
    tz.m.Backtracking()
)

Source code in torchzero/modules/grad_approximation/fdm.py
class FDM(GradApproximator):
    """Approximate gradients via finite difference method.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): magnitude of parameter perturbation. Defaults to 1e-3.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'central2'.
        target (GradTarget, optional): what to set on var. Defaults to 'closure'.

    Examples:
    plain FDM:

    ```python
    fdm = tz.Modular(model.parameters(), tz.m.FDM(), tz.m.LR(1e-2))
    ```

    Any gradient-based method can use FDM-estimated gradients.
    ```python
    fdm_ncg = tz.Modular(
        model.parameters(),
        tz.m.FDM(),
        # set hvp_method to "forward" so that it
        # uses gradient difference instead of autograd
        tz.m.NewtonCG(hvp_method="forward"),
        tz.m.Backtracking()
    )
    ```
    """
    def __init__(self, h: float=1e-3, formula: _FD_Formula = 'central', target: GradTarget = 'closure'):
        defaults = dict(h=h, formula=formula)
        super().__init__(defaults, target=target)

    @torch.no_grad
    def approximate(self, closure, params, loss):
        grads = []
        loss_approx = None

        for p in params:
            g = torch.zeros_like(p)
            grads.append(g)

            settings = self.settings[p]
            h = settings['h']
            fd_fn = _FD_FUNCS[settings['formula']]

            p_flat = p.ravel(); g_flat = g.ravel()
            for i in range(len(p_flat)):
                loss, loss_approx, d = fd_fn(closure=closure, param=p_flat, idx=i, h=h, v_0=loss)
                g_flat[i] = d

        return grads, loss, loss_approx

ForwardGradient

Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM

Forward gradient method.

This method samples one or more directional derivatives evaluated via autograd jacobian-vector products. This is very similar to randomized finite difference.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

  • n_samples (int, default: 1 ) –

    number of random gradient samples. Defaults to 1.

  • distribution (Literal, default: 'gaussian' ) –

    distribution for random gradient samples. Defaults to "gaussian".

  • beta (float, default: 0 ) –

    If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.

  • pre_generate (bool, default: True ) –

    whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.

  • jvp_method (str, default: 'autograd' ) –

    how to calculate jacobian vector product, note that with forward and 'central' this is equivalent to randomized finite difference. Defaults to 'autograd'.

  • h (float, default: 0.001 ) –

    finite difference step size of jvp_method is set to forward or central. Defaults to 1e-3.

  • target (Literal, default: 'closure' ) –

    what to set on var. Defaults to "closure".

References

Baydin, A. G., Pearlmutter, B. A., Syme, D., Wood, F., & Torr, P. (2022). Gradients without backpropagation. arXiv preprint arXiv:2202.08587.

Source code in torchzero/modules/grad_approximation/forward_gradient.py
class ForwardGradient(RandomizedFDM):
    """Forward gradient method.

    This method samples one or more directional derivatives evaluated via autograd jacobian-vector products. This is very similar to randomized finite difference.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.


    Args:
        n_samples (int, optional): number of random gradient samples. Defaults to 1.
        distribution (Distributions, optional): distribution for random gradient samples. Defaults to "gaussian".
        beta (float, optional):
            If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
        pre_generate (bool, optional):
            whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
        jvp_method (str, optional):
            how to calculate jacobian vector product, note that with `forward` and 'central' this is equivalent to randomized finite difference. Defaults to 'autograd'.
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-3.
        target (GradTarget, optional): what to set on var. Defaults to "closure".

    References:
        Baydin, A. G., Pearlmutter, B. A., Syme, D., Wood, F., & Torr, P. (2022). Gradients without backpropagation. arXiv preprint arXiv:2202.08587.
    """
    PRE_MULTIPLY_BY_H = False
    def __init__(
        self,
        n_samples: int = 1,
        distribution: Distributions = "gaussian",
        beta: float = 0,
        pre_generate = True,
        jvp_method: Literal['autograd', 'forward', 'central'] = 'autograd',
        h: float = 1e-3,
        target: GradTarget = "closure",
        seed: int | None | torch.Generator = None,
    ):
        super().__init__(h=h, n_samples=n_samples, distribution=distribution, beta=beta, target=target, pre_generate=pre_generate, seed=seed)
        self.defaults['jvp_method'] = jvp_method

    @torch.no_grad
    def approximate(self, closure, params, loss):
        params = TensorList(params)
        loss_approx = None

        settings = self.settings[params[0]]
        n_samples = settings['n_samples']
        jvp_method = settings['jvp_method']
        h = settings['h']
        distribution = settings['distribution']
        default = [None]*n_samples
        perturbations = list(zip(*(self.state[p].get('perturbations', default) for p in params)))
        generator = self._get_generator(settings['seed'], params)

        grad = None
        for i in range(n_samples):
            prt = perturbations[i]
            if prt[0] is None:
                prt = params.sample_like(distribution=distribution, variance=1, generator=generator)

            else: prt = TensorList(prt)

            if jvp_method == 'autograd':
                with torch.enable_grad():
                    loss, d = jvp(partial(closure, False), params=params, tangent=prt)

            elif jvp_method == 'forward':
                loss, d = jvp_fd_forward(partial(closure, False), params=params, tangent=prt, v_0=loss, normalize=True, h=h)

            elif jvp_method == 'central':
                loss_approx, d = jvp_fd_central(partial(closure, False), params=params, tangent=prt, normalize=True, h=h)

            else: raise ValueError(jvp_method)

            if grad is None: grad = prt * d
            else: grad += prt * d

        assert grad is not None
        if n_samples > 1: grad.div_(n_samples)
        return grad, loss, loss_approx

PRE_MULTIPLY_BY_H class-attribute

PRE_MULTIPLY_BY_H = False

bool(x) -> bool

Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.

GaussianSmoothing

Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM

Gradient approximation via Gaussian smoothing method.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

  • h (float, default: 0.01 ) –

    finite difference step size of jvp_method is set to forward or central. Defaults to 1e-2.

  • n_samples (int, default: 100 ) –

    number of random gradient samples. Defaults to 100.

  • formula (Literal, default: 'forward2' ) –

    finite difference formula. Defaults to 'forward2'.

  • distribution (Literal, default: 'gaussian' ) –

    distribution. Defaults to "gaussian".

  • beta (float, default: 0 ) –

    If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.

  • pre_generate (bool, default: True ) –

    whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.

  • seed (int | None | Generator, default: None ) –

    Seed for random generator. Defaults to None.

  • target (Literal, default: 'closure' ) –

    what to set on var. Defaults to "closure".

References

Yurii Nesterov, Vladimir Spokoiny. (2015). Random Gradient-Free Minimization of Convex Functions. https://gwern.net/doc/math/2015-nesterov.pdf

Source code in torchzero/modules/grad_approximation/rfdm.py
class GaussianSmoothing(RandomizedFDM):
    """
    Gradient approximation via Gaussian smoothing method.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-2.
        n_samples (int, optional): number of random gradient samples. Defaults to 100.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'forward2'.
        distribution (Distributions, optional): distribution. Defaults to "gaussian".
        beta (float, optional):
            If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
        pre_generate (bool, optional):
            whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
        seed (int | None | torch.Generator, optional): Seed for random generator. Defaults to None.
        target (GradTarget, optional): what to set on var. Defaults to "closure".


    References:
        Yurii Nesterov, Vladimir Spokoiny. (2015). Random Gradient-Free Minimization of Convex Functions. https://gwern.net/doc/math/2015-nesterov.pdf
    """
    def __init__(
        self,
        h: float = 1e-2,
        n_samples: int = 100,
        formula: _FD_Formula = "forward2",
        distribution: Distributions = "gaussian",
        beta: float = 0,
        pre_generate = True,
        target: GradTarget = "closure",
        seed: int | None | torch.Generator = None,
    ):
        super().__init__(h=h, n_samples=n_samples,formula=formula,distribution=distribution,beta=beta,pre_generate=pre_generate,target=target,seed=seed)

GradApproximator

Bases: torchzero.core.module.Module, abc.ABC

Base class for gradient approximations. This is an abstract class, to use it, subclass it and override approximate.

GradientApproximator modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure.

Parameters:

  • defaults (dict[str, Any] | None, default: None ) –

    dict with defaults. Defaults to None.

  • target (str, default: 'closure' ) –

    whether to set var.grad, var.update or 'var.closure`. Defaults to 'closure'.

Example:

Basic SPSA method implementation.

class SPSA(GradApproximator):
    def __init__(self, h=1e-3):
        defaults = dict(h=h)
        super().__init__(defaults)

    @torch.no_grad
    def approximate(self, closure, params, loss):
        perturbation = [rademacher_like(p) * self.settings[p]['h'] for p in params]

        # evaluate params + perturbation
        torch._foreach_add_(params, perturbation)
        loss_plus = closure(False)

        # evaluate params - perturbation
        torch._foreach_sub_(params, perturbation)
        torch._foreach_sub_(params, perturbation)
        loss_minus = closure(False)

        # restore original params
        torch._foreach_add_(params, perturbation)

        # calculate SPSA gradients
        spsa_grads = []
        for p, pert in zip(params, perturbation):
            settings = self.settings[p]
            h = settings['h']
            d = (loss_plus - loss_minus) / (2*(h**2))
            spsa_grads.append(pert * d)

        # returns tuple: (grads, loss, loss_approx)
        # loss must be with initial parameters
        # since we only evaluated loss with perturbed parameters
        # we only have loss_approx
        return spsa_grads, None, loss_plus

Methods:

  • approximate

    Returns a tuple: (grad, loss, loss_approx), make sure this resets parameters to their original values!

  • pre_step

    This runs once before each step, whereas approximate may run multiple times per step if further modules

Source code in torchzero/modules/grad_approximation/grad_approximator.py
class GradApproximator(Module, ABC):
    """Base class for gradient approximations.
    This is an abstract class, to use it, subclass it and override `approximate`.

    GradientApproximator modifies the closure to evaluate the estimated gradients,
    and further closure-based modules will use the modified closure.

    Args:
        defaults (dict[str, Any] | None, optional): dict with defaults. Defaults to None.
        target (str, optional):
            whether to set `var.grad`, `var.update` or 'var.closure`. Defaults to 'closure'.

    Example:

    Basic SPSA method implementation.
    ```python
    class SPSA(GradApproximator):
        def __init__(self, h=1e-3):
            defaults = dict(h=h)
            super().__init__(defaults)

        @torch.no_grad
        def approximate(self, closure, params, loss):
            perturbation = [rademacher_like(p) * self.settings[p]['h'] for p in params]

            # evaluate params + perturbation
            torch._foreach_add_(params, perturbation)
            loss_plus = closure(False)

            # evaluate params - perturbation
            torch._foreach_sub_(params, perturbation)
            torch._foreach_sub_(params, perturbation)
            loss_minus = closure(False)

            # restore original params
            torch._foreach_add_(params, perturbation)

            # calculate SPSA gradients
            spsa_grads = []
            for p, pert in zip(params, perturbation):
                settings = self.settings[p]
                h = settings['h']
                d = (loss_plus - loss_minus) / (2*(h**2))
                spsa_grads.append(pert * d)

            # returns tuple: (grads, loss, loss_approx)
            # loss must be with initial parameters
            # since we only evaluated loss with perturbed parameters
            # we only have loss_approx
            return spsa_grads, None, loss_plus
    ```
    """
    def __init__(self, defaults: dict[str, Any] | None = None, target: GradTarget = 'closure'):
        super().__init__(defaults)
        self._target: GradTarget = target

    @abstractmethod
    def approximate(self, closure: Callable, params: list[torch.Tensor], loss: torch.Tensor | None) -> tuple[Iterable[torch.Tensor], torch.Tensor | None, torch.Tensor | None]:
        """Returns a tuple: ``(grad, loss, loss_approx)``, make sure this resets parameters to their original values!"""

    def pre_step(self, var: Var) -> None:
        """This runs once before each step, whereas `approximate` may run multiple times per step if further modules
        evaluate gradients at multiple points. This is useful for example to pre-generate new random perturbations."""

    @torch.no_grad
    def step(self, var):
        self.pre_step(var)

        if var.closure is None: raise RuntimeError("Gradient approximation requires closure")
        params, closure, loss = var.params, var.closure, var.loss

        if self._target == 'closure':

            def approx_closure(backward=True):
                if backward:
                    # set loss to None because closure might be evaluated at different points
                    grad, l, l_approx = self.approximate(closure=closure, params=params, loss=None)
                    for p, g in zip(params, grad): p.grad = g
                    return l if l is not None else closure(False)
                return closure(False)

            var.closure = approx_closure
            return var

        # if var.grad is not None:
        #     warnings.warn('Using grad approximator when `var.grad` is already set.')
        grad,loss,loss_approx = self.approximate(closure=closure, params=params, loss=loss)
        if loss_approx is not None: var.loss_approx = loss_approx
        if loss is not None: var.loss = var.loss_approx = loss
        if self._target == 'grad': var.grad = list(grad)
        elif self._target == 'update': var.update = list(grad)
        else: raise ValueError(self._target)
        return var

approximate

approximate(closure: Callable, params: list[Tensor], loss: Tensor | None) -> tuple[Iterable[Tensor], Tensor | None, Tensor | None]

Returns a tuple: (grad, loss, loss_approx), make sure this resets parameters to their original values!

Source code in torchzero/modules/grad_approximation/grad_approximator.py
@abstractmethod
def approximate(self, closure: Callable, params: list[torch.Tensor], loss: torch.Tensor | None) -> tuple[Iterable[torch.Tensor], torch.Tensor | None, torch.Tensor | None]:
    """Returns a tuple: ``(grad, loss, loss_approx)``, make sure this resets parameters to their original values!"""

pre_step

pre_step(var: Var) -> None

This runs once before each step, whereas approximate may run multiple times per step if further modules evaluate gradients at multiple points. This is useful for example to pre-generate new random perturbations.

Source code in torchzero/modules/grad_approximation/grad_approximator.py
def pre_step(self, var: Var) -> None:
    """This runs once before each step, whereas `approximate` may run multiple times per step if further modules
    evaluate gradients at multiple points. This is useful for example to pre-generate new random perturbations."""

MeZO

Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator

Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

  • h (float, default: 0.001 ) –

    finite difference step size of jvp_method is set to forward or central. Defaults to 1e-3.

  • n_samples (int, default: 1 ) –

    number of random gradient samples. Defaults to 1.

  • formula (Literal, default: 'central2' ) –

    finite difference formula. Defaults to 'central2'.

  • distribution (Literal, default: 'rademacher' ) –

    distribution. Defaults to "rademacher". If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.

  • target (Literal, default: 'closure' ) –

    what to set on var. Defaults to "closure".

References

Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36, 53038-53075. https://arxiv.org/abs/2305.17333

Source code in torchzero/modules/grad_approximation/rfdm.py
class MeZO(GradApproximator):
    """Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-3.
        n_samples (int, optional): number of random gradient samples. Defaults to 1.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'central2'.
        distribution (Distributions, optional): distribution. Defaults to "rademacher".
            If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
        target (GradTarget, optional): what to set on var. Defaults to "closure".

    References:
        Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36, 53038-53075. https://arxiv.org/abs/2305.17333
    """

    def __init__(self, h: float=1e-3, n_samples: int = 1, formula: _FD_Formula = 'central2',
                 distribution: Distributions = 'rademacher', target: GradTarget = 'closure'):

        defaults = dict(h=h, formula=formula, n_samples=n_samples, distribution=distribution)
        super().__init__(defaults, target=target)

    def _seeded_perturbation(self, params: list[torch.Tensor], distribution, seed, h):
        prt = TensorList(params).sample_like(
            distribution=distribution,
            variance=h,
            generator=torch.Generator(params[0].device).manual_seed(seed)
        )
        return prt

    def pre_step(self, var):
        h = NumberList(self.settings[p]['h'] for p in var.params)

        n_samples = self.defaults['n_samples']
        distribution = self.defaults['distribution']

        step = var.current_step

        # create functions that generate a deterministic perturbation from seed based on current step
        prt_fns = []
        for i in range(n_samples):

            prt_fn = partial(self._seeded_perturbation, params=var.params, distribution=distribution, seed=1_000_000*step + i, h=h)
            prt_fns.append(prt_fn)

        self.global_state['prt_fns'] = prt_fns

    @torch.no_grad
    def approximate(self, closure, params, loss):
        params = TensorList(params)
        loss_approx = None

        h = NumberList(self.settings[p]['h'] for p in params)
        settings = self.settings[params[0]]
        n_samples = settings['n_samples']
        fd_fn = _RFD_FUNCS[settings['formula']]
        prt_fns = self.global_state['prt_fns']

        grad = None
        for i in range(n_samples):
            loss, loss_approx, d = fd_fn(closure=closure, params=params, p_fn=prt_fns[i], h=h, f_0=loss)
            if grad is None: grad = prt_fns[i]().mul_(d)
            else: grad += prt_fns[i]().mul_(d)

        assert grad is not None
        if n_samples > 1: grad.div_(n_samples)
        return grad, loss, loss_approx

RDSA

Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM

Gradient approximation via Random-direction stochastic approximation (RDSA) method.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

  • h (float, default: 0.001 ) –

    finite difference step size of jvp_method is set to forward or central. Defaults to 1e-3.

  • n_samples (int, default: 1 ) –

    number of random gradient samples. Defaults to 1.

  • formula (Literal, default: 'central2' ) –

    finite difference formula. Defaults to 'central2'.

  • distribution (Literal, default: 'gaussian' ) –

    distribution. Defaults to "gaussian".

  • beta (float, default: 0 ) –

    If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.

  • pre_generate (bool, default: True ) –

    whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.

  • seed (int | None | Generator, default: None ) –

    Seed for random generator. Defaults to None.

  • target (Literal, default: 'closure' ) –

    what to set on var. Defaults to "closure".

References

Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771

Source code in torchzero/modules/grad_approximation/rfdm.py
class RDSA(RandomizedFDM):
    """
    Gradient approximation via Random-direction stochastic approximation (RDSA) method.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-3.
        n_samples (int, optional): number of random gradient samples. Defaults to 1.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'central2'.
        distribution (Distributions, optional): distribution. Defaults to "gaussian".
        beta (float, optional):
            If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
        pre_generate (bool, optional):
            whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
        seed (int | None | torch.Generator, optional): Seed for random generator. Defaults to None.
        target (GradTarget, optional): what to set on var. Defaults to "closure".

    References:
        Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771

    """
    def __init__(
        self,
        h: float = 1e-3,
        n_samples: int = 1,
        formula: _FD_Formula = "central2",
        distribution: Distributions = "gaussian",
        beta: float = 0,
        pre_generate = True,
        target: GradTarget = "closure",
        seed: int | None | torch.Generator = None,
    ):
        super().__init__(h=h, n_samples=n_samples,formula=formula,distribution=distribution,beta=beta,pre_generate=pre_generate,target=target,seed=seed)

RandomizedFDM

Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator

Gradient approximation via a randomized finite-difference method.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

  • h (float, default: 0.001 ) –

    finite difference step size of jvp_method is set to forward or central. Defaults to 1e-3.

  • n_samples (int, default: 1 ) –

    number of random gradient samples. Defaults to 1.

  • formula (Literal, default: 'central' ) –

    finite difference formula. Defaults to 'central2'.

  • distribution (Literal, default: 'rademacher' ) –

    distribution. Defaults to "rademacher". If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.

  • beta (float, default: 0 ) –

    optinal momentum for generated perturbations. Defaults to 1e-3.

  • pre_generate (bool, default: True ) –

    whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.

  • seed (int | None | Generator, default: None ) –

    Seed for random generator. Defaults to None.

  • target (Literal, default: 'closure' ) –

    what to set on var. Defaults to "closure".

Examples:

Simultaneous perturbation stochastic approximation (SPSA) method

SPSA is randomized finite differnce with rademacher distribution and central formula.

spsa = tz.Modular(
    model.parameters(),
    tz.m.RandomizedFDM(formula="central", distribution="rademacher"),
    tz.m.LR(1e-2)
)

Random-direction stochastic approximation (RDSA) method

RDSA is randomized finite differnce with usually gaussian distribution and central formula.

rdsa = tz.Modular(
    model.parameters(),
    tz.m.RandomizedFDM(formula="central", distribution="gaussian"),
    tz.m.LR(1e-2)
)
RandomizedFDM with momentum

Momentum might help by reducing the variance of the estimated gradients.

momentum_spsa = tz.Modular(
    model.parameters(),
    tz.m.RandomizedFDM(),
    tz.m.HeavyBall(0.9),
    tz.m.LR(1e-3)
)
Gaussian smoothing method

GS uses many gaussian samples with possibly a larger finite difference step size.

gs = tz.Modular(
    model.parameters(),
    tz.m.RandomizedFDM(n_samples=100, distribution="gaussian", formula="forward2", h=1e-1),
    tz.m.NewtonCG(hvp_method="forward"),
    tz.m.Backtracking()
)
SPSA-NewtonCG

NewtonCG with hessian-vector product estimated via gradient difference calls closure multiple times per step. If each closure call estimates gradients with different perturbations, NewtonCG is unable to produce useful directions.

By setting pre_generate to True, perturbations are generated once before each step, and each closure call estimates gradients using the same pre-generated perturbations. This way closure-based algorithms are able to use gradients estimated in a consistent way.

opt = tz.Modular(
    model.parameters(),
    tz.m.RandomizedFDM(n_samples=10),
    tz.m.NewtonCG(hvp_method="forward", pre_generate=True),
    tz.m.Backtracking()
)
SPSA-LBFGS

LBFGS uses a memory of past parameter and gradient differences. If past gradients were estimated with different perturbations, LBFGS directions will be useless.

To alleviate this momentum can be added to random perturbations to make sure they only change by a little bit, and the history stays relevant. The momentum is determined by the :code:beta parameter. The disadvantage is that the subspace the algorithm is able to explore changes slowly.

Additionally we will reset SPSA and LBFGS memory every 100 steps to remove influence from old gradient estimates.

opt = tz.Modular(
    bench.parameters(),
    tz.m.ResetEvery(
        [tz.m.RandomizedFDM(n_samples=10, pre_generate=True, beta=0.99), tz.m.LBFGS()],
        steps = 100,
    ),
    tz.m.Backtracking()
)
Source code in torchzero/modules/grad_approximation/rfdm.py
class RandomizedFDM(GradApproximator):
    """Gradient approximation via a randomized finite-difference method.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-3.
        n_samples (int, optional): number of random gradient samples. Defaults to 1.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'central2'.
        distribution (Distributions, optional): distribution. Defaults to "rademacher".
            If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
        beta (float, optional): optinal momentum for generated perturbations. Defaults to 1e-3.
        pre_generate (bool, optional):
            whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
        seed (int | None | torch.Generator, optional): Seed for random generator. Defaults to None.
        target (GradTarget, optional): what to set on var. Defaults to "closure".

    Examples:
    #### Simultaneous perturbation stochastic approximation (SPSA) method

    SPSA is randomized finite differnce with rademacher distribution and central formula.
    ```py
    spsa = tz.Modular(
        model.parameters(),
        tz.m.RandomizedFDM(formula="central", distribution="rademacher"),
        tz.m.LR(1e-2)
    )
    ```

    #### Random-direction stochastic approximation (RDSA) method

    RDSA is randomized finite differnce with usually gaussian distribution and central formula.

    ```
    rdsa = tz.Modular(
        model.parameters(),
        tz.m.RandomizedFDM(formula="central", distribution="gaussian"),
        tz.m.LR(1e-2)
    )
    ```

    #### RandomizedFDM with momentum

    Momentum might help by reducing the variance of the estimated gradients.

    ```
    momentum_spsa = tz.Modular(
        model.parameters(),
        tz.m.RandomizedFDM(),
        tz.m.HeavyBall(0.9),
        tz.m.LR(1e-3)
    )
    ```

    #### Gaussian smoothing method

    GS uses many gaussian samples with possibly a larger finite difference step size.

    ```
    gs = tz.Modular(
        model.parameters(),
        tz.m.RandomizedFDM(n_samples=100, distribution="gaussian", formula="forward2", h=1e-1),
        tz.m.NewtonCG(hvp_method="forward"),
        tz.m.Backtracking()
    )
    ```

    #### SPSA-NewtonCG

    NewtonCG with hessian-vector product estimated via gradient difference
    calls closure multiple times per step. If each closure call estimates gradients
    with different perturbations, NewtonCG is unable to produce useful directions.

    By setting pre_generate to True, perturbations are generated once before each step,
    and each closure call estimates gradients using the same pre-generated perturbations.
    This way closure-based algorithms are able to use gradients estimated in a consistent way.

    ```
    opt = tz.Modular(
        model.parameters(),
        tz.m.RandomizedFDM(n_samples=10),
        tz.m.NewtonCG(hvp_method="forward", pre_generate=True),
        tz.m.Backtracking()
    )
    ```

    #### SPSA-LBFGS

    LBFGS uses a memory of past parameter and gradient differences. If past gradients
    were estimated with different perturbations, LBFGS directions will be useless.

    To alleviate this momentum can be added to random perturbations to make sure they only
    change by a little bit, and the history stays relevant. The momentum is determined by the :code:`beta` parameter.
    The disadvantage is that the subspace the algorithm is able to explore changes slowly.

    Additionally we will reset SPSA and LBFGS memory every 100 steps to remove influence from old gradient estimates.

    ```
    opt = tz.Modular(
        bench.parameters(),
        tz.m.ResetEvery(
            [tz.m.RandomizedFDM(n_samples=10, pre_generate=True, beta=0.99), tz.m.LBFGS()],
            steps = 100,
        ),
        tz.m.Backtracking()
    )
    ```
    """
    PRE_MULTIPLY_BY_H = True
    def __init__(
        self,
        h: float = 1e-3,
        n_samples: int = 1,
        formula: _FD_Formula = "central",
        distribution: Distributions = "rademacher",
        beta: float = 0,
        pre_generate = True,
        seed: int | None | torch.Generator = None,
        target: GradTarget = "closure",
    ):
        defaults = dict(h=h, formula=formula, n_samples=n_samples, distribution=distribution, beta=beta, pre_generate=pre_generate, seed=seed)
        super().__init__(defaults, target=target)

    def reset(self):
        self.state.clear()
        generator = self.global_state.get('generator', None) # avoid resetting generator
        self.global_state.clear()
        if generator is not None: self.global_state['generator'] = generator
        for c in self.children.values(): c.reset()

    def _get_generator(self, seed: int | None | torch.Generator, params: list[torch.Tensor]):
        if 'generator' not in self.global_state:
            if isinstance(seed, torch.Generator): self.global_state['generator'] = seed
            elif seed is not None: self.global_state['generator'] = torch.Generator(params[0].device).manual_seed(seed)
            else: self.global_state['generator'] = None
        return self.global_state['generator']

    def pre_step(self, var):
        h, beta = self.get_settings(var.params, 'h', 'beta')

        n_samples = self.defaults['n_samples']
        distribution = self.defaults['distribution']
        pre_generate = self.defaults['pre_generate']

        if pre_generate:
            params = TensorList(var.params)
            generator = self._get_generator(self.defaults['seed'], var.params)
            perturbations = [params.sample_like(distribution=distribution, variance=1, generator=generator) for _ in range(n_samples)]

            if self.PRE_MULTIPLY_BY_H:
                torch._foreach_mul_([p for l in perturbations for p in l], [v for vv in h for v in [vv]*n_samples])

            if all(i==0 for i in beta):
                # just use pre-generated perturbations
                for param, prt in zip(params, zip(*perturbations)):
                    self.state[param]['perturbations'] = prt

            else:
                # lerp old and new perturbations. This makes the subspace change gradually
                # which in theory might improve algorithms with history
                for i,p in enumerate(params):
                    state = self.state[p]
                    if 'perturbations' not in state: state['perturbations'] = [p[i] for p in perturbations]

                cur = [self.state[p]['perturbations'][:n_samples] for p in params]
                cur_flat = [p for l in cur for p in l]
                new_flat = [p for l in zip(*perturbations) for p in l]
                betas = [1-v for b in beta for v in [b]*n_samples]
                torch._foreach_lerp_(cur_flat, new_flat, betas)

    @torch.no_grad
    def approximate(self, closure, params, loss):
        params = TensorList(params)
        orig_params = params.clone() # store to avoid small changes due to float imprecision
        loss_approx = None

        h = NumberList(self.settings[p]['h'] for p in params)
        settings = self.settings[params[0]]
        n_samples = settings['n_samples']
        fd_fn = _RFD_FUNCS[settings['formula']]
        default = [None]*n_samples
        perturbations = list(zip(*(self.state[p].get('perturbations', default) for p in params)))
        distribution = settings['distribution']
        generator = self._get_generator(settings['seed'], params)

        grad = None
        for i in range(n_samples):
            prt = perturbations[i]

            if prt[0] is None:
                prt = params.sample_like(distribution=distribution, generator=generator, variance=1).mul_(h)

            else: prt = TensorList(prt)

            loss, loss_approx, d = fd_fn(closure=closure, params=params, p_fn=lambda: prt, h=h, f_0=loss)
            # here `d` is a numberlist of directional derivatives, due to per parameter `h` values.

            # support for per-sample values which gives better estimate
            if d[0].numel() > 1: d = d.map(torch.mean)

            if grad is None: grad = prt * d
            else: grad += prt * d

        params.set_(orig_params)
        assert grad is not None
        if n_samples > 1: grad.div_(n_samples)

        # mean if got per-sample values
        if loss is not None:
            if loss.numel() > 1:
                loss = loss.mean()

        if loss_approx is not None:
            if loss_approx.numel() > 1:
                loss_approx = loss_approx.mean()

        return grad, loss, loss_approx

PRE_MULTIPLY_BY_H class-attribute

PRE_MULTIPLY_BY_H = True

bool(x) -> bool

Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.

SPSA

Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM

Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

  • h (float, default: 0.001 ) –

    finite difference step size of jvp_method is set to forward or central. Defaults to 1e-3.

  • n_samples (int, default: 1 ) –

    number of random gradient samples. Defaults to 1.

  • formula (Literal, default: 'central' ) –

    finite difference formula. Defaults to 'central2'.

  • distribution (Literal, default: 'rademacher' ) –

    distribution. Defaults to "rademacher".

  • beta (float, default: 0 ) –

    If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.

  • pre_generate (bool, default: True ) –

    whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.

  • seed (int | None | Generator, default: None ) –

    Seed for random generator. Defaults to None.

  • target (Literal, default: 'closure' ) –

    what to set on var. Defaults to "closure".

References

Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771

Source code in torchzero/modules/grad_approximation/rfdm.py
class SPSA(RandomizedFDM):
    """
    Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-3.
        n_samples (int, optional): number of random gradient samples. Defaults to 1.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'central2'.
        distribution (Distributions, optional): distribution. Defaults to "rademacher".
        beta (float, optional):
            If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
        pre_generate (bool, optional):
            whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
        seed (int | None | torch.Generator, optional): Seed for random generator. Defaults to None.
        target (GradTarget, optional): what to set on var. Defaults to "closure".

    References:
        Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771
    """