Skip to content

Gradient approximations

This subpackage contains modules that estimate the gradient using function values.

Classes:

  • FDM

    Approximate gradients via finite difference method.

  • ForwardGradient

    Forward gradient method.

  • GaussianSmoothing

    Gradient approximation via Gaussian smoothing method.

  • GradApproximator

    Base class for gradient approximations.

  • MeZO

    Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.

  • RDSA

    Gradient approximation via Random-direction stochastic approximation (RDSA) method.

  • RandomizedFDM

    Gradient approximation via a randomized finite-difference method.

  • SPSA

    Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.

FDM

Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator

Approximate gradients via finite difference method.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

  • h (float, default: 0.001 ) –

    magnitude of parameter perturbation. Defaults to 1e-3.

  • formula (Literal, default: 'central' ) –

    finite difference formula. Defaults to 'central2'.

  • target (Literal, default: 'closure' ) –

    what to set on var. Defaults to 'closure'.

Examples: plain FDM:

fdm = tz.Optimizer(model.parameters(), tz.m.FDM(), tz.m.LR(1e-2))

Any gradient-based method can use FDM-estimated gradients.

fdm_ncg = tz.Optimizer(
    model.parameters(),
    tz.m.FDM(),
    # set hvp_method to "forward" so that it
    # uses gradient difference instead of autograd
    tz.m.NewtonCG(hvp_method="forward"),
    tz.m.Backtracking()
)

Source code in torchzero/modules/grad_approximation/fdm.py
class FDM(GradApproximator):
    """Approximate gradients via finite difference method.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): magnitude of parameter perturbation. Defaults to 1e-3.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'central2'.
        target (GradTarget, optional): what to set on var. Defaults to 'closure'.

    Examples:
    plain FDM:

    ```python
    fdm = tz.Optimizer(model.parameters(), tz.m.FDM(), tz.m.LR(1e-2))
    ```

    Any gradient-based method can use FDM-estimated gradients.
    ```python
    fdm_ncg = tz.Optimizer(
        model.parameters(),
        tz.m.FDM(),
        # set hvp_method to "forward" so that it
        # uses gradient difference instead of autograd
        tz.m.NewtonCG(hvp_method="forward"),
        tz.m.Backtracking()
    )
    ```
    """
    def __init__(self, h: float=1e-3, formula: _FD_Formula = 'central', target: GradTarget = 'closure'):
        defaults = dict(h=h, formula=formula)
        super().__init__(defaults, target=target)

    @torch.no_grad
    def approximate(self, closure, params, loss):
        grads = []
        loss_approx = None

        for p in params:
            g = torch.zeros_like(p)
            grads.append(g)

            settings = self.settings[p]
            h = settings['h']
            fd_fn = _FD_FUNCS[settings['formula']]

            p_flat = p.ravel(); g_flat = g.ravel()
            for i in range(len(p_flat)):
                loss, loss_approx, d = fd_fn(closure=closure, param=p_flat, idx=i, h=h, v_0=loss)
                g_flat[i] = d

        return grads, loss, loss_approx

ForwardGradient

Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM

Forward gradient method.

This method samples one or more directional derivatives evaluated via autograd jacobian-vector products. This is very similar to randomized finite difference.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

  • n_samples (int, default: 1 ) –

    number of random gradient samples. Defaults to 1.

  • distribution (Literal, default: 'gaussian' ) –

    distribution for random gradient samples. Defaults to "gaussian".

  • pre_generate (bool, default: True ) –

    whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.

  • jvp_method (str, default: 'autograd' ) –

    how to calculate jacobian vector product, note that with forward and 'central' this is equivalent to randomized finite difference. Defaults to 'autograd'.

  • h (float, default: 0.001 ) –

    finite difference step size of jvp_method is set to forward or central. Defaults to 1e-3.

  • target (Literal, default: 'closure' ) –

    what to set on var. Defaults to "closure".

References

Baydin, A. G., Pearlmutter, B. A., Syme, D., Wood, F., & Torr, P. (2022). Gradients without backpropagation. arXiv preprint arXiv:2202.08587.

Source code in torchzero/modules/grad_approximation/forward_gradient.py
class ForwardGradient(RandomizedFDM):
    """Forward gradient method.

    This method samples one or more directional derivatives evaluated via autograd jacobian-vector products. This is very similar to randomized finite difference.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.


    Args:
        n_samples (int, optional): number of random gradient samples. Defaults to 1.
        distribution (Distributions, optional): distribution for random gradient samples. Defaults to "gaussian".
        pre_generate (bool, optional):
            whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
        jvp_method (str, optional):
            how to calculate jacobian vector product, note that with `forward` and 'central' this is equivalent to randomized finite difference. Defaults to 'autograd'.
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-3.
        target (GradTarget, optional): what to set on var. Defaults to "closure".

    References:
        Baydin, A. G., Pearlmutter, B. A., Syme, D., Wood, F., & Torr, P. (2022). Gradients without backpropagation. arXiv preprint arXiv:2202.08587.
    """
    PRE_MULTIPLY_BY_H = False
    def __init__(
        self,
        n_samples: int = 1,
        distribution: Distributions = "gaussian",
        pre_generate = True,
        jvp_method: Literal['autograd', 'forward', 'central'] = 'autograd',
        h: float = 1e-3,
        target: GradTarget = "closure",
        seed: int | None | torch.Generator = None,
    ):
        super().__init__(h=h, n_samples=n_samples, distribution=distribution, target=target, pre_generate=pre_generate, seed=seed)
        self.defaults['jvp_method'] = jvp_method

    @torch.no_grad
    def approximate(self, closure, params, loss):
        params = TensorList(params)
        loss_approx = None

        fs = self.settings[params[0]]
        n_samples = fs['n_samples']
        jvp_method = fs['jvp_method']
        h = fs['h']
        distribution = fs['distribution']
        default = [None]*n_samples
        perturbations = list(zip(*(self.state[p].get('perturbations', default) for p in params)))
        generator = self.get_generator(params[0].device, self.defaults['seed'])

        grad = None
        for i in range(n_samples):
            prt = perturbations[i]
            if prt[0] is None:
                prt = params.sample_like(distribution=distribution, variance=1, generator=generator)

            else: prt = TensorList(prt)

            if jvp_method == 'autograd':
                with torch.enable_grad():
                    loss, d = jvp(partial(closure, False), params=params, tangent=prt)

            elif jvp_method == 'forward':
                loss, d = jvp_fd_forward(partial(closure, False), params=params, tangent=prt, v_0=loss, h=h)

            elif jvp_method == 'central':
                loss_approx, d = jvp_fd_central(partial(closure, False), params=params, tangent=prt, h=h)

            else: raise ValueError(jvp_method)

            if grad is None: grad = prt * d
            else: grad += prt * d

        assert grad is not None
        if n_samples > 1: grad.div_(n_samples)
        return grad, loss, loss_approx

PRE_MULTIPLY_BY_H class-attribute

PRE_MULTIPLY_BY_H = False

bool(x) -> bool

Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.

GaussianSmoothing

Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM

Gradient approximation via Gaussian smoothing method.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

  • h (float, default: 0.01 ) –

    finite difference step size of jvp_method is set to forward or central. Defaults to 1e-2.

  • n_samples (int, default: 100 ) –

    number of random gradient samples. Defaults to 100.

  • formula (Literal, default: 'forward2' ) –

    finite difference formula. Defaults to 'forward2'.

  • distribution (Literal, default: 'gaussian' ) –

    distribution. Defaults to "gaussian".

  • pre_generate (bool, default: True ) –

    whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.

  • seed (int | None | Generator, default: None ) –

    Seed for random generator. Defaults to None.

  • target (Literal, default: 'closure' ) –

    what to set on var. Defaults to "closure".

References

Yurii Nesterov, Vladimir Spokoiny. (2015). Random Gradient-Free Minimization of Convex Functions. https://gwern.net/doc/math/2015-nesterov.pdf

Source code in torchzero/modules/grad_approximation/rfdm.py
class GaussianSmoothing(RandomizedFDM):
    """
    Gradient approximation via Gaussian smoothing method.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-2.
        n_samples (int, optional): number of random gradient samples. Defaults to 100.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'forward2'.
        distribution (Distributions, optional): distribution. Defaults to "gaussian".
        pre_generate (bool, optional):
            whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
        seed (int | None | torch.Generator, optional): Seed for random generator. Defaults to None.
        target (GradTarget, optional): what to set on var. Defaults to "closure".


    References:
        Yurii Nesterov, Vladimir Spokoiny. (2015). Random Gradient-Free Minimization of Convex Functions. https://gwern.net/doc/math/2015-nesterov.pdf
    """
    def __init__(
        self,
        h: float = 1e-2,
        n_samples: int = 100,
        formula: _FD_Formula = "forward2",
        distribution: Distributions = "gaussian",
        pre_generate: bool = True,
        return_approx_loss: bool = False,
        target: GradTarget = "closure",
        seed: int | None | torch.Generator = None,
    ):
        super().__init__(h=h, n_samples=n_samples,formula=formula,distribution=distribution,pre_generate=pre_generate,target=target,seed=seed, return_approx_loss=return_approx_loss)

GradApproximator

Bases: torchzero.core.module.Module, abc.ABC

Base class for gradient approximations. This is an abstract class, to use it, subclass it and override approximate.

GradientApproximator modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure.

Parameters:

  • defaults (dict[str, Any] | None, default: None ) –

    dict with defaults. Defaults to None.

  • target (str, default: 'closure' ) –

    whether to set var.grad, var.update or 'var.closure`. Defaults to 'closure'.

Example:

Basic SPSA method implementation.

class SPSA(GradApproximator):
    def __init__(self, h=1e-3):
        defaults = dict(h=h)
        super().__init__(defaults)

    @torch.no_grad
    def approximate(self, closure, params, loss):
        perturbation = [rademacher_like(p) * self.settings[p]['h'] for p in params]

        # evaluate params + perturbation
        torch._foreach_add_(params, perturbation)
        loss_plus = closure(False)

        # evaluate params - perturbation
        torch._foreach_sub_(params, perturbation)
        torch._foreach_sub_(params, perturbation)
        loss_minus = closure(False)

        # restore original params
        torch._foreach_add_(params, perturbation)

        # calculate SPSA gradients
        spsa_grads = []
        for p, pert in zip(params, perturbation):
            settings = self.settings[p]
            h = settings['h']
            d = (loss_plus - loss_minus) / (2*(h**2))
            spsa_grads.append(pert * d)

        # returns tuple: (grads, loss, loss_approx)
        # loss must be with initial parameters
        # since we only evaluated loss with perturbed parameters
        # we only have loss_approx
        return spsa_grads, None, loss_plus

Methods:

  • approximate

    Returns a tuple: (grad, loss, loss_approx), make sure this resets parameters to their original values!

  • pre_step

    This runs once before each step, whereas approximate may run multiple times per step if further modules

Source code in torchzero/modules/grad_approximation/grad_approximator.py
class GradApproximator(Module, ABC):
    """Base class for gradient approximations.
    This is an abstract class, to use it, subclass it and override `approximate`.

    GradientApproximator modifies the closure to evaluate the estimated gradients,
    and further closure-based modules will use the modified closure.

    Args:
        defaults (dict[str, Any] | None, optional): dict with defaults. Defaults to None.
        target (str, optional):
            whether to set `var.grad`, `var.update` or 'var.closure`. Defaults to 'closure'.

    Example:

    Basic SPSA method implementation.
    ```python
    class SPSA(GradApproximator):
        def __init__(self, h=1e-3):
            defaults = dict(h=h)
            super().__init__(defaults)

        @torch.no_grad
        def approximate(self, closure, params, loss):
            perturbation = [rademacher_like(p) * self.settings[p]['h'] for p in params]

            # evaluate params + perturbation
            torch._foreach_add_(params, perturbation)
            loss_plus = closure(False)

            # evaluate params - perturbation
            torch._foreach_sub_(params, perturbation)
            torch._foreach_sub_(params, perturbation)
            loss_minus = closure(False)

            # restore original params
            torch._foreach_add_(params, perturbation)

            # calculate SPSA gradients
            spsa_grads = []
            for p, pert in zip(params, perturbation):
                settings = self.settings[p]
                h = settings['h']
                d = (loss_plus - loss_minus) / (2*(h**2))
                spsa_grads.append(pert * d)

            # returns tuple: (grads, loss, loss_approx)
            # loss must be with initial parameters
            # since we only evaluated loss with perturbed parameters
            # we only have loss_approx
            return spsa_grads, None, loss_plus
    ```
    """
    def __init__(self, defaults: dict[str, Any] | None = None, return_approx_loss:bool=False, target: GradTarget = 'closure'):
        super().__init__(defaults)
        self._target: GradTarget = target
        self._return_approx_loss = return_approx_loss

    @abstractmethod
    def approximate(self, closure: Callable, params: list[torch.Tensor], loss: torch.Tensor | None) -> tuple[Iterable[torch.Tensor], torch.Tensor | None, torch.Tensor | None]:
        """Returns a tuple: ``(grad, loss, loss_approx)``, make sure this resets parameters to their original values!"""

    def pre_step(self, objective: Objective) -> None:
        """This runs once before each step, whereas `approximate` may run multiple times per step if further modules
        evaluate gradients at multiple points. This is useful for example to pre-generate new random perturbations."""

    @torch.no_grad
    def update(self, objective):
        self.pre_step(objective)

        if objective.closure is None: raise RuntimeError("Gradient approximation requires closure")
        params, closure, loss = objective.params, objective.closure, objective.loss

        if self._target == 'closure':

            def approx_closure(backward=True):
                if backward:
                    # set loss to None because closure might be evaluated at different points
                    grad, l, l_approx = self.approximate(closure=closure, params=params, loss=None)
                    for p, g in zip(params, grad): p.grad = g
                    if l is not None: return l
                    if self._return_approx_loss and l_approx is not None: return l_approx
                    return closure(False)

                return closure(False)

            objective.closure = approx_closure
            return

        # if var.grad is not None:
        #     warnings.warn('Using grad approximator when `var.grad` is already set.')
        grad, loss, loss_approx = self.approximate(closure=closure, params=params, loss=loss)
        if loss_approx is not None: objective.loss_approx = loss_approx
        if loss is not None: objective.loss = objective.loss_approx = loss
        if self._target == 'grad': objective.grads = list(grad)
        elif self._target == 'update': objective.updates = list(grad)
        else: raise ValueError(self._target)
        return

    def apply(self, objective):
        return objective

approximate

approximate(closure: Callable, params: list[Tensor], loss: Tensor | None) -> tuple[Iterable[Tensor], Tensor | None, Tensor | None]

Returns a tuple: (grad, loss, loss_approx), make sure this resets parameters to their original values!

Source code in torchzero/modules/grad_approximation/grad_approximator.py
@abstractmethod
def approximate(self, closure: Callable, params: list[torch.Tensor], loss: torch.Tensor | None) -> tuple[Iterable[torch.Tensor], torch.Tensor | None, torch.Tensor | None]:
    """Returns a tuple: ``(grad, loss, loss_approx)``, make sure this resets parameters to their original values!"""

pre_step

pre_step(objective: Objective) -> None

This runs once before each step, whereas approximate may run multiple times per step if further modules evaluate gradients at multiple points. This is useful for example to pre-generate new random perturbations.

Source code in torchzero/modules/grad_approximation/grad_approximator.py
def pre_step(self, objective: Objective) -> None:
    """This runs once before each step, whereas `approximate` may run multiple times per step if further modules
    evaluate gradients at multiple points. This is useful for example to pre-generate new random perturbations."""

MeZO

Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator

Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

  • h (float, default: 0.001 ) –

    finite difference step size of jvp_method is set to forward or central. Defaults to 1e-3.

  • n_samples (int, default: 1 ) –

    number of random gradient samples. Defaults to 1.

  • formula (Literal, default: 'central2' ) –

    finite difference formula. Defaults to 'central2'.

  • distribution (Literal, default: 'rademacher' ) –

    distribution. Defaults to "rademacher". If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.

  • target (Literal, default: 'closure' ) –

    what to set on var. Defaults to "closure".

References

Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36, 53038-53075. https://arxiv.org/abs/2305.17333

Source code in torchzero/modules/grad_approximation/rfdm.py
class MeZO(GradApproximator):
    """Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-3.
        n_samples (int, optional): number of random gradient samples. Defaults to 1.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'central2'.
        distribution (Distributions, optional): distribution. Defaults to "rademacher".
            If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
        target (GradTarget, optional): what to set on var. Defaults to "closure".

    References:
        Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36, 53038-53075. https://arxiv.org/abs/2305.17333
    """

    def __init__(self, h: float=1e-3, n_samples: int = 1, formula: _FD_Formula = 'central2',
                 distribution: Distributions = 'rademacher', return_approx_loss: bool = False, target: GradTarget = 'closure'):

        defaults = dict(h=h, formula=formula, n_samples=n_samples, distribution=distribution)
        super().__init__(defaults, return_approx_loss=return_approx_loss, target=target)

    def _seeded_perturbation(self, params: list[torch.Tensor], distribution, seed, h):
        prt = TensorList(params).sample_like(
            distribution=distribution,
            variance=h,
            generator=torch.Generator(params[0].device).manual_seed(seed)
        )
        return prt

    def pre_step(self, objective):
        h = NumberList(self.settings[p]['h'] for p in objective.params)

        n_samples = self.defaults['n_samples']
        distribution = self.defaults['distribution']

        step = objective.current_step

        # create functions that generate a deterministic perturbation from seed based on current step
        prt_fns = []
        for i in range(n_samples):

            prt_fn = partial(self._seeded_perturbation, params=objective.params, distribution=distribution, seed=1_000_000*step + i, h=h)
            prt_fns.append(prt_fn)

        self.global_state['prt_fns'] = prt_fns

    @torch.no_grad
    def approximate(self, closure, params, loss):
        params = TensorList(params)
        loss_approx = None

        h = NumberList(self.settings[p]['h'] for p in params)
        n_samples = self.defaults['n_samples']
        fd_fn = _RFD_FUNCS[self.defaults['formula']]

        prt_fns = self.global_state['prt_fns']

        grad = None
        for i in range(n_samples):
            loss, loss_approx, d = fd_fn(closure=closure, params=params, p_fn=prt_fns[i], h=h, f_0=loss)
            if grad is None: grad = prt_fns[i]().mul_(d)
            else: grad += prt_fns[i]().mul_(d)

        assert grad is not None
        if n_samples > 1: grad.div_(n_samples)
        return grad, loss, loss_approx

RDSA

Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM

Gradient approximation via Random-direction stochastic approximation (RDSA) method.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

  • h (float, default: 0.001 ) –

    finite difference step size of jvp_method is set to forward or central. Defaults to 1e-3.

  • n_samples (int, default: 1 ) –

    number of random gradient samples. Defaults to 1.

  • formula (Literal, default: 'central2' ) –

    finite difference formula. Defaults to 'central2'.

  • distribution (Literal, default: 'gaussian' ) –

    distribution. Defaults to "gaussian".

  • pre_generate (bool, default: True ) –

    whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.

  • seed (int | None | Generator, default: None ) –

    Seed for random generator. Defaults to None.

  • target (Literal, default: 'closure' ) –

    what to set on var. Defaults to "closure".

References

Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771

Source code in torchzero/modules/grad_approximation/rfdm.py
class RDSA(RandomizedFDM):
    """
    Gradient approximation via Random-direction stochastic approximation (RDSA) method.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-3.
        n_samples (int, optional): number of random gradient samples. Defaults to 1.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'central2'.
        distribution (Distributions, optional): distribution. Defaults to "gaussian".
        pre_generate (bool, optional):
            whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
        seed (int | None | torch.Generator, optional): Seed for random generator. Defaults to None.
        target (GradTarget, optional): what to set on var. Defaults to "closure".

    References:
        Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771

    """
    def __init__(
        self,
        h: float = 1e-3,
        n_samples: int = 1,
        formula: _FD_Formula = "central2",
        distribution: Distributions = "gaussian",
        pre_generate: bool = True,
        return_approx_loss: bool = False,
        target: GradTarget = "closure",
        seed: int | None | torch.Generator = None,
    ):
        super().__init__(h=h, n_samples=n_samples,formula=formula,distribution=distribution,pre_generate=pre_generate,target=target,seed=seed, return_approx_loss=return_approx_loss)

RandomizedFDM

Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator

Gradient approximation via a randomized finite-difference method.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

  • h (float, default: 0.001 ) –

    finite difference step size of jvp_method is set to forward or central. Defaults to 1e-3.

  • n_samples (int, default: 1 ) –

    number of random gradient samples. Defaults to 1.

  • formula (Literal, default: 'central' ) –

    finite difference formula. Defaults to 'central2'.

  • distribution (Literal, default: 'rademacher' ) –

    distribution. Defaults to "rademacher". If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.

  • pre_generate (bool, default: True ) –

    whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.

  • seed (int | None | Generator, default: None ) –

    Seed for random generator. Defaults to None.

  • target (Literal, default: 'closure' ) –

    what to set on var. Defaults to "closure".

Examples:

Simultaneous perturbation stochastic approximation (SPSA) method

SPSA is randomized FDM with rademacher distribution and central formula.

spsa = tz.Optimizer(
    model.parameters(),
    tz.m.RandomizedFDM(formula="fd_central", distribution="rademacher"),
    tz.m.LR(1e-2)
)

Random-direction stochastic approximation (RDSA) method

RDSA is randomized FDM with usually gaussian distribution and central formula.

rdsa = tz.Optimizer(
    model.parameters(),
    tz.m.RandomizedFDM(formula="fd_central", distribution="gaussian"),
    tz.m.LR(1e-2)
)

Gaussian smoothing method

GS uses many gaussian samples with possibly a larger finite difference step size.

gs = tz.Optimizer(
    model.parameters(),
    tz.m.RandomizedFDM(n_samples=100, distribution="gaussian", formula="forward2", h=1e-1),
    tz.m.NewtonCG(hvp_method="forward"),
    tz.m.Backtracking()
)

RandomizedFDM with momentum

Momentum might help by reducing the variance of the estimated gradients.

momentum_spsa = tz.Optimizer(
    model.parameters(),
    tz.m.RandomizedFDM(),
    tz.m.HeavyBall(0.9),
    tz.m.LR(1e-3)
)

Source code in torchzero/modules/grad_approximation/rfdm.py
class RandomizedFDM(GradApproximator):
    """Gradient approximation via a randomized finite-difference method.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-3.
        n_samples (int, optional): number of random gradient samples. Defaults to 1.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'central2'.
        distribution (Distributions, optional): distribution. Defaults to "rademacher".
            If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
        pre_generate (bool, optional):
            whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
        seed (int | None | torch.Generator, optional): Seed for random generator. Defaults to None.
        target (GradTarget, optional): what to set on var. Defaults to "closure".

    Examples:
    #### Simultaneous perturbation stochastic approximation (SPSA) method

    SPSA is randomized FDM with rademacher distribution and central formula.
    ```py
    spsa = tz.Optimizer(
        model.parameters(),
        tz.m.RandomizedFDM(formula="fd_central", distribution="rademacher"),
        tz.m.LR(1e-2)
    )
    ```

    #### Random-direction stochastic approximation (RDSA) method

    RDSA is randomized FDM with usually gaussian distribution and central formula.
    ```
    rdsa = tz.Optimizer(
        model.parameters(),
        tz.m.RandomizedFDM(formula="fd_central", distribution="gaussian"),
        tz.m.LR(1e-2)
    )
    ```

    #### Gaussian smoothing method

    GS uses many gaussian samples with possibly a larger finite difference step size.
    ```
    gs = tz.Optimizer(
        model.parameters(),
        tz.m.RandomizedFDM(n_samples=100, distribution="gaussian", formula="forward2", h=1e-1),
        tz.m.NewtonCG(hvp_method="forward"),
        tz.m.Backtracking()
    )
    ```

    #### RandomizedFDM with momentum

    Momentum might help by reducing the variance of the estimated gradients.
    ```
    momentum_spsa = tz.Optimizer(
        model.parameters(),
        tz.m.RandomizedFDM(),
        tz.m.HeavyBall(0.9),
        tz.m.LR(1e-3)
    )
    ```
    """
    PRE_MULTIPLY_BY_H = True
    def __init__(
        self,
        h: float = 1e-3,
        n_samples: int = 1,
        formula: _FD_Formula = "central",
        distribution: Distributions = "rademacher",
        pre_generate: bool = True,
        return_approx_loss: bool = False,
        seed: int | None | torch.Generator = None,
        target: GradTarget = "closure",
    ):
        defaults = dict(h=h, formula=formula, n_samples=n_samples, distribution=distribution, pre_generate=pre_generate, seed=seed)
        super().__init__(defaults, return_approx_loss=return_approx_loss, target=target)


    def pre_step(self, objective):
        h = self.get_settings(objective.params, 'h')
        pre_generate = self.defaults['pre_generate']

        if pre_generate:
            n_samples = self.defaults['n_samples']
            distribution = self.defaults['distribution']

            params = TensorList(objective.params)
            generator = self.get_generator(params[0].device, self.defaults['seed'])
            perturbations = [params.sample_like(distribution=distribution, variance=1, generator=generator) for _ in range(n_samples)]

            # this is false for ForwardGradient where h isn't used and it subclasses this
            if self.PRE_MULTIPLY_BY_H:
                torch._foreach_mul_([p for l in perturbations for p in l], [v for vv in h for v in [vv]*n_samples])

            for param, prt in zip(params, zip(*perturbations)):
                self.state[param]['perturbations'] = prt

    @torch.no_grad
    def approximate(self, closure, params, loss):
        params = TensorList(params)
        loss_approx = None

        h = NumberList(self.settings[p]['h'] for p in params)
        n_samples = self.defaults['n_samples']
        distribution = self.defaults['distribution']
        fd_fn = _RFD_FUNCS[self.defaults['formula']]

        default = [None]*n_samples
        perturbations = list(zip(*(self.state[p].get('perturbations', default) for p in params)))
        generator = self.get_generator(params[0].device, self.defaults['seed'])

        grad = None
        for i in range(n_samples):
            prt = perturbations[i]

            if prt[0] is None:
                prt = params.sample_like(distribution=distribution, generator=generator, variance=1).mul_(h)

            else: prt = TensorList(prt)

            loss, loss_approx, d = fd_fn(closure=closure, params=params, p_fn=lambda: prt, h=h, f_0=loss)
            # here `d` is a numberlist of directional derivatives, due to per parameter `h` values.

            # support for per-sample values which gives better estimate
            if d[0].numel() > 1: d = d.map(torch.mean)

            if grad is None: grad = prt * d
            else: grad += prt * d

        assert grad is not None
        if n_samples > 1: grad.div_(n_samples)

        # mean if got per-sample values
        if loss is not None:
            if loss.numel() > 1:
                loss = loss.mean()

        if loss_approx is not None:
            if loss_approx.numel() > 1:
                loss_approx = loss_approx.mean()

        return grad, loss, loss_approx

PRE_MULTIPLY_BY_H class-attribute

PRE_MULTIPLY_BY_H = True

bool(x) -> bool

Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.

SPSA

Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM

Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

  • h (float, default: 0.001 ) –

    finite difference step size of jvp_method is set to forward or central. Defaults to 1e-3.

  • n_samples (int, default: 1 ) –

    number of random gradient samples. Defaults to 1.

  • formula (Literal, default: 'central' ) –

    finite difference formula. Defaults to 'central2'.

  • distribution (Literal, default: 'rademacher' ) –

    distribution. Defaults to "rademacher".

  • pre_generate (bool, default: True ) –

    whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.

  • seed (int | None | Generator, default: None ) –

    Seed for random generator. Defaults to None.

  • target (Literal, default: 'closure' ) –

    what to set on var. Defaults to "closure".

References

Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771

Source code in torchzero/modules/grad_approximation/rfdm.py
class SPSA(RandomizedFDM):
    """
    Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-3.
        n_samples (int, optional): number of random gradient samples. Defaults to 1.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'central2'.
        distribution (Distributions, optional): distribution. Defaults to "rademacher".
        pre_generate (bool, optional):
            whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
        seed (int | None | torch.Generator, optional): Seed for random generator. Defaults to None.
        target (GradTarget, optional): what to set on var. Defaults to "closure".

    References:
        Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771
    """