Gradient approximations¶

This subpackage contains modules that estimate the gradient using function values.

Classes:

FDM –

Approximate gradients via finite difference method.
ForwardGradient –

Forward gradient method.
GaussianSmoothing –

Gradient approximation via Gaussian smoothing method.
GradApproximator –

Base class for gradient approximations.
MeZO –

Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.
RDSA –

Gradient approximation via Random-direction stochastic approximation (RDSA) method.
RandomizedFDM –

Gradient approximation via a randomized finite-difference method.
SPSA –

Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.
SPSA1 –

One-measurement variant of SPSA. Unlike standard two-measurement SPSA, the estimated

FDM ¶

Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator

Approximate gradients via finite difference method.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

h (float, default: 0.001 ) –

magnitude of parameter perturbation. Defaults to 1e-3.
formula (Literal, default: 'central' ) –

finite difference formula. Defaults to 'central2'.
target (Literal, default: 'closure' ) –

what to set on var. Defaults to 'closure'.

Examples: plain FDM:

fdm = tz.Optimizer(model.parameters(), tz.m.FDM(), tz.m.LR(1e-2))

Any gradient-based method can use FDM-estimated gradients.

fdm_ncg = tz.Optimizer(
    model.parameters(),
    tz.m.FDM(),
    # set hvp_method to "forward" so that it
    # uses gradient difference instead of autograd
    tz.m.NewtonCG(hvp_method="forward"),
    tz.m.Backtracking()
)

Source code in torchzero/modules/grad_approximation/fdm.py

class FDM(GradApproximator):
    """Approximate gradients via finite difference method.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): magnitude of parameter perturbation. Defaults to 1e-3.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'central2'.
        target (GradTarget, optional): what to set on var. Defaults to 'closure'.

    Examples:
    plain FDM:

    ```python
    fdm = tz.Optimizer(model.parameters(), tz.m.FDM(), tz.m.LR(1e-2))
    ```

    Any gradient-based method can use FDM-estimated gradients.
    ```python
    fdm_ncg = tz.Optimizer(
        model.parameters(),
        tz.m.FDM(),
        # set hvp_method to "forward" so that it
        # uses gradient difference instead of autograd
        tz.m.NewtonCG(hvp_method="forward"),
        tz.m.Backtracking()
    )
    ```
    """
    def __init__(self, h: float=1e-3, formula: _FD_Formula = 'central', target: GradTarget = 'closure'):
        defaults = dict(h=h, formula=formula)
        super().__init__(defaults, target=target)

    @torch.no_grad
    def approximate(self, closure, params, loss):
        grads = []
        loss_approx = None

        for p in params:
            g = torch.zeros_like(p)
            grads.append(g)

            settings = self.settings[p]
            h = settings['h']
            fd_fn = _FD_FUNCS[settings['formula']]

            p_flat = p.ravel(); g_flat = g.ravel()
            for i in range(len(p_flat)):
                loss, loss_approx, d = fd_fn(closure=closure, param=p_flat, idx=i, h=h, v_0=loss)
                g_flat[i] = d

        return grads, loss, loss_approx

ForwardGradient ¶

Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM

Forward gradient method.

This method samples one or more directional derivatives evaluated via autograd jacobian-vector products. This is very similar to randomized finite difference.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

n_samples (int, default: 1 ) –

number of random gradient samples. Defaults to 1.
distribution (Literal, default: 'gaussian' ) –

distribution for random gradient samples. Defaults to "gaussian".
pre_generate (bool, default: True ) –

whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
jvp_method (str, default: 'autograd' ) –

how to calculate jacobian vector product, note that with forward and 'central' this is equivalent to randomized finite difference. Defaults to 'autograd'.
h (float, default: 0.001 ) –

finite difference step size of jvp_method is set to forward or central. Defaults to 1e-3.
target (Literal, default: 'closure' ) –

what to set on var. Defaults to "closure".

References

Baydin, A. G., Pearlmutter, B. A., Syme, D., Wood, F., & Torr, P. (2022). Gradients without backpropagation. arXiv preprint arXiv:2202.08587.

Source code in torchzero/modules/grad_approximation/forward_gradient.py

class ForwardGradient(RandomizedFDM):
    """Forward gradient method.

    This method samples one or more directional derivatives evaluated via autograd jacobian-vector products. This is very similar to randomized finite difference.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.


    Args:
        n_samples (int, optional): number of random gradient samples. Defaults to 1.
        distribution (Distributions, optional): distribution for random gradient samples. Defaults to "gaussian".
        pre_generate (bool, optional):
            whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
        jvp_method (str, optional):
            how to calculate jacobian vector product, note that with `forward` and 'central' this is equivalent to randomized finite difference. Defaults to 'autograd'.
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-3.
        target (GradTarget, optional): what to set on var. Defaults to "closure".

    References:
        Baydin, A. G., Pearlmutter, B. A., Syme, D., Wood, F., & Torr, P. (2022). Gradients without backpropagation. arXiv preprint arXiv:2202.08587.
    """
    PRE_MULTIPLY_BY_H = False
    def __init__(
        self,
        n_samples: int = 1,
        distribution: Distributions = "gaussian",
        pre_generate = True,
        jvp_method: Literal['autograd', 'forward', 'central'] = 'autograd',
        h: float = 1e-3,
        target: GradTarget = "closure",
        seed: int | None | torch.Generator = None,
    ):
        super().__init__(h=h, n_samples=n_samples, distribution=distribution, target=target, pre_generate=pre_generate, seed=seed)
        self.defaults['jvp_method'] = jvp_method

    @torch.no_grad
    def approximate(self, closure, params, loss):
        params = TensorList(params)
        loss_approx = None

        fs = self.settings[params[0]]
        n_samples = fs['n_samples']
        jvp_method = fs['jvp_method']
        h = fs['h']
        distribution = fs['distribution']
        default = [None]*n_samples
        perturbations = list(zip(*(self.state[p].get('perturbations', default) for p in params)))
        generator = self.get_generator(params[0].device, self.defaults['seed'])

        grad = None
        for i in range(n_samples):
            prt = perturbations[i]
            if prt[0] is None:
                prt = params.sample_like(distribution=distribution, variance=1, generator=generator)

            else: prt = TensorList(prt)

            if jvp_method == 'autograd':
                with torch.enable_grad():
                    loss, d = jvp(partial(closure, False), params=params, tangent=prt)

            elif jvp_method == 'forward':
                loss, d = jvp_fd_forward(partial(closure, False), params=params, tangent=prt, v_0=loss, h=h)

            elif jvp_method == 'central':
                loss_approx, d = jvp_fd_central(partial(closure, False), params=params, tangent=prt, h=h)

            else: raise ValueError(jvp_method)

            if grad is None: grad = prt * d
            else: grad += prt * d

        assert grad is not None
        if n_samples > 1: grad.div_(n_samples)
        return grad, loss, loss_approx

PRE_MULTIPLY_BY_H `class-attribute` ¶

PRE_MULTIPLY_BY_H = False

bool(x) -> bool

Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.

GaussianSmoothing ¶

Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM

Gradient approximation via Gaussian smoothing method.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

h (float, default: 0.01 ) –

finite difference step size of jvp_method is set to forward or central. Defaults to 1e-2.
n_samples (int, default: 100 ) –

number of random gradient samples. Defaults to 100.
formula (Literal, default: 'forward2' ) –

finite difference formula. Defaults to 'forward2'.
distribution (Literal, default: 'gaussian' ) –

distribution. Defaults to "gaussian".
pre_generate (bool, default: True ) –

whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
seed (int | None | Generator, default: None ) –

Seed for random generator. Defaults to None.
target (Literal, default: 'closure' ) –

what to set on var. Defaults to "closure".

References

Yurii Nesterov, Vladimir Spokoiny. (2015). Random Gradient-Free Minimization of Convex Functions. https://gwern.net/doc/math/2015-nesterov.pdf

Source code in torchzero/modules/grad_approximation/rfdm.py

class GaussianSmoothing(RandomizedFDM):
    """
    Gradient approximation via Gaussian smoothing method.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-2.
        n_samples (int, optional): number of random gradient samples. Defaults to 100.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'forward2'.
        distribution (Distributions, optional): distribution. Defaults to "gaussian".
        pre_generate (bool, optional):
            whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
        seed (int | None | torch.Generator, optional): Seed for random generator. Defaults to None.
        target (GradTarget, optional): what to set on var. Defaults to "closure".


    References:
        Yurii Nesterov, Vladimir Spokoiny. (2015). Random Gradient-Free Minimization of Convex Functions. https://gwern.net/doc/math/2015-nesterov.pdf
    """
    def __init__(
        self,
        h: float = 1e-2,
        n_samples: int = 100,
        formula: _FD_Formula = "forward2",
        distribution: Distributions = "gaussian",
        pre_generate: bool = True,
        return_approx_loss: bool = False,
        target: GradTarget = "closure",
        seed: int | None | torch.Generator = None,
    ):
        super().__init__(h=h, n_samples=n_samples,formula=formula,distribution=distribution,pre_generate=pre_generate,target=target,seed=seed, return_approx_loss=return_approx_loss)

GradApproximator ¶

Bases: torchzero.core.module.Module, abc.ABC

Base class for gradient approximations. This is an abstract class, to use it, subclass it and override approximate.

GradientApproximator modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure.

Parameters:

defaults (dict[str, Any] | None, default: None ) –

dict with defaults. Defaults to None.
target (str, default: 'closure' ) –

whether to set var.grad, var.update or 'var.closure`. Defaults to 'closure'.

Example:

Basic SPSA method implementation.

name="__codelineno-0-1" href="#__codelineno-0-1">class SPSA(GradApproximator): def __init__(self, h=1e-3): defaults = dict(h=h) super().__init__(defaults) @torch.no_grad def approximate(self, closure, params, loss): perturbation = [rademacher_like(p) * self.settings[p]['h'] for p in params] # evaluate params + perturbation torch._foreach_add_(params, perturbation) loss_plus = closure(False) # evaluate params - perturbation torch._foreach_sub_(params, perturbation) torch._foreach_sub_(params, perturbation) loss_minus = closure(False) # restore original params torch._foreach_add_(params, perturbation) # calculate SPSA gradients spsa_grads = [] for p, pert in zip(params, perturbation): settings = self.settings[p] h = settings['h'] d = (loss_plus - loss_minus) / (2*(h**2)) spsa_grads.append(pert * d) # returns tuple: (grads, loss, loss_approx) # loss must be with initial parameters # since we only evaluated loss with perturbed parameters # we only have loss_approx return spsa_grads, None, loss_plus

Methods:

approximate –

Returns a tuple: (grad, loss, loss_approx), make sure this resets parameters to their original values!
pre_step –

This runs once before each step, whereas approximate may run multiple times per step if further modules

Source code in torchzero/modules/grad_approximation/grad_approximator.py

class GradApproximator(Module, ABC):
    """Base class for gradient approximations.
    This is an abstract class, to use it, subclass it and override `approximate`.

    GradientApproximator modifies the closure to evaluate the estimated gradients,
    and further closure-based modules will use the modified closure.

    Args:
        defaults (dict[str, Any] | None, optional): dict with defaults. Defaults to None.
        target (str, optional):
            whether to set `var.grad`, `var.update` or 'var.closure`. Defaults to 'closure'.

    Example:

    Basic SPSA method implementation.
    ```python
    class SPSA(GradApproximator):
        def __init__(self, h=1e-3):
            defaults = dict(h=h)
            super().__init__(defaults)

        @torch.no_grad
        def approximate(self, closure, params, loss):
            perturbation = [rademacher_like(p) * self.settings[p]['h'] for p in params]

            # evaluate params + perturbation
            torch._foreach_add_(params, perturbation)
            loss_plus = closure(False)

            # evaluate params - perturbation
            torch._foreach_sub_(params, perturbation)
            torch._foreach_sub_(params, perturbation)
            loss_minus = closure(False)

            # restore original params
            torch._foreach_add_(params, perturbation)

            # calculate SPSA gradients
            spsa_grads = []
            for p, pert in zip(params, perturbation):
                settings = self.settings[p]
                h = settings['h']
                d = (loss_plus - loss_minus) / (2*(h**2))
                spsa_grads.append(pert * d)

            # returns tuple: (grads, loss, loss_approx)
            # loss must be with initial parameters
            # since we only evaluated loss with perturbed parameters
            # we only have loss_approx
            return spsa_grads, None, loss_plus
    ```
    """
    def __init__(self, defaults: dict[str, Any] | None = None, return_approx_loss:bool=False, target: GradTarget = 'closure'):
        super().__init__(defaults)
        self._target: GradTarget = target
        self._return_approx_loss = return_approx_loss

    @abstractmethod
    def approximate(self, closure: Callable, params: list[torch.Tensor], loss: torch.Tensor | None) -> tuple[Iterable[torch.Tensor], torch.Tensor | None, torch.Tensor | None]:
        """Returns a tuple: ``(grad, loss, loss_approx)``, make sure this resets parameters to their original values!"""

    def pre_step(self, objective: Objective) -> None:
        """This runs once before each step, whereas `approximate` may run multiple times per step if further modules
        evaluate gradients at multiple points. This is useful for example to pre-generate new random perturbations."""

    @torch.no_grad
    def update(self, objective):
        self.pre_step(objective)

        if objective.closure is None: raise RuntimeError("Gradient approximation requires closure")
        params, closure, loss = objective.params, objective.closure, objective.loss

        if self._target == 'closure':

            def approx_closure(backward=True):
                if backward:
                    # set loss to None because closure might be evaluated at different points
                    grad, l, l_approx = self.approximate(closure=closure, params=params, loss=None)
                    for p, g in zip(params, grad): p.grad = g
                    if l is not None: return l
                    if self._return_approx_loss and l_approx is not None: return l_approx
                    return closure(False)

                return closure(False)

            objective.closure = approx_closure
            return

        # if var.grad is not None:
        #     warnings.warn('Using grad approximator when `var.grad` is already set.')
        grad, loss, loss_approx = self.approximate(closure=closure, params=params, loss=loss)
        if loss_approx is not None: objective.loss_approx = loss_approx
        if loss is not None: objective.loss = objective.loss_approx = loss
        if self._target == 'grad': objective.grads = list(grad)
        elif self._target == 'update': objective.updates = list(grad)
        else: raise ValueError(self._target)
        return

    def apply(self, objective):
        return objective

approximate ¶

approximate(closure: Callable, params: list[Tensor], loss: Tensor | None) -> tuple[Iterable[Tensor], Tensor | None, Tensor | None]

Returns a tuple: (grad, loss, loss_approx), make sure this resets parameters to their original values!

Source code in torchzero/modules/grad_approximation/grad_approximator.py

@abstractmethod
def approximate(self, closure: Callable, params: list[torch.Tensor], loss: torch.Tensor | None) -> tuple[Iterable[torch.Tensor], torch.Tensor | None, torch.Tensor | None]:
    """Returns a tuple: ``(grad, loss, loss_approx)``, make sure this resets parameters to their original values!"""

pre_step ¶

pre_step(objective: Objective) -> None

This runs once before each step, whereas approximate may run multiple times per step if further modules evaluate gradients at multiple points. This is useful for example to pre-generate new random perturbations.

Source code in torchzero/modules/grad_approximation/grad_approximator.py

def pre_step(self, objective: Objective) -> None:
    """This runs once before each step, whereas `approximate` may run multiple times per step if further modules
    evaluate gradients at multiple points. This is useful for example to pre-generate new random perturbations."""

MeZO ¶

Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator

Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

h (float, default: 0.001 ) –

finite difference step size of jvp_method is set to forward or central. Defaults to 1e-3.
n_samples (int, default: 1 ) –

number of random gradient samples. Defaults to 1.
formula (Literal, default: 'central2' ) –

finite difference formula. Defaults to 'central2'.
distribution (Literal, default: 'rademacher' ) –

distribution. Defaults to "rademacher". If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
target (Literal, default: 'closure' ) –

what to set on var. Defaults to "closure".

References

Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36, 53038-53075. https://arxiv.org/abs/2305.17333

Source code in torchzero/modules/grad_approximation/rfdm.py

class MeZO(GradApproximator):
    """Gradient approximation via memory-efficient zeroth order optimizer (MeZO) - https://arxiv.org/abs/2305.17333.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-3.
        n_samples (int, optional): number of random gradient samples. Defaults to 1.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'central2'.
        distribution (Distributions, optional): distribution. Defaults to "rademacher".
            If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
        target (GradTarget, optional): what to set on var. Defaults to "closure".

    References:
        Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2023). Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36, 53038-53075. https://arxiv.org/abs/2305.17333
    """

    def __init__(self, h: float=1e-3, n_samples: int = 1, formula: _FD_Formula = 'central2',
                 distribution: Distributions = 'rademacher', return_approx_loss: bool = False, target: GradTarget = 'closure'):

        defaults = dict(h=h, formula=formula, n_samples=n_samples, distribution=distribution)
        super().__init__(defaults, return_approx_loss=return_approx_loss, target=target)

    def _seeded_perturbation(self, params: list[torch.Tensor], distribution, seed, h):
        prt = TensorList(params).sample_like(
            distribution=distribution,
            variance=h,
            generator=torch.Generator(params[0].device).manual_seed(seed)
        )
        return prt

    def pre_step(self, objective):
        h = NumberList(self.settings[p]['h'] for p in objective.params)

        n_samples = self.defaults['n_samples']
        distribution = self.defaults['distribution']

        step = objective.current_step

        # create functions that generate a deterministic perturbation from seed based on current step
        prt_fns = []
        for i in range(n_samples):

            prt_fn = partial(self._seeded_perturbation, params=objective.params, distribution=distribution, seed=1_000_000*step + i, h=h)
            prt_fns.append(prt_fn)

        self.global_state['prt_fns'] = prt_fns

    @torch.no_grad
    def approximate(self, closure, params, loss):
        params = TensorList(params)
        loss_approx = None

        h = NumberList(self.settings[p]['h'] for p in params)
        n_samples = self.defaults['n_samples']
        fd_fn = _RFD_FUNCS[self.defaults['formula']]

        prt_fns = self.global_state['prt_fns']

        grad = None
        for i in range(n_samples):
            loss, loss_approx, d = fd_fn(closure=closure, params=params, p_fn=prt_fns[i], h=h, f_0=loss)
            if grad is None: grad = prt_fns[i]().mul_(d)
            else: grad += prt_fns[i]().mul_(d)

        assert grad is not None
        if n_samples > 1: grad.div_(n_samples)
        return grad, loss, loss_approx

RDSA ¶

Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM

Gradient approximation via Random-direction stochastic approximation (RDSA) method.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

h (float, default: 0.001 ) –

finite difference step size of jvp_method is set to forward or central. Defaults to 1e-3.
n_samples (int, default: 1 ) –

number of random gradient samples. Defaults to 1.
formula (Literal, default: 'central2' ) –

finite difference formula. Defaults to 'central2'.
distribution (Literal, default: 'gaussian' ) –

distribution. Defaults to "gaussian".
pre_generate (bool, default: True ) –

whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
seed (int | None | Generator, default: None ) –

Seed for random generator. Defaults to None.
target (Literal, default: 'closure' ) –

what to set on var. Defaults to "closure".

References

Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771

Source code in torchzero/modules/grad_approximation/rfdm.py

class RDSA(RandomizedFDM):
    """
    Gradient approximation via Random-direction stochastic approximation (RDSA) method.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-3.
        n_samples (int, optional): number of random gradient samples. Defaults to 1.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'central2'.
        distribution (Distributions, optional): distribution. Defaults to "gaussian".
        pre_generate (bool, optional):
            whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
        seed (int | None | torch.Generator, optional): Seed for random generator. Defaults to None.
        target (GradTarget, optional): what to set on var. Defaults to "closure".

    References:
        Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771

    """
    def __init__(
        self,
        h: float = 1e-3,
        n_samples: int = 1,
        formula: _FD_Formula = "central2",
        distribution: Distributions = "gaussian",
        pre_generate: bool = True,
        return_approx_loss: bool = False,
        target: GradTarget = "closure",
        seed: int | None | torch.Generator = None,
    ):
        super().__init__(h=h, n_samples=n_samples,formula=formula,distribution=distribution,pre_generate=pre_generate,target=target,seed=seed, return_approx_loss=return_approx_loss)

RandomizedFDM ¶

Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator

Gradient approximation via a randomized finite-difference method.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

h (float, default: 0.001 ) –

finite difference step size of jvp_method is set to forward or central. Defaults to 1e-3.
n_samples (int, default: 1 ) –

number of random gradient samples. Defaults to 1.
formula (Literal, default: 'central' ) –

finite difference formula. Defaults to 'central2'.
distribution (Literal, default: 'rademacher' ) –

distribution. Defaults to "rademacher". If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
pre_generate (bool, default: True ) –

whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
seed (int | None | Generator, default: None ) –

Seed for random generator. Defaults to None.
target (Literal, default: 'closure' ) –

what to set on var. Defaults to "closure".

Examples:

Simultaneous perturbation stochastic approximation (SPSA) method¶

SPSA is randomized FDM with rademacher distribution and central formula.

spsa = tz.Optimizer(
    model.parameters(),
    tz.m.RandomizedFDM(formula="fd_central", distribution="rademacher"),
    tz.m.LR(1e-2)
)

Random-direction stochastic approximation (RDSA) method¶

RDSA is randomized FDM with usually gaussian distribution and central formula.

rdsa = tz.Optimizer(
    model.parameters(),
    tz.m.RandomizedFDM(formula="fd_central", distribution="gaussian"),
    tz.m.LR(1e-2)
)

Gaussian smoothing method¶

GS uses many gaussian samples with possibly a larger finite difference step size.

gs = tz.Optimizer(
    model.parameters(),
    tz.m.RandomizedFDM(n_samples=100, distribution="gaussian", formula="forward2", h=1e-1),
    tz.m.NewtonCG(hvp_method="forward"),
    tz.m.Backtracking()
)

RandomizedFDM with momentum¶

Momentum might help by reducing the variance of the estimated gradients.

momentum_spsa = tz.Optimizer(
    model.parameters(),
    tz.m.RandomizedFDM(),
    tz.m.HeavyBall(0.9),
    tz.m.LR(1e-3)
)

Source code in torchzero/modules/grad_approximation/rfdm.py

class RandomizedFDM(GradApproximator):
    """Gradient approximation via a randomized finite-difference method.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-3.
        n_samples (int, optional): number of random gradient samples. Defaults to 1.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'central2'.
        distribution (Distributions, optional): distribution. Defaults to "rademacher".
            If this is set to a value higher than zero, instead of using directional derivatives in a new random direction on each step, the direction changes gradually with momentum based on this value. This may make it possible to use methods with memory. Defaults to 0.
        pre_generate (bool, optional):
            whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
        seed (int | None | torch.Generator, optional): Seed for random generator. Defaults to None.
        target (GradTarget, optional): what to set on var. Defaults to "closure".

    Examples:
    #### Simultaneous perturbation stochastic approximation (SPSA) method

    SPSA is randomized FDM with rademacher distribution and central formula.
    ```py
    spsa = tz.Optimizer(
        model.parameters(),
        tz.m.RandomizedFDM(formula="fd_central", distribution="rademacher"),
        tz.m.LR(1e-2)
    )
    ```

    #### Random-direction stochastic approximation (RDSA) method

    RDSA is randomized FDM with usually gaussian distribution and central formula.
    ```
    rdsa = tz.Optimizer(
        model.parameters(),
        tz.m.RandomizedFDM(formula="fd_central", distribution="gaussian"),
        tz.m.LR(1e-2)
    )
    ```

    #### Gaussian smoothing method

    GS uses many gaussian samples with possibly a larger finite difference step size.
    ```
    gs = tz.Optimizer(
        model.parameters(),
        tz.m.RandomizedFDM(n_samples=100, distribution="gaussian", formula="forward2", h=1e-1),
        tz.m.NewtonCG(hvp_method="forward"),
        tz.m.Backtracking()
    )
    ```

    #### RandomizedFDM with momentum

    Momentum might help by reducing the variance of the estimated gradients.
    ```
    momentum_spsa = tz.Optimizer(
        model.parameters(),
        tz.m.RandomizedFDM(),
        tz.m.HeavyBall(0.9),
        tz.m.LR(1e-3)
    )
    ```
    """
    PRE_MULTIPLY_BY_H = True
    def __init__(
        self,
        h: float = 1e-3,
        n_samples: int = 1,
        formula: _FD_Formula = "central",
        distribution: Distributions = "rademacher",
        pre_generate: bool = True,
        return_approx_loss: bool = False,
        seed: int | None | torch.Generator = None,
        target: GradTarget = "closure",
    ):
        defaults = dict(h=h, formula=formula, n_samples=n_samples, distribution=distribution, pre_generate=pre_generate, seed=seed)
        super().__init__(defaults, return_approx_loss=return_approx_loss, target=target)


    def pre_step(self, objective):
        h = self.get_settings(objective.params, 'h')
        pre_generate = self.defaults['pre_generate']

        if pre_generate:
            n_samples = self.defaults['n_samples']
            distribution = self.defaults['distribution']

            params = TensorList(objective.params)
            generator = self.get_generator(params[0].device, self.defaults['seed'])
            perturbations = [params.sample_like(distribution=distribution, variance=1, generator=generator) for _ in range(n_samples)]

            # this is false for ForwardGradient where h isn't used and it subclasses this
            if self.PRE_MULTIPLY_BY_H:
                torch._foreach_mul_([p for l in perturbations for p in l], [v for vv in h for v in [vv]*n_samples])

            for param, prt in zip(params, zip(*perturbations)):
                self.state[param]['perturbations'] = prt

    @torch.no_grad
    def approximate(self, closure, params, loss):
        params = TensorList(params)
        loss_approx = None

        h = NumberList(self.settings[p]['h'] for p in params)
        n_samples = self.defaults['n_samples']
        distribution = self.defaults['distribution']
        fd_fn = _RFD_FUNCS[self.defaults['formula']]

        default = [None]*n_samples
        perturbations = list(zip(*(self.state[p].get('perturbations', default) for p in params)))
        generator = self.get_generator(params[0].device, self.defaults['seed'])

        grad = None
        for i in range(n_samples):
            prt = perturbations[i]

            if prt[0] is None:
                prt = params.sample_like(distribution=distribution, generator=generator, variance=1).mul_(h)

            else: prt = TensorList(prt)

            loss, loss_approx, d = fd_fn(closure=closure, params=params, p_fn=lambda: prt, h=h, f_0=loss)
            # here `d` is a numberlist of directional derivatives, due to per parameter `h` values.

            # support for per-sample values which gives better estimate
            if d[0].numel() > 1: d = d.map(torch.mean)

            if grad is None: grad = prt * d
            else: grad += prt * d

        assert grad is not None
        if n_samples > 1: grad.div_(n_samples)

        # mean if got per-sample values
        if loss is not None:
            if loss.numel() > 1:
                loss = loss.mean()

        if loss_approx is not None:
            if loss_approx.numel() > 1:
                loss_approx = loss_approx.mean()

        return grad, loss, loss_approx

PRE_MULTIPLY_BY_H `class-attribute` ¶

PRE_MULTIPLY_BY_H = True

bool(x) -> bool

Returns True when the argument x is true, False otherwise. The builtins True and False are the only two instances of the class bool. The class bool is a subclass of the class int, and cannot be subclassed.

SPSA ¶

Bases: torchzero.modules.grad_approximation.rfdm.RandomizedFDM

Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.

Note

This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients, and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

Parameters:

h (float, default: 0.001 ) –

finite difference step size of jvp_method is set to forward or central. Defaults to 1e-3.
n_samples (int, default: 1 ) –

number of random gradient samples. Defaults to 1.
formula (Literal, default: 'central' ) –

finite difference formula. Defaults to 'central2'.
distribution (Literal, default: 'rademacher' ) –

distribution. Defaults to "rademacher".
pre_generate (bool, default: True ) –

whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
seed (int | None | Generator, default: None ) –

Seed for random generator. Defaults to None.
target (Literal, default: 'closure' ) –

what to set on var. Defaults to "closure".

References

Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771

Source code in torchzero/modules/grad_approximation/rfdm.py

class SPSA(RandomizedFDM):
    """
    Gradient approximation via Simultaneous perturbation stochastic approximation (SPSA) method.

    Note:
        This module is a gradient approximator. It modifies the closure to evaluate the estimated gradients,
        and further closure-based modules will use the modified closure. All modules after this will use estimated gradients.

    Args:
        h (float, optional): finite difference step size of jvp_method is set to `forward` or `central`. Defaults to 1e-3.
        n_samples (int, optional): number of random gradient samples. Defaults to 1.
        formula (_FD_Formula, optional): finite difference formula. Defaults to 'central2'.
        distribution (Distributions, optional): distribution. Defaults to "rademacher".
        pre_generate (bool, optional):
            whether to pre-generate gradient samples before each step. If samples are not pre-generated, whenever a method performs multiple closure evaluations, the gradient will be evaluated in different directions each time. Defaults to True.
        seed (int | None | torch.Generator, optional): Seed for random generator. Defaults to None.
        target (GradTarget, optional): what to set on var. Defaults to "closure".

    References:
        Chen, Y. (2021). Theoretical study and comparison of SPSA and RDSA algorithms. arXiv preprint arXiv:2107.12771. https://arxiv.org/abs/2107.12771
    """

SPSA1 ¶

Bases: torchzero.modules.grad_approximation.grad_approximator.GradApproximator

One-measurement variant of SPSA. Unlike standard two-measurement SPSA, the estimated gradient often won't be a descent direction, however the expectation is biased towards the descent direction. Therefore this variant of SPSA is only recommended for a specific class of problems where the objective function changes on each evaluation, for example feedback control problems.

Parameters:

h (float, default: 0.001 ) –

finite difference step size, recommended to set to same value as learning rate. Defaults to 1e-3.
n_samples (int, default: 1 ) –

number of random samples. Defaults to 1.
eps (float, default: 1e-08 ) –

measurement noise estimate. Defaults to 1e-8.
seed (int | None | Generator, default: None ) –

random seed. Defaults to None.
target (Literal, default: 'closure' ) –

what to set on closure. Defaults to "closure".

Reference

SPALL, JAMES C. "A One-measurement Form of Simultaneous Stochastic Approximation."

Source code in torchzero/modules/grad_approximation/spsa1.py

class SPSA1(GradApproximator):
    """One-measurement variant of SPSA. Unlike standard two-measurement SPSA, the estimated
    gradient often won't be a descent direction, however the expectation is biased towards
    the descent direction. Therefore this variant of SPSA is only recommended for a specific
    class of problems where the objective function changes on each evaluation,
    for example feedback control problems.

    Args:
        h (float, optional):
            finite difference step size, recommended to set to same value as learning rate. Defaults to 1e-3.
        n_samples (int, optional): number of random samples. Defaults to 1.
        eps (float, optional): measurement noise estimate. Defaults to 1e-8.
        seed (int | None | torch.Generator, optional): random seed. Defaults to None.
        target (GradTarget, optional): what to set on closure. Defaults to "closure".

    Reference:
        [SPALL, JAMES C. "A One-measurement Form of Simultaneous Stochastic Approximation](https://www.jhuapl.edu/spsa/PDF-SPSA/automatica97_one_measSPSA.pdf)."
    """

    def __init__(
        self,
        h: float = 1e-3,
        n_samples: int = 1,
        eps: float = 1e-8, # measurement noise
        pre_generate = False,
        seed: int | None | torch.Generator = None,
        target: GradTarget = "closure",
    ):
        defaults = dict(h=h, eps=eps, n_samples=n_samples, pre_generate=pre_generate, seed=seed)
        super().__init__(defaults, target=target)


    def pre_step(self, objective):

        if self.defaults['pre_generate']:

            params = TensorList(objective.params)
            generator = self.get_generator(params[0].device, self.defaults['seed'])

            n_samples = self.defaults['n_samples']
            h = self.get_settings(objective.params, 'h')

            perturbations = [params.rademacher_like(generator=generator) for _ in range(n_samples)]
            torch._foreach_mul_([p for l in perturbations for p in l], [v for vv in h for v in [vv]*n_samples])

            for param, prt in zip(params, zip(*perturbations)):
                self.state[param]['perturbations'] = prt

    @torch.no_grad
    def approximate(self, closure, params, loss):
        generator = self.get_generator(params[0].device, self.defaults['seed'])

        params = TensorList(params)
        orig_params = params.clone() # store to avoid small changes due to float imprecision
        loss_approx = None

        h, eps = self.get_settings(params, "h", "eps", cls=NumberList)
        n_samples = self.defaults['n_samples']

        default = [None]*n_samples
        # perturbations are pre-multiplied by h
        perturbations = list(zip(*(self.state[p].get('perturbations', default) for p in params)))

        grad = None
        for i in range(n_samples):
            prt = perturbations[i]

            if prt[0] is None:
                prt = params.rademacher_like(generator=generator).mul_(h)

            else: prt = TensorList(prt)

            params += prt
            L = closure(False)
            params.copy_(orig_params)

            sample = prt * ((L + eps) / h)
            if grad is None: grad = sample
            else: grad += sample

        assert grad is not None
        if n_samples > 1: grad.div_(n_samples)

        # mean if got per-sample values
        return grad, loss, loss_approx

Gradient approximations¶

FDM ¶

ForwardGradient ¶

PRE_MULTIPLY_BY_H class-attribute ¶

GaussianSmoothing ¶

GradApproximator ¶

approximate ¶

pre_step ¶

MeZO ¶

RDSA ¶

RandomizedFDM ¶

Simultaneous perturbation stochastic approximation (SPSA) method¶

Random-direction stochastic approximation (RDSA) method¶

Gaussian smoothing method¶

RandomizedFDM with momentum¶

PRE_MULTIPLY_BY_H class-attribute ¶

SPSA ¶

SPSA1 ¶

PRE_MULTIPLY_BY_H `class-attribute` ¶

PRE_MULTIPLY_BY_H `class-attribute` ¶