Skip to content

Line search

This subpackage contains line search methods.

See also

  • Step size - step size selection methods like Barzilai-Borwein and Polyak's step size.
  • Trust region - trust region methods.

Classes:

  • AdaptiveBacktracking

    Adaptive backtracking line search. After each line search procedure, a new initial step size is set

  • AdaptiveTracking

    A line search that evaluates previous step size, if value increased, backtracks until the value stops decreasing,

  • Backtracking

    Backtracking line search.

  • LineSearchBase

    Base class for line searches.

  • ScipyMinimizeScalar

    Line search via :code:scipy.optimize.minimize_scalar which implements brent, golden search and bounded brent methods.

  • StrongWolfe

    Interpolation line search satisfying Strong Wolfe condition.

AdaptiveBacktracking

Bases: torchzero.modules.line_search.line_search.LineSearchBase

Adaptive backtracking line search. After each line search procedure, a new initial step size is set such that optimal step size in the procedure would be found on the second line search iteration.

Parameters:

  • init (float, default: 1.0 ) –

    initial step size. Defaults to 1.0.

  • beta (float, default: 0.5 ) –

    multiplies each consecutive step size by this value. Defaults to 0.5.

  • c (float, default: 0.0001 ) –

    sufficient decrease condition. Defaults to 1e-4.

  • condition (Literal, default: 'armijo' ) –

    termination condition, only ones that do not use gradient at f(x+a*d) can be specified. - "armijo" - sufficient decrease condition. - "decrease" - any decrease in objective function value satisfies the condition.

    "goldstein" can techincally be specified but it doesn't make sense because there is not zoom stage. Defaults to 'armijo'.

  • maxiter (int, default: 20 ) –

    maximum number of function evaluations per step. Defaults to 10.

  • target_iters (int, default: 1 ) –

    sets next step size such that this number of iterations are expected to be performed until optimal step size is found. Defaults to 1.

  • nplus (float, default: 2.0 ) –

    if initial step size is optimal, it is multiplied by this value. Defaults to 2.0.

  • scale_beta (float, default: 0.0 ) –

    momentum for initial step size, at 0 disables momentum. Defaults to 0.0.

Source code in torchzero/modules/line_search/backtracking.py
class AdaptiveBacktracking(LineSearchBase):
    """Adaptive backtracking line search. After each line search procedure, a new initial step size is set
    such that optimal step size in the procedure would be found on the second line search iteration.

    Args:
        init (float, optional): initial step size. Defaults to 1.0.
        beta (float, optional): multiplies each consecutive step size by this value. Defaults to 0.5.
        c (float, optional): sufficient decrease condition. Defaults to 1e-4.
        condition (TerminationCondition, optional):
            termination condition, only ones that do not use gradient at f(x+a*d) can be specified.
            - "armijo" - sufficient decrease condition.
            - "decrease" - any decrease in objective function value satisfies the condition.

            "goldstein" can techincally be specified but it doesn't make sense because there is not zoom stage.
            Defaults to 'armijo'.
        maxiter (int, optional): maximum number of function evaluations per step. Defaults to 10.
        target_iters (int, optional):
            sets next step size such that this number of iterations are expected
            to be performed until optimal step size is found. Defaults to 1.
        nplus (float, optional):
            if initial step size is optimal, it is multiplied by this value. Defaults to 2.0.
        scale_beta (float, optional):
            momentum for initial step size, at 0 disables momentum. Defaults to 0.0.
    """
    def __init__(
        self,
        init: float = 1.0,
        beta: float = 0.5,
        c: float = 1e-4,
        condition: TerminationCondition = 'armijo',
        maxiter: int = 20,
        target_iters = 1,
        nplus = 2.0,
        scale_beta = 0.0,
    ):
        defaults=dict(init=init,beta=beta,c=c,condition=condition,maxiter=maxiter,target_iters=target_iters,nplus=nplus,scale_beta=scale_beta)
        super().__init__(defaults=defaults)

        self.global_state['beta_scale'] = 1.0
        self.global_state['initial_scale'] = 1.0

    def reset(self):
        super().reset()
        self.global_state['beta_scale'] = 1.0
        self.global_state['initial_scale'] = 1.0

    @torch.no_grad
    def search(self, update, var):
        init, beta, c,condition, maxiter, target_iters, nplus, scale_beta=itemgetter(
            'init','beta','c','condition', 'maxiter','target_iters','nplus','scale_beta')(self.defaults)

        objective = self.make_objective(var=var)

        # directional derivative (0 if c = 0 because it is not needed)
        if c == 0: d = 0
        else: d = -sum(t.sum() for t in torch._foreach_mul(var.get_grad(), update))

        # scale beta
        beta = beta * self.global_state['beta_scale']

        # scale step size so that decrease is expected at target_iters
        init = init * self.global_state['initial_scale']

        step_size = backtracking_line_search(objective, d, init=init, beta=beta, c=c, condition=condition, maxiter=maxiter)

        # found an alpha that reduces loss
        if step_size is not None:

            # update initial_scale
            # initial step size satisfied conditions, increase initial_scale by nplus
            if step_size == init and target_iters > 0:
                self.global_state['initial_scale'] *= nplus ** target_iters

                # clip by maximum possibel value to avoid overflow exception
                self.global_state['initial_scale'] = min(
                    self.global_state['initial_scale'],
                    torch.finfo(var.params[0].dtype).max / 2,
                )

            else:
                # otherwise make initial_scale such that target_iters iterations will satisfy armijo
                init_target = step_size
                for _ in range(target_iters):
                    init_target = step_size / beta

                self.global_state['initial_scale'] = _lerp(
                    self.global_state['initial_scale'], init_target / init, 1-scale_beta
                )

            # revert beta_scale
            self.global_state['beta_scale'] = min(1.0, self.global_state['beta_scale'] * math.sqrt(1.5))

            return step_size

        # on fail reduce beta scale value
        self.global_state['beta_scale'] /= 1.5
        return 0

AdaptiveTracking

Bases: torchzero.modules.line_search.line_search.LineSearchBase

A line search that evaluates previous step size, if value increased, backtracks until the value stops decreasing, otherwise forward-tracks until value stops decreasing.

Parameters:

  • init (float, default: 1.0 ) –

    initial step size. Defaults to 1.0.

  • nplus (float, default: 2 ) –

    multiplier to step size if initial step size is optimal. Defaults to 2.

  • nminus (float, default: 0.5 ) –

    multiplier to step size if initial step size is too big. Defaults to 0.5.

  • maxiter (int, default: 10 ) –

    maximum number of function evaluations per step. Defaults to 10.

  • adaptive (bool, default: True ) –

    when enabled, if line search failed, step size will continue decreasing on the next step. Otherwise it will restart the line search from init step size. Defaults to True.

Source code in torchzero/modules/line_search/adaptive.py
class AdaptiveTracking(LineSearchBase):
    """A line search that evaluates previous step size, if value increased, backtracks until the value stops decreasing,
    otherwise forward-tracks until value stops decreasing.

    Args:
        init (float, optional): initial step size. Defaults to 1.0.
        nplus (float, optional): multiplier to step size if initial step size is optimal. Defaults to 2.
        nminus (float, optional): multiplier to step size if initial step size is too big. Defaults to 0.5.
        maxiter (int, optional): maximum number of function evaluations per step. Defaults to 10.
        adaptive (bool, optional):
            when enabled, if line search failed, step size will continue decreasing on the next step.
            Otherwise it will restart the line search from ``init`` step size. Defaults to True.
    """
    def __init__(
        self,
        init: float = 1.0,
        nplus: float = 2,
        nminus: float = 0.5,
        maxiter: int = 10,
        adaptive=True,
    ):
        defaults=dict(init=init,nplus=nplus,nminus=nminus,maxiter=maxiter,adaptive=adaptive)
        super().__init__(defaults=defaults)

    def reset(self):
        super().reset()

    @torch.no_grad
    def search(self, update, var):
        init, nplus, nminus, maxiter, adaptive = itemgetter(
            'init', 'nplus', 'nminus', 'maxiter', 'adaptive')(self.defaults)

        objective = self.make_objective(var=var)

        # scale a_prev
        a_prev = self.global_state.get('a_prev', init)
        if adaptive: a_prev = a_prev * self.global_state.get('init_scale', 1)

        a_init = a_prev
        if a_init < torch.finfo(var.params[0].dtype).tiny * 2:
            a_init = torch.finfo(var.params[0].dtype).max / 2

        step_size, f, niter = adaptive_tracking(
            objective,
            a_init=a_init,
            maxiter=maxiter,
            nplus=nplus,
            nminus=nminus,
        )

        # found an alpha that reduces loss
        if step_size != 0:
            assert (var.loss is None) or (math.isfinite(f) and f < var.loss)
            self.global_state['init_scale'] = 1

            # if niter == 1, forward tracking failed to decrease function value compared to f_a_prev
            if niter == 1 and step_size >= a_init: step_size *= nminus

            self.global_state['a_prev'] = step_size
            return step_size

        # on fail reduce beta scale value
        self.global_state['init_scale'] = self.global_state.get('init_scale', 1) * nminus**maxiter
        self.global_state['a_prev'] = init
        return 0

Backtracking

Bases: torchzero.modules.line_search.line_search.LineSearchBase

Backtracking line search.

Parameters:

  • init (float, default: 1.0 ) –

    initial step size. Defaults to 1.0.

  • beta (float, default: 0.5 ) –

    multiplies each consecutive step size by this value. Defaults to 0.5.

  • c (float, default: 0.0001 ) –

    sufficient decrease condition. Defaults to 1e-4.

  • condition (Literal, default: 'armijo' ) –

    termination condition, only ones that do not use gradient at f(x+a*d) can be specified. - "armijo" - sufficient decrease condition. - "decrease" - any decrease in objective function value satisfies the condition.

    "goldstein" can techincally be specified but it doesn't make sense because there is not zoom stage. Defaults to 'armijo'.

  • maxiter (int, default: 10 ) –

    maximum number of function evaluations per step. Defaults to 10.

  • adaptive (bool, default: True ) –

    when enabled, if line search failed, step size will continue decreasing on the next step. Otherwise it will restart the line search from init step size. Defaults to True.

Examples: Gradient descent with backtracking line search:

opt = tz.Modular(
    model.parameters(),
    tz.m.Backtracking()
)

L-BFGS with backtracking line search:

opt = tz.Modular(
    model.parameters(),
    tz.m.LBFGS(),
    tz.m.Backtracking()
)

Source code in torchzero/modules/line_search/backtracking.py
class Backtracking(LineSearchBase):
    """Backtracking line search.

    Args:
        init (float, optional): initial step size. Defaults to 1.0.
        beta (float, optional): multiplies each consecutive step size by this value. Defaults to 0.5.
        c (float, optional): sufficient decrease condition. Defaults to 1e-4.
        condition (TerminationCondition, optional):
            termination condition, only ones that do not use gradient at f(x+a*d) can be specified.
            - "armijo" - sufficient decrease condition.
            - "decrease" - any decrease in objective function value satisfies the condition.

            "goldstein" can techincally be specified but it doesn't make sense because there is not zoom stage.
            Defaults to 'armijo'.
        maxiter (int, optional): maximum number of function evaluations per step. Defaults to 10.
        adaptive (bool, optional):
            when enabled, if line search failed, step size will continue decreasing on the next step.
            Otherwise it will restart the line search from ``init`` step size. Defaults to True.

    Examples:
    Gradient descent with backtracking line search:

    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.Backtracking()
    )
    ```

    L-BFGS with backtracking line search:
    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.LBFGS(),
        tz.m.Backtracking()
    )
    ```

    """
    def __init__(
        self,
        init: float = 1.0,
        beta: float = 0.5,
        c: float = 1e-4,
        condition: TerminationCondition = 'armijo',
        maxiter: int = 10,
        adaptive=True,
    ):
        defaults=dict(init=init,beta=beta,c=c,condition=condition,maxiter=maxiter,adaptive=adaptive)
        super().__init__(defaults=defaults)

    def reset(self):
        super().reset()

    @torch.no_grad
    def search(self, update, var):
        init, beta, c, condition, maxiter, adaptive = itemgetter(
            'init', 'beta', 'c', 'condition', 'maxiter', 'adaptive')(self.defaults)

        objective = self.make_objective(var=var)

        # # directional derivative
        if c == 0: d = 0
        else: d = -sum(t.sum() for t in torch._foreach_mul(var.get_grad(), var.get_update()))

        # scale init
        init_scale = self.global_state.get('init_scale', 1)
        if adaptive: init = init * init_scale

        step_size = backtracking_line_search(objective, d, init=init, beta=beta,c=c, condition=condition, maxiter=maxiter)

        # found an alpha that reduces loss
        if step_size is not None:
            #self.global_state['beta_scale'] = min(1.0, self.global_state['beta_scale'] * math.sqrt(1.5))
            self.global_state['init_scale'] = 1
            return step_size

        # on fail set init_scale to continue decreasing the step size
        # or set to large step size when it becomes too small
        if adaptive:
            finfo = torch.finfo(var.params[0].dtype)
            if init_scale <= finfo.tiny * 2:
                self.global_state["init_scale"] = finfo.max / 2
            else:
                self.global_state['init_scale'] = init_scale * beta**maxiter
        return 0

LineSearchBase

Bases: torchzero.core.module.Module, abc.ABC

Base class for line searches.

This is an abstract class, to use it, subclass it and override search.

Parameters:

  • defaults (dict[str, Any] | None) –

    dictionary with defaults.

  • maxiter (int | None, default: None ) –

    if this is specified, the search method will terminate upon evaluating the objective this many times, and step size with the lowest loss value will be used. This is useful when passing make_objective to an external library which doesn't have a maxiter option. Defaults to None.

Other useful methods
  • evaluate_f - returns loss with a given scalar step size
  • evaluate_f_d - returns loss and directional derivative with a given scalar step size
  • make_objective - creates a function that accepts a scalar step size and returns loss. This can be passed to a scalar solver, such as scipy.optimize.minimize_scalar.
  • make_objective_with_derivative - creates a function that accepts a scalar step size and returns a tuple with loss and directional derivative. This can be passed to a scalar solver.

Examples:

This evaluates all step sizes in a range by using the :code:self.evaluate_step_size method.

class GridLineSearch(LineSearch):
    def __init__(self, start, end, num):
        defaults = dict(start=start,end=end,num=num)
        super().__init__(defaults)

    @torch.no_grad
    def search(self, update, var):

        start = self.defaults["start"]
        end = self.defaults["end"]
        num = self.defaults["num"]

        lowest_loss = float("inf")
        best_step_size = best_step_size

        for step_size in torch.linspace(start,end,num):
            loss = self.evaluate_step_size(step_size.item(), var=var, backward=False)
            if loss < lowest_loss:
                lowest_loss = loss
                best_step_size = step_size

        return best_step_size

Using external solver via self.make_objective

Here we let :code:scipy.optimize.minimize_scalar solver find the best step size via :code:self.make_objective

class ScipyMinimizeScalar(LineSearch):
    def __init__(self, method: str | None = None):
        defaults = dict(method=method)
        super().__init__(defaults)

    @torch.no_grad
    def search(self, update, var):
        objective = self.make_objective(var=var)
        method = self.defaults["method"]

        res = self.scopt.minimize_scalar(objective, method=method)
        return res.x

Methods:

  • evaluate_f

    evaluate function value at alpha step_size.

  • evaluate_f_d

    evaluate function value and directional derivative in the direction of the update at step size step_size.

  • evaluate_f_d_g

    evaluate function value, directional derivative, and gradient list at step size step_size.

  • search

    Finds the step size to use

Source code in torchzero/modules/line_search/line_search.py
class LineSearchBase(Module, ABC):
    """Base class for line searches.

    This is an abstract class, to use it, subclass it and override `search`.

    Args:
        defaults (dict[str, Any] | None): dictionary with defaults.
        maxiter (int | None, optional):
            if this is specified, the search method will terminate upon evaluating
            the objective this many times, and step size with the lowest loss value will be used.
            This is useful when passing `make_objective` to an external library which
            doesn't have a maxiter option. Defaults to None.

    Other useful methods:
        * ``evaluate_f`` - returns loss with a given scalar step size
        * ``evaluate_f_d`` - returns loss and directional derivative with a given scalar step size
        * ``make_objective`` - creates a function that accepts a scalar step size and returns loss. This can be passed to a scalar solver, such as scipy.optimize.minimize_scalar.
        * ``make_objective_with_derivative`` - creates a function that accepts a scalar step size and returns a tuple with loss and directional derivative. This can be passed to a scalar solver.

    Examples:

    #### Basic line search

    This evaluates all step sizes in a range by using the :code:`self.evaluate_step_size` method.
    ```python
    class GridLineSearch(LineSearch):
        def __init__(self, start, end, num):
            defaults = dict(start=start,end=end,num=num)
            super().__init__(defaults)

        @torch.no_grad
        def search(self, update, var):

            start = self.defaults["start"]
            end = self.defaults["end"]
            num = self.defaults["num"]

            lowest_loss = float("inf")
            best_step_size = best_step_size

            for step_size in torch.linspace(start,end,num):
                loss = self.evaluate_step_size(step_size.item(), var=var, backward=False)
                if loss < lowest_loss:
                    lowest_loss = loss
                    best_step_size = step_size

            return best_step_size
    ```

    #### Using external solver via self.make_objective

    Here we let :code:`scipy.optimize.minimize_scalar` solver find the best step size via :code:`self.make_objective`

    ```python
    class ScipyMinimizeScalar(LineSearch):
        def __init__(self, method: str | None = None):
            defaults = dict(method=method)
            super().__init__(defaults)

        @torch.no_grad
        def search(self, update, var):
            objective = self.make_objective(var=var)
            method = self.defaults["method"]

            res = self.scopt.minimize_scalar(objective, method=method)
            return res.x
    ```
    """
    def __init__(self, defaults: dict[str, Any] | None, maxiter: int | None = None):
        super().__init__(defaults)
        self._maxiter = maxiter
        self._reset()

    def _reset(self):
        self._current_step_size: float = 0
        self._lowest_loss = float('inf')
        self._best_step_size: float = 0
        self._current_iter = 0
        self._initial_params = None

    def set_step_size_(
        self,
        step_size: float,
        params: list[torch.Tensor],
        update: list[torch.Tensor],
    ):
        if not math.isfinite(step_size): return

         # fixes overflow when backtracking keeps increasing alpha after converging
        step_size = max(min(tofloat(step_size), 1e36), -1e36)

        # skip is parameters are already at suggested step size
        if self._current_step_size == step_size: return

        # this was basically causing floating point imprecision to build up
        #if False:
        # if abs(alpha) < abs(step_size) and step_size != 0:
        #     torch._foreach_add_(params, update, alpha=alpha)

        # else:
        assert self._initial_params is not None
        if step_size == 0:
            new_params = [p.clone() for p in self._initial_params]
        else:
            new_params = torch._foreach_sub(self._initial_params, update, alpha=step_size)
        for c, n in zip(params, new_params):
            set_storage_(c, n)

        self._current_step_size = step_size

    def _set_per_parameter_step_size_(
        self,
        step_size: Sequence[float],
        params: list[torch.Tensor],
        update: list[torch.Tensor],
    ):
        # if not np.isfinite(step_size): step_size = [0 for _ in step_size]
        # alpha = [self._current_step_size - s for s in step_size]
        # if any(a!=0 for a in alpha):
        #     torch._foreach_add_(params, torch._foreach_mul(update, alpha))
        assert self._initial_params is not None
        if not np.isfinite(step_size).all(): step_size = [0 for _ in step_size]

        if any(s!=0 for s in step_size):
            new_params = torch._foreach_sub(self._initial_params, torch._foreach_mul(update, step_size))
        else:
            new_params = [p.clone() for p in self._initial_params]

        for c, n in zip(params, new_params):
            set_storage_(c, n)

    def _loss(self, step_size: float, var: Var, closure, params: list[torch.Tensor],
              update: list[torch.Tensor], backward:bool=False) -> float:

        # if step_size is 0, we might already know the loss
        if (var.loss is not None) and (step_size == 0):
            return tofloat(var.loss)

        # check max iter
        if self._maxiter is not None and self._current_iter >= self._maxiter: raise MaxLineSearchItersReached
        self._current_iter += 1

        # set new lr and evaluate loss with it
        self.set_step_size_(step_size, params=params, update=update)
        if backward:
            with torch.enable_grad(): loss = closure()
        else:
            loss = closure(False)

        # if it is the best so far, record it
        if loss < self._lowest_loss:
            self._lowest_loss = tofloat(loss)
            self._best_step_size = step_size

        # if evaluated loss at step size 0, set it to var.loss
        if step_size == 0:
            var.loss = loss
            if backward: var.grad = [p.grad if p.grad is not None else torch.zeros_like(p) for p in params]

        return tofloat(loss)

    def _loss_derivative_gradient(self, step_size: float, var: Var, closure,
                         params: list[torch.Tensor], update: list[torch.Tensor]):
        # if step_size is 0, we might already know the derivative
        if (var.grad is not None) and (step_size == 0):
            loss = self._loss(step_size=step_size,var=var,closure=closure,params=params,update=update,backward=False)
            derivative = - sum(t.sum() for t in torch._foreach_mul(var.grad, update))

        else:
            # loss with a backward pass sets params.grad
            loss = self._loss(step_size=step_size,var=var,closure=closure,params=params,update=update,backward=True)

            # directional derivative
            derivative = - sum(t.sum() for t in torch._foreach_mul([p.grad if p.grad is not None
                                                                    else torch.zeros_like(p) for p in params], update))

        assert var.grad is not None
        return loss, tofloat(derivative), var.grad

    def _loss_derivative(self, step_size: float, var: Var, closure,
                         params: list[torch.Tensor], update: list[torch.Tensor]):
        return self._loss_derivative_gradient(step_size=step_size, var=var,closure=closure,params=params,update=update)[:2]

    def evaluate_f(self, step_size: float, var: Var, backward:bool=False):
        """evaluate function value at alpha `step_size`."""
        closure = var.closure
        if closure is None: raise RuntimeError('line search requires closure')
        return self._loss(step_size=step_size, var=var, closure=closure, params=var.params,update=var.get_update(),backward=backward)

    def evaluate_f_d(self, step_size: float, var: Var):
        """evaluate function value and directional derivative in the direction of the update at step size `step_size`."""
        closure = var.closure
        if closure is None: raise RuntimeError('line search requires closure')
        return self._loss_derivative(step_size=step_size, var=var, closure=closure, params=var.params,update=var.get_update())

    def evaluate_f_d_g(self, step_size: float, var: Var):
        """evaluate function value, directional derivative, and gradient list at step size `step_size`."""
        closure = var.closure
        if closure is None: raise RuntimeError('line search requires closure')
        return self._loss_derivative_gradient(step_size=step_size, var=var, closure=closure, params=var.params,update=var.get_update())

    def make_objective(self, var: Var, backward:bool=False):
        closure = var.closure
        if closure is None: raise RuntimeError('line search requires closure')
        return partial(self._loss, var=var, closure=closure, params=var.params, update=var.get_update(), backward=backward)

    def make_objective_with_derivative(self, var: Var):
        closure = var.closure
        if closure is None: raise RuntimeError('line search requires closure')
        return partial(self._loss_derivative, var=var, closure=closure, params=var.params, update=var.get_update())

    def make_objective_with_derivative_and_gradient(self, var: Var):
        closure = var.closure
        if closure is None: raise RuntimeError('line search requires closure')
        return partial(self._loss_derivative_gradient, var=var, closure=closure, params=var.params, update=var.get_update())

    @abstractmethod
    def search(self, update: list[torch.Tensor], var: Var) -> float:
        """Finds the step size to use"""

    @torch.no_grad
    def step(self, var: Var) -> Var:
        self._reset()

        params = var.params
        self._initial_params = [p.clone() for p in params]
        update = var.get_update()

        try:
            step_size = self.search(update=update, var=var)
        except MaxLineSearchItersReached:
            step_size = self._best_step_size

        # set loss_approx
        if var.loss_approx is None: var.loss_approx = self._lowest_loss

        # this is last module - set step size to found step_size times lr
        if var.is_last:
            if var.last_module_lrs is None:
                self.set_step_size_(step_size, params=params, update=update)

            else:
                self._set_per_parameter_step_size_([step_size*lr for lr in var.last_module_lrs], params=params, update=update)

            var.stop = True; var.skip_update = True
            return var

        # revert parameters and multiply update by step size
        self.set_step_size_(0, params=params, update=update)
        torch._foreach_mul_(var.update, step_size)
        return var

evaluate_f

evaluate_f(step_size: float, var: Var, backward: bool = False)

evaluate function value at alpha step_size.

Source code in torchzero/modules/line_search/line_search.py
def evaluate_f(self, step_size: float, var: Var, backward:bool=False):
    """evaluate function value at alpha `step_size`."""
    closure = var.closure
    if closure is None: raise RuntimeError('line search requires closure')
    return self._loss(step_size=step_size, var=var, closure=closure, params=var.params,update=var.get_update(),backward=backward)

evaluate_f_d

evaluate_f_d(step_size: float, var: Var)

evaluate function value and directional derivative in the direction of the update at step size step_size.

Source code in torchzero/modules/line_search/line_search.py
def evaluate_f_d(self, step_size: float, var: Var):
    """evaluate function value and directional derivative in the direction of the update at step size `step_size`."""
    closure = var.closure
    if closure is None: raise RuntimeError('line search requires closure')
    return self._loss_derivative(step_size=step_size, var=var, closure=closure, params=var.params,update=var.get_update())

evaluate_f_d_g

evaluate_f_d_g(step_size: float, var: Var)

evaluate function value, directional derivative, and gradient list at step size step_size.

Source code in torchzero/modules/line_search/line_search.py
def evaluate_f_d_g(self, step_size: float, var: Var):
    """evaluate function value, directional derivative, and gradient list at step size `step_size`."""
    closure = var.closure
    if closure is None: raise RuntimeError('line search requires closure')
    return self._loss_derivative_gradient(step_size=step_size, var=var, closure=closure, params=var.params,update=var.get_update())

search

search(update: list[Tensor], var: Var) -> float

Finds the step size to use

Source code in torchzero/modules/line_search/line_search.py
@abstractmethod
def search(self, update: list[torch.Tensor], var: Var) -> float:
    """Finds the step size to use"""

ScipyMinimizeScalar

Bases: torchzero.modules.line_search.line_search.LineSearchBase

Line search via :code:scipy.optimize.minimize_scalar which implements brent, golden search and bounded brent methods.

Parameters:

  • method (str | None, default: None ) –

    "brent", "golden" or "bounded". Defaults to None.

  • maxiter (int | None, default: None ) –

    maximum number of function evaluations the line search is allowed to perform. Defaults to None.

  • bracket (Sequence | None, default: None ) –

    Either a triple (xa, xb, xc) satisfying xa < xb < xc and func(xb) < func(xa) and func(xb) < func(xc), or a pair (xa, xb) to be used as initial points for a downhill bracket search. Defaults to None.

  • bounds (Sequence | None, default: None ) –

    For method ‘bounded’, bounds is mandatory and must have two finite items corresponding to the optimization bounds. Defaults to None.

  • tol (float | None, default: None ) –

    Tolerance for termination. Defaults to None.

  • options (dict | None, default: None ) –

    A dictionary of solver options. Defaults to None.

For more details on methods and arguments refer to https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize_scalar.html

Source code in torchzero/modules/line_search/scipy.py
class ScipyMinimizeScalar(LineSearchBase):
    """Line search via :code:`scipy.optimize.minimize_scalar` which implements brent, golden search and bounded brent methods.

    Args:
        method (str | None, optional): "brent", "golden" or "bounded". Defaults to None.
        maxiter (int | None, optional): maximum number of function evaluations the line search is allowed to perform. Defaults to None.
        bracket (Sequence | None, optional):
            Either a triple (xa, xb, xc) satisfying xa < xb < xc and func(xb) < func(xa) and  func(xb) < func(xc), or a pair (xa, xb) to be used as initial points for a downhill bracket search. Defaults to None.
        bounds (Sequence | None, optional):
            For method ‘bounded’, bounds is mandatory and must have two finite items corresponding to the optimization bounds. Defaults to None.
        tol (float | None, optional): Tolerance for termination. Defaults to None.
        options (dict | None, optional): A dictionary of solver options. Defaults to None.

    For more details on methods and arguments refer to https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize_scalar.html

    """
    def __init__(
        self,
        method: str | None = None,
        maxiter: int | None = None,
        bracket=None,
        bounds=None,
        tol: float | None = None,
        options=None,
    ):
        defaults = dict(method=method,bracket=bracket,bounds=bounds,tol=tol,options=options,maxiter=maxiter)
        super().__init__(defaults)

        import scipy.optimize
        self.scopt = scipy.optimize


    @torch.no_grad
    def search(self, update, var):
        objective = self.make_objective(var=var)
        method, bracket, bounds, tol, options, maxiter = itemgetter(
            'method', 'bracket', 'bounds', 'tol', 'options', 'maxiter')(self.defaults)

        if maxiter is not None:
            options = dict(options) if isinstance(options, Mapping) else {}
            options['maxiter'] = maxiter

        res = self.scopt.minimize_scalar(objective, method=method, bracket=bracket, bounds=bounds, tol=tol, options=options)
        return res.x

StrongWolfe

Bases: torchzero.modules.line_search.line_search.LineSearchBase

Interpolation line search satisfying Strong Wolfe condition.

Parameters:

  • c1 (float, default: 0.0001 ) –

    sufficient descent condition. Defaults to 1e-4.

  • c2 (float, default: 0.9 ) –

    strong curvature condition. For CG set to 0.1. Defaults to 0.9.

  • a_init (str, default: 'fixed' ) –

    strategy for initializing the initial step size guess. - "fixed" - uses a fixed value specified in init_value argument. - "first-order" - assumes first-order change in the function at iterate will be the same as that obtained at the previous step. - "quadratic" - interpolates quadratic to f(x_{-1}) and f_x. - "quadratic-clip" - same as quad, but uses min(1, 1.01*alpha) as described in Numerical Optimization. - "previous" - uses final step size found on previous iteration.

    For 2nd order methods it is usually best to leave at "fixed". For methods that do not produce well scaled search directions, e.g. conjugate gradient, "first-order" or "quadratic-clip" are recommended. Defaults to 'init'.

  • a_max (float, default: 1000000000000.0 ) –

    upper bound for the proposed step sizes. Defaults to 1e12.

  • init_value (float, default: 1 ) –

    initial step size. Used when a_init="fixed", and with other strategies as fallback value. Defaults to 1.

  • maxiter (int, default: 25 ) –

    maximum number of line search iterations. Defaults to 25.

  • maxzoom (int, default: 10 ) –

    maximum number of zoom iterations. Defaults to 10.

  • maxeval (int | None, default: None ) –

    maximum number of function evaluations. Defaults to None.

  • tol_change (float, default: 1e-09 ) –

    tolerance, terminates on small brackets. Defaults to 1e-9.

  • interpolation (str, default: 'cubic' ) –

    What type of interpolation to use. - "bisection" - uses the middle point. This is robust, especially if the objective function is non-smooth, however it may need more function evaluations. - "quadratic" - minimizes a quadratic model, generally outperformed by "cubic". - "cubic" - minimizes a cubic model - this is the most widely used interpolation strategy. - "polynomial" - fits a a polynomial to all points obtained during line search. - "polynomial2" - alternative polynomial fit, where if a point is outside of bounds, a lower degree polynomial is tried. This may have faster convergence than "cubic" and "polynomial".

    Defaults to 'cubic'.

  • adaptive (bool, default: True ) –

    if True, the initial step size will be halved when line search failed to find a good direction. When a good direction is found, initial step size is reset to the original value. Defaults to True.

  • fallback (bool, default: False ) –

    if True, when no point satisfied strong wolfe criteria, returns a point with value lower than initial value that doesn't satisfy the criteria. Defaults to False.

  • plus_minus (bool, default: False ) –

    if True, enables the plus-minus variant, where if curvature is negative, line search is performed in the opposite direction. Defaults to False.

Examples:

Conjugate gradient method with strong wolfe line search. Nocedal, Wright recommend setting c2 to 0.1 for CG. Since CG doesn't produce well scaled directions, initial alpha can be determined from function values by a_init="first-order".

opt = tz.Modular(
    model.parameters(),
    tz.m.PolakRibiere(),
    tz.m.StrongWolfe(c2=0.1, a_init="first-order")
)

LBFGS strong wolfe line search:

opt = tz.Modular(
    model.parameters(),
    tz.m.LBFGS(),
    tz.m.StrongWolfe()
)

Source code in torchzero/modules/line_search/strong_wolfe.py
class StrongWolfe(LineSearchBase):
    """Interpolation line search satisfying Strong Wolfe condition.

    Args:
        c1 (float, optional): sufficient descent condition. Defaults to 1e-4.
        c2 (float, optional): strong curvature condition. For CG set to 0.1. Defaults to 0.9.
        a_init (str, optional):
            strategy for initializing the initial step size guess.
            - "fixed" - uses a fixed value specified in `init_value` argument.
            - "first-order" - assumes first-order change in the function at iterate will be the same as that obtained at the previous step.
            - "quadratic" - interpolates quadratic to f(x_{-1}) and f_x.
            - "quadratic-clip" - same as quad, but uses min(1, 1.01*alpha) as described in Numerical Optimization.
            - "previous" - uses final step size found on previous iteration.

            For 2nd order methods it is usually best to leave at "fixed".
            For methods that do not produce well scaled search directions, e.g. conjugate gradient,
            "first-order" or "quadratic-clip" are recommended. Defaults to 'init'.
        a_max (float, optional): upper bound for the proposed step sizes. Defaults to 1e12.
        init_value (float, optional):
            initial step size. Used when ``a_init``="fixed", and with other strategies as fallback value. Defaults to 1.
        maxiter (int, optional): maximum number of line search iterations. Defaults to 25.
        maxzoom (int, optional): maximum number of zoom iterations. Defaults to 10.
        maxeval (int | None, optional): maximum number of function evaluations. Defaults to None.
        tol_change (float, optional): tolerance, terminates on small brackets. Defaults to 1e-9.
        interpolation (str, optional):
            What type of interpolation to use.
            - "bisection" - uses the middle point. This is robust, especially if the objective function is non-smooth, however it may need more function evaluations.
            - "quadratic" - minimizes a quadratic model, generally outperformed by "cubic".
            - "cubic" - minimizes a cubic model - this is the most widely used interpolation strategy.
            - "polynomial" - fits a a polynomial to all points obtained during line search.
            - "polynomial2" - alternative polynomial fit, where if a point is outside of bounds, a lower degree polynomial is tried.
            This may have faster convergence than "cubic" and "polynomial".

            Defaults to 'cubic'.
        adaptive (bool, optional):
            if True, the initial step size will be halved when line search failed to find a good direction.
            When a good direction is found, initial step size is reset to the original value. Defaults to True.
        fallback (bool, optional):
            if True, when no point satisfied strong wolfe criteria,
            returns a point with value lower than initial value that doesn't satisfy the criteria. Defaults to False.
        plus_minus (bool, optional):
            if True, enables the plus-minus variant, where if curvature is negative, line search is performed
            in the opposite direction. Defaults to False.


    ## Examples:

    Conjugate gradient method with strong wolfe line search. Nocedal, Wright recommend setting c2 to 0.1 for CG. Since CG doesn't produce well scaled directions, initial alpha can be determined from function values by ``a_init="first-order"``.

    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.PolakRibiere(),
        tz.m.StrongWolfe(c2=0.1, a_init="first-order")
    )
    ```

    LBFGS strong wolfe line search:
    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.LBFGS(),
        tz.m.StrongWolfe()
    )
    ```

    """
    def __init__(
        self,
        c1: float = 1e-4,
        c2: float = 0.9,
        a_init: Literal['first-order', 'quadratic', 'quadratic-clip', 'previous', 'fixed'] = 'fixed',
        a_max: float = 1e12,
        init_value: float = 1,
        maxiter: int = 25,
        maxzoom: int = 10,
        maxeval: int | None = None,
        tol_change: float = 1e-9,
        interpolation: Literal["quadratic", "cubic", "bisection", "polynomial", 'polynomial2'] = 'cubic',
        adaptive = True,
        fallback:bool = False,
        plus_minus = False,
    ):
        defaults=dict(init_value=init_value,init=a_init,a_max=a_max,c1=c1,c2=c2,maxiter=maxiter,maxzoom=maxzoom, fallback=fallback,
                      maxeval=maxeval, adaptive=adaptive, interpolation=interpolation, plus_minus=plus_minus, tol_change=tol_change)
        super().__init__(defaults=defaults)

        self.global_state['initial_scale'] = 1.0

    @torch.no_grad
    def search(self, update, var):
        self._g_prev = self._f_prev = None
        objective = self.make_objective_with_derivative(var=var)

        init_value, init, c1, c2, a_max, maxiter, maxzoom, maxeval, interpolation, adaptive, plus_minus, fallback, tol_change = itemgetter(
            'init_value', 'init', 'c1', 'c2', 'a_max', 'maxiter', 'maxzoom',
            'maxeval', 'interpolation', 'adaptive', 'plus_minus', 'fallback', 'tol_change')(self.defaults)

        dir = as_tensorlist(var.get_update())
        grad_list = var.get_grad()

        g_0 = -sum(t.sum() for t in torch._foreach_mul(grad_list, dir))
        f_0 = var.get_loss(False)
        dir_norm = dir.global_vector_norm()

        inverted = False
        if plus_minus and g_0 > 0:
            original_objective = objective
            def inverted_objective(a):
                l, g_a = original_objective(-a)
                return l, -g_a
            objective = inverted_objective
            inverted = True

        # --------------------- determine initial step size guess -------------------- #
        init = init.lower().strip()

        a_init = init_value
        if init == 'fixed':
            pass # use init_value

        elif init == 'previous':
            if 'a_prev' in self.global_state:
                a_init = self.global_state['a_prev']

        elif init == 'first-order':
            if 'g_prev' in self.global_state and g_0 < -torch.finfo(dir[0].dtype).tiny * 2:
                a_prev = self.global_state['a_prev']
                g_prev = self.global_state['g_prev']
                if g_prev < 0:
                    a_init = a_prev * g_prev / g_0

        elif init in ('quadratic', 'quadratic-clip'):
            if 'f_prev' in self.global_state and g_0 < -torch.finfo(dir[0].dtype).tiny * 2:
                f_prev = self.global_state['f_prev']
                if f_0 < f_prev:
                    a_init = 2 * (f_0 - f_prev) / g_0
                    if init == 'quadratic-clip': a_init = min(1, 1.01*a_init)
        else:
            raise ValueError(init)

        if adaptive:
            a_init *= self.global_state.get('initial_scale', 1)

        strong_wolfe = _StrongWolfe(
            f=objective,
            f_0=f_0,
            g_0=g_0,
            d_norm=dir_norm,
            a_init=a_init,
            a_max=a_max,
            c1=c1,
            c2=c2,
            maxiter=maxiter,
            maxzoom=maxzoom,
            maxeval=maxeval,
            tol_change=tol_change,
            interpolation=interpolation,
        )

        a, f_a, g_a = strong_wolfe.search()
        if inverted and a is not None: a = -a
        if f_a is not None and (f_a > f_0 or not math.isfinite(f_a)): a = None

        if fallback:
            if a is None or a==0 or not math.isfinite(a):
                lowest = min(strong_wolfe.history.items(), key=lambda x: x[1][0])
                if lowest[1][0] < f_0:
                    a = lowest[0]
                    f_a, g_a = lowest[1]
                    if inverted: a = -a

        if a is not None and a != 0 and math.isfinite(a):
            self.global_state['initial_scale'] = 1
            self.global_state['a_prev'] = a
            self.global_state['f_prev'] = f_0
            self.global_state['g_prev'] = g_0
            return a

        # fail
        if adaptive:
            self.global_state['initial_scale'] = self.global_state.get('initial_scale', 1) * 0.5
            finfo = torch.finfo(dir[0].dtype)
            if self.global_state['initial_scale'] < finfo.tiny * 2:
                self.global_state['initial_scale'] = finfo.max / 2

        return 0