Skip to content

Quasi-newton methods

This subpackage contains Quasi-newton methods that estimate the hessian using gradient information.

See also

  • Conjugate gradient - computationally cheaper alternative to quasi-newton methods.
  • Second order - "true" second order methods.
  • Line search - quasi-newton methods usually require either a line search or a trust region.
  • Trust region - trust regions can use hessian estimated by any of the quasi-newton methods.

Classes:

  • BFGS

    Broyden–Fletcher–Goldfarb–Shanno Quasi-Newton method. This is usually the most stable quasi-newton method.

  • BroydenBad

    Broyden's "bad" Quasi-Newton method.

  • BroydenGood

    Broyden's "good" Quasi-Newton method.

  • DFP

    Davidon–Fletcher–Powell Quasi-Newton method.

  • DNRTR

    Diagonal quasi-newton method.

  • DiagonalBFGS

    Diagonal BFGS. This is simply BFGS with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful.

  • DiagonalQuasiCauchi

    Diagonal quasi-cauchi method.

  • DiagonalSR1

    Diagonal SR1. This is simply SR1 with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful.

  • DiagonalWeightedQuasiCauchi

    Diagonal quasi-cauchi method.

  • FletcherVMM

    Fletcher's variable metric Quasi-Newton method.

  • GradientCorrection

    Estimates gradient at minima along search direction assuming function is quadratic.

  • Greenstadt1

    Greenstadt's first Quasi-Newton method.

  • Greenstadt2

    Greenstadt's second Quasi-Newton method.

  • Horisho

    Horisho's variable metric Quasi-Newton method.

  • ICUM

    Inverse Column-updating Quasi-Newton method. This is computationally cheaper than other Quasi-Newton methods

  • LBFGS

    Limited-memory BFGS algorithm. A line search or trust region is recommended.

  • LSR1

    Limited-memory SR1 algorithm. A line search or trust region is recommended.

  • McCormick

    McCormicks's Quasi-Newton method.

  • NewDQN

    Diagonal quasi-newton method.

  • NewSSM

    Self-scaling Quasi-Newton method.

  • PSB

    Powell's Symmetric Broyden Quasi-Newton method.

  • Pearson

    Pearson's Quasi-Newton method.

  • ProjectedNewtonRaphson

    Projected Newton Raphson method.

  • SR1

    Symmetric Rank 1. This works best with a trust region:

  • SSVM

    Self-scaling variable metric Quasi-Newton method.

  • ShorR

    Shor’s r-algorithm.

  • ThomasOptimalMethod

    Thomas's "optimal" Quasi-Newton method.

BFGS

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Broyden–Fletcher–Goldfarb–Shanno Quasi-Newton method. This is usually the most stable quasi-newton method.

Note

a line search or a trust region is recommended

Warning

this uses at least O(N^2) memory.

Parameters:

  • init_scale (float | Literal['auto'], default: 'auto' ) –

    initial hessian matrix is set to identity times this.

    "auto" corresponds to a heuristic from Nocedal. Stephen J. Wright. Numerical Optimization p.142-143.

    Defaults to "auto".

  • tol (float, default: 1e-32 ) –

    tolerance on curvature condition. Defaults to 1e-32.

  • ptol (float | None, default: 1e-32 ) –

    skips update if maximum difference between current and previous gradients is less than this, to avoid instability. Defaults to 1e-32.

  • ptol_restart (bool, default: False ) –

    whether to reset the hessian approximation when ptol tolerance is not met. Defaults to False.

  • restart_interval (int | None | Literal['auto'], default: None ) –

    interval between resetting the hessian approximation.

    "auto" corresponds to number of decision variables + 1.

    None - no resets.

    Defaults to None.

  • beta (float | None, default: None ) –

    momentum on H or B. Defaults to None.

  • update_freq (int, default: 1 ) –

    frequency of updating H or B. Defaults to 1.

  • scale_first (bool, default: False ) –

    whether to downscale first step before hessian approximation becomes available. Defaults to True.

  • scale_second (bool) –

    whether to downscale second step. Defaults to False.

  • concat_params (bool, default: True ) –

    If true, all parameters are treated as a single vector. If False, the update rule is applied to each parameter separately. Defaults to True.

  • inner (Chainable | None, default: None ) –

    preconditioning is applied to the output of this module. Defaults to None.

Examples:

BFGS with backtracking line search:

opt = tz.Modular(
    model.parameters(),
    tz.m.BFGS(),
    tz.m.Backtracking()
)

BFGS with trust region

opt = tz.Modular(
    model.parameters(),
    tz.m.LevenbergMarquardt(tz.m.BFGS(inverse=False)),
)

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class BFGS(_InverseHessianUpdateStrategyDefaults):
    """Broyden–Fletcher–Goldfarb–Shanno Quasi-Newton method. This is usually the most stable quasi-newton method.

    Note:
        a line search or a trust region is recommended

    Warning:
        this uses at least O(N^2) memory.

    Args:
        init_scale (float | Literal["auto"], optional):
            initial hessian matrix is set to identity times this.

            "auto" corresponds to a heuristic from Nocedal. Stephen J. Wright. Numerical Optimization p.142-143.

            Defaults to "auto".
        tol (float, optional):
            tolerance on curvature condition. Defaults to 1e-32.
        ptol (float | None, optional):
            skips update if maximum difference between current and previous gradients is less than this, to avoid instability.
            Defaults to 1e-32.
        ptol_restart (bool, optional): whether to reset the hessian approximation when ptol tolerance is not met. Defaults to False.
        restart_interval (int | None | Literal["auto"], optional):
            interval between resetting the hessian approximation.

            "auto" corresponds to number of decision variables + 1.

            None - no resets.

            Defaults to None.
        beta (float | None, optional): momentum on H or B. Defaults to None.
        update_freq (int, optional): frequency of updating H or B. Defaults to 1.
        scale_first (bool, optional):
            whether to downscale first step before hessian approximation becomes available. Defaults to True.
        scale_second (bool, optional): whether to downscale second step. Defaults to False.
        concat_params (bool, optional):
            If true, all parameters are treated as a single vector.
            If False, the update rule is applied to each parameter separately. Defaults to True.
        inner (Chainable | None, optional): preconditioning is applied to the output of this module. Defaults to None.

    ## Examples:

    BFGS with backtracking line search:

    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.BFGS(),
        tz.m.Backtracking()
    )
    ```

    BFGS with trust region
    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.LevenbergMarquardt(tz.m.BFGS(inverse=False)),
    )
    ```
    """

    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return bfgs_H_(H=H, s=s, y=y, tol=setting['tol'])
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return bfgs_B_(B=B, s=s, y=y, tol=setting['tol'])

BroydenBad

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Broyden's "bad" Quasi-Newton method.

Note

a trust region or an accurate line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class BroydenBad(_InverseHessianUpdateStrategyDefaults):
    """Broyden's "bad" Quasi-Newton method.

    Note:
        a trust region or an accurate line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return broyden_bad_H_(H=H, s=s, y=y)
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return broyden_bad_B_(B=B, s=s, y=y)

BroydenGood

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Broyden's "good" Quasi-Newton method.

Note

a trust region or an accurate line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class BroydenGood(_InverseHessianUpdateStrategyDefaults):
    """Broyden's "good" Quasi-Newton method.

    Note:
        a trust region or an accurate line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return broyden_good_H_(H=H, s=s, y=y)
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return broyden_good_B_(B=B, s=s, y=y)

DFP

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Davidon–Fletcher–Powell Quasi-Newton method.

Note

a trust region or an accurate line search is recommended.

Warning

this uses at least O(N^2) memory.

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class DFP(_InverseHessianUpdateStrategyDefaults):
    """Davidon–Fletcher–Powell Quasi-Newton method.

    Note:
        a trust region or an accurate line search is recommended.

    Warning:
        this uses at least O(N^2) memory.
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return dfp_H_(H=H, s=s, y=y, tol=setting['tol'])
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return dfp_B(B=B, s=s, y=y, tol=setting['tol'])

DNRTR

Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy

Diagonal quasi-newton method.

Reference

Andrei, Neculai. "A diagonal quasi-Newton updating method for unconstrained optimization." Numerical Algorithms 81.2 (2019): 575-590.

Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
class DNRTR(HessianUpdateStrategy):
    """Diagonal quasi-newton method.

    Reference:
        Andrei, Neculai. "A diagonal quasi-Newton updating method for unconstrained optimization." Numerical Algorithms 81.2 (2019): 575-590.
    """
    def __init__(
        self,
        lb: float = 1e-2,
        ub: float = 1e5,
        init_scale: float | Literal["auto"] = "auto",
        tol: float = 1e-32,
        ptol: float | None = 1e-32,
        ptol_restart: bool = False,
        gtol: float | None = 1e-32,
        restart_interval: int | None | Literal['auto'] = None,
        beta: float | None = None,
        update_freq: int = 1,
        scale_first: bool = False,
        concat_params: bool = True,
        inner: Chainable | None = None,
    ):
        defaults = dict(lb=lb, ub=ub)
        super().__init__(
            defaults=defaults,
            init_scale=init_scale,
            tol=tol,
            ptol=ptol,
            ptol_restart=ptol_restart,
            gtol=gtol,
            restart_interval=restart_interval,
            beta=beta,
            update_freq=update_freq,
            scale_first=scale_first,
            concat_params=concat_params,
            inverse=False,
            inner=inner,
        )

    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return diagonal_wqc_B_(B=B, s=s, y=y)

    def modify_B(self, B, state, setting):
        return _truncate(B, setting['lb'], setting['ub'])

    def initialize_P(self, size:int, device, dtype, is_inverse:bool): return torch.ones(size, device=device, dtype=dtype)

DiagonalBFGS

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Diagonal BFGS. This is simply BFGS with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful.

Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
class DiagonalBFGS(_InverseHessianUpdateStrategyDefaults):
    """Diagonal BFGS. This is simply BFGS with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful."""
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return diagonal_bfgs_H_(H=H, s=s, y=y, tol=setting['tol'])

    def initialize_P(self, size:int, device, dtype, is_inverse:bool): return torch.ones(size, device=device, dtype=dtype)

DiagonalQuasiCauchi

Bases: torchzero.modules.quasi_newton.quasi_newton._HessianUpdateStrategyDefaults

Diagonal quasi-cauchi method.

Reference

Zhu M., Nazareth J. L., Wolkowicz H. The quasi-Cauchy relation and diagonal updating //SIAM Journal on Optimization. – 1999. – Т. 9. – №. 4. – С. 1192-1204.

Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
class DiagonalQuasiCauchi(_HessianUpdateStrategyDefaults):
    """Diagonal quasi-cauchi method.

    Reference:
        Zhu M., Nazareth J. L., Wolkowicz H. The quasi-Cauchy relation and diagonal updating //SIAM Journal on Optimization. – 1999. – Т. 9. – №. 4. – С. 1192-1204.
    """
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return diagonal_qc_B_(B=B, s=s, y=y)

    def initialize_P(self, size:int, device, dtype, is_inverse:bool): return torch.ones(size, device=device, dtype=dtype)

DiagonalSR1

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Diagonal SR1. This is simply SR1 with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful.

Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
class DiagonalSR1(_InverseHessianUpdateStrategyDefaults):
    """Diagonal SR1. This is simply SR1 with only the diagonal being updated and used. It doesn't satisfy the secant equation but may still be useful."""
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return diagonal_sr1_(H=H, s=s, y=y, tol=setting['tol'])
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return diagonal_sr1_(H=B, s=y, y=s, tol=setting['tol'])

    def initialize_P(self, size:int, device, dtype, is_inverse:bool): return torch.ones(size, device=device, dtype=dtype)

DiagonalWeightedQuasiCauchi

Bases: torchzero.modules.quasi_newton.quasi_newton._HessianUpdateStrategyDefaults

Diagonal quasi-cauchi method.

Reference

Leong, Wah June, Sharareh Enshaei, and Sie Long Kek. "Diagonal quasi-Newton methods via least change updating principle with weighted Frobenius norm." Numerical Algorithms 86 (2021): 1225-1241.

Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
class DiagonalWeightedQuasiCauchi(_HessianUpdateStrategyDefaults):
    """Diagonal quasi-cauchi method.

    Reference:
        Leong, Wah June, Sharareh Enshaei, and Sie Long Kek. "Diagonal quasi-Newton methods via least change updating principle with weighted Frobenius norm." Numerical Algorithms 86 (2021): 1225-1241.
    """
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return diagonal_wqc_B_(B=B, s=s, y=y)

    def initialize_P(self, size:int, device, dtype, is_inverse:bool): return torch.ones(size, device=device, dtype=dtype)

FletcherVMM

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Fletcher's variable metric Quasi-Newton method.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Fletcher, R. (1970). A new approach to variable metric algorithms. The Computer Journal, 13(3), 317–322. doi:10.1093/comjnl/13.3.317

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class FletcherVMM(_InverseHessianUpdateStrategyDefaults):
    """
    Fletcher's variable metric Quasi-Newton method.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Fletcher, R. (1970). A new approach to variable metric algorithms. The Computer Journal, 13(3), 317–322. doi:10.1093/comjnl/13.3.317
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return fletcher_vmm_H_(H=H, s=s, y=y, tol=setting['tol'])

GradientCorrection

Bases: torchzero.core.transform.Transform

Estimates gradient at minima along search direction assuming function is quadratic.

This can useful as inner module for second order methods with inexact line search.

Example:

L-BFGS with gradient correction

opt = tz.Modular(
    model.parameters(),
    tz.m.LBFGS(inner=tz.m.GradientCorrection()),
    tz.m.Backtracking()
)
Reference

HOSHINO, S. (1972). A Formulation of Variable Metric Methods. IMA Journal of Applied Mathematics, 10(3), 394–403. doi:10.1093/imamat/10.3.394

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class GradientCorrection(Transform):
    """
    Estimates gradient at minima along search direction assuming function is quadratic.

    This can useful as inner module for second order methods with inexact line search.

    ## Example:
    L-BFGS with gradient correction

    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.LBFGS(inner=tz.m.GradientCorrection()),
        tz.m.Backtracking()
    )
    ```

    Reference:
        HOSHINO, S. (1972). A Formulation of Variable Metric Methods. IMA Journal of Applied Mathematics, 10(3), 394–403. doi:10.1093/imamat/10.3.394

    """
    def __init__(self):
        super().__init__(None, uses_grad=False)

    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        if 'p_prev' not in states[0]:
            p_prev = unpack_states(states, tensors, 'p_prev', init=params)
            g_prev = unpack_states(states, tensors, 'g_prev', init=tensors)
            return tensors

        p_prev, g_prev = unpack_states(states, tensors, 'p_prev', 'g_prev', cls=TensorList)
        g_hat = gradient_correction(TensorList(tensors), params-p_prev, tensors-g_prev)

        p_prev.copy_(params)
        g_prev.copy_(tensors)
        return g_hat

Greenstadt1

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Greenstadt's first Quasi-Newton method.

Note

a trust region or an accurate line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class Greenstadt1(_InverseHessianUpdateStrategyDefaults):
    """Greenstadt's first Quasi-Newton method.

    Note:
        a trust region or an accurate line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return greenstadt1_H_(H=H, s=s, y=y, g_prev=g_prev)

Greenstadt2

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Greenstadt's second Quasi-Newton method.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class Greenstadt2(_InverseHessianUpdateStrategyDefaults):
    """Greenstadt's second Quasi-Newton method.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return greenstadt2_H_(H=H, s=s, y=y)

Horisho

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Horisho's variable metric Quasi-Newton method.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

HOSHINO, S. (1972). A Formulation of Variable Metric Methods. IMA Journal of Applied Mathematics, 10(3), 394–403. doi:10.1093/imamat/10.3.394

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class Horisho(_InverseHessianUpdateStrategyDefaults):
    """
    Horisho's variable metric Quasi-Newton method.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        HOSHINO, S. (1972). A Formulation of Variable Metric Methods. IMA Journal of Applied Mathematics, 10(3), 394–403. doi:10.1093/imamat/10.3.394
    """

    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return hoshino_H_(H=H, s=s, y=y, tol=setting['tol'])

ICUM

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Inverse Column-updating Quasi-Newton method. This is computationally cheaper than other Quasi-Newton methods due to only updating one column of the inverse hessian approximation per step.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Lopes, V. L., & Martínez, J. M. (1995). Convergence properties of the inverse column-updating method. Optimization Methods & Software, 6(2), 127–144. from https://www.ime.unicamp.br/sites/default/files/pesquisa/relatorios/rp-1993-76.pdf

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class ICUM(_InverseHessianUpdateStrategyDefaults):
    """
    Inverse Column-updating Quasi-Newton method. This is computationally cheaper than other Quasi-Newton methods
    due to only updating one column of the inverse hessian approximation per step.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Lopes, V. L., & Martínez, J. M. (1995). Convergence properties of the inverse column-updating method. Optimization Methods & Software, 6(2), 127–144. from https://www.ime.unicamp.br/sites/default/files/pesquisa/relatorios/rp-1993-76.pdf
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return icum_H_(H=H, s=s, y=y)

LBFGS

Bases: torchzero.core.transform.Transform

Limited-memory BFGS algorithm. A line search or trust region is recommended.

Parameters:

  • history_size (int, default: 10 ) –

    number of past parameter differences and gradient differences to store. Defaults to 10.

  • ptol (float | None, default: 1e-32 ) –

    skips updating the history if maximum absolute value of parameter difference is less than this value. Defaults to 1e-10.

  • ptol_restart (bool, default: False ) –

    If true, whenever parameter difference is less then ptol, L-BFGS state will be reset. Defaults to None.

  • gtol (float | None, default: 1e-32 ) –

    skips updating the history if if maximum absolute value of gradient difference is less than this value. Defaults to 1e-10.

  • ptol_restart (bool, default: False ) –

    If true, whenever gradient difference is less then gtol, L-BFGS state will be reset. Defaults to None.

  • sy_tol (float | None, default: 1e-32 ) –

    history will not be updated whenever s⋅y is less than this value (negative s⋅y means negative curvature)

  • scale_first (bool, default: True ) –

    makes first step, when hessian approximation is not available, small to reduce number of line search iterations. Defaults to True.

  • update_freq (int, default: 1 ) –

    how often to update L-BFGS history. Larger values may be better for stochastic optimization. Defaults to 1.

  • damping (Union, default: None ) –

    damping to use, can be "powell" or "double". Defaults to None.

  • inner (Chainable | None, default: None ) –

    optional inner modules applied after updating L-BFGS history and before preconditioning. Defaults to None.

Examples:

L-BFGS with line search

opt = tz.Modular(
    model.parameters(),
    tz.m.LBFGS(100),
    tz.m.Backtracking()
)

L-BFGS with trust region

opt = tz.Modular(
    model.parameters(),
    tz.m.TrustCG(tz.m.LBFGS())
)

Source code in torchzero/modules/quasi_newton/lbfgs.py
class LBFGS(Transform):
    """Limited-memory BFGS algorithm. A line search or trust region is recommended.

    Args:
        history_size (int, optional):
            number of past parameter differences and gradient differences to store. Defaults to 10.
        ptol (float | None, optional):
            skips updating the history if maximum absolute value of
            parameter difference is less than this value. Defaults to 1e-10.
        ptol_restart (bool, optional):
            If true, whenever parameter difference is less then ``ptol``,
            L-BFGS state will be reset. Defaults to None.
        gtol (float | None, optional):
            skips updating the history if if maximum absolute value of
            gradient difference is less than this value. Defaults to 1e-10.
        ptol_restart (bool, optional):
            If true, whenever gradient difference is less then ``gtol``,
            L-BFGS state will be reset. Defaults to None.
        sy_tol (float | None, optional):
            history will not be updated whenever s⋅y is less than this value (negative s⋅y means negative curvature)
        scale_first (bool, optional):
            makes first step, when hessian approximation is not available,
            small to reduce number of line search iterations. Defaults to True.
        update_freq (int, optional):
            how often to update L-BFGS history. Larger values may be better for stochastic optimization. Defaults to 1.
        damping (DampingStrategyType, optional):
            damping to use, can be "powell" or "double". Defaults to None.
        inner (Chainable | None, optional):
            optional inner modules applied after updating L-BFGS history and before preconditioning. Defaults to None.

    ## Examples:

    L-BFGS with line search
    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.LBFGS(100),
        tz.m.Backtracking()
    )
    ```

    L-BFGS with trust region
    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.TrustCG(tz.m.LBFGS())
    )
    ```
    """
    def __init__(
        self,
        history_size=10,
        ptol: float | None = 1e-32,
        ptol_restart: bool = False,
        gtol: float | None = 1e-32,
        gtol_restart: bool = False,
        sy_tol: float = 1e-32,
        scale_first:bool=True,
        update_freq = 1,
        damping: DampingStrategyType = None,
        inner: Chainable | None = None,
    ):
        defaults = dict(
            history_size=history_size,
            scale_first=scale_first,
            ptol=ptol,
            gtol=gtol,
            ptol_restart=ptol_restart,
            gtol_restart=gtol_restart,
            sy_tol=sy_tol,
            damping = damping,
        )
        super().__init__(defaults, uses_grad=False, inner=inner, update_freq=update_freq)

        self.global_state['s_history'] = deque(maxlen=history_size)
        self.global_state['y_history'] = deque(maxlen=history_size)
        self.global_state['sy_history'] = deque(maxlen=history_size)

    def _reset_self(self):
        self.state.clear()
        self.global_state['step'] = 0
        self.global_state['s_history'].clear()
        self.global_state['y_history'].clear()
        self.global_state['sy_history'].clear()

    def reset(self):
        self._reset_self()
        for c in self.children.values(): c.reset()

    def reset_for_online(self):
        super().reset_for_online()
        self.clear_state_keys('p_prev', 'g_prev')
        self.global_state.pop('step', None)

    @torch.no_grad
    def update_tensors(self, tensors, params, grads, loss, states, settings):
        p = as_tensorlist(params)
        g = as_tensorlist(tensors)
        step = self.global_state.get('step', 0)
        self.global_state['step'] = step + 1

        # history of s and k
        s_history: deque[TensorList] = self.global_state['s_history']
        y_history: deque[TensorList] = self.global_state['y_history']
        sy_history: deque[torch.Tensor] = self.global_state['sy_history']

        ptol = self.defaults['ptol']
        gtol = self.defaults['gtol']
        ptol_restart = self.defaults['ptol_restart']
        gtol_restart = self.defaults['gtol_restart']
        sy_tol = self.defaults['sy_tol']
        damping = self.defaults['damping']

        p_prev, g_prev = unpack_states(states, tensors, 'p_prev', 'g_prev', cls=TensorList)

        # 1st step - there are no previous params and grads, lbfgs will do normalized SGD step
        if step == 0:
            s = None; y = None; sy = None
        else:
            s = p - p_prev
            y = g - g_prev

            if damping is not None:
                s, y = apply_damping(damping, s=s, y=y, g=g, H=self.get_H())

            sy = s.dot(y)
            # damping to be added here

        below_tol = False
        # tolerance on parameter difference to avoid exploding after converging
        if ptol is not None:
            if s is not None and s.abs().global_max() <= ptol:
                if ptol_restart:
                    self._reset_self()
                sy = None
                below_tol = True

        # tolerance on gradient difference to avoid exploding when there is no curvature
        if gtol is not None:
            if y is not None and y.abs().global_max() <= gtol:
                if gtol_restart: self._reset_self()
                sy = None
                below_tol = True

        # store previous params and grads
        if not below_tol:
            p_prev.copy_(p)
            g_prev.copy_(g)

        # update effective preconditioning state
        if sy is not None and sy > sy_tol:
            assert s is not None and y is not None and sy is not None

            s_history.append(s)
            y_history.append(y)
            sy_history.append(sy)

    def get_H(self, var=...):
        s_history = [tl.to_vec() for tl in self.global_state['s_history']]
        y_history = [tl.to_vec() for tl in self.global_state['y_history']]
        sy_history = self.global_state['sy_history']
        return LBFGSLinearOperator(s_history, y_history, sy_history)

    @torch.no_grad
    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        scale_first = self.defaults['scale_first']

        tensors = as_tensorlist(tensors)

        s_history = self.global_state['s_history']
        y_history = self.global_state['y_history']
        sy_history = self.global_state['sy_history']

        # precondition
        dir = lbfgs_Hx(
            x=tensors,
            s_history=s_history,
            y_history=y_history,
            sy_history=sy_history,
        )

        # scale 1st step
        if scale_first and self.global_state.get('step', 1) == 1:
            dir *= initial_step_size(dir, eps=1e-7)

        return dir

LSR1

Bases: torchzero.core.transform.Transform

Limited-memory SR1 algorithm. A line search or trust region is recommended.

Parameters:

  • history_size (int, default: 10 ) –

    number of past parameter differences and gradient differences to store. Defaults to 10.

  • ptol (float | None, default: None ) –

    skips updating the history if maximum absolute value of parameter difference is less than this value. Defaults to None.

  • ptol_restart (bool, default: False ) –

    If true, whenever parameter difference is less then ptol, L-SR1 state will be reset. Defaults to None.

  • gtol (float | None, default: None ) –

    skips updating the history if if maximum absolute value of gradient difference is less than this value. Defaults to None.

  • ptol_restart (bool, default: False ) –

    If true, whenever gradient difference is less then gtol, L-SR1 state will be reset. Defaults to None.

  • scale_first (bool, default: False ) –

    makes first step, when hessian approximation is not available, small to reduce number of line search iterations. Defaults to False.

  • update_freq (int, default: 1 ) –

    how often to update L-SR1 history. Larger values may be better for stochastic optimization. Defaults to 1.

  • damping (Union, default: None ) –

    damping to use, can be "powell" or "double". Defaults to None.

  • compact (bool) –

    if True, uses a compact representation verstion of L-SR1. It is much faster computationally, but less stable.

  • inner (Chainable | None, default: None ) –

    optional inner modules applied after updating L-SR1 history and before preconditioning. Defaults to None.

Examples:

L-SR1 with line search

opt = tz.Modular(
    model.parameters(),
    tz.m.SR1(),
    tz.m.StrongWolfe(c2=0.1, fallback=True)
)

L-SR1 with trust region

opt = tz.Modular(
    model.parameters(),
    tz.m.TrustCG(tz.m.LSR1())
)

Source code in torchzero/modules/quasi_newton/lsr1.py
class LSR1(Transform):
    """Limited-memory SR1 algorithm. A line search or trust region is recommended.

    Args:
        history_size (int, optional):
            number of past parameter differences and gradient differences to store. Defaults to 10.
        ptol (float | None, optional):
            skips updating the history if maximum absolute value of
            parameter difference is less than this value. Defaults to None.
        ptol_restart (bool, optional):
            If true, whenever parameter difference is less then ``ptol``,
            L-SR1 state will be reset. Defaults to None.
        gtol (float | None, optional):
            skips updating the history if if maximum absolute value of
            gradient difference is less than this value. Defaults to None.
        ptol_restart (bool, optional):
            If true, whenever gradient difference is less then ``gtol``,
            L-SR1 state will be reset. Defaults to None.
        scale_first (bool, optional):
            makes first step, when hessian approximation is not available,
            small to reduce number of line search iterations. Defaults to False.
        update_freq (int, optional):
            how often to update L-SR1 history. Larger values may be better for stochastic optimization. Defaults to 1.
        damping (DampingStrategyType, optional):
            damping to use, can be "powell" or "double". Defaults to None.
        compact (bool, optional):
            if True, uses a compact representation verstion of L-SR1. It is much faster computationally, but less stable.
        inner (Chainable | None, optional):
            optional inner modules applied after updating L-SR1 history and before preconditioning. Defaults to None.

    ## Examples:

    L-SR1 with line search
    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.SR1(),
        tz.m.StrongWolfe(c2=0.1, fallback=True)
    )
    ```

    L-SR1 with trust region
    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.TrustCG(tz.m.LSR1())
    )
    ```
    """
    def __init__(
        self,
        history_size=10,
        ptol: float | None = None,
        ptol_restart: bool = False,
        gtol: float | None = None,
        gtol_restart: bool = False,
        scale_first:bool=False,
        update_freq = 1,
        damping: DampingStrategyType = None,
        inner: Chainable | None = None,
    ):
        defaults = dict(
            history_size=history_size,
            scale_first=scale_first,
            ptol=ptol,
            gtol=gtol,
            ptol_restart=ptol_restart,
            gtol_restart=gtol_restart,
            damping = damping,
        )
        super().__init__(defaults, uses_grad=False, inner=inner, update_freq=update_freq)

        self.global_state['s_history'] = deque(maxlen=history_size)
        self.global_state['y_history'] = deque(maxlen=history_size)

    def _reset_self(self):
        self.state.clear()
        self.global_state['step'] = 0
        self.global_state['s_history'].clear()
        self.global_state['y_history'].clear()

    def reset(self):
        self._reset_self()
        for c in self.children.values(): c.reset()

    def reset_for_online(self):
        super().reset_for_online()
        self.clear_state_keys('p_prev', 'g_prev')
        self.global_state.pop('step', None)

    @torch.no_grad
    def update_tensors(self, tensors, params, grads, loss, states, settings):
        p = as_tensorlist(params)
        g = as_tensorlist(tensors)
        step = self.global_state.get('step', 0)
        self.global_state['step'] = step + 1

        # history of s and k
        s_history: deque = self.global_state['s_history']
        y_history: deque = self.global_state['y_history']

        ptol = self.defaults['ptol']
        gtol = self.defaults['gtol']
        ptol_restart = self.defaults['ptol_restart']
        gtol_restart = self.defaults['gtol_restart']
        damping = self.defaults['damping']

        p_prev, g_prev = unpack_states(states, tensors, 'p_prev', 'g_prev', cls=TensorList)

        # 1st step - there are no previous params and grads, lsr1 will do normalized SGD step
        if step == 0:
            s = None; y = None; sy = None
        else:
            s = p - p_prev
            y = g - g_prev

            if damping is not None:
                s, y = apply_damping(damping, s=s, y=y, g=g, H=self.get_H())

            sy = s.dot(y)
            # damping to be added here

        below_tol = False
        # tolerance on parameter difference to avoid exploding after converging
        if ptol is not None:
            if s is not None and s.abs().global_max() <= ptol:
                if ptol_restart: self._reset_self()
                sy = None
                below_tol = True

        # tolerance on gradient difference to avoid exploding when there is no curvature
        if gtol is not None:
            if y is not None and y.abs().global_max() <= gtol:
                if gtol_restart: self._reset_self()
                sy = None
                below_tol = True

        # store previous params and grads
        if not below_tol:
            p_prev.copy_(p)
            g_prev.copy_(g)

        # update effective preconditioning state
        if sy is not None:
            assert s is not None and y is not None and sy is not None

            s_history.append(s)
            y_history.append(y)

    def get_H(self, var=...):
        s_history = [tl.to_vec() for tl in self.global_state['s_history']]
        y_history = [tl.to_vec() for tl in self.global_state['y_history']]
        return LSR1LinearOperator(s_history, y_history)

    @torch.no_grad
    def apply_tensors(self, tensors, params, grads, loss, states, settings):
        scale_first = self.defaults['scale_first']

        tensors = as_tensorlist(tensors)

        s_history = self.global_state['s_history']
        y_history = self.global_state['y_history']

        # precondition
        dir = lsr1_Hx(
            x=tensors,
            s_history=s_history,
            y_history=y_history,
        )

        # scale 1st step
        if scale_first and self.global_state.get('step', 1) == 1:
            dir *= initial_step_size(dir, eps=1e-7)

        return dir

McCormick

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

McCormicks's Quasi-Newton method.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.

This is "Algorithm 2", attributed to McCormick in this paper. However for some reason this method is also called Pearson's 2nd method in other sources.

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class McCormick(_InverseHessianUpdateStrategyDefaults):
    """McCormicks's Quasi-Newton method.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.

        This is "Algorithm 2", attributed to McCormick in this paper. However for some reason this method is also called Pearson's 2nd method in other sources.
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return mccormick_H_(H=H, s=s, y=y)

NewDQN

Bases: torchzero.modules.quasi_newton.diagonal_quasi_newton.DNRTR

Diagonal quasi-newton method.

Reference

Nosrati, Mahsa, and Keyvan Amini. "A new diagonal quasi-Newton algorithm for unconstrained optimization problems." Applications of Mathematics 69.4 (2024): 501-512.

Source code in torchzero/modules/quasi_newton/diagonal_quasi_newton.py
class NewDQN(DNRTR):
    """Diagonal quasi-newton method.

    Reference:
        Nosrati, Mahsa, and Keyvan Amini. "A new diagonal quasi-Newton algorithm for unconstrained optimization problems." Applications of Mathematics 69.4 (2024): 501-512.
    """
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return new_dqn_B_(B=B, s=s, y=y)

NewSSM

Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy

Self-scaling Quasi-Newton method.

Note

a line search such as tz.m.StrongWolfe() is required.

Warning

this uses roughly O(N^2) memory.

Reference

Moghrabi, I. A., Hassan, B. A., & Askar, A. (2022). New self-scaling quasi-newton methods for unconstrained optimization. Int. J. Math. Comput. Sci., 17, 1061U.

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class NewSSM(HessianUpdateStrategy):
    """Self-scaling Quasi-Newton method.

    Note:
        a line search such as ``tz.m.StrongWolfe()`` is required.

    Warning:
        this uses roughly O(N^2) memory.

    Reference:
        Moghrabi, I. A., Hassan, B. A., & Askar, A. (2022). New self-scaling quasi-newton methods for unconstrained optimization. Int. J. Math. Comput. Sci., 17, 1061U.
    """
    def __init__(
        self,
        type: Literal[1, 2] = 1,
        init_scale: float | Literal["auto"] = "auto",
        tol: float = 1e-32,
        ptol: float | None = 1e-32,
        ptol_restart: bool = False,
        gtol: float | None = 1e-32,
        restart_interval: int | None = None,
        beta: float | None = None,
        update_freq: int = 1,
        scale_first: bool = False,
        concat_params: bool = True,
        inner: Chainable | None = None,
    ):
        super().__init__(
            defaults=dict(type=type),
            init_scale=init_scale,
            tol=tol,
            ptol=ptol,
            ptol_restart=ptol_restart,
            gtol=gtol,
            restart_interval=restart_interval,
            beta=beta,
            update_freq=update_freq,
            scale_first=scale_first,
            concat_params=concat_params,
            inverse=True,
            inner=inner,
        )
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        f = state['f']
        f_prev = state['f_prev']
        return new_ssm1(H=H, s=s, y=y, f=f, f_prev=f_prev, type=setting['type'], tol=setting['tol'])

PSB

Bases: torchzero.modules.quasi_newton.quasi_newton._HessianUpdateStrategyDefaults

Powell's Symmetric Broyden Quasi-Newton method.

Note

a line search or a trust region is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class PSB(_HessianUpdateStrategyDefaults):
    """Powell's Symmetric Broyden Quasi-Newton method.

    Note:
        a line search or a trust region is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Spedicato, E., & Huang, Z. (1997). Numerical experience with newton-like methods for nonlinear algebraic systems. Computing, 58(1), 69–89. doi:10.1007/bf02684472
    """
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return psb_B_(B=B, s=s, y=y)

Pearson

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Pearson's Quasi-Newton method.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class Pearson(_InverseHessianUpdateStrategyDefaults):
    """
    Pearson's Quasi-Newton method.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return pearson_H_(H=H, s=s, y=y)

ProjectedNewtonRaphson

Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy

Projected Newton Raphson method.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.

This one is Algorithm 7.

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class ProjectedNewtonRaphson(HessianUpdateStrategy):
    """
    Projected Newton Raphson method.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Pearson, J. D. (1969). Variable metric methods of minimisation. The Computer Journal, 12(2), 171–178. doi:10.1093/comjnl/12.2.171.

        This one is Algorithm 7.
    """
    def __init__(
        self,
        init_scale: float | Literal["auto"] = 'auto',
        tol: float = 1e-32,
        ptol: float | None = 1e-32,
        ptol_restart: bool = False,
        gtol: float | None = 1e-32,
        restart_interval: int | None | Literal['auto'] = 'auto',
        beta: float | None = None,
        update_freq: int = 1,
        scale_first: bool = False,
        concat_params: bool = True,
        inner: Chainable | None = None,
    ):
        super().__init__(
            init_scale=init_scale,
            tol=tol,
            ptol = ptol,
            ptol_restart=ptol_restart,
            gtol=gtol,
            restart_interval=restart_interval,
            beta=beta,
            update_freq=update_freq,
            scale_first=scale_first,
            concat_params=concat_params,
            inverse=True,
            inner=inner,
        )

    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        if 'R' not in state: state['R'] = torch.eye(H.size(-1), device=H.device, dtype=H.dtype)
        H, R = projected_newton_raphson_H_(H=H, R=state['R'], s=s, y=y)
        state["R"] = R
        return H

    def reset_P(self, P, s, y, inverse, init_scale, state):
        assert inverse
        if 'R' not in state: state['R'] = torch.eye(P.size(-1), device=P.device, dtype=P.dtype)
        P.copy_(state["R"])

SR1

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Symmetric Rank 1. This works best with a trust region:

tz.m.LevenbergMarquardt(tz.m.SR1(inverse=False))

Parameters:

  • init_scale (float | Literal['auto'], default: 'auto' ) –

    initial hessian matrix is set to identity times this.

    "auto" corresponds to a heuristic from [1] p.142-143.

    Defaults to "auto".

  • tol (float, default: 1e-32 ) –

    tolerance for denominator in SR1 update rule as in [1] p.146. Defaults to 1e-32.

  • ptol (float | None, default: 1e-32 ) –

    skips update if maximum difference between current and previous gradients is less than this, to avoid instability. Defaults to 1e-32.

  • ptol_restart (bool, default: False ) –

    whether to reset the hessian approximation when ptol tolerance is not met. Defaults to False.

  • restart_interval (int | None | Literal['auto'], default: None ) –

    interval between resetting the hessian approximation.

    "auto" corresponds to number of decision variables + 1.

    None - no resets.

    Defaults to None.

  • beta (float | None, default: None ) –

    momentum on H or B. Defaults to None.

  • update_freq (int, default: 1 ) –

    frequency of updating H or B. Defaults to 1.

  • scale_first (bool, default: False ) –

    whether to downscale first step before hessian approximation becomes available. Defaults to True.

  • scale_second (bool) –

    whether to downscale second step. Defaults to False.

  • concat_params (bool, default: True ) –

    If true, all parameters are treated as a single vector. If False, the update rule is applied to each parameter separately. Defaults to True.

  • inner (Chainable | None, default: None ) –

    preconditioning is applied to the output of this module. Defaults to None.

Examples:

SR1 with trust region

opt = tz.Modular(
    model.parameters(),
    tz.m.LevenbergMarquardt(tz.m.SR1(inverse=False)),
)

References:
[1]. Nocedal. Stephen J. Wright. Numerical Optimization
Source code in torchzero/modules/quasi_newton/quasi_newton.py
class SR1(_InverseHessianUpdateStrategyDefaults):
    """Symmetric Rank 1. This works best with a trust region:
    ```python
    tz.m.LevenbergMarquardt(tz.m.SR1(inverse=False))
    ```

    Args:
        init_scale (float | Literal["auto"], optional):
            initial hessian matrix is set to identity times this.

            "auto" corresponds to a heuristic from [1] p.142-143.

            Defaults to "auto".
        tol (float, optional):
            tolerance for denominator in SR1 update rule as in [1] p.146. Defaults to 1e-32.
        ptol (float | None, optional):
            skips update if maximum difference between current and previous gradients is less than this, to avoid instability.
            Defaults to 1e-32.
        ptol_restart (bool, optional): whether to reset the hessian approximation when ptol tolerance is not met. Defaults to False.
        restart_interval (int | None | Literal["auto"], optional):
            interval between resetting the hessian approximation.

            "auto" corresponds to number of decision variables + 1.

            None - no resets.

            Defaults to None.
        beta (float | None, optional): momentum on H or B. Defaults to None.
        update_freq (int, optional): frequency of updating H or B. Defaults to 1.
        scale_first (bool, optional):
            whether to downscale first step before hessian approximation becomes available. Defaults to True.
        scale_second (bool, optional): whether to downscale second step. Defaults to False.
        concat_params (bool, optional):
            If true, all parameters are treated as a single vector.
            If False, the update rule is applied to each parameter separately. Defaults to True.
        inner (Chainable | None, optional): preconditioning is applied to the output of this module. Defaults to None.

    ### Examples:

    SR1 with trust region
    ```python
    opt = tz.Modular(
        model.parameters(),
        tz.m.LevenbergMarquardt(tz.m.SR1(inverse=False)),
    )
    ```

    ###  References:
        [1]. Nocedal. Stephen J. Wright. Numerical Optimization
    """

    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return sr1_(H=H, s=s, y=y, tol=setting['tol'])
    def update_B(self, B, s, y, p, g, p_prev, g_prev, state, setting):
        return sr1_(H=B, s=y, y=s, tol=setting['tol'])

SSVM

Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy

Self-scaling variable metric Quasi-Newton method.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Oren, S. S., & Spedicato, E. (1976). Optimal conditioning of self-scaling variable Metric algorithms. Mathematical Programming, 10(1), 70–90. doi:10.1007/bf01580654

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class SSVM(HessianUpdateStrategy):
    """
    Self-scaling variable metric Quasi-Newton method.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Oren, S. S., & Spedicato, E. (1976). Optimal conditioning of self-scaling variable Metric algorithms. Mathematical Programming, 10(1), 70–90. doi:10.1007/bf01580654
    """
    def __init__(
        self,
        switch: tuple[float,float] | Literal[1,2,3,4] = 3,
        init_scale: float | Literal["auto"] = 'auto',
        tol: float = 1e-32,
        ptol: float | None = 1e-32,
        ptol_restart: bool = False,
        gtol: float | None = 1e-32,
        restart_interval: int | None = None,
        beta: float | None = None,
        update_freq: int = 1,
        scale_first: bool = False,
        concat_params: bool = True,
        inner: Chainable | None = None,
    ):
        defaults = dict(switch=switch)
        super().__init__(
            defaults=defaults,
            init_scale=init_scale,
            tol=tol,
            ptol=ptol,
            ptol_restart=ptol_restart,
            gtol=gtol,
            restart_interval=restart_interval,
            beta=beta,
            update_freq=update_freq,
            scale_first=scale_first,
            concat_params=concat_params,
            inverse=True,
            inner=inner,
        )

    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return ssvm_H_(H=H, s=s, y=y, g=g, switch=setting['switch'], tol=setting['tol'])

ShorR

Bases: torchzero.modules.quasi_newton.quasi_newton.HessianUpdateStrategy

Shor’s r-algorithm.

Note

A line search such as tz.m.StrongWolfe(a_init="quadratic", fallback=True) is required. Similarly to conjugate gradient, ShorR doesn't have an automatic step size scaling, so setting a_init in the line search is recommended.

References

S HOR , N. Z. (1985) Minimization Methods for Non-differentiable Functions. New York: Springer.

Burke, James V., Adrian S. Lewis, and Michael L. Overton. "The Speed of Shor's R-algorithm." IMA Journal of numerical analysis 28.4 (2008): 711-720. - good overview.

Ansari, Zafar A. Limited Memory Space Dilation and Reduction Algorithms. Diss. Virginia Tech, 1998. - this is where a more efficient formula is described.

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class ShorR(HessianUpdateStrategy):
    """Shor’s r-algorithm.

    Note:
        A line search such as ``tz.m.StrongWolfe(a_init="quadratic", fallback=True)`` is required.
        Similarly to conjugate gradient, ShorR doesn't have an automatic step size scaling,
        so setting ``a_init`` in the line search is recommended.

    References:
        S HOR , N. Z. (1985) Minimization Methods for Non-differentiable Functions. New York: Springer.

        Burke, James V., Adrian S. Lewis, and Michael L. Overton. "The Speed of Shor's R-algorithm." IMA Journal of numerical analysis 28.4 (2008): 711-720. - good overview.

        Ansari, Zafar A. Limited Memory Space Dilation and Reduction Algorithms. Diss. Virginia Tech, 1998. - this is where a more efficient formula is described.
    """

    def __init__(
        self,
        alpha=0.5,
        init_scale: float | Literal["auto"] = 1,
        tol: float = 1e-32,
        ptol: float | None = 1e-32,
        ptol_restart: bool = False,
        gtol: float | None = 1e-32,
        restart_interval: int | None | Literal['auto'] = None,
        beta: float | None = None,
        update_freq: int = 1,
        scale_first: bool = False,
        concat_params: bool = True,
        # inverse: bool = True,
        inner: Chainable | None = None,
    ):
        defaults = dict(alpha=alpha)
        super().__init__(
            defaults=defaults,
            init_scale=init_scale,
            tol=tol,
            ptol=ptol,
            ptol_restart=ptol_restart,
            gtol=gtol,
            restart_interval=restart_interval,
            beta=beta,
            update_freq=update_freq,
            scale_first=scale_first,
            concat_params=concat_params,
            inverse=True,
            inner=inner,
        )

    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        return shor_r_(H=H, y=y, alpha=setting['alpha'])

ThomasOptimalMethod

Bases: torchzero.modules.quasi_newton.quasi_newton._InverseHessianUpdateStrategyDefaults

Thomas's "optimal" Quasi-Newton method.

Note

a line search is recommended.

Warning

this uses at least O(N^2) memory.

Reference

Thomas, Stephen Walter. Sequential estimation techniques for quasi-Newton algorithms. Cornell University, 1975.

Source code in torchzero/modules/quasi_newton/quasi_newton.py
class ThomasOptimalMethod(_InverseHessianUpdateStrategyDefaults):
    """
    Thomas's "optimal" Quasi-Newton method.

    Note:
        a line search is recommended.

    Warning:
        this uses at least O(N^2) memory.

    Reference:
        Thomas, Stephen Walter. Sequential estimation techniques for quasi-Newton algorithms. Cornell University, 1975.
    """
    def update_H(self, H, s, y, p, g, p_prev, g_prev, state, setting):
        if 'R' not in state: state['R'] = torch.eye(H.size(-1), device=H.device, dtype=H.dtype)
        H, state['R'] = thomas_H_(H=H, R=state['R'], s=s, y=y)
        return H

    def reset_P(self, P, s, y, inverse, init_scale, state):
        super().reset_P(P, s, y, inverse, init_scale, state)
        for st in self.state.values():
            st.pop("R", None)